RE: [PATCH v3 net-next 3/6] net: Add SW fallback infrastructure for offloaded sockets

2017-12-18 Thread Ilya Lesokhin
On Monday, December 18, 2017 9:18 PM, Marcelo Ricardo Leitner wrote:

> > +
> > +   if (sk && sk_fullsock(sk) && sk->sk_offload_check)
> 
> Isn't this going to hurt the fast path, checking for sk fields here?
> 

We do add code to the fast path but it seems unavoidable if you want to have SW 
fallback.
The XFRM device offload also does that
http://elixir.free-electrons.com/linux/v4.14.7/source/net/core/dev.c#L3058

The check can be optimized but I didn't want to do that before I saw that it's 
an issue.
I'm also not sure what the correct solution is.
I don't like that fact that each "stateful protocol" we offload requires its 
own check. 
We need to think if we can find a generic way of doing it.

Perhaps we can hold the expected netdev somewhere in the SKB and only if we 
don't
Go out of the expected netdev go to a slow path that does a check for each 
protocol.

Thanks,
Ilya


RE: [PATCH v3 net-next 6/6] tls: Add generic NIC offload infrastructure.

2017-12-18 Thread Ilya Lesokhin
On Mon, Monday, December 18, 2017 9:54 PM, Marcelo Ricardo Leitner wrote:

> On Mon, Dec 18, 2017 at 01:10:33PM +0200, Ilya Lesokhin wrote:
> > This patch adds a generic infrastructure to offload TLS crypto to a
> > network devices. It enables the kernel TLS socket to skip encryption
> > and authentication operations on the transmit side of the data path.
> > Leaving those computationally expensive operations to the NIC.
> 
> I have a hard time understanding why this was named 'tls_device' if no
> net_device's are registered.
> 
I'm not quite sure what you mean by "no net_device's are registered"
Presumably you mean there is no device that implements the 
NETIF_F_HW_TLS_TX capability yet.
I'll just say that the IPSEC device offload infrastructure was also submitted
https://github.com/torvalds/linux/commit/d77e38e612a017480157fe6d2c1422f42cb5b7e3
before the first implementation
https://github.com/torvalds/linux/commit/bebb23e6cb02d2fc752905e39d09ff6152852c6c

And we did provide a link to an implementation 
https://github.com/Mellanox/tls-offload/tree/tls_device_v3
for people who want to take a look.
Unfortunately it is not ready for upstream submission yet


> > +   percpu_down_read(_offload_lock);
> > +   netdev = get_netdev_for_sock(sk);
> > +   if (!netdev) {
> > +   pr_err("%s: netdev not found\n", __func__);
> 
> _ratelimit?
> 

Thanks, we will fix it in the future.


Re: [alsa-devel] [trivial PATCH] treewide: Align function definition open/close braces

2017-12-18 Thread Takashi Iwai
On Mon, 18 Dec 2017 01:28:44 +0100,
Joe Perches wrote:
> 
> Some functions definitions have either the initial open brace and/or
> the closing brace outside of column 1.
> 
> Move those braces to column 1.
> 
> This allows various function analyzers like gnu complexity to work
> properly for these modified functions.
> 
> Miscellanea:
> 
> o Remove extra trailing ; and blank line from xfs_agf_verify
> 
> Signed-off-by: Joe Perches 
> ---
> git diff -w shows no difference other than the above 'Miscellanea'
> 
> (this is against -next, but it applies against Linus' tree
>  with a couple offsets)
> 
>  arch/x86/include/asm/atomic64_32.h   |  2 +-
>  drivers/acpi/custom_method.c |  2 +-
>  drivers/acpi/fan.c   |  2 +-
>  drivers/gpu/drm/amd/display/dc/core/dc.c |  2 +-
>  drivers/media/i2c/msp3400-kthreads.c |  2 +-
>  drivers/message/fusion/mptsas.c  |  2 +-
>  drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c |  2 +-
>  drivers/net/wireless/ath/ath9k/xmit.c|  2 +-
>  drivers/platform/x86/eeepc-laptop.c  |  2 +-
>  drivers/rtc/rtc-ab-b5ze-s3.c |  2 +-
>  drivers/scsi/dpt_i2o.c   |  2 +-
>  drivers/scsi/sym53c8xx_2/sym_glue.c  |  2 +-
>  fs/locks.c   |  2 +-
>  fs/ocfs2/stack_user.c|  2 +-
>  fs/xfs/libxfs/xfs_alloc.c|  5 ++---
>  fs/xfs/xfs_export.c  |  2 +-
>  kernel/audit.c   |  6 +++---
>  kernel/trace/trace_printk.c  |  4 ++--
>  lib/raid6/sse2.c | 14 +++---
>  sound/soc/fsl/fsl_dma.c  |  2 +-

For sound bits,
  Acked-by: Takashi Iwai 


thanks,

Takashi


Re: [PATCH v3 net-next 6/6] tls: Add generic NIC offload infrastructure.

2017-12-18 Thread kbuild test robot
Hi Ilya,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Ilya-Lesokhin/tls-Add-generic-NIC-offload-infrastructure/20171219-140819
config: tile-allmodconfig (attached as .config)
compiler: tilegx-linux-gcc (GCC) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=tile 

All errors (new ones prefixed by >>):

   net/tls/tls_device_fallback.c: In function 'update_chksum':
>> net/tls/tls_device_fallback.c:190:16: error: implicit declaration of 
>> function 'csum_ipv6_magic'; did you mean 'csum_tcpudp_magic'? 
>> [-Werror=implicit-function-declaration]
  th->check = ~csum_ipv6_magic(>saddr, >daddr,
   ^~~
   csum_tcpudp_magic
   cc1: some warnings being treated as errors

vim +190 net/tls/tls_device_fallback.c

   166  
   167  static inline void update_chksum(struct sk_buff *skb, int headln)
   168  {
   169  /* Can't use icsk->icsk_af_ops->send_check here because the ip 
addresses
   170   * might have been changed by NAT.
   171   */
   172  
   173  const struct ipv6hdr *ipv6h;
   174  const struct iphdr *iph;
   175  struct tcphdr *th = tcp_hdr(skb);
   176  int datalen = skb->len - headln;
   177  
   178  /* We only changed the payload so if we are using partial we 
don't
   179   * need to update anything.
   180   */
   181  if (likely(skb->ip_summed == CHECKSUM_PARTIAL))
   182  return;
   183  
   184  skb->ip_summed = CHECKSUM_PARTIAL;
   185  skb->csum_start = skb_transport_header(skb) - skb->head;
   186  skb->csum_offset = offsetof(struct tcphdr, check);
   187  
   188  if (skb->sk->sk_family == AF_INET6) {
   189  ipv6h = ipv6_hdr(skb);
 > 190  th->check = ~csum_ipv6_magic(>saddr, 
 > >daddr,
   191   datalen, IPPROTO_TCP, 0);
   192  } else {
   193  iph = ip_hdr(skb);
   194  th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, 
datalen,
   195 IPPROTO_TCP, 0);
   196  }
   197  }
   198  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v3 net-next 6/6] tls: Add generic NIC offload infrastructure.

2017-12-18 Thread kbuild test robot
Hi Ilya,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Ilya-Lesokhin/tls-Add-generic-NIC-offload-infrastructure/20171219-140819
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All warnings (new ones prefixed by >>):

   net//tls/tls_device_fallback.c: In function 'tls_sw_fallback':
>> net//tls/tls_device_fallback.c:360:1: warning: the frame size of 1040 bytes 
>> is larger than 1024 bytes [-Wframe-larger-than=]
}
^

vim +360 net//tls/tls_device_fallback.c

   214  
   215  /* This function may be called after the user socket is already
   216   * closed so make sure we don't use anything freed during
   217   * tls_sk_proto_close here
   218   */
   219  struct sk_buff *tls_sw_fallback(struct sock *sk, struct sk_buff *skb)
   220  {
   221  int tcp_header_size = tcp_hdrlen(skb);
   222  int tcp_payload_offset = skb_transport_offset(skb) + 
tcp_header_size;
   223  int payload_len = skb->len - tcp_payload_offset;
   224  struct tls_context *tls_ctx = tls_get_ctx(sk);
   225  struct tls_offload_context *ctx = tls_offload_ctx(tls_ctx);
   226  int remaining, buf_len, resync_sgs, rc, i = 0;
   227  void *buf, *dummy_buf, *iv, *aad;
   228  struct scatterlist sg_in[2 * (MAX_SKB_FRAGS + 1)];
   229  struct scatterlist sg_out[3];
   230  u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
   231  struct aead_request *aead_req;
   232  struct sk_buff *nskb = NULL;
   233  struct tls_record_info *record;
   234  unsigned long flags;
   235  s32 sync_size;
   236  u64 rcd_sn;
   237  
   238  if (!payload_len)
   239  return skb;
   240  
   241  sg_init_table(sg_in, ARRAY_SIZE(sg_in));
   242  sg_init_table(sg_out, ARRAY_SIZE(sg_out));
   243  
   244  spin_lock_irqsave(>lock, flags);
   245  record = tls_get_record(ctx, tcp_seq, _sn);
   246  if (!record) {
   247  spin_unlock_irqrestore(>lock, flags);
   248  WARN(1, "Record not found for seq %u\n", tcp_seq);
   249  goto free_orig;
   250  }
   251  
   252  sync_size = tcp_seq - tls_record_start_seq(record);
   253  if (sync_size < 0) {
   254  int is_start_marker = 
tls_record_is_start_marker(record);
   255  
   256  spin_unlock_irqrestore(>lock, flags);
   257  if (!is_start_marker)
   258  /* This should only occur if the relevant record was
   259   * already acked. In that case it should be ok
   260   * to drop the packet and avoid retransmission.
   261   *
   262   * There is a corner case where the packet contains
   263   * both an acked and a non-acked record.
   264   * We currently don't handle that case and rely
   265   * on TCP to retranmit a packet that doesn't contain
   266   * already acked payload.
   267   */
   268  goto free_orig;
   269  
   270  if (payload_len > -sync_size) {
   271  WARN(1, "Fallback of partially offloaded 
packets is not supported\n");
   272  goto free_orig;
   273  } else {
   274  return skb;
   275  }
   276  }
   277  
   278  remaining = sync_size;
   279  while (remaining > 0) {
   280  skb_frag_t *frag = >frags[i];
   281  
   282  __skb_frag_ref(frag);
   283  sg_set_page(sg_in + i, skb_frag_page(frag),
   284  skb_frag_size(frag), frag->page_offset);
   285  
   286  remaining -= skb_frag_size(frag);
   287  
   288  if (remaining < 0)
   289  sg_in[i].length += remaining;
   290  
   291  i++;
   292  }
   293  spin_unlock_irqrestore(>lock, flags);
   294  resync_sgs = i;
   295  
   296  aead_req = tls_alloc_aead_request(ctx->aead_send, GFP_ATOMIC);
   297  if (!aead_req)
   298  goto put_sg;
   299  
   300  buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE +
   301TLS_CIPHER_AES_GCM_128_IV_SIZE +
   302TLS_AAD_SPACE_SIZE +
   303sync_size +
   304tls_ctx->tag_size;
   305  buf = kmalloc(buf_len, GFP_ATOMIC);
   

Re: [PATCH v3 07/33] nds32: MMU initialization

2017-12-18 Thread Greentime Hu
Hi, Guo Ren:

2017-12-18 20:22 GMT+08:00 Guo Ren :
> On Mon, Dec 18, 2017 at 07:21:30PM +0800, Greentime Hu wrote:
>> Hi, Guo Ren:
>>
>> 2017-12-18 17:08 GMT+08:00 Guo Ren :
>> > Hi Greentime,
>> >
>> > On Fri, Dec 08, 2017 at 05:11:50PM +0800, Greentime Hu wrote:
>> > [...]
>> >>
>> >> diff --git a/arch/nds32/mm/highmem.c b/arch/nds32/mm/highmem.c
>> > [...]
>> >> +void *kmap(struct page *page)
>> >> +{
>> >> + unsigned long vaddr;
>> >> + might_sleep();
>> >> + if (!PageHighMem(page))
>> >> + return page_address(page);
>> >> + vaddr = (unsigned long)kmap_high(page);
>> > Here should invalid the cpu_mmu_tlb's entry, Or invalid it in the
>> > set_pte().
>> >
>> > eg:
>> > vaddr0 = kmap(page0)
>> > *vaddr0 = val0 //It will cause tlb-miss, and hard-refill to MMU-tlb
>> > kunmap(page0)
>> > vaddr1 = kmap(page1) // Mostly vaddr1 = vaddr0
>> > val = vaddr1; //No tlb-miss and it will get page0's val not page1, because
>> > last expired vaddr0's entry is left in CPU-MMU-tlb.
>> >
>>
>> Thanks.
>> I will add __nds32__tlbop_inv(vaddr); to invalidate this mapping
>> before retrun vaddr.
>
> Sorry, perhaps I'm wrong. See
> kmap->kmap_high->map_new_virtual->get_next_pkmap_nr(color).
>
> Seems pkmap will return the vaddr by vaddr + 1 until
> no_more_pkmaps(), and then flush_all_zero_pkmaps.
> Just kmap_atomic need it, and you've done.

Thanks for double checking this case. :)
As you said, it will flush tlb in the generic code flow.


Re: [PATCH v10 3/5] bpf: add a bpf_override_function helper

2017-12-18 Thread Masami Hiramatsu
On Mon, 18 Dec 2017 16:09:30 +0100
Daniel Borkmann  wrote:

> On 12/18/2017 10:51 AM, Masami Hiramatsu wrote:
> > On Fri, 15 Dec 2017 14:12:54 -0500
> > Josef Bacik  wrote:
> >> From: Josef Bacik 
> >>
> >> Error injection is sloppy and very ad-hoc.  BPF could fill this niche
> >> perfectly with it's kprobe functionality.  We could make sure errors are
> >> only triggered in specific call chains that we care about with very
> >> specific situations.  Accomplish this with the bpf_override_funciton
> >> helper.  This will modify the probe'd callers return value to the
> >> specified value and set the PC to an override function that simply
> >> returns, bypassing the originally probed function.  This gives us a nice
> >> clean way to implement systematic error injection for all of our code
> >> paths.
> > 
> > OK, got it. I think the error_injectable function list should be defined
> > in kernel/trace/bpf_trace.c because only bpf calls it and needs to care
> > the "safeness".
> > 
> > [...]
> >> diff --git a/arch/x86/kernel/kprobes/ftrace.c 
> >> b/arch/x86/kernel/kprobes/ftrace.c
> >> index 8dc0161cec8f..1ea748d682fd 100644
> >> --- a/arch/x86/kernel/kprobes/ftrace.c
> >> +++ b/arch/x86/kernel/kprobes/ftrace.c
> >> @@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
> >>p->ainsn.boostable = false;
> >>return 0;
> >>  }
> >> +
> >> +asmlinkage void override_func(void);
> >> +asm(
> >> +  ".type override_func, @function\n"
> >> +  "override_func:\n"
> >> +  "   ret\n"
> >> +  ".size override_func, .-override_func\n"
> >> +);
> >> +
> >> +void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
> >> +{
> >> +  regs->ip = (unsigned long)_func;
> >> +}
> >> +NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);
> > 
> > Calling this as "override_function" is meaningless. This is a function
> > which just return. So I think combination of just_return_func() and
> > arch_bpf_override_func_just_return() will be better.
> > 
> > Moreover, this arch/x86/kernel/kprobes/ftrace.c is an archtecture
> > dependent implementation of kprobes, not bpf.
> 
> Josef, please work out any necessary cleanups that would still need
> to be addressed based on Masami's feedback and send them as follow-up
> patches, thanks.
> 
> > Hmm, arch/x86/net/bpf_jit_comp.c will be better place?
> 
> (No, it's JIT only and I'd really prefer to keep it that way, mixing
>  this would result in a huge mess.)

OK, that is same to kprobes. kernel/kprobes.c and arch/x86/kernel/kprobe/*
are for instrumentation code. And kernel/trace/trace_kprobe.c is ftrace's
kprobe user interface, just one implementation of kprobe usage. So please
do not mix it up. It will result in a huge mess to me.

Thank you,

-- 
Masami Hiramatsu 


[patch iproute2] tc: add -bs option for batch mode

2017-12-18 Thread Chris Mi
Currently in tc batch mode, only one command is read from the batch
file and sent to kernel to process. With this patch, we can accumulate
several commands before sending to kernel. The batch size is specified
using option -bs or -batchsize.

To accumulate the commands in tc, we allocate an array of struct iovec.
If batchsize is bigger than 1 and we haven't accumulated enough
commands, rtnl_talk() will return without actually sending the message.
One exception is that there is no more command in the batch file.

But please note that kernel still processes the requests one by one.
To process the requests in parallel in kernel is another effort.
The time we're saving in this patch is the user mode and kernel mode
context switch. So this patch works on top of the current kernel.

Using the following script in kernel, we can generate 1,000,000 rules.
tools/testing/selftests/tc-testing/tdc_batch.py

Without this patch, 'tc -b $file' exection time is:

real0m14.916s
user0m6.808s
sys 0m8.046s

With this patch, 'tc -b $file -bs 10' exection time is:

real0m12.286s
user0m5.903s
sys 0m6.312s

The insertion rate is improved more than 10%.

Signed-off-by: Chris Mi 
---
 include/libnetlink.h | 27 
 include/utils.h  |  8 +
 lib/libnetlink.c | 30 +-
 lib/utils.c  | 20 
 man/man8/tc.8|  5 +++
 tc/m_action.c| 63 -
 tc/tc.c  | 27 ++--
 tc/tc_common.h   |  3 ++
 tc/tc_filter.c   | 89 
 9 files changed, 221 insertions(+), 51 deletions(-)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index a4d83b9e..07e88c94 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -13,6 +13,8 @@
 #include 
 #include 
 
+#define MSG_IOV_MAX 256
+
 struct rtnl_handle {
int fd;
struct sockaddr_nl  local;
@@ -93,6 +95,31 @@ int rtnl_dump_filter_nc(struct rtnl_handle *rth,
void *arg, __u16 nc_flags);
 #define rtnl_dump_filter(rth, filter, arg) \
rtnl_dump_filter_nc(rth, filter, arg, 0)
+
+extern int msg_iov_index;
+static inline int get_msg_iov_index(void)
+{
+   return msg_iov_index;
+}
+static inline void set_msg_iov_index(int index)
+{
+   msg_iov_index = index;
+}
+static inline void incr_msg_iov_index(void)
+{
+   ++msg_iov_index;
+}
+
+extern int batch_size;
+static inline int get_batch_size(void)
+{
+   return batch_size;
+}
+static inline void set_batch_size(int size)
+{
+   batch_size = size;
+}
+
 int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
  struct nlmsghdr **answer)
__attribute__((warn_unused_result));
diff --git a/include/utils.h b/include/utils.h
index d3895d56..66cb4747 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -235,6 +235,14 @@ void print_nlmsg_timestamp(FILE *fp, const struct nlmsghdr 
*n);
 
 extern int cmdlineno;
 ssize_t getcmdline(char **line, size_t *len, FILE *in);
+
+extern int cmdlinetotal;
+static inline int getcmdlinetotal(void)
+{
+   return cmdlinetotal;
+}
+void setcmdlinetotal(const char *name);
+
 int makeargs(char *line, char *argv[], int maxargs);
 int inet_get_addr(const char *src, __u32 *dst, struct in6_addr *dst6);
 
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 00e6ce0c..f9be1c6d 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -24,6 +24,7 @@
 #include 
 
 #include "libnetlink.h"
+#include "utils.h"
 
 #ifndef SOL_NETLINK
 #define SOL_NETLINK 270
@@ -581,6 +582,10 @@ static void rtnl_talk_error(struct nlmsghdr *h, struct 
nlmsgerr *err,
strerror(-err->error));
 }
 
+static struct iovec msg_iov[MSG_IOV_MAX];
+int msg_iov_index;
+int batch_size = 1;
+
 static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
   struct nlmsghdr **answer,
   bool show_rtnl_err, nl_ext_ack_fn_t errfn)
@@ -589,23 +594,34 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct 
nlmsghdr *n,
unsigned int seq;
struct nlmsghdr *h;
struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK };
-   struct iovec iov = {
-   .iov_base = n,
-   .iov_len = n->nlmsg_len
-   };
+   struct iovec *iov = _iov[get_msg_iov_index()];
+   int index;
+   char *buf;
+
+   iov->iov_base = n;
+   iov->iov_len = n->nlmsg_len;
+
+   incr_msg_iov_index();
struct msghdr msg = {
.msg_name = ,
.msg_namelen = sizeof(nladdr),
-   .msg_iov = ,
-   .msg_iovlen = 1,
+   .msg_iov = msg_iov,
+   .msg_iovlen = get_msg_iov_index(),
};
-   char *buf;
 
n->nlmsg_seq = seq = ++rtnl->seq;
 
if (answer == NULL)
n->nlmsg_flags |= NLM_F_ACK;
 
+   index = get_msg_iov_index() % 

Re: [PATCH v2 2/3] vsprintf: print if symbol not found

2017-12-18 Thread Tobin C. Harding
On Mon, Dec 18, 2017 at 10:18:27PM -0800, Joe Perches wrote:
> On Tue, 2017-12-19 at 14:28 +1100, Tobin C. Harding wrote:
> > Depends on: commit 40eee173a35e ("kallsyms: don't leak address when
> > symbol not found")
> > 
> > Currently vsprintf for specifiers %p[SsB] relies on the behaviour of
> > kallsyms (sprint_symbol()) and prints the actual address if a symbol is
> > not found. Previous patch changes this behaviour so that sprint_symbol()
> > returns an error if symbol not found. With this patch in place we can
> > print a sanitized message '' instead of leaking the
> > address.
> > 
> > Print '' for printk specifier %p[sSB] if symbol look
> > up fails.
> > 
> > Signed-off-by: Tobin C. Harding 
> > ---
> >  lib/vsprintf.c | 11 ---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> > index 01c3957b2de6..820ed4fe6e6c 100644
> > --- a/lib/vsprintf.c
> > +++ b/lib/vsprintf.c
> > @@ -674,6 +674,8 @@ char *symbol_string(char *buf, char *end, void *ptr,
> > unsigned long value;
> >  #ifdef CONFIG_KALLSYMS
> > char sym[KSYM_SYMBOL_LEN];
> > +   const char *sym_not_found = "";
> 
> This will be reinitialized on every use.
> 
> > +   int ret;
> >  #endif
> >  
> > if (fmt[1] == 'R')
> > @@ -682,11 +684,14 @@ char *symbol_string(char *buf, char *end, void *ptr,
> >  
> >  #ifdef CONFIG_KALLSYMS
> > if (*fmt == 'B')
> > -   sprint_backtrace(sym, value);
> > +   ret = sprint_backtrace(sym, value);
> > else if (*fmt != 'f' && *fmt != 's')
> > -   sprint_symbol(sym, value);
> > +   ret = sprint_symbol(sym, value);
> > else
> > -   sprint_symbol_no_offset(sym, value);
> > +   ret = sprint_symbol_no_offset(sym, value);
> > +
> > +   if (ret == -1)
> > +   strcpy(sym, sym_not_found);
> 
> 
> This could avoid the unnecessary strcpy if sym_not_found
> was not used at all and this was used instead
> 
>   if (ret == -1)
>   return string(buf, end, "", spec);
> 
>   return string(buf, end, sym, spec);
> 
> or maybe
> 
>   return string(buf, end, ret == -1 ? "" : sum, spec);

Oh, thanks. This is much cleaner. Will re-spin.

thanks,
Tobin.


Re: [PATCH v10 1/5] add infrastructure for tagging functions as error injectable

2017-12-18 Thread Masami Hiramatsu
On Fri, 15 Dec 2017 14:12:52 -0500
Josef Bacik  wrote:

> From: Josef Bacik 
> 
> Using BPF we can override kprob'ed functions and return arbitrary
> values.  Obviously this can be a bit unsafe, so make this feature opt-in
> for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
> order to give BPF access to that function for error injection purposes.
> 
> Signed-off-by: Josef Bacik 
> Acked-by: Ingo Molnar 
> ---
>  include/asm-generic/vmlinux.lds.h |  10 +++
>  include/linux/bpf.h   |  11 +++
>  include/linux/kprobes.h   |   1 +
>  include/linux/module.h|   5 ++
>  kernel/kprobes.c  | 163 
> ++
>  kernel/module.c   |   6 +-
>  6 files changed, 195 insertions(+), 1 deletion(-)
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index ee8b707d9fa9..a2e8582d094a 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -136,6 +136,15 @@
>  #define KPROBE_BLACKLIST()
>  #endif
>  
> +#ifdef CONFIG_BPF_KPROBE_OVERRIDE
> +#define ERROR_INJECT_LIST()  . = ALIGN(8);   
> \
> + 
> VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .;   \
> + KEEP(*(_kprobe_error_inject_list))  
> \
> + VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) 
> = .;
> +#else
> +#define ERROR_INJECT_LIST()
> +#endif
> +
>  #ifdef CONFIG_EVENT_TRACING
>  #define FTRACE_EVENTS()  . = ALIGN(8);   
> \
>   VMLINUX_SYMBOL(__start_ftrace_events) = .;  \
> @@ -564,6 +573,7 @@
>   FTRACE_EVENTS() \
>   TRACE_SYSCALLS()\
>   KPROBE_BLACKLIST()  \
> + ERROR_INJECT_LIST() \
>   MEM_DISCARD(init.rodata)\
>   CLK_OF_TABLES() \
>   RESERVEDMEM_OF_TABLES() \
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e55e4255a210..7f4d2a953173 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -576,4 +576,15 @@ extern const struct bpf_func_proto 
> bpf_sock_map_update_proto;
>  void bpf_user_rnd_init_once(void);
>  u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>  
> +#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
> +#ifdef CONFIG_BPF_KPROBE_OVERRIDE

BTW, CONFIG_BPF_KPROBE_OVERRIDE is also confusable name.
Since this feature override a function to just return with
some return value (as far as I understand, or would you
also plan to modify execution path inside a function?),
I think it should be better CONFIG_BPF_FUNCTION_OVERRIDE or
CONFIG_BPF_EXECUTION_OVERRIDE.

Indeed, BPF is based on kprobes, but it seems you are limiting it
with ftrace (function-call trace) (I'm not sure the reason why),
so using "kprobes" for this feature seems strange for me.

The idea in this patch itself (marking injectable function on
a list) is OK to me. 

Thank you,

> +#define BPF_ALLOW_ERROR_INJECTION(fname) \
> +static unsigned long __used  \
> + __attribute__((__section__("_kprobe_error_inject_list")))   \
> + _eil_addr_##fname = (unsigned long)fname;
> +#else
> +#define BPF_ALLOW_ERROR_INJECTION(fname)
> +#endif
> +#endif
> +
>  #endif /* _LINUX_BPF_H */
> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> index 9440a2fc8893..963fd364f3d6 100644
> --- a/include/linux/kprobes.h
> +++ b/include/linux/kprobes.h
> @@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long 
> offset);
>  extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, 
> unsigned long offset);
>  
>  extern bool within_kprobe_blacklist(unsigned long addr);
> +extern bool within_kprobe_error_injection_list(unsigned long addr);
>  
>  struct kprobe_insn_cache {
>   struct mutex mutex;
> diff --git a/include/linux/module.h b/include/linux/module.h
> index c69b49abe877..548fa09fa806 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -475,6 +475,11 @@ struct module {
>   ctor_fn_t *ctors;
>   unsigned int num_ctors;
>  #endif
> +
> +#ifdef CONFIG_BPF_KPROBE_OVERRIDE
> + unsigned int num_kprobe_ei_funcs;
> + unsigned long *kprobe_ei_funcs;
> +#endif
>  } cacheline_aligned __randomize_layout;
>  #ifndef MODULE_ARCH_INIT
>  #define MODULE_ARCH_INIT {}
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index da2ccf142358..b4aab48ad258 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -83,6 

Re: pull-request: bpf-next 2017-12-18

2017-12-18 Thread Alexei Starovoitov
On Mon, Dec 18, 2017 at 10:51:53AM -0500, David Miller wrote:
> From: Daniel Borkmann 
> Date: Mon, 18 Dec 2017 01:33:07 +0100
> 
> > The following pull-request contains BPF updates for your *net-next* tree.
> > 
> > The main changes are:
> > 
> > 1) Allow arbitrary function calls from one BPF function to another BPF 
> > function.
> >As of today when writing BPF programs, __always_inline had to be used in
> >the BPF C programs for all functions, unnecessarily causing LLVM to 
> > inflate
> >code size. Handle this more naturally with support for BPF to BPF calls
> >such that this __always_inline restriction can be overcome. As a result,
> >it allows for better optimized code and finally enables to introduce core
> >BPF libraries in the future that can be reused out of different projects.
> >x86 and arm64 JIT support was added as well, from Alexei.
> 
> Exciting... but now there's a lot of JIT work to do.

I've looked at sparc64. It should be simpler than arm64.
First reaction was that it would need dumb version of
emit_loadimm64() (similar to arm's emit_addr_mov_i64), but not,
since it's not used in emit_call.
I can take a stab at it, but cannot test. The most time
consuming part is to setup the latest llvm on the system
to compile *_noinline.c tests.
Note to self, I really need to make test_verifier run the tests.



[PATCH bpf 07/11] bpf: Add support for reading sk_state and more

2017-12-18 Thread Lawrence Brakmo
Add support for reading many more tcp_sock fields

  state,same as sk->sk_state
  rtt_min   same as sk->rtt_min.s[0].v (current rtt_min)
  snd_ssthresh
  rcv_nxt
  snd_nxt
  snd_una
  mss_cache
  ecn_flags
  rate_delivered
  rate_interval_us
  packets_out
  retrans_out
  total_retrans
  segs_in
  data_segs_in
  segs_out
  data_segs_out
  bytes_received (__u64)
  bytes_acked(__u64)

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h | 19 ++
 net/core/filter.c| 96 +++-
 2 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1c36795..19a0b1b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -954,6 +954,25 @@ struct bpf_sock_ops {
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
__u32 bpf_sock_ops_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index 2692514..2628077 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3828,7 +3828,7 @@ static bool __is_valid_sock_ops_access(int off, int size)
/* The verifier guarantees that size > 0. */
if (off % size != 0)
return false;
-   if (size != sizeof(__u32))
+   if (size != sizeof(__u32) && size != sizeof(__u64))
return false;
 
return true;
@@ -4448,6 +4448,32 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
+   case offsetof(struct bpf_sock_ops, state):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 1);
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+   *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_state));
+   break;
+
+   case offsetof(struct bpf_sock_ops, rtt_min):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+sizeof(struct minmax));
+   BUILD_BUG_ON(sizeof(struct minmax) <
+sizeof(struct minmax_sample));
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+ offsetof(struct tcp_sock, rtt_min) +
+ FIELD_SIZEOF(struct minmax_sample, t));
+   break;
+
 /* Helper macro for adding read access to tcp_sock or sock fields. */
 #define SOCK_OPS_GET_FIELD(FIELD_NAME, OBJ)  \
do {  \
@@ -4528,6 +4554,74 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
SOCK_OPS_GET_OR_SET_FIELD(bpf_sock_ops_flags, struct tcp_sock,
  type);
break;
+
+   case offsetof(struct bpf_sock_ops, snd_ssthresh):
+   SOCK_OPS_GET_FIELD(snd_ssthresh, struct tcp_sock);
+   break;
+
+   case offsetof(struct bpf_sock_ops, rcv_nxt):
+   SOCK_OPS_GET_FIELD(rcv_nxt, struct tcp_sock);
+   break;
+
+   case offsetof(struct bpf_sock_ops, snd_nxt):
+   SOCK_OPS_GET_FIELD(snd_nxt, struct tcp_sock);
+   break;
+
+   case offsetof(struct bpf_sock_ops, snd_una):
+   SOCK_OPS_GET_FIELD(snd_una, struct tcp_sock);
+   break;
+
+   case offsetof(struct bpf_sock_ops, mss_cache):
+   SOCK_OPS_GET_FIELD(mss_cache, struct tcp_sock);
+   break;
+
+   case offsetof(struct bpf_sock_ops, ecn_flags):
+   SOCK_OPS_GET_FIELD(ecn_flags, struct tcp_sock);
+   break;
+
+   case offsetof(struct bpf_sock_ops, rate_delivered):
+   

[PATCH bpf 08/11] bpf: Add sock_ops R/W access to tclass & sk_txhash

2017-12-18 Thread Lawrence Brakmo
Adds direct R/W access to sk_txhash and access to tclass for ipv6 flows
through getsockopt and setsockopt. Sample usage for tclass:

  bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, , sizeof(v))

where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h |  1 +
 net/core/filter.c| 47 ++-
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 19a0b1b..fe2b692 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -973,6 +973,7 @@ struct bpf_sock_ops {
__u32 data_segs_out;
__u64 bytes_received;
__u64 bytes_acked;
+   __u32 sk_txhash;
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index 2628077..5cb2b70 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3229,6 +3229,29 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
ret = -EINVAL;
}
 #ifdef CONFIG_INET
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   return -EINVAL;
+
+   val = *((int *)optval);
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   if (val < -1 || val > 0xff) {
+   ret = -EINVAL;
+   } else {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (val == -1)
+   val = 0;
+   np->tclass = val;
+   }
+   break;
+   default:
+   ret = -EINVAL;
+   }
+#endif
} else if (level == SOL_TCP &&
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
@@ -3238,7 +3261,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false, 
reinit);
+   ret = tcp_set_congestion_control(sk, name, false,
+reinit);
} else {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -3304,6 +3328,22 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
} else {
goto err_clear;
}
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   goto err_clear;
+
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   *((int *)optval) = (int)np->tclass;
+   break;
+   default:
+   goto err_clear;
+   }
+#endif
} else {
goto err_clear;
}
@@ -3843,6 +3883,7 @@ static bool sock_ops_is_valid_access(int off, int size,
case offsetof(struct bpf_sock_ops, op) ...
 offsetof(struct bpf_sock_ops, replylong[3]):
case offsetof(struct bpf_sock_ops, bpf_sock_ops_flags):
+   case offsetof(struct bpf_sock_ops, sk_txhash):
break;
default:
return false;
@@ -4622,6 +4663,10 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
case offsetof(struct bpf_sock_ops, bytes_acked):
SOCK_OPS_GET_FIELD(bytes_acked, struct tcp_sock);
break;
+
+   case offsetof(struct bpf_sock_ops, sk_txhash):
+   SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, struct sock, type);
+   break;
}
return insn - insn_buf;
 }
-- 
2.9.5



[PATCH bpf 10/11] bpf: Add BPF_SOCK_OPS_STATE_CB

2017-12-18 Thread Lawrence Brakmo
Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

New op: BPF_SOCK_OPS_STATE_CB.

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h | 4 
 include/uapi/linux/tcp.h | 1 +
 net/ipv4/tcp.c   | 2 ++
 3 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7165619..b018d6f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1018,6 +1018,10 @@ enum {
 * Arg1: sequence number of 1st byte
 * Arg2: # segments
 */
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+* Arg2: new_state
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index dc36d3c..211322c 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -262,6 +262,7 @@ struct tcp_md5sig {
 /* Definitions for bpf_sock_ops_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
 #define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
+#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2)
 
 /* INET_DIAG_MD5SIG */
 struct tcp_diag_md5sig {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 817df3f..e70dd2f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2041,6 +2041,8 @@ void tcp_set_state(struct sock *sk, int state)
int oldstate = sk->sk_state;
 
trace_tcp_set_state(sk, oldstate, state);
+   if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
 
switch (state) {
case TCP_ESTABLISHED:
-- 
2.9.5



[PATCH bpf 09/11] bpf: Add BPF_SOCK_OPS_RETRANS_CB

2017-12-18 Thread Lawrence Brakmo
Adds support for calling sock_ops BPF program when there is a
retransmission. Two arguments are used; one for the sequence number and
other for the number of segments retransmitted. Does not include syn-ack
retransmissions.

New op: BPF_SOCK_OPS_RETRANS_CB.

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h | 4 
 include/uapi/linux/tcp.h | 1 +
 net/ipv4/tcp_output.c| 3 +++
 3 files changed, 8 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fe2b692..7165619 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1014,6 +1014,10 @@ enum {
 * Arg2: value of icsk_rto
 * Arg3: whether RTO has expired
 */
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 089c19e..dc36d3c 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -261,6 +261,7 @@ struct tcp_md5sig {
 
 /* Definitions for bpf_sock_ops_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
+#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
 
 /* INET_DIAG_MD5SIG */
 struct tcp_diag_md5sig {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 50cb242..b8ad088 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2910,6 +2910,9 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
if (likely(!err)) {
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
trace_tcp_retransmit_skb(sk, skb);
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_RETRANS_CB,
+ TCP_SKB_CB(skb)->seq, segs);
} else if (err != -EBUSY) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
}
-- 
2.9.5



[PATCH bpf 0/11] bpf: more sock_ops callbacks

2017-12-18 Thread Lawrence Brakmo
This patchset adds support for:

- direct R or R/W access to many tcp_sock fields
- passing up to 4 arguments to sock_ops BPF functions
- tcp_sock field bpf_sock_ops_flags for controlling callbacks
- optionally calling sock_ops BPF program when RTO fires
- optionally calling sock_ops BPF program when packet is retransmitted
- optionally calling sock_ops BPF program when TCP state changes
- access to tclass and sk_txhash
- new selftest

Signed-off-by: Lawrence Brakmo 

Consists of the following patches:
[PATCH bpf 01/11] bpf: Make SOCK_OPS_GET_TCP size independent
[PATCH bpf 02/11] bpf: Make SOCK_OPS_GET_TCP struct independent
[PATCH bpf 03/11] bpf: Add write access to tcp_sock and sock fields
[PATCH bpf 04/11] bpf: Support passing args to sock_ops bpf function
[PATCH bpf 05/11] bpf: Adds field bpf_sock_ops_flags to tcp_sock
[PATCH bpf 06/11] bpf: Add sock_ops RTO callback
[PATCH bpf 07/11] bpf: Add support for reading sk_state and more
[PATCH bpf 08/11] bpf: Add sock_ops R/W access to tclass & sk_txhash
[PATCH bpf 09/11] bpf: Add BPF_SOCK_OPS_RETRANS_CB
[PATCH bpf 10/11] bpf: Add BPF_SOCK_OPS_STATE_CB
[PATCH bpf 11/11] bpf: add selftest for tcpbpf

include/linux/filter.h |   4 +
include/linux/tcp.h|   8 ++
include/net/tcp.h  |  66 +-
include/uapi/linux/bpf.h   |  39 +-
include/uapi/linux/tcp.h   |   5 +
net/core/filter.c  | 212 
++--
net/ipv4/tcp.c |   4 +-
net/ipv4/tcp_nv.c  |   2 +-
net/ipv4/tcp_output.c  |   5 +-
net/ipv4/tcp_timer.c   |   9 ++
tools/include/uapi/linux/bpf.h |  45 ++-
tools/testing/selftests/bpf/Makefile   |   5 +-
tools/testing/selftests/bpf/tcp_client.py  |  57 +
tools/testing/selftests/bpf/tcp_server.py  |  83 +
tools/testing/selftests/bpf/test_tcpbpf_kern.c | 133 
tools/testing/selftests/bpf/test_tcpbpf_user.c | 119 ++
16 files changed, 772 insertions(+), 24 deletions(-)



[PATCH bpf 05/11] bpf: Adds field bpf_sock_ops_flags to tcp_sock

2017-12-18 Thread Lawrence Brakmo
Adds field bpf_sock_ops_flags to tcp_sock and bpf_sock_ops. Its primary
use is to determine if there should be calls to sock_ops bpf program at
various points in the TCP code. The field is initialized to zero,
disabling the calls. A sock_ops BPF program can set, per connection and
as necessary, when the connection is established.

It also adds support for reading and writting the field within a
sock_ops BPF program.

Examples of where to call the bpf program:

1) When RTO fires
2) When a packet is retransmitted
3) When the connection terminates
4) When a packet is sent
5) When a packet is received

Signed-off-by: Lawrence Brakmo 
---
 include/linux/tcp.h  | 8 
 include/uapi/linux/bpf.h | 1 +
 net/core/filter.c| 6 ++
 3 files changed, 15 insertions(+)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index df5d97a..c46553f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -372,6 +372,14 @@ struct tcp_sock {
 */
struct request_sock *fastopen_rsk;
u32 *saved_syn;
+
+/* Sock_ops bpf program related variables */
+#ifdef CONFIG_BPF
+   u32 bpf_sock_ops_flags; /* values defined in uapi/linux/tcp.h */
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_flags & ARG)
+#else
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
+#endif
 };
 
 enum tsq_enum {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index addd849..dfbf43a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -953,6 +953,7 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_flags; /* flags defined in uapi/linux/tcp.h */
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index 97e65df..2692514 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3842,6 +3842,7 @@ static bool sock_ops_is_valid_access(int off, int size,
switch (off) {
case offsetof(struct bpf_sock_ops, op) ...
 offsetof(struct bpf_sock_ops, replylong[3]):
+   case offsetof(struct bpf_sock_ops, bpf_sock_ops_flags):
break;
default:
return false;
@@ -4522,6 +4523,11 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
case offsetof(struct bpf_sock_ops, srtt_us):
SOCK_OPS_GET_FIELD(srtt_us, struct tcp_sock);
break;
+
+   case offsetof(struct bpf_sock_ops, bpf_sock_ops_flags):
+   SOCK_OPS_GET_OR_SET_FIELD(bpf_sock_ops_flags, struct tcp_sock,
+ type);
+   break;
}
return insn - insn_buf;
 }
-- 
2.9.5



[PATCH bpf 11/11] bpf: add selftest for tcpbpf

2017-12-18 Thread Lawrence Brakmo
Added a selftest for tcpbpf (sock_ops) that checks that the appropriate
callbacks occured and that it can access tcp_sock fields and that their
values are correct.

Signed-off-by: Lawrence Brakmo 
---
 tools/include/uapi/linux/bpf.h |  45 -
 tools/testing/selftests/bpf/Makefile   |   5 +-
 tools/testing/selftests/bpf/tcp_client.py  |  57 +++
 tools/testing/selftests/bpf/tcp_server.py  |  83 +++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 133 +
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 119 ++
 6 files changed, 438 insertions(+), 4 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/tcp_client.py
 create mode 100755 tools/testing/selftests/bpf/tcp_server.py
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_user.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index cf446c2..b018d6f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -936,8 +936,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
@@ -946,6 +947,33 @@ struct bpf_sock_ops {
__u32 local_ip6[4]; /* Stored in network byte order */
__u32 remote_port;  /* Stored in network byte order */
__u32 local_port;   /* stored in host byte order */
+   __u32 is_fullsock;  /* Some TCP fields are only valid if
+* there is a full socket. If not, the
+* fields read as zero.
+*/
+   __u32 snd_cwnd;
+   __u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u64 bytes_received;
+   __u64 bytes_acked;
+   __u32 sk_txhash;
 };
 
 /* List of known BPF sock_ops operators.
@@ -981,6 +1009,19 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+*/
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+* Arg2: new_state
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 255fb1f..f3632b2 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -13,11 +13,12 @@ CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) 
$(GENFLAGS) -I../../../i
 LDLIBS += -lcap -lelf
 
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map 
test_progs \
-   test_align test_verifier_log test_dev_cgroup
+   test_align test_verifier_log test_dev_cgroup test_tcpbpf_user
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
-   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o
+   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
+   test_tcpbpf_kern.o
 
 TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh \
test_offload.py
diff --git 

[PATCH bpf 03/11] bpf: Add write access to tcp_sock and sock fields

2017-12-18 Thread Lawrence Brakmo
This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo 
---
 include/linux/filter.h |  3 +++
 include/net/tcp.h  |  2 +-
 net/core/filter.c  | 46 ++
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 5feb441..8929162 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -987,6 +987,9 @@ struct bpf_sock_ops_kern {
u32 replylong[4];
};
u32 is_fullsock;
+   u64 temp;   /* Used by sock_ops_convert_ctx_access
+* as temporary storaage of a register
+*/
 };
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6cc205c..e0213f1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2011,7 +2011,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
struct bpf_sock_ops_kern sock_ops;
int ret;
 
-   memset(_ops, 0, sizeof(sock_ops));
+   memset(_ops, 0, offsetof(struct bpf_sock_ops_kern, is_fullsock));
if (sk_fullsock(sk)) {
sock_ops.is_fullsock = 1;
sock_owned_by_me(sk);
diff --git a/net/core/filter.c b/net/core/filter.c
index f808269..97e65df 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4469,6 +4469,52 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
  offsetof(OBJ, FIELD_NAME)); \
} while (0)
 
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an aditional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(FIELD_NAME, OBJ)  \
+   do {  \
+   int reg = BPF_REG_9;  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, FIELD_NAME) >  \
+FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
+   while (si->dst_reg == reg || si->src_reg == reg)  \
+   reg--;\
+   *insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  temp));\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, \
+   is_fullsock), \
+ reg, si->dst_reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  is_fullsock)); \
+   *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, sk),\
+ reg, si->dst_reg,   \
+ offsetof(struct bpf_sock_ops_kern, sk));\
+   *insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, FIELD_NAME),  \
+ reg, si->src_reg,   \
+ offsetof(OBJ, FIELD_NAME)); \
+   *insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  temp));\
+   } while (0)
+
+#define 

[PATCH bpf 04/11] bpf: Support passing args to sock_ops bpf function

2017-12-18 Thread Lawrence Brakmo
Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo 
---
 include/linux/filter.h   |  1 +
 include/net/tcp.h| 64 
 include/uapi/linux/bpf.h |  5 ++--
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_nv.c|  2 +-
 net/ipv4/tcp_output.c|  2 +-
 6 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 8929162..2a09f27 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -983,6 +983,7 @@ struct bpf_sock_ops_kern {
struct  sock *sk;
u32 op;
union {
+   u32 args[4];
u32 reply;
u32 replylong[4];
};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e0213f1..c262be6 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2006,7 +2006,7 @@ void tcp_cleanup_ulp(struct sock *sk);
  * program loaded).
  */
 #ifdef CONFIG_BPF
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
struct bpf_sock_ops_kern sock_ops;
int ret;
@@ -2019,6 +2019,8 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 
sock_ops.sk = sk;
sock_ops.op = op;
+   if (nargs > 0)
+   memcpy(sock_ops.args, args, nargs*sizeof(u32));
 
ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(_ops);
if (ret == 0)
@@ -2027,18 +2029,70 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
ret = -1;
return ret;
 }
+
+static inline int tcp_call_bpf_1arg(struct sock *sk, int op, u32 arg)
+{
+   return tcp_call_bpf(sk, op, 1, );
+}
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   u32 args[2] = {arg1, arg2};
+
+   return tcp_call_bpf(sk, op, 2, args);
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   u32 args[3] = {arg1, arg2, arg3};
+
+   return tcp_call_bpf(sk, op, 3, args);
+}
+
+static inline int tcp_call_bpf_4arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3, u32 arg4)
+{
+   u32 args[4] = {arg1, arg2, arg3, arg4};
+
+   return tcp_call_bpf(sk, op, 4, args);
+}
+
 #else
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
return -EPERM;
 }
+
+static inline int tcp_call_bpf_1arg(struct sock *sk, int op, u32 arg)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_4arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3, u32 arg4)
+{
+   return -EPERM;
+}
+
 #endif
 
 static inline u32 tcp_timeout_init(struct sock *sk)
 {
int timeout;
 
-   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT);
+   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL);
 
if (timeout <= 0)
timeout = TCP_TIMEOUT_INIT;
@@ -2049,7 +2103,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 {
int rwnd;
 
-   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT);
+   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL);
 
if (rwnd < 0)
rwnd = 0;
@@ -2058,7 +2112,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 
 static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 {
-   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1);
 }
 
 #if IS_ENABLED(CONFIG_SMC)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 595bda1..addd849 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -936,8 +936,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1803116..817df3f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -465,7 +465,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);

[PATCH bpf 06/11] bpf: Add sock_ops RTO callback

2017-12-18 Thread Lawrence Brakmo
Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.

Signed-off-by: Lawrence Brakmo 
---
 include/uapi/linux/bpf.h | 5 +
 include/uapi/linux/tcp.h | 3 +++
 net/ipv4/tcp_timer.c | 9 +
 3 files changed, 17 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dfbf43a..1c36795 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -989,6 +989,11 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index b4a4f64..089c19e 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -259,6 +259,9 @@ struct tcp_md5sig {
__u8tcpm_key[TCP_MD5SIG_MAXKEYLEN]; /* key (binary) */
 };
 
+/* Definitions for bpf_sock_ops_flags */
+#define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
+
 /* INET_DIAG_MD5SIG */
 struct tcp_diag_md5sig {
__u8tcpm_family;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 16df6dd..e6afd93 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -230,9 +230,18 @@ static int tcp_write_timeout(struct sock *sk)
}
if (expired) {
/* Has it gone just too far? */
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
+ icsk->icsk_retransmits,
+ icsk->icsk_rto, 1);
tcp_write_err(sk);
return 1;
}
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
+ icsk->icsk_retransmits,
+ icsk->icsk_rto, 0);
return 0;
 }
 
-- 
2.9.5



[PATCH bpf 01/11] bpf: Make SOCK_OPS_GET_TCP size independent

2017-12-18 Thread Lawrence Brakmo
Make SOCK_OPS_GET_TCP helper macro size independent (before only worked
with 4-byte fields.

Signed-off-by: Lawrence Brakmo 
---
 net/core/filter.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 754abe1..d47d126 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4448,9 +4448,10 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
 /* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP32(FIELD_NAME)   \
+#define SOCK_OPS_GET_TCP(FIELD_NAME) \
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) != 4); \
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
+FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4462,16 +4463,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,\
+   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
+  FIELD_NAME), si->dst_reg,  \
+ si->dst_reg,\
  offsetof(struct tcp_sock, FIELD_NAME)); \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP32(snd_cwnd);
+   SOCK_OPS_GET_TCP(snd_cwnd);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP32(srtt_us);
+   SOCK_OPS_GET_TCP(srtt_us);
break;
}
return insn - insn_buf;
-- 
2.9.5



[PATCH bpf 02/11] bpf: Make SOCK_OPS_GET_TCP struct independent

2017-12-18 Thread Lawrence Brakmo
Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added a new
argument so now it can also work with struct sock fields.

Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
New:  SOCK_OPS_GET_FIELD(FIELD_NAME, OBJ)

Where OBJ is either "struct tcp_sock" or "struct sock" (without
quotation). Assumes FIELD_NAME is a field in the struct
bpf_sock_ops and in the OBJ specified.

Signed-off-by: Lawrence Brakmo 
---
 net/core/filter.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d47d126..f808269 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4447,10 +4447,10 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
-/* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP(FIELD_NAME) \
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(FIELD_NAME, OBJ)  \
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, FIELD_NAME) >  \
 FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
@@ -4463,18 +4463,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
-  FIELD_NAME), si->dst_reg,  \
- si->dst_reg,\
- offsetof(struct tcp_sock, FIELD_NAME)); \
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,   \
+  FIELD_NAME),   \
+ si->dst_reg, si->dst_reg,   \
+ offsetof(OBJ, FIELD_NAME)); \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP(snd_cwnd);
+   SOCK_OPS_GET_FIELD(snd_cwnd, struct tcp_sock);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP(srtt_us);
+   SOCK_OPS_GET_FIELD(srtt_us, struct tcp_sock);
break;
}
return insn - insn_buf;
-- 
2.9.5



Re: [PATCH v2 2/3] vsprintf: print if symbol not found

2017-12-18 Thread Joe Perches
On Tue, 2017-12-19 at 14:28 +1100, Tobin C. Harding wrote:
> Depends on: commit 40eee173a35e ("kallsyms: don't leak address when
> symbol not found")
> 
> Currently vsprintf for specifiers %p[SsB] relies on the behaviour of
> kallsyms (sprint_symbol()) and prints the actual address if a symbol is
> not found. Previous patch changes this behaviour so that sprint_symbol()
> returns an error if symbol not found. With this patch in place we can
> print a sanitized message '' instead of leaking the
> address.
> 
> Print '' for printk specifier %p[sSB] if symbol look
> up fails.
> 
> Signed-off-by: Tobin C. Harding 
> ---
>  lib/vsprintf.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index 01c3957b2de6..820ed4fe6e6c 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -674,6 +674,8 @@ char *symbol_string(char *buf, char *end, void *ptr,
>   unsigned long value;
>  #ifdef CONFIG_KALLSYMS
>   char sym[KSYM_SYMBOL_LEN];
> + const char *sym_not_found = "";

This will be reinitialized on every use.

> + int ret;
>  #endif
>  
>   if (fmt[1] == 'R')
> @@ -682,11 +684,14 @@ char *symbol_string(char *buf, char *end, void *ptr,
>  
>  #ifdef CONFIG_KALLSYMS
>   if (*fmt == 'B')
> - sprint_backtrace(sym, value);
> + ret = sprint_backtrace(sym, value);
>   else if (*fmt != 'f' && *fmt != 's')
> - sprint_symbol(sym, value);
> + ret = sprint_symbol(sym, value);
>   else
> - sprint_symbol_no_offset(sym, value);
> + ret = sprint_symbol_no_offset(sym, value);
> +
> + if (ret == -1)
> + strcpy(sym, sym_not_found);


This could avoid the unnecessary strcpy if sym_not_found
was not used at all and this was used instead

if (ret == -1)
return string(buf, end, "", spec);

return string(buf, end, sym, spec);

or maybe

return string(buf, end, ret == -1 ? "" : sum, spec);

>  
>   return string(buf, end, sym, spec);
>  #else


net/wireless/certs/*.x509 binary files

2017-12-18 Thread Randy Dunlap
This is just an FYI/acknowledgment that net/wireless/certs/*.x509
binary file(s) practically kills use of Linux kernel tarballs.

Of course, someone can always enable EXPERT and CFG80211_CERTIFICATION_ONUS
and disable the REGDB kconfig symbols to get around this.
Oh, and then chmod +x tools/objtool/sync-check.sh (unrelated problem).

Then you are good to go. :)

Background:
I was getting build errors on net/wireless/shipped-certs.o and the
build log didn't help me at all. It just said something like,
Error: build failed on net/wireless/shipped-certs.o.
Even building with V=1 didn't help.
So I finally discovered the reason and worked around it.

-- 
~Randy

PS:  Yes, I know about git.


Re: r8169 regression: UDP packets dropped intermittantly

2017-12-18 Thread Jonathan Woithe
Hi again

This is a follow up to my earlier message.

On Tue, Dec 19, 2017 at 09:02:25AM +1030, Jonathan Woithe wrote:
> On Mon, Dec 18, 2017 at 02:38:53PM +0100, Holger Hoffstätte wrote:
> > Since I've seen your postings several times now with no comment or 
> > resolution
> > I've decided to try your reproducer on my own systems. In short, I cannot
> > reproduce any packet loss, despite having 2 (cheap) 1Gb switches between the
> > two machines. Both are running 4.14.7.
> 
> Thanks for trying the test program on your system.  The result indicates
> that the problem might be specific to the behaviour of a particular network
> variant of the r8169 chip.

I was able to temporarily acquire a PCIe card which uses the r8169 driver.
This allowed me to run the reproducer on the same machine with two different
r8169-based cards.  The original NIC is this:

  05:01.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 
  Gigabit Ethernet (rev 10) [10ec:8169]
  Subsystem: Netgear GA311 [1385:311a]

The PCIe card is this:

  02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B 
  PCI Express Gigabit Ethernet controller (rev 06) [10ec:8168]
  Subsystem: Realtek Semiconductor Co., Ltd. Device 0123 [10ec:0123]

The test was conducted with kernel 4.3.0 since both the 4.3.0 driver (which
triggers the fault) and the forward ported driver (which predates commit
da78dbff2e05630921c551dbbc70a4b7981a8fff) was available.  For the record,
the machine used as the slave in these tests (the one receiving the 6 byte
request and sending the 14 byte response) was using its onboard NIC:

  00:19.0 Ethernet controller: Intel Corporation 82579V Gigabit Network 
  Connection (rev 05) [8086:1503]
  Subsystem: Gigabyte Technology Co., Ltd 82579V Gigabit Network 
  Connection [1458:e000]

Test outcomes were as follows:

  PCIe card, unpatched 4.3.0 r8169 driver: no error (tested for 1 hour)
  PCIe card, forward ported r8169 driver:  no error (tested for 1 hour)

  GA311 card, unpatched 4.3.0 r8169 driver: test fail in under 4 minutes
  GA311 card, forward ported r8169 driver:  no error (tested for 1 hour)

For completeness, I then booted 4.14 and repeated the test with its r8168
driver.  The PCIe card ran for an hour without triggering the error, while
the GA311 triggered it quickly (in under 3 minutes).

This clearly indicates that not every card using the r8169 driver is
vulnerable to the problem.  It also explains why Holger was unable to
reproduce the result on his system: the PCIe cards do not appear to suffer
from the problem.  Most likely the PCI RTL-8169 chip is affected, but newer
PCIe variations do not.  However, obviously more testing will be required
with a wider variety of cards if this inference is to hold up.

The above result (and those from Holger) allow the problem description to be
refined a little: changes in commit da78dbff2e05630921c551dbbc70a4b7981a8fff
cause GA311 NICs (and possibly other PCI cards using an RTL-8169) to have
trouble with small UDP packets, while PCIe variants are seemingly
unaffected.

Does this help?

Regards
  jonathan


[PATCH V4 14/26] pch_gbe: deprecate pci_get_bus_and_slot()

2017-12-18 Thread Sinan Kaya
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.

Getting ready to remove pci_get_bus_and_slot() function in favor of
pci_get_domain_bus_and_slot().

Use the domain information from pdev while calling into
pci_get_domain_bus_and_slot() function.

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 40e52ff..7cd4946 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -2594,8 +2594,10 @@ static int pch_gbe_probe(struct pci_dev *pdev,
if (adapter->pdata && adapter->pdata->platform_init)
adapter->pdata->platform_init(pdev);
 
-   adapter->ptp_pdev = pci_get_bus_and_slot(adapter->pdev->bus->number,
-  PCI_DEVFN(12, 4));
+   adapter->ptp_pdev =
+   pci_get_domain_bus_and_slot(pci_domain_nr(adapter->pdev->bus),
+   adapter->pdev->bus->number,
+   PCI_DEVFN(12, 4));
 
netdev->netdev_ops = _gbe_netdev_ops;
netdev->watchdog_timeo = PCH_GBE_WATCHDOG_PERIOD;
-- 
1.9.1



[PATCH V4 13/26] bnx2x: deprecate pci_get_bus_and_slot()

2017-12-18 Thread Sinan Kaya
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.

Getting ready to remove pci_get_bus_and_slot() function in favor of
pci_get_domain_bus_and_slot().

Introduce bnx2x_vf_domain() function to extract the domain information
and save it to VF specific data structure.

Use the saved domain value while calling pci_get_domain_bus_and_slot().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 10 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h |  1 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index 3591077..ffa7959 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -812,7 +812,7 @@ static u8 bnx2x_vf_is_pcie_pending(struct bnx2x *bp, u8 
abs_vfid)
if (!vf)
return false;
 
-   dev = pci_get_bus_and_slot(vf->bus, vf->devfn);
+   dev = pci_get_domain_bus_and_slot(vf->domain, vf->bus, vf->devfn);
if (dev)
return bnx2x_is_pcie_pending(dev);
return false;
@@ -1041,6 +1041,13 @@ void bnx2x_iov_init_dmae(struct bnx2x *bp)
REG_WR(bp, DMAE_REG_BACKWARD_COMP_EN, 0);
 }
 
+static int bnx2x_vf_domain(struct bnx2x *bp, int vfid)
+{
+   struct pci_dev *dev = bp->pdev;
+
+   return pci_domain_nr(dev->bus);
+}
+
 static int bnx2x_vf_bus(struct bnx2x *bp, int vfid)
 {
struct pci_dev *dev = bp->pdev;
@@ -1606,6 +1613,7 @@ int bnx2x_iov_nic_init(struct bnx2x *bp)
struct bnx2x_virtf *vf = BP_VF(bp, vfid);
 
/* fill in the BDF and bars */
+   vf->domain = bnx2x_vf_domain(bp, vfid);
vf->bus = bnx2x_vf_bus(bp, vfid);
vf->devfn = bnx2x_vf_devfn(bp, vfid);
bnx2x_vf_set_bars(bp, vf);
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
index 53466f6..eb814c6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
@@ -182,6 +182,7 @@ struct bnx2x_virtf {
u32 error;  /* 0 means all's-well */
 
/* BDF */
+   unsigned int domain;
unsigned int bus;
unsigned int devfn;
 
-- 
1.9.1



Re: [PATCH v4 25/36] nds32: Miscellaneous header files

2017-12-18 Thread Greentime Hu
Hi, Arnd:

2017-12-18 19:13 GMT+08:00 Arnd Bergmann :
> On Mon, Dec 18, 2017 at 7:46 AM, Greentime Hu  wrote:
>> From: Greentime Hu 
>>
>> This patch introduces some miscellaneous header files.
>
>> +static inline void __delay(unsigned long loops)
>> +{
>> +   __asm__ __volatile__(".align 2\n"
>> +"1:\n"
>> +"\taddi\t%0, %0, -1\n"
>> +"\tbgtz\t%0, 1b\n"
>> +:"=r"(loops)
>> +:"0"(loops));
>> +}
>> +
>> +static inline void __udelay(unsigned long usecs, unsigned long lpj)
>> +{
>> +   usecs *= (unsigned long)(((0x8000ULL / (50 / HZ)) +
>> + 0x8000ULL) >> 32);
>> +   usecs = (unsigned long)(((unsigned long long)usecs * lpj) >> 32);
>> +   __delay(usecs);
>> +}
>
> Do you have a reliable clocksource that you can read here instead of doing the
> loop? It's generally preferred to have an accurate delay if at all possible, 
> the
> delay loop calibration is only for those architectures that don't have any
> way to observe how much time has passed accurately.
>

We currently only have atcpit100 as clocksource but it is an IP of  SoC.
These delay API will be unavailable if we changed to another SoC
unless all these timer driver provided the same APIs.
It may suffer our customers if they forget to port these APIs in their
timer drivers when they try to use nds32 in the first beginning.
Or maybe I can use a CONFIG_USE_ACCURATE_DELAY to keep these 2
implementions for these purposes?


Re: Linux ECN Handling

2017-12-18 Thread Steve Ibanez
Hi Neal,

I started looking into this receiver ACKing issue today. Strangely,
when I tried adding printk statements at the top of the
tcp_v4_do_rcv(), tcp_rcv_established(), __tcp_ack_snd_check() and
tcp_send_delayed_ack() functions they were never executed on the
machine running the iperf3 server (i.e. the destination of the flows).
Maybe the iperf3 server is using its own TCP stack?

In any case, the ACKing problem is reproducible using just normal
iperf for which I do see my printk statements being executed. I can
now confirm that when the CWR marked packet (for which no ACK is sent)
arrives at the receiver, the __tcp_ack_snd_check() function is never
invoked; and hence neither is the tcp_send_delayed_ack() function.
Hopefully this helps narrow down where the issue might be? I started
adding some printk statements into the tcp_rcv_established() function,
but I'm not sure where the best places to look would be so I wanted to
ask your advice on this.

In case you're interested, I instrumented the __tcp_ack_snd_check()
function with the following printk statements:

@@ -5057,9 +5117,15 @@ static inline void tcp_data_snd_check(struct sock *sk)
 /*
  * Check if sending an ack is needed.
  */
-static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
+static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible,
const struct tcphdr *th)
 {
struct tcp_sock *tp = tcp_sk(sk);
+struct inet_sock *inet = inet_sk(sk);

/* More than one full frame received... */
if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
@@ -5071,21 +5137,31 @@ static void __tcp_ack_snd_check(struct sock
*sk, int ofo_possible)
tcp_in_quickack_mode(sk) ||
/* We have out of order data. */
(ofo_possible && !RB_EMPTY_ROOT(>out_of_order_queue))) {
+// SI: Debugging TCP ECN handeling
+if (sk->sk_family == AF_INET && th->cwr) {
+printk("tcp_debug: __tcp_ack_snd_check:
%pI4/%u CWR set and sending ACK now - rcv_nxt=%u\n",
+ >inet_daddr,
ntohs(inet->inet_sport), tp->rcv_nxt);
+ }
/* Then ack it now */
tcp_send_ack(sk);
} else {
+// SI: Debugging TCP ECN handeling
+if (sk->sk_family == AF_INET && th->cwr) {
+printk("tcp_debug: __tcp_ack_snd_check:
%pI4/%u CWR set and sending delayed ACK - rcv_nxt=%u\n",
+ >inet_daddr,
ntohs(inet->inet_sport), tp->rcv_nxt);
+ }
/* Else, send delayed ack. */
-   tcp_send_delayed_ack(sk);
+   tcp_send_delayed_ack(sk, th);
}
 }

In the kernel log on the receiver, I see the following sequence of
events at a timeout:

[ 2730.145023] tcp_debug: __tcp_ack_snd_check: 10.0.0.5/916 CWR set
and sending ACK now - rcv_nxt=2317949387
[ 2730.145543] tcp_debug: __tcp_ack_snd_check: 10.0.0.5/916 CWR set
and sending ACK now - rcv_nxt=2318243331 <-- last log statement before
timeout
[ 2730.452540] tcp_debug: __tcp_ack_snd_check: 10.0.0.5/916 CWR set
and sending ACK now - rcv_nxt=2318593747
[ 2730.453137] tcp_debug: __tcp_ack_snd_check: 10.0.0.5/916 CWR set
and sending ACK now - rcv_nxt=2318813843

>From the tcpdump trace at the receiver's interface I see that the last
log statement before the timeout corresponds exactly to one CWR packet
before the unACKed CWR packet. For example, in this case, the CWR
packet to which the indicated log statement corresponds has seqNo =
2318196995 and length 46336 ==> 2318196995 + 46336 = 2318243331, which
is exactly the rcv_nxt value of the indicated log statement. And then
unACKed CWR packet arrives and it is completely missing from the log
file, there is no indication of sending a delayed ACK either. Hence my
conclusion that the __tcp_ack_snd_check() function is never invoked by
the receiver upon receiving the unACKed CWR packet.

Sorry if that was long and verbose, I just wanted to be clear on what
I had done. Please do let me know if you have any questions though.

Thanks,
-Steve


On Tue, Dec 5, 2017 at 12:04 PM, Neal Cardwell  wrote:
> On Tue, Dec 5, 2017 at 2:36 PM, Steve Ibanez  wrote:
>> Hi Neal,
>>
>> I've included a link to small trace of 13 packets which is different
>> from the screenshot I attached in my last email, but shows the same
>> sequence of events. It's a bit hard to read the tcptrace due to the
>> 300ms timeout, so I figured this was the best approach.
>>
>> slice.pcap: 
>> https://drive.google.com/open?id=1hYXbUClHGbQv1hWG1HZWDO2WYf30N6G8
>
> Thanks for the trace! Attached is a screen shot (first screen shot is
> for the arriving packets with CWR; second is after the RTO). The
> sender behavior looks reasonable. I don't see why the receiver is not
> ACKing. As you say, it does look like a receiver bug. You could try
> adding instrumentation to 

Re: [Patch net-next] net_sched: properly check for empty skb array on error path

2017-12-18 Thread John Fastabend
On 12/18/2017 08:31 PM, Cong Wang wrote:
> On Mon, Dec 18, 2017 at 7:58 PM, John Fastabend
>  wrote:
>> On 12/18/2017 06:20 PM, Cong Wang wrote:
>>> On Mon, Dec 18, 2017 at 5:25 PM, John Fastabend
>>>  wrote:
 On 12/18/2017 02:34 PM, Cong Wang wrote:
> First, the check of >ring.queue against NULL is wrong, it
> is always false. We should check the value rather than the address.
>

 Thanks.

> Secondly, we need the same check in pfifo_fast_reset() too,
> as both ->reset() and ->destroy() are called in qdisc_destroy().
>

 not that it hurts to have the check here, but if init fails
 in qdisc_create it seems only ->destroy() is called without
 a ->reset().

 Is there another path for init() to fail that I'm missing.
>>>
>>> Pretty sure ->reset() is called in qdisc_destroy() and also before
>>> ->destroy():
>>>
>>
>> Except, the failed init path does not call qdisc_destroy.
>>
>> static struct Qdisc *qdisc_create(struct net_device *dev,
>> [...]
>>
>> if (ops->init) {
>> err = ops->init(sch, tca[TCA_OPTIONS]);
>> if (err != 0)
>> goto err_out5;
>> }
>> [...]
>>
>> err_out5:
>> /* ops->init() failed, we call ->destroy() like qdisc_create_dflt() 
>> */
>> if (ops->destroy)
>> ops->destroy(sch);
> 
> Didn't I say qdisc_destroy() rather than ->destroy()? :-)
> 

Yep, thanks for the fix.

Acked-by: John Fastabend 


RE: [PATCH net-next] netdevsim: correctly check return value of debugfs_create_dir

2017-12-18 Thread Prashant Bhole

> From: Jakub Kicinski [mailto:jakub.kicin...@netronome.com]
> 
> On Mon, 11 Dec 2017 13:46:48 +0900, Prashant Bhole wrote:
> > > From: David Miller [mailto:da...@davemloft.net]
> > >
> > > From: Prashant Bhole 
> > > Date: Fri,  8 Dec 2017 09:52:50 +0900
> > >
> > > > Return value is now checked with IS_ERROR_OR_NULL because
> > > > debugfs_create_dir doesn't return error value. It either returns
> > > > NULL or a valid pointer.
> > > >
> > > > Signed-off-by: Prashant Bhole 
> > > > ---
> > > >  drivers/net/netdevsim/netdev.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/net/netdevsim/netdev.c
> > > > b/drivers/net/netdevsim/netdev.c index eb8c679fca9f..88d8ee2c89da
> > > > 100644
> > > > --- a/drivers/net/netdevsim/netdev.c
> > > > +++ b/drivers/net/netdevsim/netdev.c
> > > > @@ -469,7 +469,7 @@ static int __init nsim_module_init(void)
> > > > int err;
> > > >
> > > > nsim_ddir = debugfs_create_dir(DRV_NAME, NULL);
> > > > -   if (IS_ERR(nsim_ddir))
> > > > +   if (IS_ERR_OR_NULL(nsim_ddir))
> > > > return PTR_ERR(nsim_ddir);
> > >
> > > debugfs_create_dir() should really be fixed, either it uses error
> > > pointers consistently and therefore always provides a suitable error
> > > code to return
> > or it
> > > always uses NULL.
> > >
> > > This in-between behavior makes using it as an interface painful
> > > because no
> > clear
> > > meaning is given to NULL.
> > >
> > > So please do the work necessary to make debugfs_create_dir()'s
> > > return semantics clearer and more useful.
> > >
> > > Thank you.
> >
> > Dave,
> > Thanks for comments. I will try to fix error handling in netdevsim
first.
> >
> > Jakub,
> > Let's decide with an example. The typical directory structure for
> > netdevsim interface is as below:
> > /sys/kernel/debug/netdevsim/sim0/bpf_bound_progs/
> > Please let me know if you are ok with following:
> >
> > 1) If debugfs_create_dir() fails in module_init, let's keep it fatal
> > error with corrected condition:
> > +   if (IS_ERR_OR_NULL(nsim_ddir))
> > +   return -ENOMEM;
> >
> > 2) In case sim0 or bpf_bound_progs are  fail to create, we need to add
> > checks before creating any file in them.
> 
> Fine with me, although if you fix DebugFS first you could use the real
error from
> the start here.

Jakub, Dave,
Sorry for late reply.
I tried to evaluate whether fixing return value of debugfs_create_dir() (and
friends) will be useful or not because it has not been changed since very
long time. Now I am not much convinced about changing this api. 

Important and possible error codes could be -EEXIST and -ENOMEM. Suppose
-EEXIST is returned, IMO the directory shouldn't exists in the first place
because it is specific to particular module. Also, there is no point in
creating file in such directory, because directory owner (creator) might
remove it too. This means there are less chances that api change will be
useful. Please let me know your opinion on it.

If you are ok with above explanation, shall I submit v2 for this patch?

-Prashant  




Re: [Patch net-next] net_sched: properly check for empty skb array on error path

2017-12-18 Thread Cong Wang
On Mon, Dec 18, 2017 at 7:58 PM, John Fastabend
 wrote:
> On 12/18/2017 06:20 PM, Cong Wang wrote:
>> On Mon, Dec 18, 2017 at 5:25 PM, John Fastabend
>>  wrote:
>>> On 12/18/2017 02:34 PM, Cong Wang wrote:
 First, the check of >ring.queue against NULL is wrong, it
 is always false. We should check the value rather than the address.

>>>
>>> Thanks.
>>>
 Secondly, we need the same check in pfifo_fast_reset() too,
 as both ->reset() and ->destroy() are called in qdisc_destroy().

>>>
>>> not that it hurts to have the check here, but if init fails
>>> in qdisc_create it seems only ->destroy() is called without
>>> a ->reset().
>>>
>>> Is there another path for init() to fail that I'm missing.
>>
>> Pretty sure ->reset() is called in qdisc_destroy() and also before
>> ->destroy():
>>
>
> Except, the failed init path does not call qdisc_destroy.
>
> static struct Qdisc *qdisc_create(struct net_device *dev,
> [...]
>
> if (ops->init) {
> err = ops->init(sch, tca[TCA_OPTIONS]);
> if (err != 0)
> goto err_out5;
> }
> [...]
>
> err_out5:
> /* ops->init() failed, we call ->destroy() like qdisc_create_dflt() */
> if (ops->destroy)
> ops->destroy(sch);

Didn't I say qdisc_destroy() rather than ->destroy()? :-)


struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue,
const struct Qdisc_ops *ops,
unsigned int parentid)
{
struct Qdisc *sch;

if (!try_module_get(ops->owner))
return NULL;

sch = qdisc_alloc(dev_queue, ops);
if (IS_ERR(sch)) {
module_put(ops->owner);
return NULL;
}
sch->parent = parentid;

if (!ops->init || ops->init(sch, NULL) == 0)
return sch;

qdisc_destroy(sch);
return NULL;
}


Re: [PATCH 3/3] trace: print address if symbol not found

2017-12-18 Thread Tobin C. Harding
On Mon, Dec 18, 2017 at 10:37:38PM -0500, Steven Rostedt wrote:
> On Tue, 19 Dec 2017 14:00:11 +1100
> "Tobin C. Harding"  wrote:
> 
> > I ran through these as outlined here for the new version (v4). This hits
> > the modified code but doesn't test symbol look up failure.
> 
> stacktrace shouldn't post non kernel values, unless there's a frame
> pointer that isn't handled by kallsyms.
> 
> As for the other two, we could probably force a failure, like:
> 
>  # echo 'hist:keys=hrtimer.sym' > \
>  events/timer/hrtimer_start/trigger
>  # cat events/timer/hrtimer_start/hist
> 
> And then just add sym-offset too.
>
> > I also configured kernel with 'Perform a startup test on ftrace' for
> > good luck.
> > 
> > Are you happy with this level of testing?
> 
> Can you try the above.

Did both and in both cases we get the addresses as hoped :)

thanks,
Tobin.


[PATCH bpf] bpf: do not allow root to mangle valid pointers

2017-12-18 Thread Alexei Starovoitov
Do not allow root to convert valid pointers into unknown scalars.
In particular disallow:
 ptr &= reg
 ptr <<= reg
 ptr += ptr
and explicitly allow:
 ptr -= ptr
since pkt_end - pkt == length

1.
This minimizes amount of address leaks root can do.
In the future may need to further tighten the leaks with kptr_restrict.

2.
If program has such pointer math it's likely a user mistake and
when verifier complains about it right away instead of many instructions
later on invalid memory access it's easier for users to fix their progs.

3.
when register holding a pointer cannot change to scalar it allows JITs to
optimize better. Like 32-bit archs could use single register for pointers
instead of a pair required to hold 64-bit scalars.

4.
reduces architecture dependent behavior. Since code:
r1 = r10;
r1 &= 0xff;
if (r1 ...)
will behave differently arm64 vs x64 and offloaded vs native.

A significant chunk of ptr mangling was allowed by
commit f1174f77b50c ("bpf/verifier: rework value tracking")
yet some of it was allowed even earlier.

Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c   | 102 ++--
 tools/testing/selftests/bpf/test_verifier.c |  56 +++
 2 files changed, 63 insertions(+), 95 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 86dfe6b5c243..04b24876cd23 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1890,29 +1890,25 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
 
if (BPF_CLASS(insn->code) != BPF_ALU64) {
/* 32-bit ALU ops on pointers produce (meaningless) scalars */
-   if (!env->allow_ptr_leaks)
-   verbose(env,
-   "R%d 32-bit pointer arithmetic prohibited\n",
-   dst);
+   verbose(env,
+   "R%d 32-bit pointer arithmetic prohibited\n",
+   dst);
return -EACCES;
}
 
if (ptr_reg->type == PTR_TO_MAP_VALUE_OR_NULL) {
-   if (!env->allow_ptr_leaks)
-   verbose(env, "R%d pointer arithmetic on 
PTR_TO_MAP_VALUE_OR_NULL prohibited, null-check it first\n",
-   dst);
+   verbose(env, "R%d pointer arithmetic on 
PTR_TO_MAP_VALUE_OR_NULL prohibited, null-check it first\n",
+   dst);
return -EACCES;
}
if (ptr_reg->type == CONST_PTR_TO_MAP) {
-   if (!env->allow_ptr_leaks)
-   verbose(env, "R%d pointer arithmetic on 
CONST_PTR_TO_MAP prohibited\n",
-   dst);
+   verbose(env, "R%d pointer arithmetic on CONST_PTR_TO_MAP 
prohibited\n",
+   dst);
return -EACCES;
}
if (ptr_reg->type == PTR_TO_PACKET_END) {
-   if (!env->allow_ptr_leaks)
-   verbose(env, "R%d pointer arithmetic on 
PTR_TO_PACKET_END prohibited\n",
-   dst);
+   verbose(env, "R%d pointer arithmetic on PTR_TO_PACKET_END 
prohibited\n",
+   dst);
return -EACCES;
}
 
@@ -1979,9 +1975,8 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
case BPF_SUB:
if (dst_reg == off_reg) {
/* scalar -= pointer.  Creates an unknown scalar */
-   if (!env->allow_ptr_leaks)
-   verbose(env, "R%d tried to subtract pointer 
from scalar\n",
-   dst);
+   verbose(env, "R%d tried to subtract pointer from 
scalar\n",
+   dst);
return -EACCES;
}
/* We don't allow subtraction from FP, because (according to
@@ -1989,9 +1984,8 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
 * be able to deal with it.
 */
if (ptr_reg->type == PTR_TO_STACK) {
-   if (!env->allow_ptr_leaks)
-   verbose(env, "R%d subtraction from stack 
pointer prohibited\n",
-   dst);
+   verbose(env, "R%d subtraction from stack pointer 
prohibited\n",
+   dst);
return -EACCES;
}
if (known && (ptr_reg->off - smin_val ==
@@ -2040,19 +2034,14 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
case BPF_AND:
case BPF_OR:
case BPF_XOR:
-   /* bitwise ops on pointers are troublesome, prohibit for now.
-* (However, in principle we could allow some cases, e.g.
-* ptr &= ~3 which would reduce min_value by 3.)
-*/
-  

[PATCH bpf 9/9] selftests/bpf: add tests for recent bugfixes

2017-12-18 Thread Alexei Starovoitov
From: Jann Horn 

These tests should cover the following cases:

 - MOV with both zero-extended and sign-extended immediates
 - implicit truncation of register contents via ALU32/MOV32
 - implicit 32-bit truncation of ALU32 output
 - oversized register source operand for ALU32 shift
 - right-shift of a number that could be positive or negative
 - map access where adding the operation size to the offset causes signed
   32-bit overflow
 - direct stack access at a ~4GiB offset

Also remove the F_LOAD_WITH_STRICT_ALIGNMENT flag from a bunch of tests
that should fail independent of what flags userspace passes.

Signed-off-by: Jann Horn 
Signed-off-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_verifier.c | 549 +++-
 1 file changed, 533 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index b03ecfd7185b..961c1426fbf2 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -606,7 +606,6 @@ static struct bpf_test tests[] = {
},
.errstr = "misaligned stack access",
.result = REJECT,
-   .flags = F_LOAD_WITH_STRICT_ALIGNMENT,
},
{
"invalid map_fd for function call",
@@ -1797,7 +1796,6 @@ static struct bpf_test tests[] = {
},
.result = REJECT,
.errstr = "misaligned stack access off (0x0; 0x0)+-8+2 size 8",
-   .flags = F_LOAD_WITH_STRICT_ALIGNMENT,
},
{
"PTR_TO_STACK store/load - bad alignment on reg",
@@ -1810,7 +1808,6 @@ static struct bpf_test tests[] = {
},
.result = REJECT,
.errstr = "misaligned stack access off (0x0; 0x0)+-10+8 size 8",
-   .flags = F_LOAD_WITH_STRICT_ALIGNMENT,
},
{
"PTR_TO_STACK store/load - out of bounds low",
@@ -6324,7 +6321,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 3 },
-   .errstr = "R0 min value is negative",
+   .errstr = "unbounded min value",
.result = REJECT,
},
{
@@ -6348,7 +6345,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 3 },
-   .errstr = "R0 min value is negative",
+   .errstr = "unbounded min value",
.result = REJECT,
},
{
@@ -6374,7 +6371,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 3 },
-   .errstr = "R8 invalid mem access 'inv'",
+   .errstr = "unbounded min value",
.result = REJECT,
},
{
@@ -6399,7 +6396,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 3 },
-   .errstr = "R8 invalid mem access 'inv'",
+   .errstr = "unbounded min value",
.result = REJECT,
},
{
@@ -6447,7 +6444,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 3 },
-   .errstr = "R0 min value is negative",
+   .errstr = "unbounded min value",
.result = REJECT,
},
{
@@ -6518,7 +6515,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 3 },
-   .errstr = "R0 min value is negative",
+   .errstr = "unbounded min value",
.result = REJECT,
},
{
@@ -6569,7 +6566,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 3 },
-   .errstr = "R0 min value is negative",
+   .errstr = "unbounded min value",
.result = REJECT,
},
{
@@ -6596,7 +6593,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 3 },
-   .errstr = "R0 min value is negative",
+   .errstr = "unbounded min value",
.result = REJECT,
},
{
@@ -6622,7 +6619,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
.fixup_map1 = { 3 },
-   .errstr = "R0 min value is negative",
+   .errstr = "unbounded min value",
.result = REJECT,
},
{
@@ -6651,7 +6648,7 @@ static struct bpf_test tests[] = {
BPF_EXIT_INSN(),
},
   

[PATCH bpf 2/9] bpf: fix incorrect sign extension in check_alu_op()

2017-12-18 Thread Alexei Starovoitov
From: Jann Horn 

Distinguish between
BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
only perform sign extension in the first case.

Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.

Debian assigned CVE-2017-16995 for this issue.

v3:
 - add CVE number (Ben Hutchings)

Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Signed-off-by: Jann Horn 
Acked-by: Edward Cree 
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 625e358ca765..c086010ae51e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2408,7 +2408,13 @@ static int check_alu_op(struct bpf_verifier_env *env, 
struct bpf_insn *insn)
 * remember the value we stored into this reg
 */
regs[insn->dst_reg].type = SCALAR_VALUE;
-   __mark_reg_known(regs + insn->dst_reg, insn->imm);
+   if (BPF_CLASS(insn->code) == BPF_ALU64) {
+   __mark_reg_known(regs + insn->dst_reg,
+insn->imm);
+   } else {
+   __mark_reg_known(regs + insn->dst_reg,
+(u32)insn->imm);
+   }
}
 
} else if (opcode > BPF_END) {
-- 
2.9.5



[PATCH bpf 4/9] bpf: fix 32-bit ALU op verification

2017-12-18 Thread Alexei Starovoitov
From: Jann Horn 

32-bit ALU ops operate on 32-bit values and have 32-bit outputs.
Adjust the verifier accordingly.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn 
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f716bdf29dd0..ecdc265244ca 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2017,6 +2017,10 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
return 0;
 }
 
+/* WARNING: This function does calculations on 64-bit values, but the actual
+ * execution may occur on 32-bit values. Therefore, things like bitshifts
+ * need extra checks in the 32-bit case.
+ */
 static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
  struct bpf_insn *insn,
  struct bpf_reg_state *dst_reg,
@@ -2027,12 +2031,8 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
bool src_known, dst_known;
s64 smin_val, smax_val;
u64 umin_val, umax_val;
+   u64 insn_bitness = (BPF_CLASS(insn->code) == BPF_ALU64) ? 64 : 32;
 
-   if (BPF_CLASS(insn->code) != BPF_ALU64) {
-   /* 32-bit ALU ops are (32,32)->64 */
-   coerce_reg_to_size(dst_reg, 4);
-   coerce_reg_to_size(_reg, 4);
-   }
smin_val = src_reg.smin_value;
smax_val = src_reg.smax_value;
umin_val = src_reg.umin_value;
@@ -2168,9 +2168,9 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
__update_reg_bounds(dst_reg);
break;
case BPF_LSH:
-   if (umax_val > 63) {
-   /* Shifts greater than 63 are undefined.  This includes
-* shifts by a negative number.
+   if (umax_val >= insn_bitness) {
+   /* Shifts greater than 31 or 63 are undefined.
+* This includes shifts by a negative number.
 */
mark_reg_unknown(env, regs, insn->dst_reg);
break;
@@ -2196,9 +2196,9 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
__update_reg_bounds(dst_reg);
break;
case BPF_RSH:
-   if (umax_val > 63) {
-   /* Shifts greater than 63 are undefined.  This includes
-* shifts by a negative number.
+   if (umax_val >= insn_bitness) {
+   /* Shifts greater than 31 or 63 are undefined.
+* This includes shifts by a negative number.
 */
mark_reg_unknown(env, regs, insn->dst_reg);
break;
@@ -2234,6 +2234,12 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
break;
}
 
+   if (BPF_CLASS(insn->code) != BPF_ALU64) {
+   /* 32-bit ALU ops are (32,32)->32 */
+   coerce_reg_to_size(dst_reg, 4);
+   coerce_reg_to_size(_reg, 4);
+   }
+
__reg_deduce_bounds(dst_reg);
__reg_bound_offset(dst_reg);
return 0;
-- 
2.9.5



[PATCH bpf 5/9] bpf: fix missing error return in check_stack_boundary()

2017-12-18 Thread Alexei Starovoitov
From: Jann Horn 

Prevent indirect stack accesses at non-constant addresses, which would
permit reading and corrupting spilled pointers.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn 
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ecdc265244ca..77e4b5223867 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1303,6 +1303,7 @@ static int check_stack_boundary(struct bpf_verifier_env 
*env, int regno,
tnum_strn(tn_buf, sizeof(tn_buf), regs[regno].var_off);
verbose(env, "invalid variable stack read R%d var_off=%s\n",
regno, tn_buf);
+   return -EACCES;
}
off = regs[regno].off + regs[regno].var_off.value;
if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 ||
-- 
2.9.5



[PATCH bpf 1/9] bpf/verifier: fix bounds calculation on BPF_RSH

2017-12-18 Thread Alexei Starovoitov
From: Edward Cree 

Incorrect signed bounds were being computed.
If the old upper signed bound was positive and the old lower signed bound was
negative, this could cause the new upper signed bound to be too low,
leading to security issues.

Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values")
Reported-by: Jann Horn 
Signed-off-by: Edward Cree 
Acked-by: Alexei Starovoitov 
[ja...@google.com: changed description to reflect bug impact]
Signed-off-by: Jann Horn 
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e39b01317b6f..625e358ca765 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2190,20 +2190,22 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
mark_reg_unknown(env, regs, insn->dst_reg);
break;
}
-   /* BPF_RSH is an unsigned shift, so make the appropriate casts 
*/
-   if (dst_reg->smin_value < 0) {
-   if (umin_val) {
-   /* Sign bit will be cleared */
-   dst_reg->smin_value = 0;
-   } else {
-   /* Lost sign bit information */
-   dst_reg->smin_value = S64_MIN;
-   dst_reg->smax_value = S64_MAX;
-   }
-   } else {
-   dst_reg->smin_value =
-   (u64)(dst_reg->smin_value) >> umax_val;
-   }
+   /* BPF_RSH is an unsigned shift.  If the value in dst_reg might
+* be negative, then either:
+* 1) src_reg might be zero, so the sign bit of the result is
+*unknown, so we lose our signed bounds
+* 2) it's known negative, thus the unsigned bounds capture the
+*signed bounds
+* 3) the signed bounds cross zero, so they tell us nothing
+*about the result
+* If the value in dst_reg is known nonnegative, then again the
+* unsigned bounts capture the signed bounds.
+* Thus, in all cases it suffices to blow away our signed bounds
+* and rely on inferring new ones from the unsigned bounds and
+* var_off of the result.
+*/
+   dst_reg->smin_value = S64_MIN;
+   dst_reg->smax_value = S64_MAX;
if (src_known)
dst_reg->var_off = tnum_rshift(dst_reg->var_off,
   umin_val);
-- 
2.9.5



[PATCH bpf 7/9] bpf: don't prune branches when a scalar is replaced with a pointer

2017-12-18 Thread Alexei Starovoitov
From: Jann Horn 

This could be made safe by passing through a reference to env and checking
for env->allow_ptr_leaks, but it would only work one way and is probably
not worth the hassle - not doing it will not directly lead to program
rejection.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn 
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 102c519836f6..982bd9ec721a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3467,15 +3467,14 @@ static bool regsafe(struct bpf_reg_state *rold, struct 
bpf_reg_state *rcur,
return range_within(rold, rcur) &&
   tnum_in(rold->var_off, rcur->var_off);
} else {
-   /* if we knew anything about the old value, we're not
-* equal, because we can't know anything about the
-* scalar value of the pointer in the new value.
+   /* We're trying to use a pointer in place of a scalar.
+* Even if the scalar was unbounded, this could lead to
+* pointer leaks because scalars are allowed to leak
+* while pointers are not. We could make this safe in
+* special cases if root is calling us, but it's
+* probably not worth the hassle.
 */
-   return rold->umin_value == 0 &&
-  rold->umax_value == U64_MAX &&
-  rold->smin_value == S64_MIN &&
-  rold->smax_value == S64_MAX &&
-  tnum_is_unknown(rold->var_off);
+   return false;
}
case PTR_TO_MAP_VALUE:
/* If the new min/max/var_off satisfy the old ones and
-- 
2.9.5



[PATCH bpf 6/9] bpf: force strict alignment checks for stack pointers

2017-12-18 Thread Alexei Starovoitov
From: Jann Horn 

Force strict alignment checks for stack pointers because the tracking of
stack spills relies on it; unaligned stack accesses can lead to corruption
of spilled registers, which is exploitable.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Jann Horn 
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 77e4b5223867..102c519836f6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1059,6 +1059,11 @@ static int check_ptr_alignment(struct bpf_verifier_env 
*env,
break;
case PTR_TO_STACK:
pointer_desc = "stack ";
+   /* The stack spill tracking logic in check_stack_write()
+* and check_stack_read() relies on stack accesses being
+* aligned.
+*/
+   strict = true;
break;
default:
break;
-- 
2.9.5



[PATCH bpf 8/9] bpf: fix integer overflows

2017-12-18 Thread Alexei Starovoitov
There were various issues related to the limited size of integers used in
the verifier:
 - `off + size` overflow in __check_map_access()
 - `off + reg->off` overflow in check_mem_access()
 - `off + reg->var_off.value` overflow or 32-bit truncation of
   `reg->var_off.value` in check_mem_access()
 - 32-bit truncation in check_stack_boundary()

Make sure that any integer math cannot overflow by not allowing
pointer math with large values.

Also reduce the scope of "scalar op scalar" tracking.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Reported-by: Jann Horn 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf_verifier.h |  4 ++--
 kernel/bpf/verifier.c| 48 
 2 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index c561b986bab0..1632bb13ad8a 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -15,11 +15,11 @@
  * In practice this is far bigger than any realistic pointer offset; this limit
  * ensures that umax_value + (int)off + (int)size cannot overflow a u64.
  */
-#define BPF_MAX_VAR_OFF(1ULL << 31)
+#define BPF_MAX_VAR_OFF(1 << 29)
 /* Maximum variable size permitted for ARG_CONST_SIZE[_OR_ZERO].  This ensures
  * that converting umax_value to int cannot overflow.
  */
-#define BPF_MAX_VAR_SIZINT_MAX
+#define BPF_MAX_VAR_SIZ(1 << 29)
 
 /* Liveness marks, used for registers and spilled-regs (in stack slots).
  * Read marks propagate upwards until they find a write mark; they record that
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 982bd9ec721a..86dfe6b5c243 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1819,6 +1819,41 @@ static bool signed_sub_overflows(s64 a, s64 b)
return res > a;
 }
 
+static bool check_reg_sane_offset(struct bpf_verifier_env *env,
+ const struct bpf_reg_state *reg,
+ enum bpf_reg_type type)
+{
+   bool known = tnum_is_const(reg->var_off);
+   s64 val = reg->var_off.value;
+   s64 smin = reg->smin_value;
+
+   if (known && (val >= BPF_MAX_VAR_OFF || val <= -BPF_MAX_VAR_OFF)) {
+   verbose(env, "math between %s pointer and %lld is not 
allowed\n",
+   reg_type_str[type], val);
+   return false;
+   }
+
+   if (reg->off >= BPF_MAX_VAR_OFF || reg->off <= -BPF_MAX_VAR_OFF) {
+   verbose(env, "%s pointer offset %d is not allowed\n",
+   reg_type_str[type], reg->off);
+   return false;
+   }
+
+   if (smin == S64_MIN) {
+   verbose(env, "math between %s pointer and register with 
unbounded min value is not allowed\n",
+   reg_type_str[type]);
+   return false;
+   }
+
+   if (smin >= BPF_MAX_VAR_OFF || smin <= -BPF_MAX_VAR_OFF) {
+   verbose(env, "value %lld makes %s pointer be out of bounds\n",
+   smin, reg_type_str[type]);
+   return false;
+   }
+
+   return true;
+}
+
 /* Handles arithmetic on a pointer and a scalar: computes new min/max and 
var_off.
  * Caller should also handle BPF_MOV case separately.
  * If we return -EACCES, caller may want to try again treating pointer as a
@@ -1887,6 +1922,10 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
dst_reg->type = ptr_reg->type;
dst_reg->id = ptr_reg->id;
 
+   if (!check_reg_sane_offset(env, off_reg, ptr_reg->type) ||
+   !check_reg_sane_offset(env, ptr_reg, ptr_reg->type))
+   return -EINVAL;
+
switch (opcode) {
case BPF_ADD:
/* We can take a fixed offset as long as it doesn't overflow
@@ -2017,6 +2056,9 @@ static int adjust_ptr_min_max_vals(struct 
bpf_verifier_env *env,
return -EACCES;
}
 
+   if (!check_reg_sane_offset(env, dst_reg, ptr_reg->type))
+   return -EINVAL;
+
__update_reg_bounds(dst_reg);
__reg_deduce_bounds(dst_reg);
__reg_bound_offset(dst_reg);
@@ -2046,6 +2088,12 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
src_known = tnum_is_const(src_reg.var_off);
dst_known = tnum_is_const(dst_reg->var_off);
 
+   if (!src_known &&
+   opcode != BPF_ADD && opcode != BPF_SUB && opcode != BPF_AND) {
+   __mark_reg_unknown(dst_reg);
+   return 0;
+   }
+
switch (opcode) {
case BPF_ADD:
if (signed_add_overflows(dst_reg->smin_value, smin_val) ||
-- 
2.9.5



[PATCH bpf 0/9] bpf: verifier security fixes

2017-12-18 Thread Alexei Starovoitov
This patch set addresses a set of security vulnerabilities
in bpf verifier logic discovered by Jann Horn.
All of the patches are candidates for 4.14 stable.

Alexei Starovoitov (1):
  bpf: fix integer overflows

Edward Cree (1):
  bpf/verifier: fix bounds calculation on BPF_RSH

Jann Horn (7):
  bpf: fix incorrect sign extension in check_alu_op()
  bpf: fix incorrect tracking of register size truncation
  bpf: fix 32-bit ALU op verification
  bpf: fix missing error return in check_stack_boundary()
  bpf: force strict alignment checks for stack pointers
  bpf: don't prune branches when a scalar is replaced with a pointer
  selftests/bpf: add tests for recent bugfixes

 include/linux/bpf_verifier.h|   4 +-
 kernel/bpf/verifier.c   | 175 ++---
 tools/testing/selftests/bpf/test_verifier.c | 549 +++-
 3 files changed, 661 insertions(+), 67 deletions(-)

-- 
2.9.5



[PATCH bpf 3/9] bpf: fix incorrect tracking of register size truncation

2017-12-18 Thread Alexei Starovoitov
From: Jann Horn 

Properly handle register truncation to a smaller size.

The old code first mirrors the clearing of the high 32 bits in the bitwise
tristate representation, which is correct. But then, it computes the new
arithmetic bounds as the intersection between the old arithmetic bounds and
the bounds resulting from the bitwise tristate representation. Therefore,
when coerce_reg_to_32() is called on a number with bounds
[0x'fff8, 0x1''0007], the verifier computes
[0x'fff8, 0x'] as bounds of the truncated number.
This is incorrect: The truncated number could also be in the range [0, 7],
and no meaningful arithmetic bounds can be computed in that case apart from
the obvious [0, 0x'].

Starting with v4.14, this is exploitable by unprivileged users as long as
the unprivileged_bpf_disabled sysctl isn't set.

Debian assigned CVE-2017-16996 for this issue.

v2:
 - flip the mask during arithmetic bounds calculation (Ben Hutchings)
v3:
 - add CVE number (Ben Hutchings)

Fixes: b03c9f9fdc37 ("bpf/verifier: track signed and unsigned min/max values")
Signed-off-by: Jann Horn 
Acked-by: Edward Cree 
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c | 44 +++-
 1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c086010ae51e..f716bdf29dd0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1067,6 +1067,29 @@ static int check_ptr_alignment(struct bpf_verifier_env 
*env,
   strict);
 }
 
+/* truncate register to smaller size (in bytes)
+ * must be called with size < BPF_REG_SIZE
+ */
+static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
+{
+   u64 mask;
+
+   /* clear high bits in bit representation */
+   reg->var_off = tnum_cast(reg->var_off, size);
+
+   /* fix arithmetic bounds */
+   mask = ((u64)1 << (size * 8)) - 1;
+   if ((reg->umin_value & ~mask) == (reg->umax_value & ~mask)) {
+   reg->umin_value &= mask;
+   reg->umax_value &= mask;
+   } else {
+   reg->umin_value = 0;
+   reg->umax_value = mask;
+   }
+   reg->smin_value = reg->umin_value;
+   reg->smax_value = reg->umax_value;
+}
+
 /* check whether memory at (regno + off) is accessible for t = (read | write)
  * if t==write, value_regno is a register which value is stored into memory
  * if t==read, value_regno is a register which will receive the value from 
memory
@@ -1200,9 +1223,7 @@ static int check_mem_access(struct bpf_verifier_env *env, 
int insn_idx, u32 regn
if (!err && size < BPF_REG_SIZE && value_regno >= 0 && t == BPF_READ &&
regs[value_regno].type == SCALAR_VALUE) {
/* b/h/w load zero-extends, mark upper bits as known 0 */
-   regs[value_regno].var_off =
-   tnum_cast(regs[value_regno].var_off, size);
-   __update_reg_bounds([value_regno]);
+   coerce_reg_to_size([value_regno], size);
}
return err;
 }
@@ -1772,14 +1793,6 @@ static int check_call(struct bpf_verifier_env *env, int 
func_id, int insn_idx)
return 0;
 }
 
-static void coerce_reg_to_32(struct bpf_reg_state *reg)
-{
-   /* clear high 32 bits */
-   reg->var_off = tnum_cast(reg->var_off, 4);
-   /* Update bounds */
-   __update_reg_bounds(reg);
-}
-
 static bool signed_add_overflows(s64 a, s64 b)
 {
/* Do the add in u64, where overflow is well-defined */
@@ -2017,8 +2030,8 @@ static int adjust_scalar_min_max_vals(struct 
bpf_verifier_env *env,
 
if (BPF_CLASS(insn->code) != BPF_ALU64) {
/* 32-bit ALU ops are (32,32)->64 */
-   coerce_reg_to_32(dst_reg);
-   coerce_reg_to_32(_reg);
+   coerce_reg_to_size(dst_reg, 4);
+   coerce_reg_to_size(_reg, 4);
}
smin_val = src_reg.smin_value;
smax_val = src_reg.smax_value;
@@ -2398,10 +2411,7 @@ static int check_alu_op(struct bpf_verifier_env *env, 
struct bpf_insn *insn)
return -EACCES;
}
mark_reg_unknown(env, regs, insn->dst_reg);
-   /* high 32 bits are known zero. */
-   regs[insn->dst_reg].var_off = tnum_cast(
-   regs[insn->dst_reg].var_off, 4);
-   __update_reg_bounds([insn->dst_reg]);
+   coerce_reg_to_size([insn->dst_reg], 4);
}
} else {
/* case: R = imm
-- 
2.9.5



Re: [PATCH net] sctp: add SCTP_CID_RECONF conversion in sctp_cname

2017-12-18 Thread Xin Long
On Mon, Dec 18, 2017 at 9:08 PM, Marcelo Ricardo Leitner
 wrote:
> On Mon, Dec 18, 2017 at 02:13:17PM +0800, Xin Long wrote:
>> Whenever a new type of chunk is added, the corresp conversion in
>> sctp_cname should be added. Otherwise, in some places, pr_debug
>> will print it as "unknown chunk".
>>
>> Fixes: cc16f00f6529 ("sctp: add support for generating stream reconf ssn 
>> reset request chunk")
>> Signed-off-by: Xin Long 
>
> Acked-by: Marcelo R. Leitner 
>
> ...
>>   case SCTP_CID_AUTH:
>>   return "AUTH";
>>
>> + case SCTP_CID_RECONF:
>> + return "RECONF";
>> +
>>   default:
>>   break;
>
> Now we also need idata and ifwdtsn in there too, btw.
Yes, waiting for the merge from net-next to net.

>
>   Marcelo


Re: [Patch net-next] net_sched: properly check for empty skb array on error path

2017-12-18 Thread John Fastabend
On 12/18/2017 06:20 PM, Cong Wang wrote:
> On Mon, Dec 18, 2017 at 5:25 PM, John Fastabend
>  wrote:
>> On 12/18/2017 02:34 PM, Cong Wang wrote:
>>> First, the check of >ring.queue against NULL is wrong, it
>>> is always false. We should check the value rather than the address.
>>>
>>
>> Thanks.
>>
>>> Secondly, we need the same check in pfifo_fast_reset() too,
>>> as both ->reset() and ->destroy() are called in qdisc_destroy().
>>>
>>
>> not that it hurts to have the check here, but if init fails
>> in qdisc_create it seems only ->destroy() is called without
>> a ->reset().
>>
>> Is there another path for init() to fail that I'm missing.
> 
> Pretty sure ->reset() is called in qdisc_destroy() and also before
> ->destroy():
> 

Except, the failed init path does not call qdisc_destroy.

static struct Qdisc *qdisc_create(struct net_device *dev,
[...]

if (ops->init) {
err = ops->init(sch, tca[TCA_OPTIONS]);
if (err != 0)
goto err_out5;
}
[...]

err_out5:
/* ops->init() failed, we call ->destroy() like qdisc_create_dflt() */
if (ops->destroy)
ops->destroy(sch);
err_out3:
dev_put(dev);
kfree((char *) sch - sch->padded);
err_out2:
module_put(ops->owner);
err_out:
*errp = err;
return NULL;
[...]



> 
> void qdisc_destroy(struct Qdisc *qdisc)
> {
> const struct Qdisc_ops  *ops = qdisc->ops;
> struct sk_buff *skb, *tmp;
> 
> if (qdisc->flags & TCQ_F_BUILTIN ||
> !refcount_dec_and_test(>refcnt))
> return;
> 
> #ifdef CONFIG_NET_SCHED
> qdisc_hash_del(qdisc);
> 
> qdisc_put_stab(rtnl_dereference(qdisc->stab));
> #endif
> gen_kill_estimator(>rate_est);
> if (ops->reset)
> ops->reset(qdisc);
> if (ops->destroy)
> ops->destroy(qdisc);
> 



[PATCH V2 net-next 14/17] net: hns3: add Asym Pause support to phy default features

2017-12-18 Thread Lipeng
From: Fuyun Liang 

commit c4fb2cdf575d ("net: hns3: fix a bug for phy supported feature
initialization") adds default supported features for phy, but our hardware
also supports Asym Pause. This patch adds Asym Pause support to phy
default features to prevent Asym Pause can not be advertised when the phy
negotiates flow control.

Fixes: c4fb2cdf575d ("net: hns3: fix a bug for phy supported feature 
initialization")
Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
index 3745153..c1dea3a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
@@ -17,6 +17,7 @@
 #define HCLGE_PHY_SUPPORTED_FEATURES   (SUPPORTED_Autoneg | \
 SUPPORTED_TP | \
 SUPPORTED_Pause | \
+SUPPORTED_Asym_Pause | \
 PHY_10BT_FEATURES | \
 PHY_100BT_FEATURES | \
 PHY_1000BT_FEATURES)
-- 
1.9.1



[PATCH V2 net-next 10/17] net: hns3: cleanup mac auto-negotiation state query

2017-12-18 Thread Lipeng
From: Fuyun Liang 

When checking whether auto-negotiation is on, driver only needs to
check the value of mac.autoneg(SW) directly, and does not need to
query it from hardware. Because this value is always synchronized
with the auto-negotiation state of hardware.

This patch removes the mac auto-negotiation state query.

Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 24 --
 1 file changed, 24 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index e253f73..9ccfe86 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2137,28 +2137,6 @@ static int hclge_query_mac_an_speed_dup(struct hclge_dev 
*hdev, int *speed,
return 0;
 }
 
-static int hclge_query_autoneg_result(struct hclge_dev *hdev)
-{
-   struct hclge_mac *mac = >hw.mac;
-   struct hclge_query_an_speed_dup_cmd *req;
-   struct hclge_desc desc;
-   int ret;
-
-   req = (struct hclge_query_an_speed_dup_cmd *)desc.data;
-
-   hclge_cmd_setup_basic_desc(, HCLGE_OPC_QUERY_AN_RESULT, true);
-   ret = hclge_cmd_send(>hw, , 1);
-   if (ret) {
-   dev_err(>pdev->dev,
-   "autoneg result query cmd failed %d.\n", ret);
-   return ret;
-   }
-
-   mac->autoneg = hnae_get_bit(req->an_syn_dup_speed, HCLGE_QUERY_AN_B);
-
-   return 0;
-}
-
 static int hclge_set_autoneg_en(struct hclge_dev *hdev, bool enable)
 {
struct hclge_config_auto_neg_cmd *req;
@@ -2195,8 +2173,6 @@ static int hclge_get_autoneg(struct hnae3_handle *handle)
struct hclge_vport *vport = hclge_get_vport(handle);
struct hclge_dev *hdev = vport->back;
 
-   hclge_query_autoneg_result(hdev);
-
return hdev->hw.mac.autoneg;
 }
 
-- 
1.9.1



[PATCH V2 net-next 06/17] net: hns3: Add a mask initialization for mac_vlan table

2017-12-18 Thread Lipeng
This patch sets vlan masked, in order to avoid the received
packets being filtered.

Signed-off-by: Shenjian 
Signed-off-by: Lipeng 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h | 10 ++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 39 +-
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index 1eb9ff0..10adf86 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -191,6 +191,7 @@ enum hclge_opcode_type {
HCLGE_OPC_MAC_VLAN_INSERT   = 0x1003,
HCLGE_OPC_MAC_ETHTYPE_ADD   = 0x1010,
HCLGE_OPC_MAC_ETHTYPE_REMOVE= 0x1011,
+   HCLGE_OPC_MAC_VLAN_MASK_SET = 0x1012,
 
/* Multicast linear table cmd */
HCLGE_OPC_MTA_MAC_MODE_CFG  = 0x1020,
@@ -589,6 +590,15 @@ struct hclge_mac_vlan_tbl_entry_cmd {
u8  rsv2[6];
 };
 
+#define HCLGE_VLAN_MASK_EN_B   0x0
+struct hclge_mac_vlan_mask_entry_cmd {
+   u8 rsv0[2];
+   u8 vlan_mask;
+   u8 rsv1;
+   u8 mac_mask[6];
+   u8 rsv2[14];
+};
+
 #define HCLGE_CFG_MTA_MAC_SEL_S0x0
 #define HCLGE_CFG_MTA_MAC_SEL_MGENMASK(1, 0)
 #define HCLGE_CFG_MTA_MAC_EN_B 0x7
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index b8658b8..d7f6063 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2200,9 +2200,34 @@ static int hclge_get_autoneg(struct hnae3_handle *handle)
return hdev->hw.mac.autoneg;
 }
 
+static int hclge_set_default_mac_vlan_mask(struct hclge_dev *hdev,
+  bool mask_vlan,
+  u8 *mac_mask)
+{
+   struct hclge_mac_vlan_mask_entry_cmd *req;
+   struct hclge_desc desc;
+   int status;
+
+   req = (struct hclge_mac_vlan_mask_entry_cmd *)desc.data;
+   hclge_cmd_setup_basic_desc(, HCLGE_OPC_MAC_VLAN_MASK_SET, false);
+
+   hnae_set_bit(req->vlan_mask, HCLGE_VLAN_MASK_EN_B,
+mask_vlan ? 1 : 0);
+   ether_addr_copy(req->mac_mask, mac_mask);
+
+   status = hclge_cmd_send(>hw, , 1);
+   if (status)
+   dev_err(>pdev->dev,
+   "Config mac_vlan_mask failed for cmd_send, ret =%d\n",
+   status);
+
+   return status;
+}
+
 static int hclge_mac_init(struct hclge_dev *hdev)
 {
struct hclge_mac *mac = >hw.mac;
+   u8 mac_mask[ETH_ALEN] = {0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
int ret;
 
ret = hclge_cfg_mac_speed_dup(hdev, hdev->hw.mac.speed, HCLGE_MAC_FULL);
@@ -2228,7 +2253,19 @@ static int hclge_mac_init(struct hclge_dev *hdev)
return ret;
}
 
-   return hclge_cfg_func_mta_filter(hdev, 0, hdev->accept_mta_mc);
+   ret = hclge_cfg_func_mta_filter(hdev, 0, hdev->accept_mta_mc);
+   if (ret) {
+   dev_err(>pdev->dev,
+   "set mta filter mode fail ret=%d\n", ret);
+   return ret;
+   }
+
+   ret = hclge_set_default_mac_vlan_mask(hdev, true, mac_mask);
+   if (ret)
+   dev_err(>pdev->dev,
+   "set default mac_vlan_mask fail ret=%d\n", ret);
+
+   return ret;
 }
 
 static void hclge_mbx_task_schedule(struct hclge_dev *hdev)
-- 
1.9.1



Re: [PATCH 3/3] trace: print address if symbol not found

2017-12-18 Thread Steven Rostedt
On Tue, 19 Dec 2017 14:00:11 +1100
"Tobin C. Harding"  wrote:

> I ran through these as outlined here for the new version (v4). This hits
> the modified code but doesn't test symbol look up failure.

stacktrace shouldn't post non kernel values, unless there's a frame
pointer that isn't handled by kallsyms.

As for the other two, we could probably force a failure, like:

 # echo 'hist:keys=hrtimer.sym' > \
 events/timer/hrtimer_start/trigger
 # cat events/timer/hrtimer_start/hist

And then just add sym-offset too.

> 
> I also configured kernel with 'Perform a startup test on ftrace' for
> good luck.
> 
> Are you happy with this level of testing?

Can you try the above.

-- Steve


[PATCH V2 net-next 17/17] net: hns3: change TM sched mode to TC-based mode when SRIOV enabled

2017-12-18 Thread Lipeng
TC-based sched mode supports SRIOV enabled and SRIOV disabled. This
patch change the TM sched mode to TC-based mode in initialization
process.

Fixes: cc9bb43ab394 ("net: hns3: Add tc-based TM support for sriov enabled 
port")
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index ff63bca..01bc744 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -1100,10 +1100,7 @@ static int hclge_configure(struct hclge_dev *hdev)
for (i = 0; i < hdev->tm_info.num_tc; i++)
hnae_set_bit(hdev->hw_tc_map, i, 1);
 
-   if (!hdev->num_vmdq_vport && !hdev->num_req_vfs)
-   hdev->tx_sch_mode = HCLGE_FLAG_TC_BASE_SCH_MODE;
-   else
-   hdev->tx_sch_mode = HCLGE_FLAG_VNET_BASE_SCH_MODE;
+   hdev->tx_sch_mode = HCLGE_FLAG_TC_BASE_SCH_MODE;
 
return ret;
 }
-- 
1.9.1



[PATCH V2 net-next 11/17] net: hns3: fix for getting auto-negotiation state in hclge_get_autoneg

2017-12-18 Thread Lipeng
From: Fuyun Liang 

When phy exists, we use the value of phydev.autoneg to represent the
auto-negotiation state of hardware. Otherwise, we use the value of
mac.autoneg to represent it.

This patch fixes for getting a error value of auto-negotiation state in
hclge_get_autoneg().

Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility 
Layer Support")
Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 9ccfe86..b65c74f 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2172,6 +2172,10 @@ static int hclge_get_autoneg(struct hnae3_handle *handle)
 {
struct hclge_vport *vport = hclge_get_vport(handle);
struct hclge_dev *hdev = vport->back;
+   struct phy_device *phydev = hdev->hw.mac.phydev;
+
+   if (phydev)
+   return phydev->autoneg;
 
return hdev->hw.mac.autoneg;
 }
-- 
1.9.1



[PATCH V2 net-next 15/17] net: hns3: add support for querying advertised pause frame by ethtool ethx

2017-12-18 Thread Lipeng
This patch adds support for querying advertised pause frame by using
ethtool command(ethtool ethx).

Fixes: 496d03e960ae ("net: hns3: Add Ethtool support to HNS3 driver")
Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h |  2 ++
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c  | 15 +++
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 15 +++
 3 files changed, 32 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index a67d02a9..82e9a80 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -394,6 +394,8 @@ struct hnae3_ae_ops {
void (*get_tqps_and_rss_info)(struct hnae3_handle *h,
  u16 *free_tqps, u16 *max_rss_size);
int (*set_channels)(struct hnae3_handle *handle, u32 new_tqps_num);
+   void (*get_flowctrl_adv)(struct hnae3_handle *handle,
+u32 *flowctrl_adv);
 };
 
 struct hnae3_dcb_ops {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index b829ec7..2ae4d39 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -575,6 +575,7 @@ static int hns3_get_link_ksettings(struct net_device 
*netdev,
   struct ethtool_link_ksettings *cmd)
 {
struct hnae3_handle *h = hns3_get_handle(netdev);
+   u32 flowctrl_adv = 0;
u32 supported_caps;
u32 advertised_caps;
u8 media_type = HNAE3_MEDIA_TYPE_UNKNOWN;
@@ -650,6 +651,8 @@ static int hns3_get_link_ksettings(struct net_device 
*netdev,
if (!cmd->base.autoneg)
advertised_caps &= ~HNS3_LM_AUTONEG_BIT;
 
+   advertised_caps &= ~HNS3_LM_PAUSE_BIT;
+
/* now, map driver link modes to ethtool link modes */
hns3_driv_to_eth_caps(supported_caps, cmd, false);
hns3_driv_to_eth_caps(advertised_caps, cmd, true);
@@ -662,6 +665,18 @@ static int hns3_get_link_ksettings(struct net_device 
*netdev,
/* 4.mdio_support */
cmd->base.mdio_support = ETH_MDIO_SUPPORTS_C22;
 
+   /* 5.get flow control setttings */
+   if (h->ae_algo->ops->get_flowctrl_adv)
+   h->ae_algo->ops->get_flowctrl_adv(h, _adv);
+
+   if (flowctrl_adv & ADVERTISED_Pause)
+   ethtool_link_ksettings_add_link_mode(cmd, advertising,
+Pause);
+
+   if (flowctrl_adv & ADVERTISED_Asym_Pause)
+   ethtool_link_ksettings_add_link_mode(cmd, advertising,
+Asym_Pause);
+
return 0;
 }
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index f5465a8..ff63bca 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -4660,6 +4660,20 @@ static u32 hclge_get_fw_version(struct hnae3_handle 
*handle)
return hdev->fw_version;
 }
 
+static void hclge_get_flowctrl_adv(struct hnae3_handle *handle,
+  u32 *flowctrl_adv)
+{
+   struct hclge_vport *vport = hclge_get_vport(handle);
+   struct hclge_dev *hdev = vport->back;
+   struct phy_device *phydev = hdev->hw.mac.phydev;
+
+   if (!phydev)
+   return;
+
+   *flowctrl_adv |= (phydev->advertising & ADVERTISED_Pause) |
+(phydev->advertising & ADVERTISED_Asym_Pause);
+}
+
 static void hclge_set_flowctrl_adv(struct hclge_dev *hdev, u32 rx_en, u32 
tx_en)
 {
struct phy_device *phydev = hdev->hw.mac.phydev;
@@ -5477,6 +5491,7 @@ static int hclge_set_channels(struct hnae3_handle 
*handle, u32 new_tqps_num)
.get_tqps_and_rss_info = hclge_get_tqps_and_rss_info,
.set_channels = hclge_set_channels,
.get_channels = hclge_get_channels,
+   .get_flowctrl_adv = hclge_get_flowctrl_adv,
 };
 
 static struct hnae3_ae_algo ae_algo = {
-- 
1.9.1



[PATCH V2 net-next 08/17] net: hns3: Add ethtool related offload command

2017-12-18 Thread Lipeng
This patch adds offload command related to "ethtool -K".

Signed-off-by: Shenjian 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h |  3 +++
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 16 
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 13 +
 3 files changed, 32 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index a5d3d22..a67d02a9 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -278,6 +278,8 @@ struct hnae3_ae_dev {
  *   Set vlan filter config of Ports
  * set_vf_vlan_filter()
  *   Set vlan filter config of vf
+ * enable_hw_strip_rxvtag()
+ *   Enable/disable hardware strip vlan tag of packets received
  */
 struct hnae3_ae_ops {
int (*init_ae_dev)(struct hnae3_ae_dev *ae_dev);
@@ -384,6 +386,7 @@ struct hnae3_ae_ops {
   u16 vlan_id, bool is_kill);
int (*set_vf_vlan_filter)(struct hnae3_handle *handle, int vfid,
  u16 vlan, u8 qos, __be16 proto);
+   int (*enable_hw_strip_rxvtag)(struct hnae3_handle *handle, bool enable);
void (*reset_event)(struct hnae3_handle *handle,
enum hnae3_reset_type reset);
void (*get_channels)(struct hnae3_handle *handle,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index b7fe980..377964a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -1032,6 +1032,9 @@ static int hns3_nic_set_features(struct net_device 
*netdev,
 netdev_features_t features)
 {
struct hns3_nic_priv *priv = netdev_priv(netdev);
+   struct hnae3_handle *h = priv->ae_handle;
+   netdev_features_t changed;
+   int ret;
 
if (features & (NETIF_F_TSO | NETIF_F_TSO6)) {
priv->ops.fill_desc = hns3_fill_desc_tso;
@@ -1041,6 +1044,17 @@ static int hns3_nic_set_features(struct net_device 
*netdev,
priv->ops.maybe_stop_tx = hns3_nic_maybe_stop_tx;
}
 
+   changed = netdev->features ^ features;
+   if (changed & NETIF_F_HW_VLAN_CTAG_RX) {
+   if (features & NETIF_F_HW_VLAN_CTAG_RX)
+   ret = h->ae_algo->ops->enable_hw_strip_rxvtag(h, true);
+   else
+   ret = h->ae_algo->ops->enable_hw_strip_rxvtag(h, false);
+
+   if (ret)
+   return ret;
+   }
+
netdev->features = features;
return 0;
 }
@@ -1492,6 +1506,7 @@ static void hns3_set_default_feature(struct net_device 
*netdev)
 
netdev->features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
NETIF_F_HW_VLAN_CTAG_FILTER |
+   NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX |
NETIF_F_RXCSUM | NETIF_F_SG | NETIF_F_GSO |
NETIF_F_GRO | NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_GSO_GRE |
NETIF_F_GSO_GRE_CSUM | NETIF_F_GSO_UDP_TUNNEL |
@@ -1506,6 +1521,7 @@ static void hns3_set_default_feature(struct net_device 
*netdev)
 
netdev->hw_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
NETIF_F_HW_VLAN_CTAG_FILTER |
+   NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX |
NETIF_F_RXCSUM | NETIF_F_SG | NETIF_F_GSO |
NETIF_F_GRO | NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_GSO_GRE |
NETIF_F_GSO_GRE_CSUM | NETIF_F_GSO_UDP_TUNNEL |
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index d4cdc8d..e253f73 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -4547,6 +4547,18 @@ static int hclge_init_vlan_config(struct hclge_dev *hdev)
return hclge_set_port_vlan_filter(handle, htons(ETH_P_8021Q), 0, false);
 }
 
+static int hclge_en_hw_strip_rxvtag(struct hnae3_handle *handle, bool enable)
+{
+   struct hclge_vport *vport = hclge_get_vport(handle);
+
+   vport->rxvlan_cfg.strip_tag1_en = false;
+   vport->rxvlan_cfg.strip_tag2_en = enable;
+   vport->rxvlan_cfg.vlan1_vlan_prionly = false;
+   vport->rxvlan_cfg.vlan2_vlan_prionly = false;
+
+   return hclge_set_vlan_rx_offload_cfg(vport);
+}
+
 static int hclge_set_mtu(struct hnae3_handle *handle, int new_mtu)
 {
struct hclge_vport *vport = hclge_get_vport(handle);
@@ -5361,6 +5373,7 @@ static int hclge_set_channels(struct hnae3_handle 
*handle, u32 new_tqps_num)
.get_mdix_mode = hclge_get_mdix_mode,
.set_vlan_filter = hclge_set_port_vlan_filter,
.set_vf_vlan_filter = hclge_set_vf_vlan_filter,
+   .enable_hw_strip_rxvtag = 

[PATCH V2 net-next 09/17] net: hns3: Add handling vlan tag offload in bd

2017-12-18 Thread Lipeng
This patch deals with the vlan tag information between
sk_buff and rx/tx bd.

Signed-off-by: Shenjian 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 83 +++--
 1 file changed, 78 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 377964a..212d0dc 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -723,6 +723,58 @@ static void hns3_set_txbd_baseinfo(u16 
*bdtp_fe_sc_vld_ra_ri, int frag_end)
hnae_set_field(*bdtp_fe_sc_vld_ra_ri, HNS3_TXD_SC_M, HNS3_TXD_SC_S, 0);
 }
 
+static int hns3_fill_desc_vtags(struct sk_buff *skb,
+   struct hns3_enet_ring *tx_ring,
+   u32 *inner_vlan_flag,
+   u32 *out_vlan_flag,
+   u16 *inner_vtag,
+   u16 *out_vtag)
+{
+#define HNS3_TX_VLAN_PRIO_SHIFT 13
+
+   if (skb->protocol == htons(ETH_P_8021Q) &&
+   !(tx_ring->tqp->handle->kinfo.netdev->features &
+   NETIF_F_HW_VLAN_CTAG_TX)) {
+   /* When HW VLAN acceleration is turned off, and the stack
+* sets the protocol to 802.1q, the driver just need to
+* set the protocol to the encapsulated ethertype.
+*/
+   skb->protocol = vlan_get_protocol(skb);
+   return 0;
+   }
+
+   if (skb_vlan_tag_present(skb)) {
+   u16 vlan_tag;
+
+   vlan_tag = skb_vlan_tag_get(skb);
+   vlan_tag |= (skb->priority & 0x7) << HNS3_TX_VLAN_PRIO_SHIFT;
+
+   /* Based on hw strategy, use out_vtag in two layer tag case,
+* and use inner_vtag in one tag case.
+*/
+   if (skb->protocol == htons(ETH_P_8021Q)) {
+   hnae_set_bit(*out_vlan_flag, HNS3_TXD_OVLAN_B, 1);
+   *out_vtag = vlan_tag;
+   } else {
+   hnae_set_bit(*inner_vlan_flag, HNS3_TXD_VLAN_B, 1);
+   *inner_vtag = vlan_tag;
+   }
+   } else if (skb->protocol == htons(ETH_P_8021Q)) {
+   struct vlan_ethhdr *vhdr;
+   int rc;
+
+   rc = skb_cow_head(skb, 0);
+   if (rc < 0)
+   return rc;
+   vhdr = (struct vlan_ethhdr *)skb->data;
+   vhdr->h_vlan_TCI |= cpu_to_be16((skb->priority & 0x7)
+   << HNS3_TX_VLAN_PRIO_SHIFT);
+   }
+
+   skb->protocol = vlan_get_protocol(skb);
+   return 0;
+}
+
 static int hns3_fill_desc(struct hns3_enet_ring *ring, void *priv,
  int size, dma_addr_t dma, int frag_end,
  enum hns_desc_type type)
@@ -733,6 +785,8 @@ static int hns3_fill_desc(struct hns3_enet_ring *ring, void 
*priv,
u16 bdtp_fe_sc_vld_ra_ri = 0;
u32 type_cs_vlan_tso = 0;
struct sk_buff *skb;
+   u16 inner_vtag = 0;
+   u16 out_vtag = 0;
u32 paylen = 0;
u16 mss = 0;
__be16 protocol;
@@ -756,15 +810,16 @@ static int hns3_fill_desc(struct hns3_enet_ring *ring, 
void *priv,
skb = (struct sk_buff *)priv;
paylen = skb->len;
 
+   ret = hns3_fill_desc_vtags(skb, ring, _cs_vlan_tso,
+  _type_vlan_len_msec,
+  _vtag, _vtag);
+   if (unlikely(ret))
+   return ret;
+
if (skb->ip_summed == CHECKSUM_PARTIAL) {
skb_reset_mac_len(skb);
protocol = skb->protocol;
 
-   /* vlan packet*/
-   if (protocol == htons(ETH_P_8021Q)) {
-   protocol = vlan_get_protocol(skb);
-   skb->protocol = protocol;
-   }
ret = hns3_get_l4_protocol(skb, _proto, _proto);
if (ret)
return ret;
@@ -790,6 +845,8 @@ static int hns3_fill_desc(struct hns3_enet_ring *ring, void 
*priv,
cpu_to_le32(type_cs_vlan_tso);
desc->tx.paylen = cpu_to_le32(paylen);
desc->tx.mss = cpu_to_le16(mss);
+   desc->tx.vlan_tag = cpu_to_le16(inner_vtag);
+   desc->tx.outer_vlan_tag = cpu_to_le16(out_vtag);
}
 
/* move ring pointer to next.*/
@@ -2101,6 +2158,22 @@ static int hns3_handle_rx_bd(struct hns3_enet_ring *ring,
 
prefetchw(skb->data);
 
+   /* Based on hw strategy, the tag offloaded will be stored at
+* ot_vlan_tag in two layer tag case, and stored at vlan_tag
+* in one layer tag case.

[PATCH V2 net-next 16/17] net: hns3: Increase the default depth of bucket for TM shaper

2017-12-18 Thread Lipeng
Burstiness of a flow is determined by the depth of a bucket, When the
upper rate of shaper is large, the current depth of a bucket is not
enough.

The default upper rate of shaper is 100G, so increase the depth of
a bucket according to UM.

Signed-off-by: Yunsheng Lin 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
index 7cfe1eb..ea9355d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
@@ -23,8 +23,8 @@ enum hclge_shaper_level {
HCLGE_SHAPER_LVL_PF = 1,
 };
 
-#define HCLGE_SHAPER_BS_U_DEF  1
-#define HCLGE_SHAPER_BS_S_DEF  4
+#define HCLGE_SHAPER_BS_U_DEF  5
+#define HCLGE_SHAPER_BS_S_DEF  20
 
 #define HCLGE_ETHER_MAX_RATE   10
 
-- 
1.9.1



Re: Shutting down a VM with Kernel 4.14 will sometime hang and a reboot is the only way to recover.

2017-12-18 Thread Jason Wang



On 2017年12月12日 11:53, David Hill wrote:



On 2017-12-08 01:03 PM, David Hill wrote:



On 2017-12-07 12:13 AM, Jason Wang wrote:



On 2017年12月07日 12:42, David Hill wrote:



On 2017-12-06 11:34 PM, David Hill wrote:



On 2017-12-04 02:51 PM, David Hill wrote:


On 2017-12-03 11:08 PM, Jason Wang wrote:



On 2017年12月02日 00:38, David Hill wrote:


Finally, I reverted 581fe0ea61584d88072527ae9fb9dcb9d1f2783e 
too ... compiling and I'll keep you posted.


So I'm still able to reproduce this issue even with reverting 
these 3 commits.  Would you have other suspect commits ? 


Thanks for the testing. No, I don't have other suspect commits.

Looks like somebody else it hitting your issue too (see 
https://www.spinics.net/lists/netdev/msg468319.html)


But he claims the issue were fixed by using qemu 2.10.1.

So you may:

-try to see if qemu 2.10.1 solves your issue

It didn't solve it for him... it's only harder to reproduce. [1]
-if not, try to see if commit 
2ddf71e23cc246e95af72a6deed67b4a50a7b81c ("net: add notifier 
hooks for devmap bpf map") is the first bad commit

I'll try to see what I can do here
I'm looking at that commit and it's been introduced before v4.13 
if I'm not mistaken while this issue appeared between v4.13 and 
v4.14-rc1 .  Between those two releases, there're 1352 commits.
Is there a way to quickly know which commits are touching 
vhost-net, zerocopy ?



[ 7496.553044]  __schedule+0x2dc/0xbb0
[ 7496.553055]  ? trace_hardirqs_on+0xd/0x10
[ 7496.553074]  schedule+0x3d/0x90
[ 7496.553087]  vhost_net_ubuf_put_and_wait+0x73/0xa0 [vhost_net]
[ 7496.553100]  ? finish_wait+0x90/0x90
[ 7496.553115]  vhost_net_ioctl+0x542/0x910 [vhost_net]
[ 7496.553144]  do_vfs_ioctl+0xa6/0x6c0
[ 7496.553166]  SyS_ioctl+0x79/0x90
[ 7496.553182]  entry_SYSCALL_64_fastpath+0x1f/0xbe


That vhost_net_ubuf_put_and)wait call has been changed in this 
commit with the following comment:


commit 0ad8b480d6ee916aa84324f69acf690142aecd0e
Author: Michael S. Tsirkin 
Date:   Thu Feb 13 11:42:05 2014 +0200

    vhost: fix ref cnt checking deadlock

    vhost checked the counter within the refcnt before 
decrementing.  It
    really wanted to know that it is the one that has the last 
reference, as

    a way to batch freeing resources a bit more efficiently.

    Note: we only let refcount go to 0 on device release.

    This works well but we now access the ref counter twice so 
there's a

    race: all users might see a high count and decide to defer freeing
    resources.
    In the end no one initiates freeing resources until the last 
reference
    is gone (which is on VM shotdown so might happen after a 
long time).


    Let's do what we probably should have done straight away:
    switch from kref to plain atomic, documenting the
    semantics, return the refcount value atomically after decrement,
    then use that to avoid the deadlock.

    Reported-by: Qin Chuanyu 
    Signed-off-by: Michael S. Tsirkin 
    Acked-by: Jason Wang 
    Signed-off-by: David S. Miller 



So at this point, are we hitting a deadlock when using 
experimental_zcopytx ? 


Yes. But there could be another possibility that it was not caused 
by vhost_net itself but other places that holds a packet.


Thanks


While bisecting, when I reach this commit 
46d4b68f891bee5d83a32508bfbd9778be6b1b63, the system kernel panic 
when I run virt-customize :


Message from syslogd@zappa at Dec  8 12:52:06 ...
 kernel:[  350.016376] Kernel panic - not syncing: Fatal exception in 
interrupt


I marked that commit as bad again.   Will continue bisecting!



It looks like the first bad commit would be the following:

[jenkins@zappa linux-stable-new]$ sudo bash bisect.sh -g
3ece782693c4b64d588dd217868558ab9a19bfe7 is the first bad commit
commit 3ece782693c4b64d588dd217868558ab9a19bfe7
Author: Willem de Bruijn 
Date:   Thu Aug 3 16:29:38 2017 -0400

    sock: skb_copy_ubufs support for compound pages

    Refine skb_copy_ubufs to support compound pages. With upcoming TCP
    zerocopy sendmsg, such fragments may appear.

    The existing code replaces each page one for one. Splitting each
    compound page into an independent number of regular pages can result
    in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.

    Instead, fill all destination pages but the last to PAGE_SIZE.
    Split the existing alloc + copy loop into separate stages:
    1. compute bytelength and minimum number of pages to store this.
    2. allocate
    3. copy, filling each page except the last to PAGE_SIZE bytes
    4. update skb frag array

    Signed-off-by: Willem de Bruijn 
    Signed-off-by: David S. Miller 

:04 04 f1b652be7e59b1046400cad8e6be25028a88b8e2 
6ecf86d9f06a2d98946f531f1e4cf803de071b10 M    include
:04 04 8420cf451fcf51f669ce81437ce7e0aacc33d2eb 

[PATCH V2 net-next 07/17] net: hns3: Add vlan offload config command

2017-12-18 Thread Lipeng
This patch adds vlan offload config commands, initializes
the rules of tx/rx vlan tag handle for hw.

Signed-off-by: Shenjian 
Signed-off-by: Lipeng 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |  45 ++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 158 -
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  36 +
 3 files changed, 233 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index 10adf86..f5baba21 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -180,6 +180,10 @@ enum hclge_opcode_type {
/* Promisuous mode command */
HCLGE_OPC_CFG_PROMISC_MODE  = 0x0E01,
 
+   /* Vlan offload command */
+   HCLGE_OPC_VLAN_PORT_TX_CFG  = 0x0F01,
+   HCLGE_OPC_VLAN_PORT_RX_CFG  = 0x0F02,
+
/* Interrupts cmd */
HCLGE_OPC_ADD_RING_TO_VECTOR= 0x1503,
HCLGE_OPC_DEL_RING_TO_VECTOR= 0x1504,
@@ -670,6 +674,47 @@ struct hclge_vlan_filter_vf_cfg_cmd {
u8  vf_bitmap[16];
 };
 
+#define HCLGE_ACCEPT_TAG_B 0
+#define HCLGE_ACCEPT_UNTAG_B   1
+#define HCLGE_PORT_INS_TAG1_EN_B   2
+#define HCLGE_PORT_INS_TAG2_EN_B   3
+#define HCLGE_CFG_NIC_ROCE_SEL_B   4
+struct hclge_vport_vtag_tx_cfg_cmd {
+   u8 vport_vlan_cfg;
+   u8 vf_offset;
+   u8 rsv1[2];
+   __le16 def_vlan_tag1;
+   __le16 def_vlan_tag2;
+   u8 vf_bitmap[8];
+   u8 rsv2[8];
+};
+
+#define HCLGE_REM_TAG1_EN_B0
+#define HCLGE_REM_TAG2_EN_B1
+#define HCLGE_SHOW_TAG1_EN_B   2
+#define HCLGE_SHOW_TAG2_EN_B   3
+struct hclge_vport_vtag_rx_cfg_cmd {
+   u8 vport_vlan_cfg;
+   u8 vf_offset;
+   u8 rsv1[6];
+   u8 vf_bitmap[8];
+   u8 rsv2[8];
+};
+
+struct hclge_tx_vlan_type_cfg_cmd {
+   __le16 ot_vlan_type;
+   __le16 in_vlan_type;
+   u8 rsv[20];
+};
+
+struct hclge_rx_vlan_type_cfg_cmd {
+   __le16 ot_fst_vlan_type;
+   __le16 ot_sec_vlan_type;
+   __le16 in_fst_vlan_type;
+   __le16 in_sec_vlan_type;
+   u8 rsv[16];
+};
+
 struct hclge_cfg_com_tqp_queue_cmd {
__le16 tqp_id;
__le16 stream_id;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index d7f6063..d4cdc8d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -4380,23 +4380,169 @@ static int hclge_set_vf_vlan_filter(struct 
hnae3_handle *handle, int vfid,
return hclge_set_vf_vlan_common(hdev, vfid, false, vlan, qos, proto);
 }
 
+static int hclge_set_vlan_tx_offload_cfg(struct hclge_vport *vport)
+{
+   struct hclge_tx_vtag_cfg *vcfg = >txvlan_cfg;
+   struct hclge_vport_vtag_tx_cfg_cmd *req;
+   struct hclge_dev *hdev = vport->back;
+   struct hclge_desc desc;
+   int status;
+
+   hclge_cmd_setup_basic_desc(, HCLGE_OPC_VLAN_PORT_TX_CFG, false);
+
+   req = (struct hclge_vport_vtag_tx_cfg_cmd *)desc.data;
+   req->def_vlan_tag1 = cpu_to_le16(vcfg->default_tag1);
+   req->def_vlan_tag2 = cpu_to_le16(vcfg->default_tag2);
+   hnae_set_bit(req->vport_vlan_cfg, HCLGE_ACCEPT_TAG_B,
+vcfg->accept_tag ? 1 : 0);
+   hnae_set_bit(req->vport_vlan_cfg, HCLGE_ACCEPT_UNTAG_B,
+vcfg->accept_untag ? 1 : 0);
+   hnae_set_bit(req->vport_vlan_cfg, HCLGE_PORT_INS_TAG1_EN_B,
+vcfg->insert_tag1_en ? 1 : 0);
+   hnae_set_bit(req->vport_vlan_cfg, HCLGE_PORT_INS_TAG2_EN_B,
+vcfg->insert_tag2_en ? 1 : 0);
+   hnae_set_bit(req->vport_vlan_cfg, HCLGE_CFG_NIC_ROCE_SEL_B, 0);
+
+   req->vf_offset = vport->vport_id / HCLGE_VF_NUM_PER_CMD;
+   req->vf_bitmap[req->vf_offset] =
+   1 << (vport->vport_id % HCLGE_VF_NUM_PER_BYTE);
+
+   status = hclge_cmd_send(>hw, , 1);
+   if (status)
+   dev_err(>pdev->dev,
+   "Send port txvlan cfg command fail, ret =%d\n",
+   status);
+
+   return status;
+}
+
+static int hclge_set_vlan_rx_offload_cfg(struct hclge_vport *vport)
+{
+   struct hclge_rx_vtag_cfg *vcfg = >rxvlan_cfg;
+   struct hclge_vport_vtag_rx_cfg_cmd *req;
+   struct hclge_dev *hdev = vport->back;
+   struct hclge_desc desc;
+   int status;
+
+   hclge_cmd_setup_basic_desc(, HCLGE_OPC_VLAN_PORT_RX_CFG, false);
+
+   req = (struct hclge_vport_vtag_rx_cfg_cmd *)desc.data;
+   hnae_set_bit(req->vport_vlan_cfg, HCLGE_REM_TAG1_EN_B,
+vcfg->strip_tag1_en ? 1 : 0);
+   hnae_set_bit(req->vport_vlan_cfg, HCLGE_REM_TAG2_EN_B,
+vcfg->strip_tag2_en ? 1 : 0);
+   

[PATCH V2 net-next 12/17] net: hns3: add support for set_pauseparam

2017-12-18 Thread Lipeng
This patch adds set_pauseparam support for ethtool cmd.

Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 13 
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 83 ++
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c  |  2 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h  |  1 +
 4 files changed, 98 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index 2fd2656..b829ec7 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -559,6 +559,18 @@ static void hns3_get_pauseparam(struct net_device *netdev,
>rx_pause, >tx_pause);
 }
 
+static int hns3_set_pauseparam(struct net_device *netdev,
+  struct ethtool_pauseparam *param)
+{
+   struct hnae3_handle *h = hns3_get_handle(netdev);
+
+   if (h->ae_algo->ops->set_pauseparam)
+   return h->ae_algo->ops->set_pauseparam(h, param->autoneg,
+  param->rx_pause,
+  param->tx_pause);
+   return -EOPNOTSUPP;
+}
+
 static int hns3_get_link_ksettings(struct net_device *netdev,
   struct ethtool_link_ksettings *cmd)
 {
@@ -880,6 +892,7 @@ void hns3_get_channels(struct net_device *netdev,
.get_ringparam = hns3_get_ringparam,
.set_ringparam = hns3_set_ringparam,
.get_pauseparam = hns3_get_pauseparam,
+   .set_pauseparam = hns3_set_pauseparam,
.get_strings = hns3_get_strings,
.get_ethtool_stats = hns3_get_stats,
.get_sset_count = hns3_get_sset_count,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index b65c74f..fbe5dee 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -4660,6 +4660,53 @@ static u32 hclge_get_fw_version(struct hnae3_handle 
*handle)
return hdev->fw_version;
 }
 
+static void hclge_set_flowctrl_adv(struct hclge_dev *hdev, u32 rx_en, u32 
tx_en)
+{
+   struct phy_device *phydev = hdev->hw.mac.phydev;
+
+   if (!phydev)
+   return;
+
+   phydev->advertising &= ~(ADVERTISED_Pause | ADVERTISED_Asym_Pause);
+
+   if (rx_en)
+   phydev->advertising |= ADVERTISED_Pause | ADVERTISED_Asym_Pause;
+
+   if (tx_en)
+   phydev->advertising ^= ADVERTISED_Asym_Pause;
+}
+
+static int hclge_cfg_pauseparam(struct hclge_dev *hdev, u32 rx_en, u32 tx_en)
+{
+   enum hclge_fc_mode fc_mode;
+   int ret;
+
+   if (rx_en && tx_en)
+   fc_mode = HCLGE_FC_FULL;
+   else if (rx_en && !tx_en)
+   fc_mode = HCLGE_FC_RX_PAUSE;
+   else if (!rx_en && tx_en)
+   fc_mode = HCLGE_FC_TX_PAUSE;
+   else
+   fc_mode = HCLGE_FC_NONE;
+
+   if (hdev->tm_info.fc_mode == HCLGE_FC_PFC) {
+   hdev->fc_mode_last_time = fc_mode;
+   return 0;
+   }
+
+   ret = hclge_mac_pause_en_cfg(hdev, tx_en, rx_en);
+   if (ret) {
+   dev_err(>pdev->dev, "configure pauseparam error, ret = 
%d.\n",
+   ret);
+   return ret;
+   }
+
+   hdev->tm_info.fc_mode = fc_mode;
+
+   return 0;
+}
+
 static void hclge_get_pauseparam(struct hnae3_handle *handle, u32 *auto_neg,
 u32 *rx_en, u32 *tx_en)
 {
@@ -4689,6 +4736,41 @@ static void hclge_get_pauseparam(struct hnae3_handle 
*handle, u32 *auto_neg,
}
 }
 
+static int hclge_set_pauseparam(struct hnae3_handle *handle, u32 auto_neg,
+   u32 rx_en, u32 tx_en)
+{
+   struct hclge_vport *vport = hclge_get_vport(handle);
+   struct hclge_dev *hdev = vport->back;
+   struct phy_device *phydev = hdev->hw.mac.phydev;
+   u32 fc_autoneg;
+
+   /* Only support flow control negotiation for netdev with
+* phy attached for now.
+*/
+   if (!phydev)
+   return -EOPNOTSUPP;
+
+   fc_autoneg = hclge_get_autoneg(handle);
+   if (auto_neg != fc_autoneg) {
+   dev_info(>pdev->dev,
+"To change autoneg please use: ethtool -s  
autoneg \n");
+   return -EOPNOTSUPP;
+   }
+
+   if (hdev->tm_info.fc_mode == HCLGE_FC_PFC) {
+   dev_info(>pdev->dev,
+"Priority flow control enabled. Cannot set link flow 
control.\n");
+   return -EOPNOTSUPP;
+   }
+
+   hclge_set_flowctrl_adv(hdev, rx_en, tx_en);
+
+   if (!fc_autoneg)
+   return hclge_cfg_pauseparam(hdev, rx_en, tx_en);
+
+   return 

[PATCH V2 net-next 01/17] net: hns3: add support to query tqps number

2017-12-18 Thread Lipeng
This patch adds the support to query tqps number for PF driver
by using ehtool -l command.

Signed-off-by: qumingguang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h |  2 ++
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c  | 10 ++
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 21 +
 3 files changed, 33 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index a9e2b32..d887721 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -386,6 +386,8 @@ struct hnae3_ae_ops {
  u16 vlan, u8 qos, __be16 proto);
void (*reset_event)(struct hnae3_handle *handle,
enum hnae3_reset_type reset);
+   void (*get_channels)(struct hnae3_handle *handle,
+struct ethtool_channels *ch);
 };
 
 struct hnae3_dcb_ops {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index 65a69b4..23af36c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -849,6 +849,15 @@ static int hns3_nway_reset(struct net_device *netdev)
return genphy_restart_aneg(phy);
 }
 
+void hns3_get_channels(struct net_device *netdev,
+  struct ethtool_channels *ch)
+{
+   struct hnae3_handle *h = hns3_get_handle(netdev);
+
+   if (h->ae_algo->ops->get_channels)
+   h->ae_algo->ops->get_channels(h, ch);
+}
+
 static const struct ethtool_ops hns3vf_ethtool_ops = {
.get_drvinfo = hns3_get_drvinfo,
.get_ringparam = hns3_get_ringparam,
@@ -883,6 +892,7 @@ static int hns3_nway_reset(struct net_device *netdev)
.get_link_ksettings = hns3_get_link_ksettings,
.set_link_ksettings = hns3_set_link_ksettings,
.nway_reset = hns3_nway_reset,
+   .get_channels = hns3_get_channels,
 };
 
 void hns3_ethtool_set_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index e97fd66..533e15e5 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5002,6 +5002,26 @@ static void hclge_uninit_ae_dev(struct hnae3_ae_dev 
*ae_dev)
ae_dev->priv = NULL;
 }
 
+static u32 hclge_get_max_channels(struct hnae3_handle *handle)
+{
+   struct hclge_vport *vport = hclge_get_vport(handle);
+   struct hnae3_knic_private_info *kinfo = >kinfo;
+   struct hclge_dev *hdev = vport->back;
+
+   return min_t(u32, hdev->rss_size_max * kinfo->num_tc, hdev->num_tqps);
+}
+
+static void hclge_get_channels(struct hnae3_handle *handle,
+  struct ethtool_channels *ch)
+{
+   struct hclge_vport *vport = hclge_get_vport(handle);
+
+   ch->max_combined = hclge_get_max_channels(handle);
+   ch->other_count = 1;
+   ch->max_other = 1;
+   ch->combined_count = vport->alloc_tqps;
+}
+
 static const struct hnae3_ae_ops hclge_ops = {
.init_ae_dev = hclge_init_ae_dev,
.uninit_ae_dev = hclge_uninit_ae_dev,
@@ -5046,6 +5066,7 @@ static void hclge_uninit_ae_dev(struct hnae3_ae_dev 
*ae_dev)
.set_vlan_filter = hclge_set_port_vlan_filter,
.set_vf_vlan_filter = hclge_set_vf_vlan_filter,
.reset_event = hclge_reset_event,
+   .get_channels = hclge_get_channels,
 };
 
 static struct hnae3_ae_algo ae_algo = {
-- 
1.9.1



[PATCH V2 net-next 05/17] net: hns3: Get rss_size_max from configuration but not hardcode

2017-12-18 Thread Lipeng
From: qumingguang 

Add configuration for rss_size_max in hdev but not hardcode it.

Signed-off-by: qumingguang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h  | 2 ++
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 6 +-
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h | 1 +
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index ce5ed88..1eb9ff0 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -399,6 +399,8 @@ struct hclge_pf_res_cmd {
 #define HCLGE_CFG_MAC_ADDR_H_M GENMASK(15, 0)
 #define HCLGE_CFG_DEFAULT_SPEED_S  16
 #define HCLGE_CFG_DEFAULT_SPEED_M  GENMASK(23, 16)
+#define HCLGE_CFG_RSS_SIZE_S   24
+#define HCLGE_CFG_RSS_SIZE_M   GENMASK(31, 24)
 
 struct hclge_cfg_param_cmd {
__le32 offset;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index f354681..b8658b8 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -982,6 +982,10 @@ static void hclge_parse_cfg(struct hclge_cfg *cfg, struct 
hclge_desc *desc)
cfg->default_speed = hnae_get_field(__le32_to_cpu(req->param[3]),
HCLGE_CFG_DEFAULT_SPEED_M,
HCLGE_CFG_DEFAULT_SPEED_S);
+   cfg->rss_size_max = hnae_get_field(__le32_to_cpu(req->param[3]),
+  HCLGE_CFG_RSS_SIZE_M,
+  HCLGE_CFG_RSS_SIZE_S);
+
for (i = 0; i < ETH_ALEN; i++)
cfg->mac_addr[i] = (mac_addr_tmp >> (8 * i)) & 0xff;
 
@@ -1059,7 +1063,7 @@ static int hclge_configure(struct hclge_dev *hdev)
 
hdev->num_vmdq_vport = cfg.vmdq_vport_num;
hdev->base_tqp_pid = 0;
-   hdev->rss_size_max = 1;
+   hdev->rss_size_max = cfg.rss_size_max;
hdev->rx_buf_len = cfg.rx_buf_len;
ether_addr_copy(hdev->hw.mac.mac_addr, cfg.mac_addr);
hdev->hw.mac.media_type = cfg.media_type;
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index fb043b5..4858909 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -220,6 +220,7 @@ struct hclge_cfg {
u8 tc_num;
u16 tqp_desc_num;
u16 rx_buf_len;
+   u16 rss_size_max;
u8 phy_addr;
u8 media_type;
u8 mac_addr[ETH_ALEN];
-- 
1.9.1



[PATCH V2 net-next 04/17] net: hns3: Free the ring_data structrue when change tqps

2017-12-18 Thread Lipeng
This patch fixes a memory leak problems in change tqps process,
the function hns3_uninit_all_ring and hns3_init_all_ring
may be called many times.

Signed-off-by: qumingguang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index be43d09..b7fe980 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2800,8 +2800,12 @@ int hns3_uninit_all_ring(struct hns3_nic_priv *priv)
h->ae_algo->ops->reset_queue(h, i);
 
hns3_fini_ring(priv->ring_data[i].ring);
+   devm_kfree(priv->dev, priv->ring_data[i].ring);
hns3_fini_ring(priv->ring_data[i + h->kinfo.num_tqps].ring);
+   devm_kfree(priv->dev,
+  priv->ring_data[i + h->kinfo.num_tqps].ring);
}
+   devm_kfree(priv->dev, priv->ring_data);
 
return 0;
 }
-- 
1.9.1



[PATCH V2 net-next 02/17] net: hns3: add support to modify tqps number

2017-12-18 Thread Lipeng
This patch add the support to change tqps number for PF driver
by using ehtool -L command.

Signed-off-by: qumingguang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|   3 +
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 122 +
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|   2 +
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |   1 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 111 +++
 5 files changed, 239 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index d887721..a5d3d22 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -388,6 +388,9 @@ struct hnae3_ae_ops {
enum hnae3_reset_type reset);
void (*get_channels)(struct hnae3_handle *handle,
 struct ethtool_channels *ch);
+   void (*get_tqps_and_rss_info)(struct hnae3_handle *h,
+ u16 *free_tqps, u16 *max_rss_size);
+   int (*set_channels)(struct hnae3_handle *handle, u32 new_tqps_num);
 };
 
 struct hnae3_dcb_ops {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index c2c1323..be43d09 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2651,6 +2651,19 @@ static int hns3_get_ring_config(struct hns3_nic_priv 
*priv)
return ret;
 }
 
+static void hns3_put_ring_config(struct hns3_nic_priv *priv)
+{
+   struct hnae3_handle *h = priv->ae_handle;
+   u16 i;
+
+   for (i = 0; i < h->kinfo.num_tqps; i++) {
+   devm_kfree(priv->dev, priv->ring_data[i].ring);
+   devm_kfree(priv->dev,
+  priv->ring_data[i + h->kinfo.num_tqps].ring);
+   }
+   devm_kfree(priv->dev, priv->ring_data);
+}
+
 static int hns3_alloc_ring_memory(struct hns3_enet_ring *ring)
 {
int ret;
@@ -3162,6 +3175,115 @@ static int hns3_reset_notify(struct hnae3_handle 
*handle,
return ret;
 }
 
+static u16 hns3_get_max_available_channels(struct net_device *netdev)
+{
+   struct hnae3_handle *h = hns3_get_handle(netdev);
+   u16 free_tqps, max_rss_size, max_tqps;
+
+   h->ae_algo->ops->get_tqps_and_rss_info(h, _tqps, _rss_size);
+   max_tqps = h->kinfo.num_tc * max_rss_size;
+
+   return min_t(u16, max_tqps, (free_tqps + h->kinfo.num_tqps));
+}
+
+static int hns3_modify_tqp_num(struct net_device *netdev, u16 new_tqp_num)
+{
+   struct hns3_nic_priv *priv = netdev_priv(netdev);
+   struct hnae3_handle *h = hns3_get_handle(netdev);
+   int ret;
+
+   ret = h->ae_algo->ops->set_channels(h, new_tqp_num);
+   if (ret)
+   return ret;
+
+   ret = hns3_get_ring_config(priv);
+   if (ret)
+   return ret;
+
+   ret = hns3_nic_init_vector_data(priv);
+   if (ret)
+   goto err_uninit_vector;
+
+   ret = hns3_init_all_ring(priv);
+   if (ret)
+   goto err_put_ring;
+
+   return 0;
+
+err_put_ring:
+   hns3_put_ring_config(priv);
+err_uninit_vector:
+   hns3_nic_uninit_vector_data(priv);
+   return ret;
+}
+
+static int hns3_adjust_tqps_num(u8 num_tc, u32 new_tqp_num)
+{
+   return (new_tqp_num / num_tc) * num_tc;
+}
+
+int hns3_set_channels(struct net_device *netdev,
+ struct ethtool_channels *ch)
+{
+   struct hns3_nic_priv *priv = netdev_priv(netdev);
+   struct hnae3_handle *h = hns3_get_handle(netdev);
+   struct hnae3_knic_private_info *kinfo = >kinfo;
+   bool if_running = netif_running(netdev);
+   u32 new_tqp_num = ch->combined_count;
+   u16 org_tqp_num;
+   int ret;
+
+   if (ch->rx_count || ch->tx_count)
+   return -EINVAL;
+
+   if (new_tqp_num > hns3_get_max_available_channels(netdev) ||
+   new_tqp_num < kinfo->num_tc) {
+   dev_err(>dev,
+   "Change tqps fail, the tqp range is from %d to %d",
+   kinfo->num_tc,
+   hns3_get_max_available_channels(netdev));
+   return -EINVAL;
+   }
+
+   new_tqp_num = hns3_adjust_tqps_num(kinfo->num_tc, new_tqp_num);
+   if (kinfo->num_tqps == new_tqp_num)
+   return 0;
+
+   if (if_running)
+   dev_close(netdev);
+
+   hns3_clear_all_ring(h);
+
+   ret = hns3_nic_uninit_vector_data(priv);
+   if (ret) {
+   dev_err(>dev,
+   "Unbind vector with tqp fail, nothing is changed");
+   goto open_netdev;
+   }
+
+   hns3_uninit_all_ring(priv);
+
+   org_tqp_num = h->kinfo.num_tqps;
+   ret = hns3_modify_tqp_num(netdev, new_tqp_num);
+   if (ret) {
+   ret = 

[PATCH V2 net-next 13/17] net: hns3: add support to update flow control settings after autoneg

2017-12-18 Thread Lipeng
When auto-negotiation is enabled, the MAC flow control settings is
based on the flow control negotiation result. And it should be configured
after a valid link has been established. This patch adds support to update
flow control settings after auto-negotiation has completed.

Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 36 ++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  1 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c|  4 +++
 3 files changed, 41 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index fbe5dee..f5465a8 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -4707,6 +4707,42 @@ static int hclge_cfg_pauseparam(struct hclge_dev *hdev, 
u32 rx_en, u32 tx_en)
return 0;
 }
 
+int hclge_cfg_flowctrl(struct hclge_dev *hdev)
+{
+   struct phy_device *phydev = hdev->hw.mac.phydev;
+   u16 local_advertising = 0;
+   u16 remote_advertising = 0;
+   u32 rx_pause, tx_pause;
+   u8 flowctl;
+
+   if (!phydev->link || !phydev->autoneg)
+   return 0;
+
+   if (phydev->advertising & ADVERTISED_Pause)
+   local_advertising = ADVERTISE_PAUSE_CAP;
+
+   if (phydev->advertising & ADVERTISED_Asym_Pause)
+   local_advertising |= ADVERTISE_PAUSE_ASYM;
+
+   if (phydev->pause)
+   remote_advertising = LPA_PAUSE_CAP;
+
+   if (phydev->asym_pause)
+   remote_advertising |= LPA_PAUSE_ASYM;
+
+   flowctl = mii_resolve_flowctrl_fdx(local_advertising,
+  remote_advertising);
+   tx_pause = flowctl & FLOW_CTRL_TX;
+   rx_pause = flowctl & FLOW_CTRL_RX;
+
+   if (phydev->duplex == HCLGE_MAC_HALF) {
+   tx_pause = 0;
+   rx_pause = 0;
+   }
+
+   return hclge_cfg_pauseparam(hdev, rx_pause, tx_pause);
+}
+
 static void hclge_get_pauseparam(struct hnae3_handle *handle, u32 *auto_neg,
 u32 *rx_en, u32 *tx_en)
 {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index cda520c..28cc063 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -602,4 +602,5 @@ int hclge_set_vf_vlan_common(struct hclge_dev *vport, int 
vfid,
 
 void hclge_mbx_handler(struct hclge_dev *hdev);
 void hclge_reset_tqp(struct hnae3_handle *handle, u16 queue_id);
+int hclge_cfg_flowctrl(struct hclge_dev *hdev);
 #endif
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
index 7069e94..3745153 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
@@ -183,6 +183,10 @@ static void hclge_mac_adjust_link(struct net_device 
*netdev)
ret = hclge_cfg_mac_speed_dup(hdev, speed, duplex);
if (ret)
netdev_err(netdev, "failed to adjust link.\n");
+
+   ret = hclge_cfg_flowctrl(hdev);
+   if (ret)
+   netdev_err(netdev, "failed to configure flow control.\n");
 }
 
 int hclge_mac_start_phy(struct hclge_dev *hdev)
-- 
1.9.1



[PATCH V2 net-next 00/17] add some features and fix some bugs for HNS3 driver

2017-12-18 Thread Lipeng
This patchset adds some new feature support and fixes some bugs:
[Patch 1/17 - 5/17] add the support to modify/query the tqp number
through ethtool -L/l command, and also fix some related bugs for
change tqp number.
[Patch 6/17 - 9-17] add support vlan tag offload on tx& direction
for pf, and fix some related bugs.
[patch 10/17 - 11/17] fix bugs for auto negotiation.
[patch 12/17] adds support for ethtool command set_pauseparam.
[patch 13/17 - 14/17] add support to update flow control settings after
autoneg.
[patch 15/17 - 17/17] fix some other bugs in net-next.

---
Change Log:
V1 -> V2:
1, fix the comments from Sergei Shtylyov.
---

Fuyun Liang (3):
  net: hns3: cleanup mac auto-negotiation state query
  net: hns3: fix for getting auto-negotiation state in hclge_get_autoneg
  net: hns3: add Asym Pause support to phy default features

Lipeng (13):
  net: hns3: add support to query tqps number
  net: hns3: add support to modify tqps number
  net: hns3: change the returned tqp number by ethtool -x
  net: hns3: Free the ring_data structrue when change tqps
  net: hns3: Add a mask initialization for mac_vlan table
  net: hns3: Add vlan offload config command
  net: hns3: Add ethtool related offload command
  net: hns3: Add handling vlan tag offload in bd
  net: hns3: add support for set_pauseparam
  net: hns3: add support to update flow control settings after autoneg
  net: hns3: add support for querying advertised pause frame by ethtool
ethx
  net: hns3: Increase the default depth of bucket for TM shaper
  net: hns3: change TM sched mode to TC-based mode when SRIOV enabled

qumingguang (1):
  net: hns3: Get rss_size_max from configuration but not hardcode

 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  10 +
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 225 -
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|   2 +
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  41 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |  57 +++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 513 +++--
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  38 ++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c|   5 +
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c  |   6 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.h  |   1 +
 10 files changed, 854 insertions(+), 44 deletions(-)

-- 
1.9.1



[PATCH V2 net-next 03/17] net: hns3: change the returned tqp number by ethtool -x

2017-12-18 Thread Lipeng
This patch modifies the return data of get_rxnfc, it will return
the current handle's rss_size but not the total tqp number.
because the tc_size has been change to the log2 of roundup
power of two of rss_size.

Signed-off-by: qumingguang 
Signed-off-by: Lipeng 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
index 1b2d79b..2fd2656 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c
@@ -730,7 +730,7 @@ static int hns3_get_rxnfc(struct net_device *netdev,
 
switch (cmd->cmd) {
case ETHTOOL_GRXRINGS:
-   cmd->data = h->kinfo.num_tc * h->kinfo.rss_size;
+   cmd->data = h->kinfo.rss_size;
break;
case ETHTOOL_GRXFH:
return h->ae_algo->ops->get_rss_tuple(h, cmd);
-- 
1.9.1



Re: [trivial PATCH] treewide: Align function definition open/close braces

2017-12-18 Thread Martin K. Petersen

Joe,

> Some functions definitions have either the initial open brace and/or
> the closing brace outside of column 1.
>
> Move those braces to column 1.

SCSI bits look OK.

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


[PATCH v2 2/3] vsprintf: print if symbol not found

2017-12-18 Thread Tobin C. Harding
Depends on: commit 40eee173a35e ("kallsyms: don't leak address when
symbol not found")

Currently vsprintf for specifiers %p[SsB] relies on the behaviour of
kallsyms (sprint_symbol()) and prints the actual address if a symbol is
not found. Previous patch changes this behaviour so that sprint_symbol()
returns an error if symbol not found. With this patch in place we can
print a sanitized message '' instead of leaking the
address.

Print '' for printk specifier %p[sSB] if symbol look
up fails.

Signed-off-by: Tobin C. Harding 
---
 lib/vsprintf.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 01c3957b2de6..820ed4fe6e6c 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -674,6 +674,8 @@ char *symbol_string(char *buf, char *end, void *ptr,
unsigned long value;
 #ifdef CONFIG_KALLSYMS
char sym[KSYM_SYMBOL_LEN];
+   const char *sym_not_found = "";
+   int ret;
 #endif
 
if (fmt[1] == 'R')
@@ -682,11 +684,14 @@ char *symbol_string(char *buf, char *end, void *ptr,
 
 #ifdef CONFIG_KALLSYMS
if (*fmt == 'B')
-   sprint_backtrace(sym, value);
+   ret = sprint_backtrace(sym, value);
else if (*fmt != 'f' && *fmt != 's')
-   sprint_symbol(sym, value);
+   ret = sprint_symbol(sym, value);
else
-   sprint_symbol_no_offset(sym, value);
+   ret = sprint_symbol_no_offset(sym, value);
+
+   if (ret == -1)
+   strcpy(sym, sym_not_found);
 
return string(buf, end, sym, spec);
 #else
-- 
2.7.4



[PATCH v2 1/3] kallsyms: don't leak address when symbol not found

2017-12-18 Thread Tobin C. Harding
Currently if kallsyms_lookup() fails to find the symbol then the address
is printed. This potentially leaks sensitive information but is useful
for debugging. We would like to stop the leak but keep the current
behaviour when needed for debugging. To achieve this we can add a
command-line parameter that if enabled maintains the current
behaviour. If the command-line parameter is not enabled we can return an
error instead of printing the address giving the calling code the option
of how to handle the look up failure.

Add command-line parameter 'insecure_print_all_symbols'. If parameter is
not enabled return an error value instead of printing the raw address.

Signed-off-by: Tobin C. Harding 
---
 kernel/kallsyms.c | 31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index d5fa4116688a..2707cf751437 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -383,6 +383,16 @@ int lookup_symbol_attrs(unsigned long addr, unsigned long 
*size,
return lookup_module_symbol_attrs(addr, size, offset, modname, name);
 }
 
+/* Enables printing of raw address when symbol look up fails */
+static bool insecure_print_all_symbols;
+
+static int __init enable_insecure_print_all_symbols(char *unused)
+{
+   insecure_print_all_symbols = true;
+   return 0;
+}
+early_param("insecure_print_all_symbols", enable_insecure_print_all_symbols);
+
 /* Look up a kernel symbol and return it in a text buffer. */
 static int __sprint_symbol(char *buffer, unsigned long address,
   int symbol_offset, int add_offset)
@@ -394,8 +404,15 @@ static int __sprint_symbol(char *buffer, unsigned long 
address,
 
address += symbol_offset;
name = kallsyms_lookup(address, , , , buffer);
-   if (!name)
-   return sprintf(buffer, "0x%lx", address - symbol_offset);
+   if (insecure_print_all_symbols) {
+   if (!name)
+   return sprintf(buffer, "0x%lx", address - 
symbol_offset);
+   } else {
+   if (!name) {
+   buffer[0] = '\0';
+   return -1;
+   }
+   }
 
if (name != buffer)
strcpy(buffer, name);
@@ -417,8 +434,9 @@ static int __sprint_symbol(char *buffer, unsigned long 
address,
  * @address: address to lookup
  *
  * This function looks up a kernel symbol with @address and stores its name,
- * offset, size and module name to @buffer if possible. If no symbol was found,
- * just saves its @address as is.
+ * offset, size and module name to @buffer if possible. If no symbol was found
+ * returns -1 unless kernel command-line parameter 'insecure_print_all_symbols'
+ * is enabled, in which case saves @address as is to buffer.
  *
  * This function returns the number of bytes stored in @buffer.
  */
@@ -434,8 +452,9 @@ EXPORT_SYMBOL_GPL(sprint_symbol);
  * @address: address to lookup
  *
  * This function looks up a kernel symbol with @address and stores its name
- * and module name to @buffer if possible. If no symbol was found, just saves
- * its @address as is.
+ * and module name to @buffer if possible. If no symbol was found, returns -1
+ * unless kernel command-line parameter 'insecure_print_all_symbols' is 
enabled,
+ * in which case saves @address as is to buffer.
  *
  * This function returns the number of bytes stored in @buffer.
  */
-- 
2.7.4



[PATCH v2 3/3] trace: print address if symbol not found

2017-12-18 Thread Tobin C. Harding
Fixes behaviour modified by: commit 40eee173a35e ("kallsyms: don't leak
address when symbol not found")

Previous patch changed behaviour of kallsyms function sprint_symbol() to
return an error code instead of printing the address if a symbol was not
found. Ftrace relies on the original behaviour. We should not break
tracing when applying the previous patch. We can maintain the original
behaviour by checking the return code on calls to sprint_symbol() and
friends.

Check return code and print actual address on error (i.e symbol not
found).

Signed-off-by: Tobin C. Harding 
---
 kernel/trace/trace.h | 24 
 kernel/trace/trace_events_hist.c |  6 +++---
 kernel/trace/trace_output.c  |  2 +-
 3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2a6d0325a761..881b1a577d75 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1814,4 +1814,28 @@ static inline void trace_event_eval_update(struct 
trace_eval_map **map, int len)
 
 extern struct trace_iterator *tracepoint_print_iter;
 
+static inline int
+trace_sprint_symbol(char *buffer, unsigned long address)
+{
+   int ret;
+
+   ret = sprint_symbol(buffer, address);
+   if (ret == -1)
+   ret = sprintf(buffer, "0x%lx", address);
+
+   return ret;
+}
+
+static inline int
+trace_sprint_symbol_no_offset(char *buffer, unsigned long address)
+{
+   int ret;
+
+   ret = sprint_symbol_no_offset(buffer, address);
+   if (ret == -1)
+   ret = sprintf(buffer, "0x%lx", address);
+
+   return ret;
+}
+
 #endif /* _LINUX_KERNEL_TRACE_H */
diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c
index 1e1558c99d56..ca523327c058 100644
--- a/kernel/trace/trace_events_hist.c
+++ b/kernel/trace/trace_events_hist.c
@@ -982,7 +982,7 @@ static void hist_trigger_stacktrace_print(struct seq_file 
*m,
return;
 
seq_printf(m, "%*c", 1 + spaces, ' ');
-   sprint_symbol(str, stacktrace_entries[i]);
+   trace_sprint_symbol(str, stacktrace_entries[i]);
seq_printf(m, "%s\n", str);
}
 }
@@ -1014,12 +1014,12 @@ hist_trigger_entry_print(struct seq_file *m,
seq_printf(m, "%s: %llx", field_name, uval);
} else if (key_field->flags & HIST_FIELD_FL_SYM) {
uval = *(u64 *)(key + key_field->offset);
-   sprint_symbol_no_offset(str, uval);
+   trace_sprint_symbol_no_offset(str, uval);
seq_printf(m, "%s: [%llx] %-45s", field_name,
   uval, str);
} else if (key_field->flags & HIST_FIELD_FL_SYM_OFFSET) {
uval = *(u64 *)(key + key_field->offset);
-   sprint_symbol(str, uval);
+   trace_sprint_symbol(str, uval);
seq_printf(m, "%s: [%llx] %-55s", field_name,
   uval, str);
} else if (key_field->flags & HIST_FIELD_FL_EXECNAME) {
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 90db994ac900..f3c3a0a60f72 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -365,7 +365,7 @@ seq_print_sym_offset(struct trace_seq *s, const char *fmt,
 #ifdef CONFIG_KALLSYMS
const char *name;
 
-   sprint_symbol(str, address);
+   trace_sprint_symbol(str, address);
name = kretprobed(str);
 
if (name && strlen(name)) {
-- 
2.7.4



[PATCH v2 0/3] kallsyms: don't leak address

2017-12-18 Thread Tobin C. Harding
This set plugs a kernel address leak that occurs if kallsyms symbol
look up fails. This set was prompted by a leaking address found using
scripts/leaking_addresses.pl on a PowerPC machine in the wild.

$ perl scripts/leaking_addresses.pl [address sanitized]
...
/proc/8025/task/8025/stack: [<>] 0xc001

$ uname -r
4.4.0-79-powerpc64-smp


Patch set does not change behaviour when KALLSYMS is not defined
(suggested by Linus).

Comments on version 1 indicated that current behaviour may be useful for
debugging. This version adds a kernel command-line parameter in order to
be able to preserve current behaviour (print raw address if kallsyms
symbol look up fails). (Command-line parameter suggested by Steve.)

New command-line parameter is documented only in the kernel-doc for
kallsyms functions sprint_symbol() and sprint_symbol_no_offset(). Is
this sufficient? Perhaps an entry in printk-formats.txt also?

Patch 1 - return error code if symbol look up fails unless new
  command-line parameter 'insecure_print_all_symbols' is enabled.
Patch 2 - print  to buffer if symbol look up returns
  an error.
Patch 3 - maintain current behaviour in ftrace.

thanks,
Tobin.

v2:
 - Add kernel command-line parameter.
 - Remove unnecessary function.
 - Fix broken ftrace code (and actually build and test ftrace code).

Patch 1 and 2 tested. Patch 3 (ftrace) tested but not all code paths
executed (discussed with Steve in another thread).

Tobin C. Harding (3):
  kallsyms: don't leak address when symbol not found
  vsprintf: print  if symbol not found
  trace: print address if symbol not found

 kernel/kallsyms.c| 31 +--
 kernel/trace/trace.h | 24 
 kernel/trace/trace_events_hist.c |  6 +++---
 kernel/trace/trace_output.c  |  2 +-
 lib/vsprintf.c   | 11 ---
 5 files changed, 61 insertions(+), 13 deletions(-)

-- 
2.7.4



Re: [PATCH v7 2/3] sock: Move the socket inuse to namespace.

2017-12-18 Thread David Miller
From: Cong Wang 
Date: Mon, 18 Dec 2017 13:38:39 -0800

> On Mon, Dec 18, 2017 at 11:30 AM, David Miller  wrote:
>> From: Tonghao Zhang 
>> Date: Thu, 14 Dec 2017 05:51:58 -0800
>>
>>> In some case, we want to know how many sockets are in use in
>>> different _net_ namespaces. It's a key resource metric.
>>
>> Useful or not, you're not exporting this value.
>>
>> All this patch series does is convert the existing export of the
>> global tally to add up the per-net values.
>>
>> So if you're not exporting the per-net value on it's own in any way,
>> this patch series isn't achieving the stated goal.
>>
>> I'm not applying this series, sorry.
> 
> 
> This value is already exported via procfs:
> sockstat_seq_show() -> socket_seq_show().
> 
> And the proc file itself should already be per-net:
> 
> static int sockstat_seq_open(struct inode *inode, struct file *file)
> {
> return single_open_net(inode, file, sockstat_seq_show);
> }
> 
> 
> This patch just makes that value to be per-net too.

You're right, my bad.

I'll keep reviewing this.


Re: [PATCH 3/3] trace: print address if symbol not found

2017-12-18 Thread Tobin C. Harding
On Tue, Dec 19, 2017 at 02:00:11PM +1100, Tobin C. Harding wrote:
> On Mon, Dec 18, 2017 at 06:51:43PM -0500, Steven Rostedt wrote:
> > On Tue, 19 Dec 2017 08:16:14 +1100
> > "Tobin C. Harding"  wrote:
> > 
> > > > >  #endif /* _LINUX_KERNEL_TRACE_H */
> > > > > diff --git a/kernel/trace/trace_events_hist.c 
> > > > > b/kernel/trace/trace_events_hist.c
> > > > > index 1e1558c99d56..3e28522a76f4 100644
> > > > > --- a/kernel/trace/trace_events_hist.c
> > > > > +++ b/kernel/trace/trace_events_hist.c
> > > > > @@ -982,7 +982,7 @@ static void hist_trigger_stacktrace_print(struct 
> > > > > seq_file *m,
> > > > >   return;
> > > > >  
> > > > >   seq_printf(m, "%*c", 1 + spaces, ' ');
> > > > > - sprint_symbol(str, stacktrace_entries[i]);
> > > > > + trace_sprint_symbol_addr(str, stacktrace_entries[i]);  
> > > > 
> > 
> > > 
> > > If you have the time to give me some brief pointers on how I should go
> > > about testing this I'd love to test it before the next version. I know
> > > very little about ftrace.
> > 
> > For hitting the histogram stacktrace trigger (this code path), make
> > sure you have CONFIG_HIST_TRIGGERS enabled. And then do:
> > 
> >  # cd /sys/kernel/debug/tracing
> >  # echo 'hist:keys=common_pid.execname,stacktrace:vals=prev_state' > \
> >  events/sched/sched_switch/trigger
> >  # cat events/sched/sched_switch/hist
> > 
> > For the "sym" part, you can do (from the same directory):
> > 
> >  # echo 'hist:keys=call_site.sym:vals=bytes_req' > \
> >  events/kmem/kmalloc/trigger
> >  # cat events/kmem/kmalloc/hist
> > 
> > 
> > And for sym-offset:
> > 
> >  # echo 'hist:keys=call_site.sym-offset:vals=bytes_req' > \
> > events/kmem/kmalloc/trigger
> >  # cat events/kmem/kmalloc/hist
> 
> I ran through these as outlined here for the new version (v4). This hits

Should have been:

v2

thanks,
Tobin.


Re: [PATCH 3/3] trace: print address if symbol not found

2017-12-18 Thread Tobin C. Harding
On Mon, Dec 18, 2017 at 06:51:43PM -0500, Steven Rostedt wrote:
> On Tue, 19 Dec 2017 08:16:14 +1100
> "Tobin C. Harding"  wrote:
> 
> > > >  #endif /* _LINUX_KERNEL_TRACE_H */
> > > > diff --git a/kernel/trace/trace_events_hist.c 
> > > > b/kernel/trace/trace_events_hist.c
> > > > index 1e1558c99d56..3e28522a76f4 100644
> > > > --- a/kernel/trace/trace_events_hist.c
> > > > +++ b/kernel/trace/trace_events_hist.c
> > > > @@ -982,7 +982,7 @@ static void hist_trigger_stacktrace_print(struct 
> > > > seq_file *m,
> > > > return;
> > > >  
> > > > seq_printf(m, "%*c", 1 + spaces, ' ');
> > > > -   sprint_symbol(str, stacktrace_entries[i]);
> > > > +   trace_sprint_symbol_addr(str, stacktrace_entries[i]);  
> > > 
> 
> > 
> > If you have the time to give me some brief pointers on how I should go
> > about testing this I'd love to test it before the next version. I know
> > very little about ftrace.
> 
> For hitting the histogram stacktrace trigger (this code path), make
> sure you have CONFIG_HIST_TRIGGERS enabled. And then do:
> 
>  # cd /sys/kernel/debug/tracing
>  # echo 'hist:keys=common_pid.execname,stacktrace:vals=prev_state' > \
>  events/sched/sched_switch/trigger
>  # cat events/sched/sched_switch/hist
> 
> For the "sym" part, you can do (from the same directory):
> 
>  # echo 'hist:keys=call_site.sym:vals=bytes_req' > \
>  events/kmem/kmalloc/trigger
>  # cat events/kmem/kmalloc/hist
> 
> 
> And for sym-offset:
> 
>  # echo 'hist:keys=call_site.sym-offset:vals=bytes_req' > \
> events/kmem/kmalloc/trigger
>  # cat events/kmem/kmalloc/hist

I ran through these as outlined here for the new version (v4). This hits
the modified code but doesn't test symbol look up failure.

I also configured kernel with 'Perform a startup test on ftrace' for
good luck.

Are you happy with this level of testing?

thanks,
Tobin.


[PATCH net] net: always reevalulate autoflowlabel setting for reset packet

2017-12-18 Thread Shaohua Li
From: Shaohua Li 

ipv6_pinfo.autoflowlabel is set in sock creation. Later if we change
sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't changed,
so the sock will keep the old behavior in terms of auto flowlabel. Reset
packet is suffering from this problem, because reset packset is sent
from a special control socket, which is created at boot time. Since
sysctl.ipv6.auto_flowlabels is 2 by default, the control socket will
always have its ipv6_pinfo.autoflowlabel set, even after user set
sysctl.ipv6.auto_flowlabels to 1, so reset packset will always have
flowlabel.

To fix this, we always reevaluate autoflowlabel setting for reset
packet. Normal sock has the same issue too, but since the
sysctl.ipv6.auto_flowlabels is usually set at host startup, this isn't a
big issue for normal sock.

Cc: Martin KaFai Lau 
Signed-off-by: Shaohua Li 
---
 net/ipv6/tcp_ipv6.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 7178476..fc35233 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -789,7 +789,9 @@ static void tcp_v6_send_response(const struct sock *sk, 
struct sk_buff *skb, u32
unsigned int tot_len = sizeof(struct tcphdr);
struct dst_entry *dst;
__be32 *topt;
+   struct ipv6_pinfo *np = inet6_sk(ctl_sk);
 
+   np->autoflowlabel = ip6_default_np_autolabel(net);
if (tsecr)
tot_len += TCPOLEN_TSTAMP_ALIGNED;
 #ifdef CONFIG_TCP_MD5SIG
-- 
2.9.5



Re: [Patch net-next] net_sched: properly check for empty skb array on error path

2017-12-18 Thread Cong Wang
On Mon, Dec 18, 2017 at 5:25 PM, John Fastabend
 wrote:
> On 12/18/2017 02:34 PM, Cong Wang wrote:
>> First, the check of >ring.queue against NULL is wrong, it
>> is always false. We should check the value rather than the address.
>>
>
> Thanks.
>
>> Secondly, we need the same check in pfifo_fast_reset() too,
>> as both ->reset() and ->destroy() are called in qdisc_destroy().
>>
>
> not that it hurts to have the check here, but if init fails
> in qdisc_create it seems only ->destroy() is called without
> a ->reset().
>
> Is there another path for init() to fail that I'm missing.

Pretty sure ->reset() is called in qdisc_destroy() and also before
->destroy():


void qdisc_destroy(struct Qdisc *qdisc)
{
const struct Qdisc_ops  *ops = qdisc->ops;
struct sk_buff *skb, *tmp;

if (qdisc->flags & TCQ_F_BUILTIN ||
!refcount_dec_and_test(>refcnt))
return;

#ifdef CONFIG_NET_SCHED
qdisc_hash_del(qdisc);

qdisc_put_stab(rtnl_dereference(qdisc->stab));
#endif
gen_kill_estimator(>rate_est);
if (ops->reset)
ops->reset(qdisc);
if (ops->destroy)
ops->destroy(qdisc);


RE: [Patch v2] net: phy: marvell: Limit 88m1101 autoneg errata to 88E1145 as well.

2017-12-18 Thread Qiang Zhao
 From: David Miller 
 Date: Tue, 19 Dec 2017 2:20AM
> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Tuesday, December 19, 2017 2:20 AM
> To: Qiang Zhao 
> Cc: netdev@vger.kernel.org
> Subject: Re: [Patch v2] net: phy: marvell: Limit 88m1101 autoneg errata to
> 88E1145 as well.
> 
> From: Zhao Qiang 
> Date: Mon, 18 Dec 2017 10:26:43 +0800
> 
> > 88E1145 also need this autoneg errata.
> >
> > Fixes: f2899788353c ("net: phy: marvell: Limit errata to 88m1101")
> > Signed-off-by: Zhao Qiang 
> > ---
> > Changes for v2
> > - modify the commit msg in a proper way.
> 
> Applied and queued up for -stable.

Thank you!

Best Regards
Qiang Zhao


Re: [PATCH v4 16/36] nds32: System calls handling

2017-12-18 Thread Vincent Chen
2017-12-18 19:19 GMT+08:00 Arnd Bergmann :
> On Mon, Dec 18, 2017 at 7:46 AM, Greentime Hu  wrote:
>
>
>> new file mode 100644
>> index 000..90da745
>> --- /dev/null
>> +++ b/arch/nds32/include/uapi/asm/unistd.h
>> @@ -0,0 +1,12 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (C) 2005-2017 Andes Technology Corporation
>> +
>> +#define __ARCH_WANT_SYNC_FILE_RANGE2
>> +
>> +/* Use the standard ABI for syscalls */
>> +#include 
>> +
>> +/* Additional NDS32 specific syscalls. */
>> +#define __NR_cacheflush(__NR_arch_specific_syscall)
>> +#define __NR__llseek __NR_llseek
>> +__SYSCALL(__NR_cacheflush, sys_cacheflush)
>
> I'm still confused by __NR__llseek here, why do you need that one?
>

Dear Arnd:
We hoped to solve  ABI register alignment problem for llseek in glibc
by __NR__llseek.
After checking glibc again, I find glibc has same __NR__llseek macro
and It's better to solve this problem.
So, I will remove this definition in the next version patch.


>> +SYSCALL_DEFINE6(mmap2, unsigned long, addr, unsigned long, len,
>> +  unsigned long, prot, unsigned long, flags,
>> +  unsigned long, fd, unsigned long, pgoff)
>> +{
>> +   if (pgoff & (~PAGE_MASK >> 12))
>> +   return -EINVAL;
>> +
>> +   return sys_mmap_pgoff(addr, len, prot, flags, fd,
>> + pgoff >> (PAGE_SHIFT - 12));
>> +}
>> +
>> +SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
>> +  unsigned long, prot, unsigned long, flags,
>> +  unsigned long, fd, unsigned long, pgoff)
>> +{
>> +   if (unlikely(pgoff & ~PAGE_MASK))
>> +   return -EINVAL;
>> +
>> +   return sys_mmap_pgoff(addr, len, prot, flags, fd,
>> + pgoff >> PAGE_SHIFT);
>> +}
>
> And I don't see why you define sys_mmap() in addition to sys_mmap2().
>
This is my mistake. I will remove it in the next version patch.

> The rest of the syscall handling looks good now.
>
>  Arnd


Thanks
Vincent


[PATCH net-next v2] cxgb4: RSS table is 4k for T6

2017-12-18 Thread Ganesh Goudar
RSS table is 4k for T6 and later cards, add check for the
same.

Signed-off-by: Ganesh Goudar 
---
v2: Not a series, It is single patch
---
 drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c |  5 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c   |  2 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c |  7 ++---
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 13 +++--
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.h | 31 +++---
 6 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c 
b/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
index d73fb6a..336670d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
@@ -1004,9 +1004,10 @@ int cudbg_collect_rss(struct cudbg_init *pdbg_init,
 {
struct adapter *padap = pdbg_init->adap;
struct cudbg_buffer temp_buff = { 0 };
-   int rc;
+   int rc, nentries;
 
-   rc = cudbg_get_buff(dbg_buff, RSS_NENTRIES * sizeof(u16), _buff);
+   nentries = t4_chip_rss_size(padap);
+   rc = cudbg_get_buff(dbg_buff, nentries * sizeof(u16), _buff);
if (rc)
return rc;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index b1df2aa..69d0b64 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1528,6 +1528,7 @@ int t4_init_portinfo(struct port_info *pi, int mbox,
 int port, int pf, int vf, u8 mac[]);
 int t4_port_init(struct adapter *adap, int mbox, int pf, int vf);
 void t4_fatal_err(struct adapter *adapter);
+unsigned int t4_chip_rss_size(struct adapter *adapter);
 int t4_config_rss_range(struct adapter *adapter, int mbox, unsigned int viid,
int start, int n, const u16 *rspq, unsigned int nrspq);
 int t4_config_glbl_rss(struct adapter *adapter, int mbox, unsigned int mode,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
index 41c8736..581d628 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
@@ -179,7 +179,7 @@ static u32 cxgb4_get_entity_length(struct adapter *adap, 
u32 entity)
len = cudbg_mbytes_to_bytes(len);
break;
case CUDBG_RSS:
-   len = RSS_NENTRIES * sizeof(u16);
+   len = t4_chip_rss_size(adap) * sizeof(u16);
break;
case CUDBG_RSS_VF_CONF:
len = adap->params.arch.vfcount *
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index d8efcd9..d3ced04 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -2021,11 +2021,12 @@ static int rss_show(struct seq_file *seq, void *v, int 
idx)
 
 static int rss_open(struct inode *inode, struct file *file)
 {
-   int ret;
-   struct seq_tab *p;
struct adapter *adap = inode->i_private;
+   int ret, nentries;
+   struct seq_tab *p;
 
-   p = seq_open_tab(file, RSS_NENTRIES / 8, 8 * sizeof(u16), 0, rss_show);
+   nentries = t4_chip_rss_size(adap);
+   p = seq_open_tab(file, nentries / 8, 8 * sizeof(u16), 0, rss_show);
if (!p)
return -ENOMEM;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index f044717..242bcdd 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -4927,6 +4927,14 @@ void t4_intr_disable(struct adapter *adapter)
t4_set_reg_field(adapter, PL_INT_MAP0_A, 1 << pf, 0);
 }
 
+unsigned int t4_chip_rss_size(struct adapter *adap)
+{
+   if (CHELSIO_CHIP_VERSION(adap->params.chip) <= CHELSIO_T5)
+   return RSS_NENTRIES;
+   else
+   return T6_RSS_NENTRIES;
+}
+
 /**
  * t4_config_rss_range - configure a portion of the RSS mapping table
  * @adapter: the adapter
@@ -5065,10 +5073,11 @@ static int rd_rss_row(struct adapter *adap, int row, 
u32 *val)
  */
 int t4_read_rss(struct adapter *adapter, u16 *map)
 {
+   int i, ret, nentries;
u32 val;
-   int i, ret;
 
-   for (i = 0; i < RSS_NENTRIES / 2; ++i) {
+   nentries = t4_chip_rss_size(adapter);
+   for (i = 0; i < nentries / 2; ++i) {
ret = rd_rss_row(adapter, i, );
if (ret)
return ret;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.h
index 872a91b..361d503 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.h
@@ -38,21 +38,22 @@
 #include 
 
 enum {
-   NCHAN  = 4,  

Re: [PATCH net-next 17/17] net: hns3: change TM sched mode to TC-based mode when SRIOV enabled

2017-12-18 Thread lipeng (Y)



On 2017/12/18 17:08, Sergei Shtylyov wrote:

On 12/18/2017 12:31 PM, Lipeng wrote:


TC-based sched mode supports SRIOV enabled and SRIOV disabled. This
patch change the TM sched mode to TC-based mode in initialization
process.

Fixes: cc9bb43 (net: hns3: Add tc-based TM support for sriov enabled 
port)


   Need at least 12 hex digits.



agree , may lost some hex digits,  will fix it.


Signed-off-by: Lipeng 

[...]

MBR, Sergei







Re: [PATCH net-next 14/17] net: hns3: add Asym Pause support to phy default features

2017-12-18 Thread lipeng (Y)



On 2017/12/18 17:07, Sergei Shtylyov wrote:

Hello!

On 12/18/2017 12:31 PM, Lipeng wrote:


From: Fuyun Liang 

commit c4fb2cdf575d (net: hns3: fix a bug for phy supported feature
initialization) adds default supported features for phy, but our 
hardware


   Ten cited commit's summary needs to be enclosed in (""), not just 
()...



Thanks , will fix it.


also supports Asym Pause. This patch adds Asym Pause support to phy
default features to prevent Asym Pause can not be advertised when the 
phy

negotiates flow control.

Fixes: c4fb2cdf575d (net: hns3: fix a bug for phy supported feature 
initialization)


   Here as well...


will fix here too.

Thanks


Signed-off-by: Fuyun Liang 
Signed-off-by: Lipeng 

[...]

MBR, Sergei







[PATCH net-next 1/2] cxgb4: RSS table is 4k for T6

2017-12-18 Thread Ganesh Goudar
RSS table is 4k for T6 and later cards, add check for the
same.

Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c |  5 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c   |  2 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c |  7 ++---
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 13 +++--
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.h | 31 +++---
 6 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c 
b/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
index d73fb6a..336670d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cudbg_lib.c
@@ -1004,9 +1004,10 @@ int cudbg_collect_rss(struct cudbg_init *pdbg_init,
 {
struct adapter *padap = pdbg_init->adap;
struct cudbg_buffer temp_buff = { 0 };
-   int rc;
+   int rc, nentries;
 
-   rc = cudbg_get_buff(dbg_buff, RSS_NENTRIES * sizeof(u16), _buff);
+   nentries = t4_chip_rss_size(padap);
+   rc = cudbg_get_buff(dbg_buff, nentries * sizeof(u16), _buff);
if (rc)
return rc;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index b1df2aa..69d0b64 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1528,6 +1528,7 @@ int t4_init_portinfo(struct port_info *pi, int mbox,
 int port, int pf, int vf, u8 mac[]);
 int t4_port_init(struct adapter *adap, int mbox, int pf, int vf);
 void t4_fatal_err(struct adapter *adapter);
+unsigned int t4_chip_rss_size(struct adapter *adapter);
 int t4_config_rss_range(struct adapter *adapter, int mbox, unsigned int viid,
int start, int n, const u16 *rspq, unsigned int nrspq);
 int t4_config_glbl_rss(struct adapter *adapter, int mbox, unsigned int mode,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
index 41c8736..581d628 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
@@ -179,7 +179,7 @@ static u32 cxgb4_get_entity_length(struct adapter *adap, 
u32 entity)
len = cudbg_mbytes_to_bytes(len);
break;
case CUDBG_RSS:
-   len = RSS_NENTRIES * sizeof(u16);
+   len = t4_chip_rss_size(adap) * sizeof(u16);
break;
case CUDBG_RSS_VF_CONF:
len = adap->params.arch.vfcount *
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 4956e42..200bf67 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -2021,11 +2021,12 @@ static int rss_show(struct seq_file *seq, void *v, int 
idx)
 
 static int rss_open(struct inode *inode, struct file *file)
 {
-   int ret;
-   struct seq_tab *p;
struct adapter *adap = inode->i_private;
+   int ret, nentries;
+   struct seq_tab *p;
 
-   p = seq_open_tab(file, RSS_NENTRIES / 8, 8 * sizeof(u16), 0, rss_show);
+   nentries = t4_chip_rss_size(adap);
+   p = seq_open_tab(file, nentries / 8, 8 * sizeof(u16), 0, rss_show);
if (!p)
return -ENOMEM;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index f044717..242bcdd 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -4927,6 +4927,14 @@ void t4_intr_disable(struct adapter *adapter)
t4_set_reg_field(adapter, PL_INT_MAP0_A, 1 << pf, 0);
 }
 
+unsigned int t4_chip_rss_size(struct adapter *adap)
+{
+   if (CHELSIO_CHIP_VERSION(adap->params.chip) <= CHELSIO_T5)
+   return RSS_NENTRIES;
+   else
+   return T6_RSS_NENTRIES;
+}
+
 /**
  * t4_config_rss_range - configure a portion of the RSS mapping table
  * @adapter: the adapter
@@ -5065,10 +5073,11 @@ static int rd_rss_row(struct adapter *adap, int row, 
u32 *val)
  */
 int t4_read_rss(struct adapter *adapter, u16 *map)
 {
+   int i, ret, nentries;
u32 val;
-   int i, ret;
 
-   for (i = 0; i < RSS_NENTRIES / 2; ++i) {
+   nentries = t4_chip_rss_size(adapter);
+   for (i = 0; i < nentries / 2; ++i) {
ret = rd_rss_row(adapter, i, );
if (ret)
return ret;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.h
index 872a91b..361d503 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.h
@@ -38,21 +38,22 @@
 #include 
 
 enum {
-   NCHAN  = 4, /* # of HW channels */
-   

Re: [v2 PATCH -tip 3/6] net: sctp: Add SCTP ACK tracking trace event

2017-12-18 Thread Masami Hiramatsu
On Mon, 18 Dec 2017 12:05:16 -0500
Steven Rostedt  wrote:

> On Mon, 18 Dec 2017 17:12:15 +0900
> Masami Hiramatsu  wrote:
> 
> > Add SCTP ACK tracking trace event to trace the changes of SCTP
> > association state in response to incoming packets.
> > It is used for debugging SCTP congestion control algorithms,
> > and will replace sctp_probe module.
> > 
> > Note that this event a bit tricky. Since this consists of 2
> > events (sctp_probe and sctp_probe_path) so you have to enable
> > both events as below.
> > 
> >   # cd /sys/kernel/debug/tracing
> >   # echo 1 > events/sctp/sctp_probe/enable
> >   # echo 1 > events/sctp/sctp_probe_path/enable
> > 
> > Or, you can enable all the events under sctp.
> > 
> >   # echo 1 > events/sctp/enable
> > 
> > Since sctp_probe_path event is always invoked from sctp_probe
> > event, you can not see any output if you only enable
> > sctp_probe_path.
> 
> I have to ask, why did you do it this way?
> 
> 
> > +#include 
> > diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> > index 8f8ccded13e4..c5f92b2cc5c3 100644
> > --- a/net/sctp/sm_statefuns.c
> > +++ b/net/sctp/sm_statefuns.c
> > @@ -59,6 +59,9 @@
> >  #include 
> >  #include 
> >  
> > +#define CREATE_TRACE_POINTS
> > +#include 
> > +
> >  static struct sctp_packet *sctp_abort_pkt_new(
> > struct net *net,
> > const struct sctp_endpoint *ep,
> > @@ -3219,6 +3222,8 @@ enum sctp_disposition sctp_sf_eat_sack_6_2(struct net 
> > *net,
> > struct sctp_sackhdr *sackh;
> > __u32 ctsn;
> >  
> > +   trace_sctp_probe(ep, asoc, chunk);
> 
> What about doing this right after this probe:
> 
>   if (trace_sctp_probe_path_enabled()) {
>   struct sctp_transport *sp;
> 
>   list_for_each_entry(sp, >peer.transpor_addr_list,
>   transports) {
>   trace_sctp_probe_path(sp, asoc);
>   }
>   }
> 
> The "trace_sctp_probe_path_enabled()" is a static branch, which means
> it's a nop just like a tracepoint is, and will not add any overhead if
> the trace_sctp_probe_path is not enabled.

That's a good idea! I'll update to use it :)

Thank you,

> 
> -- Steve
> 
> > +
> > if (!sctp_vtag_verify(chunk, asoc))
> > return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
> >  
> 


-- 
Masami Hiramatsu 


Re: [Patch net-next] net_sched: properly check for empty skb array on error path

2017-12-18 Thread John Fastabend
On 12/18/2017 02:34 PM, Cong Wang wrote:
> First, the check of >ring.queue against NULL is wrong, it
> is always false. We should check the value rather than the address.
> 

Thanks.

> Secondly, we need the same check in pfifo_fast_reset() too,
> as both ->reset() and ->destroy() are called in qdisc_destroy().
> 

not that it hurts to have the check here, but if init fails
in qdisc_create it seems only ->destroy() is called without
a ->reset().

Is there another path for init() to fail that I'm missing.

> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
> Reported-by: syzbot 
> Cc: John Fastabend 
> Signed-off-by: Cong Wang 
> ---
>  net/sched/sch_generic.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 


linux-next: manual merge of the net-next tree with the net tree

2017-12-18 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/phy/marvell.c

between commit:

  c505873eaece ("net: phy: marvell: Limit 88m1101 autoneg errata to 88E1145 as 
well.")

from the net tree and commit:

  80274abafc60 ("net: phy: remove generic settings for callbacks config_aneg 
and read_status from drivers")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/phy/marvell.c
index 82104edca393,2fc026dc170a..
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@@ -2085,8 -2070,7 +2082,7 @@@ static struct phy_driver marvell_driver
.flags = PHY_HAS_INTERRUPT,
.probe = marvell_probe,
.config_init = _config_init,
 -  .config_aneg = _config_aneg,
 +  .config_aneg = _config_aneg,
-   .read_status = _read_status,
.ack_interrupt = _ack_interrupt,
.config_intr = _config_intr,
.resume = _resume,


Re: [PATCH net-next] bpf/cgroup: fix a verification error for a CGROUP_DEVICE type prog

2017-12-18 Thread Daniel Borkmann
On 12/18/2017 07:13 PM, Yonghong Song wrote:
> The tools/testing/selftests/bpf test program
> test_dev_cgroup fails with the following error
> when compiled with llvm 6.0. (I did not try
> with earlier versions.)
> 
>   libbpf: load bpf program failed: Permission denied
>   libbpf: -- BEGIN DUMP LOG ---
>   libbpf:
>   0: (61) r2 = *(u32 *)(r1 +4)
>   1: (b7) r0 = 0
>   2: (55) if r2 != 0x1 goto pc+8
>R0=inv0 R1=ctx(id=0,off=0,imm=0) R2=inv1 R10=fp0
>   3: (69) r2 = *(u16 *)(r1 +0)
>   invalid bpf_context access off=0 size=2
>   ...
> 
> The culprit is the following statement in dev_cgroup.c:
>   short type = ctx->access_type & 0x;
> This code is typical as the ctx->access_type is assigned
> as below in kernel/bpf/cgroup.c:
>   struct bpf_cgroup_dev_ctx ctx = {
> .access_type = (access << 16) | dev_type,
> .major = major,
> .minor = minor,
>   };
> 
> The compiler converts it to u16 access while
> the verifier cgroup_dev_is_valid_access rejects
> any non u32 access.
> 
> This patch permits the field access_type to be accessible
> with type u16 and u8 as well.
> 
> Signed-off-by: Yonghong Song 
> Tested-by: Roman Gushchin 

Looks good, applied to bpf-next, thanks Yonghong!


[RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-18 Thread Sridhar Samudrala
This patch enables virtio to switch over to a VF datapath when a VF netdev
is present with the same MAC address.  It allows live migration of a VM
with a direct attached VF without the need to setup a bond/team between a
VF and virtio net device in the guest.

The hypervisor needs to unplug the VF device from the guest on the source
host and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed, the
destination hypervisor sets the MAC filter on the VF and plugs it back to
the guest to switch over to VF datapath.

It is entirely based on netvsc implementation and it should be possible to
make this code generic and move it to a common location that can be shared
by netvsc and virtio.

Also, i think we should make this a negotiated feature that is off by
default via a new feature bit.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization=151189725224231=2

Signed-off-by: Sridhar Samudrala 
Reviewed-by: Jesse Brandeburg 
---
 drivers/net/virtio_net.c | 341 ++-
 1 file changed, 339 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 559b215c0169..a34c717bb15b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -31,6 +31,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -56,6 +58,8 @@ module_param(napi_tx, bool, 0644);
  */
 DECLARE_EWMA(pkt_len, 0, 64)
 
+#define VF_TAKEOVER_INT(HZ / 10)
+
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
 static const unsigned long guest_offloads[] = {
@@ -117,6 +121,15 @@ struct receive_queue {
char name[40];
 };
 
+struct virtnet_vf_pcpu_stats {
+   u64 rx_packets;
+   u64 rx_bytes;
+   u64 tx_packets;
+   u64 tx_bytes;
+   struct u64_stats_sync   syncp;
+   u32 tx_dropped;
+};
+
 struct virtnet_info {
struct virtio_device *vdev;
struct virtqueue *cvq;
@@ -179,6 +192,11 @@ struct virtnet_info {
u32 speed;
 
unsigned long guest_offloads;
+
+   /* State to manage the associated VF interface. */
+   struct net_device __rcu *vf_netdev;
+   struct virtnet_vf_pcpu_stats __percpu *vf_stats;
+   struct delayed_work vf_takeover;
 };
 
 struct padded_vnet_hdr {
@@ -1300,16 +1318,51 @@ static int xmit_skb(struct send_queue *sq, struct 
sk_buff *skb)
return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
 }
 
+/* Send skb on the slave VF device. */
+static int virtnet_vf_xmit(struct net_device *dev, struct net_device 
*vf_netdev,
+  struct sk_buff *skb)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   unsigned int len = skb->len;
+   int rc;
+
+   skb->dev = vf_netdev;
+   skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+   rc = dev_queue_xmit(skb);
+   if (likely(rc == NET_XMIT_SUCCESS || rc == NET_XMIT_CN)) {
+   struct virtnet_vf_pcpu_stats *pcpu_stats
+   = this_cpu_ptr(vi->vf_stats);
+
+   u64_stats_update_begin(_stats->syncp);
+   pcpu_stats->tx_packets++;
+   pcpu_stats->tx_bytes += len;
+   u64_stats_update_end(_stats->syncp);
+   } else {
+   this_cpu_inc(vi->vf_stats->tx_dropped);
+   }
+
+   return rc;
+}
+
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
int qnum = skb_get_queue_mapping(skb);
struct send_queue *sq = >sq[qnum];
+   struct net_device *vf_netdev;
int err;
struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
bool kick = !skb->xmit_more;
bool use_napi = sq->napi.weight;
 
+   /* if VF is present and up then redirect packets
+* called with rcu_read_lock_bh
+*/
+   vf_netdev = rcu_dereference_bh(vi->vf_netdev);
+   if (vf_netdev && netif_running(vf_netdev) && !netpoll_tx_running(dev))
+   return virtnet_vf_xmit(dev, vf_netdev, skb);
+
/* Free up any pending old buffers before queueing new ones. */
free_old_xmit_skbs(sq);
 
@@ -1456,10 +1509,41 @@ static int virtnet_set_mac_address(struct net_device 
*dev, void *p)
return ret;
 }
 
+static void virtnet_get_vf_stats(struct net_device *dev,
+struct virtnet_vf_pcpu_stats *tot)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   int i;
+
+   memset(tot, 0, sizeof(*tot));
+
+   for_each_possible_cpu(i) {
+   const struct virtnet_vf_pcpu_stats *stats
+   = per_cpu_ptr(vi->vf_stats, i);
+   u64 rx_packets, rx_bytes, 

Re: [PATCH] bpf: make function xdp_do_generic_redirect_map() static

2017-12-18 Thread Daniel Borkmann
On 12/19/2017 12:17 AM, Xiongwei Song wrote:
> The function xdp_do_generic_redirect_map() is only used in this file, so
> make it static.
> 
> Clean up sparse warning:
> net/core/filter.c:2687:5: warning: no previous prototype
> for 'xdp_do_generic_redirect_map' [-Wmissing-prototypes]
> 
> Signed-off-by: Xiongwei Song 

Applied to bpf-next, thanks Xiongwei!


Re: [PATCH bpf-next] selftests/bpf: add netdevsim to config

2017-12-18 Thread Daniel Borkmann
On 12/19/2017 12:11 AM, Jakub Kicinski wrote:
> BPF offload tests (test_offload.py) will require netdevsim
> to be built, add it to config.
> 
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Quentin Monnet 

Applied to bpf-next, thanks Jakub!


Re: [PATCH bpf-next] bpf: arm64: fix uninitialized variable

2017-12-18 Thread Daniel Borkmann
On 12/18/2017 07:36 PM, Alexei Starovoitov wrote:
> On 12/18/17 10:19 AM, Daniel Borkmann wrote:
>> On 12/18/2017 07:09 PM, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov 
>>>
>>> fix the following issue:
>>> arch/arm64/net/bpf_jit_comp.c: In function 'bpf_int_jit_compile':
>>> arch/arm64/net/bpf_jit_comp.c:982:18: error: 'image_size' may be used
>>> uninitialized in this function [-Werror=maybe-uninitialized]
>>>
>>> Fixes: db496944fdaa ("bpf: arm64: add JIT support for multi-function 
>>> programs")
>>> Reported-by: Arnd Bergmann 
>>> Signed-off-by: Alexei Starovoitov 
>>> ---
>>>  arch/arm64/net/bpf_jit_comp.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>>> index 396490cf7316..acaa935ed977 100644
>>> --- a/arch/arm64/net/bpf_jit_comp.c
>>> +++ b/arch/arm64/net/bpf_jit_comp.c
>>> @@ -897,6 +897,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
>>> *prog)
>>>  image_ptr = jit_data->image;
>>>  header = jit_data->header;
>>>  extra_pass = true;
>>> +    image_size = sizeof(u32) * ctx.idx;
>>>  goto skip_init_ctx;
>>>  }
>>>  memset(, 0, sizeof(ctx));
>>
>> I don't really mind, but it feels more complex than it needs to be
>> imho, since in the initial pass you fetch 'image_size' in fake pass
>> from ctx.idx, then we set ctx.idx to 0 again, do another pass and
>> use the cached ctx.idx from that second pass instead of the first
>> one where we set 'image_size' originally, so we definitely need to
>> take that into consideration in future reviews at least.
> 
> not sure what you mean.
> This check: ctx.idx != jit_data->ctx.idx matters the most.
> After first alloc the 'image_size' variable used for dumping only.
> That's why the JITing itself worked fine. We could have removed it
> since it's computable from idx, but imo it's fine this way.

Fair enough, given final ctx.idx value must be guaranteed to never change
in future between pass#1 and pass#2 from the first bpf_int_jit_compile()
run, then lets go with this smaller version; applied to bpf-next, thanks
Alexei!


Re: [PATCH][next] bpf: make function skip_callee static and return NULL rather than 0

2017-12-18 Thread Daniel Borkmann
On 12/18/2017 06:47 PM, Colin King wrote:
> From: Colin Ian King 
> 
> Function skip_callee is local to the source and does not need to
> be in global scope, so make it static. Also return NULL rather than 0.
> Cleans up two sparse warnings:
> 
> symbol 'skip_callee' was not declared. Should it be static?
> Using plain integer as NULL pointer
> 
> Signed-off-by: Colin Ian King 

Makes sense, applied to bpf-next, thanks Colin!


Re: [PATCH][next] bpf: fix spelling mistake: "funcation"-> "function"

2017-12-18 Thread Daniel Borkmann
On 12/18/2017 03:03 PM, Colin King wrote:
> From: Colin Ian King 
> 
> Trivial fix to spelling mistake in error message text.
> 
> Signed-off-by: Colin Ian King 

Applied to bpf-next, thanks Colin!


Re: [PATCH 1/3] kallsyms: don't leak address when symbol not found

2017-12-18 Thread Tobin C. Harding
On Mon, Dec 18, 2017 at 06:43:24PM -0500, Steven Rostedt wrote:
> On Tue, 19 Dec 2017 09:41:29 +1100
> "Tobin C. Harding"  wrote:
> 
> > Current suggestion on list is to remove this function. Do you have a use
> > case in mind where debugging will break? We could add a fix to this
> > series if so. Otherwise next version will likely drop
> > string_is_no_symbol()
> 
> What about adding a kernel command line parameter that lets one put
> back the old behavior.
> 
> "insecure_print_all_symbols" ?

Cool. I've not done that before it will be a good learning
experience. I'll hack it up and see what people think.

thanks,
Tobin.


Re: [PATCH 3/3] trace: print address if symbol not found

2017-12-18 Thread Tobin C. Harding
On Mon, Dec 18, 2017 at 06:51:43PM -0500, Steven Rostedt wrote:
> On Tue, 19 Dec 2017 08:16:14 +1100
> "Tobin C. Harding"  wrote:
> 
> > > >  #endif /* _LINUX_KERNEL_TRACE_H */
> > > > diff --git a/kernel/trace/trace_events_hist.c 
> > > > b/kernel/trace/trace_events_hist.c
> > > > index 1e1558c99d56..3e28522a76f4 100644
> > > > --- a/kernel/trace/trace_events_hist.c
> > > > +++ b/kernel/trace/trace_events_hist.c
> > > > @@ -982,7 +982,7 @@ static void hist_trigger_stacktrace_print(struct 
> > > > seq_file *m,
> > > > return;
> > > >  
> > > > seq_printf(m, "%*c", 1 + spaces, ' ');
> > > > -   sprint_symbol(str, stacktrace_entries[i]);
> > > > +   trace_sprint_symbol_addr(str, stacktrace_entries[i]);  
> > > 
> 
> > 
> > If you have the time to give me some brief pointers on how I should go
> > about testing this I'd love to test it before the next version. I know
> > very little about ftrace.
> 
> For hitting the histogram stacktrace trigger (this code path), make
> sure you have CONFIG_HIST_TRIGGERS enabled. And then do:
> 
>  # cd /sys/kernel/debug/tracing
>  # echo 'hist:keys=common_pid.execname,stacktrace:vals=prev_state' > \
>  events/sched/sched_switch/trigger
>  # cat events/sched/sched_switch/hist
> 
> For the "sym" part, you can do (from the same directory):
> 
>  # echo 'hist:keys=call_site.sym:vals=bytes_req' > \
>  events/kmem/kmalloc/trigger
>  # cat events/kmem/kmalloc/hist
> 
> 
> And for sym-offset:
> 
>  # echo 'hist:keys=call_site.sym-offset:vals=bytes_req' > \
> events/kmem/kmalloc/trigger
>  # cat events/kmem/kmalloc/hist
> 
> -- Steve

Thanks, you're the man


Re: [PATCH] bpf: fix broken BPF selftest build on s390

2017-12-18 Thread Daniel Borkmann
On 12/18/2017 02:09 PM, Hendrik Brueckner wrote:
> With 720f228e8d31 ("bpf: fix broken BPF selftest build") the
> inclusion of arch-specific header files changed.  Including the
> asm/bpf_perf_event.h on s390, correctly includes the s390 specific
> header file.  This header file tries then to include the s390
> asm/ptrace.h and the build fails with:
> 
> cc -Wall -O2 -I../../../include/uapi -I../../../lib 
> -I../../../../include/generated  -I../../../includetest_verifier.c
> +/root/git/linux/tools/testing/selftests/bpf/libbpf.a 
> /root/git/linux/tools/testing/selftests/bpf/cgroup_helpers.c -lcap -lelf -o
> +/root/git/linux/tools/testing/selftests/bpf/test_verifier
> In file included from ../../../include/uapi/asm/bpf_perf_event.h:4:0,
>  from ../../../include/uapi/linux/bpf_perf_event.h:11,
>  from test_verifier.c:29:
> ../../../include/uapi/../../arch/s390/include/uapi/asm/bpf_perf_event.h:7:9: 
> error: unknown type name 'user_pt_regs'
>  typedef user_pt_regs bpf_user_pt_regs_t;
>  ^~~~
> make: *** [../lib.mk:109: 
> /root/git/linux/tools/testing/selftests/bpf/test_verifier] Error 1
> 
> This is caused by a recent update to the s390 asm/ptrace.h file
> that is not (yet) available in the local installation.  That means,
> the s390 asm/ptrace.h must be included from the tools/arch/s390
> directory.
> 
> Because there is no proper framework to deal with asm specific
> includes in tools/, slightly modify the s390 asm/bpf_perf_event.h
> to include the local ptrace.h header file.
> 
> See also discussion on
> https://marc.info/?l=linux-s390=151359424420691=2
> 
> Please note that this needs to be preserved until tools/ is able to
> correctly handle asm specific headers.
> 
> References: https://marc.info/?l=linux-s390=151359424420691=2
> Fixes: 720f228e8d31 ("bpf: fix broken BPF selftest build")
> Signed-off-by: Hendrik Brueckner 
> Cc: Daniel Borkmann 
> Cc: Hendrik Brueckner 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Alexei Starovoitov 

Applied to bpf tree, thanks Hendrik!


Re: [PATCH 3/3] trace: print address if symbol not found

2017-12-18 Thread Steven Rostedt
On Tue, 19 Dec 2017 08:16:14 +1100
"Tobin C. Harding"  wrote:

> > >  #endif /* _LINUX_KERNEL_TRACE_H */
> > > diff --git a/kernel/trace/trace_events_hist.c 
> > > b/kernel/trace/trace_events_hist.c
> > > index 1e1558c99d56..3e28522a76f4 100644
> > > --- a/kernel/trace/trace_events_hist.c
> > > +++ b/kernel/trace/trace_events_hist.c
> > > @@ -982,7 +982,7 @@ static void hist_trigger_stacktrace_print(struct 
> > > seq_file *m,
> > >   return;
> > >  
> > >   seq_printf(m, "%*c", 1 + spaces, ' ');
> > > - sprint_symbol(str, stacktrace_entries[i]);
> > > + trace_sprint_symbol_addr(str, stacktrace_entries[i]);  
> > 

> 
> If you have the time to give me some brief pointers on how I should go
> about testing this I'd love to test it before the next version. I know
> very little about ftrace.

For hitting the histogram stacktrace trigger (this code path), make
sure you have CONFIG_HIST_TRIGGERS enabled. And then do:

 # cd /sys/kernel/debug/tracing
 # echo 'hist:keys=common_pid.execname,stacktrace:vals=prev_state' > \
 events/sched/sched_switch/trigger
 # cat events/sched/sched_switch/hist

For the "sym" part, you can do (from the same directory):

 # echo 'hist:keys=call_site.sym:vals=bytes_req' > \
 events/kmem/kmalloc/trigger
 # cat events/kmem/kmalloc/hist


And for sym-offset:

 # echo 'hist:keys=call_site.sym-offset:vals=bytes_req' > \
events/kmem/kmalloc/trigger
 # cat events/kmem/kmalloc/hist

-- Steve



Re: [PATCH 1/3] kallsyms: don't leak address when symbol not found

2017-12-18 Thread Steven Rostedt
On Tue, 19 Dec 2017 09:41:29 +1100
"Tobin C. Harding"  wrote:

> Current suggestion on list is to remove this function. Do you have a use
> case in mind where debugging will break? We could add a fix to this
> series if so. Otherwise next version will likely drop
> string_is_no_symbol()

What about adding a kernel command line parameter that lets one put
back the old behavior.

"insecure_print_all_symbols" ?

-- Steve


Re: [PATCH 2/3] rhashtable: Add rhashtable_walk_curr

2017-12-18 Thread Herbert Xu
On Mon, Dec 18, 2017 at 02:31:21PM +0100, Andreas Gruenbacher wrote:
> When iterating through an rhashtable is stopped with
> rhashtable_walk_stop and then resumed with rhashtable_walk_start, there
> currently is no way to get back to the current object and thus revisit
> the object rhashtable_walk_next has previously returned.
> 
> This functionality is useful when dumping an rhashtable via the seq file
> interface: seq_read will convert one object after the other.  When an
> object doesn't fit in the remaining buffer space anymore, user-space
> will be returned all objects that have been fully converted so far.
> Upon the next read from user-space, the object that didn't fit
> previously will be revisited.
> 
> Signed-off-by: Andreas Gruenbacher 

Doesn't the helper that Tom Herbert just added do exactly this?

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH] bpf: make function xdp_do_generic_redirect_map() static

2017-12-18 Thread Xiongwei Song
The function xdp_do_generic_redirect_map() is only used in this file, so
make it static.

Clean up sparse warning:
net/core/filter.c:2687:5: warning: no previous prototype
for 'xdp_do_generic_redirect_map' [-Wmissing-prototypes]

Signed-off-by: Xiongwei Song 
---
 net/core/filter.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 754abe1041b7..130b842c3a15 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2684,8 +2684,9 @@ static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, 
struct net_device *fwd)
return 0;
 }
 
-int xdp_do_generic_redirect_map(struct net_device *dev, struct sk_buff *skb,
-   struct bpf_prog *xdp_prog)
+static int xdp_do_generic_redirect_map(struct net_device *dev,
+  struct sk_buff *skb,
+  struct bpf_prog *xdp_prog)
 {
struct redirect_info *ri = this_cpu_ptr(_info);
unsigned long map_owner = ri->map_owner;
-- 
2.15.1



  1   2   3   4   >