date:20160916

[PATCH net-next] chcr/cxgb4i/cxgbit/RDMA/cxgb4: Allocate resources dynamically for all cxgb4 ULD's

2016-09-16 Thread Hariprasad Shenai

Allocate resources dynamically to cxgb4's Upper layer driver's(ULD) like
cxgbit, iw_cxgb4 and cxgb4i. Allocate resources when they register with
cxgb4 driver and free them while unregistering. All the queues and the
interrupts for them will be allocated during ULD probe only and freed
during remove.

Signed-off-by: Hariprasad Shenai 
---
 drivers/crypto/chelsio/chcr_core.c |   10 +-
 drivers/infiniband/hw/cxgb4/device.c   |4 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |   47 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c |  127 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c|  613 +---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.c |  223 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h |   31 +-
 drivers/net/ethernet/chelsio/cxgb4/sge.c   |   18 +-
 drivers/scsi/cxgbi/cxgb4i/cxgb4i.c |3 +
 drivers/target/iscsi/cxgbit/cxgbit_main.c  |3 +
 10 files changed, 385 insertions(+), 694 deletions(-)

diff --git a/drivers/crypto/chelsio/chcr_core.c 
b/drivers/crypto/chelsio/chcr_core.c
index 2f6156b..fb5f9bb 100644
--- a/drivers/crypto/chelsio/chcr_core.c
+++ b/drivers/crypto/chelsio/chcr_core.c
@@ -39,12 +39,10 @@ static chcr_handler_func work_handlers[NUM_CPL_CMDS] = {
[CPL_FW6_PLD] = cpl_fw6_pld_handler,
 };
 
-static struct cxgb4_pci_uld_info chcr_uld_info = {
+static struct cxgb4_uld_info chcr_uld_info = {
.name = DRV_MODULE_NAME,
-   .nrxq = 4,
+   .nrxq = MAX_ULD_QSETS,
.rxq_size = 1024,
-   .nciq = 0,
-   .ciq_size = 0,
.add = chcr_uld_add,
.state_change = chcr_uld_state_change,
.rx_handler = chcr_uld_rx_handler,
@@ -205,7 +203,7 @@ static int chcr_uld_state_change(void *handle, enum 
cxgb4_state state)
 
 static int __init chcr_crypto_init(void)
 {
-   if (cxgb4_register_pci_uld(CXGB4_PCI_ULD1, _uld_info)) {
+   if (cxgb4_register_uld(CXGB4_ULD_CRYPTO, _uld_info)) {
pr_err("ULD register fail: No chcr crypto support in cxgb4");
return -1;
}
@@ -228,7 +226,7 @@ static void __exit chcr_crypto_exit(void)
kfree(u_ctx);
}
mutex_unlock(_mutex);
-   cxgb4_unregister_pci_uld(CXGB4_PCI_ULD1);
+   cxgb4_unregister_uld(CXGB4_ULD_CRYPTO);
 }
 
 module_init(chcr_crypto_init);
diff --git a/drivers/infiniband/hw/cxgb4/device.c 
b/drivers/infiniband/hw/cxgb4/device.c
index 071d733..f170b63 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -1475,6 +1475,10 @@ static int c4iw_uld_control(void *handle, enum 
cxgb4_control control, ...)
 
 static struct cxgb4_uld_info c4iw_uld_info = {
.name = DRV_NAME,
+   .nrxq = MAX_ULD_QSETS,
+   .rxq_size = 511,
+   .ciq = true,
+   .lro = false,
.add = c4iw_uld_add,
.rx_handler = c4iw_uld_rx_handler,
.state_change = c4iw_uld_state_change,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 4595569..1f9867d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -437,11 +437,6 @@ enum {
MAX_ETH_QSETS = 32,   /* # of Ethernet Tx/Rx queue sets */
MAX_OFLD_QSETS = 16,  /* # of offload Tx, iscsi Rx queue sets */
MAX_CTRL_QUEUES = NCHAN,  /* # of control Tx queues */
-   MAX_RDMA_QUEUES = NCHAN,  /* # of streaming RDMA Rx queues */
-   MAX_RDMA_CIQS = 32,/* # of  RDMA concentrator IQs */
-
-   /* # of streaming iSCSIT Rx queues */
-   MAX_ISCSIT_QUEUES = MAX_OFLD_QSETS,
 };
 
 enum {
@@ -458,8 +453,7 @@ enum {
 enum {
INGQ_EXTRAS = 2,/* firmware event queue and */
/*   forwarded interrupts */
-   MAX_INGQ = MAX_ETH_QSETS + MAX_OFLD_QSETS + MAX_RDMA_QUEUES +
-  MAX_RDMA_CIQS + MAX_ISCSIT_QUEUES + INGQ_EXTRAS,
+   MAX_INGQ = MAX_ETH_QSETS + INGQ_EXTRAS,
 };
 
 struct adapter;
@@ -704,10 +698,6 @@ struct sge {
struct sge_ctrl_txq ctrlq[MAX_CTRL_QUEUES];
 
struct sge_eth_rxq ethrxq[MAX_ETH_QSETS];
-   struct sge_ofld_rxq iscsirxq[MAX_OFLD_QSETS];
-   struct sge_ofld_rxq iscsitrxq[MAX_ISCSIT_QUEUES];
-   struct sge_ofld_rxq rdmarxq[MAX_RDMA_QUEUES];
-   struct sge_ofld_rxq rdmaciq[MAX_RDMA_CIQS];
struct sge_rspq fw_evtq cacheline_aligned_in_smp;
struct sge_uld_rxq_info **uld_rxq_info;
 
@@ -717,15 +707,8 @@ struct sge {
u16 max_ethqsets;   /* # of available Ethernet queue sets */
u16 ethqsets;   /* # of active Ethernet queue sets */
u16 ethtxq_rover;   /* Tx queue to clean up next */
-   u16 iscsiqsets;  /* # of active iSCSI queue sets */
-   u16 niscsitq;   /* # of available iSCST Rx queues */
-   u16 rdmaqs;

Re: [PATCHv5 net-next 05/15] bpf: expose internal verfier structures

2016-09-16 Thread Daniel Borkmann


On 09/16/2016 11:36 AM, Jakub Kicinski wrote:

Move verifier's internal structures to a header file and
prefix their names with bpf_ to avoid potential namespace
conflicts.  Those structures will soon be used by external
analyzers.

Signed-off-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 


Acked-by: Daniel Borkmann

Re: [PATCHv5 net-next 06/15] bpf: enable non-core use of the verfier

2016-09-16 Thread Daniel Borkmann


On 09/16/2016 11:36 AM, Jakub Kicinski wrote:

Advanced JIT compilers and translators may want to use
eBPF verifier as a base for parsers or to perform custom
checks and validations.

Add ability for external users to invoke the verifier
and provide callbacks to be invoked for every intruction
checked.  For now only add most basic callback for
per-instruction pre-interpretation checks is added.  More
advanced users may also like to have per-instruction post
callback and state comparison callback.

Signed-off-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 
---
v4:
  - separate from the header split patch.
---
  include/linux/bpf_verifier.h | 11 +++
  kernel/bpf/verifier.c| 68 
  2 files changed, 79 insertions(+)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 9056117b4a81..925359e1d9a1 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -59,6 +59,12 @@ struct bpf_insn_aux_data {

  #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */

+struct bpf_verifier_env;
+struct bpf_ext_analyzer_ops {
+   int (*insn_hook)(struct bpf_verifier_env *env,
+int insn_idx, int prev_insn_idx);
+};
+
  /* single container for all structs
   * one verifier_env per bpf_check() call
   */
@@ -68,6 +74,8 @@ struct bpf_verifier_env {
int stack_size; /* number of states to be processed */
struct bpf_verifier_state cur_state; /* current verifier state */
struct bpf_verifier_state_list **explored_states; /* search pruning 
optimization */
+   const struct bpf_ext_analyzer_ops *analyzer_ops; /* external analyzer 
ops */
+   void *analyzer_priv; /* pointer to external analyzer's private data */
struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by 
eBPF program */
u32 used_map_cnt;   /* number of used maps */
u32 id_gen; /* used to generate unique reg IDs */
@@ -75,4 +83,7 @@ struct bpf_verifier_env {
struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
  };

+int bpf_analyzer(struct bpf_prog *prog, const struct bpf_ext_analyzer_ops *ops,
+void *priv);
+
  #endif /* _LINUX_BPF_VERIFIER_H */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6e126a417290..d93e78331b90 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -624,6 +624,10 @@ static int check_packet_access(struct bpf_verifier_env 
*env, u32 regno, int off,
  static int check_ctx_access(struct bpf_verifier_env *env, int off, int size,
enum bpf_access_type t, enum bpf_reg_type *reg_type)
  {
+   /* for analyzer ctx accesses are already validated and converted */
+   if (env->analyzer_ops)
+   return 0;
+
if (env->prog->aux->ops->is_valid_access &&
env->prog->aux->ops->is_valid_access(off, size, t, reg_type)) {
/* remember the offset of last byte accessed in ctx */
@@ -,6 +2226,15 @@ static int is_state_visited(struct bpf_verifier_env 
*env, int insn_idx)
return 0;
  }

+static int ext_analyzer_insn_hook(struct bpf_verifier_env *env,
+ int insn_idx, int prev_insn_idx)
+{
+   if (!env->analyzer_ops || !env->analyzer_ops->insn_hook)
+   return 0;
+
+   return env->analyzer_ops->insn_hook(env, insn_idx, prev_insn_idx);
+}
+
  static int do_check(struct bpf_verifier_env *env)
  {
struct bpf_verifier_state *state = >cur_state;
@@ -2280,6 +2293,10 @@ static int do_check(struct bpf_verifier_env *env)
print_bpf_insn(insn);
}

+   err = ext_analyzer_insn_hook(env, insn_idx, prev_insn_idx);
+   if (err)
+   return err;
+


Looking at this and the nfp code translator (patch 8/15): Did you check this
also with JIT hardening enabled? Presumably nfp sees this after it got JITed
through the normal load via syscall. Then, when this gets rewritten using the
BPF_REG_AX helper before the image gets locked, and you later on push this
through bpf_analyzer() again, where the hook is called before re-verification
of insns, it's still assumed to be MAX_BPF_REG in your hook callbacks, right?
So, I was wondering wrt out of bounds in nfp_verify_insn() -> 
nfp_bpf_check_ctx_ptr()
for things like BPF_STX? If that's the case, it would make sense to just reject
any prog with reg that is BPF_REG_AX in nfp_verify_insn() upfront. Alternative
would be to use MAX_BPF_JIT_REG in nfp and let bpf_analyzer() fail this via
check_reg_arg() test.


if (class == BPF_ALU || class == BPF_ALU64) {
err = check_alu_op(env, insn);
if (err)
@@ -2829,3 +2846,54 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)

Re: Modification to skb->queue_mapping affecting performance

2016-09-16 Thread Michael Ma

2016-09-16 12:53 GMT-07:00 Eric Dumazet :
> On Fri, 2016-09-16 at 10:57 -0700, Michael Ma wrote:
>
>> This is actually the problem - if flows from different RX queues are
>> switched to the same RX queue in IFB, they'll use different processor
>> context with the same tasklet, and the processor context of different
>> tasklets might be the same. So multiple tasklets in IFB competes for
>> the same core when queue is switched.
>>
>> The following simple fix proved this - with this change even switching
>> the queue won't affect small packet bandwidth/latency anymore:
>>
>> in ifb.c:
>>
>> -   struct ifb_q_private *txp = dp->tx_private + 
>> skb_get_queue_mapping(skb);
>> +   struct ifb_q_private *txp = dp->tx_private +
>> (smp_processor_id() % dev->num_tx_queues);
>>
>> This should be more efficient since we're not sending the task to a
>> different processor, instead we try to queue the packet to an
>> appropriate tasklet based on the processor ID. Will this cause any
>> packet out-of-order problem? If packets from the same flow are queued
>> to the same RX queue due to RSS, and processor affinity is set for RX
>> queues, I assume packets from the same flow will end up in the same
>> core when tasklet is scheduled. But I might have missed some uncommon
>> cases here... Would appreciate if anyone can provide more insights.
>
> Wait, don't you have proper smp affinity for the RX queues on your NIC ?
>
> ( Documentation/networking/scaling.txt RSS IRQ Configuration )
>
Yes - what I was trying to say is that this change will be more
efficient than using smp_call_function_single() to schedule the
tasklet to a different processor.

RSS IRQ should be set properly already. The issue here is that I'll
need to switch the queue mapping for NIC RX to a different TXQ on IFB,
which allows me to classify the flows at the IFB TXQ layer and avoid
qdisc lock contention.

When that switch happens, ideally processor core shouldn't be switched
because all the thread context isn't changed. The work in tasklet
should be scheduled to the same processor as well. That's why I tried
this change. Also conceptually IFB is a software device which should
be able to schedule its workload independent from how NIC is
configured for the interrupt handling.

> A driver ndo_start_xmit() MUST use skb_get_queue_mapping(skb), because
> the driver queue is locked before ndo_start_xmit())  (for non
> NETIF_F_LLTX drivers at least)
>

Thanks a lot for pointing out this! I was expecting this kind of
guidance... Then the options would be:

1. Use smp_call_function_single() to schedule the tasklet to a core
statically mapped to the IFB TXQ, which is very similar to how TX/RX
IRQ is configured.
2. As you suggested below add some additional action to do the
rescheduling before entering IFB - for example when receiving the
packet we could just use RSS to redirect to the desired RXQ, however
this doesn't seem to be easy, especially compared with the way how
mqprio chooses the queue. The challenge here is that IFB queue
selection is based on queue_mapping when skb arrives at IFB and core
selection is based on RXQ on NIC and so it's also based on
queue_mapping when skb arrives at NIC. Then these two queue_mappings
must be the same so that there is no core conflict of processing two
TXQs of IFB. Then this essentially means we have to change queue
mapping of the NIC on the receiver side which can't be achieved using
TC.

> In case of ifb, __skb_queue_tail(>rq, skb); could corrupt the skb
> list.
>
> In any case, you could have an action to do this before reaching IFB.
>
>
>

[PATCH net-next 2/2] bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

2016-09-16 Thread Jason Baron

Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for the pending list to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 131 ++---
 1 file changed, 94 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
index d468380c2a23..6db8dd252d7c 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
@@ -2606,8 +2606,23 @@ struct bnx2x_mcast_bin_elem {
int type; /* BNX2X_MCAST_CMD_SET_{ADD, DEL} */
 };
 
+union bnx2x_mcast_elem {
+   struct bnx2x_mcast_bin_elem bin_elem;
+   struct bnx2x_mcast_mac_elem mac_elem;
+};
+
+struct bnx2x_mcast_elem_group {
+   struct list_head mcast_group_link;
+   union bnx2x_mcast_elem mcast_elems[];
+};
+
+#define MCAST_MAC_ELEMS_PER_PG \
+   ((PAGE_SIZE - sizeof(struct bnx2x_mcast_elem_group)) / \
+   sizeof(union bnx2x_mcast_elem))
+
 struct bnx2x_pending_mcast_cmd {
struct list_head link;
+   struct list_head group_head;
int type; /* BNX2X_MCAST_CMD_X */
union {
struct list_head macs_head;
@@ -2638,16 +2653,30 @@ static int bnx2x_mcast_wait(struct bnx2x *bp,
return 0;
 }
 
+static void bnx2x_free_groups(struct list_head *mcast_group_list)
+{
+   struct bnx2x_mcast_elem_group *current_mcast_group;
+
+   while (!list_empty(mcast_group_list)) {
+   current_mcast_group = list_first_entry(mcast_group_list,
+ struct bnx2x_mcast_elem_group,
+ mcast_group_link);
+   list_del(_mcast_group->mcast_group_link);
+   kfree(current_mcast_group);
+   }
+}
+
 static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
   struct bnx2x_mcast_obj *o,
   struct bnx2x_mcast_ramrod_params *p,
   enum bnx2x_mcast_cmd cmd)
 {
-   int total_sz;
struct bnx2x_pending_mcast_cmd *new_cmd;
-   struct bnx2x_mcast_mac_elem *cur_mac = NULL;
struct bnx2x_mcast_list_elem *pos;
-   int macs_list_len = 0, macs_list_len_size;
+   struct bnx2x_mcast_elem_group *elem_group;
+   struct bnx2x_mcast_mac_elem *mac_elem;
+   int i = 0, offset = 0, macs_list_len = 0;
+   int total_elems, alloc_size;
 
/* When adding MACs we'll need to store their values */
if (cmd == BNX2X_MCAST_CMD_ADD || cmd == BNX2X_MCAST_CMD_SET)
@@ -2657,50 +2686,68 @@ static int bnx2x_mcast_enqueue_cmd(struct bnx2x *bp,
if (!p->mcast_list_len)
return 0;
 
-   /* For a set command, we need to allocate sufficient memory for all
-* the bins, since we can't analyze at this point how much memory would
-* be required.
-*/
-   macs_list_len_size = macs_list_len *
-sizeof(struct bnx2x_mcast_mac_elem);
-   if (cmd == BNX2X_MCAST_CMD_SET) {
-   int bin_size = BNX2X_MCAST_BINS_NUM *
-  sizeof(struct bnx2x_mcast_bin_elem);
-
-   if (bin_size > macs_list_len_size)
-   macs_list_len_size = bin_size;
-   }
-   total_sz = sizeof(*new_cmd) + macs_list_len_size;
-
/* Add mcast is called under spin_lock, thus calling with GFP_ATOMIC */
-   new_cmd = kzalloc(total_sz, GFP_ATOMIC);
-
+   new_cmd = kzalloc(sizeof(*new_cmd), GFP_ATOMIC);
if (!new_cmd)
return -ENOMEM;
 
-   DP(BNX2X_MSG_SP, "About to enqueue a new %d command. 
macs_list_len=%d\n",
-  cmd, macs_list_len);
-
INIT_LIST_HEAD(_cmd->data.macs_head);
-
+   INIT_LIST_HEAD(_cmd->group_head);
new_cmd->type = cmd;
new_cmd->done = false;
 
+   DP(BNX2X_MSG_SP, "About to enqueue a new %d command. 
macs_list_len=%d\n",
+  cmd, macs_list_len);
+
switch (cmd) {
case BNX2X_MCAST_CMD_ADD:
case BNX2X_MCAST_CMD_SET:
-   cur_mac = (struct bnx2x_mcast_mac_elem *)
- ((u8 *)new_cmd + sizeof(*new_cmd));
-
-   /* Push the MACs of the current command into the pending command
-* MACs list: FIFO
+   /* For a set command, we need to allocate sufficient memory for
+* all the bins, since we can't analyze at this point how much
+* memory would be required.
 */
+   total_elems = macs_list_len;
+   if (cmd == BNX2X_MCAST_CMD_SET) {
+   if (total_elems < BNX2X_MCAST_BINS_NUM)
+

Re: [PATCH net-next 07/14] tcp: export data delivery rate

2016-09-16 Thread kbuild test robot

Hi Yuchung,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Neal-Cardwell/tcp-BBR-congestion-control-algorithm/20160917-025323
config: arm-simpad_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   In file included from include/linux/kernel.h:142:0,
from include/linux/crypto.h:21,
from include/crypto/hash.h:16,
from net/ipv4/tcp.c:250:
   net/ipv4/tcp.c: In function 'tcp_get_info':
>> arch/arm/include/asm/div64.h:59:36: error: passing argument 1 of 
>> '__div64_32' from incompatible pointer type 
>> [-Werror=incompatible-pointer-types]
#define do_div(n, base) __div64_32(&(n), base)
   ^
   net/ipv4/tcp.c:2794:3: note: in expansion of macro 'do_div'
  do_div(rate, intv);
  ^~
   arch/arm/include/asm/div64.h:32:24: note: expected 'uint64_t * {aka long 
long unsigned int *}' but argument is of type 'u32 * {aka unsigned int *}'
static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
   ^~
   cc1: some warnings being treated as errors

vim +/__div64_32 +59 arch/arm/include/asm/div64.h

040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  43 
: "=r" (__rem), "=r" (__res)
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  44 
: "r" (__n), "r" (__base)
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  45 
: "ip", "lr", "cc");
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  46 *n = 
__res;
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  47 return 
__rem;
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  48  }
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  49  #define 
__div64_32 __div64_32
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  50  
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  51  #if 
!defined(CONFIG_AEABI)
fa4adc614 include/asm-arm/div64.h  Nicolas Pitre 2006-12-06  52  
fa4adc614 include/asm-arm/div64.h  Nicolas Pitre 2006-12-06  53  /*
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  54   * In OABI 
configurations, some uses of the do_div function
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  55   * cause 
gcc to run out of registers. To work around that,
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  56   * we can 
force the use of the out-of-line version for
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  57   * 
configurations that build a OABI kernel.
fa4adc614 include/asm-arm/div64.h  Nicolas Pitre 2006-12-06  58   */
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02 @59  #define 
do_div(n, base) __div64_32(&(n), base)
fa4adc614 include/asm-arm/div64.h  Nicolas Pitre 2006-12-06  60  
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  61  #else
fa4adc614 include/asm-arm/div64.h  Nicolas Pitre 2006-12-06  62  
fa4adc614 include/asm-arm/div64.h  Nicolas Pitre 2006-12-06  63  /*
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  64   * gcc 
versions earlier than 4.0 are simply too problematic for the
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  65   * 
__div64_const32() code in asm-generic/div64.h. First there is
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  66   * gcc PR 
15089 that tend to trig on more complex constructs, spurious
040b323b5 arch/arm/include/asm/div64.h Nicolas Pitre 2015-11-02  67   * .global 
__udivsi3 are inserted even if none of those symbols are

:: The code at line 59 was first introduced by commit
:: 040b323b5012b5503561ec7fe15cccd6a4bcaec2 ARM: asm/div64.h: adjust to 
generic codde

:: TO: Nicolas Pitre 
:: CC: Nicolas Pitre 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH net-next 05/14] tcp: track data delivery rate for a TCP connection

2016-09-16 Thread kbuild test robot

Hi Yuchung,

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Neal-Cardwell/tcp-BBR-congestion-control-algorithm/20160917-025323
config: cris-etrax-100lx_v2_defconfig (attached as .config)
compiler: cris-linux-gcc (GCC) 6.2.0
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=cris 

All warnings (new ones prefixed by >>):

   In file included from net/ipv4/route.c:103:0:
>> include/net/tcp.h:769:11: warning: 'packed' attribute ignored for field of 
>> type 'struct skb_mstamp' [-Wattributes]
   struct skb_mstamp first_tx_mstamp __packed;
  ^~
   include/net/tcp.h:771:11: warning: 'packed' attribute ignored for field of 
type 'struct skb_mstamp' [-Wattributes]
   struct skb_mstamp delivered_mstamp __packed;
  ^~

vim +769 include/net/tcp.h

   753  #define TCPCB_TAGBITS   0x07/* All tag bits 
*/
   754  #define TCPCB_REPAIRED  0x10/* SKB repaired (no skb_mstamp) 
*/
   755  #define TCPCB_EVER_RETRANS  0x80/* Ever retransmitted frame 
*/
   756  #define TCPCB_RETRANS   
(TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS| \
   757  TCPCB_REPAIRED)
   758  
   759  __u8ip_dsfield; /* IPv4 tos or IPv6 dsfield 
*/
   760  __u8txstamp_ack:1,  /* Record TX timestamp for ack? 
*/
   761  eor:1,  /* Is skb MSG_EOR marked? */
   762  unused:6;
   763  __u32   ack_seq;/* Sequence number ACK'd
*/
   764  union {
   765  struct {
   766  /* There is space for up to 24 bytes */
   767  __u32 in_flight;/* Bytes in flight when packet 
sent */
   768  /* start of send pipeline phase */
 > 769  struct skb_mstamp first_tx_mstamp __packed;
   770  /* when we reached the "delivered" count */
   771  struct skb_mstamp delivered_mstamp __packed;
   772  /* pkts S/ACKed so far upon tx of skb, incl 
retrans: */
   773  __u32 delivered;
   774  } tx;   /* only used for outgoing skbs */
   775  union {
   776  struct inet_skb_parmh4;
   777  #if IS_ENABLED(CONFIG_IPV6)

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCHv5 net-next 07/15] bpf: recognize 64bit immediate loads as consts

2016-09-16 Thread Daniel Borkmann


On 09/16/2016 11:36 AM, Jakub Kicinski wrote:

When running as parser interpret BPF_LD | BPF_IMM | BPF_DW
instructions as loading CONST_IMM with the value stored
in imm.  The verifier will continue not recognizing those
due to concerns about search space/program complexity
increase.

Signed-off-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 


Acked-by: Daniel Borkmann

[PATCH net-next 1/2] bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments

2016-09-16 Thread Jason Baron

Currently, we can have high order page allocations that specify
GFP_ATOMIC when configuring multicast MAC address filters.

For example, we have seen order 2 page allocation failures with
~500 multicast addresses configured.

Convert the allocation for 'mcast_list' to be done in PAGE_SIZE
increments.

Signed-off-by: Jason Baron 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 85 
 1 file changed, 57 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index dab61a81a3ba..9f5b2d94e4df 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -12563,43 +12563,70 @@ static int bnx2x_close(struct net_device *dev)
return 0;
 }
 
-static int bnx2x_init_mcast_macs_list(struct bnx2x *bp,
- struct bnx2x_mcast_ramrod_params *p)
+struct bnx2x_mcast_list_elem_group
 {
-   int mc_count = netdev_mc_count(bp->dev);
-   struct bnx2x_mcast_list_elem *mc_mac =
-   kcalloc(mc_count, sizeof(*mc_mac), GFP_ATOMIC);
-   struct netdev_hw_addr *ha;
+   struct list_head mcast_group_link;
+   struct bnx2x_mcast_list_elem mcast_elems[];
+};
 
-   if (!mc_mac) {
-   BNX2X_ERR("Failed to allocate mc MAC list\n");
-   return -ENOMEM;
+#define MCAST_ELEMS_PER_PG \
+   ((PAGE_SIZE - sizeof(struct bnx2x_mcast_list_elem_group)) / \
+   sizeof(struct bnx2x_mcast_list_elem))
+
+static void bnx2x_free_mcast_macs_list(struct list_head *mcast_group_list)
+{
+   struct bnx2x_mcast_list_elem_group *current_mcast_group;
+
+   while (!list_empty(mcast_group_list)) {
+   current_mcast_group = list_first_entry(mcast_group_list,
+ struct bnx2x_mcast_list_elem_group,
+ mcast_group_link);
+   list_del(_mcast_group->mcast_group_link);
+   kfree(current_mcast_group);
}
+}
 
-   INIT_LIST_HEAD(>mcast_list);
+static int bnx2x_init_mcast_macs_list(struct bnx2x *bp,
+ struct bnx2x_mcast_ramrod_params *p,
+ struct list_head *mcast_group_list)
+{
+   struct bnx2x_mcast_list_elem *mc_mac;
+   struct netdev_hw_addr *ha;
+   struct bnx2x_mcast_list_elem_group *current_mcast_group = NULL;
+   int mc_count = netdev_mc_count(bp->dev);
+   int i = 0, offset = 0, alloc_size;
 
+   INIT_LIST_HEAD(>mcast_list);
netdev_for_each_mc_addr(ha, bp->dev) {
+   if (!offset) {
+   if ((mc_count - i) < MCAST_ELEMS_PER_PG)
+   alloc_size = sizeof(struct list_head) +
+   (sizeof(struct bnx2x_mcast_list_elem) *
+   (mc_count - i));
+   else
+   alloc_size = PAGE_SIZE;
+   current_mcast_group = kmalloc(alloc_size, GFP_ATOMIC);
+   if (!current_mcast_group) {
+   bnx2x_free_mcast_macs_list(mcast_group_list);
+   BNX2X_ERR("Failed to allocate mc MAC list\n");
+   return -ENOMEM;
+   }
+   list_add(_mcast_group->mcast_group_link,
+mcast_group_list);
+   }
+   mc_mac = _mcast_group->mcast_elems[offset];
mc_mac->mac = bnx2x_mc_addr(ha);
list_add_tail(_mac->link, >mcast_list);
-   mc_mac++;
+   offset++;
+   if (offset == MCAST_ELEMS_PER_PG) {
+   i += offset;
+   offset = 0;
+   }
}
-
p->mcast_list_len = mc_count;
-
return 0;
 }
 
-static void bnx2x_free_mcast_macs_list(
-   struct bnx2x_mcast_ramrod_params *p)
-{
-   struct bnx2x_mcast_list_elem *mc_mac =
-   list_first_entry(>mcast_list, struct bnx2x_mcast_list_elem,
-link);
-
-   WARN_ON(!mc_mac);
-   kfree(mc_mac);
-}
-
 /**
  * bnx2x_set_uc_list - configure a new unicast MACs list.
  *
@@ -12647,6 +12674,7 @@ static int bnx2x_set_uc_list(struct bnx2x *bp)
 
 static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 {
+   LIST_HEAD(mcast_group_list);
struct net_device *dev = bp->dev;
struct bnx2x_mcast_ramrod_params rparam = {NULL};
int rc = 0;
@@ -12662,7 +12690,7 @@ static int bnx2x_set_mc_list_e1x(struct bnx2x *bp)
 
/* then, configure a new MACs list */
if (netdev_mc_count(dev)) {
-   rc = bnx2x_init_mcast_macs_list(bp, );
+   rc = bnx2x_init_mcast_macs_list(bp, , _group_list);
if (rc)
return rc;
 
@@ -12673,7

[PATCH net-next 0/2] bnx2x: page allocation failure

2016-09-16 Thread Jason Baron

Hi,

While configuring ~500 multicast addrs, we ran into high order
page allocation failures. They don't need to be high order, and
thus I'm proposing to split them into at most PAGE_SIZE allocations.

Below is a sample failure.

Thanks,

-Jason

[1201902.617882] bnx2x: [bnx2x_set_mc_list:12374(eth0)]Failed to create 
multicast MACs list: -12
[1207325.695021] kworker/1:0: page allocation failure: order:2, mode:0xc020
[1207325.702059] CPU: 1 PID: 15805 Comm: kworker/1:0 Tainted: GW
[1207325.712940] Hardware name: SYNNEX CORPORATION 1x8-X4i SSD 10GE/S5512LE, 
BIOS V8.810 05/16/2013
[1207325.722284] Workqueue: events bnx2x_sp_rtnl_task [bnx2x]
[1207325.728206]   88012d873a78 8267f7c7 
c020
[1207325.736754]   88012d873b08 8212f8e0 
fffc0003
[1207325.745301]  88041ffecd80 88040030 0002 
c0206800da13
[1207325.753846] Call Trace:
[1207325.756789]  [] dump_stack+0x4d/0x63
[1207325.762426]  [] warn_alloc_failed+0xe0/0x130
[1207325.768756]  [] ? wakeup_kswapd+0x48/0x140
[1207325.774914]  [] __alloc_pages_nodemask+0x2bc/0x970
[1207325.781761]  [] alloc_pages_current+0x91/0x100
[1207325.788260]  [] alloc_kmem_pages+0xe/0x10
[1207325.794329]  [] kmalloc_order+0x18/0x50
[1207325.800227]  [] kmalloc_order_trace+0x26/0xb0
[1207325.806642]  [] ? _xfer_secondary_pool+0xa8/0x1a0
[1207325.813404]  [] __kmalloc+0x19a/0x1b0
[1207325.819142]  [] bnx2x_set_rx_mode_inner+0x3d5/0x590 
[bnx2x]
[1207325.827000]  [] bnx2x_sp_rtnl_task+0x28d/0x760 [bnx2x]
[1207325.834197]  [] process_one_work+0x134/0x3c0
[1207325.840522]  [] worker_thread+0x121/0x460
[1207325.846585]  [] ? process_one_work+0x3c0/0x3c0
[1207325.853089]  [] kthread+0xc9/0xe0
[1207325.858459]  [] ? notify_die+0x10/0x40
[1207325.864263]  [] ? kthread_create_on_node+0x180/0x180
[1207325.871288]  [] ret_from_fork+0x42/0x70
[1207325.877183]  [] ? kthread_create_on_node+0x180/0x180


Jason Baron (2):
  bnx2x: allocate mac filtering 'mcast_list' in PAGE_SIZE increments
  bnx2x: allocate mac filtering pending list in PAGE_SIZE increments

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |  85 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c   | 131 ---
 2 files changed, 151 insertions(+), 65 deletions(-)

-- 
2.6.1

Re: [PATCHv5 net-next 04/15] bpf: don't (ab)use instructions to store state

2016-09-16 Thread Daniel Borkmann


On 09/16/2016 11:36 AM, Jakub Kicinski wrote:

Storing state in reserved fields of instructions makes
it impossible to run verifier on programs already
marked as read-only. Allocate and use an array of
per-instruction state instead.

While touching the error path rename and move existing
jump target.

Suggested-by: Alexei Starovoitov 
Signed-off-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 


LGMT

Acked-by: Daniel Borkmann

[PATCH resend] sctp: Remove some redundant code

2016-09-16 Thread Christophe JAILLET

In commit 311b21774f13 ("sctp: simplify sk_receive_queue locking"), a call
to 'skb_queue_splice_tail_init()' has been made explicit. Previously it was
hidden in 'sctp_skb_list_tail()'

Now, the code around it looks redundant. The '_init()' part of
'skb_queue_splice_tail_init()' should already do the same.

Signed-off-by: Christophe JAILLET 
Acked-by: Marcelo Ricardo Leitner 
Acked-by: Neil Horman 
---
Resent because netdev@ was not in copy
Acked-by tags added
---
 net/sctp/ulpqueue.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/sctp/ulpqueue.c b/net/sctp/ulpqueue.c
index 877e55066f89..84d0fdaf7de9 100644
--- a/net/sctp/ulpqueue.c
+++ b/net/sctp/ulpqueue.c
@@ -140,11 +140,8 @@ int sctp_clear_pd(struct sock *sk, struct sctp_association 
*asoc)
 * we can go ahead and clear out the lobby in one shot
 */
if (!skb_queue_empty(>pd_lobby)) {
-   struct list_head *list;
skb_queue_splice_tail_init(>pd_lobby,
   >sk_receive_queue);
-   list = (struct list_head *)_sk(sk)->pd_lobby;
-   INIT_LIST_HEAD(list);
return 1;
}
} else {
-- 
2.7.4

Re: [net PATCH] mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full

2016-09-16 Thread Brenden Blanco

On Fri, Sep 16, 2016 at 10:36:12PM +0200, Jesper Dangaard Brouer wrote:
> The XDP_TX action can fail transmitting the frame in case the TX ring
> is full or port is down.  In case of TX failure it should drop the
> frame, and not as now call 'break' which is the same as XDP_PASS.
> 
> Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write support")
> Signed-off-by: Jesper Dangaard Brouer 

You could in theory have also tried to recycle the page instead of
dropping it, but that's probably not worth optimizing when tx is backed
up, as you'll only save a handful of page_put's. The code to do so
wouldn't have been pretty.

Reviewed-by: Brenden Blanco

Re: [net PATCH] mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full

2016-09-16 Thread Jesper Dangaard Brouer

On Fri, 16 Sep 2016 22:36:12 +0200
Jesper Dangaard Brouer  wrote:

> The XDP_TX action can fail transmitting the frame in case the TX ring
> is full or port is down.  In case of TX failure it should drop the
> frame, and not as now call 'break' which is the same as XDP_PASS.

Ups, forgot to add the V2 subject tag... Dave let me know if I should
resend with V2 in subj.?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[net PATCH] mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full

2016-09-16 Thread Jesper Dangaard Brouer

The XDP_TX action can fail transmitting the frame in case the TX ring
is full or port is down.  In case of TX failure it should drop the
frame, and not as now call 'break' which is the same as XDP_PASS.

Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write support")
Signed-off-by: Jesper Dangaard Brouer 

---
Note, this fix have nothing to do with the page-refcnt bug I just reported.

 drivers/net/ethernet/mellanox/mlx4/en_rx.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 2040dad8611d..adcd55c655db 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -906,7 +906,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
length, tx_index,
_pending))
goto consumed;
-   break;
+   goto next; /* Drop on xmit failure */
default:
bpf_warn_invalid_xdp_action(act);
case XDP_ABORTED:

[v2] net: ipv6: fallback to full lookup if table lookup is unsuitable

2016-09-16 Thread Vincent Bernat

Commit 8c14586fc320 ("net: ipv6: Use passed in table for nexthop
lookups") introduced a regression: insertion of an IPv6 route in a table
not containing the appropriate connected route for the gateway but which
contained a non-connected route (like a default gateway) fails while it
was previously working:

$ ip link add eth0 type dummy
$ ip link set up dev eth0
$ ip addr add 2001:db8::1/64 dev eth0
$ ip route add ::/0 via 2001:db8::5 dev eth0 table 20
$ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
RTNETLINK answers: No route to host
$ ip -6 route show table 20
default via 2001:db8::5 dev eth0  metric 1024  pref medium

After this patch, we get:

$ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
$ ip -6 route show table 20
2001:db8:cafe::1 via 2001:db8::6 dev eth0  metric 1024  pref medium
default via 2001:db8::5 dev eth0  metric 1024  pref medium

Signed-off-by: Vincent Bernat 
---
 net/ipv6/route.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ad4a7ff301fc..2c6c7257ff75 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1994,6 +1994,14 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg)
if (cfg->fc_table)
grt = ip6_nh_lookup_table(net, cfg, gw_addr);
 
+   if (grt) {
+   if (grt->rt6i_flags & RTF_GATEWAY ||
+   (dev && dev != grt->dst.dev)) {
+   ip6_rt_put(grt);
+   grt = NULL;
+   }
+   }
+
if (!grt)
grt = rt6_lookup(net, gw_addr, NULL,
 cfg->fc_ifindex, 1);
-- 
2.9.3

Re: [net PATCH] mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full

2016-09-16 Thread Jesper Dangaard Brouer

On Fri, 16 Sep 2016 13:00:50 -0700
Eric Dumazet  wrote:

> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > index 2040dad8611d..d414c67dfd12 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > @@ -906,6 +906,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
> > struct mlx4_en_cq *cq, int bud
> > length, tx_index,
> > _pending))
> > goto consumed;
> > +   goto next;
> > break;  
> 
> Why keeping this break; then ? ;)

I'll send a V2


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH v4 net 1/1] net sched actions: fix GETing actions

2016-09-16 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

With the batch changes that translated transient actions into
a temporary list lost in the translation was the fact that
tcf_action_destroy() will eventually delete the action from
the permanent location if the refcount is zero.

Example of what broke:
...add a gact action to drop
sudo $TC actions add action drop index 10
...now retrieve it, looks good
sudo $TC actions get action gact index 10
...retrieve it again and find it is gone!
sudo $TC actions get action gact index 10

Fixes:
commit 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array"),
commit 824a7e8863b3 ("net_sched: remove an unnecessary list_del()")
commit f07fed82ad79 ("net_sched: remove the leftover cleanup_a()")

Signed-off-by: Jamal Hadi Salim 
Acked-by: Cong Wang 
---
 net/sched/act_api.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index d09d068..e1c0ce1 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -592,6 +592,16 @@ err_out:
return ERR_PTR(err);
 }
 
+static void cleanup_a(struct list_head *actions, int ovr)
+{
+   struct tc_action *a;
+
+   list_for_each_entry(a, actions, list) {
+   if (ovr)
+   a->tcfa_refcnt--;
+   }
+}
+
 int tcf_action_init(struct net *net, struct nlattr *nla,
  struct nlattr *est, char *name, int ovr,
  int bind, struct list_head *actions)
@@ -612,8 +622,15 @@ int tcf_action_init(struct net *net, struct nlattr *nla,
goto err;
}
act->order = i;
+   if (ovr)
+   act->tcfa_refcnt++;
list_add_tail(>list, actions);
}
+
+   /* Remove the temp refcnt which was necessary to protect against
+* destroying an existing action which was being replaced
+*/
+   cleanup_a(actions, ovr);
return 0;
 
 err:
@@ -883,6 +900,8 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct 
nlmsghdr *n,
goto err;
}
act->order = i;
+   if (event == RTM_GETACTION)
+   act->tcfa_refcnt++;
list_add_tail(>list, );
}
 
-- 
1.9.1

Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

2016-09-16 Thread Tejun Heo

Hello,

On Tue, Sep 13, 2016 at 08:14:40PM +0200, Jiri Slaby wrote:
> I assume Dmitry sees the same what I am still seeing, so I reported this
> some time ago:
> https://lkml.org/lkml/2016/3/21/492
> 
> This warning is trigerred there and still occurs with "HEAD":
>   (pwq != wq->dfl_pwq) && (pwq->refcnt > 1)
> and the state dump is in the log empty too:
> destroy_workqueue: name='hci0' pwq=88006b5c8f00
> wq->dfl_pwq=88006b5c9b00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
>   pwq 13:
>  cpus=2-3 node=1 flags=0x4 nice=-20 active=0/1
> in-flight: 2669:wq_barrier_func

Hmmm... I think it could be from rescuer holding reference on the pwq.
Both cases have WQ_MEM_RECLAIM and it could be that rescuer was still
in flight (even without work items pending) when the sanity checks
were done.  The following patch moves the sanity checks after rescuer
destruction.  Dmitry, Jiri, can you please see whether the warning
goes away with this patch?

Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 984f6ff..e8046a1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4042,8 +4042,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
}
}
 
-   if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) ||
-   WARN_ON(pwq->nr_active) ||
+   if (WARN_ON(pwq->nr_active) ||
WARN_ON(!list_empty(>delayed_works))) {
mutex_unlock(>mutex);
show_workqueue_state();
@@ -4080,6 +4079,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
for_each_node(node) {
pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
RCU_INIT_POINTER(wq->numa_pwq_tbl[node], NULL);
+   WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt != 1));
put_pwq_unlocked(pwq);
}
 
@@ -4089,6 +4089,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
 */
pwq = wq->dfl_pwq;
wq->dfl_pwq = NULL;
+   WARN_ON(pwq->refcnt != 1);
put_pwq_unlocked(pwq);
}
 }

Re: [PATCH net-next 07/14] tcp: export data delivery rate

2016-09-16 Thread Eric Dumazet

On Fri, Sep 16, 2016 at 1:03 PM, Neal Cardwell  wrote:

>
> Looks like 'rate' should be 'rate64'. I will include this fix in the
> next version of the patch series.
>
> neal


Oh, right you are !

Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets

2016-09-16 Thread Cyrill Gorcunov

On Fri, Sep 16, 2016 at 01:55:42PM -0600, David Ahern wrote:
> >> Since the display is showing sockets in addition to IPPROTO_RAW:
> >>
> >> $ ss -A raw
> >> State  Recv-Q Send-QLocal Address:Port 
> >> Peer Address:Port
> >> UNCONN 0  0*%eth0:icmp 
> >>*:*
> >>
> >> It is going to be confusing if only ipproto-255 sockets can be killed.
> > 
> > OK, gimme some time to implement it. Hopefully on the weekend or monday.
> > Thanks a huge for feedback!
> > 
> 
> It may well be a ss bug / problem. As I mentioned I am always seeing 255 for 
> the protocol which

It is rather not addressed in ss. I mean, look, when we send out a diag packet
the kernel look ups for a handler, which for raw protocol we register as

static const struct inet_diag_handler raw_diag_handler = {
.dump= raw_diag_dump,
.dump_one= raw_diag_dump_one,
.idiag_get_info= raw_diag_get_info,
.idiag_type= IPPROTO_RAW,
.idiag_info_size= 0,
#ifdef CONFIG_INET_DIAG_DESTROY
.destroy= raw_diag_destroy,
#endif
};

so if we patch ss and ask for IPPROTO_ICMP in netlink packet the
kernel simply won't find anything. Thus I think we need (well, I need)
to extend the patch and register IPPROTO_ICMP diag type, then
extend ss as well. (If only I didn't miss somethin obvious).

> is odd since ss does a dump and takes the matches and invokes the kill. 
> Thanks for taking
> the time to do the kill piece.

Sure!

Re: [PATCH net-next 07/14] tcp: export data delivery rate

2016-09-16 Thread Neal Cardwell

On Fri, Sep 16, 2016 at 11:56 PM, kbuild test robot  wrote:
>
>>> net/ipv4/tcp.c:2794:3: note: in expansion of macro 'do_div'
>   do_div(rate, intv);
>   ^~
>In file included from arch/arm/include/asm/div64.h:126:0,
> from include/linux/kernel.h:142,
> from include/linux/crypto.h:21,
> from include/crypto/hash.h:16,
> from net/ipv4/tcp.c:250:
>>> include/asm-generic/div64.h:224:22: error: passing argument 1 of 
>>> '__div64_32' from incompatible pointer type 
>>> [-Werror=incompatible-pointer-types]
>   __rem = __div64_32(&(n), __base); \
...
> > 2794  do_div(rate, intv);

Looks like 'rate' should be 'rate64'. I will include this fix in the
next version of the patch series.

neal

Re: [PATCH net-next 07/14] tcp: export data delivery rate

2016-09-16 Thread Eric Dumazet

On Sat, 2016-09-17 at 11:56 +0800, kbuild test robot wrote:
> Hi Yuchung,
> 
> [auto build test ERROR on net-next/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Neal-Cardwell/tcp-BBR-congestion-control-algorithm/20160917-025323
> config: arm-nhk8815_defconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
> wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=arm 

Right, we need to include   for some arches.

Re: [net PATCH] mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full

2016-09-16 Thread Eric Dumazet

On Fri, 2016-09-16 at 21:47 +0200, Jesper Dangaard Brouer wrote:
> The XDP_TX action can fail transmitting the frame in case the TX ring
> is full or port is down.  In case of TX failure it should drop the
> frame, and not as now call 'break' which is the same as XDP_PASS.
> 
> Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write support")
> Signed-off-by: Jesper Dangaard Brouer 
> 
> ---
> Note, this fix have nothing to do with the page-refcnt bug I just reported.

Yeah, the e1000 driver proposal patch had the same issue.

> 
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 2040dad8611d..d414c67dfd12 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -906,6 +906,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
> mlx4_en_cq *cq, int bud
>   length, tx_index,
>   _pending))
>   goto consumed;
> + goto next;
>   break;

Why keeping this break; then ? ;)

>   default:
>   bpf_warn_invalid_xdp_action(act);
>

[PATCHv4 next 2/3] net: Add _nf_(un)register_hooks symbols

2016-09-16 Thread Mahesh Bandewar

From: Mahesh Bandewar 

Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow
caller to hold RTNL mutex.

Signed-off-by: Mahesh Bandewar 
CC: Pablo Neira Ayuso 
---
 include/linux/netfilter.h |  2 ++
 net/netfilter/core.c  | 51 ++-
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 9230f9aee896..e82b76781bf6 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -133,6 +133,8 @@ int nf_register_hook(struct nf_hook_ops *reg);
 void nf_unregister_hook(struct nf_hook_ops *reg);
 int nf_register_hooks(struct nf_hook_ops *reg, unsigned int n);
 void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n);
+int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n);
+void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n);
 
 /* Functions to register get/setsockopt ranges (non-inclusive).  You
need to check permissions yourself! */
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index f39276d1c2d7..2c5327e43a88 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -188,19 +188,17 @@ EXPORT_SYMBOL(nf_unregister_net_hooks);
 
 static LIST_HEAD(nf_hook_list);
 
-int nf_register_hook(struct nf_hook_ops *reg)
+static int _nf_register_hook(struct nf_hook_ops *reg)
 {
struct net *net, *last;
int ret;
 
-   rtnl_lock();
for_each_net(net) {
ret = nf_register_net_hook(net, reg);
if (ret && ret != -ENOENT)
goto rollback;
}
list_add_tail(>list, _hook_list);
-   rtnl_unlock();
 
return 0;
 rollback:
@@ -210,19 +208,34 @@ rollback:
break;
nf_unregister_net_hook(net, reg);
}
+   return ret;
+}
+
+int nf_register_hook(struct nf_hook_ops *reg)
+{
+   int ret;
+
+   rtnl_lock();
+   ret = _nf_register_hook(reg);
rtnl_unlock();
+
return ret;
 }
 EXPORT_SYMBOL(nf_register_hook);
 
-void nf_unregister_hook(struct nf_hook_ops *reg)
+static void _nf_unregister_hook(struct nf_hook_ops *reg)
 {
struct net *net;
 
-   rtnl_lock();
list_del(>list);
for_each_net(net)
nf_unregister_net_hook(net, reg);
+}
+
+void nf_unregister_hook(struct nf_hook_ops *reg)
+{
+   rtnl_lock();
+   _nf_unregister_hook(reg);
rtnl_unlock();
 }
 EXPORT_SYMBOL(nf_unregister_hook);
@@ -246,6 +259,26 @@ err:
 }
 EXPORT_SYMBOL(nf_register_hooks);
 
+/* Caller MUST take rtnl_lock() */
+int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n)
+{
+   unsigned int i;
+   int err = 0;
+
+   for (i = 0; i < n; i++) {
+   err = _nf_register_hook([i]);
+   if (err)
+   goto err;
+   }
+   return err;
+
+err:
+   if (i > 0)
+   _nf_unregister_hooks(reg, i);
+   return err;
+}
+EXPORT_SYMBOL(_nf_register_hooks);
+
 void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n)
 {
while (n-- > 0)
@@ -253,6 +286,14 @@ void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned 
int n)
 }
 EXPORT_SYMBOL(nf_unregister_hooks);
 
+/* Caller MUST take rtnl_lock */
+void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n)
+{
+   while (n-- > 0)
+   _nf_unregister_hook([n]);
+}
+EXPORT_SYMBOL(_nf_unregister_hooks);
+
 unsigned int nf_iterate(struct list_head *head,
struct sk_buff *skb,
struct nf_hook_state *state,
-- 
2.8.0.rc3.226.g39d4020

[PATCHv4 next 3/3] ipvlan: Introduce l3s mode

2016-09-16 Thread Mahesh Bandewar

From: Mahesh Bandewar 

In a typical IPvlan L3 setup where master is in default-ns and
each slave is into different (slave) ns. In this setup egress
packet processing for traffic originating from slave-ns will
hit all NF_HOOKs in slave-ns as well as default-ns. However same
is not true for ingress processing. All these NF_HOOKs are
hit only in the slave-ns skipping them in the default-ns.
IPvlan in L3 mode is restrictive and if admins want to deploy
iptables rules in default-ns, this asymmetric data path makes it
impossible to do so.

This patch makes use of the l3_rcv() (added as part of l3mdev
enhancements) to perform input route lookup on RX packets without
changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
to change the skb->dev just before handing over skb to L4.

Signed-off-by: Mahesh Bandewar 
CC: David Ahern 
---
 Documentation/networking/ipvlan.txt |  7 ++-
 drivers/net/Kconfig |  1 +
 drivers/net/ipvlan/ipvlan.h |  6 +++
 drivers/net/ipvlan/ipvlan_core.c| 94 +
 drivers/net/ipvlan/ipvlan_main.c| 87 +++---
 include/uapi/linux/if_link.h|  1 +
 6 files changed, 188 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/ipvlan.txt 
b/Documentation/networking/ipvlan.txt
index 14422f8fcdc4..24196cef7c91 100644
--- a/Documentation/networking/ipvlan.txt
+++ b/Documentation/networking/ipvlan.txt
@@ -22,7 +22,7 @@ The driver can be built into the kernel (CONFIG_IPVLAN=y) or 
as a module
There are no module parameters for this driver and it can be configured
 using IProute2/ip utility.
 
-   ip link add link   type ipvlan mode { l2 | L3 }
+   ip link add link   type ipvlan mode { l2 | l3 | 
l3s }
 
e.g. ip link add link ipvl0 eth0 type ipvlan mode l2
 
@@ -48,6 +48,11 @@ master device for the L2 processing and routing from that 
instance will be
 used before packets are queued on the outbound device. In this mode the slaves
 will not receive nor can send multicast / broadcast traffic.
 
+4.3 L3S mode:
+   This is very similar to the L3 mode except that iptables (conn-tracking)
+works in this mode and hence it is L3-symmetric (L3s). This will have slightly 
less
+performance but that shouldn't matter since you are choosing this mode over 
plain-L3
+mode to make conn-tracking work.
 
 5. What to choose (macvlan vs. ipvlan)?
These two devices are very similar in many regards and the specific use
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 0c5415b05ea9..8768a625350d 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -149,6 +149,7 @@ config IPVLAN
 tristate "IP-VLAN support"
 depends on INET
 depends on IPV6
+depends on NET_L3_MASTER_DEV
 ---help---
   This allows one to create virtual devices off of a main interface
   and packets will be delivered based on the dest L3 (IPv6/IPv4 addr)
diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
index 695a5dc9ace3..7e0732f5ea07 100644
--- a/drivers/net/ipvlan/ipvlan.h
+++ b/drivers/net/ipvlan/ipvlan.h
@@ -23,11 +23,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #define IPVLAN_DRV "ipvlan"
 #define IPV_DRV_VER"0.1"
@@ -124,4 +126,8 @@ struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev 
*ipvlan,
   const void *iaddr, bool is_v6);
 bool ipvlan_addr_busy(struct ipvl_port *port, void *iaddr, bool is_v6);
 void ipvlan_ht_addr_del(struct ipvl_addr *addr);
+struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
+ u16 proto);
+unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
+const struct nf_hook_state *state);
 #endif /* __IPVLAN_H */
diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index b5f9511d819e..b4e990743e1d 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -560,6 +560,7 @@ int ipvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
case IPVLAN_MODE_L2:
return ipvlan_xmit_mode_l2(skb, dev);
case IPVLAN_MODE_L3:
+   case IPVLAN_MODE_L3S:
return ipvlan_xmit_mode_l3(skb, dev);
}
 
@@ -664,6 +665,8 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff 
**pskb)
return ipvlan_handle_mode_l2(pskb, port);
case IPVLAN_MODE_L3:
return ipvlan_handle_mode_l3(pskb, port);
+   case IPVLAN_MODE_L3S:
+   return RX_HANDLER_PASS;
}
 
/* Should not reach here */
@@ -672,3 +675,94 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff 
**pskb)
kfree_skb(skb);
return RX_HANDLER_CONSUMED;
 }
+
+static struct ipvl_addr *ipvlan_skb_to_addr(struct sk_buff *skb,

[PATCHv4 next 0/3] IPvlan introduce l3s mode

2016-09-16 Thread Mahesh Bandewar

From: Mahesh Bandewar 

Same old problem with new approach especially from suggestions from
earlier patch-series.

First thing is that this is introduced as a new mode rather than
modifying the old (L3) mode. So the behavior of the existing modes is
preserved as it is and the new L3s mode obeys iptables so that intended
conn-tracking can work. 

To do this, the code uses newly added l3mdev_rcv() handler and an
Iptables hook. l3mdev_rcv() to perform an inbound route lookup with the
correct (IPvlan slave) interface and then IPtable-hook at LOCAL_INPUT
to change the input device from master to the slave to complete the
formality.

Supporting stack changes are trivial changes to export symbol to get
IPv4 equivalent code exported for IPv6 and to allow netfilter hook
registration code to allow caller to hold RTNL. Please look into
individual patches for details.

Mahesh Bandewar (3):
  ipv6: Export p6_route_input_lookup symbol
  net: Add _nf_(un)register_hooks symbols
  ipvlan: Introduce l3s mode

 Documentation/networking/ipvlan.txt |  7 ++-
 drivers/net/Kconfig |  1 +
 drivers/net/ipvlan/ipvlan.h |  6 +++
 drivers/net/ipvlan/ipvlan_core.c| 94 +
 drivers/net/ipvlan/ipvlan_main.c| 87 +++---
 include/linux/netfilter.h   |  2 +
 include/net/ip6_route.h |  3 ++
 include/uapi/linux/if_link.h|  1 +
 net/ipv6/route.c|  7 +--
 net/netfilter/core.c| 51 ++--
 10 files changed, 243 insertions(+), 16 deletions(-)

v1: Initial post
v2: Text correction and config changed from "select" to "depends on"
v3: separated nf_hook registration logic and made it independent of port
as nf_hook registration is independant of how many IPvlan ports are
present in the system.
v4: Eliminated need to have "hooks_attached" per port and rely just on
the mode. Also change BUG_ON to WARN_ON

-- 
2.8.0.rc3.226.g39d4020

[PATCHv4 next 1/3] ipv6: Export p6_route_input_lookup symbol

2016-09-16 Thread Mahesh Bandewar

From: Mahesh Bandewar 

Make ip6_route_input_lookup available outside of ipv6 the module
similar to ip_route_input_noref in the IPv4 world.

Signed-off-by: Mahesh Bandewar 
---
 include/net/ip6_route.h | 3 +++
 net/ipv6/route.c| 7 ---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index d97305d0e71f..e0cd318d5103 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -64,6 +64,9 @@ static inline bool rt6_need_strict(const struct in6_addr 
*daddr)
 }
 
 void ip6_route_input(struct sk_buff *skb);
+struct dst_entry *ip6_route_input_lookup(struct net *net,
+struct net_device *dev,
+struct flowi6 *fl6, int flags);
 
 struct dst_entry *ip6_route_output_flags(struct net *net, const struct sock 
*sk,
 struct flowi6 *fl6, int flags);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ad4a7ff301fc..4dab585f7642 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1147,15 +1147,16 @@ static struct rt6_info *ip6_pol_route_input(struct net 
*net, struct fib6_table *
return ip6_pol_route(net, table, fl6->flowi6_iif, fl6, flags);
 }
 
-static struct dst_entry *ip6_route_input_lookup(struct net *net,
-   struct net_device *dev,
-   struct flowi6 *fl6, int flags)
+struct dst_entry *ip6_route_input_lookup(struct net *net,
+struct net_device *dev,
+struct flowi6 *fl6, int flags)
 {
if (rt6_need_strict(>daddr) && dev->type != ARPHRD_PIMREG)
flags |= RT6_LOOKUP_F_IFACE;
 
return fib6_rule_lookup(net, fl6, flags, ip6_pol_route_input);
 }
+EXPORT_SYMBOL_GPL(ip6_route_input_lookup);
 
 void ip6_route_input(struct sk_buff *skb)
 {
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH v5 0/6] Add eBPF hooks for cgroups

2016-09-16 Thread Sargun Dhillon

On Wed, Sep 14, 2016 at 01:13:16PM +0200, Daniel Mack wrote:
> Hi Pablo,
> 
> On 09/13/2016 07:24 PM, Pablo Neira Ayuso wrote:
> > On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
> >> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> >>> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
>  This is v5 of the patch set to allow eBPF programs for network
>  filtering and accounting to be attached to cgroups, so that they apply
>  to all sockets of all tasks placed in that cgroup. The logic also
>  allows to be extendeded for other cgroup based eBPF logic.
> >>>
> >>> 1) This infrastructure can only be useful to systemd, or any similar
> >>>orchestration daemon. Look, you can only apply filtering policies
> >>>to processes that are launched by systemd, so this only works
> >>>for server processes.
> >>
> >> Sorry, but both statements aren't true. The eBPF policies apply to every
> >> process that is placed in a cgroup, and my example program in 6/6 shows
> >> how that can be done from the command line.
> > 
> > Then you have to explain me how can anyone else than systemd use this
> > infrastructure?
> 
> I have no idea what makes you think this is limited to systemd. As I
> said, I provided an example for userspace that works from the command
> line. The same limitation apply as for all other users of cgroups.
> 
So, at least in my work, we have Mesos, but on nearly every machine that Mesos 
runs, people also have systemd. Now, there's recently become a bit of a battle 
of ownership of things like cgroups on these machines. We can usually solve it 
by nesting under systemd cgroups, and thus so far we've avoided making too many 
systemd-specific concessions.

The reason this works (mostly), is because everything we touch has a sense of 
nesting, where we can apply policy at a place lower in the hierarchy, and yet 
systemd's monitoring and policy still stays in place. 

Now, with this patch, we don't have that, but I think we can reasonably add 
some 
flag like "no override" when applying policies, or alternatively something like 
"no new privileges", to prevent children from applying policies that override 
top-level policy. I realize there is a speed concern as well, but I think for 
people who want nested policy, we're willing to make the tradeoff. The cost
of traversing a few extra pointers still outweighs the overhead of network
namespaces, iptables, etc.. for many of us. 

What do you think Daniel?

> > My main point is that those processes *need* to be launched by the
> > orchestrator, which is was refering as 'server processes'.
> 
> Yes, that's right. But as I said, this rule applies to many other kernel
> concepts, so I don't see any real issue.
>
Also, cgroups have become such a big part of how applications are managed
that many of us have solved this problem.

> >> That's a limitation that applies to many more control mechanisms in the
> >> kernel, and it's something that can easily be solved with fork+exec.
> > 
> > As long as you have control to launch the processes yes, but this
> > will not work in other scenarios. Just like cgroup net_cls and friends
> > are broken for filtering for things that you have no control to
> > fork+exec.
> 
> Probably, but that's only solvable with rules that store the full cgroup
> path then, and do a string comparison (!) for each packet flying by.
>
> >> That's just as transparent as SO_ATTACH_FILTER. What kind of
> >> introspection mechanism do you have in mind?
> > 
> > SO_ATTACH_FILTER is called from the process itself, so this is a local
> > filtering policy that you apply to your own process.
> 
> Not necessarily. You can as well do it the inetd way, and pass the
> socket to a process that is launched on demand, but do SO_ATTACH_FILTER
> + SO_LOCK_FILTER  in the middle. What happens with payload on the socket
> is not transparent to the launched binary at all. The proposed cgroup
> eBPF solution implements a very similar behavior in that regard.
> 
It would be nice to be able to see whether or not a filter is attached to a 
cgroup, but given this is going through syscalls, at least introspection
is possible as opposed to something like netlink.

> >> It's about filtering outgoing network packets of applications, and
> >> providing them with L2 information for filtering purposes. I don't think
> >> that's a very specific use-case.
> >>
> >> When the feature is not used at all, the added costs on the output path
> >> are close to zero, due to the use of static branches.
> > 
> > *You're proposing a socket filtering facility that hooks layer 2
> > output path*!
> 
> As I said, I'm open to discussing that. In order to make it work for L3,
> the LL_OFF issues need to be solved, as Daniel explained. Daniel,
> Alexei, any idea how much work that would be?
> 
> > That is only a rough ~30 lines kernel patchset to support this in
> > netfilter and only one extra input hook, with potential access to
> > conntrack and

Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets

2016-09-16 Thread David Ahern

On 9/16/16 1:52 PM, Cyrill Gorcunov wrote:
> On Fri, Sep 16, 2016 at 01:47:57PM -0600, David Ahern wrote:

 I'm guessing you passed IPPROTO_RAW (255) as the protocol to socket(). If 
 you pass something
 else (IPPROTO_ICMP for example) it won't work.
>>>
>>> True. To support IPPROTO_ICMP it need enhancement. I thought start with
>>> plain _RAW first and then extend to support _ICMP.
>>
>> I thought raw in this case was SOCK_RAW as in the socket type.
>>
>> Since the display is showing sockets in addition to IPPROTO_RAW:
>>
>> $ ss -A raw
>> State  Recv-Q Send-QLocal Address:Port 
>> Peer Address:Port
>> UNCONN 0  0*%eth0:icmp   
>>  *:*
>>
>> It is going to be confusing if only ipproto-255 sockets can be killed.
> 
> OK, gimme some time to implement it. Hopefully on the weekend or monday.
> Thanks a huge for feedback!
> 

It may well be a ss bug / problem. As I mentioned I am always seeing 255 for 
the protocol which is odd since ss does a dump and takes the matches and 
invokes the kill. Thanks for taking the time to do the kill piece.

Re: Modification to skb->queue_mapping affecting performance

2016-09-16 Thread Eric Dumazet

On Fri, 2016-09-16 at 10:57 -0700, Michael Ma wrote:

> This is actually the problem - if flows from different RX queues are
> switched to the same RX queue in IFB, they'll use different processor
> context with the same tasklet, and the processor context of different
> tasklets might be the same. So multiple tasklets in IFB competes for
> the same core when queue is switched.
> 
> The following simple fix proved this - with this change even switching
> the queue won't affect small packet bandwidth/latency anymore:
> 
> in ifb.c:
> 
> -   struct ifb_q_private *txp = dp->tx_private + 
> skb_get_queue_mapping(skb);
> +   struct ifb_q_private *txp = dp->tx_private +
> (smp_processor_id() % dev->num_tx_queues);
> 
> This should be more efficient since we're not sending the task to a
> different processor, instead we try to queue the packet to an
> appropriate tasklet based on the processor ID. Will this cause any
> packet out-of-order problem? If packets from the same flow are queued
> to the same RX queue due to RSS, and processor affinity is set for RX
> queues, I assume packets from the same flow will end up in the same
> core when tasklet is scheduled. But I might have missed some uncommon
> cases here... Would appreciate if anyone can provide more insights.

Wait, don't you have proper smp affinity for the RX queues on your NIC ?

( Documentation/networking/scaling.txt RSS IRQ Configuration )

A driver ndo_start_xmit() MUST use skb_get_queue_mapping(skb), because
the driver queue is locked before ndo_start_xmit())  (for non
NETIF_F_LLTX drivers at least)

In case of ifb, __skb_queue_tail(>rq, skb); could corrupt the skb
list.

In any case, you could have an action to do this before reaching IFB.

Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets

2016-09-16 Thread Cyrill Gorcunov

On Fri, Sep 16, 2016 at 01:47:57PM -0600, David Ahern wrote:
> >>
> >> I'm guessing you passed IPPROTO_RAW (255) as the protocol to socket(). If 
> >> you pass something
> >> else (IPPROTO_ICMP for example) it won't work.
> > 
> > True. To support IPPROTO_ICMP it need enhancement. I thought start with
> > plain _RAW first and then extend to support _ICMP.
> 
> I thought raw in this case was SOCK_RAW as in the socket type.
> 
> Since the display is showing sockets in addition to IPPROTO_RAW:
> 
> $ ss -A raw
> State  Recv-Q Send-QLocal Address:Port 
> Peer Address:Port
> UNCONN 0  0*%eth0:icmp
> *:*
> 
> It is going to be confusing if only ipproto-255 sockets can be killed.

OK, gimme some time to implement it. Hopefully on the weekend or monday.
Thanks a huge for feedback!

Re: [PATCH net-next 07/14] tcp: export data delivery rate

2016-09-16 Thread kbuild test robot

Hi Yuchung,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Neal-Cardwell/tcp-BBR-congestion-control-algorithm/20160917-025323
config: arm-nhk8815_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All error/warnings (new ones prefixed by >>):

   In file included from arch/arm/include/asm/div64.h:126:0,
from include/linux/kernel.h:142,
from include/linux/crypto.h:21,
from include/crypto/hash.h:16,
from net/ipv4/tcp.c:250:
   net/ipv4/tcp.c: In function 'tcp_get_info':
   include/asm-generic/div64.h:207:28: warning: comparison of distinct pointer 
types lacks a cast
 (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
   ^
>> net/ipv4/tcp.c:2794:3: note: in expansion of macro 'do_div'
  do_div(rate, intv);
  ^~
   In file included from arch/arm/include/asm/atomic.h:14:0,
from include/linux/atomic.h:4,
from include/linux/crypto.h:20,
from include/crypto/hash.h:16,
from net/ipv4/tcp.c:250:
   include/asm-generic/div64.h:220:25: warning: right shift count >= width of 
type [-Wshift-count-overflow]
 } else if (likely(((n) >> 32) == 0)) {  \
^
   include/linux/compiler.h:167:40: note: in definition of macro 'likely'
# define likely(x) __builtin_expect(!!(x), 1)
   ^
>> net/ipv4/tcp.c:2794:3: note: in expansion of macro 'do_div'
  do_div(rate, intv);
  ^~
   In file included from arch/arm/include/asm/div64.h:126:0,
from include/linux/kernel.h:142,
from include/linux/crypto.h:21,
from include/crypto/hash.h:16,
from net/ipv4/tcp.c:250:
>> include/asm-generic/div64.h:224:22: error: passing argument 1 of 
>> '__div64_32' from incompatible pointer type 
>> [-Werror=incompatible-pointer-types]
  __rem = __div64_32(&(n), __base); \
 ^
>> net/ipv4/tcp.c:2794:3: note: in expansion of macro 'do_div'
  do_div(rate, intv);
  ^~
   In file included from include/linux/kernel.h:142:0,
from include/linux/crypto.h:21,
from include/crypto/hash.h:16,
from net/ipv4/tcp.c:250:
   arch/arm/include/asm/div64.h:32:24: note: expected 'uint64_t * {aka long 
long unsigned int *}' but argument is of type 'u32 * {aka unsigned int *}'
static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
   ^~
   cc1: some warnings being treated as errors

vim +/do_div +2794 net/ipv4/tcp.c

  2778  } while (u64_stats_fetch_retry_irq(>syncp, start));
  2779  info->tcpi_segs_out = tp->segs_out;
  2780  info->tcpi_segs_in = tp->segs_in;
  2781  
  2782  notsent_bytes = READ_ONCE(tp->write_seq) - 
READ_ONCE(tp->snd_nxt);
  2783  info->tcpi_notsent_bytes = max(0, notsent_bytes);
  2784  
  2785  info->tcpi_min_rtt = tcp_min_rtt(tp);
  2786  info->tcpi_data_segs_in = tp->data_segs_in;
  2787  info->tcpi_data_segs_out = tp->data_segs_out;
  2788  
  2789  info->tcpi_delivery_rate_app_limited = tp->rate_app_limited ? 1 
: 0;
  2790  rate = READ_ONCE(tp->rate_delivered);
  2791  intv = READ_ONCE(tp->rate_interval_us);
  2792  if (rate && intv) {
  2793  rate = rate * tp->mss_cache * USEC_PER_SEC;
> 2794  do_div(rate, intv);
  2795  put_unaligned(rate, >tcpi_delivery_rate);
  2796  }
  2797  }
  2798  EXPORT_SYMBOL_GPL(tcp_get_info);
  2799  
  2800  static int do_tcp_getsockopt(struct sock *sk, int level,
  2801  int optname, char __user *optval, int __user *optlen)
  2802  {

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets

2016-09-16 Thread David Ahern

On 9/16/16 1:39 PM, Cyrill Gorcunov wrote:
> On Fri, Sep 16, 2016 at 01:30:28PM -0600, David Ahern wrote:
>>> [root@pcs7 iproute2]# misc/ss -A raw
>>> State  Recv-Q Send-QLocal Address:Port  
>>>Peer Address:Port
>>> 
>>> ESTAB  0  0 
>>> 127.0.0.1:ipproto-255
>>> 127.0.0.10:ipproto-9090 
>>> UNCONN 0  0
>>> 127.0.0.10:ipproto-255 
>>> *:*
>>> UNCONN 0  0
>>> :::ipv6-icmp  :::*  
>>>   
>>> UNCONN 0  0
>>> :::ipv6-icmp  :::*  
>>>   
>>> ESTAB  0  0   
>>> ::1:ipproto-255   
>>> ::1:ipproto-9091 
>>>
>>> so it get zapped out. Is there some other way to test it?
>>>
>>
>> I'm guessing you passed IPPROTO_RAW (255) as the protocol to socket(). If 
>> you pass something
>> else (IPPROTO_ICMP for example) it won't work.
> 
> True. To support IPPROTO_ICMP it need enhancement. I thought start with
> plain _RAW first and then extend to support _ICMP.

I thought raw in this case was SOCK_RAW as in the socket type.

Since the display is showing sockets in addition to IPPROTO_RAW:

$ ss -A raw
State  Recv-Q Send-QLocal Address:Port Peer 
Address:Port
UNCONN 0  0*%eth0:icmp  
  *:*

It is going to be confusing if only ipproto-255 sockets can be killed.

[net PATCH] mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full

2016-09-16 Thread Jesper Dangaard Brouer

The XDP_TX action can fail transmitting the frame in case the TX ring
is full or port is down.  In case of TX failure it should drop the
frame, and not as now call 'break' which is the same as XDP_PASS.

Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write support")
Signed-off-by: Jesper Dangaard Brouer 

---
Note, this fix have nothing to do with the page-refcnt bug I just reported.

 drivers/net/ethernet/mellanox/mlx4/en_rx.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 2040dad8611d..d414c67dfd12 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -906,6 +906,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
length, tx_index,
_pending))
goto consumed;
+   goto next;
break;
default:
bpf_warn_invalid_xdp_action(act);

Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets

2016-09-16 Thread Cyrill Gorcunov

On Fri, Sep 16, 2016 at 01:30:28PM -0600, David Ahern wrote:
> > [root@pcs7 iproute2]# misc/ss -A raw
> > State  Recv-Q Send-QLocal Address:Port  
> >Peer Address:Port
> > 
> > ESTAB  0  0 
> > 127.0.0.1:ipproto-255
> > 127.0.0.10:ipproto-9090 
> > UNCONN 0  0
> > 127.0.0.10:ipproto-255 
> > *:*
> > UNCONN 0  0
> > :::ipv6-icmp  :::*  
> >   
> > UNCONN 0  0
> > :::ipv6-icmp  :::*  
> >   
> > ESTAB  0  0   
> > ::1:ipproto-255   
> > ::1:ipproto-9091 
> > 
> > so it get zapped out. Is there some other way to test it?
> > 
> 
> I'm guessing you passed IPPROTO_RAW (255) as the protocol to socket(). If you 
> pass something
> else (IPPROTO_ICMP for example) it won't work.

True. To support IPPROTO_ICMP it need enhancement. I thought start with
plain _RAW first and then extend to support _ICMP.

Cyrill

Re: [PATCH] net: ipv6: fallback to full lookup if table lookup is unsuitable

2016-09-16 Thread David Ahern

On 9/16/16 1:15 PM, Vincent Bernat wrote:
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index ad4a7ff301fc..48bae2ee2e18 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -1991,9 +1991,19 @@ static struct rt6_info *ip6_route_info_create(struct 
>> fib6_config *cfg)
>> if (!(gwa_type & IPV6_ADDR_UNICAST))
>> goto out;
>>
>> -   if (cfg->fc_table)
>> +   if (cfg->fc_table) {
>> grt = ip6_nh_lookup_table(net, cfg, gw_addr);
>>
>> +   /* a nexthop lookup can not go through a gw.
>> +* if this happens on a table based lookup
>> +* then fallback to a full lookup
>> +*/
>> +   if (grt && grt->rt6i_flags & RTF_GATEWAY) {
>> +   ip6_rt_put(grt);
>> +   grt = NULL;
>> +   }
>> +   }
>> +
>> if (!grt)
>> grt = rt6_lookup(net, gw_addr, NULL,
>>  cfg->fc_ifindex, 1);
> 
> OK. Should the dev check be dismissed or do we add "dev && dev !=
> grt->dst.dev" just as a safety net (this would be a convulated setup,
> but the correct direct route could be in an ip rule with higher priority
> while the one in this table is incorrect)?
> 

yes. So the validity check becomes:

grt = ip6_nh_lookup_table(net, cfg, gw_addr);
if (grt) {
if (grt->rt6i_flags & RTF_GATEWAY ||
dev && dev != grt->dst.dev) {
ip6_rt_put(grt);
grt = NULL;< causes the full rt6_lookup
}
}

Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets

2016-09-16 Thread David Ahern

On 9/16/16 1:00 PM, Cyrill Gorcunov wrote:
> I created veth pair and bound raw socket into it.
> 
> [root@pcs7 iproute2]# misc/ss -A raw
> State  Recv-Q Send-QLocal Address:Port
>  Peer Address:Port
> ESTAB  0  0 
> 127.0.0.1:ipproto-255
> 127.0.0.10:ipproto-9090 
> UNCONN 0  0
> 127.0.0.10:ipproto-255 
> *:*
> UNCONN 0  0
> :::ipv6-icmp  :::*
> 
> UNCONN 0  0
> :::ipv6-icmp  :::*
> 
> ESTAB  0  0   
> ::1:ipproto-255   
> ::1:ipproto-9091 
> UNCONN 0  0   
> ::1%vm1:ipproto-255:::*   
>  
> [root@pcs7 iproute2]# 
> 
> [root@pcs7 iproute2]# misc/ss -aKw 'dev == vm1'
> State  Recv-Q Send-QLocal Address:Port
>  Peer Address:Port
> UNCONN 0  0   
> ::1%vm1:ipproto-255:::*   
>  
> 
> [root@pcs7 iproute2]# misc/ss -A raw
> State  Recv-Q Send-QLocal Address:Port
>  Peer Address:Port
> ESTAB  0  0 
> 127.0.0.1:ipproto-255
> 127.0.0.10:ipproto-9090 
> UNCONN 0  0
> 127.0.0.10:ipproto-255 
> *:*
> UNCONN 0  0
> :::ipv6-icmp  :::*
> 
> UNCONN 0  0
> :::ipv6-icmp  :::*
> 
> ESTAB  0  0   
> ::1:ipproto-255   
> ::1:ipproto-9091 
> 
> so it get zapped out. Is there some other way to test it?
> 

I'm guessing you passed IPPROTO_RAW (255) as the protocol to socket(). If you 
pass something else (IPPROTO_ICMP for example) it won't work.

Re: [PATCH net-next 02/14] tcp: use windowed min filter library for TCP min_rtt estimation

2016-09-16 Thread Neal Cardwell

On Fri, Sep 16, 2016 at 3:21 PM, kbuild test robot  wrote:
> All errors (new ones prefixed by >>):
>
>>> net/ipv4/tcp_cdg.c:59:8: error: redefinition of 'struct minmax'
> struct minmax {
>^~
>In file included from include/linux/tcp.h:22:0,
> from include/net/tcp.h:24,
> from net/ipv4/tcp_cdg.c:30:
>include/linux/win_minmax.h:17:8: note: originally defined here
> struct minmax {
>^~
>
> vim +59 net/ipv4/tcp_cdg.c

Sorry about that. I will fix that and re-post.

neal

Re: XDP_TX bug report on mlx4

2016-09-16 Thread Jesper Dangaard Brouer

On Fri, 16 Sep 2016 12:17:27 -0700
Brenden Blanco  wrote:

> On Fri, Sep 16, 2016 at 09:03:40PM +0200, Jesper Dangaard Brouer wrote:
> > Hi Brenden,
> > 
> > I've discovered a bug with XDP_TX recycling of pages in the mlx4 driver.
> > 
> > If I increase the number of RX and TX queues/channels via ethtool cmd:
> >  ethtool -L mlx4p1 rx 10 tx 10
> > 
> > Then when running the xdp2 program, which does XDP_TX, the kernel will
> > crash with page errors, because the page refcnt goes to zero or even
> > minus.  I've noticed pages delivered to mlx4_en_rx_recycle() can have
> > a page refcnt of zero, which is wrong, they should always have 1 (for
> > XDP).
> > 
> > Debugging it further, I find that this can happen when mlx4_en_rx_recycle()
> > is called from mlx4_en_recycle_tx_desc().  This is the TX cleanup function,
> > associated with TX ring queues used for XDP_TX only. No others than the
> > XDP_TX action should be able to place packets into these TX rings
> > which call mlx4_en_recycle_tx_desc().  
> 
> Sounds pretty straightforward, let me look into it.

Here is some debug info I instrumented my kernel with, and I've
attached my minicom output with a warning and a panic.

Enable some driver debug printks via::
 ethtool -s mlx4p1 msglvl drv on

Debug normal situation::

 $ grep recycle_ring minicom_capturefile.log08
 [  520.746610] mlx4_en: mlx4p1: Set tx_ring[56]->recycle_ring = rx_ring[0]
 [  520.747042] mlx4_en: mlx4p1: Set tx_ring[57]->recycle_ring = rx_ring[1]
 [  520.747470] mlx4_en: mlx4p1: Set tx_ring[58]->recycle_ring = rx_ring[2]
 [  520.747918] mlx4_en: mlx4p1: Set tx_ring[59]->recycle_ring = rx_ring[3]
 [  520.748330] mlx4_en: mlx4p1: Set tx_ring[60]->recycle_ring = rx_ring[4]
 [  520.748749] mlx4_en: mlx4p1: Set tx_ring[61]->recycle_ring = rx_ring[5]
 [  520.749181] mlx4_en: mlx4p1: Set tx_ring[62]->recycle_ring = rx_ring[6]
 [  520.749620] mlx4_en: mlx4p1: Set tx_ring[63]->recycle_ring = rx_ring[7]

Change $ ethtool -L mlx4p1 rx 9 tx 9 ::

 [  911.594692] mlx4_en: mlx4p1: Set tx_ring[56]->recycle_ring = rx_ring[0]
 [  911.608345] mlx4_en: mlx4p1: Set tx_ring[57]->recycle_ring = rx_ring[1]
 [  911.622008] mlx4_en: mlx4p1: Set tx_ring[58]->recycle_ring = rx_ring[2]
 [  911.636364] mlx4_en: mlx4p1: Set tx_ring[59]->recycle_ring = rx_ring[3]
 [  911.650015] mlx4_en: mlx4p1: Set tx_ring[60]->recycle_ring = rx_ring[4]
 [  911.663690] mlx4_en: mlx4p1: Set tx_ring[61]->recycle_ring = rx_ring[5]
 [  911.677356] mlx4_en: mlx4p1: Set tx_ring[62]->recycle_ring = rx_ring[6]
 [  911.690924] mlx4_en: mlx4p1: Set tx_ring[63]->recycle_ring = rx_ring[7]
 [  911.704544] mlx4_en: mlx4p1: Set tx_ring[64]->recycle_ring = rx_ring[8]
 [  911.718171] mlx4_en: mlx4p1: Set tx_ring[65]->recycle_ring = rx_ring[9]
 [  911.731772] mlx4_en: mlx4p1: Set tx_ring[66]->recycle_ring = rx_ring[10]
 [  911.745438] mlx4_en: mlx4p1: Set tx_ring[67]->recycle_ring = rx_ring[11]
 [  911.759063] mlx4_en: mlx4p1: Set tx_ring[68]->recycle_ring = rx_ring[12]
 [  911.772741] mlx4_en: mlx4p1: Set tx_ring[69]->recycle_ring = rx_ring[13]
 [  911.786415] mlx4_en: mlx4p1: Set tx_ring[70]->recycle_ring = rx_ring[14]
 [  911.800070] mlx4_en: mlx4p1: Set tx_ring[71]->recycle_ring = rx_ring[15]

Change $ ethtool -L mlx4p1 rx 10 tx 10::

 netif_set_real_num_tx_queues() setting dev->real_num_tx_queues(now:80) = 64
 mlx4_en: mlx4p1:   frag:0 - size:1522 prefix:0 stride:4096
 mlx4_en_init_recycle_ring() Set tx_ring[64]->recycle_ring = rx_ring[0]
 mlx4_en_init_recycle_ring() Set tx_ring[65]->recycle_ring = rx_ring[1]
 mlx4_en_init_recycle_ring() Set tx_ring[66]->recycle_ring = rx_ring[2]
 mlx4_en_init_recycle_ring() Set tx_ring[67]->recycle_ring = rx_ring[3]
 mlx4_en_init_recycle_ring() Set tx_ring[68]->recycle_ring = rx_ring[4]
 mlx4_en_init_recycle_ring() Set tx_ring[69]->recycle_ring = rx_ring[5]
 mlx4_en_init_recycle_ring() Set tx_ring[70]->recycle_ring = rx_ring[6]
 mlx4_en_init_recycle_ring() Set tx_ring[71]->recycle_ring = rx_ring[7]
 mlx4_en_init_recycle_ring() Set tx_ring[72]->recycle_ring = rx_ring[8]
 mlx4_en_init_recycle_ring() Set tx_ring[73]->recycle_ring = rx_ring[9]
 mlx4_en_init_recycle_ring() Set tx_ring[74]->recycle_ring = rx_ring[10]
 mlx4_en_init_recycle_ring() Set tx_ring[75]->recycle_ring = rx_ring[11]
 mlx4_en_init_recycle_ring() Set tx_ring[76]->recycle_ring = rx_ring[12]
 mlx4_en_init_recycle_ring() Set tx_ring[77]->recycle_ring = rx_ring[13]
 mlx4_en_init_recycle_ring() Set tx_ring[78]->recycle_ring = rx_ring[14]
 mlx4_en_init_recycle_ring() Set tx_ring[79]->recycle_ring = rx_ring[15]

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[   95.777366] systemd[1]: Started Session c1 of user jbrouer.
[   95.783108] systemd[1]: Starting Session c1 of user jbrouer.
[  102.577674] XXX: netif_set_real_num_tx_queues() setting 
dev->real_num_tx_queues(now:64) = 80
[  102.586160] mlx4_en:

Re: [PATCH net-next 02/14] tcp: use windowed min filter library for TCP min_rtt estimation

2016-09-16 Thread kbuild test robot

Hi Neal,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Neal-Cardwell/tcp-BBR-congestion-control-algorithm/20160917-025323
config: x86_64-randconfig-x006-201637 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> net/ipv4/tcp_cdg.c:59:8: error: redefinition of 'struct minmax'
struct minmax {
   ^~
   In file included from include/linux/tcp.h:22:0,
from include/net/tcp.h:24,
from net/ipv4/tcp_cdg.c:30:
   include/linux/win_minmax.h:17:8: note: originally defined here
struct minmax {
   ^~

vim +59 net/ipv4/tcp_cdg.c

2b0a8c9e Kenneth Klette Jonassen 2015-06-10  43  module_param(window, int, 
0444);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  44  MODULE_PARM_DESC(window, 
"gradient window size (power of two <= 256)");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  45  module_param(backoff_beta, 
uint, 0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  46  MODULE_PARM_DESC(backoff_beta, 
"backoff beta (0-1024)");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  47  module_param(backoff_factor, 
uint, 0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  48  
MODULE_PARM_DESC(backoff_factor, "backoff probability scale factor");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  49  module_param(hystart_detect, 
uint, 0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  50  
MODULE_PARM_DESC(hystart_detect, "use Hybrid Slow start "
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  51  "(0: disabled, 
1: ACK train, 2: delay threshold, 3: both)");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  52  module_param(use_ineff, uint, 
0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  53  MODULE_PARM_DESC(use_ineff, 
"use ineffectual backoff detection (threshold)");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  54  module_param(use_shadow, bool, 
0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  55  MODULE_PARM_DESC(use_shadow, 
"use shadow window heuristic");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  56  module_param(use_tolerance, 
bool, 0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  57  
MODULE_PARM_DESC(use_tolerance, "use loss tolerance heuristic");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  58  
2b0a8c9e Kenneth Klette Jonassen 2015-06-10 @59  struct minmax {
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  60 union {
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  61 struct {
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  62 s32 min;
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  63 s32 max;
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  64 };
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  65 u64 v64;
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  66 };
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  67  };

:: The code at line 59 was first introduced by commit
:: 2b0a8c9eee81882fc0001ccf6d9af62cdc682f9e tcp: add CDG congestion control

:: TO: Kenneth Klette Jonassen 
:: CC: David S. Miller 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: XDP_TX bug report on mlx4

2016-09-16 Thread Brenden Blanco

On Fri, Sep 16, 2016 at 09:03:40PM +0200, Jesper Dangaard Brouer wrote:
> Hi Brenden,
> 
> I've discovered a bug with XDP_TX recycling of pages in the mlx4 driver.
> 
> If I increase the number of RX and TX queues/channels via ethtool cmd:
>  ethtool -L mlx4p1 rx 10 tx 10
> 
> Then when running the xdp2 program, which does XDP_TX, the kernel will
> crash with page errors, because the page refcnt goes to zero or even
> minus.  I've noticed pages delivered to mlx4_en_rx_recycle() can have
> a page refcnt of zero, which is wrong, they should always have 1 (for
> XDP).
> 
> Debugging it further, I find that this can happen when mlx4_en_rx_recycle()
> is called from mlx4_en_recycle_tx_desc().  This is the TX cleanup function,
> associated with TX ring queues used for XDP_TX only. No others than the
> XDP_TX action should be able to place packets into these TX rings
> which call mlx4_en_recycle_tx_desc().

Sounds pretty straightforward, let me look into it.
> 
> Do you have any idea of what could be going wrong in this case?
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
>

Re: [PATCH] net: ipv6: fallback to full lookup if table lookup is unsuitable

2016-09-16 Thread Vincent Bernat

 ❦ 16 septembre 2016 20:36 CEST, David Ahern  :

>> contained a non-connected route (like a default gateway) fails while it
>> was previously working:
>> 
>> $ ip link add eth0 type dummy
>> $ ip link set up dev eth0
>> $ ip addr add 2001:db8::1/64 dev eth0
>> $ ip route add ::/0 via 2001:db8::5 dev eth0 table 20
>> $ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
>> RTNETLINK answers: No route to host
>> $ ip -6 route show table 20
>> default via 2001:db8::5 dev eth0  metric 1024  pref medium
>
> so your table 20 is not complete in that it lacks a connected route to
> resolve 2001:db8::6 as a nexthop, so you are relying on a fallback to
> other tables (main in this case).

Yes.

>> @@ -1991,33 +2015,15 @@ static struct rt6_info *ip6_route_info_create(struct 
>> fib6_config *cfg)
>>  if (!(gwa_type & IPV6_ADDR_UNICAST))
>>  goto out;
>>  
>> +err = -EHOSTUNREACH;
>>  if (cfg->fc_table)
>>  grt = ip6_nh_lookup_table(net, cfg, gw_addr);
>
> -8<-
>
>> -if (!(grt->rt6i_flags & RTF_GATEWAY))
>> -err = 0;
>
> This is the check that is failing for your use
> case. ip6_nh_lookup_table is returning the default route and nexthops
> can not rely on a gateway. Given that a simpler and more direct change
> is (whitespace mangled on paste):
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index ad4a7ff301fc..48bae2ee2e18 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1991,9 +1991,19 @@ static struct rt6_info *ip6_route_info_create(struct 
> fib6_config *cfg)
> if (!(gwa_type & IPV6_ADDR_UNICAST))
> goto out;
>
> -   if (cfg->fc_table)
> +   if (cfg->fc_table) {
> grt = ip6_nh_lookup_table(net, cfg, gw_addr);
>
> +   /* a nexthop lookup can not go through a gw.
> +* if this happens on a table based lookup
> +* then fallback to a full lookup
> +*/
> +   if (grt && grt->rt6i_flags & RTF_GATEWAY) {
> +   ip6_rt_put(grt);
> +   grt = NULL;
> +   }
> +   }
> +
> if (!grt)
> grt = rt6_lookup(net, gw_addr, NULL,
>  cfg->fc_ifindex, 1);

OK. Should the dev check be dismissed or do we add "dev && dev !=
grt->dst.dev" just as a safety net (this would be a convulated setup,
but the correct direct route could be in an ip rule with higher priority
while the one in this table is incorrect)?
-- 
"... an experienced, industrious, ambitious, and often quite often
picturesque liar."
-- Mark Twain

XDP_TX bug report on mlx4

2016-09-16 Thread Jesper Dangaard Brouer

Hi Brenden,

I've discovered a bug with XDP_TX recycling of pages in the mlx4 driver.

If I increase the number of RX and TX queues/channels via ethtool cmd:
 ethtool -L mlx4p1 rx 10 tx 10

Then when running the xdp2 program, which does XDP_TX, the kernel will
crash with page errors, because the page refcnt goes to zero or even
minus.  I've noticed pages delivered to mlx4_en_rx_recycle() can have
a page refcnt of zero, which is wrong, they should always have 1 (for
XDP).

Debugging it further, I find that this can happen when mlx4_en_rx_recycle()
is called from mlx4_en_recycle_tx_desc().  This is the TX cleanup function,
associated with TX ring queues used for XDP_TX only. No others than the
XDP_TX action should be able to place packets into these TX rings
which call mlx4_en_recycle_tx_desc().

Do you have any idea of what could be going wrong in this case?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets

2016-09-16 Thread Cyrill Gorcunov

On Fri, Sep 16, 2016 at 10:06:23AM +0300, Cyrill Gorcunov wrote:
> On Thu, Sep 15, 2016 at 05:45:02PM -0600, David Ahern wrote:
> > > 
> > > Try to be selective in the -K , do not kill tcp sockets ?
> > 
> > I am running
> >ss -aKw 'dev == red'
> > 
> > to kill raw sockets bound to device named 'red'.
> 
> Thanks David, Eric! I'll play with this option today and report the results.

I created veth pair and bound raw socket into it.

[root@pcs7 iproute2]# misc/ss -A raw
State  Recv-Q Send-QLocal Address:Port  
   Peer Address:Port
ESTAB  0  0 
127.0.0.1:ipproto-255
127.0.0.10:ipproto-9090 
UNCONN 0  0
127.0.0.10:ipproto-255 *:*  
  
UNCONN 0  0:::ipv6-icmp 
 :::*
UNCONN 0  0:::ipv6-icmp 
 :::*
ESTAB  0  0   
::1:ipproto-255   
::1:ipproto-9091 
UNCONN 0  0   
::1%vm1:ipproto-255:::* 
   
[root@pcs7 iproute2]# 

[root@pcs7 iproute2]# misc/ss -aKw 'dev == vm1'
State  Recv-Q Send-QLocal Address:Port  
   Peer Address:Port
UNCONN 0  0   
::1%vm1:ipproto-255:::* 
   

[root@pcs7 iproute2]# misc/ss -A raw
State  Recv-Q Send-QLocal Address:Port  
   Peer Address:Port
ESTAB  0  0 
127.0.0.1:ipproto-255
127.0.0.10:ipproto-9090 
UNCONN 0  0
127.0.0.10:ipproto-255 *:*  
  
UNCONN 0  0:::ipv6-icmp 
 :::*
UNCONN 0  0:::ipv6-icmp 
 :::*
ESTAB  0  0   
::1:ipproto-255   
::1:ipproto-9091 

so it get zapped out. Is there some other way to test it?

[PATCH net-next 13/14] tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88

2016-09-16 Thread Neal Cardwell

The TCP CUBIC module already uses 64 bytes.
The upcoming TCP BBR module uses 88 bytes.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/inet_connection_sock.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 49dcad4..197a30d 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -134,8 +134,8 @@ struct inet_connection_sock {
} icsk_mtup;
u32   icsk_user_timeout;
 
-   u64   icsk_ca_priv[64 / sizeof(u64)];
-#define ICSK_CA_PRIV_SIZE  (8 * sizeof(u64))
+   u64   icsk_ca_priv[88 / sizeof(u64)];
+#define ICSK_CA_PRIV_SIZE  (11 * sizeof(u64))
 };
 
 #define ICSK_TIME_RETRANS  1   /* Retransmit timer */
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 09/14] tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments

2016-09-16 Thread Neal Cardwell

To allow congestion control modules to use the default TSO auto-sizing
algorithm as one of the ingredients in their own decision about TSO sizing:

1) Export tcp_tso_autosize() so that CC modules can use it.

2) Change tcp_tso_autosize() to allow callers to specify a minimum
   number of segments per TSO skb, in case the congestion control
   module has a different notion of the best floor for TSO skbs for
   the connection right now. For very low-rate paths or policed
   connections it can be appropriate to use smaller TSO skbs.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/tcp.h | 2 ++
 net/ipv4/tcp_output.c | 9 ++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4d85cd7..8805c65 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -533,6 +533,8 @@ __u32 cookie_v6_init_sequence(const struct sk_buff *skb, 
__u16 *mss);
 #endif
 /* tcp_output.c */
 
+u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
+int min_tso_segs);
 void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
   int nonagle);
 bool tcp_may_send_now(struct sock *sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0137956..0bf3d48 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1549,7 +1549,8 @@ static bool tcp_nagle_check(bool partial, const struct 
tcp_sock *tp,
 /* Return how many segs we'd like on a TSO packet,
  * to send one TSO packet per ms
  */
-static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now)
+u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
+int min_tso_segs)
 {
u32 bytes, segs;
 
@@ -1561,10 +1562,11 @@ static u32 tcp_tso_autosize(const struct sock *sk, 
unsigned int mss_now)
 * This preserves ACK clocking and is consistent
 * with tcp_tso_should_defer() heuristic.
 */
-   segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs);
+   segs = max_t(u32, bytes / mss_now, min_tso_segs);
 
return min_t(u32, segs, sk->sk_gso_max_segs);
 }
+EXPORT_SYMBOL(tcp_tso_autosize);
 
 /* Return the number of segments we want in the skb we are transmitting.
  * See if congestion control module wants to decide; otherwise, autosize.
@@ -1574,7 +1576,8 @@ static u32 tcp_tso_segs(struct sock *sk, unsigned int 
mss_now)
const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops;
u32 tso_segs = ca_ops->tso_segs_goal ? ca_ops->tso_segs_goal(sk) : 0;
 
-   return tso_segs ? : tcp_tso_autosize(sk, mss_now);
+   return tso_segs ? :
+   tcp_tso_autosize(sk, mss_now, sysctl_tcp_min_tso_segs);
 }
 
 /* Returns the portion of skb which can be sent right away */
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 12/14] tcp: new CC hook to set sending rate with rate_sample in any CA state

2016-09-16 Thread Neal Cardwell

From: Yuchung Cheng 

This commit introduces an optional new "omnipotent" hook,
cong_control(), for congestion control modules. The cong_control()
function is called at the end of processing an ACK (i.e., after
updating sequence numbers, the SACK scoreboard, and loss
detection). At that moment we have precise delivery rate information
the congestion control module can use to control the sending behavior
(using cwnd, TSO skb size, and pacing rate) in any CA state.

This function can also be used by a congestion control that prefers
not to use the default cwnd reduction approach (i.e., the PRR
algorithm) during CA_Recovery to control the cwnd and sending rate
during loss recovery.

We take advantage of the fact that recent changes defer the
retransmission or transmission of new data (e.g. by F-RTO) in recovery
until the new tcp_cong_control() function is run.

With this commit, we only run tcp_update_pacing_rate() if the
congestion control is not using this new API. New congestion controls
which use the new API do not want the TCP stack to run the default
pacing rate calculation and overwrite whatever pacing rate they have
chosen at initialization time.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/tcp.h|  4 
 net/ipv4/tcp_cong.c  |  2 +-
 net/ipv4/tcp_input.c | 17 ++---
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c4d2e46..35ec286 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -919,6 +919,10 @@ struct tcp_congestion_ops {
u32 (*tso_segs_goal)(struct sock *sk);
/* returns the multiplier used in tcp_sndbuf_expand (optional) */
u32 (*sndbuf_expand)(struct sock *sk);
+   /* call when packets are delivered to update cwnd and pacing rate,
+* after all the ca_state processing. (optional)
+*/
+   void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
/* get info for inet_diag (optional) */
size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
   union tcp_cc_info *info);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 882caa4..1294af4 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -69,7 +69,7 @@ int tcp_register_congestion_control(struct tcp_congestion_ops 
*ca)
int ret = 0;
 
/* all algorithms must implement ssthresh and cong_avoid ops */
-   if (!ca->ssthresh || !ca->cong_avoid) {
+   if (!ca->ssthresh || !(ca->cong_avoid || ca->cong_control)) {
pr_err("%s does not implement required ops\n", ca->name);
return -EINVAL;
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a134e66..931fe32 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2536,6 +2536,9 @@ static inline void tcp_end_cwnd_reduction(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
+   if (inet_csk(sk)->icsk_ca_ops->cong_control)
+   return;
+
/* Reset cwnd to ssthresh in CWR or Recovery (unless it's undone) */
if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR ||
(tp->undo_marker && tp->snd_ssthresh < TCP_INFINITE_SSTHRESH)) {
@@ -3312,8 +3315,15 @@ static inline bool tcp_may_raise_cwnd(const struct sock 
*sk, const int flag)
  * information. All transmission or retransmission are delayed afterwards.
  */
 static void tcp_cong_control(struct sock *sk, u32 ack, u32 acked_sacked,
-int flag)
+int flag, const struct rate_sample *rs)
 {
+   const struct inet_connection_sock *icsk = inet_csk(sk);
+
+   if (icsk->icsk_ca_ops->cong_control) {
+   icsk->icsk_ca_ops->cong_control(sk, rs);
+   return;
+   }
+
if (tcp_in_cwnd_reduction(sk)) {
/* Reduce cwnd if state mandates */
tcp_cwnd_reduction(sk, acked_sacked, flag);
@@ -3683,7 +3693,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff 
*skb, int flag)
delivered = tp->delivered - delivered;  /* freshly ACKed or SACKed */
lost = tp->lost - lost; /* freshly marked lost */
tcp_rate_gen(sk, delivered, lost, , );
-   tcp_cong_control(sk, ack, delivered, flag);
+   tcp_cong_control(sk, ack, delivered, flag, );
tcp_xmit_recovery(sk, rexmit);
return 1;
 
@@ -5981,7 +5991,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff 
*skb)
} else
tcp_init_metrics(sk);
 
-   tcp_update_pacing_rate(sk);
+   if (!inet_csk(sk)->icsk_ca_ops->cong_control)
+

[PATCH net-next 06/14] tcp: track application-limited rate samples

2016-09-16 Thread Neal Cardwell

From: Soheil Hassas Yeganeh 

This commit adds code to track whether the delivery rate represented
by each rate_sample was limited by the application.

Upon each transmit, we store in the is_app_limited field in the skb a
boolean bit indicating whether there is a known "bubble in the pipe":
a point in the rate sample interval where the sender was
application-limited, and did not transmit even though the cwnd and
pacing rate allowed it.

This logic marks the flow app-limited on a write if *all* of the
following are true:

  1) There is less than 1 MSS of unsent data in the write queue
 available to transmit.

  2) There is no packet in the sender's queues (e.g. in fq or the NIC
 tx queue).

  3) The connection is not limited by cwnd.

  4) There are no lost packets to retransmit.

The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
the connection is application-limited at the moment. If the flow is
application-limited, it sets the tp->app_limited field. If the flow is
application-limited then that means there is effectively a "bubble" of
silence in the pipe now, and this silence will be reflected in a lower
bandwidth sample for any rate samples from now until we get an ACK
indicating this bubble has exited the pipe: specifically, until we get
an ACK for the next packet we transmit.

When we send every skb we record in scb->tx.is_app_limited whether the
resulting rate sample will be application-limited.

The code in tcp_rate_gen() checks to see when it is safe to mark all
known application-limited bubbles of silence as having exited the
pipe. It does this by checking to see when the delivered count moves
past the tp->app_limited marker. At this point it zeroes the
tp->app_limited marker, as all known bubbles are out of the pipe.

We make room for the tx.is_app_limited bit in the skb by borrowing a
bit from the in_flight field used by NV to record the number of bytes
in flight. The receive window in the TCP header is 16 bits, and the
max receive window scaling shift factor is 14 (RFC 1323). So the max
receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
only need 30 bits for the tx.in_flight used by NV.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h  |  1 +
 include/net/tcp.h|  6 +-
 net/ipv4/tcp.c   |  8 
 net/ipv4/tcp_minisocks.c |  3 +++
 net/ipv4/tcp_rate.c  | 29 -
 5 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c50e6ae..fdcd00f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -268,6 +268,7 @@ struct tcp_sock {
u32 prr_out;/* Total number of pkts sent during Recovery. */
u32 delivered;  /* Total data packets delivered incl. rexmits */
u32 lost;   /* Total data packets lost incl. rexmits */
+   u32 app_limited;/* limited until "delivered" reaches this val */
struct skb_mstamp first_tx_mstamp;  /* start of window send phase */
struct skb_mstamp delivered_mstamp; /* time we reached "delivered" */
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4a94f64..9829aa7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -764,7 +764,9 @@ struct tcp_skb_cb {
union {
struct {
/* There is space for up to 24 bytes */
-   __u32 in_flight;/* Bytes in flight when packet sent */
+   __u32 in_flight:30,/* Bytes in flight at transmit */
+ is_app_limited:1, /* cwnd not fully used? */
+ unused:1;
/* start of send pipeline phase */
struct skb_mstamp first_tx_mstamp __packed;
/* when we reached the "delivered" count */
@@ -883,6 +885,7 @@ struct rate_sample {
int  losses;/* number of packets marked lost upon ACK */
u32  acked_sacked;  /* number of packets newly (S)ACKed upon ACK */
u32  prior_in_flight;   /* in flight before this ACK */
+   bool is_app_limited;/* is sample from packet with bubble in pipe? */
bool is_retrans;/* is sample from retransmission? */
 };
 
@@ -978,6 +981,7 @@ void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff 
*skb,
struct rate_sample *rs);
 void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
  struct skb_mstamp *now, struct rate_sample *rs);
+void tcp_rate_check_app_limited(struct sock *sk);
 
 /* These functions determine how the current flow behaves in respect of SACK
  * handling.

[PATCH net-next 05/14] tcp: track data delivery rate for a TCP connection

2016-09-16 Thread Neal Cardwell

From: Yuchung Cheng 

This patch generates data delivery rate (throughput) samples on a
per-ACK basis. These rate samples can be used by congestion control
modules, and specifically will be used by TCP BBR in later patches in
this series.

Key state:

tp->delivered: Tracks the total number of data packets (original or not)
   delivered so far. This is an already-existing field.

tp->delivered_mstamp: the last time tp->delivered was updated.

Algorithm:

A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:

  d1: the current tp->delivered after processing the ACK
  t1: the current time after processing the ACK

  d0: the prior tp->delivered when the acked skb was transmitted
  t0: the prior tp->delivered_mstamp when the acked skb was transmitted

When an skb is transmitted, we snapshot d0 and t0 in its control
block in tcp_rate_skb_sent().

When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
to reflect the latest (d0, t0).

Finally, tcp_rate_gen() generates a rate sample by storing
(d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.

One caveat: if an skb was sent with no packets in flight, then
tp->delivered_mstamp may be either invalid (if the connection is
starting) or outdated (if the connection was idle). In that case,
we'll re-stamp tp->delivered_mstamp.

At first glance it seems t0 should always be the time when an skb was
transmitted, but actually this could over-estimate the rate due to
phase mismatch between transmit and ACK events. To track the delivery
rate, we ensure that if packets are in flight then t0 and and t1 are
times at which packets were marked delivered.

If the initial and final RTTs are different then one may be corrupted
by some sort of noise. The noise we see most often is sending gaps
caused by delayed, compressed, or stretched acks. This either affects
both RTTs equally or artificially reduces the final RTT. We approach
this by recording the info we need to compute the initial RTT
(duration of the "send phase" of the window) when we recorded the
associated inflight. Then, for a filter to avoid bandwidth
overestimates, we generalize the per-sample bandwidth computation
from:

bw = delivered / ack_phase_rtt

to the following:

bw = delivered / max(send_phase_rtt, ack_phase_rtt)

In large-scale experiments, this filtering approach incorporating
send_phase_rtt is effective at avoiding bandwidth overestimates due to
ACK compression or stretched ACKs.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h   |   2 +
 include/net/tcp.h |  35 +++-
 net/ipv4/Makefile |   2 +-
 net/ipv4/tcp.c|   5 ++
 net/ipv4/tcp_input.c  |  46 +++-
 net/ipv4/tcp_output.c |   4 ++
 net/ipv4/tcp_rate.c   | 149 ++
 7 files changed, 227 insertions(+), 16 deletions(-)
 create mode 100644 net/ipv4/tcp_rate.c

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 38590fb..c50e6ae 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -268,6 +268,8 @@ struct tcp_sock {
u32 prr_out;/* Total number of pkts sent during Recovery. */
u32 delivered;  /* Total data packets delivered incl. rexmits */
u32 lost;   /* Total data packets lost incl. rexmits */
+   struct skb_mstamp first_tx_mstamp;  /* start of window send phase */
+   struct skb_mstamp delivered_mstamp; /* time we reached "delivered" */
 
u32 rcv_wnd;/* Current receiver window  */
u32 write_seq;  /* Tail(+1) of data held in tcp send buffer */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2f1648a..4a94f64 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -763,8 +763,14 @@ struct tcp_skb_cb {
__u32   ack_seq;/* Sequence number ACK'd*/
union {
struct {
-   /* There is space for up to 20 bytes */
+   /* There is space for up to 24 bytes */
__u32 in_flight;/* Bytes in flight when packet sent */
+   /* start of send pipeline phase */
+   struct skb_mstamp first_tx_mstamp __packed;
+   /* when we reached the "delivered" count */
+   struct skb_mstamp delivered_mstamp __packed;
+   /* pkts S/ACKed so far upon tx of skb, incl retrans: */
+   __u32 delivered;
} tx;   /* only used for outgoing skbs */
union {
struct

[PATCH net-next 03/14] net_sched: sch_fq: add low_rate_threshold parameter

2016-09-16 Thread Neal Cardwell

From: Eric Dumazet 

This commit adds to the fq module a low_rate_threshold parameter to
insert a delay after all packets if the socket requests a pacing rate
below the threshold.

This helps achieve more precise control of the sending rate with
low-rate paths, especially policers. The basic issue is that if a
congestion control module detects a policer at a certain rate, it may
want fq to be able to shape to that policed rate. That way the sender
can avoid policer drops by having the packets arrive at the policer at
or just under the policed rate.

The default threshold of 550Kbps was chosen analytically so that for
policers or links at 500Kbps or 512Kbps fq would very likely invoke
this mechanism, even if the pacing rate was briefly slightly above the
available bandwidth. This value was then empirically validated with
two years of production testing on YouTube video servers.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/uapi/linux/pkt_sched.h |  2 ++
 net/sched/sch_fq.c | 22 +++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 2382eed..f8e39db 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -792,6 +792,8 @@ enum {
 
TCA_FQ_ORPHAN_MASK, /* mask applied to orphaned skb hashes */
 
+   TCA_FQ_LOW_RATE_THRESHOLD, /* per packet delay under this rate */
+
__TCA_FQ_MAX
 };
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index e5458b9..40ad4fc 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -94,6 +94,7 @@ struct fq_sched_data {
u32 flow_max_rate;  /* optional max rate per flow */
u32 flow_plimit;/* max packets per flow */
u32 orphan_mask;/* mask for orphaned skb */
+   u32 low_rate_threshold;
struct rb_root  *fq_root;
u8  rate_enable;
u8  fq_trees_log;
@@ -433,7 +434,7 @@ static struct sk_buff *fq_dequeue(struct Qdisc *sch)
struct fq_flow_head *head;
struct sk_buff *skb;
struct fq_flow *f;
-   u32 rate;
+   u32 rate, plen;
 
skb = fq_dequeue_head(sch, >internal);
if (skb)
@@ -482,7 +483,7 @@ begin:
prefetch(>end);
f->credit -= qdisc_pkt_len(skb);
 
-   if (f->credit > 0 || !q->rate_enable)
+   if (!q->rate_enable)
goto out;
 
/* Do not pace locally generated ack packets */
@@ -493,8 +494,15 @@ begin:
if (skb->sk)
rate = min(skb->sk->sk_pacing_rate, rate);
 
+   if (rate <= q->low_rate_threshold) {
+   f->credit = 0;
+   plen = qdisc_pkt_len(skb);
+   } else {
+   plen = max(qdisc_pkt_len(skb), q->quantum);
+   if (f->credit > 0)
+   goto out;
+   }
if (rate != ~0U) {
-   u32 plen = max(qdisc_pkt_len(skb), q->quantum);
u64 len = (u64)plen * NSEC_PER_SEC;
 
if (likely(rate))
@@ -662,6 +670,7 @@ static const struct nla_policy fq_policy[TCA_FQ_MAX + 1] = {
[TCA_FQ_FLOW_MAX_RATE]  = { .type = NLA_U32 },
[TCA_FQ_BUCKETS_LOG]= { .type = NLA_U32 },
[TCA_FQ_FLOW_REFILL_DELAY]  = { .type = NLA_U32 },
+   [TCA_FQ_LOW_RATE_THRESHOLD] = { .type = NLA_U32 },
 };
 
 static int fq_change(struct Qdisc *sch, struct nlattr *opt)
@@ -716,6 +725,10 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt)
if (tb[TCA_FQ_FLOW_MAX_RATE])
q->flow_max_rate = nla_get_u32(tb[TCA_FQ_FLOW_MAX_RATE]);
 
+   if (tb[TCA_FQ_LOW_RATE_THRESHOLD])
+   q->low_rate_threshold =
+   nla_get_u32(tb[TCA_FQ_LOW_RATE_THRESHOLD]);
+
if (tb[TCA_FQ_RATE_ENABLE]) {
u32 enable = nla_get_u32(tb[TCA_FQ_RATE_ENABLE]);
 
@@ -781,6 +794,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt)
q->fq_root  = NULL;
q->fq_trees_log = ilog2(1024);
q->orphan_mask  = 1024 - 1;
+   q->low_rate_threshold   = 55 / 8;
qdisc_watchdog_init(>watchdog, sch);
 
if (opt)
@@ -811,6 +825,8 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
nla_put_u32(skb, TCA_FQ_FLOW_REFILL_DELAY,
jiffies_to_usecs(q->flow_refill_delay)) ||
nla_put_u32(skb, TCA_FQ_ORPHAN_MASK, q->orphan_mask) ||
+   nla_put_u32(skb, TCA_FQ_LOW_RATE_THRESHOLD,
+   q->low_rate_threshold) ||
nla_put_u32(skb, TCA_FQ_BUCKETS_LOG,

[PATCH net-next 14/14] tcp_bbr: add BBR congestion control

2016-09-16 Thread Neal Cardwell

This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".

BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.

BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.

The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.

In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.

While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.

In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.

Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.

When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.

Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).

Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.

Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:

  https://groups.google.com/forum/#!forum/bbr-dev

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/uapi/linux/inet_diag.h |  13

[PATCH net-next 08/14] tcp: allow congestion control module to request TSO skb segment count

2016-09-16 Thread Neal Cardwell

Add the tso_segs_goal() function in tcp_congestion_ops to allow the
congestion control module to specify the number of segments that
should be in a TSO skb sent by tcp_write_xmit() and
tcp_xmit_retransmit_queue(). The congestion control module can either
request a particular number of segments in TSO skb that we transmit,
or return 0 if it doesn't care.

This allows the upcoming BBR congestion control module to select small
TSO skb sizes if the module detects that the bottleneck bandwidth is
very low, or that the connection is policed to a low rate.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/tcp.h |  2 ++
 net/ipv4/tcp_output.c | 15 +--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9829aa7..4d85cd7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -913,6 +913,8 @@ struct tcp_congestion_ops {
u32  (*undo_cwnd)(struct sock *sk);
/* hook for packet ack accounting (optional) */
void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
+   /* suggest number of segments for each skb to transmit (optional) */
+   u32 (*tso_segs_goal)(struct sock *sk);
/* get info for inet_diag (optional) */
size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
   union tcp_cc_info *info);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e02c8eb..0137956 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1566,6 +1566,17 @@ static u32 tcp_tso_autosize(const struct sock *sk, 
unsigned int mss_now)
return min_t(u32, segs, sk->sk_gso_max_segs);
 }
 
+/* Return the number of segments we want in the skb we are transmitting.
+ * See if congestion control module wants to decide; otherwise, autosize.
+ */
+static u32 tcp_tso_segs(struct sock *sk, unsigned int mss_now)
+{
+   const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops;
+   u32 tso_segs = ca_ops->tso_segs_goal ? ca_ops->tso_segs_goal(sk) : 0;
+
+   return tso_segs ? : tcp_tso_autosize(sk, mss_now);
+}
+
 /* Returns the portion of skb which can be sent right away */
 static unsigned int tcp_mss_split_point(const struct sock *sk,
const struct sk_buff *skb,
@@ -2061,7 +2072,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
}
}
 
-   max_segs = tcp_tso_autosize(sk, mss_now);
+   max_segs = tcp_tso_segs(sk, mss_now);
while ((skb = tcp_send_head(sk))) {
unsigned int limit;
 
@@ -2778,7 +2789,7 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
last_lost = tp->snd_una;
}
 
-   max_segs = tcp_tso_autosize(sk, tcp_current_mss(sk));
+   max_segs = tcp_tso_segs(sk, tcp_current_mss(sk));
tcp_for_write_queue_from(skb, sk) {
__u8 sacked;
int segs;
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 07/14] tcp: export data delivery rate

2016-09-16 Thread Neal Cardwell

From: Yuchung Cheng 

This commit export two new fields in struct tcp_info:

  tcpi_delivery_rate: The most recent goodput, as measured by
tcp_rate_gen(). If the socket is limited by the sending
application (e.g., no data to send), it reports the highest
measurement instead of the most recent. The unit is bytes per
second (like other rate fields in tcp_info).

  tcpi_delivery_rate_app_limited: A boolean indicating if the goodput
was measured when the socket's throughput was limited by the
sending application.

This delivery rate information can be useful for applications that
want to know the current throughput the TCP connection is seeing,
e.g. adaptive bitrate video streaming. It can also be very useful for
debugging or troubleshooting.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h  |  5 -
 include/uapi/linux/tcp.h |  3 +++
 net/ipv4/tcp.c   | 11 ++-
 net/ipv4/tcp_rate.c  | 12 +++-
 4 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index fdcd00f..a17ae7b 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -213,7 +213,8 @@ struct tcp_sock {
u8 reord;/* reordering detected */
} rack;
u16 advmss; /* Advertised MSS   */
-   u8  unused;
+   u8  rate_app_limited:1,  /* rate_{delivered,interval_us} limited? */
+   unused:7;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
thin_dupack : 1,/* Fast retransmit on first dupack  */
@@ -271,6 +272,8 @@ struct tcp_sock {
u32 app_limited;/* limited until "delivered" reaches this val */
struct skb_mstamp first_tx_mstamp;  /* start of window send phase */
struct skb_mstamp delivered_mstamp; /* time we reached "delivered" */
+   u32 rate_delivered;/* saved rate sample: packets delivered */
+   u32 rate_interval_us;  /* saved rate sample: time elapsed */
 
u32 rcv_wnd;/* Current receiver window  */
u32 write_seq;  /* Tail(+1) of data held in tcp send buffer */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 482898f..73ac0db 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -167,6 +167,7 @@ struct tcp_info {
__u8tcpi_backoff;
__u8tcpi_options;
__u8tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;
+   __u8tcpi_delivery_rate_app_limited:1;
 
__u32   tcpi_rto;
__u32   tcpi_ato;
@@ -211,6 +212,8 @@ struct tcp_info {
__u32   tcpi_min_rtt;
__u32   tcpi_data_segs_in;  /* RFC4898 tcpEStatsDataSegsIn */
__u32   tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */
+
+   __u64   tcpi_delivery_rate;
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6c7a6fc..7358101 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2695,7 +2695,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 {
const struct tcp_sock *tp = tcp_sk(sk); /* iff sk_type == SOCK_STREAM */
const struct inet_connection_sock *icsk = inet_csk(sk);
-   u32 now = tcp_time_stamp;
+   u32 now = tcp_time_stamp, intv;
unsigned int start;
int notsent_bytes;
u64 rate64;
@@ -2785,6 +2785,15 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_min_rtt = tcp_min_rtt(tp);
info->tcpi_data_segs_in = tp->data_segs_in;
info->tcpi_data_segs_out = tp->data_segs_out;
+
+   info->tcpi_delivery_rate_app_limited = tp->rate_app_limited ? 1 : 0;
+   rate = READ_ONCE(tp->rate_delivered);
+   intv = READ_ONCE(tp->rate_interval_us);
+   if (rate && intv) {
+   rate = rate * tp->mss_cache * USEC_PER_SEC;
+   do_div(rate, intv);
+   put_unaligned(rate, >tcpi_delivery_rate);
+   }
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
 
diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c
index 52ff84b..9be1581 100644
--- a/net/ipv4/tcp_rate.c
+++ b/net/ipv4/tcp_rate.c
@@ -149,12 +149,22 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 
lost,
 * for connections suffer heavy or prolonged losses.
 */
if (unlikely(rs->interval_us < tcp_min_rtt(tp))) {
-   rs->interval_us = -1;
if (!rs->is_retrans)
pr_debug("tcp rate: %ld %d %u %u %u\n",
 rs->interval_us, rs->delivered,

[PATCH net-next 02/14] tcp: use windowed min filter library for TCP min_rtt estimation

2016-09-16 Thread Neal Cardwell

Refactor the TCP min_rtt code to reuse the new win_minmax library in
lib/win_minmax.c to simplify the TCP code.

This is a pure refactor: the functionality is exactly the same. We
just moved the windowed min code to make TCP easier to read and
maintain, and to allow other parts of the kernel to use the windowed
min/max filter code.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h  |  5 ++--
 include/net/tcp.h|  2 +-
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_input.c | 64 
 net/ipv4/tcp_minisocks.c |  2 +-
 5 files changed, 10 insertions(+), 65 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c723a46..6433cc8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -19,6 +19,7 @@
 
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -234,9 +235,7 @@ struct tcp_sock {
u32 mdev_max_us;/* maximal mdev for the last rtt period */
u32 rttvar_us;  /* smoothed mdev_max*/
u32 rtt_seq;/* sequence number to update rttvar */
-   struct rtt_meas {
-   u32 rtt, ts;/* RTT in usec and sampling time in jiffies. */
-   } rtt_min[3];
+   struct  minmax rtt_min;
 
u32 packets_out;/* Packets which are "in flight"*/
u32 retrans_out;/* Retransmitted packets out*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index fdfbedd..2f1648a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -671,7 +671,7 @@ static inline bool tcp_ca_dst_locked(const struct dst_entry 
*dst)
 /* Minimum RTT in usec. ~0 means not available. */
 static inline u32 tcp_min_rtt(const struct tcp_sock *tp)
 {
-   return tp->rtt_min[0].rtt;
+   return minmax_get(>rtt_min);
 }
 
 /* Compute the actual receive window we are currently advertising.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a13fcb3..5b0b49c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -387,7 +387,7 @@ void tcp_init_sock(struct sock *sk)
 
icsk->icsk_rto = TCP_TIMEOUT_INIT;
tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
-   tp->rtt_min[0].rtt = ~0U;
+   minmax_reset(>rtt_min, tcp_time_stamp, ~0U);
 
/* So many TCP implementations out there (incorrectly) count the
 * initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 70b892d..ac5b38f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2879,67 +2879,13 @@ static void tcp_fastretrans_alert(struct sock *sk, 
const int acked,
*rexmit = REXMIT_LOST;
 }
 
-/* Kathleen Nichols' algorithm for tracking the minimum value of
- * a data stream over some fixed time interval. (E.g., the minimum
- * RTT over the past five minutes.) It uses constant space and constant
- * time per update yet almost always delivers the same minimum as an
- * implementation that has to keep all the data in the window.
- *
- * The algorithm keeps track of the best, 2nd best & 3rd best min
- * values, maintaining an invariant that the measurement time of the
- * n'th best >= n-1'th best. It also makes sure that the three values
- * are widely separated in the time window since that bounds the worse
- * case error when that data is monotonically increasing over the window.
- *
- * Upon getting a new min, we can forget everything earlier because it
- * has no value - the new min is <= everything else in the window by
- * definition and it's the most recent. So we restart fresh on every new min
- * and overwrites 2nd & 3rd choices. The same property holds for 2nd & 3rd
- * best.
- */
 static void tcp_update_rtt_min(struct sock *sk, u32 rtt_us)
 {
-   const u32 now = tcp_time_stamp, wlen = sysctl_tcp_min_rtt_wlen * HZ;
-   struct rtt_meas *m = tcp_sk(sk)->rtt_min;
-   struct rtt_meas rttm = {
-   .rtt = likely(rtt_us) ? rtt_us : jiffies_to_usecs(1),
-   .ts = now,
-   };
-   u32 elapsed;
-
-   /* Check if the new measurement updates the 1st, 2nd, or 3rd choices */
-   if (unlikely(rttm.rtt <= m[0].rtt))
-   m[0] = m[1] = m[2] = rttm;
-   else if (rttm.rtt <= m[1].rtt)
-   m[1] = m[2] = rttm;
-   else if (rttm.rtt <= m[2].rtt)
-   m[2] = rttm;
-
-   elapsed = now - m[0].ts;
-   if (unlikely(elapsed > wlen)) {
-   /* Passed entire window without a new min so make 2nd choice
-* the new min & 3rd choice the new 2nd. So forth and so on.
-*/
-   m[0] = m[1];
-   m[1] = m[2];
-   m[2] = rttm;
-   if

[PATCH net-next 04/14] tcp: count packets marked lost for a TCP connection

2016-09-16 Thread Neal Cardwell

Count the number of packets that a TCP connection marks lost.

Congestion control modules can use this loss rate information for more
intelligent decisions about how fast to send.

Specifically, this is used in TCP BBR policer detection. BBR uses a
high packet loss rate as one signal in its policer detection and
policer bandwidth estimation algorithm.

The BBR policer detection algorithm cannot simply track retransmits,
because a retransmit can be (and often is) an indicator of packets
lost long, long ago. This is particularly true in a long CA_Loss
period that repairs the initial massive losses when a policer kicks
in.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/tcp.h  |  1 +
 net/ipv4/tcp_input.c | 25 -
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 6433cc8..38590fb 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -267,6 +267,7 @@ struct tcp_sock {
 * receiver in Recovery. */
u32 prr_out;/* Total number of pkts sent during Recovery. */
u32 delivered;  /* Total data packets delivered incl. rexmits */
+   u32 lost;   /* Total data packets lost incl. rexmits */
 
u32 rcv_wnd;/* Current receiver window  */
u32 write_seq;  /* Tail(+1) of data held in tcp send buffer */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ac5b38f..024b579 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -899,12 +899,29 @@ static void tcp_verify_retransmit_hint(struct tcp_sock 
*tp, struct sk_buff *skb)
tp->retransmit_high = TCP_SKB_CB(skb)->end_seq;
 }
 
+/* Sum the number of packets on the wire we have marked as lost.
+ * There are two cases we care about here:
+ * a) Packet hasn't been marked lost (nor retransmitted),
+ *and this is the first loss.
+ * b) Packet has been marked both lost and retransmitted,
+ *and this means we think it was lost again.
+ */
+static void tcp_sum_lost(struct tcp_sock *tp, struct sk_buff *skb)
+{
+   __u8 sacked = TCP_SKB_CB(skb)->sacked;
+
+   if (!(sacked & TCPCB_LOST) ||
+   ((sacked & TCPCB_LOST) && (sacked & TCPCB_SACKED_RETRANS)))
+   tp->lost += tcp_skb_pcount(skb);
+}
+
 static void tcp_skb_mark_lost(struct tcp_sock *tp, struct sk_buff *skb)
 {
if (!(TCP_SKB_CB(skb)->sacked & (TCPCB_LOST|TCPCB_SACKED_ACKED))) {
tcp_verify_retransmit_hint(tp, skb);
 
tp->lost_out += tcp_skb_pcount(skb);
+   tcp_sum_lost(tp, skb);
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
}
 }
@@ -913,6 +930,7 @@ void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, 
struct sk_buff *skb)
 {
tcp_verify_retransmit_hint(tp, skb);
 
+   tcp_sum_lost(tp, skb);
if (!(TCP_SKB_CB(skb)->sacked & (TCPCB_LOST|TCPCB_SACKED_ACKED))) {
tp->lost_out += tcp_skb_pcount(skb);
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
@@ -1890,6 +1908,7 @@ void tcp_enter_loss(struct sock *sk)
struct sk_buff *skb;
bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
bool is_reneg;  /* is receiver reneging on SACKs? */
+   bool mark_lost;
 
/* Reduce ssthresh if it has not yet been made inside this window. */
if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
@@ -1923,8 +1942,12 @@ void tcp_enter_loss(struct sock *sk)
if (skb == tcp_send_head(sk))
break;
 
+   mark_lost = (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) ||
+is_reneg);
+   if (mark_lost)
+   tcp_sum_lost(tp, skb);
TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED;
-   if (!(TCP_SKB_CB(skb)->sacked_SACKED_ACKED) || is_reneg) {
+   if (mark_lost) {
TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
tp->lost_out += tcp_skb_pcount(skb);
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 10/14] tcp: export tcp_mss_to_mtu() for congestion control modules

2016-09-16 Thread Neal Cardwell

Export tcp_mss_to_mtu(), so that congestion control modules can use
this to help calculate a pacing rate.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 net/ipv4/tcp_output.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0bf3d48..7d025a7 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1362,6 +1362,7 @@ int tcp_mss_to_mtu(struct sock *sk, int mss)
}
return mtu;
 }
+EXPORT_SYMBOL(tcp_mss_to_mtu);
 
 /* MTU probing init per socket */
 void tcp_mtup_init(struct sock *sk)
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 11/14] tcp: allow congestion control to expand send buffer differently

2016-09-16 Thread Neal Cardwell

From: Yuchung Cheng 

Currently the TCP send buffer expands to twice cwnd, in order to allow
limited transmits in the CA_Recovery state. This assumes that cwnd
does not increase in the CA_Recovery.

For some congestion control algorithms, like the upcoming BBR module,
if the losses in recovery do not indicate congestion then we may
continue to raise cwnd multiplicatively in recovery. In such cases the
current multiplier will falsely limit the sending rate, much as if it
were limited by the application.

This commit adds an optional congestion control callback to use a
different multiplier to expand the TCP send buffer. For congestion
control modules that do not specificy this callback, TCP continues to
use the previous default of 2.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/tcp.h| 2 ++
 net/ipv4/tcp_input.c | 4 +++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8805c65..c4d2e46 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -917,6 +917,8 @@ struct tcp_congestion_ops {
void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
/* suggest number of segments for each skb to transmit (optional) */
u32 (*tso_segs_goal)(struct sock *sk);
+   /* returns the multiplier used in tcp_sndbuf_expand (optional) */
+   u32 (*sndbuf_expand)(struct sock *sk);
/* get info for inet_diag (optional) */
size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
   union tcp_cc_info *info);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index df26af0..a134e66 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -289,6 +289,7 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, 
const struct tcphdr
 static void tcp_sndbuf_expand(struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
+   const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops;
int sndmem, per_mss;
u32 nr_segs;
 
@@ -309,7 +310,8 @@ static void tcp_sndbuf_expand(struct sock *sk)
 * Cubic needs 1.7 factor, rounded to 2 to include
 * extra cushion (application might react slowly to POLLOUT)
 */
-   sndmem = 2 * nr_segs * per_mss;
+   sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2;
+   sndmem *= nr_segs * per_mss;
 
if (sk->sk_sndbuf < sndmem)
sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 01/14] lib/win_minmax: windowed min or max estimator

2016-09-16 Thread Neal Cardwell

This commit introduces a generic library to estimate either the min or
max value of a time-varying variable over a recent time window. This
is code originally from Kathleen Nichols. The current form of the code
is from Van Jacobson.

A single struct minmax_sample will track the estimated windowed-max
value of the series if you call minmax_running_max() or the estimated
windowed-min value of the series if you call minmax_running_min().

Nearly equivalent code is already in place for minimum RTT estimation
in the TCP stack. This commit extracts that code and generalizes it to
handle both min and max. Moving the code here reduces the footprint
and complexity of the TCP code base and makes the filter generally
available for other parts of the codebase, including an upcoming TCP
congestion control module.

This library works well for time series where the measurements are
smoothly increasing or decreasing.

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/linux/win_minmax.h | 37 +
 lib/Makefile   |  2 +-
 lib/win_minmax.c   | 98 ++
 3 files changed, 136 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/win_minmax.h
 create mode 100644 lib/win_minmax.c

diff --git a/include/linux/win_minmax.h b/include/linux/win_minmax.h
new file mode 100644
index 000..5656960
--- /dev/null
+++ b/include/linux/win_minmax.h
@@ -0,0 +1,37 @@
+/**
+ * lib/minmax.c: windowed min/max tracker by Kathleen Nichols.
+ *
+ */
+#ifndef MINMAX_H
+#define MINMAX_H
+
+#include 
+
+/* A single data point for our parameterized min-max tracker */
+struct minmax_sample {
+   u32 t;  /* time measurement was taken */
+   u32 v;  /* value measured */
+};
+
+/* State for the parameterized min-max tracker */
+struct minmax {
+   struct minmax_sample s[3];
+};
+
+static inline u32 minmax_get(const struct minmax *m)
+{
+   return m->s[0].v;
+}
+
+static inline u32 minmax_reset(struct minmax *m, u32 t, u32 meas)
+{
+   struct minmax_sample val = { .t = t, .v = meas };
+
+   m->s[2] = m->s[1] = m->s[0] = val;
+   return m->s[0].v;
+}
+
+u32 minmax_running_max(struct minmax *m, u32 win, u32 t, u32 meas);
+u32 minmax_running_min(struct minmax *m, u32 win, u32 t, u32 meas);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 5dc77a8..df747e5 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -22,7 +22,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 sha1.o chacha20.o md5.o irq_regs.o argv_split.o \
 flex_proportions.o ratelimit.o show_mem.o \
 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o
+earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o win_minmax.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/win_minmax.c b/lib/win_minmax.c
new file mode 100644
index 000..c8420d4
--- /dev/null
+++ b/lib/win_minmax.c
@@ -0,0 +1,98 @@
+/**
+ * lib/minmax.c: windowed min/max tracker
+ *
+ * Kathleen Nichols' algorithm for tracking the minimum (or maximum)
+ * value of a data stream over some fixed time interval.  (E.g.,
+ * the minimum RTT over the past five minutes.) It uses constant
+ * space and constant time per update yet almost always delivers
+ * the same minimum as an implementation that has to keep all the
+ * data in the window.
+ *
+ * The algorithm keeps track of the best, 2nd best & 3rd best min
+ * values, maintaining an invariant that the measurement time of
+ * the n'th best >= n-1'th best. It also makes sure that the three
+ * values are widely separated in the time window since that bounds
+ * the worse case error when that data is monotonically increasing
+ * over the window.
+ *
+ * Upon getting a new min, we can forget everything earlier because
+ * it has no value - the new min is <= everything else in the window
+ * by definition and it's the most recent. So we restart fresh on
+ * every new min and overwrites 2nd & 3rd choices. The same property
+ * holds for 2nd & 3rd best.
+ */
+#include 
+#include 
+
+/* As time advances, update the 1st, 2nd, and 3rd choices. */
+static u32 minmax_subwin_update(struct minmax *m, u32 win,
+   const struct minmax_sample *val)
+{
+   u32 dt = val->t - m->s[0].t;
+
+   if (unlikely(dt > win)) {
+   /*
+* Passed entire window without a new val so make 2nd
+* choice the new val & 3rd choice the new 2nd choice.
+* we may have to iterate this since our 2nd choice
+* may also be outside the window (we checked on entry
+* that the third

[PATCH net-next 00/14] tcp: BBR congestion control algorithm

2016-09-16 Thread Neal Cardwell

tcp: BBR congestion control algorithm

This patch series implements a new TCP congestion control algorithm:
BBR (Bottleneck Bandwidth and RTT). A paper with a detailed
description of BBR will be published in ACM Queue, September-October
2016, as "BBR: Congestion-Based Congestion Control". BBR is widely
deployed in production at Google.

The patch series starts with a set of supporting infrastructure
changes, including a few that extend the congestion control
framework. The last patch adds BBR as a TCP congestion control
module. Please see individual patches for the details.

Eric Dumazet (1):
  net_sched: sch_fq: add low_rate_threshold parameter

Neal Cardwell (8):
  lib/win_minmax: windowed min or max estimator
  tcp: use windowed min filter library for TCP min_rtt estimation
  tcp: count packets marked lost for a TCP connection
  tcp: allow congestion control module to request TSO skb segment count
  tcp: export tcp_tso_autosize() and parameterize minimum number of TSO
segments
  tcp: export tcp_mss_to_mtu() for congestion control modules
  tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88
  tcp_bbr: add BBR congestion control

Soheil Hassas Yeganeh (1):
  tcp: track application-limited rate samples

Yuchung Cheng (4):
  tcp: track data delivery rate for a TCP connection
  tcp: export data delivery rate
  tcp: allow congestion control to expand send buffer differently
  tcp: new CC hook to set sending rate with rate_sample in any CA state

 include/linux/tcp.h|  14 +-
 include/linux/win_minmax.h |  37 ++
 include/net/inet_connection_sock.h |   4 +-
 include/net/tcp.h  |  53 ++-
 include/uapi/linux/inet_diag.h |  13 +
 include/uapi/linux/pkt_sched.h |   2 +
 include/uapi/linux/tcp.h   |   3 +
 lib/Makefile   |   2 +-
 lib/win_minmax.c   |  98 +
 net/ipv4/Kconfig   |  18 +
 net/ipv4/Makefile  |   3 +-
 net/ipv4/tcp.c |  26 +-
 net/ipv4/tcp_bbr.c | 875 +
 net/ipv4/tcp_cong.c|   2 +-
 net/ipv4/tcp_input.c   | 154 +++
 net/ipv4/tcp_minisocks.c   |   5 +-
 net/ipv4/tcp_output.c  |  27 +-
 net/ipv4/tcp_rate.c| 186 
 net/sched/sch_fq.c |  22 +-
 19 files changed, 1445 insertions(+), 99 deletions(-)
 create mode 100644 include/linux/win_minmax.h
 create mode 100644 lib/win_minmax.c
 create mode 100644 net/ipv4/tcp_bbr.c
 create mode 100644 net/ipv4/tcp_rate.c

-- 
2.8.0.rc3.226.g39d4020

Re: [ethtool PATCH v1] ethtool: Document ethtool advertised speeds for 1G/10G

2016-09-16 Thread John W. Linville

On Tue, Sep 06, 2016 at 04:55:11PM -0700, Vidya Sagar Ravipati wrote:
> From: Vidya Sagar Ravipati 
> 
> Man page update to include updated advertised speeds for
> 1G/10G
> 
> Signed-off-by: Vidya Sagar Ravipati 

Applied, thanks!

-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.

Re: [PATCH] net: ipv6: fallback to full lookup if table lookup is unsuitable

2016-09-16 Thread David Ahern

On 9/16/16 6:55 AM, Vincent Bernat wrote:
> Commit 8c14586fc320 ("net: ipv6: Use passed in table for nexthop
> lookups") introduced a regression: insertion of an IPv6 route in a table
> not containing the appropriate connected route for the gateway but which
> contained a non-connected route (like a default gateway) fails while it
> was previously working:
> 
> $ ip link add eth0 type dummy
> $ ip link set up dev eth0
> $ ip addr add 2001:db8::1/64 dev eth0
> $ ip route add ::/0 via 2001:db8::5 dev eth0 table 20
> $ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
> RTNETLINK answers: No route to host
> $ ip -6 route show table 20
> default via 2001:db8::5 dev eth0  metric 1024  pref medium

so your table 20 is not complete in that it lacks a connected route to resolve 
2001:db8::6 as a nexthop, so you are relying on a fallback to other tables 
(main in this case).

-8<-

> @@ -1991,33 +2015,15 @@ static struct rt6_info *ip6_route_info_create(struct 
> fib6_config *cfg)
>   if (!(gwa_type & IPV6_ADDR_UNICAST))
>   goto out;
>  
> + err = -EHOSTUNREACH;
>   if (cfg->fc_table)
>   grt = ip6_nh_lookup_table(net, cfg, gw_addr);

-8<-

> - if (!(grt->rt6i_flags & RTF_GATEWAY))
> - err = 0;

This is the check that is failing for your use case. ip6_nh_lookup_table is 
returning the default route and nexthops can not rely on a gateway. Given that 
a simpler and more direct change is (whitespace mangled on paste):

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ad4a7ff301fc..48bae2ee2e18 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1991,9 +1991,19 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg)
if (!(gwa_type & IPV6_ADDR_UNICAST))
goto out;

-   if (cfg->fc_table)
+   if (cfg->fc_table) {
grt = ip6_nh_lookup_table(net, cfg, gw_addr);

+   /* a nexthop lookup can not go through a gw.
+* if this happens on a table based lookup
+* then fallback to a full lookup
+*/
+   if (grt && grt->rt6i_flags & RTF_GATEWAY) {
+   ip6_rt_put(grt);
+   grt = NULL;
+   }
+   }
+
if (!grt)
grt = rt6_lookup(net, gw_addr, NULL,
 cfg->fc_ifindex, 1);

Re: [PATCHv1] sunrpc: fix write space race causing stalls

2016-09-16 Thread Trond Myklebust


> On Sep 16, 2016, at 13:29, David Vrabel  wrote:
> 
> On 16/09/16 18:06, Trond Myklebust wrote:
>> 
>>> On Sep 16, 2016, at 12:41, David Vrabel  wrote:
>>> 
>>> On 16/09/16 17:01, Trond Myklebust wrote:
 
> On Sep 16, 2016, at 08:28, David Vrabel  wrote:
> 
> Write space becoming available may race with putting the task to sleep
> in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
> race does not work.
> 
> This (edited) partial trace illustrates the problem:
> 
> [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
> [2] xs_write_space <-xs_tcp_write_space
> [3] xprt_write_space <-xs_write_space
> [4] rpc_task_sleep: task:43546@5 ...
> [5] xs_write_space <-xs_tcp_write_space
> 
> [1] Task 43546 runs but is out of write space.
> 
> [2] Space becomes available, xs_write_space() clears the
>  SOCKWQ_ASYNC_NOSPACE bit.
> 
> [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
>  this has not yet been queued and the wake up is lost.
> 
> [4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
>  which queues task 43546.
> 
> [5] The call to sk->sk_write_space() at the end of xs_nospace() (which
>  is supposed to handle the above race) does not call
>  xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
>  thus the task is not woken.
> 
> Fix the race by have xprt_wait_for_buffer_space() check for write
> space after putting the task to sleep.
> 
> Signed-off-by: David Vrabel 
> ---
> include/linux/sunrpc/xprt.h |  1 +
> net/sunrpc/xprt.c   |  4 
> net/sunrpc/xprtsock.c   | 21 +++--
> 3 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index a16070d..621e74b 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -129,6 +129,7 @@ struct rpc_xprt_ops {
>   void(*connect)(struct rpc_xprt *xprt, struct rpc_task 
> *task);
>   void *  (*buf_alloc)(struct rpc_task *task, size_t size);
>   void(*buf_free)(void *buffer);
> + bool(*have_write_space)(struct rpc_xprt *task);
>   int (*send_request)(struct rpc_task *task);
>   void(*set_retrans_timeout)(struct rpc_task *task);
>   void(*timer)(struct rpc_xprt *xprt, struct rpc_task *task);
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index ea244b2..d3c1b1e 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -502,6 +502,10 @@ void xprt_wait_for_buffer_space(struct rpc_task 
> *task, rpc_action action)
> 
>   task->tk_timeout = RPC_IS_SOFT(task) ? req->rq_timeout : 0;
>   rpc_sleep_on(>pending, task, action);
> +
> + /* Write space notification may race with putting task to sleep. */
> + if (xprt->ops->have_write_space(xprt))
> + rpc_wake_up_queued_task(>pending, task);
> }
> EXPORT_SYMBOL_GPL(xprt_wait_for_buffer_space);
> 
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index bf16883..211de5b 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -472,8 +472,6 @@ static int xs_nospace(struct rpc_task *task)
> 
>   spin_unlock_bh(>transport_lock);
> 
> - /* Race breaker in case memory is freed before above code is called */
> - sk->sk_write_space(sk);
>   return ret;
> }
 
 Instead of these callbacks, why not just add a call to
 sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk) after queueing the task in
 xs_nospace()? Won’t that fix the existing race breaker?
>>> 
>>> I don't see how that would help.  If sk->sk_write_space was already
>>> called, SOCKWQ_ASYNC_NOSPACE will still be clear and the next call to
>>> sk->sk_write_space will still be a nop.
>> 
>> Sorry. Copy+paste error. I meant SOCKWQ_ASYNC_NOSPACE.
>> 
>>> 
>>> Or did you mean SOCKWQ_ASYNC_NOSPACE here?  It doesn't seem right to set
>>> this bit when we don't know if there's space or not.
>> 
>> Why not?
> 
> I prefer my solution because:
> 
> a) It obviously fixes the race (games with bits are less understandable).
> 
> b) It requires fewer atomic ops.
> 
> c) It doesn't require me to understand what the behaviour of the
> socket-internal SOCKWQ_ASYNC_NOSPACE bit is or should be.
> 
> d) I'm not sure I understand the objection to the additional
> have_write_space method -- it has simple, clear behaviour.
> 

I don’t see the point of adding 24 lines of code over 3 different files if the 
problem can be solved with 1 line of code.

Re: Modification to skb->queue_mapping affecting performance

2016-09-16 Thread Michael Ma

2016-09-15 17:51 GMT-07:00 Michael Ma :
> 2016-09-14 10:46 GMT-07:00 Michael Ma :
>> 2016-09-13 22:22 GMT-07:00 Eric Dumazet :
>>> On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote:
>>>
 I don't intend to install multiple qdisc - the only reason that I'm
 doing this now is to leverage MQ to workaround the lock contention,
 and based on the profile this all worked. However to simplify the way
 to setup HTB I wanted to use TXQ to partition HTB classes so that a
 HTB class only belongs to one TXQ, which also requires mapping skb to
 TXQ using some rules (here I'm using priority but I assume it's
 straightforward to use other information such as classid). And the
 problem I found here is that when using priority to infer the TXQ so
 that queue_mapping is changed, bandwidth is affected significantly -
 the only thing I can guess is that due to queue switch, there are more
 cache misses assuming processor cores have a static mapping to all the
 queues. Any suggestion on what to do next for the investigation?

 I would also guess that this should be a common problem if anyone
 wants to use MQ+IFB to workaround the qdisc lock contention on the
 receiver side and classful qdisc is used on IFB, but haven't really
 found a similar thread here...
>>>
>>> But why are you changing the queue ?
>>>
>>> NIC already does the proper RSS thing, meaning all packets of one flow
>>> should land on one RX queue. No need to ' classify yourself and risk
>>> lock contention'
>>>
>>> I use IFB + MQ + netem every day, and it scales to 10 Mpps with no
>>> problem.
>>>
>>> Do you really need to rate limit flows ? Not clear what are your goals,
>>> why for example you use HTB to begin with.
>>>
>> Yes. My goal is to set different min/max bandwidth limits for
>> different processes, so we started with HTB. However with HTB the
>> qdisc root lock contention caused some unintended correlation between
>> flows in different classes. For example if some flows belonging to one
>> class have large amount of small packets, other flows in a different
>> class will get their effective bandwidth reduced because they'll wait
>> longer for the root lock. Using MQ this can be avoided because I'll
>> just put flows belonging to one class to its dedicated TXQ. Then
>> classes within one HTB on a TXQ will still have the lock contention
>> problem but classes in different HTB will use different root locks so
>> the contention doesn't exist.
>>
>> This also means that I'll need to classify packets to different
>> TXQ/HTB based on some skb metadata (essentially similar to what mqprio
>> is doing). So TXQ might need to be switched to achieve this.
>
> My current theory to this problem is that tasklets in IFB might be
> scheduled to the same cpu core if the RXQ happens to be the same for
> two different flows. When queue_mapping is modified and multiple flows
> are concentrated to the same IFB TXQ because they need to be
> controlled by the same HTB, they'll have to use the same tasklet
> because of the way IFB is implemented. So if other flows belonging to
> a different TXQ/tasklet happens to be scheduled on the same core, that
> core can be overloaded and becomes the bottleneck. Without modifying
> the queue_mapping the chance of this contention is much lower.
>
> This is a speculation based on the increased si time in softirqd
> process. I'll try to affinitize each tasklet with a cpu core to verify
> whether this is the problem. I also noticed that in the past there was
> a similar proposal of scheduling the tasklet to a dedicated core which
> was not committed(https://patchwork.ozlabs.org/patch/38486/). I'll try
> something similar to verify this theory.

This is actually the problem - if flows from different RX queues are
switched to the same RX queue in IFB, they'll use different processor
context with the same tasklet, and the processor context of different
tasklets might be the same. So multiple tasklets in IFB competes for
the same core when queue is switched.

The following simple fix proved this - with this change even switching
the queue won't affect small packet bandwidth/latency anymore:

in ifb.c:

-   struct ifb_q_private *txp = dp->tx_private + skb_get_queue_mapping(skb);
+   struct ifb_q_private *txp = dp->tx_private +
(smp_processor_id() % dev->num_tx_queues);

This should be more efficient since we're not sending the task to a
different processor, instead we try to queue the packet to an
appropriate tasklet based on the processor ID. Will this cause any
packet out-of-order problem? If packets from the same flow are queued
to the same RX queue due to RSS, and processor affinity is set for RX
queues, I assume packets from the same flow will end up in the same
core when tasklet is scheduled. But I might have missed some uncommon
cases here... Would appreciate if anyone can provide more insights.

Re: [PATCHv3 next 3/3] ipvlan: Introduce l3s mode

2016-09-16 Thread महेश बंडेवार

On Thu, Sep 15, 2016 at 6:49 PM, David Ahern  wrote:
> On 9/15/16 6:14 PM, Mahesh Bandewar wrote:
>> diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
>> index 695a5dc9ace3..371f4548c42d 100644
>> --- a/drivers/net/ipvlan/ipvlan.h
>> +++ b/drivers/net/ipvlan/ipvlan.h
>> @@ -23,11 +23,13 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  #define IPVLAN_DRV   "ipvlan"
>>  #define IPV_DRV_VER  "0.1"
>> @@ -96,6 +98,7 @@ struct ipvl_port {
>>   struct work_struct  wq;
>>   struct sk_buff_head backlog;
>>   int count;
>> + boolhooks_attached;
>
> With a refcnt on the hook registration you don't need this bool and removing 
> simplifies the set_mode logic.
>
not sure it simplifies the logic, but mode change fact can be used
instead of relying on the value of hooks_attached (so it's more
indirect).
>
>> diff --git a/drivers/net/ipvlan/ipvlan_main.c 
>> b/drivers/net/ipvlan/ipvlan_main.c
>> index 18b4e8c7f68a..aca690f41559 100644
>> --- a/drivers/net/ipvlan/ipvlan_main.c
>> +++ b/drivers/net/ipvlan/ipvlan_main.c
>
> 
>
>> +static void ipvlan_unregister_nf_hook(void)
>> +{
>> + BUG_ON(!ipvl_nf_hook_refcnt);
>
> not a panic() worthy issue. just a pr_warn or WARN_ON_ONCE should be ok.
>
sure, I don't think it would ever hit considering that RTNL mutex is
protecting all these updates (for now). But would definitely prefer
something there (probably WARN_ON) as a protection.
>

Re: [PATCHv1] sunrpc: fix write space race causing stalls

2016-09-16 Thread David Vrabel

On 16/09/16 18:06, Trond Myklebust wrote:
> 
>> On Sep 16, 2016, at 12:41, David Vrabel  wrote:
>>
>> On 16/09/16 17:01, Trond Myklebust wrote:
>>>
 On Sep 16, 2016, at 08:28, David Vrabel  wrote:

 Write space becoming available may race with putting the task to sleep
 in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
 race does not work.

 This (edited) partial trace illustrates the problem:

  [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
  [2] xs_write_space <-xs_tcp_write_space
  [3] xprt_write_space <-xs_write_space
  [4] rpc_task_sleep: task:43546@5 ...
  [5] xs_write_space <-xs_tcp_write_space

 [1] Task 43546 runs but is out of write space.

 [2] Space becomes available, xs_write_space() clears the
   SOCKWQ_ASYNC_NOSPACE bit.

 [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
   this has not yet been queued and the wake up is lost.

 [4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
   which queues task 43546.

 [5] The call to sk->sk_write_space() at the end of xs_nospace() (which
   is supposed to handle the above race) does not call
   xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
   thus the task is not woken.

 Fix the race by have xprt_wait_for_buffer_space() check for write
 space after putting the task to sleep.

 Signed-off-by: David Vrabel 
 ---
 include/linux/sunrpc/xprt.h |  1 +
 net/sunrpc/xprt.c   |  4 
 net/sunrpc/xprtsock.c   | 21 +++--
 3 files changed, 24 insertions(+), 2 deletions(-)

 diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
 index a16070d..621e74b 100644
 --- a/include/linux/sunrpc/xprt.h
 +++ b/include/linux/sunrpc/xprt.h
 @@ -129,6 +129,7 @@ struct rpc_xprt_ops {
void(*connect)(struct rpc_xprt *xprt, struct rpc_task 
 *task);
void *  (*buf_alloc)(struct rpc_task *task, size_t size);
void(*buf_free)(void *buffer);
 +  bool(*have_write_space)(struct rpc_xprt *task);
int (*send_request)(struct rpc_task *task);
void(*set_retrans_timeout)(struct rpc_task *task);
void(*timer)(struct rpc_xprt *xprt, struct rpc_task *task);
 diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
 index ea244b2..d3c1b1e 100644
 --- a/net/sunrpc/xprt.c
 +++ b/net/sunrpc/xprt.c
 @@ -502,6 +502,10 @@ void xprt_wait_for_buffer_space(struct rpc_task 
 *task, rpc_action action)

task->tk_timeout = RPC_IS_SOFT(task) ? req->rq_timeout : 0;
rpc_sleep_on(>pending, task, action);
 +
 +  /* Write space notification may race with putting task to sleep. */
 +  if (xprt->ops->have_write_space(xprt))
 +  rpc_wake_up_queued_task(>pending, task);
 }
 EXPORT_SYMBOL_GPL(xprt_wait_for_buffer_space);

 diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
 index bf16883..211de5b 100644
 --- a/net/sunrpc/xprtsock.c
 +++ b/net/sunrpc/xprtsock.c
 @@ -472,8 +472,6 @@ static int xs_nospace(struct rpc_task *task)

spin_unlock_bh(>transport_lock);

 -  /* Race breaker in case memory is freed before above code is called */
 -  sk->sk_write_space(sk);
return ret;
 }
>>>
>>> Instead of these callbacks, why not just add a call to
>>> sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk) after queueing the task in
>>> xs_nospace()? Won’t that fix the existing race breaker?
>>
>> I don't see how that would help.  If sk->sk_write_space was already
>> called, SOCKWQ_ASYNC_NOSPACE will still be clear and the next call to
>> sk->sk_write_space will still be a nop.
> 
> Sorry. Copy+paste error. I meant SOCKWQ_ASYNC_NOSPACE.
> 
>>
>> Or did you mean SOCKWQ_ASYNC_NOSPACE here?  It doesn't seem right to set
>> this bit when we don't know if there's space or not.
> 
> Why not?

I prefer my solution because:

a) It obviously fixes the race (games with bits are less understandable).

b) It requires fewer atomic ops.

c) It doesn't require me to understand what the behaviour of the
socket-internal SOCKWQ_ASYNC_NOSPACE bit is or should be.

d) I'm not sure I understand the objection to the additional
have_write_space method -- it has simple, clear behaviour.

David

Re: [netfilter-core] [lkp] [netfilter] 68263ddb47: WARNING: CPU: 0 PID: 1225 at net/netfilter/nf_conntrack_seqadj.c:232 nf_ct_seq_offset+0x7a/0x9a

2016-09-16 Thread Florian Westphal

Gao Feng  wrote:
> > [   23.465616] [ cut here ]
> > [   23.466477] WARNING: CPU: 0 PID: 1225 at 
> > net/netfilter/nf_conntrack_seqadj.c:232
> > nf_ct_seq_offset+0x7a/0x9a
> > [   23.468458] Missing nfct_seqadj_ext_add() setup call
> >
> 
> It should be that nf_ct_add_synproxy failed and the seqadj extentision is
> not added.

Note that nfct_synproxy_ext_add always returns NULL if
CONFIG_NETFILTER_SYNPROXY=n

The warning should also be removed.

> When nf_ct_add_synproxy fails, the init_conntrack fails too and return
> ERR_PTR(-ENOMEM). In this case, the packet should be dropped directly, and
> should not be processed by the latter codes.

This means the commit breaks conntrack if SYNPROXY=n

Re: [PATCHv1] sunrpc: fix write space race causing stalls

2016-09-16 Thread Trond Myklebust


> On Sep 16, 2016, at 12:41, David Vrabel  wrote:
> 
> On 16/09/16 17:01, Trond Myklebust wrote:
>> 
>>> On Sep 16, 2016, at 08:28, David Vrabel  wrote:
>>> 
>>> Write space becoming available may race with putting the task to sleep
>>> in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
>>> race does not work.
>>> 
>>> This (edited) partial trace illustrates the problem:
>>> 
>>>  [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
>>>  [2] xs_write_space <-xs_tcp_write_space
>>>  [3] xprt_write_space <-xs_write_space
>>>  [4] rpc_task_sleep: task:43546@5 ...
>>>  [5] xs_write_space <-xs_tcp_write_space
>>> 
>>> [1] Task 43546 runs but is out of write space.
>>> 
>>> [2] Space becomes available, xs_write_space() clears the
>>>   SOCKWQ_ASYNC_NOSPACE bit.
>>> 
>>> [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
>>>   this has not yet been queued and the wake up is lost.
>>> 
>>> [4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
>>>   which queues task 43546.
>>> 
>>> [5] The call to sk->sk_write_space() at the end of xs_nospace() (which
>>>   is supposed to handle the above race) does not call
>>>   xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
>>>   thus the task is not woken.
>>> 
>>> Fix the race by have xprt_wait_for_buffer_space() check for write
>>> space after putting the task to sleep.
>>> 
>>> Signed-off-by: David Vrabel 
>>> ---
>>> include/linux/sunrpc/xprt.h |  1 +
>>> net/sunrpc/xprt.c   |  4 
>>> net/sunrpc/xprtsock.c   | 21 +++--
>>> 3 files changed, 24 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>>> index a16070d..621e74b 100644
>>> --- a/include/linux/sunrpc/xprt.h
>>> +++ b/include/linux/sunrpc/xprt.h
>>> @@ -129,6 +129,7 @@ struct rpc_xprt_ops {
>>> void(*connect)(struct rpc_xprt *xprt, struct rpc_task 
>>> *task);
>>> void *  (*buf_alloc)(struct rpc_task *task, size_t size);
>>> void(*buf_free)(void *buffer);
>>> +   bool(*have_write_space)(struct rpc_xprt *task);
>>> int (*send_request)(struct rpc_task *task);
>>> void(*set_retrans_timeout)(struct rpc_task *task);
>>> void(*timer)(struct rpc_xprt *xprt, struct rpc_task *task);
>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>> index ea244b2..d3c1b1e 100644
>>> --- a/net/sunrpc/xprt.c
>>> +++ b/net/sunrpc/xprt.c
>>> @@ -502,6 +502,10 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, 
>>> rpc_action action)
>>> 
>>> task->tk_timeout = RPC_IS_SOFT(task) ? req->rq_timeout : 0;
>>> rpc_sleep_on(>pending, task, action);
>>> +
>>> +   /* Write space notification may race with putting task to sleep. */
>>> +   if (xprt->ops->have_write_space(xprt))
>>> +   rpc_wake_up_queued_task(>pending, task);
>>> }
>>> EXPORT_SYMBOL_GPL(xprt_wait_for_buffer_space);
>>> 
>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>> index bf16883..211de5b 100644
>>> --- a/net/sunrpc/xprtsock.c
>>> +++ b/net/sunrpc/xprtsock.c
>>> @@ -472,8 +472,6 @@ static int xs_nospace(struct rpc_task *task)
>>> 
>>> spin_unlock_bh(>transport_lock);
>>> 
>>> -   /* Race breaker in case memory is freed before above code is called */
>>> -   sk->sk_write_space(sk);
>>> return ret;
>>> }
>> 
>> Instead of these callbacks, why not just add a call to
>> sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk) after queueing the task in
>> xs_nospace()? Won’t that fix the existing race breaker?
> 
> I don't see how that would help.  If sk->sk_write_space was already
> called, SOCKWQ_ASYNC_NOSPACE will still be clear and the next call to
> sk->sk_write_space will still be a nop.

Sorry. Copy+paste error. I meant SOCKWQ_ASYNC_NOSPACE.

> 
> Or did you mean SOCKWQ_ASYNC_NOSPACE here?  It doesn't seem right to set
> this bit when we don't know if there's space or not.

Why not?

Re: [PATCH][V2] net: r6040: add in missing white space in error message text

2016-09-16 Thread Florian Fainelli

On 09/16/2016 02:43 AM, Colin King wrote:
> From: Colin Ian King 
> 
> A couple of dev_err messages span two lines and the literal
> string is missing a white space between words. Add the white
> space and join the two lines into one.
> 
> Signed-off-by: Colin Ian King 

Acked-by: FLorian Fainelli 

-- 
Florian

RE: [PATCH v4 00/16] Add Paravirtual RDMA Driver

2016-09-16 Thread Woodruff, Robert J

Jason wrote, 
>I should be clearer here. I am *strongly* opposed to anything that changes the 
>license of the existing 4 core libraries away from the
>GPLv2 or OpenIB.org situation we have today. (that includes to other varients 
>of the BSD license)

>I just checked and we appear to be completely OK on this point today.

Ok, thanks

Re: [PATCHv1] sunrpc: fix write space race causing stalls

2016-09-16 Thread David Vrabel

On 16/09/16 17:01, Trond Myklebust wrote:
> 
>> On Sep 16, 2016, at 08:28, David Vrabel  wrote:
>>
>> Write space becoming available may race with putting the task to sleep
>> in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
>> race does not work.
>>
>> This (edited) partial trace illustrates the problem:
>>
>>   [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
>>   [2] xs_write_space <-xs_tcp_write_space
>>   [3] xprt_write_space <-xs_write_space
>>   [4] rpc_task_sleep: task:43546@5 ...
>>   [5] xs_write_space <-xs_tcp_write_space
>>
>> [1] Task 43546 runs but is out of write space.
>>
>> [2] Space becomes available, xs_write_space() clears the
>>SOCKWQ_ASYNC_NOSPACE bit.
>>
>> [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
>>this has not yet been queued and the wake up is lost.
>>
>> [4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
>>which queues task 43546.
>>
>> [5] The call to sk->sk_write_space() at the end of xs_nospace() (which
>>is supposed to handle the above race) does not call
>>xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
>>thus the task is not woken.
>>
>> Fix the race by have xprt_wait_for_buffer_space() check for write
>> space after putting the task to sleep.
>>
>> Signed-off-by: David Vrabel 
>> ---
>> include/linux/sunrpc/xprt.h |  1 +
>> net/sunrpc/xprt.c   |  4 
>> net/sunrpc/xprtsock.c   | 21 +++--
>> 3 files changed, 24 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>> index a16070d..621e74b 100644
>> --- a/include/linux/sunrpc/xprt.h
>> +++ b/include/linux/sunrpc/xprt.h
>> @@ -129,6 +129,7 @@ struct rpc_xprt_ops {
>>  void(*connect)(struct rpc_xprt *xprt, struct rpc_task 
>> *task);
>>  void *  (*buf_alloc)(struct rpc_task *task, size_t size);
>>  void(*buf_free)(void *buffer);
>> +bool(*have_write_space)(struct rpc_xprt *task);
>>  int (*send_request)(struct rpc_task *task);
>>  void(*set_retrans_timeout)(struct rpc_task *task);
>>  void(*timer)(struct rpc_xprt *xprt, struct rpc_task *task);
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index ea244b2..d3c1b1e 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -502,6 +502,10 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, 
>> rpc_action action)
>>
>>  task->tk_timeout = RPC_IS_SOFT(task) ? req->rq_timeout : 0;
>>  rpc_sleep_on(>pending, task, action);
>> +
>> +/* Write space notification may race with putting task to sleep. */
>> +if (xprt->ops->have_write_space(xprt))
>> +rpc_wake_up_queued_task(>pending, task);
>> }
>> EXPORT_SYMBOL_GPL(xprt_wait_for_buffer_space);
>>
>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>> index bf16883..211de5b 100644
>> --- a/net/sunrpc/xprtsock.c
>> +++ b/net/sunrpc/xprtsock.c
>> @@ -472,8 +472,6 @@ static int xs_nospace(struct rpc_task *task)
>>
>>  spin_unlock_bh(>transport_lock);
>>
>> -/* Race breaker in case memory is freed before above code is called */
>> -sk->sk_write_space(sk);
>>  return ret;
>> }
> 
> Instead of these callbacks, why not just add a call to
> sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk) after queueing the task in
> xs_nospace()? Won’t that fix the existing race breaker?

I don't see how that would help.  If sk->sk_write_space was already
called, SOCKWQ_ASYNC_NOSPACE will still be clear and the next call to
sk->sk_write_space will still be a nop.

Or did you mean SOCKWQ_ASYNC_NOSPACE here?  It doesn't seem right to set
this bit when we don't know if there's space or not.

David

Re: [PATCH v4 00/16] Add Paravirtual RDMA Driver

2016-09-16 Thread Jason Gunthorpe

On Wed, Sep 14, 2016 at 04:59:10PM -0600, Jason Gunthorpe wrote:

> > package follows that licensing model for accepting any new code into
> > that combined repo ?
> 
> As with the kernel we'd discourage 're-licensing' existing files.
> 
> However, since this is not a OFA project, I, personally, would not
> turn away a GPLv2 compatible contribution, but I am proposing that the
> 'default' license for the project be OFA compatible.

I should be clearer here. I am *strongly* opposed to anything that
changes the license of the existing 4 core libraries away from the
GPLv2 or OpenIB.org situation we have today. (that includes to other
varients of the BSD license)

I just checked and we appear to be completely OK on this point today.

Jason

Re: [PATCH] net: ipv6: Disable forwarding per interface via sysctl

2016-09-16 Thread Mike Manning

On 09/16/2016 04:46 PM, Hannes Frederic Sowa wrote:
> On 16.09.2016 15:39, Eric Dumazet wrote:
>> On Fri, 2016-09-16 at 13:47 +0100, Mike Manning wrote:
>>> Disabling forwarding per interface via sysctl continues to allow
>>> forwarding. This is contrary to the sysctl documentation stating that
>>> the forwarding sysctl is per interface, whereas currently it is only
>>> the sysctl for all interfaces that has an effect on forwarding. The
>>> solution is to drop any received packets instead of forwarding them
>>> if the ingress device has a per-device forwarding sysctl that is unset.
>>
>> Some archaeological research might be needed because
>> Documentation/networking/ip-sysctl.txt states :
>>
>> IPv4 and IPv6 work differently here; e.g. netfilter must be used
>> to control which interfaces may forward packets and which not.
>>
>> If this netfilter requirement is obsolete, then your patch would need to
>> change the doc as well.
>>
>> Hannes can probably comment on this ?
> 
> Yep, thanks.
> 
> This commit breaks a very common setup: people globally enabled
> forwarding but disabled the forwarding knob on one special interface to
> allow this interface to participate in auto configuration from their
> provider while still forwarding packets over this interface.
> 
> I fear this is so common that this would be a uapi violation.
> 
> Thanks,
> Hannes
> 
> 
Thanks for the use-case, I request to withdraw this patch then.
So configuring an interface on a router to be in host mode is not actually
disabling forwarding in the kernel, it is merely to allow SLAAC. Using ip6tables
for the purpose of disabling forwarding on an interface if one wants an 
interface
in host mode seems a heavyweight solution to work around this. If anyone has
any better suggestions, please let me know.

Re: [PATCHv1] sunrpc: fix write space race causing stalls

2016-09-16 Thread Trond Myklebust


> On Sep 16, 2016, at 08:28, David Vrabel  wrote:
> 
> Write space becoming available may race with putting the task to sleep
> in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
> race does not work.
> 
> This (edited) partial trace illustrates the problem:
> 
>   [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
>   [2] xs_write_space <-xs_tcp_write_space
>   [3] xprt_write_space <-xs_write_space
>   [4] rpc_task_sleep: task:43546@5 ...
>   [5] xs_write_space <-xs_tcp_write_space
> 
> [1] Task 43546 runs but is out of write space.
> 
> [2] Space becomes available, xs_write_space() clears the
>SOCKWQ_ASYNC_NOSPACE bit.
> 
> [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
>this has not yet been queued and the wake up is lost.
> 
> [4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
>which queues task 43546.
> 
> [5] The call to sk->sk_write_space() at the end of xs_nospace() (which
>is supposed to handle the above race) does not call
>xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
>thus the task is not woken.
> 
> Fix the race by have xprt_wait_for_buffer_space() check for write
> space after putting the task to sleep.
> 
> Signed-off-by: David Vrabel 
> ---
> include/linux/sunrpc/xprt.h |  1 +
> net/sunrpc/xprt.c   |  4 
> net/sunrpc/xprtsock.c   | 21 +++--
> 3 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index a16070d..621e74b 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -129,6 +129,7 @@ struct rpc_xprt_ops {
>   void(*connect)(struct rpc_xprt *xprt, struct rpc_task 
> *task);
>   void *  (*buf_alloc)(struct rpc_task *task, size_t size);
>   void(*buf_free)(void *buffer);
> + bool(*have_write_space)(struct rpc_xprt *task);
>   int (*send_request)(struct rpc_task *task);
>   void(*set_retrans_timeout)(struct rpc_task *task);
>   void(*timer)(struct rpc_xprt *xprt, struct rpc_task *task);
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index ea244b2..d3c1b1e 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -502,6 +502,10 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, 
> rpc_action action)
> 
>   task->tk_timeout = RPC_IS_SOFT(task) ? req->rq_timeout : 0;
>   rpc_sleep_on(>pending, task, action);
> +
> + /* Write space notification may race with putting task to sleep. */
> + if (xprt->ops->have_write_space(xprt))
> + rpc_wake_up_queued_task(>pending, task);
> }
> EXPORT_SYMBOL_GPL(xprt_wait_for_buffer_space);
> 
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index bf16883..211de5b 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -472,8 +472,6 @@ static int xs_nospace(struct rpc_task *task)
> 
>   spin_unlock_bh(>transport_lock);
> 
> - /* Race breaker in case memory is freed before above code is called */
> - sk->sk_write_space(sk);
>   return ret;
> }

Instead of these callbacks, why not just add a call to 
sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk) after queueing the task in xs_nospace()? 
Won’t that fix the existing race breaker?

Cheers
  Trond

Re: [PATCH] net: ipv6: Disable forwarding per interface via sysctl

2016-09-16 Thread Hannes Frederic Sowa

On 16.09.2016 15:39, Eric Dumazet wrote:
> On Fri, 2016-09-16 at 13:47 +0100, Mike Manning wrote:
>> Disabling forwarding per interface via sysctl continues to allow
>> forwarding. This is contrary to the sysctl documentation stating that
>> the forwarding sysctl is per interface, whereas currently it is only
>> the sysctl for all interfaces that has an effect on forwarding. The
>> solution is to drop any received packets instead of forwarding them
>> if the ingress device has a per-device forwarding sysctl that is unset.
> 
> Some archaeological research might be needed because
> Documentation/networking/ip-sysctl.txt states :
> 
> IPv4 and IPv6 work differently here; e.g. netfilter must be used
> to control which interfaces may forward packets and which not.
> 
> If this netfilter requirement is obsolete, then your patch would need to
> change the doc as well.
> 
> Hannes can probably comment on this ?

Yep, thanks.

This commit breaks a very common setup: people globally enabled
forwarding but disabled the forwarding knob on one special interface to
allow this interface to participate in auto configuration from their
provider while still forwarding packets over this interface.

I fear this is so common that this would be a uapi violation.

Thanks,
Hannes

Re: [PATCH v2] iproute2: build nsid-name cache only for commands that need it

2016-09-16 Thread Nicolas Dichtel

Le 16/09/2016 à 15:18, Anton Aksola a écrit :
[snip]
> Nicolas,
> This seems to be caused by netns_add calling unshare(CLONE_NEWNET).
> If we initialize the socket for nsid after that it doesn't seem to work.
> 
> Unfortunately I'm not an expert in these details. Should we separate the
> socket and cache initialization to different functions and call the
> socket init in the beginning of do_netns() as before? What do you think?
Seems good.

> I made a quick patch and it seems to work in batch mode now.
What is the result of that sequence?

$ ip netns add bar
$ ip netns set bar 5678
$ ip -b test.batch
nsid 1234 (iproute2 netns name: foo)
nsid 5678 (iproute2 netns name: bar)
$

[PATCH 0/2] cxgb4 FR_NSMR_TPTE_WR support

2016-09-16 Thread Steve Wise

This series enables a new work request to optimize small REG_MR
operations.  This is intended for 4.9.  If everyone agrees, I suggest
Doug take both the cxgb4 and iw_cxgb4 patches through his tree.

Thanks,

Steve.

---

Steve Wise (2):
  cxgb4: advertise support for FR_NSMR_TPTE_WR
  iw_cxgb4: add fast-path for small REG_MR operations

 drivers/infiniband/hw/cxgb4/cq.c| 17 +++
 drivers/infiniband/hw/cxgb4/mem.c   |  2 +-
 drivers/infiniband/hw/cxgb4/qp.c| 67 +
 drivers/infiniband/hw/cxgb4/t4.h|  4 +-
 drivers/infiniband/hw/cxgb4/t4fw_ri_api.h   | 12 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  7 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |  1 +
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   |  2 +
 9 files changed, 102 insertions(+), 11 deletions(-)

-- 
2.7.0

[PATCH 1/2] cxgb4: advertise support for FR_NSMR_TPTE_WR

2016-09-16 Thread Steve Wise

Query firmware for the FW_PARAMS_PARAM_DEV_RI_FR_NSMR_TPTE_WR parameter.
If it exists and is 1, then advertise support for FR_NSMR_TPTE_WR to
the ULDs.

Signed-off-by: Steve Wise 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  | 1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 7 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  | 1 +
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   | 1 +
 4 files changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 2e2aa9f..65207b3 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -346,6 +346,7 @@ struct adapter_params {
 
unsigned int max_ordird_qp;   /* Max read depth per RDMA QP */
unsigned int max_ird_adapter; /* Max read depth per adapter */
+   bool fr_nsmr_tpte_wr_support; /* FW support for FR_NSMR_TPTE_WR */
 };
 
 /* State needed to monitor the forward progress of SGE Ingress DMA activities
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index c762a8c..37e0c82 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -2517,6 +2517,7 @@ static void uld_attach(struct adapter *adap, unsigned int 
uld)
lli.max_ird_adapter = adap->params.max_ird_adapter;
lli.ulptx_memwrite_dsgl = adap->params.ulptx_memwrite_dsgl;
lli.nodeid = dev_to_node(adap->pdev_dev);
+   lli.fr_nsmr_tpte_wr_support = adap->params.fr_nsmr_tpte_wr_support;
 
handle = ulds[uld].add();
if (IS_ERR(handle)) {
@@ -4016,6 +4017,12 @@ static int adap_init0(struct adapter *adap)
adap->params.ulptx_memwrite_dsgl = (ret == 0 && val[0] != 0);
}
 
+   /* See if FW supports FW_RI_FR_NSMR_TPTE_WR work request */
+   params[0] = FW_PARAM_DEV(RI_FR_NSMR_TPTE_WR);
+   ret = t4_query_params(adap, adap->mbox, adap->pf, 0,
+ 1, params, val);
+   adap->params.fr_nsmr_tpte_wr_support = (ret == 0 && val[0] != 0);
+
/*
 * Get device capabilities so we can determine what resources we need
 * to manage.
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
index f3c58aa..42e73f7 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
@@ -280,6 +280,7 @@ struct cxgb4_lld_info {
unsigned int iscsi_llimit;   /* chip's iscsi region llimit */
void **iscsi_ppm;/* iscsi page pod manager */
int nodeid;  /* device numa node id */
+   bool fr_nsmr_tpte_wr_support;/* FW supports FR_NSMR_TPTE_WR */
 };
 
 struct cxgb4_uld_info {
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
index a89b307..9164d20 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
@@ -1119,6 +1119,7 @@ enum fw_params_param_dev {
FW_PARAMS_PARAM_DEV_MAXIRD_ADAPTER = 0x14, /* max supported adap IRD */
FW_PARAMS_PARAM_DEV_ULPTX_MEMWRITE_DSGL = 0x17,
FW_PARAMS_PARAM_DEV_FWCACHE = 0x18,
+   FW_PARAMS_PARAM_DEV_RI_FR_NSMR_TPTE_WR  = 0x1C,
 };
 
 /*
-- 
2.7.0

[PATCH 2/2] iw_cxgb4: add fast-path for small REG_MR operations

2016-09-16 Thread Steve Wise

When processing a REG_MR work request, if fw supports the
FW_RI_NSMR_TPTE_WR work request, and if the page list for this
registration is <= 2 pages, and the current state of the mr is INVALID,
then use FW_RI_NSMR_TPTE_WR to pass down a fully populated TPTE for FW
to write.  This avoids FW having to do an async read of the TPTE blocking
the SQ until the read completes.

To know if the current MR state is INVALID or not, iw_cxgb4 must track the
state of each fastreg MR.  The c4iw_mr struct state is updated as REG_MR
and LOCAL_INV WRs are posted and completed, when a reg_mr is destroyed,
and when RECV completions are processed that include a local invalidation.

This optimization increases small IO IOPS for both iSER and NVMF.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cq.c  | 17 +++
 drivers/infiniband/hw/cxgb4/mem.c |  2 +-
 drivers/infiniband/hw/cxgb4/qp.c  | 67 +++
 drivers/infiniband/hw/cxgb4/t4.h  |  4 +-
 drivers/infiniband/hw/cxgb4/t4fw_ri_api.h | 12 +
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h |  1 +
 6 files changed, 92 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index ac926c9..867b8cf 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -666,6 +666,18 @@ skip_cqe:
return ret;
 }
 
+static void invalidate_mr(struct c4iw_dev *rhp, u32 rkey)
+{
+   struct c4iw_mr *mhp;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   mhp = get_mhp(rhp, rkey >> 8);
+   if (mhp)
+   mhp->attr.state = 0;
+   spin_unlock_irqrestore(>lock, flags);
+}
+
 /*
  * Get one cq entry from c4iw and map it to openib.
  *
@@ -721,6 +733,7 @@ static int c4iw_poll_cq_one(struct c4iw_cq *chp, struct 
ib_wc *wc)
CQE_OPCODE() == FW_RI_SEND_WITH_SE_INV) {
wc->ex.invalidate_rkey = CQE_WRID_STAG();
wc->wc_flags |= IB_WC_WITH_INVALIDATE;
+   invalidate_mr(qhp->rhp, wc->ex.invalidate_rkey);
}
} else {
switch (CQE_OPCODE()) {
@@ -746,6 +759,10 @@ static int c4iw_poll_cq_one(struct c4iw_cq *chp, struct 
ib_wc *wc)
break;
case FW_RI_FAST_REGISTER:
wc->opcode = IB_WC_REG_MR;
+
+   /* Invalidate the MR if the fastreg failed */
+   if (CQE_STATUS() != T4_ERR_SUCCESS)
+   invalidate_mr(qhp->rhp, CQE_WRID_FR_STAG());
break;
default:
printk(KERN_ERR MOD "Unexpected opcode %d "
diff --git a/drivers/infiniband/hw/cxgb4/mem.c 
b/drivers/infiniband/hw/cxgb4/mem.c
index 0b91b0f..80e2774 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -695,7 +695,7 @@ struct ib_mr *c4iw_alloc_mr(struct ib_pd *pd,
mhp->attr.pdid = php->pdid;
mhp->attr.type = FW_RI_STAG_NSMR;
mhp->attr.stag = stag;
-   mhp->attr.state = 1;
+   mhp->attr.state = 0;
mmid = (stag) >> 8;
mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
if (insert_handle(rhp, >mmidr, mhp, mmid)) {
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index edb1172..3467b90 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -609,10 +609,42 @@ static int build_rdma_recv(struct c4iw_qp *qhp, union 
t4_recv_wr *wqe,
return 0;
 }
 
+static void build_tpte_memreg(struct fw_ri_fr_nsmr_tpte_wr *fr,
+ struct ib_reg_wr *wr, struct c4iw_mr *mhp,
+ u8 *len16)
+{
+   __be64 *p = (__be64 *)fr->pbl;
+
+   fr->r2 = cpu_to_be32(0);
+   fr->stag = cpu_to_be32(mhp->ibmr.rkey);
+
+   fr->tpte.valid_to_pdid = cpu_to_be32(FW_RI_TPTE_VALID_F |
+   FW_RI_TPTE_STAGKEY_V((mhp->ibmr.rkey & FW_RI_TPTE_STAGKEY_M)) |
+   FW_RI_TPTE_STAGSTATE_V(1) |
+   FW_RI_TPTE_STAGTYPE_V(FW_RI_STAG_NSMR) |
+   FW_RI_TPTE_PDID_V(mhp->attr.pdid));
+   fr->tpte.locread_to_qpid = cpu_to_be32(
+   FW_RI_TPTE_PERM_V(c4iw_ib_to_tpt_access(wr->access)) |
+   FW_RI_TPTE_ADDRTYPE_V(FW_RI_VA_BASED_TO) |
+   FW_RI_TPTE_PS_V(ilog2(wr->mr->page_size) - 12));
+   fr->tpte.nosnoop_pbladdr = cpu_to_be32(FW_RI_TPTE_PBLADDR_V(
+   PBL_OFF(>rhp->rdev, mhp->attr.pbl_addr)>>3));
+   fr->tpte.dca_mwbcnt_pstag = cpu_to_be32(0);
+   fr->tpte.len_hi = cpu_to_be32(0);
+   fr->tpte.len_lo = cpu_to_be32(mhp->ibmr.length);
+   fr->tpte.va_hi = cpu_to_be32(mhp->ibmr.iova >> 32);
+   fr->tpte.va_lo_fbo = cpu_to_be32(mhp->ibmr.iova & 0x);
+
+   p[0] = cpu_to_be64((u64)mhp->mpl[0]);
+   p[1] =

[lkp] [netfilter] 68263ddb47: WARNING: CPU: 0 PID: 1225 at net/netfilter/nf_conntrack_seqadj.c:232 nf_ct_seq_offset+0x7a/0x9a

2016-09-16 Thread kernel test robot


FYI, we noticed the following commit:

https://github.com/0day-ci/linux 
fgao-ikuai8-com/netfilter-seqadj-Fix-some-possible-panics-of-seqadj-when-mem-is-exhausted/20160902-095727
commit 68263ddb4777cc996868498e3d56f616851966d2 ("netfilter: seqadj: Fix some 
possible panics of seqadj when mem is exhausted")

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -m 320M

caused below changes:


+--+++
|  | c73c248490 
| 68263ddb47 |
+--+++
| boot_successes   | 7  
| 0  |
| boot_failures| 7  
| 14 |
| BUG:kernel_reboot-without-warning_in_test_stage  | 7  
||
| WARNING:at_net/netfilter/nf_conntrack_seqadj.c:#nf_ct_seq_offset | 0  
| 14 |
| calltrace:SyS_connect| 0  
| 14 |
| calltrace:SyS_socketcall | 0  
| 14 |
| invoked_oom-killer:gfp_mask=0x   | 0  
| 1  |
| Mem-Info | 0  
| 1  |
+--+++



[   22.475640] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: 
RX
[   22.602089] Kernel tests: Boot OK!
[   23.465616] [ cut here ]
[   23.466477] WARNING: CPU: 0 PID: 1225 at 
net/netfilter/nf_conntrack_seqadj.c:232 nf_ct_seq_offset+0x7a/0x9a
[   23.468458] Missing nfct_seqadj_ext_add() setup call
[   23.469319] CPU: 0 PID: 1225 Comm: busybox Not tainted 
4.8.0-rc2-00241-g68263dd #1
[   23.470629] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[   23.472138]   8ca4db58 8ca4db28 8a045c76 8ca4db44 89e4a2a3 00e8 
8a97c7b6
[   23.491476]  0001  0001 8ca4db60 89e4a2e7 0009  
8ca4db58
[   23.493027]  8b2c55b0 8ca4db74 8ca4db80 8a97c7b6 8b2c55da 00e8 8b2c55b0 
8e358000
[   23.494559] Call Trace:
[   23.495013]  [<8a045c76>] dump_stack+0x16/0x18
[   23.495777]  [<89e4a2a3>] __warn+0xaa/0xc1
[   23.496500]  [<8a97c7b6>] ? nf_ct_seq_offset+0x7a/0x9a
[   23.497392]  [<89e4a2e7>] warn_slowpath_fmt+0x2d/0x32
[   23.498270]  [<8a97c7b6>] nf_ct_seq_offset+0x7a/0x9a
[   23.499134]  [<8a97b2a3>] tcp_packet+0x63f/0xb93
[   23.499924]  [<89f39167>] ? cache_alloc_refill+0x203/0x80a
[   23.500896]  [<89e6863e>] ? preempt_count_sub+0x8f/0xd4
[   23.501824]  [<8a976d99>] ? nf_conntrack_in+0x2f9/0x53e
[   23.502747]  [<8a976dc2>] ? nf_conntrack_in+0x322/0x53e
[   23.521594]  [<89e6863e>] ? preempt_count_sub+0x8f/0xd4
[   23.522508]  [<8a976f5b>] nf_conntrack_in+0x4bb/0x53e
[   23.523399]  [<8a9e6a9a>] ipv4_conntrack_local+0x40/0x48
[   23.524338]  [<8a9729c5>] nf_iterate+0x3b/0x8e
[   23.525128]  [<8a972a5f>] nf_hook_slow+0x47/0xc7
[   23.525925]  [<8a9a47f1>] __ip_local_out+0xdb/0xea
[   23.526766]  [<8a9a3750>] ? ip_options_rcv_srr+0x30e/0x30e
[   23.527785]  [<8a9a4818>] ip_local_out+0x18/0x9e
[   23.528600]  [<8a9a4dd8>] ip_queue_xmit+0x395/0x41e
[   23.529469]  [<8a9b93cb>] tcp_transmit_skb+0x701/0x740
[   23.530420]  [<8a9bbcba>] tcp_connect+0x65a/0x6be
[   23.531247]  [<8a9bd9e3>] tcp_v4_connect+0x41d/0x45c
[   23.532115]  [<8a9d1a6c>] __inet_stream_connect+0x77/0x27e
[   23.533107]  [<89e6863e>] ? preempt_count_sub+0x8f/0xd4
[   23.534018]  [<89e4d8e1>] ? __local_bh_enable_ip+0xc4/0xea
[   23.534949]  [<8a9d1c9e>] inet_stream_connect+0x2b/0x3e
[   23.535862]  [<8a935255>] SyS_connect+0x77/0x9d
[   23.536662]  [<89f74959>] ? __fd_install+0x163/0x1c5
[   23.538079]  [<89e6863e>] ? preempt_count_sub+0x8f/0xd4





Thanks,
Xiaolong
#
# Automatically generated file; DO NOT EDIT.
# Linux/i386 4.8.0-rc2 Kernel Configuration
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf32-i386"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_BITS_MAX=16
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y

[lkp] [net] 70a8118a03: BUG: workqueue leaked lock or atomic: kworker/0:1/0x00000000/28

2016-09-16 Thread kernel test robot


FYI, we noticed the following commit:

https://github.com/0day-ci/linux 
Christophe-JAILLET/net-inet-diag-Fix-an-error-handling/20160912-140503
commit 70a8118a03243de2aba508d79cc1a042db094191 ("net: inet: diag: Fix an error 
handling")

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -smp 2 -m 512M

caused below changes:


++++
|| 373df3131a | 70a8118a03 |
++++
| boot_successes | 6  | 3  |
| boot_failures  | 17 | 19 |
| BUG:unable_to_handle_kernel| 2  ||
| Oops   | 2  ||
| calltrace:compat_SyS_ipc   | 2  ||
| Kernel_panic-not_syncing:Fatal_exception   | 2  ||
| invoked_oom-killer:gfp_mask=0x | 4  | 2  |
| Mem-Info   | 4  | 2  |
| BUG:kernel_reboot-without-warning_in_test_stage| 11 | 5  |
| Out_of_memory:Kill_process | 1  | 1  |
| BUG:kernel_hang_in_test_stage  | 0  | 2  |
| BUG:workqueue_leaked_lock_or_atomic:kworker| 0  | 11 |
| calltrace:dump_stack   | 0  | 11 |
| INFO:possible_circular_locking_dependency_detected | 0  | 11 |
| calltrace:ret_from_fork| 0  | 11 |
| calltrace:sock_diag_broadcast_destroy_work | 0  | 11 |
| calltrace:lock_acquire | 0  | 11 |
| calltrace:inet_diag_lock_handler   | 0  | 11 |
++++



[   34.367674] init: tty3 main process ended, respawning
[   34.444537] init: tty6 main process (356) terminated with status 1
[   34.445711] init: tty6 main process ended, respawning
[   34.657943] BUG: workqueue leaked lock or atomic: kworker/0:1/0x/28
[   34.657943]  last function: sock_diag_broadcast_destroy_work
[   34.674672] 1 lock held by kworker/0:1/28:
[   34.675402]  #0:  (inet_diag_table_mutex){+.+...}, at: [] 
inet_diag_lock_handler+0x4e/0x6b
[   34.685206] CPU: 0 PID: 28 Comm: kworker/0:1 Not tainted 
4.8.0-rc4-00239-g70a8118 #2
[   34.686489] Workqueue: sock_diag_events sock_diag_broadcast_destroy_work
[   34.688013]   88001bc9bc68 9dcce3fa 
88001bc94740
[   34.689383]  88001d614cc0 88001bc7ab40 88001bc94740 
88001bc9bd48
[   34.690737]  9dac0997 9dac088f 88001bc94740 
e8c02f05
[   34.691995] Call Trace:
[   34.692409]  [] dump_stack+0x89/0xcb
[   34.693268]  [] process_one_work+0x2e4/0x414
[   34.694280]  [] ? process_one_work+0x1dc/0x414
[   34.695115] trinity-main (440) used greatest stack depth: 10480 bytes left
[   34.697086]  [] ? worker_thread+0x53/0x3ea
[   34.698036]  [] worker_thread+0x2bb/0x3ea
[   34.699007]  [] ? _raw_spin_unlock_irqrestore+0x42/0x64
[   34.700143]  [] ? process_one_work+0x414/0x414
[   34.701154]  [] ? process_one_work+0x414/0x414
[   34.702154]  [] ? schedule+0x9f/0xb4
[   34.703184]  [] ? process_one_work+0x414/0x414
[   34.704158]  [] kthread+0xe6/0xee
[   34.704956]  [] ret_from_fork+0x1f/0x40
[   34.705848]  [] ? __init_kthread_worker+0x59/0x59
[   34.742163] 
[   34.746911] ==
[   34.747955] [ INFO: possible circular locking dependency detected ]
[   34.749010] 4.8.0-rc4-00239-g70a8118 #2 Not tainted
[   34.749842] ---
[   34.751043] kworker/0:1/28 is trying to acquire lock:
[   34.751905]  ((>work)){+.+.+.}, at: [] 
process_one_work+0x1dc/0x414
[   34.753383] 
[   34.753383] but task is already holding lock:
[   34.754357]  (inet_diag_table_mutex){+.+...}, at: [] 
inet_diag_lock_handler+0x4e/0x6b
[   34.756018] 
[   34.756018] which lock already depends on the new lock.
[   34.756018] 
[   34.757342] 
[   34.757342] the existing dependency chain (in reverse order) is:
[   34.758710] 
-> #1 (inet_diag_table_mutex){+.+...}:
[   34.759634][] validate_chain+0x5ac/0x6d5
[   34.760629][] __lock_acquire+0x434/0x4e8
[   34.761608][] __lock_release+0x287/0x309
[   34.762616][] lock_release+0x5f/0x93
[   34.763557][] __mutex_unlock_slowpath+0xef/0x175
[   34.764637][] mutex_unlock+0x9/0xb
[   34.765523][] 
sock_diag_broadcast_destroy_work+0xea/0x134
[   34.766836][] process_one_work+0x246/0x414
[   34.767827][]

drr scheduler [mis]configuration question

2016-09-16 Thread Michal Soltys

Hi,

I have hit some weird (probably missing some detail) issue with drr.
Originally it was tested between two machines, then I quickly double
checked between namespaces (same behaviour) - the configuration follows:

# setup namespace

ip netns add drrtest
ip li add name left type veth peer name right
ip li set right netns drrtest

# setup 'left' interface with drr

tc qdisc add dev left handle 1:0 root drr
for x in {1..16}; do tc class add dev left classid 1:$x parent 1:0 drr; done
for x in {1..16}; do tc qdisc add dev left handle $((256+x)):0 parent 1:$x 
pfifo_fast; done
tc filter add dev left proto all pref 1 parent 1:0 handle 1 flow map key dst 
divisor 16 baseclass 1:1
ip add add 10.15.0.255/16 dev left
ip li set left up

The above creates simple drr setup with 16 classes, counted from 1:1 to 1:16
The flow filter should simply distribute it across those classes,
relevant piece of code would suggest everything is ok:

if (f->divisor)
classid %= f->divisor;

res->class   = 0;
res->classid = TC_H_MAKE(f->baseclass, f->baseclass + classid);

So 'classid' should 0-15 and after TC_H_MAKE() 1:1-1:16

# setup 'right' interface

ip netns exec drrtest bash

ip add add 10.15.1.249/16 dev right
ip li set right up


But with this setup, any transfer from 'left' to 'right' is blackholed.
If I change ip address of 'right' to 10.15.1.248/16, everything works
again - which would suggest some issue with proper classification.

At this point I'm a bit lost what I'm doing wrong.

Re: [PATCH] net: ipv6: Disable forwarding per interface via sysctl

2016-09-16 Thread Eric Dumazet

On Fri, 2016-09-16 at 13:47 +0100, Mike Manning wrote:
> Disabling forwarding per interface via sysctl continues to allow
> forwarding. This is contrary to the sysctl documentation stating that
> the forwarding sysctl is per interface, whereas currently it is only
> the sysctl for all interfaces that has an effect on forwarding. The
> solution is to drop any received packets instead of forwarding them
> if the ingress device has a per-device forwarding sysctl that is unset.

Some archaeological research might be needed because
Documentation/networking/ip-sysctl.txt states :

IPv4 and IPv6 work differently here; e.g. netfilter must be used
to control which interfaces may forward packets and which not.

If this netfilter requirement is obsolete, then your patch would need to
change the doc as well.

Hannes can probably comment on this ?

Thanks.

[net-next:master 58/374] drivers/net/ethernet/amazon/ena/ena_netdev.c:3026:1-11: Use setup_timer function for function on line 3028.

2016-09-16 Thread Julia Lawall

Setup_timer could be used instead of the cvall to init_timer and the
initializations of the function and data fields.

julia


tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   84ce3da1bfd6fd12fce3cd06691e405a36f72cde
commit: 1738cd3ed342294360d6a74d4e5884bff854 [58/374] net: ena: Add a 
driver for Amazon Elastic Network Adapters (ENA)
:: branch date: 5 hours ago
:: commit date: 5 weeks ago

>> drivers/net/ethernet/amazon/ena/ena_netdev.c:3026:1-11: Use setup_timer 
>> function for function on line 3028.

git remote add net-next 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
git remote update net-next
git checkout 1738cd3ed342294360d6a74d4e5884bff854
vim +3026 drivers/net/ethernet/amazon/ena/ena_netdev.c

1738cd3e Netanel Belgazal 2016-08-10  3020  
INIT_WORK(>suspend_io_task, ena_device_io_suspend);
1738cd3e Netanel Belgazal 2016-08-10  3021  
INIT_WORK(>resume_io_task, ena_device_io_resume);
1738cd3e Netanel Belgazal 2016-08-10  3022  INIT_WORK(>reset_task, 
ena_fw_reset_device);
1738cd3e Netanel Belgazal 2016-08-10  3023
1738cd3e Netanel Belgazal 2016-08-10  3024  
adapter->last_keep_alive_jiffies = jiffies;
1738cd3e Netanel Belgazal 2016-08-10  3025
1738cd3e Netanel Belgazal 2016-08-10 @3026  
init_timer(>timer_service);
1738cd3e Netanel Belgazal 2016-08-10  3027  adapter->timer_service.expires 
= round_jiffies(jiffies + HZ);
1738cd3e Netanel Belgazal 2016-08-10 @3028  adapter->timer_service.function 
= ena_timer_service;
1738cd3e Netanel Belgazal 2016-08-10  3029  adapter->timer_service.data = 
(unsigned long)adapter;
1738cd3e Netanel Belgazal 2016-08-10  3030
1738cd3e Netanel Belgazal 2016-08-10  3031  
add_timer(>timer_service);

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

Re: [PATCH v2] iproute2: build nsid-name cache only for commands that need it

2016-09-16 Thread Anton Aksola

On Fri, Sep 16, 2016 at 11:13:11AM +0200, Nicolas Dichtel wrote:
> There is still some differences:
> $ cat test.batch
> netns add foo
> netns set foo 1234
> netns list-id
>
> Before your patch:
> $ ip -b test.batch
> nsid 1234 (iproute2 netns name: foo)
>
> After your patch:
> $ ip -b test.batch
> nsid 1234

Nicolas,
This seems to be caused by netns_add calling unshare(CLONE_NEWNET).
If we initialize the socket for nsid after that it doesn't seem to work.

Unfortunately I'm not an expert in these details. Should we separate the
socket and cache initialization to different functions and call the
socket init in the beginning of do_netns() as before? What do you think?
I made a quick patch and it seems to work in batch mode now.

BR,
Anton

Re: [PATCH net] sctp: fix SSN comparision

2016-09-16 Thread Neil Horman

On Thu, Sep 15, 2016 at 03:02:38PM -0300, Marcelo Ricardo Leitner wrote:
> This function actually operates on u32 yet its paramteres were declared
> as u16, causing integer truncation upon calling.
> 
> Note in patch context that ADDIP_SERIAL_SIGN_BIT is already 32 bits.
> 
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
> 
> This issue exists since before git import, so I can't put a Fixes tag.
> Also, that said, probably not worth queueing it to stable.
> Thanks
> 
>  include/net/sctp/sm.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
> index 
> efc01743b9d641bf6b16a37780ee0df34b4ec698..bafe2a0ab9085f24e17038516c55c00cfddd02f4
>  100644
> --- a/include/net/sctp/sm.h
> +++ b/include/net/sctp/sm.h
> @@ -382,7 +382,7 @@ enum {
>   ADDIP_SERIAL_SIGN_BIT = (1<<31)
>  };
>  
> -static inline int ADDIP_SERIAL_gte(__u16 s, __u16 t)
> +static inline int ADDIP_SERIAL_gte(__u32 s, __u32 t)
>  {
>   return ((s) == (t)) || (((t) - (s)) & ADDIP_SERIAL_SIGN_BIT);
>  }
> -- 
> 2.7.4
> 

Acked-by: Neil Horman 

> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[patch net-next v10 3/3] mlxsw: spectrum: Implement offload stats ndo and expose HW stats by default

2016-09-16 Thread Jiri Pirko

From: Nogah Frankel 

Change the default statistics ndo to return HW statistics
(like the one returned by ethtool_ops).
The HW stats are collected to a cache by delayed work every 1 sec.
Implement the offload stat ndo.
Add a function to get SW statistics, to be called from this function.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 129 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |   5 +
 2 files changed, 127 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 27bbcaf..171f8dd 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -819,9 +819,9 @@ err_span_port_mtu_update:
return err;
 }
 
-static struct rtnl_link_stats64 *
-mlxsw_sp_port_get_stats64(struct net_device *dev,
- struct rtnl_link_stats64 *stats)
+int
+mlxsw_sp_port_get_sw_stats64(const struct net_device *dev,
+struct rtnl_link_stats64 *stats)
 {
struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
struct mlxsw_sp_port_pcpu_stats *p;
@@ -848,6 +848,107 @@ mlxsw_sp_port_get_stats64(struct net_device *dev,
tx_dropped  += p->tx_dropped;
}
stats->tx_dropped   = tx_dropped;
+   return 0;
+}
+
+bool mlxsw_sp_port_has_offload_stats(int attr_id)
+{
+   switch (attr_id) {
+   case IFLA_OFFLOAD_XSTATS_CPU_HIT:
+   return true;
+   }
+
+   return false;
+}
+
+int mlxsw_sp_port_get_offload_stats(int attr_id, const struct net_device *dev,
+   void *sp)
+{
+   switch (attr_id) {
+   case IFLA_OFFLOAD_XSTATS_CPU_HIT:
+   return mlxsw_sp_port_get_sw_stats64(dev, sp);
+   }
+
+   return -EINVAL;
+}
+
+static int mlxsw_sp_port_get_stats_raw(struct net_device *dev, int grp,
+  int prio, char *ppcnt_pl)
+{
+   struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
+   struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+
+   mlxsw_reg_ppcnt_pack(ppcnt_pl, mlxsw_sp_port->local_port, grp, prio);
+   return mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(ppcnt), ppcnt_pl);
+}
+
+static int mlxsw_sp_port_get_hw_stats(struct net_device *dev,
+ struct rtnl_link_stats64 *stats)
+{
+   char ppcnt_pl[MLXSW_REG_PPCNT_LEN];
+   int err;
+
+   err = mlxsw_sp_port_get_stats_raw(dev, MLXSW_REG_PPCNT_IEEE_8023_CNT,
+ 0, ppcnt_pl);
+   if (err)
+   goto out;
+
+   stats->tx_packets =
+   mlxsw_reg_ppcnt_a_frames_transmitted_ok_get(ppcnt_pl);
+   stats->rx_packets =
+   mlxsw_reg_ppcnt_a_frames_received_ok_get(ppcnt_pl);
+   stats->tx_bytes =
+   mlxsw_reg_ppcnt_a_octets_transmitted_ok_get(ppcnt_pl);
+   stats->rx_bytes =
+   mlxsw_reg_ppcnt_a_octets_received_ok_get(ppcnt_pl);
+   stats->multicast =
+   mlxsw_reg_ppcnt_a_multicast_frames_received_ok_get(ppcnt_pl);
+
+   stats->rx_crc_errors =
+   mlxsw_reg_ppcnt_a_frame_check_sequence_errors_get(ppcnt_pl);
+   stats->rx_frame_errors =
+   mlxsw_reg_ppcnt_a_alignment_errors_get(ppcnt_pl);
+
+   stats->rx_length_errors = (
+   mlxsw_reg_ppcnt_a_in_range_length_errors_get(ppcnt_pl) +
+   mlxsw_reg_ppcnt_a_out_of_range_length_field_get(ppcnt_pl) +
+   mlxsw_reg_ppcnt_a_frame_too_long_errors_get(ppcnt_pl));
+
+   stats->rx_errors = (stats->rx_crc_errors +
+   stats->rx_frame_errors + stats->rx_length_errors);
+
+out:
+   return err;
+}
+
+static void update_stats_cache(struct work_struct *work)
+{
+   struct mlxsw_sp_port *mlxsw_sp_port =
+   container_of(work, struct mlxsw_sp_port,
+hw_stats.update_dw.work);
+
+   if (!netif_carrier_ok(mlxsw_sp_port->dev))
+   goto out;
+
+   mlxsw_sp_port_get_hw_stats(mlxsw_sp_port->dev,
+  mlxsw_sp_port->hw_stats.cache);
+
+out:
+   mlxsw_core_schedule_dw(_sp_port->hw_stats.update_dw,
+  MLXSW_HW_STATS_UPDATE_TIME);
+}
+
+/* Return the stats from a cache that is updated periodically,
+ * as this function might get called in an atomic context.
+ */
+static struct rtnl_link_stats64 *
+mlxsw_sp_port_get_stats64(struct net_device *dev,
+ struct rtnl_link_stats64 *stats)
+{
+   struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
+
+   memcpy(stats, mlxsw_sp_port->hw_stats.cache, sizeof(*stats));
+
return stats;
 }
 
@@ -1209,6 +1310,8 @@ static const struct net_device_ops 
mlxsw_sp_port_netdev_ops = {

[patch net-next v10 2/3] net: core: Add offload stats to if_stats_msg

2016-09-16 Thread Jiri Pirko

From: Nogah Frankel 

Add a nested attribute of offload stats to if_stats_msg
named IFLA_STATS_LINK_OFFLOAD_XSTATS.
Under it, add SW stats, meaning stats only per packets that went via
slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 include/uapi/linux/if_link.h |   9 
 net/core/rtnetlink.c | 111 +--
 2 files changed, 116 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 9bf3aec..2351776 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -826,6 +826,7 @@ enum {
IFLA_STATS_LINK_64,
IFLA_STATS_LINK_XSTATS,
IFLA_STATS_LINK_XSTATS_SLAVE,
+   IFLA_STATS_LINK_OFFLOAD_XSTATS,
__IFLA_STATS_MAX,
 };
 
@@ -845,6 +846,14 @@ enum {
 };
 #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1)
 
+/* These are stats embedded into IFLA_STATS_LINK_OFFLOAD_XSTATS */
+enum {
+   IFLA_OFFLOAD_XSTATS_UNSPEC,
+   IFLA_OFFLOAD_XSTATS_CPU_HIT, /* struct rtnl_link_stats64 */
+   __IFLA_OFFLOAD_XSTATS_MAX
+};
+#define IFLA_OFFLOAD_XSTATS_MAX (__IFLA_OFFLOAD_XSTATS_MAX - 1)
+
 /* XDP section */
 
 enum {
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 937e459..0dbae42 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3577,6 +3577,91 @@ static bool stats_attr_valid(unsigned int mask, int 
attrid, int idxattr)
   (!idxattr || idxattr == attrid);
 }
 
+#define IFLA_OFFLOAD_XSTATS_FIRST (IFLA_OFFLOAD_XSTATS_UNSPEC + 1)
+static int rtnl_get_offload_stats_attr_size(int attr_id)
+{
+   switch (attr_id) {
+   case IFLA_OFFLOAD_XSTATS_CPU_HIT:
+   return sizeof(struct rtnl_link_stats64);
+   }
+
+   return 0;
+}
+
+static int rtnl_get_offload_stats(struct sk_buff *skb, struct net_device *dev,
+ int *prividx)
+{
+   struct nlattr *attr = NULL;
+   int attr_id, size;
+   void *attr_data;
+   int err;
+
+   if (!(dev->netdev_ops && dev->netdev_ops->ndo_has_offload_stats &&
+ dev->netdev_ops->ndo_get_offload_stats))
+   return -ENODATA;
+
+   for (attr_id = IFLA_OFFLOAD_XSTATS_FIRST;
+attr_id <= IFLA_OFFLOAD_XSTATS_MAX; attr_id++) {
+   if (attr_id < *prividx)
+   continue;
+
+   size = rtnl_get_offload_stats_attr_size(attr_id);
+   if (!size)
+   continue;
+
+   if (!dev->netdev_ops->ndo_has_offload_stats(attr_id))
+   continue;
+
+   attr = nla_reserve_64bit(skb, attr_id, size,
+IFLA_OFFLOAD_XSTATS_UNSPEC);
+   if (!attr)
+   goto nla_put_failure;
+
+   attr_data = nla_data(attr);
+   memset(attr_data, 0, size);
+   err = dev->netdev_ops->ndo_get_offload_stats(attr_id, dev,
+attr_data);
+   if (err)
+   goto get_offload_stats_failure;
+   }
+
+   if (!attr)
+   return -ENODATA;
+
+   *prividx = 0;
+   return 0;
+
+nla_put_failure:
+   err = -EMSGSIZE;
+get_offload_stats_failure:
+   *prividx = attr_id;
+   return err;
+}
+
+static int rtnl_get_offload_stats_size(const struct net_device *dev)
+{
+   int nla_size = 0;
+   int attr_id;
+   int size;
+
+   if (!(dev->netdev_ops && dev->netdev_ops->ndo_has_offload_stats &&
+ dev->netdev_ops->ndo_get_offload_stats))
+   return 0;
+
+   for (attr_id = IFLA_OFFLOAD_XSTATS_FIRST;
+attr_id <= IFLA_OFFLOAD_XSTATS_MAX; attr_id++) {
+   if (!dev->netdev_ops->ndo_has_offload_stats(attr_id))
+   continue;
+   size = rtnl_get_offload_stats_attr_size(attr_id);
+   nla_size += nla_total_size_64bit(size);
+   }
+
+   if (nla_size != 0)
+   nla_size += nla_total_size(0);
+
+   return nla_size;
+}
+
 static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev,
   int type, u32 pid, u32 seq, u32 change,
   unsigned int flags, unsigned int filter_mask,
@@ -3586,6 +3671,7 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, 
struct net_device *dev,
struct nlmsghdr *nlh;
struct nlattr *attr;
int s_prividx = *prividx;
+   int err;
 
ASSERT_RTNL();
 
@@ -3614,8 +3700,6 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, 
struct net_device *dev,
const struct rtnl_link_ops *ops = dev->rtnl_link_ops;
 
if (ops && ops->fill_linkxstats) {
-   int err;
-
*idxattr =

[patch net-next v10 0/3] return offloaded stats as default and expose original sw stats

2016-09-16 Thread Jiri Pirko

From: Jiri Pirko 

From: Jiri Pirko 

The problem we try to handle is about offloaded forwarded packets
which are not seen by kernel. Let me try to draw it:

port1   port2 (HW stats are counted here)
  \  /
   \/
\  /
 --(A) ASIC --(B)--
|
   (C)
|
   CPU (SW stats are counted here)

Now we have couple of flows for TX and RX (direction does not matter here):

1) port1->A->ASIC->C->CPU

   For this flow, HW and SW stats are equal.

2) port1->A->ASIC->C->CPU->C->ASIC->B->port2

   For this flow, HW and SW stats are equal.

3) port1->A->ASIC->B->port2

   For this flow, SW stats are 0.

The purpose of this patchset is to provide facility for user to
find out the difference between flows 1+2 and 3. In other words, user
will be able to see the statistics for the slow-path (through kernel).

Also note that HW stats are what someone calls "accumulated" stats.
Every packet counted by SW is also counted by HW. Not the other way around.

As a default the accumulated stats (HW) will be exposed to user
so the userspace apps can react properly.

This patchset add the SW stats (flows 1+2) under offload related stats, so
in the future we can expose other offload related stat in a similar way.

---
v9->v10:
- patch 2/3
 - removed unnecessary ()s as pointed out by Nik
v8->v9:
- patch 2/3
 - add using of idxattr and prividx
v7->v8:
- patch 2/3
 - move helping const from uapi to rtnetlink
 - cancel driver xstat nesting if it is empty
v6->v7:
- patch 1/3:
 - ndo interface changed to get the wanted stats type as an input.
 - change commit message.
- patch 2/3:
 - create a nesting for offloaded stat and put SW stats under it.
 - change the ndo call to indicate which offload stats we wants.
 - change commit message.
- patch 3/3:
 - change ndo implementation to match the changes in the previous patches.
 - change commit message.
v5->v6:
- patch 2/4 was dropped as requested by Roopa
- patch 1/3:
 - comment changed to indicate that default stats are combined stats
 - commit massage changed
- patch 2/3: (previously 3/4)
 - SW stats return nothing if there is no SW stats ndo
v4->v5:
- updated cover letter
- patch3/4:
  - using memcpy directly to copy stats as requested by DaveM
v3->v4:
- patch1/4:
  - fixed "return ()" pointed out by EricD
- patch2/4:
  - fixed if_nlmsg_size as pointed out by EricD
v2->v3:
- patch1/4:
  - added dev_have_sw_stats helper
- patch2/4:
  - avoided memcpy as requested by DaveM
- patch3/4:
  - use new dev_have_sw_stats helper
v1->v2:
- patch3/4:
  - fixed NULL initialization

Nogah Frankel (3):
  netdevice: Add offload statistics ndo
  net: core: Add offload stats to if_stats_msg
  mlxsw: spectrum: Implement offload stats ndo and expose HW stats by
default

 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 129 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |   5 +
 include/linux/netdevice.h  |  12 +++
 include/uapi/linux/if_link.h   |   9 ++
 net/core/rtnetlink.c   | 111 -
 5 files changed, 255 insertions(+), 11 deletions(-)

-- 
2.5.5

[patch net-next v10 1/3] netdevice: Add offload statistics ndo

2016-09-16 Thread Jiri Pirko

From: Nogah Frankel 

Add a new ndo to return statistics for offloaded operation.
Since there can be many different offloaded operation with many
stats types, the ndo gets an attribute id by which it knows which
stats are wanted. The ndo also gets a void pointer to be cast according
to the attribute id.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 include/linux/netdevice.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2095b6a..a10d8d1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -924,6 +924,14 @@ struct netdev_xdp {
  * 3. Update dev->stats asynchronously and atomically, and define
  *neither operation.
  *
+ * bool (*ndo_has_offload_stats)(int attr_id)
+ * Return true if this device supports offload stats of this attr_id.
+ *
+ * int (*ndo_get_offload_stats)(int attr_id, const struct net_device *dev,
+ * void *attr_data)
+ * Get statistics for offload operations by attr_id. Write it into the
+ * attr_data pointer.
+ *
  * int (*ndo_vlan_rx_add_vid)(struct net_device *dev, __be16 proto, u16 vid);
  * If device supports VLAN filtering this function is called when a
  * VLAN id is registered.
@@ -1155,6 +1163,10 @@ struct net_device_ops {
 
struct rtnl_link_stats64* (*ndo_get_stats64)(struct net_device *dev,
 struct rtnl_link_stats64 
*storage);
+   bool(*ndo_has_offload_stats)(int attr_id);
+   int (*ndo_get_offload_stats)(int attr_id,
+const struct 
net_device *dev,
+void *attr_data);
struct net_device_stats* (*ndo_get_stats)(struct net_device *dev);
 
int (*ndo_vlan_rx_add_vid)(struct net_device *dev,
-- 
2.5.5

pull-request: mac80211-next 2016-09-16

2016-09-16 Thread Johannes Berg

Hi Dave,

And here's another set for net-next, it's been a month or so and we have a
reasonably large number of patches (for a change, mostly because I cleaned
up some WEP crypto thing and a few static checkers.)

Let me know if there's any problem.

Thanks,
johannes



The following changes since commit 02154927c115c7599677df57203988e05b576346:

  net: dsa: bcm_sf2: Get VLAN_PORT_MASK from b53_device (2016-09-11 19:37:02 
-0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git 
tags/mac80211-next-for-davem-2016-09-16

for you to fetch changes up to fbd05e4a6e82fd573d3aa79e284e424b8d78c149:

  cfg80211: add helper to find an IE that matches a byte-array (2016-09-16 
14:49:52 +0200)


This time we have various things - all across the board:
 * MU-MIMO sniffer support in mac80211
 * a create_singlethread_workqueue() cleanup
 * interface dump filtering that was documented but not implemented
 * support for the new radiotap timestamp field
 * send delBA in two unexpected conditions (as required by the spec)
 * connect keys cleanups - allow only WEP with index 0-3
 * per-station aggregation limit to work around broken APs
 * debugfs improvement for the integrated codel algorithm
and various other small improvements and cleanups.


Aviya Erenfeld (2):
  mac80211: refactor monitor representation in sdata
  mac80211: add support for MU-MIMO air sniffer

Bhaktipriya Shridhar (1):
  cfg80211: Remove deprecated create_singlethread_workqueue

Denis Kenzior (1):
  nl80211: Allow GET_INTERFACE dumps to be filtered

Emmanuel Grumbach (2):
  cfg80211: clarify the requirements of .disconnect()
  mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE

Johannes Berg (21):
  mac80211: add support for radiotap timestamp field
  mac80211: send delBA on unexpected BlockAck data frames
  mac80211: send delBA on unexpected BlockAck Request
  mac80211: simplify TDLS RA lookup
  mac80211: remove useless open_count check
  cfg80211: disallow shared key authentication with key index 4
  nl80211: fix connect keys range check
  nl80211: only allow WEP keys during connect command
  cfg80211: wext: only allow WEP keys to be configured before connected
  cfg80211: validate key index better
  cfg80211: reduce connect key caching struct size
  cfg80211: allow connect keys only with default (TX) key
  mac80211: fix possible out-of-bounds access
  mac80211: fix scan completed tracing
  nl80211: always check nla_nest_start() return value
  nl80211: always check nla_put* return values
  mac80211: remove unused assignment
  mac80211: remove pointless chanctx NULL check
  mac80211: remove sta_remove_debugfs driver callback
  cfg80211: remove unnecessary pointer-of
  mac80211_hwsim: statically initialize hwsim_radios list

Luca Coelho (1):
  cfg80211: add helper to find an IE that matches a byte-array

Maxim Altshul (1):
  mac80211: RX BA support for sta max_rx_aggregation_subframes

Rajkumar Manoharan (1):
  mac80211: allow driver to handle packet-loss mechanism

Toke Høiland-Jørgensen (1):
  mac80211: Re-structure aqm debugfs output and keep CoDel stats per txq

 drivers/net/wireless/mac80211_hwsim.c |   3 +-
 include/net/cfg80211.h|  36 +++-
 include/net/ieee80211_radiotap.h  |  21 +
 include/net/mac80211.h|  33 ++--
 net/mac80211/agg-rx.c |  11 ++-
 net/mac80211/cfg.c|  35 ++--
 net/mac80211/debugfs.c| 152 ++
 net/mac80211/debugfs_netdev.c |  37 -
 net/mac80211/debugfs_sta.c|  56 -
 net/mac80211/driver-ops.c |   2 +-
 net/mac80211/driver-ops.h |  18 +---
 net/mac80211/ieee80211_i.h|  11 ++-
 net/mac80211/iface.c  |  21 +++--
 net/mac80211/main.c   |   3 +
 net/mac80211/mlme.c   |  12 ++-
 net/mac80211/pm.c |   3 +-
 net/mac80211/rx.c |  69 ++-
 net/mac80211/scan.c   |   2 +-
 net/mac80211/sta_info.c   |   5 +-
 net/mac80211/sta_info.h   |   3 +
 net/mac80211/status.c |   8 +-
 net/mac80211/tx.c |  21 ++---
 net/mac80211/util.c   |   3 +-
 net/wireless/core.c   |   2 +-
 net/wireless/core.h   |   6 +-
 net/wireless/ibss.c   |  11 +--
 net/wireless/mlme.c   |   2 +-
 net/wireless/nl80211.c|  85 ---
 net/wireless/scan.c   |  58 ++---
 net/wireless/sme.c|   3 +
 net/wireless/sysfs.c

[PATCH] net: ipv6: fallback to full lookup if table lookup is unsuitable

2016-09-16 Thread Vincent Bernat

Commit 8c14586fc320 ("net: ipv6: Use passed in table for nexthop
lookups") introduced a regression: insertion of an IPv6 route in a table
not containing the appropriate connected route for the gateway but which
contained a non-connected route (like a default gateway) fails while it
was previously working:

$ ip link add eth0 type dummy
$ ip link set up dev eth0
$ ip addr add 2001:db8::1/64 dev eth0
$ ip route add ::/0 via 2001:db8::5 dev eth0 table 20
$ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
RTNETLINK answers: No route to host
$ ip -6 route show table 20
default via 2001:db8::5 dev eth0  metric 1024  pref medium

After this patch, we get:

$ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
$ ip -6 route show table 20
2001:db8:cafe::1 via 2001:db8::6 dev eth0  metric 1024  pref medium
default via 2001:db8::5 dev eth0  metric 1024  pref medium

Signed-off-by: Vincent Bernat 
---
 net/ipv6/route.c | 48 +++-
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ad4a7ff301fc..c2aaddcfed9e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1808,6 +1808,30 @@ static struct rt6_info *ip6_nh_lookup_table(struct net 
*net,
return rt;
 }
 
+static int ip6_nh_valid(struct rt6_info *grt,
+   struct net_device **dev, struct inet6_dev **idev) {
+   int ret = 0;
+
+   if (!grt)
+   goto out;
+   if (grt->rt6i_flags & RTF_GATEWAY)
+   goto out;
+   if (*dev) {
+   if (*dev != grt->dst.dev)
+   goto out;
+   } else {
+   *dev = grt->dst.dev;
+   *idev = grt->rt6i_idev;
+   dev_hold(*dev);
+   in6_dev_hold(*idev);
+   }
+   ret = 1;
+out:
+   if (grt)
+   ip6_rt_put(grt);
+   return ret;
+}
+
 static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg)
 {
struct net *net = cfg->fc_nlinfo.nl_net;
@@ -1991,33 +2015,15 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg)
if (!(gwa_type & IPV6_ADDR_UNICAST))
goto out;
 
+   err = -EHOSTUNREACH;
if (cfg->fc_table)
grt = ip6_nh_lookup_table(net, cfg, gw_addr);
-
-   if (!grt)
+   if (!ip6_nh_valid(grt, , )) {
grt = rt6_lookup(net, gw_addr, NULL,
 cfg->fc_ifindex, 1);
-
-   err = -EHOSTUNREACH;
-   if (!grt)
-   goto out;
-   if (dev) {
-   if (dev != grt->dst.dev) {
-   ip6_rt_put(grt);
+   if (!ip6_nh_valid(grt, , ))
goto out;
-   }
-   } else {
-   dev = grt->dst.dev;
-   idev = grt->rt6i_idev;
-   dev_hold(dev);
-   in6_dev_hold(grt->rt6i_idev);
}
-   if (!(grt->rt6i_flags & RTF_GATEWAY))
-   err = 0;
-   ip6_rt_put(grt);
-
-   if (err)
-   goto out;
}
err = -EINVAL;
if (!dev || (dev->flags & IFF_LOOPBACK))
-- 
2.9.3

[PATCH] net: ipv6: Disable forwarding per interface via sysctl

2016-09-16 Thread Mike Manning

Disabling forwarding per interface via sysctl continues to allow
forwarding. This is contrary to the sysctl documentation stating that
the forwarding sysctl is per interface, whereas currently it is only
the sysctl for all interfaces that has an effect on forwarding. The
solution is to drop any received packets instead of forwarding them
if the ingress device has a per-device forwarding sysctl that is unset.

Signed-off-by: Mike Manning 
---
 net/ipv6/ip6_output.c |4 
 1 file changed, 4 insertions(+)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 1dfc402..37cd1d0 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -380,11 +380,15 @@ int ip6_forward(struct sk_buff *skb)
struct ipv6hdr *hdr = ipv6_hdr(skb);
struct inet6_skb_parm *opt = IP6CB(skb);
struct net *net = dev_net(dst->dev);
+   struct inet6_dev *idev = __in6_dev_get(skb->dev);
u32 mtu;
 
if (net->ipv6.devconf_all->forwarding == 0)
goto error;
 
+   if (idev && !idev->cnf.forwarding)
+   goto error;
+
if (skb->pkt_type != PACKET_HOST)
goto drop;
 
-- 
1.7.10.4

pull-request: mac80211 2016-09-16

2016-09-16 Thread Johannes Berg

Hi Dave,

Sorry - I know you only just pulled my tree for the previous fixes,
but we found two more problems in the last few days; it'd be great
to get those fixes in as well.

Let me know if there's any problem.

Thanks,
johannes



The following changes since commit ad5987b47e96a0fb6d13fea250e936aed93c:

  nl80211: validate number of probe response CSA counters (2016-09-13 20:19:27 
+0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git 
tags/mac80211-for-davem-2016-09-16

for you to fetch changes up to 85d5313ed717ad60769491c7c072d23bc0a68e7a:

  mac80211: reject TSPEC TIDs (TSIDs) for aggregation (2016-09-15 10:08:52 
+0200)


Two more fixes:
 * reject aggregation sessions for TSID/TID 8-16 that we
   can never use anyway and which could confuse drivers
 * check return value of skb_linearize()


Johannes Berg (2):
  mac80211: check skb_linearize() return value
  mac80211: reject TSPEC TIDs (TSIDs) for aggregation

 net/mac80211/agg-rx.c | 8 +++-
 net/mac80211/agg-tx.c | 3 +++
 net/mac80211/tx.c | 8 ++--
 3 files changed, 16 insertions(+), 3 deletions(-)

[PATCHv1] sunrpc: fix write space race causing stalls

2016-09-16 Thread David Vrabel

Write space becoming available may race with putting the task to sleep
in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
race does not work.

This (edited) partial trace illustrates the problem:

   [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
   [2] xs_write_space <-xs_tcp_write_space
   [3] xprt_write_space <-xs_write_space
   [4] rpc_task_sleep: task:43546@5 ...
   [5] xs_write_space <-xs_tcp_write_space

[1] Task 43546 runs but is out of write space.

[2] Space becomes available, xs_write_space() clears the
SOCKWQ_ASYNC_NOSPACE bit.

[3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
this has not yet been queued and the wake up is lost.

[4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
which queues task 43546.

[5] The call to sk->sk_write_space() at the end of xs_nospace() (which
is supposed to handle the above race) does not call
xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
thus the task is not woken.

Fix the race by have xprt_wait_for_buffer_space() check for write
space after putting the task to sleep.

Signed-off-by: David Vrabel 
---
 include/linux/sunrpc/xprt.h |  1 +
 net/sunrpc/xprt.c   |  4 
 net/sunrpc/xprtsock.c   | 21 +++--
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index a16070d..621e74b 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -129,6 +129,7 @@ struct rpc_xprt_ops {
void(*connect)(struct rpc_xprt *xprt, struct rpc_task 
*task);
void *  (*buf_alloc)(struct rpc_task *task, size_t size);
void(*buf_free)(void *buffer);
+   bool(*have_write_space)(struct rpc_xprt *task);
int (*send_request)(struct rpc_task *task);
void(*set_retrans_timeout)(struct rpc_task *task);
void(*timer)(struct rpc_xprt *xprt, struct rpc_task *task);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index ea244b2..d3c1b1e 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -502,6 +502,10 @@ void xprt_wait_for_buffer_space(struct rpc_task *task, 
rpc_action action)
 
task->tk_timeout = RPC_IS_SOFT(task) ? req->rq_timeout : 0;
rpc_sleep_on(>pending, task, action);
+
+   /* Write space notification may race with putting task to sleep. */
+   if (xprt->ops->have_write_space(xprt))
+   rpc_wake_up_queued_task(>pending, task);
 }
 EXPORT_SYMBOL_GPL(xprt_wait_for_buffer_space);
 
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index bf16883..211de5b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -472,8 +472,6 @@ static int xs_nospace(struct rpc_task *task)
 
spin_unlock_bh(>transport_lock);
 
-   /* Race breaker in case memory is freed before above code is called */
-   sk->sk_write_space(sk);
return ret;
 }
 
@@ -1679,6 +1677,22 @@ static void xs_tcp_write_space(struct sock *sk)
read_unlock_bh(>sk_callback_lock);
 }
 
+static bool xs_udp_have_write_space(struct rpc_xprt *xprt)
+{
+   struct sock_xprt *transport = container_of(xprt, struct sock_xprt, 
xprt);
+   struct sock *sk = transport->inet;
+
+   return sock_writeable(sk);
+}
+
+static bool xs_tcp_have_write_space(struct rpc_xprt *xprt)
+{
+   struct sock_xprt *transport = container_of(xprt, struct sock_xprt, 
xprt);
+   struct sock *sk = transport->inet;
+
+   return sk_stream_is_writeable(sk);
+}
+
 static void xs_udp_do_set_buffer_size(struct rpc_xprt *xprt)
 {
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, 
xprt);
@@ -2664,6 +2678,7 @@ static struct rpc_xprt_ops xs_local_ops = {
.connect= xs_local_connect,
.buf_alloc  = rpc_malloc,
.buf_free   = rpc_free,
+   .have_write_space   = xs_udp_have_write_space,
.send_request   = xs_local_send_request,
.set_retrans_timeout= xprt_set_retrans_timeout_def,
.close  = xs_close,
@@ -2683,6 +2698,7 @@ static struct rpc_xprt_ops xs_udp_ops = {
.connect= xs_connect,
.buf_alloc  = rpc_malloc,
.buf_free   = rpc_free,
+   .have_write_space   = xs_udp_have_write_space,
.send_request   = xs_udp_send_request,
.set_retrans_timeout= xprt_set_retrans_timeout_rtt,
.timer  = xs_udp_timer,
@@ -2704,6 +2720,7 @@ static struct rpc_xprt_ops xs_tcp_ops = {
.connect= xs_connect,
.buf_alloc  = rpc_malloc,
.buf_free   = rpc_free,
+   .have_write_space   = xs_tcp_have_write_space,
.send_request   = xs_tcp_send_request,
.set_retrans_timeout

Re: [PATCH v2] iproute2: build nsid-name cache only for commands that need it

2016-09-16 Thread Anton Aksola

On Fri, Sep 16, 2016 at 02:25:40PM +0300, Vadim Kochan wrote:
> Anton, I just looked into tests after when I did post here. I am not
> sure it will be trivial,
> currently tests are running within separated network namespace by
> default (which I did) via
> 'unshare' tool, and now I see that it is better to call it explicitly
> from the each test case. So
> I am not sure netns related  tests might be valid if they will be ran
> after 'unshare -n', if yes - then there is no problem, otherwise
> it needs to be fixed somehow - I will try to do this.
>
(excuse my duplicate emails if any received)

It seems that mounts made after 'unshare -n' will propagate back:

[root@toys iproute2]# ip netns
[root@toys iproute2]# unshare -n
[root@toys iproute2]# ip netns add foo
[root@toys iproute2]# ip netns exec foo ip link add type dummy
[root@toys iproute2]# exit
logout
[root@toys iproute2]# ip netns
foo
[root@toys iproute2]# ip netns exec foo ip link show dev dummy0
6: dummy0:  mtu 1500 qdisc noop state DOWN mode DEFAULT
link/ether ba:cc:6c:ca:2d:41 brd ff:ff:ff:ff:ff:ff

This doesn't seem to work with 'ip netns exec' as it makes sure mounts
do not propagate.

BR, Anton

Re: [PATCH] net: ipv6: Failure to disable forwarding per interface via sysctl

2016-09-16 Thread Jiri Pirko

Fri, Sep 16, 2016 at 11:48:10AM CEST, mmann...@brocade.com wrote:
>Disabling forwarding per interface via sysctl continues to allow
>forwarding. This is contrary to the sysctl documentation stating that
>the forwarding sysctl is per interface, whereas currently it is only
>the sysctl for all interfaces that has an effect on forwarding. The
>solution is to drop any received packets instead of forwarding them
>if the ingress device has a per-device forwarding sysctl that is unset.
>
>Signed-off-by: Mike Manning 

The patch looks fine. But the subject is a bit weird:
Subject: [PATCH] net: ipv6: Failure to disable forwarding per interface
via sysctl

In subject of the patch you should say what the patch does.


>---
> net/ipv6/ip6_output.c |4 
> 1 file changed, 4 insertions(+)
>
>diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
>index 1dfc402..37cd1d0 100644
>--- a/net/ipv6/ip6_output.c
>+++ b/net/ipv6/ip6_output.c
>@@ -380,11 +380,15 @@ int ip6_forward(struct sk_buff *skb)
>   struct ipv6hdr *hdr = ipv6_hdr(skb);
>   struct inet6_skb_parm *opt = IP6CB(skb);
>   struct net *net = dev_net(dst->dev);
>+  struct inet6_dev *idev = __in6_dev_get(skb->dev);
>   u32 mtu;
> 
>   if (net->ipv6.devconf_all->forwarding == 0)
>   goto error;
> 
>+  if (idev && !idev->cnf.forwarding)
>+  goto error;
>+
>   if (skb->pkt_type != PACKET_HOST)
>   goto drop;
> 
>-- 
>1.7.10.4
>

Re: [PATCH next] sctp: make use of WORD_TRUNC macro

2016-09-16 Thread 'Marcelo Ricardo Leitner'

On Fri, Sep 16, 2016 at 09:51:56AM +, David Laight wrote:
> From: Marcelo Ricardo Leitner
> > Sent: 15 September 2016 19:13
> > No functional change. Just to avoid the usage of '&~3'.
> ...
> > -   max_data = (asoc->pathmtu -
> > -   sctp_sk(asoc->base.sk)->pf->af->net_header_len -
> > -   sizeof(struct sctphdr) - sizeof(struct sctp_data_chunk)) & ~3;
> > +   max_data = asoc->pathmtu -
> > +  sctp_sk(asoc->base.sk)->pf->af->net_header_len -
> > +  sizeof(struct sctphdr) - sizeof(struct sctp_data_chunk);
> > +   max_data = WORD_TRUNC(max_data);
> 
> Hmmm
> Am I the only person who understands immediately what & ~3 does
> but would have to grovel through the headers to find exactly what
> WORD_TRUNC() does.
> 

ctags & cia can help you with that. :)

> How big is a 'WORD' anyway??

That's pretty much one of the reasons for using the macro, to make sure
it is correctly adapted to some arch if necessary. (even though it's
not necessary in this case)

  Marcelo

Re: [PATCH v2] xen-netback: fix error handling on netback_probe()

2016-09-16 Thread Wei Liu

On Thu, Sep 15, 2016 at 05:10:46PM +0200, Filipe Manco wrote:
> In case of error during netback_probe() (e.g. an entry missing on the
> xenstore) netback_remove() is called on the new device, which will set
> the device backend state to XenbusStateClosed by calling
> set_backend_state(). However, the backend state wasn't initialized by
> netback_probe() at this point, which will cause and invalid transaction
> and set_backend_state() to BUG().
> 
> Initialize the backend state at the beginning of netback_probe() to
> XenbusStateInitialising, and create two new valid state transitions on
> set_backend_state(), from XenbusStateInitialising to XenbusStateClosed,
> and from XenbusStateInitialising to XenbusStateInitWait.
> 
> Signed-off-by: Filipe Manco 

Acked-by: Wei Liu

Re: [PATCH v2] iproute2: build nsid-name cache only for commands that need it

2016-09-16 Thread Vadim Kochan

On Fri, Sep 16, 2016 at 2:21 PM, Anton Aksola  wrote:
> Ok, I will post a new patch version. Should tests be posted in a separate
> patch?
>
> 2016-09-16 12:44 GMT+03:00 Nicolas Dichtel :
>>
>> Le 16/09/2016 à 11:23, Vadim Kochan a écrit :
>> [snip]
>> > Would it be useful to add test for this case into testsuite/ ?
>> Yes, it's a good idea.
>>
>> Regards,
>> Nicolas
>
>

Anton, I just looked into tests after when I did post here. I am not
sure it will be trivial,
currently tests are running within separated network namespace by
default (which I did) via
'unshare' tool, and now I see that it is better to call it explicitly
from the each test case. So
I am not sure netns related  tests might be valid if they will be ran
after 'unshare -n', if yes - then there is no problem, otherwise
it needs to be fixed somehow - I will try to do this.

Regards,
Vadim Kochan

1 2 >

1 - 100 of 151 matches

Mail list logo