date:20170920

Re: [PATCH,net-next,0/2] Improve code coverage of syzkaller

2017-09-20 Thread David Miller

From: Petar Penkov 
Date: Tue, 19 Sep 2017 21:26:14 -0700

> Furthermore, in a way testing already requires specific kernel
> configuration.  In this particular example, syzkaller prefers
> synchronous operation and therefore needs 4KSTACKS disabled. Other
> features that require rebuilding are KASAN and dbx. From this point
> of view, I still think that having the TUN_NAPI flag has value.

Then I think this path could be enabled/disabled with a runtime flag
just as easily, no?

[PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map

2017-09-20 Thread Yonghong Song

Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.

Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
  normalized_num_samples = num_samples * time_enabled / time_running
  normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.

This patch adds helper bpf_perf_event_read_value for kprobed based perf
event array map, to read perf counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
To achieve scaling factor between two bpf invocations, users
can can use cpu_id as the key (which is typical for perf array usage model)
to remember the previous value and do the calculation inside the
bpf program.

Signed-off-by: Yonghong Song 
---
 include/linux/perf_event.h |  6 --
 include/uapi/linux/bpf.h   | 19 ++-
 kernel/bpf/arraymap.c  |  2 +-
 kernel/bpf/verifier.c  |  4 +++-
 kernel/events/core.c   | 15 ---
 kernel/trace/bpf_trace.c   | 46 +-
 6 files changed, 79 insertions(+), 13 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8e22f24..21d8c12 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -884,7 +884,8 @@ perf_event_create_kernel_counter(struct perf_event_attr 
*attr,
void *context);
 extern void perf_pmu_migrate_context(struct pmu *pmu,
int src_cpu, int dst_cpu);
-int perf_event_read_local(struct perf_event *event, u64 *value);
+int perf_event_read_local(struct perf_event *event, u64 *value,
+ u64 *enabled, u64 *running);
 extern u64 perf_event_read_value(struct perf_event *event,
 u64 *enabled, u64 *running);
 
@@ -1286,7 +1287,8 @@ static inline const struct perf_event_attr 
*perf_event_attrs(struct perf_event *
 {
return ERR_PTR(-EINVAL);
 }
-static inline int perf_event_read_local(struct perf_event *event, u64 *value)
+static inline int perf_event_read_local(struct perf_event *event, u64 *value,
+   u64 *enabled, u64 *running)
 {
return -EINVAL;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 43ab5c4..ccfe1b1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -582,6 +582,14 @@ union bpf_attr {
  * @map: pointer to sockmap to update
  * @key: key to insert/update sock in map
  * @flags: same flags as map update elem
+ *
+ * int bpf_perf_event_read_value(map, flags, buf, buf_size)
+ * read perf event counter value and perf event enabled/running time
+ * @map: pointer to perf_event_array map
+ * @flags: index of event in the map or bitmask flags
+ * @buf: buf to fill
+ * @buf_size: size of the buf
+ * Return: 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -638,6 +646,7 @@ union bpf_attr {
FN(redirect_map),   \
FN(sk_redirect_map),\
FN(sock_map_update),\
+   FN(perf_event_read_value),  \
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -681,7 +690,9 @@ enum bpf_func_id {
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
 #define BPF_F_DONT_FRAGMENT(1ULL << 2)
 
-/* BPF_FUNC_perf_event_output and BPF_FUNC_perf_event_read flags. */
+/* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
+ * BPF_FUNC_perf_event_read_value flags.
+ */
 #define BPF_F_INDEX_MASK   0xULL
 #define BPF_F_CURRENT_CPU  BPF_F_INDEX_MASK
 /* BPF_FUNC_perf_event_output for sk_buff input context. */
@@ -864,4 +875,10 @@ enum {
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
 #define TCP_BPF_SNDCWND_CLAMP  1002/* Set sndcwnd_clamp */
 
+struct bpf_perf_event_value {
+   __u64 counter;
+   __u64 enabled;
+   __u64 running;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 98c0f00..68d8666 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -492,7 +492,7 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map,
 
ee = ERR_PTR(-EOPNOTSUPP);
event = perf_file->private_data;
-   if

[PATCH net-next v5 0/4] bpf: add two helpers to read perf event enabled/running time

2017-09-20 Thread Yonghong Song

Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.

Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
  normalized_num_samples = num_samples * time_enabled / time_running
  normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.
 
This patch set implements two helper functions.
The helper bpf_perf_event_read_value reads counter/time_enabled/time_running
for perf event array map. The helper bpf_perf_prog_read_value read
counter/time_enabled/time_running for bpf prog with type 
BPF_PROG_TYPE_PERF_EVENT.
 
Changelogs:
v4->v5:
  . fix some coding style issues
  . memset the input buffer in case of error for ARG_PTR_TO_UNINIT_MEM
type of argument.
v3->v4:
  . fix a build failure
v2->v3:
  . counters should be read in order to read enabled/running time. This is to
prevent that counters and enabled/running time may be read separately.
v1->v2:
  . reading enabled/running time should be together with reading counters
which contains the logic to ensure the result is valid.

Yonghong Song (4):
  bpf: add helper bpf_perf_event_read_value for perf event array map
  bpf: add a test case for helper bpf_perf_event_read_value
  bpf: add helper bpf_perf_prog_read_value
  bpf: add a test case for helper bpf_perf_prog_read_value

 include/linux/perf_event.h|  7 ++-
 include/uapi/linux/bpf.h  | 27 ++-
 kernel/bpf/arraymap.c |  2 +-
 kernel/bpf/verifier.c |  4 +-
 kernel/events/core.c  | 16 +--
 kernel/trace/bpf_trace.c  | 74 ---
 samples/bpf/trace_event_kern.c| 10 +
 samples/bpf/trace_event_user.c| 13 +++---
 samples/bpf/tracex6_kern.c| 26 +++
 samples/bpf/tracex6_user.c| 13 +-
 tools/include/uapi/linux/bpf.h|  4 +-
 tools/testing/selftests/bpf/bpf_helpers.h |  6 +++
 12 files changed, 182 insertions(+), 20 deletions(-)

-- 
2.9.5

[PATCH net-next v5 3/4] bpf: add helper bpf_perf_prog_read_value

2017-09-20 Thread Yonghong Song

This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf
programs, to read event counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.

The typical use case for perf event based bpf program is to attach itself
to a single event. In such cases, if it is desirable to get scaling factor
between two bpf invocations, users can can save the time values in a map,
and use the value from the map and the current value to calculate
the scaling factor.

Signed-off-by: Yonghong Song 
---
 include/linux/perf_event.h |  1 +
 include/uapi/linux/bpf.h   |  8 
 kernel/events/core.c   |  1 +
 kernel/trace/bpf_trace.c   | 28 
 4 files changed, 38 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 21d8c12..79b18a2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -806,6 +806,7 @@ struct perf_output_handle {
 struct bpf_perf_event_data_kern {
struct pt_regs *regs;
struct perf_sample_data *data;
+   struct perf_event *event;
 };
 
 #ifdef CONFIG_CGROUP_PERF
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ccfe1b1..f3eeae2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -590,6 +590,13 @@ union bpf_attr {
  * @buf: buf to fill
  * @buf_size: size of the buf
  * Return: 0 on success or negative error code
+ *
+ * int bpf_perf_prog_read_value(ctx, buf, buf_size)
+ * read perf prog attached perf event counter and enabled/running time
+ * @ctx: pointer to ctx
+ * @buf: buf to fill
+ * @buf_size: size of the buf
+ * Return : 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -647,6 +654,7 @@ union bpf_attr {
FN(sk_redirect_map),\
FN(sock_map_update),\
FN(perf_event_read_value),  \
+   FN(perf_prog_read_value),   \
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2d5bbe5..d039086 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8081,6 +8081,7 @@ static void bpf_overflow_handler(struct perf_event *event,
struct bpf_perf_event_data_kern ctx = {
.data = data,
.regs = regs,
+   .event = event,
};
int ret = 0;
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 686dfa1..c4d617a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -612,6 +612,32 @@ static const struct bpf_func_proto 
bpf_get_stackid_proto_tp = {
.arg3_type  = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_perf_prog_read_value_tp, struct bpf_perf_event_data_kern *, ctx,
+  struct bpf_perf_event_value *, buf, u32, size)
+{
+   int err;
+
+   if (unlikely(size != sizeof(struct bpf_perf_event_value)))
+   return -EINVAL;
+
+   err = perf_event_read_local(ctx->event, >counter, >enabled,
+   >running);
+   if (unlikely(err)) {
+   memset(buf, 0, size);
+   return err;
+   }
+   return 0;
+}
+
+static const struct bpf_func_proto bpf_perf_prog_read_value_proto_tp = {
+ .func   = bpf_perf_prog_read_value_tp,
+ .gpl_only   = true,
+ .ret_type   = RET_INTEGER,
+ .arg1_type  = ARG_PTR_TO_CTX,
+ .arg2_type  = ARG_PTR_TO_UNINIT_MEM,
+ .arg3_type  = ARG_CONST_SIZE,
+};
+
 static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id 
func_id)
 {
switch (func_id) {
@@ -619,6 +645,8 @@ static const struct bpf_func_proto *tp_prog_func_proto(enum 
bpf_func_id func_id)
return _perf_event_output_proto_tp;
case BPF_FUNC_get_stackid:
return _get_stackid_proto_tp;
+   case BPF_FUNC_perf_prog_read_value:
+   return _perf_prog_read_value_proto_tp;
default:
return tracing_func_proto(func_id);
}
-- 
2.9.5

[PATCH net-next v5 4/4] bpf: add a test case for helper bpf_perf_prog_read_value

2017-09-20 Thread Yonghong Song

The bpf sample program trace_event is enhanced to use the new
helper to print out enabled/running time.

Signed-off-by: Yonghong Song 
---
 samples/bpf/trace_event_kern.c| 10 ++
 samples/bpf/trace_event_user.c| 13 -
 tools/include/uapi/linux/bpf.h|  3 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 +++
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/samples/bpf/trace_event_kern.c b/samples/bpf/trace_event_kern.c
index 41b6115..a77a583d 100644
--- a/samples/bpf/trace_event_kern.c
+++ b/samples/bpf/trace_event_kern.c
@@ -37,10 +37,14 @@ struct bpf_map_def SEC("maps") stackmap = {
 SEC("perf_event")
 int bpf_prog1(struct bpf_perf_event_data *ctx)
 {
+   char time_fmt1[] = "Time Enabled: %llu, Time Running: %llu";
+   char time_fmt2[] = "Get Time Failed, ErrCode: %d";
char fmt[] = "CPU-%d period %lld ip %llx";
u32 cpu = bpf_get_smp_processor_id();
+   struct bpf_perf_event_value value_buf;
struct key_t key;
u64 *val, one = 1;
+   int ret;
 
if (ctx->sample_period < 1)
/* ignore warmup */
@@ -54,6 +58,12 @@ int bpf_prog1(struct bpf_perf_event_data *ctx)
return 0;
}
 
+   ret = bpf_perf_prog_read_value(ctx, (void *)_buf, sizeof(struct 
bpf_perf_event_value));
+   if (!ret)
+ bpf_trace_printk(time_fmt1, sizeof(time_fmt1), value_buf.enabled, 
value_buf.running);
+   else
+ bpf_trace_printk(time_fmt2, sizeof(time_fmt2), ret);
+
val = bpf_map_lookup_elem(, );
if (val)
(*val)++;
diff --git a/samples/bpf/trace_event_user.c b/samples/bpf/trace_event_user.c
index 7bd827b..bf4f1b6 100644
--- a/samples/bpf/trace_event_user.c
+++ b/samples/bpf/trace_event_user.c
@@ -127,6 +127,9 @@ static void test_perf_event_all_cpu(struct perf_event_attr 
*attr)
int *pmu_fd = malloc(nr_cpus * sizeof(int));
int i, error = 0;
 
+   /* system wide perf event, no need to inherit */
+   attr->inherit = 0;
+
/* open perf_event on all cpus */
for (i = 0; i < nr_cpus; i++) {
pmu_fd[i] = sys_perf_event_open(attr, -1, i, -1, 0);
@@ -154,6 +157,11 @@ static void test_perf_event_task(struct perf_event_attr 
*attr)
 {
int pmu_fd;
 
+   /* per task perf event, enable inherit so the "dd ..." command can be 
traced properly.
+* Enabling inherit will cause bpf_perf_prog_read_time helper failure.
+*/
+   attr->inherit = 1;
+
/* open task bound event */
pmu_fd = sys_perf_event_open(attr, 0, -1, -1, 0);
if (pmu_fd < 0) {
@@ -175,14 +183,12 @@ static void test_bpf_perf_event(void)
.freq = 1,
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
-   .inherit = 1,
};
struct perf_event_attr attr_type_sw = {
.sample_freq = SAMPLE_FREQ,
.freq = 1,
.type = PERF_TYPE_SOFTWARE,
.config = PERF_COUNT_SW_CPU_CLOCK,
-   .inherit = 1,
};
struct perf_event_attr attr_hw_cache_l1d = {
.sample_freq = SAMPLE_FREQ,
@@ -192,7 +198,6 @@ static void test_bpf_perf_event(void)
PERF_COUNT_HW_CACHE_L1D |
(PERF_COUNT_HW_CACHE_OP_READ << 8) |
(PERF_COUNT_HW_CACHE_RESULT_ACCESS << 16),
-   .inherit = 1,
};
struct perf_event_attr attr_hw_cache_branch_miss = {
.sample_freq = SAMPLE_FREQ,
@@ -202,7 +207,6 @@ static void test_bpf_perf_event(void)
PERF_COUNT_HW_CACHE_BPU |
(PERF_COUNT_HW_CACHE_OP_READ << 8) |
(PERF_COUNT_HW_CACHE_RESULT_MISS << 16),
-   .inherit = 1,
};
struct perf_event_attr attr_type_raw = {
.sample_freq = SAMPLE_FREQ,
@@ -210,7 +214,6 @@ static void test_bpf_perf_event(void)
.type = PERF_TYPE_RAW,
/* Intel Instruction Retired */
.config = 0xc0,
-   .inherit = 1,
};
 
printf("Test HW_CPU_CYCLES\n");
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 79eb529..50d2bcd 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -633,7 +633,8 @@ union bpf_attr {
FN(redirect_map),   \
FN(sk_redirect_map),\
FN(sock_map_update),\
-   FN(perf_event_read_value),
+   FN(perf_event_read_value),  \
+   FN(perf_prog_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 08e6f8c..1d3dcd4 100644
---

[PATCH net-next] cxgb4: add new T5 pci device id's

2017-09-20 Thread Ganesh Goudar

Add 0x50a5, 0x50a6, 0x50a7, 0x50a8 and 0x50a9 T5 device
id's.

Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
index aa28299..37d90d6 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
@@ -176,6 +176,11 @@ CH_PCI_DEVICE_ID_TABLE_DEFINE_BEGIN
CH_PCI_ID_TABLE_FENTRY(0x50a2), /* Custom T540-KR4 */
CH_PCI_ID_TABLE_FENTRY(0x50a3), /* Custom T580-KR4 */
CH_PCI_ID_TABLE_FENTRY(0x50a4), /* Custom 2x T540-CR */
+   CH_PCI_ID_TABLE_FENTRY(0x50a5), /* Custom T522-BT */
+   CH_PCI_ID_TABLE_FENTRY(0x50a6), /* Custom T522-BT-SO */
+   CH_PCI_ID_TABLE_FENTRY(0x50a7), /* Custom T580-CR */
+   CH_PCI_ID_TABLE_FENTRY(0x50a8), /* Custom T580-KR */
+   CH_PCI_ID_TABLE_FENTRY(0x50a9), /* Custom T580-KR */
 
/* T6 adapters:
 */
-- 
2.1.0

Re: [PATCH net-next 2/4] qed: Add iWARP out of order support

2017-09-20 Thread Kalderon, Michal

From: Leon Romanovsky 
Sent: Tuesday, September 19, 2017 8:45 PM
On Tue, Sep 19, 2017 at 08:26:17PM +0300, Michal Kalderon wrote:
>> iWARP requires OOO support which is already provided by the ll2
>> interface (until now was used only for iSCSI offload).
>> The changes mostly include opening a ll2 dedicated connection for
>> OOO and notifiying the FW about the handle id.
>>
>> Signed-off-by: Michal Kalderon 
>> Signed-off-by: Ariel Elior 
>> ---
>>  drivers/net/ethernet/qlogic/qed/qed_iwarp.c | 44 
>> +
>>  drivers/net/ethernet/qlogic/qed/qed_iwarp.h | 11 +++-
>>  drivers/net/ethernet/qlogic/qed/qed_rdma.c  |  7 +++--
>>  3 files changed, 59 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c 
>> b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
>> index 9d989c9..568e985 100644
>> --- a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
>> +++ b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
>> @@ -41,6 +41,7 @@
>>  #include "qed_rdma.h"
>>  #include "qed_reg_addr.h"
>>  #include "qed_sp.h"
>> +#include "qed_ooo.h"
>>
>>  #define QED_IWARP_ORD_DEFAULT32
>>  #define QED_IWARP_IRD_DEFAULT32
>> @@ -119,6 +120,13 @@ static void qed_iwarp_cid_cleaned(struct qed_hwfn 
>> *p_hwfn, u32 cid)
>>   spin_unlock_bh(_hwfn->p_rdma_info->lock);
>>  }
>>
>> +void qed_iwarp_init_fw_ramrod(struct qed_hwfn *p_hwfn,
>> +   struct iwarp_init_func_params *p_ramrod)
>> +{
>> + p_ramrod->ll2_ooo_q_index = RESC_START(p_hwfn, QED_LL2_QUEUE) +
>> + p_hwfn->p_rdma_info->iwarp.ll2_ooo_handle;
>> +}
>> +
>>  static int qed_iwarp_alloc_cid(struct qed_hwfn *p_hwfn, u32 *cid)
>>  {
>>   int rc;
>> @@ -1876,6 +1884,16 @@ static int qed_iwarp_ll2_stop(struct qed_hwfn 
>> *p_hwfn, struct qed_ptt *p_ptt)
>>   iwarp_info->ll2_syn_handle = QED_IWARP_HANDLE_INVAL;
>>   }
>>
>> + if (iwarp_info->ll2_ooo_handle != QED_IWARP_HANDLE_INVAL) {
>> + rc = qed_ll2_terminate_connection(p_hwfn,
>> +   iwarp_info->ll2_ooo_handle);
>> + if (rc)
>> + DP_INFO(p_hwfn, "Failed to terminate ooo 
>> connection\n");
>
>What exactly will you do with this knowledge? Anyway you are not
>interested in return values of qed_ll2_terminate_connection function in
>this place and other places too.
>
>Why don't you handle EAGAIN returned from the qed_ll2_terminate_connection()?
>
>Thanks
Thanks for pointing this out, you're right we could have ignored the return 
code, as there's
not much we can do at this point if it failed. But I still feel failures are 
worth knowing about,
and could help in analysis if they unexpectedly lead to another issue.
As for EAGAIN, it is very unlikely that we'll get this return code. Will 
consider adding generic
handling for this as a separate patch, as this currently isn't handled in any 
of the ll2 flows.
thanks,

[PATCH net-next v5 2/4] bpf: add a test case for helper bpf_perf_event_read_value

2017-09-20 Thread Yonghong Song

The bpf sample program tracex6 is enhanced to use the new
helper to read enabled/running time as well.

Signed-off-by: Yonghong Song 
---
 samples/bpf/tracex6_kern.c| 26 ++
 samples/bpf/tracex6_user.c| 13 -
 tools/include/uapi/linux/bpf.h|  3 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 +++
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/tracex6_kern.c b/samples/bpf/tracex6_kern.c
index e7d1803..46c557a 100644
--- a/samples/bpf/tracex6_kern.c
+++ b/samples/bpf/tracex6_kern.c
@@ -15,6 +15,12 @@ struct bpf_map_def SEC("maps") values = {
.value_size = sizeof(u64),
.max_entries = 64,
 };
+struct bpf_map_def SEC("maps") values2 = {
+   .type = BPF_MAP_TYPE_HASH,
+   .key_size = sizeof(int),
+   .value_size = sizeof(struct bpf_perf_event_value),
+   .max_entries = 64,
+};
 
 SEC("kprobe/htab_map_get_next_key")
 int bpf_prog1(struct pt_regs *ctx)
@@ -37,5 +43,25 @@ int bpf_prog1(struct pt_regs *ctx)
return 0;
 }
 
+SEC("kprobe/htab_map_lookup_elem")
+int bpf_prog2(struct pt_regs *ctx)
+{
+   u32 key = bpf_get_smp_processor_id();
+   struct bpf_perf_event_value *val, buf;
+   int error;
+
+   error = bpf_perf_event_read_value(, key, , sizeof(buf));
+   if (error)
+   return 0;
+
+   val = bpf_map_lookup_elem(, );
+   if (val)
+   *val = buf;
+   else
+   bpf_map_update_elem(, , , BPF_NOEXIST);
+
+   return 0;
+}
+
 char _license[] SEC("license") = "GPL";
 u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
index a05a99a..3341a96 100644
--- a/samples/bpf/tracex6_user.c
+++ b/samples/bpf/tracex6_user.c
@@ -22,6 +22,7 @@
 
 static void check_on_cpu(int cpu, struct perf_event_attr *attr)
 {
+   struct bpf_perf_event_value value2;
int pmu_fd, error = 0;
cpu_set_t set;
__u64 value;
@@ -46,8 +47,18 @@ static void check_on_cpu(int cpu, struct perf_event_attr 
*attr)
fprintf(stderr, "Value missing for CPU %d\n", cpu);
error = 1;
goto on_exit;
+   } else {
+   fprintf(stderr, "CPU %d: %llu\n", cpu, value);
+   }
+   /* The above bpf_map_lookup_elem should trigger the second kprobe */
+   if (bpf_map_lookup_elem(map_fd[2], , )) {
+   fprintf(stderr, "Value2 missing for CPU %d\n", cpu);
+   error = 1;
+   goto on_exit;
+   } else {
+   fprintf(stderr, "CPU %d: counter: %llu, enabled: %llu, running: 
%llu\n", cpu,
+   value2.counter, value2.enabled, value2.running);
}
-   fprintf(stderr, "CPU %d: %llu\n", cpu, value);
 
 on_exit:
assert(bpf_map_delete_elem(map_fd[0], ) == 0 || error);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 461811e..79eb529 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -632,7 +632,8 @@ union bpf_attr {
FN(skb_adjust_room),\
FN(redirect_map),   \
FN(sk_redirect_map),\
-   FN(sock_map_update),
+   FN(sock_map_update),\
+   FN(perf_event_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 36fb916..08e6f8c 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -70,6 +70,9 @@ static int (*bpf_sk_redirect_map)(void *map, int key, int 
flags) =
 static int (*bpf_sock_map_update)(void *map, void *key, void *value,
  unsigned long long flags) =
(void *) BPF_FUNC_sock_map_update;
+static int (*bpf_perf_event_read_value)(void *map, unsigned long long flags,
+   void *buf, unsigned int buf_size) =
+   (void *) BPF_FUNC_perf_event_read_value;
 
 
 /* llvm builtin functions that eBPF C program may use to
-- 
2.9.5

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski


Almost there

Bisecting: 6 revisions left to test after this (roughly 3 steps)
[ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call dst_hold_safe() 
properly




W dniu 2017-09-20 o 13:02, Paweł Staszewski pisze:

Ok resumed and soo far:

Panic:

# bad: [9cc9a5cb176ccb4f2cda5ac34da5a659926f125f] datapath: Avoid 
using stack larger than 1024.

git bisect bad 9cc9a5cb176ccb4f2cda5ac34da5a659926f125f

No panic:

# good: [073cf9e20c333ab29744717a23f9e43ec7512a20] Merge branch 
'udp-reduce-cache-pressure'

git bisect good 073cf9e20c333ab29744717a23f9e43ec7512a20


W dniu 2017-09-20 o 12:22, Paweł Staszewski pisze:

Soo far bisected and marked:

git bisect start
# bad: [07dd6cc1fff160143e82cf5df78c1db0b6e03355] Linux 4.13.2
git bisect bad 07dd6cc1fff160143e82cf5df78c1db0b6e03355
# good: [5d7d2e03e0f01a992e3521b180c3d3e67905f269] Linux 4.12.13
git bisect good 5d7d2e03e0f01a992e3521b180c3d3e67905f269
# good: [6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c] Linux 4.12
git bisect good 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c
# bad: [ac7b75966c9c86426b55fe1c50ae148aa4571075] Merge tag 
'pinctrl-v4.13-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

git bisect bad ac7b75966c9c86426b55fe1c50ae148aa4571075
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31
# good: [e24dd9ee5399747b71c1d982a484fc7601795f31] Merge branch 
'next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

git bisect good e24dd9ee5399747b71c1d982a484fc7601795f31



W dniu 2017-09-20 o 12:21, Paweł Staszewski pisze:
Ok kernel crashed with different panic that i didnt catch when i was 
doing bisect and now my bisection is broken :)


git bisect good
Bisecting: 1787 revisions left to test after this (roughly 11 steps)
error: Your local changes to the following files would be 
overwritten by checkout:

    Documentation/00-INDEX
    Documentation/ABI/stable/sysfs-class-udc
    Documentation/ABI/testing/configfs-usb-gadget-uac1
    Documentation/ABI/testing/ima_policy
    Documentation/ABI/testing/sysfs-bus-iio
    Documentation/ABI/testing/sysfs-bus-iio-meas-spec
    Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
    Documentation/ABI/testing/sysfs-class-net
    Documentation/ABI/testing/sysfs-class-power-twl4030
    Documentation/ABI/testing/sysfs-class-typec
    Documentation/DMA-API.txt
    Documentation/IRQ-domain.txt
    Documentation/Makefile
    Documentation/PCI/MSI-HOWTO.txt
    Documentation/RCU/00-INDEX
Documentation/RCU/Design/Requirements/Requirements.html
    Documentation/RCU/checklist.txt
    Documentation/admin-guide/README.rst
    Documentation/admin-guide/devices.txt
    Documentation/admin-guide/index.rst
    Documentation/admin-guide/kernel-parameters.txt
    Documentation/admin-guide/pm/cpufreq.rst
    Documentation/admin-guide/pm/intel_pstate.rst
    Documentation/admin-guide/ras.rst
    Documentation/arm/Atmel/README
    Documentation/block/biodoc.txt
    Documentation/conf.py
    Documentation/core-api/assoc_array.rst
    Documentation/core-api/atomic_ops.rst
    Documentation/core-api/index.rst
    Documentation/crypto/asymmetric-keys.txt
    Documentation/dev-tools/index.rst
    Documentation/dev-tools/sparse.rst
    Documentation/devicetree/bindings/arm/amlogic.txt
    Documentation/devicetree/bindings/arm/atmel-at91.txt
    Documentation/devicetree/bindings/arm/ccn.txt
    Documentation/devicetree/bindings/arm/cpus.txt
    Documentation/devicetree/bindings/arm/gemini.txt
Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt
Documentation/devicetree/bindings/arm/keystone/keystone.txt
    Documentation/devicetree/bindings/arm/mediatek.txt
    Documentation/devicetree/bindings/arm/rockchip.txt
    Documentation/devicetree/bindings/arm/shmobile.txt
    Documentation/devicetree/bindings/arm/tegra.txt
Documentation/devicetree/bindings/ata/ahci-fsl-qoriq.txt
Documentation/devicetree/bindings/bus/brcm,gisb-arb.txt
Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt
Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt
    Documentation/devicetree/bindings/gpio/gpio_atmel.txt
Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt
Documentation/devicetree/bindings/iio/adc/renesas,gyroadc.txt
Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt
Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt
Documentation/devicetree/bindings/interrupt-controller/allwinner,sunxi-nmi.txt

[net-next] macvlan: code refine to check data before using

2017-09-20 Thread Zhang Shengju

This patch checks data first at one place, return if it's null.

Signed-off-by: Zhang Shengju 
---
 drivers/net/macvlan.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index d2aea96..1ffe77e 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -1231,11 +1231,14 @@ static int macvlan_validate(struct nlattr *tb[], struct 
nlattr *data[],
return -EADDRNOTAVAIL;
}
 
-   if (data && data[IFLA_MACVLAN_FLAGS] &&
+   if (!data)
+   return 0;
+
+   if (data[IFLA_MACVLAN_FLAGS] &&
nla_get_u16(data[IFLA_MACVLAN_FLAGS]) & ~MACVLAN_FLAG_NOPROMISC)
return -EINVAL;
 
-   if (data && data[IFLA_MACVLAN_MODE]) {
+   if (data[IFLA_MACVLAN_MODE]) {
switch (nla_get_u32(data[IFLA_MACVLAN_MODE])) {
case MACVLAN_MODE_PRIVATE:
case MACVLAN_MODE_VEPA:
@@ -1248,7 +1251,7 @@ static int macvlan_validate(struct nlattr *tb[], struct 
nlattr *data[],
}
}
 
-   if (data && data[IFLA_MACVLAN_MACADDR_MODE]) {
+   if (data[IFLA_MACVLAN_MACADDR_MODE]) {
switch (nla_get_u32(data[IFLA_MACVLAN_MACADDR_MODE])) {
case MACVLAN_MACADDR_ADD:
case MACVLAN_MACADDR_DEL:
@@ -1260,7 +1263,7 @@ static int macvlan_validate(struct nlattr *tb[], struct 
nlattr *data[],
}
}
 
-   if (data && data[IFLA_MACVLAN_MACADDR]) {
+   if (data[IFLA_MACVLAN_MACADDR]) {
if (nla_len(data[IFLA_MACVLAN_MACADDR]) != ETH_ALEN)
return -EINVAL;
 
@@ -1268,7 +1271,7 @@ static int macvlan_validate(struct nlattr *tb[], struct 
nlattr *data[],
return -EADDRNOTAVAIL;
}
 
-   if (data && data[IFLA_MACVLAN_MACADDR_COUNT])
+   if (data[IFLA_MACVLAN_MACADDR_COUNT])
return -EINVAL;
 
return 0;
-- 
1.8.3.1

Re: [RFC PATCH 2/3] usbnet: Avoid potential races in usbnet_deferred_kevent()

2017-09-20 Thread Oliver Neukum

Am Dienstag, den 19.09.2017, 13:51 -0700 schrieb Guenter Roeck:
> On Tue, Sep 19, 2017 at 1:37 PM, Oliver Neukum  wrote:
> > 
> > Am Dienstag, den 19.09.2017, 09:15 -0700 schrieb Douglas Anderson:
> > > 
[..]
> > > NOTES:
> > > - No known bugs are fixed by this; it's just found by code inspection.
> > 
> > Hi,
> > 
> > unfortunately the patch is wrong. The flags must be cleared only
> > in case the handler is successful. That is not guaranteed.
> > 
> 
> Just out of curiosity, what is the retry mechanism ? Whenever a new,
> possibly unrelated, event is scheduled ?

Hi,

that actually depends on the flag.
Look at the case of fail_lowmem. There we reschedule.

HTH
Oliver

Re: Regression in throughput between kvm guests over virtual bridge

2017-09-20 Thread Jason Wang




On 2017年09月19日 02:11, Matthew Rosato wrote:

On 09/18/2017 03:36 AM, Jason Wang wrote:


On 2017年09月18日 11:13, Jason Wang wrote:


On 2017年09月16日 03:19, Matthew Rosato wrote:

It looks like vhost is slowed down for some reason which leads to more
idle time on 4.13+VHOST_RX_BATCH=1. Appreciated if you can collect the
perf.diff on host, one for rx and one for tx.


perf data below for the associated vhost threads, baseline=4.12,
delta1=4.13, delta2=4.13+VHOST_RX_BATCH=1

Client vhost:

60.12%  -11.11%  -12.34%  [kernel.vmlinux]   [k] raw_copy_from_user
13.76%   -1.28%   -0.74%  [kernel.vmlinux]   [k] get_page_from_freelist
   2.00%   +3.69%   +3.54%  [kernel.vmlinux]   [k] __wake_up_sync_key
   1.19%   +0.60%   +0.66%  [kernel.vmlinux]   [k] __alloc_pages_nodemask
   1.12%   +0.76%   +0.86%  [kernel.vmlinux]   [k] copy_page_from_iter
   1.09%   +0.28%   +0.35%  [vhost][k] vhost_get_vq_desc
   1.07%   +0.31%   +0.26%  [kernel.vmlinux]   [k] alloc_skb_with_frags
   0.94%   +0.42%   +0.65%  [kernel.vmlinux]   [k] alloc_pages_current
   0.91%   -0.19%   -0.18%  [kernel.vmlinux]   [k] memcpy
   0.88%   +0.26%   +0.30%  [kernel.vmlinux]   [k] __next_zones_zonelist
   0.85%   +0.05%   +0.12%  [kernel.vmlinux]   [k] iov_iter_advance
   0.79%   +0.09%   +0.19%  [vhost][k] __vhost_add_used_n
   0.74%[kernel.vmlinux]   [k] get_task_policy.part.7
   0.74%   -0.01%   -0.05%  [kernel.vmlinux]   [k] tun_net_xmit
   0.60%   +0.17%   +0.33%  [kernel.vmlinux]   [k] policy_nodemask
   0.58%   -0.15%   -0.12%  [ebtables] [k] ebt_do_table
   0.52%   -0.25%   -0.22%  [kernel.vmlinux]   [k] __alloc_skb
 ...
   0.42%   +0.58%   +0.59%  [kernel.vmlinux]   [k] eventfd_signal
 ...
   0.32%   +0.96%   +0.93%  [kernel.vmlinux]   [k] finish_task_switch
 ...
   +1.50%   +1.16%  [kernel.vmlinux]   [k] get_task_policy.part.9
   +0.40%   +0.42%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
   +0.39%   +0.40%  [kernel.vmlinux]   [k] _copy_from_iter_full
   +0.24%   +0.23%  [vhost_net][k] vhost_net_buf_peek

Server vhost:

61.93%  -10.72%  -10.91%  [kernel.vmlinux]   [k] raw_copy_to_user
   9.25%   +0.47%   +0.86%  [kernel.vmlinux]   [k] free_hot_cold_page
   5.16%   +1.41%   +1.57%  [vhost][k] vhost_get_vq_desc
   5.12%   -3.81%   -3.78%  [kernel.vmlinux]   [k] skb_release_data
   3.30%   +0.42%   +0.55%  [kernel.vmlinux]   [k] raw_copy_from_user
   1.29%   +2.20%   +2.28%  [kernel.vmlinux]   [k] copy_page_to_iter
   1.24%   +1.65%   +0.45%  [vhost_net][k] handle_rx
   1.08%   +3.03%   +2.85%  [kernel.vmlinux]   [k] __wake_up_sync_key
   0.96%   +0.70%   +1.10%  [vhost][k] translate_desc
   0.69%   -0.20%   -0.22%  [kernel.vmlinux]   [k] tun_do_read.part.10
   0.69%[kernel.vmlinux]   [k] tun_peek_len
   0.67%   +0.75%   +0.78%  [kernel.vmlinux]   [k] eventfd_signal
   0.52%   +0.96%   +0.98%  [kernel.vmlinux]   [k] finish_task_switch
   0.50%   +0.05%   +0.09%  [vhost][k] vhost_add_used_n
 ...
   +0.63%   +0.58%  [vhost_net][k] vhost_net_buf_peek
   +0.32%   +0.32%  [kernel.vmlinux]   [k] _copy_to_iter
   +0.19%   +0.19%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
   +0.11%   +0.21%  [vhost][k] vhost_umem_interval_tr


Looks like for some unknown reason which leads more wakeups.

Could you please try to attached patch to see if it solves or mitigate
the issue?

Thanks

My bad, please try this.

Thanks

Thanks Jason.  Built 4.13 + supplied patch, I see some decrease in
wakeups, but there's still quite a bit more compared to 4.12
(baseline=4.12, delta1=4.13, delta2=4.13+patch):

client:
  2.00%   +3.69%   +2.55%  [kernel.vmlinux]   [k] __wake_up_sync_key

server:
  1.08%   +3.03%   +1.85%  [kernel.vmlinux]   [k] __wake_up_sync_key


Throughput was roughly equivalent to base 4.13 (so, still seeing the
regression w/ this patch applied).



Seems to make some progress on wakeup mitigation. Previous patch tries 
to reduce the unnecessary traversal of waitqueue during rx. Attached 
patch goes even further which disables rx polling during processing tx. 
Please try it to see if it has any difference.


And two questions:
- Is the issue existed if you do uperf between 2VMs (instead of 4VMs)
- Can enable batching in the tap of sending VM improve the performance 
(ethtool -C $tap rx-frames 64)


Thanks
>From d57ad96083fc57205336af1b5ea777e5185f1581 Mon Sep 17 00:00:00 2001
From: Jason Wang 
Date: Wed, 20 Sep 2017 11:44:49 +0800
Subject: [PATCH] vhost_net: avoid unnecessary wakeups during tx

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ed476fa..e7349cf 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -444,8 +444,11 @@ static

[PATCH] netfilter: nf_tables: Release memory obtained by kasprintf

2017-09-20 Thread Arvind Yadav

Free memory region, if nf_tables_set_alloc_name is not successful.

Signed-off-by: Arvind Yadav 
---
 net/netfilter/nf_tables_api.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 9299271..393e37e 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2741,8 +2741,10 @@ static int nf_tables_set_alloc_name(struct nft_ctx *ctx, 
struct nft_set *set,
list_for_each_entry(i, >table->sets, list) {
if (!nft_is_active_next(ctx->net, i))
continue;
-   if (!strcmp(set->name, i->name))
+   if (!strcmp(set->name, i->name)) {
+   kfree(set->name);
return -ENFILE;
+   }
}
return 0;
 }
-- 
1.9.1

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski


Hi


Will try bisecting tonight



W dniu 2017-09-20 o 05:24, Eric Dumazet pisze:

On Wed, 2017-09-20 at 02:06 +0200, Paweł Staszewski wrote:

Just checked kernel 4.13.2 and same problem

Just after start all 6 bgp sessions - and kernel starts to learn routes
it panic.

https://bugzilla.kernel.org/attachment.cgi?id=258509



Unfortunately we have not enough information from these traces.

Can you get a full stack trace ?

Alternatively, can you bisect ?

Thanks.

Re: [RFC PATCH 2/3] usbnet: Avoid potential races in usbnet_deferred_kevent()

2017-09-20 Thread Oliver Neukum

Am Dienstag, den 19.09.2017, 13:53 -0700 schrieb Doug Anderson:
> Hi,
> 
> On Tue, Sep 19, 2017 at 1:37 PM, Oliver Neukum  wrote:
> > 
> > Am Dienstag, den 19.09.2017, 09:15 -0700 schrieb Douglas Anderson:
> > > 
> > > In general when you've got a flag communicating that "something needs
> > > to be done" you want to clear that flag _before_ doing the task.  If
> > > you clear the flag _after_ doing the task you end up with the risk
> > > that this will happen:
> > > 
> > > 1. Requester sets flag saying task A needs to be done.
> > > 2. Worker comes and stars doing task A.
> > > 3. Worker finishes task A but hasn't yet cleared the flag.
> > > 4. Requester wants to set flag saying task A needs to be done again.
> > > 5. Worker clears the flag without doing anything.
> > > 
> > > Let's make the usbnet codebase consistently clear the flag _before_ it
> > > does the requested work.  That way if there's another request to do
> > > the work while the work is already in progress it won't be lost.
> > > 
> > > NOTES:
> > > - No known bugs are fixed by this; it's just found by code inspection.
> > 
> > Hi,
> > 
> > unfortunately the patch is wrong. The flags must be cleared only
> > in case the handler is successful. That is not guaranteed.
> > 
> > Regards
> > Oliver
> > 
> > NACK
> 
> OK, thanks for reviewing!  I definitely wasn't super confident about
> the patch (hence the RFC).
> 
> Do you think that the races I identified are possible to hit?  In

As far as I can tell, we are safe, but you are right to say that the
driver is not quite clean at that point.

> other words: should I try to rework the patch somehow or just drop it?
>  Originally I had the patch setting the flags back to true in the
> failure cases, but then I convinced myself that wasn't needed.  I can
> certainly go back and try it that way...

Setting the flags again in the error case would certainly be an
improvement. I'd be happy with a patch doing that.

Regards
Oliver

[PATCH v2 1/2] mac80211: Add rcu read side critical sections

2017-09-20 Thread Ville Syrjala

From: Ville Syrjälä 

I got the following lockdep warning about the rcu_dereference()s in
ieee80211_tx_h_select_key(). After tracing all callers of
ieee80211_tx_h_select_key() I discovered that ieee80211_get_buffered_bc()
and ieee80211_build_data_template() had the rcu_read_lock/unlock() but
three other places did not. So I just blindly added them and made the
read side critical section extend as far as the lifetime of 'tx' which
is where we seem to be stuffing the rcu protected pointers. No real clue
whether this is correct or not.

[  854.573700] ../net/mac80211/tx.c:594 suspicious rcu_dereference_check() 
usage!
[  854.573704]
   other info that might help us debug this:

[  854.573707]
   rcu_scheduler_active = 2, debug_locks = 1
[  854.573712] 6 locks held by kworker/u2:0/2877:
[  854.573715]  #0:  ("%s"wiphy_name(local->hw.wiphy)){.+}, at: 
[] process_one_work+0x127/0x580
[  854.573742]  #1:  ((>work)){+.+.+.}, at: [] 
process_one_work+0x127/0x580
[  854.573758]  #2:  (>mtx){+.+.+.}, at: [] 
ieee80211_sta_work+0x23/0x1c70 [mac80211]
[  854.573902]  #3:  (>sta_mtx){+.+.+.}, at: [] 
__sta_info_flush+0x60/0x160 [mac80211]
[  854.573947]  #4:  (&(>axq_lock)->rlock){+.-...}, at: [] 
ath_tx_node_cleanup+0x5c/0x180 [ath9k]
[  854.573973]  #5:  (&(>lock)->rlock){+.-...}, at: [] 
ieee80211_tx_dequeue+0x24/0xa80 [mac80211]
[  854.574023]
   stack backtrace:
[  854.574028] CPU: 0 PID: 2877 Comm: kworker/u2:0 Not tainted 4.13.0-mgm-ovl+ 
#52
[  854.574032] Hardware name: FUJITSU SIEMENS LIFEBOOK S6120/FJNB16C, BIOS 
Version 1.26  05/10/2004
[  854.574070] Workqueue: phy0 ieee80211_iface_work [mac80211]
[  854.574076] Call Trace:
[  854.574086]  dump_stack+0x16/0x19
[  854.574092]  lockdep_rcu_suspicious+0xcb/0xf0
[  854.574131]  ieee80211_tx_h_select_key+0x1b5/0x500 [mac80211]
[  854.574171]  ieee80211_tx_dequeue+0x283/0xa80 [mac80211]
[  854.574181]  ath_tid_dequeue+0x84/0xf0 [ath9k]
[  854.574189]  ath_tx_node_cleanup+0xb8/0x180 [ath9k]
[  854.574199]  ath9k_sta_state+0x48/0xf0 [ath9k]
[  854.574207]  ? ath9k_del_ps_key.isra.19+0x60/0x60 [ath9k]
[  854.574240]  drv_sta_state+0xaf/0x8c0 [mac80211]
[  854.574275]  __sta_info_destroy_part2+0x10b/0x140 [mac80211]
[  854.574309]  __sta_info_flush+0xd5/0x160 [mac80211]
[  854.574349]  ieee80211_set_disassoc+0xd3/0x570 [mac80211]
[  854.574390]  ieee80211_sta_connection_lost+0x30/0x60 [mac80211]
[  854.574431]  ieee80211_sta_work+0x1ff/0x1c70 [mac80211]
[  854.574436]  ? mark_held_locks+0x62/0x90
[  854.574443]  ? _raw_spin_unlock_irqrestore+0x55/0x70
[  854.574447]  ? trace_hardirqs_on_caller+0x11c/0x1a0
[  854.574452]  ? trace_hardirqs_on+0xb/0x10
[  854.574459]  ? dev_mc_net_exit+0xe/0x20
[  854.574467]  ? skb_dequeue+0x48/0x70
[  854.574504]  ieee80211_iface_work+0x2d8/0x320 [mac80211]
[  854.574509]  process_one_work+0x1d1/0x580
[  854.574513]  ? process_one_work+0x127/0x580
[  854.574519]  worker_thread+0x31/0x380
[  854.574525]  kthread+0xd9/0x110
[  854.574529]  ? process_one_work+0x580/0x580
[  854.574534]  ? kthread_create_on_node+0x30/0x30
[  854.574540]  ret_from_fork+0x19/0x24

[  854.574548] =
[  854.574551] WARNING: suspicious RCU usage
[  854.574555] 4.13.0-mgm-ovl+ #52 Not tainted
[  854.574558] -
[  854.574561] ../net/mac80211/tx.c:608 suspicious rcu_dereference_check() 
usage!
[  854.574564]
   other info that might help us debug this:

[  854.574568]
   rcu_scheduler_active = 2, debug_locks = 1
[  854.574572] 6 locks held by kworker/u2:0/2877:
[  854.574574]  #0:  ("%s"wiphy_name(local->hw.wiphy)){.+}, at: 
[] process_one_work+0x127/0x580
[  854.574590]  #1:  ((>work)){+.+.+.}, at: [] 
process_one_work+0x127/0x580
[  854.574606]  #2:  (>mtx){+.+.+.}, at: [] 
ieee80211_sta_work+0x23/0x1c70 [mac80211]
[  854.574657]  #3:  (>sta_mtx){+.+.+.}, at: [] 
__sta_info_flush+0x60/0x160 [mac80211]
[  854.574702]  #4:  (&(>axq_lock)->rlock){+.-...}, at: [] 
ath_tx_node_cleanup+0x5c/0x180 [ath9k]
[  854.574721]  #5:  (&(>lock)->rlock){+.-...}, at: [] 
ieee80211_tx_dequeue+0x24/0xa80 [mac80211]
[  854.574771]
   stack backtrace:
[  854.574775] CPU: 0 PID: 2877 Comm: kworker/u2:0 Not tainted 4.13.0-mgm-ovl+ 
#52
[  854.574779] Hardware name: FUJITSU SIEMENS LIFEBOOK S6120/FJNB16C, BIOS 
Version 1.26  05/10/2004
[  854.574814] Workqueue: phy0 ieee80211_iface_work [mac80211]
[  854.574821] Call Trace:
[  854.574825]  dump_stack+0x16/0x19
[  854.574830]  lockdep_rcu_suspicious+0xcb/0xf0
[  854.574869]  ieee80211_tx_h_select_key+0x44e/0x500 [mac80211]
[  854.574908]  ieee80211_tx_dequeue+0x283/0xa80 [mac80211]
[  854.574919]  ath_tid_dequeue+0x84/0xf0 [ath9k]
[  854.574927]  ath_tx_node_cleanup+0xb8/0x180 [ath9k]
[  854.574936]  ath9k_sta_state+0x48/0xf0 [ath9k]
[  854.574945]  ? ath9k_del_ps_key.isra.19+0x60/0x60 [ath9k]
[  854.574978]  drv_sta_state+0xaf/0x8c0 [mac80211]
[  854.575012]

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski


Trying to make video from ipmi :)

with that results:

https://bugzilla.kernel.org/attachment.cgi?id=258521

catched two more lines where it starts - panic from 4.13.2.


Now will try tro do some bisection



W dniu 2017-09-20 o 09:58, Paweł Staszewski pisze:

Hi


Will try bisecting tonight



W dniu 2017-09-20 o 05:24, Eric Dumazet pisze:

On Wed, 2017-09-20 at 02:06 +0200, Paweł Staszewski wrote:

Just checked kernel 4.13.2 and same problem

Just after start all 6 bgp sessions - and kernel starts to learn routes
it panic.

https://bugzilla.kernel.org/attachment.cgi?id=258509



Unfortunately we have not enough information from these traces.

Can you get a full stack trace ?

Alternatively, can you bisect ?

Thanks.

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski


Ok looks like ending bisection


Latest bisected kernel when there is no kernel panic 4.12.0+ (from 
next)  - but only this warning:


[  309.030019] NETDEV WATCHDOG: enp4s0f0 (ixgbe): transmit queue 0 timed out
[  309.030034] [ cut here ]
[  309.030040] WARNING: CPU: 35 PID: 0 at dev_watchdog+0xcf/0x139
[  309.030041] Modules linked in: bonding ipmi_si x86_pkg_temp_thermal
[  309.030045] CPU: 35 PID: 0 Comm: swapper/35 Not tainted 4.12.0+ #5
[  309.030046] task: 88086d98a000 task.stack: c90003378000
[  309.030048] RIP: 0010:dev_watchdog+0xcf/0x139
[  309.030049] RSP: 0018:88087fbc3ea8 EFLAGS: 00010246
[  309.030050] RAX: 003d RBX: 88046b68 RCX: 

[  309.030050] RDX: 88087fbd2f01 RSI:  RDI: 
88087fbcda08
[  309.030051] RBP: 88087fbc3eb8 R08:  R09: 
88087ff80a04
[  309.030051] R10:  R11: 88086d98a001 R12: 

[  309.030052] R13: 88087fbc3ef8 R14: 88086d98a000 R15: 
81c06008
[  309.030053] FS:  () GS:88087fbc() 
knlGS:

[  309.030054] CS:  0010 DS:  ES:  CR0: 80050033
[  309.030054] CR2: 7fba600f6098 CR3: 00086b955000 CR4: 
001406e0

[  309.030055] Call Trace:
[  309.030057]  
[  309.030059]  ? netif_tx_lock+0x79/0x79
[  309.030062]  call_timer_fn.isra.24+0x17/0x77
[  309.030063]  run_timer_softirq+0x118/0x161
[  309.030065]  ? netif_tx_lock+0x79/0x79
[  309.030066]  ? ktime_get+0x2b/0x42
[  309.030070]  ? lapic_next_deadline+0x21/0x27
[  309.030073]  ? clockevents_program_event+0xa8/0xc5
[  309.030076]  __do_softirq+0xa8/0x19d
[  309.030078]  irq_exit+0x5d/0x6b
[  309.030079]  smp_apic_timer_interrupt+0x2a/0x36
[  309.030082]  apic_timer_interrupt+0x89/0x90
[  309.030085] RIP: 0010:mwait_idle+0x4e/0x6a
[  309.030086] RSP: 0018:c9000337be98 EFLAGS: 0246 ORIG_RAX: 
ff10
[  309.030087] RAX:  RBX:  RCX: 

[  309.030087] RDX:  RSI:  RDI: 
88086d98a000
[  309.030088] RBP: c9000337be98 R08: 88046f8279a0 R09: 
88046f827040
[  309.030089] R10: 88086d98a000 R11: 88086d98a000 R12: 

[  309.030089] R13: 88086d98a000 R14: 88086d98a000 R15: 
88086d98a000

[  309.030090]  
[  309.030094]  arch_cpu_idle+0xa/0xc
[  309.030095]  default_idle_call+0x19/0x1b
[  309.030102]  do_idle+0xbc/0x196
[  309.030104]  cpu_startup_entry+0x1d/0x20
[  309.030105]  start_secondary+0xd8/0xdc
[  309.030108]  secondary_startup_64+0x9f/0x9f
[  309.030109] Code: cc 75 bd eb 35 48 89 df c6 05 c3 dc 74 00 01 e8 3a 
62 fe ff 44 89 e1 48 89 de 48 89 c2 48 c7 c7 0f 65 a4 81 31 c0 e8 3d 4c 
b5 ff <0f> ff 48 8b 83 e0 01 00 00 48 89 df ff 50 78 48 8b 05 a0 bc 6a

[  309.030128] ---[ end trace 9102cb25703ae2d9 ]---


I just marked it as good - cause this problem above is differend - and 
im going to:


git bisect good
Bisecting: 1787 revisions left to test after this (roughly 11 steps)




W dniu 2017-09-20 o 10:44, Paweł Staszewski pisze:

Trying to make video from ipmi :)

with that results:

https://bugzilla.kernel.org/attachment.cgi?id=258521

catched two more lines where it starts - panic from 4.13.2.


Now will try tro do some bisection



W dniu 2017-09-20 o 09:58, Paweł Staszewski pisze:

Hi


Will try bisecting tonight



W dniu 2017-09-20 o 05:24, Eric Dumazet pisze:

On Wed, 2017-09-20 at 02:06 +0200, Paweł Staszewski wrote:

Just checked kernel 4.13.2 and same problem

Just after start all 6 bgp sessions - and kernel starts to learn 
routes

it panic.

https://bugzilla.kernel.org/attachment.cgi?id=258509



Unfortunately we have not enough information from these traces.

Can you get a full stack trace ?

Alternatively, can you bisect ?

Thanks.

Re: [lkp-robot] [test_rhashtable] c1bd3689a7: WARNING:at_lib/debugobjects.c:#__debug_object_init

2017-09-20 Thread Florian Westphal

kernel test robot  wrote:
> FYI, we noticed the following commit:
> 
> commit: c1bd3689a70d1ba1a2f7c6781770920087166018 ("test_rhashtable: add test 
> case for rhl_table interface")
> url: 
> https://github.com/0day-ci/linux/commits/Florian-Westphal/test_rhashtable-add-test-case-for-rhl-table/20170919-135550
> 
> 
> in testcase: boot
> 
> on test machine: qemu-system-x86_64 -enable-kvm -smp 2 -m 512M
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> +-+++
> [   15.235031] WARNING: CPU: 0 PID: 1 at lib/debugobjects.c:328 
> __debug_object_init+0x794/0x930
[..]

This is with v1 of the patch where the rhltable struct was allocated on
stack, v2 is fine.

Re: [PATCH net-next 1/5] net: add support for noref skb->sk

2017-09-20 Thread Eric Dumazet

On Wed, 2017-09-20 at 18:54 +0200, Paolo Abeni wrote:
> Noref sk do not carry a socket refcount, are valid
> only inside the current RCU section and must be
> explicitly cleared before exiting such section.
> 
> They will be used in a later patch to allow early demux
> without sock refcounting.




> +/* dummy destructor used by noref sockets */
> +void sock_dummyfree(struct sk_buff *skb)
> +{

BUG();

> +}
> +EXPORT_SYMBOL(sock_dummyfree);
> +


I do not see how you ensure we do not leave RCU section with an skb
destructor pointing to this sock_dummyfree()

This patch series looks quite dangerous to me.

Do we really have real applications using connected UDP sockets and
wanting very high pps throughput ?

I am pretty sure the bottleneck is the sender part.

Re: Latest net-next from GIT panic

2017-09-20 Thread Wei Wang

>> This is why I suggested to replace the BUG() in another mail
>>
>> So :
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index
>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>> 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>*/
>>   static inline void dev_put(struct net_device *dev)
>>   {
>> -   this_cpu_dec(*dev->pcpu_refcnt);
>> +   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>> +
>> +   if (!pref) {
>> +   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>> %d\n",
>> +  dev, dev->name, dev->reg_state, dev->dismantle);
>> +   for (;;)
>> +   cpu_relax();
>> +   }
>> +   this_cpu_dec(*pref);
>>   }
>> /**
>>

Thanks a lot Eric for the debug patch.

Pawel,

I want to confirm with you about the last good commit when you did bisection.
You mentioned:

> And the last one
>
> git bisect good
> Bisecting: 1 revision left to test after this (roughly 1 step)
> [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for
> insertion into fib6 tree
>
> With this have kernel panic same as always
>
> git bisect bad
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
> remove the operation of dst_free()


So it breaks right at:
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
remove the operation of dst_free()
Right?
If you sync the image to one commit before the above one:
[9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly
Does it crash?

And could you confirm that your config does not have any IPv6
addresses or routes configured?

Thanks.
Wei


6:03 +0200, Paweł Staszewski wrote:
>>>
>>> Nit much more after adding this patch
>>>
>>> https://bugzilla.kernel.org/attachment.cgi?id=258529
>>>
>> This is why I suggested to replace the BUG() in another mail
>>
>> So :
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index
>> f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
>> 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
>>*/
>>   static inline void dev_put(struct net_device *dev)
>>   {
>> -   this_cpu_dec(*dev->pcpu_refcnt);
>> +   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
>> +
>> +   if (!pref) {
>> +   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
>> %d\n",
>> +  dev, dev->name, dev->reg_state, dev->dismantle);
>> +   for (;;)
>> +   cpu_relax();
>> +   }
>> +   this_cpu_dec(*pref);
>>   }
>> /**
>>
>>
>>
>
> Full panic
>
> https://bugzilla.kernel.org/attachment.cgi?id=258531
>
>
> I will change patch and apply but later today cause now cant use backup
> router as testlab - Internet rush hours if something happens this will be
> bed when second router will have bugged kernel :)
>
>

Re: Latest net-next from GIT panic

2017-09-20 Thread Eric Dumazet

On Wed, 2017-09-20 at 10:50 -0700, Cong Wang wrote:
> On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet  wrote:
> > Sorry for top-posting, but this is to give context to Wei, since Pawel
> > used a top posting way to report his bisection.
> >
> > Wei, can you take a look at Pawel report ?
> >
> > Crash happens in dst_destroy() at following :
> >
> > if (dst->dev)
> >  dev_put(dst->dev); <>
> >
> >
> > dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
> >
> > 65 ff 08decl   %gs:(%rax)   // CRASH since rax = NULL
> >
> >
> >
> > Pawel, please share your netdevices and routing setup  ?
> 
> Looks like a double dev_put() on some dev...
> 
> Pawel, do you have any idea how this is triggered? Does your
> test try to remove some network device? If so which one?
> I noticed you have at least multiple vlan, bond and ixgbe
> devices.

Or a missing dev_hold() somewhere.

usb/net/p54: trying to register non-static key in p54_unregister_leds

2017-09-20 Thread Andrey Konovalov

Hi!

I've got the following report while fuzzing the kernel with syzkaller.

On commit ebb2c2437d8008d46796902ff390653822af6cc4 (Sep 18).

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 1 PID: 1404 Comm: kworker/1:1 Not tainted
4.14.0-rc1-42251-gebb2c2437d80-dirty #205
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: usb_hub_wq hub_event
Call Trace:
 __dump_stack lib/dump_stack.c:16
 dump_stack+0x292/0x395 lib/dump_stack.c:52
 register_lock_class+0x6c4/0x1a00 kernel/locking/lockdep.c:769
 __lock_acquire+0x27e/0x4550 kernel/locking/lockdep.c:3385
 lock_acquire+0x259/0x620 kernel/locking/lockdep.c:4002
 flush_work+0xf0/0x8c0 kernel/workqueue.c:2886
 __cancel_work_timer+0x51d/0x870 kernel/workqueue.c:2961
 cancel_delayed_work_sync+0x1f/0x30 kernel/workqueue.c:3081
 p54_unregister_leds+0x6c/0xc0 drivers/net/wireless/intersil/p54/led.c:160
 p54_unregister_common+0x3d/0xb0 drivers/net/wireless/intersil/p54/main.c:856
 p54u_disconnect+0x86/0x120 drivers/net/wireless/intersil/p54/p54usb.c:1073
 usb_unbind_interface+0x21c/0xa90 drivers/usb/core/driver.c:423
 __device_release_driver drivers/base/dd.c:861
 device_release_driver_internal+0x4f4/0x5c0 drivers/base/dd.c:893
 device_release_driver+0x1e/0x30 drivers/base/dd.c:918
 bus_remove_device+0x2f4/0x4b0 drivers/base/bus.c:565
 device_del+0x5c4/0xab0 drivers/base/core.c:1985
 usb_disable_device+0x1e9/0x680 drivers/usb/core/message.c:1170
 usb_disconnect+0x260/0x7a0 drivers/usb/core/hub.c:2124
 hub_port_connect drivers/usb/core/hub.c:4754
 hub_port_connect_change drivers/usb/core/hub.c:5009
 port_event drivers/usb/core/hub.c:5115
 hub_event+0x1318/0x3740 drivers/usb/core/hub.c:5195
 process_one_work+0xc7f/0x1db0 kernel/workqueue.c:2119
 process_scheduled_works kernel/workqueue.c:2179
 worker_thread+0xb2b/0x1850 kernel/workqueue.c:2255
 kthread+0x3a1/0x470 kernel/kthread.c:231
 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

Re: Latest net-next from GIT panic

2017-09-20 Thread Cong Wang

On Wed, Sep 20, 2017 at 11:30 AM, Eric Dumazet  wrote:
> On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
>> but dmesg at this time shows nothing about interfaces or flaps.
>>
>> This is very odd.
>>
>> We only free netdevice in free_netdev() and it is only called when
>> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
>> to be NULL.
>
> If there is a missing dev_hold() or one dev_put() in excess,
> this would allow the netdev to be freed too soon.
>
> -> Use after free.
> memory holding netdev could be reallocated-cleared by some other kernel
> user.
>

Sure, but only unregister could trigger a free. If there is no unregister,
like what Pawel claims, then there is no free, the refcnt just goes to
0 but the memory is still there.

Re: Latest net-next from GIT panic

2017-09-20 Thread Eric Dumazet

On Wed, 2017-09-20 at 16:03 +0200, Paweł Staszewski wrote:
> Nit much more after adding this patch
> 
> https://bugzilla.kernel.org/attachment.cgi?id=258529
> 

This is why I suggested to replace the BUG() in another mail

So :

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 
f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
  */
 static inline void dev_put(struct net_device *dev)
 {
-   this_cpu_dec(*dev->pcpu_refcnt);
+   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+   if (!pref) {
+   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle %d\n",
+  dev, dev->name, dev->reg_state, dev->dismantle);
+   for (;;)
+   cpu_relax();
+   }
+   this_cpu_dec(*pref);
 }
 
 /**

RFC iproute2 doc files

2017-09-20 Thread Stephen Hemminger

I noticed that the iproute man pages are up to date but the LaTex documentation
is very out of date. Rarely updated since the Linux 2.2 days.

Either someone needs to do a massive editing job on them, or they should just
be dropped. My preference would be to just drop everything in the doc/ 
directory.
The current versions are so old, they can't be helping.

Re: [PATCH net-next 00/10] net/smc: updates 2017-09-20

2017-09-20 Thread Bart Van Assche

On Wed, 2017-09-20 at 13:58 +0200, Ursula Braun wrote:
> here is a collection of small smc-patches built for net-next improving
> the smc code in different areas.

Hello Ursula,

Can you provide us an update for the timeline of the plan to transition from
PF_SMC to PF_INET/PF_INET6 + SOCK_STREAM? See also
https://www.mail-archive.com/netdev@vger.kernel.org/msg166744.html.

Thanks,

Bart.

RE: [PATCH v4 net 2/3] lan78xx: Allow EEPROM write for less than MAX_EEPROM_SIZE

2017-09-20 Thread Nisar.Sayed

Thanks Sergei, I will update it and submit next version.

- Nisar

 > Hello!
> 
> On 09/19/2017 01:02 AM, Nisar Sayed wrote:
> 
> > Allow EEPROM write for less than MAX_EEPROM_SIZE
> >
> > Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to
> > 10/100/1000 Ethernet device driver")
> > Signed-off-by: Nisar Sayed 
> > ---
> >   drivers/net/usb/lan78xx.c | 9 -
> >   1 file changed, 4 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
> > index fcf85ae37435..3292f56ffe02 100644
> > --- a/drivers/net/usb/lan78xx.c
> > +++ b/drivers/net/usb/lan78xx.c
> > @@ -1290,11 +1290,10 @@ static int lan78xx_ethtool_set_eeprom(struct
> net_device *netdev,
> > if (ret)
> > return ret;
> >
> > -   /* Allow entire eeprom update only */
> > -   if ((ee->magic == LAN78XX_EEPROM_MAGIC) &&
> > -   (ee->offset == 0) &&
> > -   (ee->len == 512) &&
> > -   (data[0] == EEPROM_INDICATOR))
> > +   /* Invalid EEPROM_INDICATOR at offset zero will result in fail to
> 
> s/fail/a failure/.
> 
> > +* load data from EEPROM
> > +*/
> > +   if (ee->magic == LAN78XX_EEPROM_MAGIC)
> > ret = lan78xx_write_raw_eeprom(dev, ee->offset, ee->len,
> data);
> > else if ((ee->magic == LAN78XX_OTP_MAGIC) &&
> >  (ee->offset == 0) &&
> >
> 
> MBR, Sergei

Re: Latest net-next from GIT panic

2017-09-20 Thread Cong Wang

On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet  wrote:
> Sorry for top-posting, but this is to give context to Wei, since Pawel
> used a top posting way to report his bisection.
>
> Wei, can you take a look at Pawel report ?
>
> Crash happens in dst_destroy() at following :
>
> if (dst->dev)
>  dev_put(dst->dev); <>
>
>
> dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
>
> 65 ff 08decl   %gs:(%rax)   // CRASH since rax = NULL
>
>
>
> Pawel, please share your netdevices and routing setup  ?

Looks like a double dev_put() on some dev...

Pawel, do you have any idea how this is triggered? Does your
test try to remove some network device? If so which one?
I noticed you have at least multiple vlan, bond and ixgbe
devices.

Re: [PATCH net-next 12/14] gtp: Configuration for zero UDP checksum

2017-09-20 Thread Tom Herbert

On Mon, Sep 18, 2017 at 9:24 PM, David Miller  wrote:
> From: Tom Herbert 
> Date: Mon, 18 Sep 2017 17:39:02 -0700
>
>> Add configuration to control use of zero checksums on transmit for both
>> IPv4 and IPv6, and control over accepting zero IPv6 checksums on
>> receive.
>>
>> Signed-off-by: Tom Herbert 
>
> I thought we were trying to move away from this special case of allowing
> zero UDP checksums with tunnels, especially for ipv6.

I don't have a strong preference either way. I like consistency with
VXLAN and foo/UDP, but I guess it's not required. Interestingly, since
GTP only carries IP, IPv6 zero checksums are actually safer here than
VXLAN or GRE/UDP.

Tom

Re: [PATCH v5 05/10] dt-bindings: net: dwmac-sun8i: update documentation about integrated PHY

2017-09-20 Thread Corentin Labbe

On Tue, Sep 19, 2017 at 09:49:52PM -0500, Rob Herring wrote:
> On Thu, Sep 14, 2017 at 2:19 PM, Andrew Lunn  wrote:
> >> > Is the MDIO controller "allwinner,sun8i-h3-emac" or "snps,dwmac-mdio"?
> >> > If the latter, then I think the node is fine, but then the mux should be
> >> > a child node of it. IOW, the child of an MDIO controller should either
> >> > be a mux node or slave devices.
> >
> > Hi Rob
> >
> > Up until now, children of an MDIO bus have been MDIO devices. Those
> > MDIO devices are either Ethernet PHYs, Ethernet Switches, or the
> > oddball devices that Broadcom iProc has, like generic PHYs.
> >
> > We have never had MDIO-muxes as MDIO children. A Mux is not an MDIO
> > device, and does not have the properties of an MDIO device. It is not
> > addressable on the MDIO bus. The current MUXes are addressed via GPIOs
> > or MMIO.
> 
> The DT parent/child relationship defines the bus topology. We describe
> MDIO buses in that way and if a mux is sitting between the controller
> and the devices, then the DT hierarchy should reflect that. Now
> sometimes we have 2 options for what interface has the parent/child
> relationship (e.g. an I2C controlled USB hub chip), but in this case
> we don't.
> 

Putting mdio-mux as a child of it (the mdio node) give me:
[   18.175338] libphy: stmmac: probed
[   18.175379] mdio_bus stmmac-0: /soc/ethernet@1c3/mdio/mdio-mux has 
invalid PHY address
[   18.175408] mdio_bus stmmac-0: scan phy mdio-mux at address 0
[   18.175450] mdio_bus stmmac-0: scan phy mdio-mux at address 1
[   18.175482] mdio_bus stmmac-0: scan phy mdio-mux at address 2
[   18.175513] mdio_bus stmmac-0: scan phy mdio-mux at address 3
[   18.175544] mdio_bus stmmac-0: scan phy mdio-mux at address 4
[   18.175575] mdio_bus stmmac-0: scan phy mdio-mux at address 5
[   18.175607] mdio_bus stmmac-0: scan phy mdio-mux at address 6
[   18.175638] mdio_bus stmmac-0: scan phy mdio-mux at address 7
[   18.175669] mdio_bus stmmac-0: scan phy mdio-mux at address 8
[   18.175700] mdio_bus stmmac-0: scan phy mdio-mux at address 9
[   18.175731] mdio_bus stmmac-0: scan phy mdio-mux at address 10
[   18.175762] mdio_bus stmmac-0: scan phy mdio-mux at address 11
[   18.175795] mdio_bus stmmac-0: scan phy mdio-mux at address 12
[   18.175827] mdio_bus stmmac-0: scan phy mdio-mux at address 13
[   18.175858] mdio_bus stmmac-0: scan phy mdio-mux at address 14
[   18.175889] mdio_bus stmmac-0: scan phy mdio-mux at address 15
[   18.175919] mdio_bus stmmac-0: scan phy mdio-mux at address 16
[   18.175951] mdio_bus stmmac-0: scan phy mdio-mux at address 17
[   18.175982] mdio_bus stmmac-0: scan phy mdio-mux at address 18
[   18.176014] mdio_bus stmmac-0: scan phy mdio-mux at address 19
[   18.176045] mdio_bus stmmac-0: scan phy mdio-mux at address 20
[   18.176076] mdio_bus stmmac-0: scan phy mdio-mux at address 21
[   18.176107] mdio_bus stmmac-0: scan phy mdio-mux at address 22
[   18.176139] mdio_bus stmmac-0: scan phy mdio-mux at address 23
[   18.176170] mdio_bus stmmac-0: scan phy mdio-mux at address 24
[   18.176202] mdio_bus stmmac-0: scan phy mdio-mux at address 25
[   18.176233] mdio_bus stmmac-0: scan phy mdio-mux at address 26
[   18.176271] mdio_bus stmmac-0: scan phy mdio-mux at address 27
[   18.176320] mdio_bus stmmac-0: scan phy mdio-mux at address 28
[   18.176371] mdio_bus stmmac-0: scan phy mdio-mux at address 29
[   18.176420] mdio_bus stmmac-0: scan phy mdio-mux at address 30
[   18.176452] mdio_bus stmmac-0: scan phy mdio-mux at address 31

Adding a fake  to mdio-mux remove it, but I found that a bit ugly.
Or perhaps patching of_mdiobus_register() to not scan node with compatible 
"mdio-mux".

What do you think ?

Regards

Re: Latest net-next from GIT panic

2017-09-20 Thread Eric Dumazet

On Wed, 2017-09-20 at 11:22 -0700, Cong Wang wrote:
> but dmesg at this time shows nothing about interfaces or flaps.
> 
> This is very odd.
> 
> We only free netdevice in free_netdev() and it is only called when
> we unregister a netdevice. Otherwise pcpu_refcnt is impossible
> to be NULL.

If there is a missing dev_hold() or one dev_put() in excess,
this would allow the netdev to be freed too soon.

-> Use after free.
memory holding netdev could be reallocated-cleared by some other kernel
user.

Re: [PATCH RFC V1 net-next 0/6] Time based packet transmission

2017-09-20 Thread levipearson

> This series is an early RFC that introduces a new socket option
> allowing time based transmission of packets.  This option will be
> useful in implementing various real time protocols over Ethernet,
> including but not limited to P802.1Qbv, which is currently finding
> its way into 802.1Q.
> 
> * Open questions about SO_TXTIME semantics
> 
>   - What should the kernel do if the dialed Tx time is in the past?
> Should the packet be sent ASAP, or should we throw an error?

Based on the i210 and latest NXP/Freescale FEC launch time behavior,
the hardware timestamps work over 1-second windows corresponding to
the time elapsed since the last PTP second began. When considering the
head-of-queue frame, the launch time is compared to the elapsed time
counter and if the elapsed time is between exactly the launch time and
half a second after the launch time, it is launched. If you enqueue a
frame with a scheduled launch time that ends up more than half a second
late, it is considered by the hardware to be scheduled *in the future*
at the offset belonging to the next second after the 1-second window
wraps around.

So *slightly* late (<<.5sec late) frames could be scheduled as normal,
but approaching .5sec late frames would have to either be dropped or 
have their schedule changed to avoid blocking the queue for a large
fraction of a second.

I don't like the idea of changing the scheduled time, and anything that
is close to half a second late is most likely useless. But it is also
reasonable to let barely-late frames go out ASAP--in the case of a Qav-
shaped stream, the bunching would get smoothed out downstream. A timed
launch schedule need not be used as an exact time, but a "don't send
before time X" flag. Both are useful in different circumstances.

A configurable parameter for allowable lateness, with the upper bound
set by the driver based on the hardware capabilities, seems ideal.
Barring that, I would suggest dropping frames with already-missed
launch times.

> 
>   - Should the kernel inform the user if it detects a missed deadline,
> via the error queue for example?

I think some sort of counter for mis-scheduled/late-delivered frames
would be in keeping with the general 802.1 error handling strategy.

> 
>   - What should the timescale be for the dialed Tx time?  Should the
> kernel select UTC when using the SW Qdisc and the HW time
> otherwise?  Or should the socket option include a clockid_t?

When I implemented something like this, I left it relative to the HW
time for the sake of simplicity, but I don't have a strong opinion.

> 
> * Things todo
> 
>   - Design a Qdisc for purpose of configuring SO_TXTIME.  There should
> be one option to dial HW offloading or SW best effort.

You seem focused on Qbv, but there is another aspect of the endpoint
requirements for Qav that this would provide a perfect use case for. A
bridge can treat all traffic in a Qav-shaped class equally, but an
endpoint must essentially run one credit-based shaper per active stream
feeding into the class--this is because a stream must adhere to its
frames-per-interval promise in its t-spec, and when the observation
interval is not an even multiple of the sample rate, it will occasionally
have an observation interval with no frame. This leaves extra bandwidth
in the class reservation, but it cannot be used by any other stream if
it would cause more than one frame per interval to be sent!

Even if a stream is not explicitly scheduled in userspace, a per-stream
Qdisc could apply a rough launch time that the class Qdisc (or hardware
shaping) would use to ensure the frames-per-interval aspect of the
reservation for the stream is adhered to. For example, each observation
interval could be assigned a launch time, and all streams would get a
number of frames corresponding to their frames-per-interval reservation
assigned that same launch time before being put into the class queue.
The i210's shaper would then only consider the current interval's set 
of frames ready to launch, and spread them evenly with its hardware
credit-based shaping.

For industrial and automotive control applications, a Qbv Qdisc based on
SO_TXTIME would be very interesting, but pro and automotive media uses
will most likely continue to use SRP + Qav, and these are becoming
increasingly common uses as you can see by the growing support for Qav in
automotive chips.

>   - Implement the SW best effort variant.  Here is my back of the
> napkin sketch.  Each interface has its own timerqueue keeping the
> TXTIME packets in order and a FIFO for all other traffic.  A guard
> window starts at the earliest deadline minus the maximum MTU minus
> a configurable fudge factor.  The Qdisc uses a hrtimer to transmit
> the next packet in the timerqueue.  During the guard window, all
> other traffic is defered unless the next packet can be transmitted
> before the guard window expires.

This sounds plausible to me.

> 
> * Current

Re: Latest net-next from GIT panic

2017-09-20 Thread Cong Wang

On Wed, Sep 20, 2017 at 10:55 AM, Paweł Staszewski
 wrote:
>
>
> W dniu 2017-09-20 o 19:50, Cong Wang pisze:
>
> On Wed, Sep 20, 2017 at 6:11 AM, Eric Dumazet 
> wrote:
>
> Sorry for top-posting, but this is to give context to Wei, since Pawel
> used a top posting way to report his bisection.
>
> Wei, can you take a look at Pawel report ?
>
> Crash happens in dst_destroy() at following :
>
> if (dst->dev)
>  dev_put(dst->dev); <>
>
>
> dst->dev is not NULL, but netdev->pcpu_refcnt is NULL
>
> 65 ff 08decl   %gs:(%rax)   // CRASH since rax = NULL
>
>
>
> Pawel, please share your netdevices and routing setup  ?
>
> Looks like a double dev_put() on some dev...
>
> Pawel, do you have any idea how this is triggered? Does your
> test try to remove some network device? If so which one?
> I noticed you have at least multiple vlan, bond and ixgbe
> devices.
>
> Just after i start bgp sessions
> So when host is starting i have all bgp sessions to upstreams shutdown
>
> To trigger panic i just enable all 6x bgp sessions at once to upstreams -
> and zebra is start to pull prefixes and push them to the kernel
>
> Then some traffic is generated from test hosts thru this backup router and
> panic is generated - every time after 10 to 15 seconds after bgp sessions
> are connected.
>
> I'm not removing any interface at this time or do anything with interfaces -
> just wait.
>
> And yes there are vlans attached to the bond devices
> but dmesg at this time shows nothing about interfaces or flaps.

This is very odd.

We only free netdevice in free_netdev() and it is only called when
we unregister a netdevice. Otherwise pcpu_refcnt is impossible
to be NULL.

Re: mac80211: avoid allocating TXQs that won't be used

2017-09-20 Thread Johannes Berg

On Wed, 2017-09-20 at 17:08 +0100, Colin Ian King wrote:
> Johannes,
> 
> Static analysis with CoverityScan on linux-next today detected a null
> pointer dereference issue on commit:
> 
> From 0fc4b3403d215ecd3c05505ec1f0028a227ed319 Mon Sep 17 00:00:00
> 2001
> From: Johannes Berg 
> Date: Thu, 22 Jun 2017 12:20:29 +0200
> Subject: [PATCH] mac80211: avoid allocating TXQs that won't be used
> 
> Issue: sdata is null when the sdata is dereferenced by:
> 
>    sdata->vif.type != NL80211_IFTYPE_AP_VLAN &&
>    sdata->vif.type != NL80211_IFTYPE_MONITOR)
> 
> note that sdata is assigned a non-null much later with the statement
> sdata = netdev_priv(ndev).

Yeah, umm, that should be checking just 'type'. Thanks, will fix.

johannes

[PATCH net-next 3/5] udp: do not touch socket refcount in early demux

2017-09-20 Thread Paolo Abeni

use noref sockets instead. This gives some small performance
improvements and will allow efficient early demux for unconnected
sockets in a later patch.

Signed-off-by: Paolo Abeni 
---
 net/ipv4/udp.c | 18 ++
 net/ipv6/udp.c | 10 ++
 2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 784ced0b9150..ba49d5aa9f09 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2050,12 +2050,13 @@ static inline int udp4_csum_init(struct sk_buff *skb, 
struct udphdr *uh,
 int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
   int proto)
 {
-   struct sock *sk;
-   struct udphdr *uh;
-   unsigned short ulen;
+   struct net *net = dev_net(skb->dev);
struct rtable *rt = skb_rtable(skb);
+   unsigned short ulen;
__be32 saddr, daddr;
-   struct net *net = dev_net(skb->dev);
+   struct udphdr *uh;
+   struct sock *sk;
+   bool noref_sk;
 
/*
 *  Validate the packet.
@@ -2081,6 +2082,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table 
*udptable,
if (udp4_csum_init(skb, uh, proto))
goto csum_error;
 
+   noref_sk = skb_has_noref_sk(skb);
sk = skb_steal_sock(skb);
if (sk) {
struct dst_entry *dst = skb_dst(skb);
@@ -2090,7 +2092,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table 
*udptable,
udp_sk_rx_dst_set(sk, dst);
 
ret = udp_queue_rcv_skb(sk, skb);
-   sock_put(sk);
+   if (!noref_sk)
+   sock_put(sk);
/* a return value > 0 means to resubmit the input, but
 * it wants the return to be -protocol, or 0
 */
@@ -2261,11 +2264,10 @@ void udp_v4_early_demux(struct sk_buff *skb)
 uh->source, iph->saddr, dif, sdif);
}
 
-   if (!sk || !refcount_inc_not_zero(>sk_refcnt))
+   if (!sk)
return;
 
-   skb->sk = sk;
-   skb->destructor = sock_efree;
+   skb_set_noref_sk(skb, sk);
dst = READ_ONCE(sk->sk_rx_dst);
 
if (dst)
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index e2ecfb137297..8f62392c4c35 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -787,6 +787,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table 
*udptable,
struct net *net = dev_net(skb->dev);
struct udphdr *uh;
struct sock *sk;
+   bool noref_sk;
u32 ulen = 0;
 
if (!pskb_may_pull(skb, sizeof(struct udphdr)))
@@ -823,6 +824,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table 
*udptable,
goto csum_error;
 
/* Check if the socket is already available, e.g. due to early demux */
+   noref_sk = skb_has_noref_sk(skb);
sk = skb_steal_sock(skb);
if (sk) {
struct dst_entry *dst = skb_dst(skb);
@@ -832,7 +834,8 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table 
*udptable,
udp6_sk_rx_dst_set(sk, dst);
 
ret = udpv6_queue_rcv_skb(sk, skb);
-   sock_put(sk);
+   if (!noref_sk)
+   sock_put(sk);
 
/* a return value > 0 means to resubmit the input */
if (ret > 0)
@@ -948,11 +951,10 @@ static void udp_v6_early_demux(struct sk_buff *skb)
else
return;
 
-   if (!sk || !refcount_inc_not_zero(>sk_refcnt))
+   if (!sk)
return;
 
-   skb->sk = sk;
-   skb->destructor = sock_efree;
+   skb_set_noref_sk(skb, sk);
dst = READ_ONCE(sk->sk_rx_dst);
 
if (dst)
-- 
2.13.5

[PATCH net-next 5/5] udp: perform full socket lookup in early demux

2017-09-20 Thread Paolo Abeni

Since UDP early demux lookup fetches noref socket references,
we can safely be optimistic about it and set the sk reference
even if the skb is not going to land on such socket, avoiding
the rx dst cache usage for unconnected unicast sockets.

This avoids a second lookup for unconnected sockets, and clean
up a bit the whole udp early demux code.

After this change, on hosts not acting as routers, the UDP
early demux never affect negatively the receive performances,
while before this change UDP early demux caused measurable
performance impact for unconnected sockets.

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h |  2 ++
 net/ipv4/udp.c  | 62 +++--
 net/ipv6/udp.c  | 57 
 3 files changed, 38 insertions(+), 83 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index eaea63bc79bb..9c68b57543cc 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -92,6 +92,8 @@ static inline struct udp_sock *udp_sk(const struct sock *sk)
return (struct udp_sock *)sk;
 }
 
+void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie);
+
 static inline void udp_set_no_check6_tx(struct sock *sk, bool val)
 {
udp_sk(sk)->no_check6_tx = val;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ba49d5aa9f09..5cbbd78024dc 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2043,6 +2043,11 @@ static inline int udp4_csum_init(struct sk_buff *skb, 
struct udphdr *uh,
 inet_compute_pseudo);
 }
 
+static bool udp_use_rx_dst_cache(struct sock *sk, struct sk_buff *skb)
+{
+   return sk->sk_state == TCP_ESTABLISHED || skb->pkt_type != PACKET_HOST;
+}
+
 /*
  * All we need to do is get the socket, and then do a checksum.
  */
@@ -2088,8 +2093,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table 
*udptable,
struct dst_entry *dst = skb_dst(skb);
int ret;
 
-   if (unlikely(sk->sk_rx_dst != dst))
-   udp_sk_rx_dst_set(sk, dst);
+   if (udp_use_rx_dst_cache(sk, skb))
+   dst_update(>sk_rx_dst, dst);
 
ret = udp_queue_rcv_skb(sk, skb);
if (!noref_sk)
@@ -2196,42 +2201,28 @@ static struct sock 
*__udp4_lib_mcast_demux_lookup(struct net *net,
return result;
 }
 
-/* For unicast we should only early demux connected sockets or we can
- * break forwarding setups.  The chains here can be long so only check
- * if the first socket is an exact match and if not move on.
- */
-static struct sock *__udp4_lib_demux_lookup(struct net *net,
-   __be16 loc_port, __be32 loc_addr,
-   __be16 rmt_port, __be32 rmt_addr,
-   int dif, int sdif)
+void udp_set_skb_rx_dst(struct sock *sk, struct sk_buff *skb, u32 cookie)
 {
-   unsigned short hnum = ntohs(loc_port);
-   unsigned int hash2 = udp4_portaddr_hash(net, loc_addr, hnum);
-   unsigned int slot2 = hash2 & udp_table.mask;
-   struct udp_hslot *hslot2 = _table.hash2[slot2];
-   INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr);
-   const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum);
-   struct sock *sk;
+   struct dst_entry *dst = dst_access(>sk_rx_dst, cookie);
 
-   udp_portaddr_for_each_entry_rcu(sk, >head) {
-   if (INET_MATCH(sk, net, acookie, rmt_addr,
-  loc_addr, ports, dif, sdif))
-   return sk;
-   /* Only check first socket in chain */
-   break;
+   if (dst) {
+   /* set noref for now.
+* any place which wants to hold dst has to call
+* dst_hold_safe()
+*/
+   skb_dst_set_noref(skb, dst);
}
-   return NULL;
 }
+EXPORT_SYMBOL_GPL(udp_set_skb_rx_dst);
 
 void udp_v4_early_demux(struct sk_buff *skb)
 {
struct net *net = dev_net(skb->dev);
+   int dif = skb->dev->ifindex;
+   int sdif = inet_sdif(skb);
const struct iphdr *iph;
const struct udphdr *uh;
struct sock *sk = NULL;
-   struct dst_entry *dst;
-   int dif = skb->dev->ifindex;
-   int sdif = inet_sdif(skb);
int ours;
 
/* validate the packet */
@@ -2260,25 +2251,16 @@ void udp_v4_early_demux(struct sk_buff *skb)
   uh->source, iph->saddr,
   dif, sdif);
} else if (skb->pkt_type == PACKET_HOST) {
-   sk = __udp4_lib_demux_lookup(net, uh->dest, iph->daddr,
-uh->source, iph->saddr, dif, sdif);
+   sk = __udp4_lib_lookup(net, iph->saddr, uh->source, iph->daddr,
+

[PATCH net-next 2/5] net: allow early demux to fetch noref socket

2017-09-20 Thread Paolo Abeni

We must be careful to avoid leaking such sockets outside
the RCU section containing the early demux call; we clear
them on nonlocal delivery.

For ipv4 we must take care of local mcast delivery, too,
since udp early demux works also for mcast addresses.

Also update all iptables/nftables extension that can
happen in the input chain and can transmit the skb outside
such patch, namely TEE, nft_dup and nfqueue.

Signed-off-by: Paolo Abeni 
---
 net/ipv4/ip_input.c  | 12 
 net/ipv4/ipmr.c  | 18 ++
 net/ipv4/netfilter/nf_dup_ipv4.c |  3 +++
 net/ipv6/ip6_input.c |  7 ++-
 net/ipv6/netfilter/nf_dup_ipv6.c |  3 +++
 net/netfilter/nf_queue.c |  3 +++
 6 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index fa2dc8f692c6..e71abc8b698c 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -349,6 +349,18 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, 
struct sk_buff *skb)
__NET_INC_STATS(net, LINUX_MIB_IPRPFILTER);
goto drop;
}
+
+   /* Since the sk has no reference to the socket, we must
+* clear it before escaping this RCU section.
+* The sk is just an hint and we know we are not going to use
+* it outside the input path.
+*/
+   if (skb_dst(skb)->input != ip_local_deliver
+#ifdef CONFIG_IP_MROUTE
+   && skb_dst(skb)->input != ip_mr_input
+#endif
+   )
+   skb_clear_noref_sk(skb);
}
 
 #ifdef CONFIG_IP_ROUTE_CLASSID
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index c9b3e6e069ae..76642af79038 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1978,11 +1978,12 @@ static struct mr_table *ipmr_rt_fib_lookup(struct net 
*net, struct sk_buff *skb)
  */
 int ip_mr_input(struct sk_buff *skb)
 {
-   struct mfc_cache *cache;
-   struct net *net = dev_net(skb->dev);
int local = skb_rtable(skb)->rt_flags & RTCF_LOCAL;
-   struct mr_table *mrt;
+   struct net *net = dev_net(skb->dev);
+   struct mfc_cache *cache;
struct net_device *dev;
+   struct mr_table *mrt;
+   struct sock *sk;
 
/* skb->dev passed in is the loX master dev for vrfs.
 * As there are no vifs associated with loopback devices,
@@ -2052,6 +2053,9 @@ int ip_mr_input(struct sk_buff *skb)
skb = skb2;
}
 
+   /* avoid leaking the noref sk on forward path */
+   skb_clear_noref_sk(skb);
+
read_lock(_lock);
vif = ipmr_find_vif(mrt, dev);
if (vif >= 0) {
@@ -2065,12 +2069,18 @@ int ip_mr_input(struct sk_buff *skb)
return -ENODEV;
}
 
+   /* avoid leaking the noref sk on forward path... */
+   sk = skb_clear_noref_sk(skb);
read_lock(_lock);
ip_mr_forward(net, mrt, dev, skb, cache, local);
read_unlock(_lock);
 
-   if (local)
+   if (local) {
+   /* ... but preserve it for local delivery */
+   if (sk)
+   skb_set_noref_sk(skb, sk);
return ip_local_deliver(skb);
+   }
 
return 0;
 
diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c
index 39895b9ddeb9..bf8b78492fc8 100644
--- a/net/ipv4/netfilter/nf_dup_ipv4.c
+++ b/net/ipv4/netfilter/nf_dup_ipv4.c
@@ -71,6 +71,9 @@ void nf_dup_ipv4(struct net *net, struct sk_buff *skb, 
unsigned int hooknum,
nf_reset(skb);
nf_ct_set(skb, NULL, IP_CT_UNTRACKED);
 #endif
+   /* Avoid leaking noref sk outside the input path */
+   skb_clear_noref_sk(skb);
+
/*
 * If we are in PREROUTING/INPUT, decrease the TTL to mitigate potential
 * loops between two hosts.
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 9ee208a348f5..9aa6baffd4b9 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -65,9 +65,14 @@ int ip6_rcv_finish(struct net *net, struct sock *sk, struct 
sk_buff *skb)
if (ipprot && (edemux = READ_ONCE(ipprot->early_demux)))
edemux(skb);
}
-   if (!skb_valid_dst(skb))
+   if (!skb_valid_dst(skb)) {
ip6_route_input(skb);
 
+   /* see comment on ipv4 edmux */
+   if (skb_dst(skb)->input != ip6_input)
+   skb_clear_noref_sk(skb);
+   }
+
return dst_input(skb);
 }
 
diff --git a/net/ipv6/netfilter/nf_dup_ipv6.c b/net/ipv6/netfilter/nf_dup_ipv6.c
index 4a7ddeddbaab..939f6a2238f9 100644
--- a/net/ipv6/netfilter/nf_dup_ipv6.c
+++ b/net/ipv6/netfilter/nf_dup_ipv6.c
@@ -60,6 +60,9 @@ void nf_dup_ipv6(struct net *net, struct sk_buff *skb, 
unsigned int hooknum,
nf_reset(skb);
nf_ct_set(skb, NULL,

Re: [PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map

2017-09-20 Thread Peter Zijlstra

On Tue, Sep 19, 2017 at 11:09:32PM -0700, Yonghong Song wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3e691b7..2d5bbe5 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3684,10 +3684,12 @@ static inline u64 perf_event_count(struct perf_event 
> *event)
>   * will not be local and we cannot read them atomically
>   *   - must not have a pmu::count method
>   */
> -int perf_event_read_local(struct perf_event *event, u64 *value)
> +int perf_event_read_local(struct perf_event *event, u64 *value,
> +   u64 *enabled, u64 *running)
>  {
>   unsigned long flags;
>   int ret = 0;
> + u64 now;
>  
>   /*
>* Disabling interrupts avoids all counter scheduling (context
> @@ -3718,14 +3720,21 @@ int perf_event_read_local(struct perf_event *event, 
> u64 *value)
>   goto out;
>   }
>  
> + now = event->shadow_ctx_time + perf_clock();
> + if (enabled)
> + *enabled = now - event->tstamp_enabled;
>   /*
>* If the event is currently on this CPU, its either a per-task event,
>* or local to this CPU. Furthermore it means its ACTIVE (otherwise
>* oncpu == -1).
>*/
> - if (event->oncpu == smp_processor_id())
> + if (event->oncpu == smp_processor_id()) {
>   event->pmu->read(event);
> -
> + if (running)
> + *running = now - event->tstamp_running;
> + } else if (running) {
> + *running = event->total_time_running;
> + }
>   *value = local64_read(>count);
>  out:
>   local_irq_restore(flags);

Yeah, this looks about right.

Dave, could we have this in a topic tree of sorts, because I have a
pending series to rework all the timekeeping and it might be nice to not
have sfr run into all sorts of conflicts.

Re: [PATCH net-next 08/14] gtp: Support encpasulating over IPv6

2017-09-20 Thread Tom Herbert

On Mon, Sep 18, 2017 at 9:19 PM, David Miller  wrote:
> From: Tom Herbert 
> Date: Mon, 18 Sep 2017 17:38:58 -0700
>
>> Allow peers to be specified by IPv6 addresses.
>>
>> Signed-off-by: Tom Herbert 
>
> Hmmm, can you just check the socket family or something like that?

I'm not sure what code you're referring to.

Thanks

Re: [PATCH net-next] bpf: Optimize lpm trie delete

2017-09-20 Thread Craig Gallek

On Wed, Sep 20, 2017 at 12:51 PM, Daniel Mack  wrote:
> Hi Craig,
>
> Thanks, this looks much cleaner already :)
>
> On 09/20/2017 06:22 PM, Craig Gallek wrote:
>> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
>> index 9d58a576b2ae..b5a7d70ec8b5 100644
>> --- a/kernel/bpf/lpm_trie.c
>> +++ b/kernel/bpf/lpm_trie.c
>> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void 
>> *_key)
>>   struct lpm_trie_node __rcu **trim;
>>   struct lpm_trie_node *node;
>>   unsigned long irq_flags;
>> - unsigned int next_bit;
>> + unsigned int next_bit = 0;
>
> This default assignment seems wrong, and I guess you only added it to
> squelch a compiler warning?
Yes, this variable is only initialized after the lookup iterations
below (meaning it will never be initialized the the root-removal
case).

> [...]
>
>> + /* If the node has one child, we may be able to collapse the tree
>> +  * while removing this node if the node's child is in the same
>> +  * 'next bit' slot as this node was in its parent or if the node
>> +  * itself is the root.
>> +  */
>> + if (trim == >root) {
>> + next_bit = node->child[0] ? 0 : 1;
>> + rcu_assign_pointer(trie->root, node->child[next_bit]);
>> + kfree_rcu(node, rcu);
>
> I don't think you should treat this 'root' case special.
>
> Instead, move the 'next_bit' assignment outside of the condition ...
I'm not quite sure I follow.  Are you saying do something like this:

if (trim == >root) {
  next_bit = node->child[0] ? 0 : 1;
}
if (rcu_access_pointer(node->child[next_bit])) {
...

This would save a couple lines of code, but I think the as-is
implementation is slightly easier to understand.  I don't have a
strong opinion either way, though.

Thanks for the pointers,
Craig

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski




W dniu 2017-09-20 o 19:46, Wei Wang pisze:

This is why I suggested to replace the BUG() in another mail

So :

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index
f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
*/
   static inline void dev_put(struct net_device *dev)
   {
-   this_cpu_dec(*dev->pcpu_refcnt);
+   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+   if (!pref) {
+   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
%d\n",
+  dev, dev->name, dev->reg_state, dev->dismantle);
+   for (;;)
+   cpu_relax();
+   }
+   this_cpu_dec(*pref);
   }
 /**


Thanks a lot Eric for the debug patch.

Pawel,

I want to confirm with you about the last good commit when you did bisection.
You mentioned:


And the last one

git bisect good
Bisecting: 1 revision left to test after this (roughly 1 step)
[1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for
insertion into fib6 tree

With this have kernel panic same as always

git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
remove the operation of dst_free()


So it breaks right at:
[b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and
remove the operation of dst_free()
Right?
If you sync the image to one commit before the above one:
[9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly
Does it crash?
Later today i will repeat last three steps - in about next 3 hours after 
rush hours of internet traffic - now i cant touch backup router  :)




And could you confirm that your config does not have any IPv6
addresses or routes configured?

There is ipv6 enabled
And yes there are some ipv6 ip's
One interface have ipv6 enabled with one static route

 but no ipv6 bgp sessions - so nt many ipv6 prefixes and ipv6 fib is 
almost empty


ip -6 r ls | wc -l
57




Thanks.
Wei


6:03 +0200, Paweł Staszewski wrote:

Nit much more after adding this patch

https://bugzilla.kernel.org/attachment.cgi?id=258529


This is why I suggested to replace the BUG() in another mail

So :

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index
f535779d9dc1dfe36934c2abba4e43d053ac5d6f..220cd12456754876edf2d3ef13195e82d70d5c74
100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,15 @@ void netdev_run_todo(void);
*/
   static inline void dev_put(struct net_device *dev)
   {
-   this_cpu_dec(*dev->pcpu_refcnt);
+   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+   if (!pref) {
+   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle
%d\n",
+  dev, dev->name, dev->reg_state, dev->dismantle);
+   for (;;)
+   cpu_relax();
+   }
+   this_cpu_dec(*pref);
   }
 /**




Full panic

https://bugzilla.kernel.org/attachment.cgi?id=258531


I will change patch and apply but later today cause now cant use backup
router as testlab - Internet rush hours if something happens this will be
bed when second router will have bugged kernel :)

Re: cross namespace interface notification for tun devices

2017-09-20 Thread Cong Wang

On Tue, Sep 19, 2017 at 2:02 PM, Jason A. Donenfeld  wrote:
> On Tue, Sep 19, 2017 at 10:40 PM, Cong Wang  wrote:
>> By "notification" I assume you mean netlink notification.
>
> Yes, netlink notification.
>
>> The question is why does the process in A still care about
>> the device sitting in B?
>>
>> Also, the process should be able to receive a last notification
>> on IFF_UP|IFF_RUNNING before device is finally moved to B.
>> After this point, it should not have any relation to netns A
>> any more, like the device were completely gone.
>
> That's very clearly not the case with a tun device. Tun devices work
> by letting a userspace process control the inputs (ndo_start_xmit) and
> outputs (netif_rx) of the actual network device. This controlling
> userspace process needs to know when its own interface that it
> controls goes up and down. In the kernel, we can do this by just
> checking dev->flags_UP, and receive notifications on ndo_open and
> ndo_stop. In userspace, the controlling process looses the ability to
> receive notifications like ndo_open/ndo_stop when the interface is
> moved to a new namespace. After the interface is moved to a namespace,
> the process will still control inputs and ouputs (ndo_start_xmit and
> netif_rx), but it will no longer receive netlink notifications for the
> equivalent of ndo_open and ndo_stop. This is problematic.

Sounds like we should set NETIF_F_NETNS_LOCAL for tun
device.

What is your legitimate use case of send/receive packet to/from
a tun device in a different netns?

[PATCH] hv_netvsc: fix send buffer failure on MTU change

2017-09-20 Thread Stephen Hemminger

From: Alex Ng 

If MTU is changed the host would reject the send buffer change.
This problem is result of recent change to allow changing send
buffer size.

Every time we change the MTU, we store the previous net_device section
count before destroying the buffer, but we don’t store the previous
section size. When we reinitialize the buffer, its size is calculated
by multiplying the previous count and previous size. Since we
continuously increase the MTU, the host returns us a decreasing count
value while the section size is reinitialized to 1728 bytes every
time.

This eventually leads to a condition where the calculated buf_size is
so small that the host rejects it.

Fixes: 8b5327975ae1 ("netvsc: allow controlling send/recv buffer size")
Signed-off-by: Alex Ng 
Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h | 2 ++
 drivers/net/hyperv/netvsc.c | 7 ++-
 drivers/net/hyperv/netvsc_drv.c | 8 
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index d98cdfb1536b..5176be76ca7d 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -150,6 +150,8 @@ struct netvsc_device_info {
u32  num_chn;
u32  send_sections;
u32  recv_sections;
+   u32  send_section_size;
+   u32  recv_section_size;
 };
 
 enum rndis_device_state {
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index a5511b7326af..8d5077fb0492 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -76,9 +76,6 @@ static struct netvsc_device *alloc_net_device(void)
net_device->max_pkt = RNDIS_MAX_PKT_DEFAULT;
net_device->pkt_align = RNDIS_PKT_ALIGN_DEFAULT;
 
-   net_device->recv_section_size = NETVSC_RECV_SECTION_SIZE;
-   net_device->send_section_size = NETVSC_SEND_SECTION_SIZE;
-
init_completion(_device->channel_init_wait);
init_waitqueue_head(_device->subchan_open);
INIT_WORK(_device->subchan_work, rndis_set_subchannel);
@@ -262,7 +259,7 @@ static int netvsc_init_buf(struct hv_device *device,
int ret = 0;
 
/* Get receive buffer area. */
-   buf_size = device_info->recv_sections * net_device->recv_section_size;
+   buf_size = device_info->recv_sections * device_info->recv_section_size;
buf_size = roundup(buf_size, PAGE_SIZE);
 
net_device->recv_buf = vzalloc(buf_size);
@@ -344,7 +341,7 @@ static int netvsc_init_buf(struct hv_device *device,
goto cleanup;
 
/* Now setup the send buffer. */
-   buf_size = device_info->send_sections * net_device->send_section_size;
+   buf_size = device_info->send_sections * device_info->send_section_size;
buf_size = round_up(buf_size, PAGE_SIZE);
 
net_device->send_buf = vzalloc(buf_size);
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index d4902ee5f260..a32ae02e1b6c 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -848,7 +848,9 @@ static int netvsc_set_channels(struct net_device *net,
device_info.num_chn = count;
device_info.ring_size = ring_size;
device_info.send_sections = nvdev->send_section_cnt;
+   device_info.send_section_size = nvdev->send_section_size;
device_info.recv_sections = nvdev->recv_section_cnt;
+   device_info.recv_section_size = nvdev->recv_section_size;
 
rndis_filter_device_remove(dev, nvdev);
 
@@ -963,7 +965,9 @@ static int netvsc_change_mtu(struct net_device *ndev, int 
mtu)
device_info.ring_size = ring_size;
device_info.num_chn = nvdev->num_chn;
device_info.send_sections = nvdev->send_section_cnt;
+   device_info.send_section_size = nvdev->send_section_size;
device_info.recv_sections = nvdev->recv_section_cnt;
+   device_info.recv_section_size = nvdev->recv_section_size;
 
rndis_filter_device_remove(hdev, nvdev);
 
@@ -1485,7 +1489,9 @@ static int netvsc_set_ringparam(struct net_device *ndev,
device_info.num_chn = nvdev->num_chn;
device_info.ring_size = ring_size;
device_info.send_sections = new_tx;
+   device_info.send_section_size = nvdev->send_section_size;
device_info.recv_sections = new_rx;
+   device_info.recv_section_size = nvdev->recv_section_size;
 
netif_device_detach(ndev);
was_opened = rndis_filter_opened(nvdev);
@@ -1934,7 +1940,9 @@ static int netvsc_probe(struct hv_device *dev,
device_info.ring_size = ring_size;
device_info.num_chn = VRSS_CHANNEL_DEFAULT;
device_info.send_sections = NETVSC_DEFAULT_TX;
+   device_info.send_section_size = NETVSC_SEND_SECTION_SIZE;
device_info.recv_sections = NETVSC_DEFAULT_RX;
+   device_info.recv_section_size = NETVSC_RECV_SECTION_SIZE;
 
nvdev =

Re: [PATCH net-next 09/14] gtp: Allow configuring GTP interface as standalone

2017-09-20 Thread Andreas Schultz


On 19/09/17 02:38, Tom Herbert wrote:

Add new configuration of GTP interfaces that allow specifying a port to
listen on (as opposed to having to get sockets from a userspace control
plane). This allows GTP interfaces to be configured and the data path
tested without requiring a GTP-C daemon.


This would imply that you can have multiple independent GTP sockets on 
the same IP address.That is not permitted by the GTP specifications. 
3GPP TS 29.281, section 4.3 states clearly that there is "only" one GTP 
entity per IP address.A PDP context is defined by the destination IP and 
the TEID. The destination port is not part of the identity of a PDP context.


Even the source IP and source port are not part of the tunnel identity. 
This makes is possible to send traffic from a new SGSN/SGW during 
handover before the control protocol has announced the handover.


At this point the usual response is: THAT IS NOT SAFE. Yes, GTP has been 
designed for cooperative networks only and should not be used on 
hostile/unsecured networks.


On the sending side, using multiple ports is permitted as long as the 
default GTP port is always able to receive incoming messages.


Andreas

[...]

re: mac80211: avoid allocating TXQs that won't be used

2017-09-20 Thread Colin Ian King

Johannes,

Static analysis with CoverityScan on linux-next today detected a null
pointer dereference issue on commit:

>From 0fc4b3403d215ecd3c05505ec1f0028a227ed319 Mon Sep 17 00:00:00 2001
From: Johannes Berg 
Date: Thu, 22 Jun 2017 12:20:29 +0200
Subject: [PATCH] mac80211: avoid allocating TXQs that won't be used

Issue: sdata is null when the sdata is dereferenced by:

   sdata->vif.type != NL80211_IFTYPE_AP_VLAN &&
   sdata->vif.type != NL80211_IFTYPE_MONITOR)

note that sdata is assigned a non-null much later with the statement
sdata = netdev_priv(ndev).

Detected by CoverityScan CID#1456974 ("Explicit null dereferenced")

Colin

Re: [PATCH net-next 09/14] gtp: Allow configuring GTP interface as standalone

2017-09-20 Thread Andreas Schultz




On 20/09/17 17:57, Tom Herbert wrote:

On Wed, Sep 20, 2017 at 8:27 AM, Andreas Schultz  wrote:

On 19/09/17 02:38, Tom Herbert wrote:


Add new configuration of GTP interfaces that allow specifying a port to
listen on (as opposed to having to get sockets from a userspace control
plane). This allows GTP interfaces to be configured and the data path
tested without requiring a GTP-C daemon.



This would imply that you can have multiple independent GTP sockets on the
same IP address.That is not permitted by the GTP specifications. 3GPP TS
29.281, section 4.3 states clearly that there is "only" one GTP entity per
IP address.A PDP context is defined by the destination IP and the TEID. The
destination port is not part of the identity of a PDP context.


We are in no way trying change GTP, if someone runs this in a real GTP
network then they need to abide by the specification. However, there
is nothing inconsistent and it breaks nothing if someone wishes to use
different port numbers in their own private network for testing or
development purposes. Every other UDP application that has assigned
port number allows configurable ports, I don't see that GTP is so
special that it should be an exception.


GTP isn't special, I just don't like to have testing only features in 
there when the same goal can be reached without having to add extra 
stuff. Adding code that is not going to be useful in real production 
setups (or in this case would even break production setups when enabled 
accidentally) makes the implementation more complex than it needs to be.


You can always add multiple IP's to your test system and have the same 
effect without having to change the ports.


Regards
Andreas



Tom

Re: [PATCH] ipv6_skip_exthdr: use ipv6_authlen for AH hdrlen

2017-09-20 Thread Xiang Gao

Hi David,

Thanks for your time and all your suggestions. I will resend a new patch soon.

Xiang Gao
Xiang Gao


2017-09-19 18:32 GMT-04:00 David Miller :
> From: Xiang Gao 
> Date: Tue, 19 Sep 2017 08:59:50 -0400
>
>> In ipv6_skip_exthdr, the lengh of AH header is computed manually
>> as (hp->hdrlen+2)<<2. However, in include/linux/ipv6.h, a macro
>> named ipv6_authlen is already defined for exactly the same job. This
>> commit replaces the manual computation code with the macro.
>
> All patch submissions must have a proper signoff.
>
> Also, please use a proper subsystem prefix in your Subject
> line "[PATCH] ipv6: Use ipv6_authlen for AH hdrlen in ipv6_skip_exthdr()"
> would have been much better as "ipv6: " is the appropriate
> subsystem prefix to use here.
>
> Thanks.

[PATCH net-next] bpf: Optimize lpm trie delete

2017-09-20 Thread Craig Gallek

From: Craig Gallek 

Before the delete operator was added, this datastructure maintained
an invariant that intermediate nodes were only present when necessary
to build the tree.  This patch updates the delete operation to reinstate
that invariant by removing unnecessary intermediate nodes after a node is
removed and thus keeping the tree structure at a minimal size.

Suggested-by: Daniel Mack 
Signed-off-by: Craig Gallek 
---
 kernel/bpf/lpm_trie.c | 55 +++
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 9d58a576b2ae..b5a7d70ec8b5 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key)
struct lpm_trie_node __rcu **trim;
struct lpm_trie_node *node;
unsigned long irq_flags;
-   unsigned int next_bit;
+   unsigned int next_bit = 0;
size_t matchlen = 0;
int ret = 0;
 
@@ -408,14 +408,12 @@ static int trie_delete_elem(struct bpf_map *map, void 
*_key)
 
/* Walk the tree looking for an exact key/length match and keeping
 * track of where we could begin trimming the tree.  The trim-point
-* is the sub-tree along the walk consisting of only single-child
-* intermediate nodes and ending at a leaf node that we want to
-* remove.
+* is the location of the pointer where we will remove a node from the
+* tree.
 */
trim = >root;
-   node = rcu_dereference_protected(
-   trie->root, lockdep_is_held(>lock));
-   while (node) {
+   while ((node = rcu_dereference_protected(
+  *trim, lockdep_is_held(>lock {
matchlen = longest_prefix_match(trie, node, key);
 
if (node->prefixlen != matchlen ||
@@ -423,15 +421,7 @@ static int trie_delete_elem(struct bpf_map *map, void 
*_key)
break;
 
next_bit = extract_bit(key->data, node->prefixlen);
-   /* If we hit a node that has more than one child or is a valid
-* prefix itself, do not remove it. Reset the root of the trim
-* path to its descendant on our path.
-*/
-   if (!(node->flags & LPM_TREE_NODE_FLAG_IM) ||
-   (node->child[0] && node->child[1]))
-   trim = >child[next_bit];
-   node = rcu_dereference_protected(
-   node->child[next_bit], lockdep_is_held(>lock));
+   trim = >child[next_bit];
}
 
if (!node || node->prefixlen != key->prefixlen ||
@@ -442,25 +432,38 @@ static int trie_delete_elem(struct bpf_map *map, void 
*_key)
 
trie->n_entries--;
 
-   /* If the node we are removing is not a leaf node, simply mark it
+   /* If the node we are removing has two children, simply mark it
 * as intermediate and we are done.
 */
-   if (rcu_access_pointer(node->child[0]) ||
+   if (rcu_access_pointer(node->child[0]) &&
rcu_access_pointer(node->child[1])) {
node->flags |= LPM_TREE_NODE_FLAG_IM;
goto out;
}
 
-   /* trim should now point to the slot holding the start of a path from
-* zero or more intermediate nodes to our leaf node for deletion.
-*/
-   while ((node = rcu_dereference_protected(
-   *trim, lockdep_is_held(>lock {
+   /* If the node has no children, it can be completely removed */
+   if (!rcu_access_pointer(node->child[0]) &&
+   !rcu_access_pointer(node->child[1])) {
RCU_INIT_POINTER(*trim, NULL);
-   trim = rcu_access_pointer(node->child[0]) ?
-   >child[0] :
-   >child[1];
kfree_rcu(node, rcu);
+   goto out;
+   }
+
+   /* If the node has one child, we may be able to collapse the tree
+* while removing this node if the node's child is in the same
+* 'next bit' slot as this node was in its parent or if the node
+* itself is the root.
+*/
+   if (trim == >root) {
+   next_bit = node->child[0] ? 0 : 1;
+   rcu_assign_pointer(trie->root, node->child[next_bit]);
+   kfree_rcu(node, rcu);
+   } else if (rcu_access_pointer(node->child[next_bit])) {
+   rcu_assign_pointer(*trim, node->child[next_bit]);
+   kfree_rcu(node, rcu);
+   } else {
+   /* If we can't collapse, just mark this node as intermediate */
+   node->flags |= LPM_TREE_NODE_FLAG_IM;
}
 
 out:
-- 
2.14.1.821.g8fa685d3b7-goog

[patch net-next 01/16] mlxsw: spectrum_switchdev: Change mc_router to mrouter

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

Change the naming of mc_router to mrouter to keep consistency.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_switchdev.c   | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index d39ffbf..22f8d74 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -699,10 +699,10 @@ static int mlxsw_sp_port_attr_br_vlan_set(struct 
mlxsw_sp_port *mlxsw_sp_port,
return -EINVAL;
 }
 
-static int mlxsw_sp_port_attr_mc_router_set(struct mlxsw_sp_port 
*mlxsw_sp_port,
-   struct switchdev_trans *trans,
-   struct net_device *orig_dev,
-   bool is_port_mc_router)
+static int mlxsw_sp_port_attr_mrouter_set(struct mlxsw_sp_port *mlxsw_sp_port,
+ struct switchdev_trans *trans,
+ struct net_device *orig_dev,
+ bool is_port_mrouter)
 {
struct mlxsw_sp_bridge_port *bridge_port;
int err;
@@ -720,12 +720,12 @@ static int mlxsw_sp_port_attr_mc_router_set(struct 
mlxsw_sp_port *mlxsw_sp_port,
 
err = mlxsw_sp_bridge_port_flood_table_set(mlxsw_sp_port, bridge_port,
   MLXSW_SP_FLOOD_TYPE_MC,
-  is_port_mc_router);
+  is_port_mrouter);
if (err)
return err;
 
 out:
-   bridge_port->mrouter = is_port_mc_router;
+   bridge_port->mrouter = is_port_mrouter;
return 0;
 }
 
@@ -793,9 +793,9 @@ static int mlxsw_sp_port_attr_set(struct net_device *dev,
 attr->u.vlan_filtering);
break;
case SWITCHDEV_ATTR_ID_PORT_MROUTER:
-   err = mlxsw_sp_port_attr_mc_router_set(mlxsw_sp_port, trans,
-  attr->orig_dev,
-  attr->u.mrouter);
+   err = mlxsw_sp_port_attr_mrouter_set(mlxsw_sp_port, trans,
+attr->orig_dev,
+attr->u.mrouter);
break;
case SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED:
err = mlxsw_sp_port_mc_disabled_set(mlxsw_sp_port, trans,
-- 
2.9.5

[patch net-next 07/16] mlxsw: spectrum_switchdev: Break mid deletion into two function

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

Break mid deletion into two function, so it will be possible in the future
to delete a mid entry for other reasons then switchdev command (like port
deletion).

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   | 32 ++
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 9dd05d8..7f622de 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -1468,6 +1468,25 @@ static int mlxsw_sp_port_vlans_del(struct mlxsw_sp_port 
*mlxsw_sp_port,
return 0;
 }
 
+static int
+__mlxsw_sp_port_mdb_del(struct mlxsw_sp_port *mlxsw_sp_port,
+   struct mlxsw_sp_bridge_port *bridge_port,
+   struct mlxsw_sp_mid *mid)
+{
+   struct net_device *dev = mlxsw_sp_port->dev;
+   int err;
+
+   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, false);
+   if (err)
+   netdev_err(dev, "Unable to remove port from SMID\n");
+
+   err = mlxsw_sp_port_remove_from_mid(mlxsw_sp_port, mid);
+   if (err)
+   netdev_err(dev, "Unable to remove MC SFD\n");
+
+   return err;
+}
+
 static int mlxsw_sp_port_mdb_del(struct mlxsw_sp_port *mlxsw_sp_port,
 const struct switchdev_obj_port_mdb *mdb)
 {
@@ -1479,8 +1498,6 @@ static int mlxsw_sp_port_mdb_del(struct mlxsw_sp_port 
*mlxsw_sp_port,
struct mlxsw_sp_bridge_port *bridge_port;
struct mlxsw_sp_mid *mid;
u16 fid_index;
-   u16 mid_idx;
-   int err = 0;
 
bridge_port = mlxsw_sp_bridge_port_find(mlxsw_sp->bridge, orig_dev);
if (!bridge_port)
@@ -1501,16 +1518,7 @@ static int mlxsw_sp_port_mdb_del(struct mlxsw_sp_port 
*mlxsw_sp_port,
return -EINVAL;
}
 
-   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, false);
-   if (err)
-   netdev_err(dev, "Unable to remove port from SMID\n");
-
-   mid_idx = mid->mid;
-   err = mlxsw_sp_port_remove_from_mid(mlxsw_sp_port, mid);
-   if (err)
-   netdev_err(dev, "Unable to remove MC SFD\n");
-
-   return err;
+   return __mlxsw_sp_port_mdb_del(mlxsw_sp_port, bridge_port, mid);
 }
 
 static int mlxsw_sp_port_obj_del(struct net_device *dev,
-- 
2.9.5

[patch net-next 02/16] mlxsw: spectrum_switchdev: Add a ports bitmap to the mid db

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

Add a bitmap of ports to the mid struct to hold the ports that are
registered to this mid.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h   |  1 +
 .../net/ethernet/mellanox/mlxsw/spectrum_switchdev.c | 20 +---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 7180d8f..0424bee 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -95,6 +95,7 @@ struct mlxsw_sp_mid {
u16 fid;
u16 mid;
unsigned int ref_count;
+   unsigned long *ports_in_mid; /* bits array */
 };
 
 enum mlxsw_sp_span_type {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 22f8d74..0fde16a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -1239,6 +1239,7 @@ static struct mlxsw_sp_mid *__mlxsw_sp_mc_alloc(struct 
mlxsw_sp *mlxsw_sp,
u16 fid)
 {
struct mlxsw_sp_mid *mid;
+   size_t alloc_size;
u16 mid_idx;
 
mid_idx = find_first_zero_bit(mlxsw_sp->bridge->mids_bitmap,
@@ -1250,6 +1251,14 @@ static struct mlxsw_sp_mid *__mlxsw_sp_mc_alloc(struct 
mlxsw_sp *mlxsw_sp,
if (!mid)
return NULL;
 
+   alloc_size = sizeof(unsigned long) *
+BITS_TO_LONGS(mlxsw_core_max_ports(mlxsw_sp->core));
+   mid->ports_in_mid = kzalloc(alloc_size, GFP_KERNEL);
+   if (!mid->ports_in_mid) {
+   kfree(mid);
+   return NULL;
+   }
+
set_bit(mid_idx, mlxsw_sp->bridge->mids_bitmap);
ether_addr_copy(mid->addr, addr);
mid->fid = fid;
@@ -1260,12 +1269,16 @@ static struct mlxsw_sp_mid *__mlxsw_sp_mc_alloc(struct 
mlxsw_sp *mlxsw_sp,
return mid;
 }
 
-static int __mlxsw_sp_mc_dec_ref(struct mlxsw_sp *mlxsw_sp,
+static int __mlxsw_sp_mc_dec_ref(struct mlxsw_sp_port *mlxsw_sp_port,
 struct mlxsw_sp_mid *mid)
 {
+   struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+
+   clear_bit(mlxsw_sp_port->local_port, mid->ports_in_mid);
if (--mid->ref_count == 0) {
list_del(>list);
clear_bit(mid->mid, mlxsw_sp->bridge->mids_bitmap);
+   kfree(mid->ports_in_mid);
kfree(mid);
return 1;
}
@@ -1311,6 +1324,7 @@ static int mlxsw_sp_port_mdb_add(struct mlxsw_sp_port 
*mlxsw_sp_port,
}
}
mid->ref_count++;
+   set_bit(mlxsw_sp_port->local_port, mid->ports_in_mid);
 
err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, true,
 mid->ref_count == 1);
@@ -1331,7 +1345,7 @@ static int mlxsw_sp_port_mdb_add(struct mlxsw_sp_port 
*mlxsw_sp_port,
return 0;
 
 err_out:
-   __mlxsw_sp_mc_dec_ref(mlxsw_sp, mid);
+   __mlxsw_sp_mc_dec_ref(mlxsw_sp_port, mid);
return err;
 }
 
@@ -1437,7 +1451,7 @@ static int mlxsw_sp_port_mdb_del(struct mlxsw_sp_port 
*mlxsw_sp_port,
netdev_err(dev, "Unable to remove port from SMID\n");
 
mid_idx = mid->mid;
-   if (__mlxsw_sp_mc_dec_ref(mlxsw_sp, mid)) {
+   if (__mlxsw_sp_mc_dec_ref(mlxsw_sp_port, mid)) {
err = mlxsw_sp_port_mdb_op(mlxsw_sp, mdb->addr, fid_index,
   mid_idx, false);
if (err)
-- 
2.9.5

[patch net-next 10/16] mlxsw: spectrum_switchdev: Use generic mc flood function

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

Use the generic mc flood function to decide whether to flood mc to a port
when mc is being enabled / disabled.
Move this function in the file to avoid forward declaration.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_switchdev.c   | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 79806af..19ac206 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -742,6 +742,14 @@ static int mlxsw_sp_port_attr_mrouter_set(struct 
mlxsw_sp_port *mlxsw_sp_port,
return 0;
 }
 
+static bool mlxsw_sp_mc_flood(const struct mlxsw_sp_bridge_port *bridge_port)
+{
+   const struct mlxsw_sp_bridge_device *bridge_device;
+
+   bridge_device = bridge_port->bridge_device;
+   return !bridge_device->multicast_enabled ? true : bridge_port->mrouter;
+}
+
 static int mlxsw_sp_port_mc_disabled_set(struct mlxsw_sp_port *mlxsw_sp_port,
 struct switchdev_trans *trans,
 struct net_device *orig_dev,
@@ -770,7 +778,7 @@ static int mlxsw_sp_port_mc_disabled_set(struct 
mlxsw_sp_port *mlxsw_sp_port,
 
list_for_each_entry(bridge_port, _device->ports_list, list) {
enum mlxsw_sp_flood_type packet_type = MLXSW_SP_FLOOD_TYPE_MC;
-   bool member = mc_disabled ? true : bridge_port->mrouter;
+   bool member = mlxsw_sp_mc_flood(bridge_port);
 
err = mlxsw_sp_bridge_port_flood_table_set(mlxsw_sp_port,
   bridge_port,
@@ -829,14 +837,6 @@ static int mlxsw_sp_port_attr_set(struct net_device *dev,
return err;
 }
 
-static bool mlxsw_sp_mc_flood(const struct mlxsw_sp_bridge_port *bridge_port)
-{
-   const struct mlxsw_sp_bridge_device *bridge_device;
-
-   bridge_device = bridge_port->bridge_device;
-   return !bridge_device->multicast_enabled ? true : bridge_port->mrouter;
-}
-
 static int
 mlxsw_sp_port_vlan_fid_join(struct mlxsw_sp_port_vlan *mlxsw_sp_port_vlan,
struct mlxsw_sp_bridge_port *bridge_port)
-- 
2.9.5

[patch net-next 12/16] mlxsw: spectrum_switchdev: Flush the mdb when a port is being removed

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

When a port is being removed from a bridge, flush the bridge mdb to remove
the mids of that port.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   | 39 --
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 50c4d7c..bc07873 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -122,6 +122,10 @@ mlxsw_sp_bridge_port_fdb_flush(struct mlxsw_sp *mlxsw_sp,
   u16 fid_index);
 
 static void
+mlxsw_sp_bridge_port_mdb_flush(struct mlxsw_sp_port *mlxsw_sp_port,
+  struct mlxsw_sp_bridge_port *bridge_port);
+
+static void
 mlxsw_sp_bridge_mdb_mc_enable_sync(struct mlxsw_sp_port *mlxsw_sp_port,
   struct mlxsw_sp_bridge_device
   *bridge_device);
@@ -176,17 +180,11 @@ static void
 mlxsw_sp_bridge_device_destroy(struct mlxsw_sp_bridge *bridge,
   struct mlxsw_sp_bridge_device *bridge_device)
 {
-   struct mlxsw_sp_mid *mid, *tmp;
-
list_del(_device->list);
if (bridge_device->vlan_enabled)
bridge->vlan_enabled_exists = false;
WARN_ON(!list_empty(_device->ports_list));
-   list_for_each_entry_safe(mid, tmp, _device->mids_list, list) {
-   list_del(>list);
-   clear_bit(mid->mid, bridge->mids_bitmap);
-   kfree(mid);
-   }
+   WARN_ON(!list_empty(_device->mids_list));
kfree(bridge_device);
 }
 
@@ -987,24 +985,28 @@ mlxsw_sp_port_vlan_bridge_leave(struct mlxsw_sp_port_vlan 
*mlxsw_sp_port_vlan)
struct mlxsw_sp_bridge_vlan *bridge_vlan;
struct mlxsw_sp_bridge_port *bridge_port;
u16 vid = mlxsw_sp_port_vlan->vid;
-   bool last;
+   bool last_port, last_vlan;
 
if (WARN_ON(mlxsw_sp_fid_type(fid) != MLXSW_SP_FID_TYPE_8021Q &&
mlxsw_sp_fid_type(fid) != MLXSW_SP_FID_TYPE_8021D))
return;
 
bridge_port = mlxsw_sp_port_vlan->bridge_port;
+   last_vlan = list_is_singular(_port->vlans_list);
bridge_vlan = mlxsw_sp_bridge_vlan_find(bridge_port, vid);
-   last = list_is_singular(_vlan->port_vlan_list);
+   last_port = list_is_singular(_vlan->port_vlan_list);
 
list_del(_sp_port_vlan->bridge_vlan_node);
mlxsw_sp_bridge_vlan_put(bridge_vlan);
mlxsw_sp_port_vid_stp_set(mlxsw_sp_port, vid, BR_STATE_DISABLED);
mlxsw_sp_port_vid_learning_set(mlxsw_sp_port, vid, false);
-   if (last)
+   if (last_port)
mlxsw_sp_bridge_port_fdb_flush(mlxsw_sp_port->mlxsw_sp,
   bridge_port,
   mlxsw_sp_fid_index(fid));
+   if (last_vlan)
+   mlxsw_sp_bridge_port_mdb_flush(mlxsw_sp_port, bridge_port);
+
mlxsw_sp_port_vlan_fid_leave(mlxsw_sp_port_vlan);
 
mlxsw_sp_bridge_port_put(mlxsw_sp_port->mlxsw_sp->bridge, bridge_port);
@@ -1580,6 +1582,23 @@ static int mlxsw_sp_port_mdb_del(struct mlxsw_sp_port 
*mlxsw_sp_port,
return __mlxsw_sp_port_mdb_del(mlxsw_sp_port, bridge_port, mid);
 }
 
+static void
+mlxsw_sp_bridge_port_mdb_flush(struct mlxsw_sp_port *mlxsw_sp_port,
+  struct mlxsw_sp_bridge_port *bridge_port)
+{
+   struct mlxsw_sp_bridge_device *bridge_device;
+   struct mlxsw_sp_mid *mid, *tmp;
+
+   bridge_device = bridge_port->bridge_device;
+
+   list_for_each_entry_safe(mid, tmp, _device->mids_list, list) {
+   if (test_bit(mlxsw_sp_port->local_port, mid->ports_in_mid)) {
+   __mlxsw_sp_port_mdb_del(mlxsw_sp_port, bridge_port,
+   mid);
+   }
+   }
+}
+
 static int mlxsw_sp_port_obj_del(struct net_device *dev,
 const struct switchdev_obj *obj)
 {
-- 
2.9.5

[patch net-next 08/16] mlxsw: spectrum_switchdev: Don't write mids to the HW when mc is disabled

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

Don't write multicast related data to the HW when mc is disabled.
Also, don't allocate mid id to new mids (so the remove function could know
that they weren't wrote to the HW)

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 .../ethernet/mellanox/mlxsw/spectrum_switchdev.c| 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 7f622de..cea257a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -1290,6 +1290,9 @@ mlxsw_sp_mc_write_mdb_entry(struct mlxsw_sp *mlxsw_sp,
 static int mlxsw_sp_mc_remove_mdb_entry(struct mlxsw_sp *mlxsw_sp,
struct mlxsw_sp_mid *mid)
 {
+   if (!mid->in_hw)
+   return 0;
+
clear_bit(mid->mid, mlxsw_sp->bridge->mids_bitmap);
mid->in_hw = false;
return mlxsw_sp_port_mdb_op(mlxsw_sp, mid->addr, mid->fid, mid->mid,
@@ -1319,11 +1322,15 @@ mlxsw_sp_mid *__mlxsw_sp_mc_alloc(struct mlxsw_sp 
*mlxsw_sp,
ether_addr_copy(mid->addr, addr);
mid->fid = fid;
mid->in_hw = false;
+
+   if (!bridge_device->multicast_enabled)
+   goto out;
+
if (!mlxsw_sp_mc_write_mdb_entry(mlxsw_sp, mid))
goto err_write_mdb_entry;
 
+out:
list_add_tail(>list, _device->mids_list);
-
return mid;
 
 err_write_mdb_entry:
@@ -1391,6 +1398,9 @@ static int mlxsw_sp_port_mdb_add(struct mlxsw_sp_port 
*mlxsw_sp_port,
}
set_bit(mlxsw_sp_port->local_port, mid->ports_in_mid);
 
+   if (!bridge_device->multicast_enabled)
+   return 0;
+
err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, true);
if (err) {
netdev_err(dev, "Unable to set SMID\n");
@@ -1476,9 +1486,12 @@ __mlxsw_sp_port_mdb_del(struct mlxsw_sp_port 
*mlxsw_sp_port,
struct net_device *dev = mlxsw_sp_port->dev;
int err;
 
-   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, false);
-   if (err)
-   netdev_err(dev, "Unable to remove port from SMID\n");
+   if (bridge_port->bridge_device->multicast_enabled) {
+   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, false);
+
+   if (err)
+   netdev_err(dev, "Unable to remove port from SMID\n");
+   }
 
err = mlxsw_sp_port_remove_from_mid(mlxsw_sp_port, mid);
if (err)
-- 
2.9.5

[patch net-next 05/16] mlxsw: spectrum_switchdev: Break smid write function

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

Break the smid write function into two, one that cleans the ports that
might be still written there and one that changes an exiting mid entry.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   | 42 +++---
 1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 2ba8a44..09ead97 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -1190,7 +1190,7 @@ mlxsw_sp_port_fdb_set(struct mlxsw_sp_port *mlxsw_sp_port,
 }
 
 static int mlxsw_sp_port_mdb_op(struct mlxsw_sp *mlxsw_sp, const char *addr,
-   u16 fid, u16 mid, bool adding)
+   u16 fid, u16 mid_idx, bool adding)
 {
char *sfd_pl;
int err;
@@ -1201,16 +1201,16 @@ static int mlxsw_sp_port_mdb_op(struct mlxsw_sp 
*mlxsw_sp, const char *addr,
 
mlxsw_reg_sfd_pack(sfd_pl, mlxsw_sp_sfd_op(adding), 0);
mlxsw_reg_sfd_mc_pack(sfd_pl, 0, addr, fid,
- MLXSW_REG_SFD_REC_ACTION_NOP, mid);
+ MLXSW_REG_SFD_REC_ACTION_NOP, mid_idx);
err = mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(sfd), sfd_pl);
kfree(sfd_pl);
return err;
 }
 
-static int mlxsw_sp_port_smid_set(struct mlxsw_sp_port *mlxsw_sp_port, u16 mid,
- bool add, bool clear_all_ports)
+/* clean the an entry from the HW and write there a full new entry */
+static int mlxsw_sp_port_smid_full_entry(struct mlxsw_sp *mlxsw_sp,
+u16 mid_idx)
 {
-   struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
char *smid_pl;
int err, i;
 
@@ -1218,12 +1218,29 @@ static int mlxsw_sp_port_smid_set(struct mlxsw_sp_port 
*mlxsw_sp_port, u16 mid,
if (!smid_pl)
return -ENOMEM;
 
-   mlxsw_reg_smid_pack(smid_pl, mid, mlxsw_sp_port->local_port, add);
-   if (clear_all_ports) {
-   for (i = 1; i < mlxsw_core_max_ports(mlxsw_sp->core); i++)
-   if (mlxsw_sp->ports[i])
-   mlxsw_reg_smid_port_mask_set(smid_pl, i, 1);
+   mlxsw_reg_smid_pack(smid_pl, mid_idx, 0, false);
+   for (i = 1; i < mlxsw_core_max_ports(mlxsw_sp->core); i++) {
+   if (mlxsw_sp->ports[i])
+   mlxsw_reg_smid_port_mask_set(smid_pl, i, 1);
}
+
+   err = mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(smid), smid_pl);
+   kfree(smid_pl);
+   return err;
+}
+
+static int mlxsw_sp_port_smid_set(struct mlxsw_sp_port *mlxsw_sp_port,
+ u16 mid_idx, bool add)
+{
+   struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+   char *smid_pl;
+   int err;
+
+   smid_pl = kmalloc(MLXSW_REG_SMID_LEN, GFP_KERNEL);
+   if (!smid_pl)
+   return -ENOMEM;
+
+   mlxsw_reg_smid_pack(smid_pl, mid_idx, mlxsw_sp_port->local_port, add);
err = mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(smid), smid_pl);
kfree(smid_pl);
return err;
@@ -1336,10 +1353,11 @@ static int mlxsw_sp_port_mdb_add(struct mlxsw_sp_port 
*mlxsw_sp_port,
return -ENOMEM;
}
is_new_mid = true;
+   mlxsw_sp_port_smid_full_entry(mlxsw_sp, mid->mid);
}
set_bit(mlxsw_sp_port->local_port, mid->ports_in_mid);
 
-   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, true, is_new_mid);
+   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, true);
if (err) {
netdev_err(dev, "Unable to set SMID\n");
goto err_out;
@@ -1458,7 +1476,7 @@ static int mlxsw_sp_port_mdb_del(struct mlxsw_sp_port 
*mlxsw_sp_port,
return -EINVAL;
}
 
-   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, false, false);
+   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, false);
if (err)
netdev_err(dev, "Unable to remove port from SMID\n");
 
-- 
2.9.5

[patch net-next 06/16] mlxsw: spectrum_switchdev: Attach mid id allocation to HW write

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

Attach mid getting and releasing mid id to the HW write / remove, and add
a flag to indicate whether the mid is in the HW. It is done because mid id
is also HW index to this mid.
This change allows adding in the following patches the ability to have a
mid in the mdb cache but not in the HW. It will be useful for being able
to disable the multicast.
It means that the mdb is being written / delete to the HW in the mid
allocation / removing function, not after them.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  1 +
 .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   | 88 ++
 2 files changed, 56 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 6fd0afe..e907ec4 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -94,6 +94,7 @@ struct mlxsw_sp_mid {
unsigned char addr[ETH_ALEN];
u16 fid;
u16 mid;
+   bool in_hw;
unsigned long *ports_in_mid; /* bits array */
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 09ead97..9dd05d8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -1260,6 +1260,42 @@ mlxsw_sp_mid *__mlxsw_sp_mc_get(struct 
mlxsw_sp_bridge_device *bridge_device,
return NULL;
 }
 
+static bool
+mlxsw_sp_mc_write_mdb_entry(struct mlxsw_sp *mlxsw_sp,
+   struct mlxsw_sp_mid *mid)
+{
+   u16 mid_idx;
+   int err;
+
+   mid_idx = find_first_zero_bit(mlxsw_sp->bridge->mids_bitmap,
+ MLXSW_SP_MID_MAX);
+   if (mid_idx == MLXSW_SP_MID_MAX)
+   return false;
+
+   mid->mid = mid_idx;
+   err = mlxsw_sp_port_smid_full_entry(mlxsw_sp, mid_idx);
+   if (err)
+   return false;
+
+   err = mlxsw_sp_port_mdb_op(mlxsw_sp, mid->addr, mid->fid, mid_idx,
+  true);
+   if (err)
+   return false;
+
+   set_bit(mid_idx, mlxsw_sp->bridge->mids_bitmap);
+   mid->in_hw = true;
+   return true;
+}
+
+static int mlxsw_sp_mc_remove_mdb_entry(struct mlxsw_sp *mlxsw_sp,
+   struct mlxsw_sp_mid *mid)
+{
+   clear_bit(mid->mid, mlxsw_sp->bridge->mids_bitmap);
+   mid->in_hw = false;
+   return mlxsw_sp_port_mdb_op(mlxsw_sp, mid->addr, mid->fid, mid->mid,
+   false);
+}
+
 static struct
 mlxsw_sp_mid *__mlxsw_sp_mc_alloc(struct mlxsw_sp *mlxsw_sp,
  struct mlxsw_sp_bridge_device *bridge_device,
@@ -1268,12 +1304,6 @@ mlxsw_sp_mid *__mlxsw_sp_mc_alloc(struct mlxsw_sp 
*mlxsw_sp,
 {
struct mlxsw_sp_mid *mid;
size_t alloc_size;
-   u16 mid_idx;
-
-   mid_idx = find_first_zero_bit(mlxsw_sp->bridge->mids_bitmap,
- MLXSW_SP_MID_MAX);
-   if (mid_idx == MLXSW_SP_MID_MAX)
-   return NULL;
 
mid = kzalloc(sizeof(*mid), GFP_KERNEL);
if (!mid)
@@ -1281,36 +1311,43 @@ mlxsw_sp_mid *__mlxsw_sp_mc_alloc(struct mlxsw_sp 
*mlxsw_sp,
 
alloc_size = sizeof(unsigned long) *
 BITS_TO_LONGS(mlxsw_core_max_ports(mlxsw_sp->core));
+
mid->ports_in_mid = kzalloc(alloc_size, GFP_KERNEL);
-   if (!mid->ports_in_mid) {
-   kfree(mid);
-   return NULL;
-   }
+   if (!mid->ports_in_mid)
+   goto err_ports_in_mid_alloc;
 
-   set_bit(mid_idx, mlxsw_sp->bridge->mids_bitmap);
ether_addr_copy(mid->addr, addr);
mid->fid = fid;
-   mid->mid = mid_idx;
+   mid->in_hw = false;
+   if (!mlxsw_sp_mc_write_mdb_entry(mlxsw_sp, mid))
+   goto err_write_mdb_entry;
+
list_add_tail(>list, _device->mids_list);
 
return mid;
+
+err_write_mdb_entry:
+   kfree(mid->ports_in_mid);
+err_ports_in_mid_alloc:
+   kfree(mid);
+   return NULL;
 }
 
 static int mlxsw_sp_port_remove_from_mid(struct mlxsw_sp_port *mlxsw_sp_port,
 struct mlxsw_sp_mid *mid)
 {
struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+   int err = 0;
 
clear_bit(mlxsw_sp_port->local_port, mid->ports_in_mid);
if (bitmap_empty(mid->ports_in_mid,
 mlxsw_core_max_ports(mlxsw_sp->core))) {
+   err = mlxsw_sp_mc_remove_mdb_entry(mlxsw_sp, mid);
list_del(>list);
-   clear_bit(mid->mid, mlxsw_sp->bridge->mids_bitmap);
kfree(mid->ports_in_mid);
kfree(mid);
-   return 1;
}
-

[patch net-next 03/16] mlxsw: spectrum_switchdev: Remove reference count from mid

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

Since there is a bitmap for the ports registered to each mid, there is no
need for a ref count, since it will always be the number of set bits in
this bitmap. Any check of the ref count was replaced with checking if the
bitmap is empty.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h   |  1 -
 .../net/ethernet/mellanox/mlxsw/spectrum_switchdev.c | 20 ++--
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 0424bee..6fd0afe 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -94,7 +94,6 @@ struct mlxsw_sp_mid {
unsigned char addr[ETH_ALEN];
u16 fid;
u16 mid;
-   unsigned int ref_count;
unsigned long *ports_in_mid; /* bits array */
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 0fde16a..cb2275ed 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -1263,19 +1263,19 @@ static struct mlxsw_sp_mid *__mlxsw_sp_mc_alloc(struct 
mlxsw_sp *mlxsw_sp,
ether_addr_copy(mid->addr, addr);
mid->fid = fid;
mid->mid = mid_idx;
-   mid->ref_count = 0;
list_add_tail(>list, _sp->bridge->mids_list);
 
return mid;
 }
 
-static int __mlxsw_sp_mc_dec_ref(struct mlxsw_sp_port *mlxsw_sp_port,
-struct mlxsw_sp_mid *mid)
+static int mlxsw_sp_port_remove_from_mid(struct mlxsw_sp_port *mlxsw_sp_port,
+struct mlxsw_sp_mid *mid)
 {
struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
 
clear_bit(mlxsw_sp_port->local_port, mid->ports_in_mid);
-   if (--mid->ref_count == 0) {
+   if (bitmap_empty(mid->ports_in_mid,
+mlxsw_core_max_ports(mlxsw_sp->core))) {
list_del(>list);
clear_bit(mid->mid, mlxsw_sp->bridge->mids_bitmap);
kfree(mid->ports_in_mid);
@@ -1296,6 +1296,7 @@ static int mlxsw_sp_port_mdb_add(struct mlxsw_sp_port 
*mlxsw_sp_port,
struct mlxsw_sp_bridge_device *bridge_device;
struct mlxsw_sp_bridge_port *bridge_port;
struct mlxsw_sp_mid *mid;
+   bool is_new_mid = false;
u16 fid_index;
int err = 0;
 
@@ -1322,18 +1323,17 @@ static int mlxsw_sp_port_mdb_add(struct mlxsw_sp_port 
*mlxsw_sp_port,
netdev_err(dev, "Unable to allocate MC group\n");
return -ENOMEM;
}
+   is_new_mid = true;
}
-   mid->ref_count++;
set_bit(mlxsw_sp_port->local_port, mid->ports_in_mid);
 
-   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, true,
-mid->ref_count == 1);
+   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, true, is_new_mid);
if (err) {
netdev_err(dev, "Unable to set SMID\n");
goto err_out;
}
 
-   if (mid->ref_count == 1) {
+   if (is_new_mid) {
err = mlxsw_sp_port_mdb_op(mlxsw_sp, mdb->addr, fid_index,
   mid->mid, true);
if (err) {
@@ -1345,7 +1345,7 @@ static int mlxsw_sp_port_mdb_add(struct mlxsw_sp_port 
*mlxsw_sp_port,
return 0;
 
 err_out:
-   __mlxsw_sp_mc_dec_ref(mlxsw_sp_port, mid);
+   mlxsw_sp_port_remove_from_mid(mlxsw_sp_port, mid);
return err;
 }
 
@@ -1451,7 +1451,7 @@ static int mlxsw_sp_port_mdb_del(struct mlxsw_sp_port 
*mlxsw_sp_port,
netdev_err(dev, "Unable to remove port from SMID\n");
 
mid_idx = mid->mid;
-   if (__mlxsw_sp_mc_dec_ref(mlxsw_sp_port, mid)) {
+   if (mlxsw_sp_port_remove_from_mid(mlxsw_sp_port, mid)) {
err = mlxsw_sp_port_mdb_op(mlxsw_sp, mdb->addr, fid_index,
   mid_idx, false);
if (err)
-- 
2.9.5

[patch net-next 13/16] mlxsw: spectrum_switchdev: Flood all mc packets to mrouter ports

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

When mc is enabled, whenever a mc packet doesn't hit any mdb entry it is
being flood to the ports marked as mrouters. However, all mc packets should
be flooded to them even if they match an entry in the mdb.
This patch adds the mrouter ports to every mdb entry that is being written
to the HW.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 .../ethernet/mellanox/mlxsw/spectrum_switchdev.c   | 65 --
 1 file changed, 60 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index bc07873..146beaa 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -1288,10 +1288,55 @@ mlxsw_sp_mid *__mlxsw_sp_mc_get(struct 
mlxsw_sp_bridge_device *bridge_device,
return NULL;
 }
 
+static void
+mlxsw_sp_bridge_port_get_ports_bitmap(struct mlxsw_sp *mlxsw_sp,
+ struct mlxsw_sp_bridge_port *bridge_port,
+ unsigned long *ports_bitmap)
+{
+   struct mlxsw_sp_port *mlxsw_sp_port;
+   u64 max_lag_members, i;
+   int lag_id;
+
+   if (!bridge_port->lagged) {
+   set_bit(bridge_port->system_port, ports_bitmap);
+   } else {
+   max_lag_members = MLXSW_CORE_RES_GET(mlxsw_sp->core,
+MAX_LAG_MEMBERS);
+   lag_id = bridge_port->lag_id;
+   for (i = 0; i < max_lag_members; i++) {
+   mlxsw_sp_port = mlxsw_sp_port_lagged_get(mlxsw_sp,
+lag_id, i);
+   if (mlxsw_sp_port)
+   set_bit(mlxsw_sp_port->local_port,
+   ports_bitmap);
+   }
+   }
+}
+
+static void
+mlxsw_sp_mc_get_mrouters_bitmap(unsigned long *flood_bitmap,
+   struct mlxsw_sp_bridge_device *bridge_device,
+   struct mlxsw_sp *mlxsw_sp)
+{
+   struct mlxsw_sp_bridge_port *bridge_port;
+
+   list_for_each_entry(bridge_port, _device->ports_list, list) {
+   if (bridge_port->mrouter) {
+   mlxsw_sp_bridge_port_get_ports_bitmap(mlxsw_sp,
+ bridge_port,
+ flood_bitmap);
+   }
+   }
+}
+
 static bool
 mlxsw_sp_mc_write_mdb_entry(struct mlxsw_sp *mlxsw_sp,
-   struct mlxsw_sp_mid *mid)
+   struct mlxsw_sp_mid *mid,
+   struct mlxsw_sp_bridge_device *bridge_device)
 {
+   long *flood_bitmap;
+   int num_of_ports;
+   int alloc_size;
u16 mid_idx;
int err;
 
@@ -1300,9 +1345,18 @@ mlxsw_sp_mc_write_mdb_entry(struct mlxsw_sp *mlxsw_sp,
if (mid_idx == MLXSW_SP_MID_MAX)
return false;
 
+   num_of_ports = mlxsw_core_max_ports(mlxsw_sp->core);
+   alloc_size = sizeof(long) * BITS_TO_LONGS(num_of_ports);
+   flood_bitmap = kzalloc(alloc_size, GFP_KERNEL);
+   if (!flood_bitmap)
+   return false;
+
+   bitmap_copy(flood_bitmap,  mid->ports_in_mid, num_of_ports);
+   mlxsw_sp_mc_get_mrouters_bitmap(flood_bitmap, bridge_device, mlxsw_sp);
+
mid->mid = mid_idx;
-   err = mlxsw_sp_port_smid_full_entry(mlxsw_sp, mid_idx,
-   mid->ports_in_mid);
+   err = mlxsw_sp_port_smid_full_entry(mlxsw_sp, mid_idx, flood_bitmap);
+   kfree(flood_bitmap);
if (err)
return false;
 
@@ -1355,7 +1409,7 @@ mlxsw_sp_mid *__mlxsw_sp_mc_alloc(struct mlxsw_sp 
*mlxsw_sp,
if (!bridge_device->multicast_enabled)
goto out;
 
-   if (!mlxsw_sp_mc_write_mdb_entry(mlxsw_sp, mid))
+   if (!mlxsw_sp_mc_write_mdb_entry(mlxsw_sp, mid, bridge_device))
goto err_write_mdb_entry;
 
 out:
@@ -1456,7 +1510,8 @@ mlxsw_sp_bridge_mdb_mc_enable_sync(struct mlxsw_sp_port 
*mlxsw_sp_port,
 
list_for_each_entry(mid, _device->mids_list, list) {
if (mc_enabled)
-   mlxsw_sp_mc_write_mdb_entry(mlxsw_sp, mid);
+   mlxsw_sp_mc_write_mdb_entry(mlxsw_sp, mid,
+   bridge_device);
else
mlxsw_sp_mc_remove_mdb_entry(mlxsw_sp, mid);
}
-- 
2.9.5

[patch net-next 16/16] mlxsw: spectrum_switchdev: Consider mrouter status for mdb changes

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

When a mrouter is registered or leaves a mid, don't update the HW.

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 459cedc..0f9eac5 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -1491,6 +1491,9 @@ static int mlxsw_sp_port_mdb_add(struct mlxsw_sp_port 
*mlxsw_sp_port,
if (!bridge_device->multicast_enabled)
return 0;
 
+   if (bridge_port->mrouter)
+   return 0;
+
err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, true);
if (err) {
netdev_err(dev, "Unable to set SMID\n");
@@ -1613,10 +1616,12 @@ __mlxsw_sp_port_mdb_del(struct mlxsw_sp_port 
*mlxsw_sp_port,
int err;
 
if (bridge_port->bridge_device->multicast_enabled) {
-   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid, false);
-
-   if (err)
-   netdev_err(dev, "Unable to remove port from SMID\n");
+   if (bridge_port->bridge_device->multicast_enabled) {
+   err = mlxsw_sp_port_smid_set(mlxsw_sp_port, mid->mid,
+false);
+   if (err)
+   netdev_err(dev, "Unable to remove port from 
SMID\n");
+   }
}
 
err = mlxsw_sp_port_remove_from_mid(mlxsw_sp_port, mid);
-- 
2.9.5

[patch net-next 11/16] mlxsw: spectrum_switchdev: Flood mc when mc is disabled by user flag

2017-09-20 Thread Jiri Pirko

From: Nogah Frankel 

When multicast is disabled, flood mc packets only to port that are marked
BR_MCAST_FLOOD (instead to all).

Signed-off-by: Nogah Frankel 
Signed-off-by: Jiri Pirko 
---
 .../ethernet/mellanox/mlxsw/spectrum_switchdev.c| 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 19ac206..50c4d7c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -262,7 +262,8 @@ mlxsw_sp_bridge_port_create(struct mlxsw_sp_bridge_device 
*bridge_device,
bridge_port->dev = brport_dev;
bridge_port->bridge_device = bridge_device;
bridge_port->stp_state = BR_STATE_DISABLED;
-   bridge_port->flags = BR_LEARNING | BR_FLOOD | BR_LEARNING_SYNC;
+   bridge_port->flags = BR_LEARNING | BR_FLOOD | BR_LEARNING_SYNC |
+BR_MCAST_FLOOD;
INIT_LIST_HEAD(_port->vlans_list);
list_add(_port->list, _device->ports_list);
bridge_port->ref_count = 1;
@@ -468,7 +469,8 @@ static int mlxsw_sp_port_attr_get(struct net_device *dev,
   >u.brport_flags);
break;
case SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT:
-   attr->u.brport_flags_support = BR_LEARNING | BR_FLOOD;
+   attr->u.brport_flags_support = BR_LEARNING | BR_FLOOD |
+  BR_MCAST_FLOOD;
break;
default:
return -EOPNOTSUPP;
@@ -653,8 +655,18 @@ static int mlxsw_sp_port_attr_br_flags_set(struct 
mlxsw_sp_port *mlxsw_sp_port,
if (err)
return err;
 
-   memcpy(_port->flags, _flags, sizeof(brport_flags));
+   if (bridge_port->bridge_device->multicast_enabled)
+   goto out;
 
+   err = mlxsw_sp_bridge_port_flood_table_set(mlxsw_sp_port, bridge_port,
+  MLXSW_SP_FLOOD_TYPE_MC,
+  brport_flags &
+  BR_MCAST_FLOOD);
+   if (err)
+   return err;
+
+out:
+   memcpy(_port->flags, _flags, sizeof(brport_flags));
return 0;
 }
 
@@ -747,7 +759,8 @@ static bool mlxsw_sp_mc_flood(const struct 
mlxsw_sp_bridge_port *bridge_port)
const struct mlxsw_sp_bridge_device *bridge_device;
 
bridge_device = bridge_port->bridge_device;
-   return !bridge_device->multicast_enabled ? true : bridge_port->mrouter;
+   return bridge_device->multicast_enabled ? bridge_port->mrouter :
+   bridge_port->flags & BR_MCAST_FLOOD;
 }
 
 static int mlxsw_sp_port_mc_disabled_set(struct mlxsw_sp_port *mlxsw_sp_port,
-- 
2.9.5

Re: [PATCH net-next] net: dsa: Utilize dsa_slave_dev_check()

2017-09-20 Thread Vivien Didelot

Hi Florian,

Florian Fainelli  writes:

> Instead of open coding the check.
>
> Signed-off-by: Florian Fainelli 

If we do need to use it outside one day, we may think about renaming
netdev_uses_dsa() to netdev_is_dsa_master() and renaming
dsa_slave_dev_check() to netdev_is_dsa_slave().

In the meantime, looks good!

Reviewed-by: Vivien Didelot

[PATCH v5 net 2/3] lan78xx: Allow EEPROM write for less than MAX_EEPROM_SIZE

2017-09-20 Thread Nisar Sayed

Allow EEPROM write for less than MAX_EEPROM_SIZE

Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
Ethernet device driver")
Signed-off-by: Nisar Sayed 
---
 drivers/net/usb/lan78xx.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index fcf85ae37435..f8c63eec8353 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -1290,11 +1290,10 @@ static int lan78xx_ethtool_set_eeprom(struct net_device 
*netdev,
if (ret)
return ret;
 
-   /* Allow entire eeprom update only */
-   if ((ee->magic == LAN78XX_EEPROM_MAGIC) &&
-   (ee->offset == 0) &&
-   (ee->len == 512) &&
-   (data[0] == EEPROM_INDICATOR))
+   /* Invalid EEPROM_INDICATOR at offset zero will result in a failure
+* to load data from EEPROM
+*/
+   if (ee->magic == LAN78XX_EEPROM_MAGIC)
ret = lan78xx_write_raw_eeprom(dev, ee->offset, ee->len, data);
else if ((ee->magic == LAN78XX_OTP_MAGIC) &&
 (ee->offset == 0) &&
-- 
2.14.1

[PATCH v5 net 0/3] lan78xx: This series of patches are for lan78xx driver.

2017-09-20 Thread Nisar Sayed

This series of patches are for lan78xx driver.

These patches fixes potential issues associated with lan78xx driver.

v5
- Updated changes as per comments

v4
- Updated changes to handle return values as per comments
- Updated EEPROM write handling as per comments

v3
- Updated chagnes as per comments

v2
- Added patch version information
- Added fixes tag
- Updated patch description
- Updated chagnes as per comments

v1
- Splitted patches as per comments
- Dropped "fixed_phy device support" and "Fix for system suspend" changes

Nisar Sayed (3):
  lan78xx: Fix for eeprom read/write when device auto suspend
  lan78xx: Allow EEPROM write for less than MAX_EEPROM_SIZE
  lan78xx: Use default values loaded from EEPROM/OTP after reset

 drivers/net/usb/lan78xx.c | 34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

-- 
2.14.1

[PATCH v5 net 1/3] lan78xx: Fix for eeprom read/write when device auto suspend

2017-09-20 Thread Nisar Sayed

Fix for eeprom read/write when device auto suspend

Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
Ethernet device driver")
Signed-off-by: Nisar Sayed 
---
 drivers/net/usb/lan78xx.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index b99a7fb09f8e..fcf85ae37435 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -1265,30 +1265,46 @@ static int lan78xx_ethtool_get_eeprom(struct net_device 
*netdev,
  struct ethtool_eeprom *ee, u8 *data)
 {
struct lan78xx_net *dev = netdev_priv(netdev);
+   int ret;
+
+   ret = usb_autopm_get_interface(dev->intf);
+   if (ret)
+   return ret;
 
ee->magic = LAN78XX_EEPROM_MAGIC;
 
-   return lan78xx_read_raw_eeprom(dev, ee->offset, ee->len, data);
+   ret = lan78xx_read_raw_eeprom(dev, ee->offset, ee->len, data);
+
+   usb_autopm_put_interface(dev->intf);
+
+   return ret;
 }
 
 static int lan78xx_ethtool_set_eeprom(struct net_device *netdev,
  struct ethtool_eeprom *ee, u8 *data)
 {
struct lan78xx_net *dev = netdev_priv(netdev);
+   int ret;
+
+   ret = usb_autopm_get_interface(dev->intf);
+   if (ret)
+   return ret;
 
/* Allow entire eeprom update only */
if ((ee->magic == LAN78XX_EEPROM_MAGIC) &&
(ee->offset == 0) &&
(ee->len == 512) &&
(data[0] == EEPROM_INDICATOR))
-   return lan78xx_write_raw_eeprom(dev, ee->offset, ee->len, data);
+   ret = lan78xx_write_raw_eeprom(dev, ee->offset, ee->len, data);
else if ((ee->magic == LAN78XX_OTP_MAGIC) &&
 (ee->offset == 0) &&
 (ee->len == 512) &&
 (data[0] == OTP_INDICATOR_1))
-   return lan78xx_write_raw_otp(dev, ee->offset, ee->len, data);
+   ret = lan78xx_write_raw_otp(dev, ee->offset, ee->len, data);
 
-   return -EINVAL;
+   usb_autopm_put_interface(dev->intf);
+
+   return ret;
 }
 
 static void lan78xx_get_strings(struct net_device *netdev, u32 stringset,
-- 
2.14.1

[PATCH net-next] net: avoid a full fib lookup when rp_filter is disabled.

2017-09-20 Thread Paolo Abeni

Since commit 1dced6a85482 ("ipv4: Restore accept_local behaviour
in fib_validate_source()") a full fib lookup is needed even if
the rp_filter is disabled, if accept_local is false - which is
the default.

What we really need in the above scenario is just checking
that the source IP address is not local, and in most case we
can do that is a cheaper way looking up the ifaddr hash table.

This commit adds a helper for such lookup, and uses it to
validate the src address when rp_filter is disabled and no
'local' routes are created by the user space in the relevant
namespace.

A new ipv4 netns flag is added to account for such routes.
We need that to preserve the same behavior we had before this
patch.

It also drops the checks to bail early from __fib_validate_source,
added by the commit 1dced6a85482 ("ipv4: Restore accept_local
behaviour in fib_validate_source()") they do not give any
measurable performance improvement: if we do the lookup with are
on a slower path.

This improves UDP performances for unconnected sockets
when rp_filter is disabled by 5% and also gives small but
measurable performance improvement for TCP flood scenarios.

v1 -> v2:
 - use the ifaddr lookup helper in __ip_dev_find(), as suggested
   by Eric
 - fall-back to full lookup if custom local routes are present

Signed-off-by: Paolo Abeni 
---
 include/linux/inetdevice.h |  1 +
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/devinet.c | 30 ++
 net/ipv4/fib_frontend.c| 22 +-
 4 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index fb3f809e34e4..751d051f0bc7 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -179,6 +179,7 @@ __be32 inet_confirm_addr(struct net *net, struct in_device 
*in_dev, __be32 dst,
 __be32 local, int scope);
 struct in_ifaddr *inet_ifa_byprefix(struct in_device *in_dev, __be32 prefix,
__be32 mask);
+struct in_ifaddr *inet_lookup_ifaddr_rcu(struct net *net, __be32 addr);
 static __inline__ bool inet_ifa_match(__be32 addr, struct in_ifaddr *ifa)
 {
return !((addr^ifa->ifa_address)>ifa_mask);
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 20d061c805e3..20720721da4b 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -49,6 +49,7 @@ struct netns_ipv4 {
 #ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_rules_ops*rules_ops;
boolfib_has_custom_rules;
+   boolfib_has_custom_local_routes;
struct fib_table __rcu  *fib_main;
struct fib_table __rcu  *fib_default;
 #endif
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index d7adc0616599..7ce22a2c07ce 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -137,22 +137,12 @@ static void inet_hash_remove(struct in_ifaddr *ifa)
  */
 struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
 {
-   u32 hash = inet_addr_hash(net, addr);
struct net_device *result = NULL;
struct in_ifaddr *ifa;
 
rcu_read_lock();
-   hlist_for_each_entry_rcu(ifa, _addr_lst[hash], hash) {
-   if (ifa->ifa_local == addr) {
-   struct net_device *dev = ifa->ifa_dev->dev;
-
-   if (!net_eq(dev_net(dev), net))
-   continue;
-   result = dev;
-   break;
-   }
-   }
-   if (!result) {
+   ifa = inet_lookup_ifaddr_rcu(net, addr);
+   if (!ifa) {
struct flowi4 fl4 = { .daddr = addr };
struct fib_result res = { 0 };
struct fib_table *local;
@@ -165,6 +155,8 @@ struct net_device *__ip_dev_find(struct net *net, __be32 
addr, bool devref)
!fib_table_lookup(local, , , FIB_LOOKUP_NOREF) &&
res.type == RTN_LOCAL)
result = FIB_RES_DEV(res);
+   } else {
+   result = ifa->ifa_dev->dev;
}
if (result && devref)
dev_hold(result);
@@ -173,6 +165,20 @@ struct net_device *__ip_dev_find(struct net *net, __be32 
addr, bool devref)
 }
 EXPORT_SYMBOL(__ip_dev_find);
 
+/* called under RCU lock */
+struct in_ifaddr *inet_lookup_ifaddr_rcu(struct net *net, __be32 addr)
+{
+   u32 hash = inet_addr_hash(net, addr);
+   struct in_ifaddr *ifa;
+
+   hlist_for_each_entry_rcu(ifa, _addr_lst[hash], hash)
+   if (ifa->ifa_local == addr &&
+   net_eq(dev_net(ifa->ifa_dev->dev), net))
+   return ifa;
+
+   return NULL;
+}
+
 static void rtmsg_ifa(int event, struct in_ifaddr *, struct nlmsghdr *, u32);
 
 static BLOCKING_NOTIFIER_HEAD(inetaddr_chain);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 37819ab4cc74..f02819134ba2 100644
---

Re: [Patch net] net_sched: remove cls_flower idr on failure

2017-09-20 Thread Jiri Pirko

Wed, Sep 20, 2017 at 06:18:45PM CEST, xiyou.wangc...@gmail.com wrote:
>Fixes: c15ab236d69d ("net/sched: Change cls_flower to use IDR")
>Cc: Chris Mi 
>Cc: Jiri Pirko 
>Signed-off-by: Cong Wang 

Looks fine.
Acked-by: Jiri Pirko

Re: Latest net-next from GIT panic

2017-09-20 Thread Paweł Staszewski


Nit much more after adding this patch

https://bugzilla.kernel.org/attachment.cgi?id=258529



W dniu 2017-09-20 o 15:44, Eric Dumazet pisze:

On Wed, 2017-09-20 at 15:39 +0200, Paweł Staszewski wrote:

W dniu 2017-09-20 o 15:34, Eric Dumazet pisze:

Could you try this debug patch ?

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 
f535779d9dc1dfe36934c2abba4e43d053ac5d6f..1eaa3553a724dc8c048f67b556337072d5addc82
 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3331,7 +3331,14 @@ void netdev_run_todo(void);
*/
   static inline void dev_put(struct net_device *dev)
   {
-   this_cpu_dec(*dev->pcpu_refcnt);
+   int __percpu *pref = READ_ONCE(dev->pcpu_refcnt);
+
+   if (!pref) {
+   pr_err("no pcpu_refcnt on dev %p(%s) state %d dismantle %d\n",
+  dev, dev->name, dev->reg_state, dev->dismantle);
+   BUG();
+   }
+   this_cpu_dec(*pref);
   }
   
   /**





You want me to add this patch to what kernel version ?
currently im after git bisect reset - so mainline stable


Simply us the latest net-next as mentioned in the thread title, thanks.

Re: [PATCH] VSOCK: fix uapi/linux/vm_sockets.h incomplete types

2017-09-20 Thread Stefan Hajnoczi

On Tue, Sep 19, 2017 at 10:38:40AM -0700, David Miller wrote:
> From: Stefan Hajnoczi 
> Date: Mon, 18 Sep 2017 16:21:00 +0100
> 
> > On Fri, Sep 15, 2017 at 02:14:32PM -0700, David Miller wrote:
> >> > diff --git a/include/uapi/linux/vm_sockets.h 
> >> > b/include/uapi/linux/vm_sockets.h
> >> > index b4ed5d895699..4ae5c625ac56 100644
> >> > --- a/include/uapi/linux/vm_sockets.h
> >> > +++ b/include/uapi/linux/vm_sockets.h
> >> > @@ -18,6 +18,10 @@
> >> >  
> >> >  #include 
> >> >  
> >> > +#ifndef __KERNEL__
> >> > +#include  /* struct sockaddr */
> >> > +#endif
> >> > +
> >> 
> >> There is no precedence whatsoever to include sys/socket.h in _any_ UAPI
> >> header file provided by the kernel.
> > 
> >  does it for the same reason:
> > 
> > include/uapi/linux/if.h:#include  /* for 
> > struct sockaddr. */
> 
> You don't need it for struct sockaddr, you need it for sa_family_t,
> the comment is very misleading.
> 
> Please do as I have instructed and it will fix this problem.

No, you really cannot rely on struct sockaddr from  in
uapi headers.  You can check this yourself:

  $ cd /tmp && gcc -o a.o -c /usr/include/linux/vm_sockets.h
  /usr/include/linux/vm_sockets.h:148:32: error: invalid application of 
‘sizeof’ to incomplete type ‘struct sockaddr’
  unsigned char svm_zero[sizeof(struct sockaddr) -
^~

The weird situation is:

1. When compiling the kernel,  brings in struct sockaddr
   because the compiler finds include/linux/socket.h first before
   include/uapi/linux/socket.h.

2. When compiling a userspace application,  does not
   bring in struct sockaddr because include/uapi/linux/socket.h is
   found.

This is why I added the #include  when !__KERNEL__.  Sorry
that the commit description wasn't clear on this.

Am I misunderstanding something?

Stefan

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-20 Thread Hannes Frederic Sowa

Sridhar Samudrala  writes:

> This patch introduces a new socket option SO_SYMMETRIC_QUEUES that can be used
> to enable symmetric tx and rx queues on a socket.
>
> This option is specifically useful for epoll based multi threaded workloads
> where each thread handles packets received on a single RX queue . In this 
> model,
> we have noticed that it helps to send the packets on the same TX queue
> corresponding to the queue-pair associated with the RX queue specifically when
> busy poll is enabled with epoll().
>
> Two new fields are added to struct sock_common to cache the last rx ifindex 
> and
> the rx queue in the receive path of an SKB. __netdev_pick_tx() returns the 
> cached
> rx queue when this option is enabled and the TX is happening on the same 
> device.

Would it help to make the rx and tx skb hashes symmetric
(skb_get_hash_symmetric) on request?

Re: [PATCH net-next 03/14] gtp: Call common functions to get tunnel routes and add dst_cache

2017-09-20 Thread Andreas Schultz




On 19/09/17 14:09, Harald Welte wrote:

Hi Dave,

On Mon, Sep 18, 2017 at 09:17:51PM -0700, David Miller wrote:

This and the new dst caching code ignores any source address selection
done by ip_route_output_key() or the new tunnel route lookup helpers.

Either source address selection should be respected, or if saddr will
never be modified by a route lookup for some specific reason here,
that should be documented.


The IP source address is fixed by signaling on the GTP-C control plane
and nothing that the kernel can unilaterally decide to change.  Such a
change of address would have to be decided by and first be signaled on
GTP-C to the peer by the userspace daemon, which would then update the
PDP context in the kernel.


I think we had this discussion before. The sending IP and port are not 
part of the identity of the PDP context. So IMHO the sender is permitted

to change the source IP at random.

Regards
Andreas



So I guess you're asking us to document that rationale as form of a
source code comment ?

Re: [PATCH net-next] bpf: Optimize lpm trie delete

2017-09-20 Thread Daniel Mack

Hi Craig,

Thanks, this looks much cleaner already :)

On 09/20/2017 06:22 PM, Craig Gallek wrote:
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index 9d58a576b2ae..b5a7d70ec8b5 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -397,7 +397,7 @@ static int trie_delete_elem(struct bpf_map *map, void 
> *_key)
>   struct lpm_trie_node __rcu **trim;
>   struct lpm_trie_node *node;
>   unsigned long irq_flags;
> - unsigned int next_bit;
> + unsigned int next_bit = 0;

This default assignment seems wrong, and I guess you only added it to
squelch a compiler warning?

[...]

> + /* If the node has one child, we may be able to collapse the tree
> +  * while removing this node if the node's child is in the same
> +  * 'next bit' slot as this node was in its parent or if the node
> +  * itself is the root.
> +  */
> + if (trim == >root) {
> + next_bit = node->child[0] ? 0 : 1;
> + rcu_assign_pointer(trie->root, node->child[next_bit]);
> + kfree_rcu(node, rcu);

I don't think you should treat this 'root' case special.

Instead, move the 'next_bit' assignment outside of the condition ...

> + } else if (rcu_access_pointer(node->child[next_bit])) {
> + rcu_assign_pointer(*trim, node->child[next_bit]);
> + kfree_rcu(node, rcu);

... and then this branch would handle the case just fine. Correct?

Otherwise, looks good to me!



Thanks,
Daniel

Re: [PATCHv3 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough

2017-09-20 Thread Stephen Hemminger

On Wed, 20 Sep 2017 09:43:39 +0800
Hangbin Liu  wrote:

Thanks for keeping up on this.

> +realloc:
> + bufp = realloc(buf, buf_len);
> +
> + if (bufp == NULL) {

Minor personal style issue:
To me, blank lines are like paragraphs in writing.
Code reads better assignment and condition check are next to
each other.

> +recv:
> + len = recvmsg(fd, msg, flag);
> +
> + if (len < 0) {
> + if (errno == EINTR || errno == EAGAIN)
> + goto recv;
> + fprintf(stderr, "netlink receive error %s (%d)\n",
> + strerror(errno), errno);
> + free(buf);
> + return -errno;
> + }
> +
> + if (len == 0) {
> + fprintf(stderr, "EOF on netlink\n");
> + free(buf);
> + return -ENODATA;
> + }
> +
> + if (len > buf_len) {
> + buf_len = len;
> + flag = 0;
> + goto realloc;
> + }
> +
> + if (flag != 0) {
> + flag = 0;
> + goto recv;

Although I programmed in BASIC years ago. I never liked code
with loops via goto. To me it indicates the logic is not well thought
through.  Not sure exactly how to rearrange the control flow, but it
should be possible to rewrite this so that it reads cleaner.

Still think this needs to go through a few more review cycles
before applying.

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-20 Thread Tom Herbert

On Tue, Sep 19, 2017 at 10:13 PM, Eric Dumazet  wrote:
> On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote:
>> On 9/19/2017 5:48 PM, Tom Herbert wrote:
>> > On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar
>> >  wrote:
>> > > On 9/12/2017 3:53 PM, Tom Herbert wrote:
>> > > > On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
>> > > >  wrote:
>> > > > >
>> > > > > On 9/12/2017 8:47 AM, Eric Dumazet wrote:
>> > > > > > On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
>> > > > > > > On 9/11/2017 8:53 PM, Eric Dumazet wrote:
>> > > > > > > > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:
>> > > > > > > >
>> > > > > > > > > Two ints in sock_common for this purpose is quite expensive 
>> > > > > > > > > and the
>> > > > > > > > > use case for this is limited-- even if a RX->TX queue 
>> > > > > > > > > mapping were
>> > > > > > > > > introduced to eliminate the queue pair assumption this still 
>> > > > > > > > > won't
>> > > > > > > > > help if the receive and transmit interfaces are different 
>> > > > > > > > > for the
>> > > > > > > > > connection. I think we really need to see some very 
>> > > > > > > > > compelling
>> > > > > > > > > results
>> > > > > > > > > to be able to justify this.
>> > > > > > > Will try to collect and post some perf data with symmetric queue
>> > > > > > > configuration.
>> > >
>> > > Here is some performance data i collected with memcached workload over
>> > > ixgbe 10Gb NIC with mcblaster benchmark.
>> > > ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very
>> > > low
>> > > interrupt rate.
>> > >   ethtool -L p1p1 combined 16
>> > >   ethtool -C p1p1 rx-usecs 1000
>> > > and busy poll is set to 1000usecs
>> > >   sysctl net.core.busy_poll = 1000
>> > >
>> > > 16 threads  800K requests/sec
>> > > =
>> > >   rtt(min/avg/max)usecs intr/sec contextswitch/sec
>> > > ---
>> > > Default2/182/1064123391 61163
>> > > Symmetric Queues   2/50/6311  20457 32843
>> > >
>> > > 32 threads  800K requests/sec
>> > > =
>> > >  rtt(min/avg/max)usecs intr/sec contextswitch/sec
>> > > 
>> > > Default2/162/639032168 69450
>> > > Symmetric Queues2/50/385335044 35847
>> > >
>> > No idea what "Default" configuration is. Please report how xps_cpus is
>> > being set, how many RSS queues there are, and what the mapping is
>> > between RSS queues and CPUs and shared caches. Also, whether and
>> > threads are pinned.
>> Default is linux 4.13 with the settings i listed above.
>> ethtool -L p1p1 combined 16
>> ethtool -C p1p1 rx-usecs 1000
>> sysctl net.core.busy_poll = 1000
>>
>> # ethtool -x p1p1
>> RX flow hash indirection table for p1p1 with 16 RX ring(s):
>> 0:  0 1 2 3 4 5 6 7
>> 8:  8 9101112131415
>>16:  0 1 2 3 4 5 6 7
>>24:  8 9101112131415
>>32:  0 1 2 3 4 5 6 7
>>40:  8 9101112131415
>>48:  0 1 2 3 4 5 6 7
>>56:  8 9101112131415
>>64:  0 1 2 3 4 5 6 7
>>72:  8 9101112131415
>>80:  0 1 2 3 4 5 6 7
>>88:  8 9101112131415
>>96:  0 1 2 3 4 5 6 7
>>   104:  8 9101112131415
>>   112:  0 1 2 3 4 5 6 7
>>   120:  8 9101112131415
>>
>> smp_affinity for the 16 queuepairs
>> 141 p1p1-TxRx-0 ,0001
>> 142 p1p1-TxRx-1 ,0002
>> 143 p1p1-TxRx-2 ,0004
>> 144 p1p1-TxRx-3 ,0008
>> 145 p1p1-TxRx-4 ,0010
>> 146 p1p1-TxRx-5 ,0020
>> 147 p1p1-TxRx-6 ,0040
>> 148 p1p1-TxRx-7 ,0080
>> 149 p1p1-TxRx-8 ,0100
>> 150 p1p1-TxRx-9 ,0200
>> 151 p1p1-TxRx-10 ,0400
>> 152 p1p1-TxRx-11 ,0800
>> 153 p1p1-TxRx-12 ,1000
>> 154 p1p1-TxRx-13 ,2000
>> 155 p1p1-TxRx-14 ,4000
>> 156 p1p1-TxRx-15 ,8000
>> xps_cpus for the 16 Tx queues
>> ,0001
>> ,0002
>> ,0004
>> ,0008
>> ,0010
>> ,0020
>> ,0040
>> ,0080
>>

IP Expo show Europe 2017 Attendees List

2017-09-20 Thread Aspen Ella

Hi,
Would you be interested in the "IP Expo show Europe 2017 Attendees List ?"

Please Let me know your interest to send you the number of attendees and cost.
Just let me know if you have any questions.
Awaiting your reply
 
Regards,
Aspen
Marketing Executive
 
 To remove from this mailing: reply with subject line as "leave out."

Re: [PATCH,net-next,0/2] Improve code coverage of syzkaller

2017-09-20 Thread Willem de Bruijn

On Wed, Sep 20, 2017 at 2:08 AM, David Miller  wrote:
> From: Petar Penkov 
> Date: Tue, 19 Sep 2017 21:26:14 -0700
>
>> Furthermore, in a way testing already requires specific kernel
>> configuration.  In this particular example, syzkaller prefers
>> synchronous operation and therefore needs 4KSTACKS disabled. Other
>> features that require rebuilding are KASAN and dbx. From this point
>> of view, I still think that having the TUN_NAPI flag has value.
>
> Then I think this path could be enabled/disabled with a runtime flag
> just as easily, no?

I think that the compile time option was chosen because of the ns_capable
check, so that with user namespaces unprivileged processes can control this
path. Perhaps we can require capable() only to set IFF_NAPI_FRAGS.

Then we can convert the napi_gro_receive path to be conditional on a new
IFF_NAPI flag instead of this compile time option.

[PATCH v4 0/4] Add cross-compilation support to eBPF samples

2017-09-20 Thread Joel Fernandes

These patches fix issues seen when cross-compiling eBPF samples on arm64.
Compared to [1], I dropped the controversial inline-asm patch and exploring
other options to fix it. However these patches are a step in the right
direction and I look forward to getting them into -next and the merge window.

Changes since v3:
- just a repost with acks

[1] https://lkml.org/lkml/2017/8/7/417

Joel Fernandes (4):
  samples/bpf: Use getppid instead of getpgrp for array map stress
  samples/bpf: Enable cross compiler support
  samples/bpf: Fix pt_regs issues when cross-compiling
  samples/bpf: Add documentation on cross compilation

 samples/bpf/Makefile  |  7 +++-
 samples/bpf/README.rst| 10 ++
 samples/bpf/map_perf_test_kern.c  |  2 +-
 samples/bpf/map_perf_test_user.c  |  2 +-
 tools/testing/selftests/bpf/bpf_helpers.h | 56 +++
 5 files changed, 67 insertions(+), 10 deletions(-)

-- 
2.14.1.821.g8fa685d3b7-goog

Re: [PATCH net-next 08/10] net/smc: introduce a delay

2017-09-20 Thread Ursula Braun

On 09/20/2017 04:03 PM, Leon Romanovsky wrote:
> On Wed, Sep 20, 2017 at 01:58:11PM +0200, Ursula Braun wrote:
>> The number of outstanding work requests is limited. If all work
>> requests are in use, tx processing is postponed to another scheduling
>> of the tx worker. Switch to a delayed worker to have a gap for tx
>> completion queue events before the next retry.
>>
> 
> How will delay prevent and protect the resource exhausting?
> 
> Thanks
> 

SMC runs with a fixed number of in-flight work requests per QP (constant
SMC_WR_BUF_CNT) to prevent resource exhausting. If all work requests are
currently in use, sending of another work request has to wait till some
outstanding work request is confirmed via send completion queue. If sending
is done in a context which is not allowed to wait, the tx_worker is
scheduled instead.
With this patch a small delay is added to avoid too many unsuccessful send
retries due to a still ongoing "all work requests in use" condition.

0xC5ED6645.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

[PATCH v5 net 3/3] lan78xx: Use default values loaded from EEPROM/OTP after reset

2017-09-20 Thread Nisar Sayed

Use default value of auto duplex and auto speed values loaded
from EEPROM/OTP after reset. The LAN78xx allows platform
configurations to be loaded from EEPROM/OTP.
Ex: When external phy is connected, the MAC can be configured to
have correct auto speed, auto duplex, auto polarity configured
from the EEPROM/OTP.

Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
Ethernet device driver")
Signed-off-by: Nisar Sayed 
---
 drivers/net/usb/lan78xx.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index f8c63eec8353..0161f77641fa 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2449,7 +2449,6 @@ static int lan78xx_reset(struct lan78xx_net *dev)
/* LAN7801 only has RGMII mode */
if (dev->chipid == ID_REV_CHIP_ID_7801_)
buf &= ~MAC_CR_GMII_EN_;
-   buf |= MAC_CR_AUTO_DUPLEX_ | MAC_CR_AUTO_SPEED_;
ret = lan78xx_write_reg(dev, MAC_CR, buf);
 
ret = lan78xx_read_reg(dev, MAC_TX, );
-- 
2.14.1

Re: [PATCH net-next 09/14] gtp: Allow configuring GTP interface as standalone

2017-09-20 Thread Tom Herbert

On Wed, Sep 20, 2017 at 8:27 AM, Andreas Schultz  wrote:
> On 19/09/17 02:38, Tom Herbert wrote:
>>
>> Add new configuration of GTP interfaces that allow specifying a port to
>> listen on (as opposed to having to get sockets from a userspace control
>> plane). This allows GTP interfaces to be configured and the data path
>> tested without requiring a GTP-C daemon.
>
>
> This would imply that you can have multiple independent GTP sockets on the
> same IP address.That is not permitted by the GTP specifications. 3GPP TS
> 29.281, section 4.3 states clearly that there is "only" one GTP entity per
> IP address.A PDP context is defined by the destination IP and the TEID. The
> destination port is not part of the identity of a PDP context.
>
We are in no way trying change GTP, if someone runs this in a real GTP
network then they need to abide by the specification. However, there
is nothing inconsistent and it breaks nothing if someone wishes to use
different port numbers in their own private network for testing or
development purposes. Every other UDP application that has assigned
port number allows configurable ports, I don't see that GTP is so
special that it should be an exception.

Tom

[PATCH v4 2/4] samples/bpf: Enable cross compiler support

2017-09-20 Thread Joel Fernandes

When cross compiling, bpf samples use HOSTCC for compiling the non-BPF part of
the sample, however what we really want is to use the cross compiler to build
for the cross target since that is what will load and run the BPF sample.
Detect this and compile samples correctly.

Acked-by: Alexei Starovoitov 
Signed-off-by: Joel Fernandes 
---
 samples/bpf/Makefile | 5 +
 1 file changed, 5 insertions(+)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index cf17c7932a6e..13f74b67ca44 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -177,6 +177,11 @@ HOSTLOADLIBES_syscall_tp += -lelf
 LLC ?= llc
 CLANG ?= clang
 
+# Detect that we're cross compiling and use the cross compiler
+ifdef CROSS_COMPILE
+HOSTCC = $(CROSS_COMPILE)gcc
+endif
+
 # Trick to allow make to be run from this directory
 all:
$(MAKE) -C ../../ $(CURDIR)/
-- 
2.14.1.821.g8fa685d3b7-goog

[PATCH v4 1/4] samples/bpf: Use getppid instead of getpgrp for array map stress

2017-09-20 Thread Joel Fernandes

When cross-compiling the bpf sample map_perf_test for aarch64, I find that
__NR_getpgrp is undefined. This causes build errors. This syscall is deprecated
and requires defining __ARCH_WANT_SYSCALL_DEPRECATED. To avoid having to define
that, just use a different syscall (getppid) for the array map stress test.

Acked-by: Alexei Starovoitov 
Signed-off-by: Joel Fernandes 
---
 samples/bpf/map_perf_test_kern.c | 2 +-
 samples/bpf/map_perf_test_user.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/map_perf_test_kern.c b/samples/bpf/map_perf_test_kern.c
index 098c857f1eda..2b2ffb97018b 100644
--- a/samples/bpf/map_perf_test_kern.c
+++ b/samples/bpf/map_perf_test_kern.c
@@ -266,7 +266,7 @@ int stress_hash_map_lookup(struct pt_regs *ctx)
return 0;
 }
 
-SEC("kprobe/sys_getpgrp")
+SEC("kprobe/sys_getppid")
 int stress_array_map_lookup(struct pt_regs *ctx)
 {
u32 key = 1, i;
diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index f388254896f6..a0310fc70057 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -282,7 +282,7 @@ static void test_array_lookup(int cpu)
 
start_time = time_get_ns();
for (i = 0; i < max_cnt; i++)
-   syscall(__NR_getpgrp, 0);
+   syscall(__NR_getppid, 0);
printf("%d:array_lookup %lld lookups per sec\n",
   cpu, max_cnt * 10ll * 64 / (time_get_ns() - start_time));
 }
-- 
2.14.1.821.g8fa685d3b7-goog

[PATCH v4 3/4] samples/bpf: Fix pt_regs issues when cross-compiling

2017-09-20 Thread Joel Fernandes

BPF samples fail to build when cross-compiling for ARM64 because of incorrect
pt_regs param selection. This is because clang defines __x86_64__ and
bpf_headers thinks we're building for x86. Since clang is building for the BPF
target, it shouldn't make assumptions about what target the BPF program is
going to run on. To fix this, lets pass ARCH so the header knows which target
the BPF program is being compiled for and can use the correct pt_regs code.

Acked-by: Alexei Starovoitov 
Signed-off-by: Joel Fernandes 
---
 samples/bpf/Makefile  |  2 +-
 tools/testing/selftests/bpf/bpf_helpers.h | 56 +++
 2 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 13f74b67ca44..ebc2ad69b62c 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -230,7 +230,7 @@ $(obj)/%.o: $(src)/%.c
$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
-I$(srctree)/tools/testing/selftests/bpf/ \
-D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value 
-Wno-pointer-sign \
-   -Wno-compare-distinct-pointer-types \
+   -D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
-Wno-unknown-warning-option \
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 36fb9161b34a..4875395b0b52 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -109,7 +109,47 @@ static int (*bpf_skb_under_cgroup)(void *ctx, void *map, 
int index) =
 static int (*bpf_skb_change_head)(void *, int len, int flags) =
(void *) BPF_FUNC_skb_change_head;
 
+/* Scan the ARCH passed in from ARCH env variable (see Makefile) */
+#if defined(__TARGET_ARCH_x86)
+   #define bpf_target_x86
+   #define bpf_target_defined
+#elif defined(__TARGET_ARCH_s930x)
+   #define bpf_target_s930x
+   #define bpf_target_defined
+#elif defined(__TARGET_ARCH_arm64)
+   #define bpf_target_arm64
+   #define bpf_target_defined
+#elif defined(__TARGET_ARCH_mips)
+   #define bpf_target_mips
+   #define bpf_target_defined
+#elif defined(__TARGET_ARCH_powerpc)
+   #define bpf_target_powerpc
+   #define bpf_target_defined
+#elif defined(__TARGET_ARCH_sparc)
+   #define bpf_target_sparc
+   #define bpf_target_defined
+#else
+   #undef bpf_target_defined
+#endif
+
+/* Fall back to what the compiler says */
+#ifndef bpf_target_defined
 #if defined(__x86_64__)
+   #define bpf_target_x86
+#elif defined(__s390x__)
+   #define bpf_target_s930x
+#elif defined(__aarch64__)
+   #define bpf_target_arm64
+#elif defined(__mips__)
+   #define bpf_target_mips
+#elif defined(__powerpc__)
+   #define bpf_target_powerpc
+#elif defined(__sparc__)
+   #define bpf_target_sparc
+#endif
+#endif
+
+#if defined(bpf_target_x86)
 
 #define PT_REGS_PARM1(x) ((x)->di)
 #define PT_REGS_PARM2(x) ((x)->si)
@@ -122,7 +162,7 @@ static int (*bpf_skb_change_head)(void *, int len, int 
flags) =
 #define PT_REGS_SP(x) ((x)->sp)
 #define PT_REGS_IP(x) ((x)->ip)
 
-#elif defined(__s390x__)
+#elif defined(bpf_target_s390x)
 
 #define PT_REGS_PARM1(x) ((x)->gprs[2])
 #define PT_REGS_PARM2(x) ((x)->gprs[3])
@@ -135,7 +175,7 @@ static int (*bpf_skb_change_head)(void *, int len, int 
flags) =
 #define PT_REGS_SP(x) ((x)->gprs[15])
 #define PT_REGS_IP(x) ((x)->psw.addr)
 
-#elif defined(__aarch64__)
+#elif defined(bpf_target_arm64)
 
 #define PT_REGS_PARM1(x) ((x)->regs[0])
 #define PT_REGS_PARM2(x) ((x)->regs[1])
@@ -148,7 +188,7 @@ static int (*bpf_skb_change_head)(void *, int len, int 
flags) =
 #define PT_REGS_SP(x) ((x)->sp)
 #define PT_REGS_IP(x) ((x)->pc)
 
-#elif defined(__mips__)
+#elif defined(bpf_target_mips)
 
 #define PT_REGS_PARM1(x) ((x)->regs[4])
 #define PT_REGS_PARM2(x) ((x)->regs[5])
@@ -161,7 +201,7 @@ static int (*bpf_skb_change_head)(void *, int len, int 
flags) =
 #define PT_REGS_SP(x) ((x)->regs[29])
 #define PT_REGS_IP(x) ((x)->cp0_epc)
 
-#elif defined(__powerpc__)
+#elif defined(bpf_target_powerpc)
 
 #define PT_REGS_PARM1(x) ((x)->gpr[3])
 #define PT_REGS_PARM2(x) ((x)->gpr[4])
@@ -172,7 +212,7 @@ static int (*bpf_skb_change_head)(void *, int len, int 
flags) =
 #define PT_REGS_SP(x) ((x)->sp)
 #define PT_REGS_IP(x) ((x)->nip)
 
-#elif defined(__sparc__)
+#elif defined(bpf_target_sparc)
 
 #define PT_REGS_PARM1(x) ((x)->u_regs[UREG_I0])
 #define PT_REGS_PARM2(x) ((x)->u_regs[UREG_I1])
@@ -182,6 +222,8 @@ static int (*bpf_skb_change_head)(void *, int len, int 
flags) =
 #define PT_REGS_RET(x) ((x)->u_regs[UREG_I7])
 #define PT_REGS_RC(x) ((x)->u_regs[UREG_I0])
 #define PT_REGS_SP(x) ((x)->u_regs[UREG_FP])
+
+/* Should this also be a bpf_target check for the sparc

[PATCH v4 4/4] samples/bpf: Add documentation on cross compilation

2017-09-20 Thread Joel Fernandes

Acked-by: Alexei Starovoitov 
Signed-off-by: Joel Fernandes 
---
 samples/bpf/README.rst | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/samples/bpf/README.rst b/samples/bpf/README.rst
index 79f9a58f1872..2b906127ef54 100644
--- a/samples/bpf/README.rst
+++ b/samples/bpf/README.rst
@@ -64,3 +64,13 @@ It is also possible to point make to the newly compiled 
'llc' or
 'clang' command via redefining LLC or CLANG on the make command line::
 
  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
+
+Cross compiling samples
+---
+Inorder to cross-compile, say for arm64 targets, export CROSS_COMPILE and ARCH
+environment variables before calling make. This will direct make to build
+samples for the cross target.
+
+export ARCH=arm64
+export CROSS_COMPILE="aarch64-linux-gnu-"
+make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
-- 
2.14.1.821.g8fa685d3b7-goog

[PATCH net-next] net: dsa: use dedicated CPU port

2017-09-20 Thread Vivien Didelot

Each port in DSA has its own dedicated CPU port currently available in
its parent switch's ds->ports[port].cpu_dp. Use it instead of getting
the unique tree CPU port, which will be deprecated soon.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/b53/b53_common.c | 4 ++--
 drivers/net/dsa/bcm_sf2.c| 6 +++---
 drivers/net/dsa/mt7530.c | 4 ++--
 drivers/net/dsa/mv88e6060.c  | 2 +-
 drivers/net/dsa/qca8k.c  | 2 +-
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index a9f2a5b55a5e..d4ce092def83 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -1336,7 +1336,7 @@ EXPORT_SYMBOL(b53_fdb_dump);
 int b53_br_join(struct dsa_switch *ds, int port, struct net_device *br)
 {
struct b53_device *dev = ds->priv;
-   s8 cpu_port = ds->dst->cpu_dp->index;
+   s8 cpu_port = ds->ports[port].cpu_dp->index;
u16 pvlan, reg;
unsigned int i;
 
@@ -1382,7 +1382,7 @@ void b53_br_leave(struct dsa_switch *ds, int port, struct 
net_device *br)
 {
struct b53_device *dev = ds->priv;
struct b53_vlan *vl = >vlans[0];
-   s8 cpu_port = ds->dst->cpu_dp->index;
+   s8 cpu_port = ds->ports[port].cpu_dp->index;
unsigned int i;
u16 pvlan, reg, pvid;
 
diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 0072a959db5b..898d5642b516 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -661,7 +661,7 @@ static int bcm_sf2_sw_resume(struct dsa_switch *ds)
 static void bcm_sf2_sw_get_wol(struct dsa_switch *ds, int port,
   struct ethtool_wolinfo *wol)
 {
-   struct net_device *p = ds->dst->cpu_dp->netdev;
+   struct net_device *p = ds->ports[port].cpu_dp->netdev;
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
struct ethtool_wolinfo pwol;
 
@@ -684,9 +684,9 @@ static void bcm_sf2_sw_get_wol(struct dsa_switch *ds, int 
port,
 static int bcm_sf2_sw_set_wol(struct dsa_switch *ds, int port,
  struct ethtool_wolinfo *wol)
 {
-   struct net_device *p = ds->dst->cpu_dp->netdev;
+   struct net_device *p = ds->ports[port].cpu_dp->netdev;
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
-   s8 cpu_port = ds->dst->cpu_dp->index;
+   s8 cpu_port = ds->ports[port].cpu_dp->index;
struct ethtool_wolinfo pwol;
 
p->ethtool_ops->get_wol(p, );
diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index c142b97add2c..faa3b88d2206 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -928,11 +928,11 @@ mt7530_setup(struct dsa_switch *ds)
struct device_node *dn;
struct mt7530_dummy_poll p;
 
-   /* The parent node of cpu_dp->netdev which holds the common system
+   /* The parent node of master netdev which holds the common system
 * controller also is the container for two GMACs nodes representing
 * as two netdev instances.
 */
-   dn = ds->dst->cpu_dp->netdev->dev.of_node->parent;
+   dn = ds->ports[MT7530_CPU_PORT].netdev->dev.of_node->parent;
priv->ethernet = syscon_node_to_regmap(dn);
if (IS_ERR(priv->ethernet))
return PTR_ERR(priv->ethernet);
diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
index dce7fa57eb55..621cdc46ad81 100644
--- a/drivers/net/dsa/mv88e6060.c
+++ b/drivers/net/dsa/mv88e6060.c
@@ -176,7 +176,7 @@ static int mv88e6060_setup_port(struct dsa_switch *ds, int 
p)
  ((p & 0xf) << PORT_VLAN_MAP_DBNUM_SHIFT) |
   (dsa_is_cpu_port(ds, p) ?
ds->enabled_port_mask :
-   BIT(ds->dst->cpu_dp->index)));
+   BIT(ds->ports[p].cpu_dp->index)));
 
/* Port Association Vector: when learning source addresses
 * of packets, add the address to the address database using
diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index 5ada7a41449c..82f09711ac1a 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -506,7 +506,7 @@ qca8k_setup(struct dsa_switch *ds)
pr_warn("regmap initialization failed");
 
/* Initialize CPU port pad mode (xMII type, delays...) */
-   phy_mode = of_get_phy_mode(ds->dst->cpu_dp->dn);
+   phy_mode = of_get_phy_mode(ds->ports[QCA8K_CPU_PORT].dn);
if (phy_mode < 0) {
pr_err("Can't find phy-mode for master device\n");
return phy_mode;
-- 
2.14.1

Re: [PATCH net-next 00/14] gtp: Additional feature support

2017-09-20 Thread Andreas Schultz


Hi Harald,

On 20/09/17 01:19, Harald Welte wrote:

Hi Tom,

On Tue, Sep 19, 2017 at 08:59:28AM -0700, Tom Herbert wrote:

On Tue, Sep 19, 2017 at 5:43 AM, Harald Welte 
wrote:

On Mon, Sep 18, 2017 at 05:38:50PM -0700, Tom Herbert wrote:

   - IPv6 support


see my detailed comments in other mails.  It's unfortunately only
support for the already "deprecated" IPv6-only PDP contexts, not the
more modern v4v6 type.  In order to interoperate with old and new
approach, all three cases (v4, v6 and v4v6) should be supported from
one code base.


It sounds like something that can be subsequently added.


Not entirely, at least on the netlink (and any other configuration
interface) you will have to reflect this from the very beginning.  You
have to have an explicit PDP type and cannot rely on the address type to
specify the type of PDP context.  Whatever interfaces are introduced
now will have to remain compatible to any future change.

My strategy to avoid any such possible 'road blocks' from being
introduced would be to simply add v4v6 and v6 support in one go.  The
differences are marginal (having both an IPv6 prefix and a v4 address in
parallel, rather than mutually exclusive only).


Do you have a reference to the spec?


See http://osmocom.org/issues/2418#note-7 which lists Section 11.2.1.3.2
of 3GPP TS 29.061 in combination with RFC3314, RFC7066, RFC6459 and
3GPP TS 23.060 9.2.1 as well as a summary of my understanding of it some
months ago.


   - Configurable networking interfaces so that GTP kernel can be
   used and tested without needing GSN network emulation (i.e. no
   user space daemon needed).


We have some pretty decent userspace utilities for configuring the
GTP interfaces and tunnels in the libgtpnl repository, but if it
helps people to have another way of configuration, I won't be
against it.


AFAIK those userspace utilities don't support IPv6.


Of course not [yet]. libgtpnl and the command line tools have been
implemented specifically for the in-kernel GTP driver, and you have to
make sure to add related support on both the kernel and the userspace
side (libgtpnl). So there's little point in adding features on either
side before the other side.  There would be no way to test...


Being able to configure GTP like any other encapsulation will
facilitate development of IPv6 and other features.


That may very well be the case, but adding "IPv6 support" to kernel GTP
in a way that is not in line with the existing userspace libraries and
control-plane implementations means that you're developing those
features in an artificial environment that doesn't resemble real 3GPP
interoperable networks out there.

As indicated, I'm not against adding additional interfaces, but we have
to make sure that we add IPv6 support (or any new feature support) to at
least libgtpnl, and to make sure we test interoperability with existing
3GPP network equipment such as real IPv6 capable phones and SGSNs.


I'm not sure if this is a useful feature.  GTP is used only in
operator-controlled networks and only on standard ports.  It's not
possible to negotiate any non-standard ports on the signaling plane
either.


Bear in mind that we're not required to do everything the GTP spec
says.


Yes, we are, at least as long as it affects interoperability with other
implemetations out there.

GTP uses well-known port numbers on *both* sides of the tunnel, and you
cannot deviate from that.


Actually, the well-known port is only mandatory for the receiving side.
The sending side can choose any port it wishes as long as it is prepared
to receive possible error indication on the well-known port.

Of course, it makes the implementation simple to use only one port, but 
for scalability it might be a good idea to support per PDP context 
sending ports.


Regards
Andreas


There's no point in having all kinds of feetures in the GTP user plane
which are not interoperable with other implementations, and which are
completely outside of the information model / architecture of GTP.

In the real world, GTP-U is only used in combination with GTP-C.  And in
GTP-C you can only negotiate the IP address of both sides of GTP-U, and
not the port number information.  As a result, the port numbers are
static on both sides.


My impression is GTP designers probably didn't think in terms of
getting best performance. But we can ;-)


I think it's wasted efforts if it's about "random udp ports" as no
standards-compliant implementation out there with which you will have to
interoperate will be able to support it.

GTP is used between home and roaming operator.  If you want to introduce
changes to how it works, you will have to have control over both sides
of the implementation of both the GTP-C and the GTP-u plane, which is
very unlikely and rather the exception in the hundreds of operators you
interoperate with.  Also keep in mind that there often are various
"middleboxes" that will suddenly have to reflect your changes.  That

Re: [PATCH net-next] net: avoid a full fib lookup when rp_filter is disabled.

2017-09-20 Thread Paolo Abeni

Dumb me and dumb my scripts. 

This is actually a v2, v1 was at:

https://patchwork.ozlabs.org/project/netdev/list/?series=3835

David, please let me know if you prefer I'll repost with a more
appropriate subject line.

Sorry for the noise,

Paolo

vhost_net: VM looses network when using vhost over time

2017-09-20 Thread Bernd Naumann

Hi @all,

We have encountered/experience a bug which is more or less reproducible, but we 
do not know how to do it exactly or how to debug the issue in the first place.


# Background

In our setup we have a Ganti Cluser (kvm) with atm ~60 nodes running ~500 VMs, 
we are using tap interfaces on L2 bridges, L3 routed tap interfaces, and tap 
interfaces on a bridge with a VTEP attached to it. (For the vxlan setup we have 
a home grown daemon to maintain the FDB).


# The issue

On some VMs we loose network-connectivity under certain/unknown circumstances. 
"Looseing" means that the VM is not reachable and can therefor not reach any 
other host in the network.

However with `tcpdump` on the host (phy NIC + bridge) we can see the traffic 
going in; but with `tcpdump` on the VM we only see arp goes in, but nothing 
goes out. Manually setting the ARP entry does not help at all, or only for a 
moment, like `ip link set $DEV set arp off; ip link set $DEV arp on`. The only 
way we found to "fix" it, is rebooting the VM, or do `modprobe -r virtio_net; 
modprobe virtio_net`, but this seams also not the best workaround and can fail 
in a short time again. Also it is difficult to determinate when the issue is 
kicking in. Counting 'FAILED' neighbors is a indicator but nothing to rely on.

The frequence of the issue ranges from once in a few days, to multiple times 
per day or even after some minutes after boot. Most impact we see on VMs with 
higher network traffic like our gateway-VMs (multiple NICs in different 
networks, IPsec, iptables, ...); ha-proxy-VMs (similar to our gateways), but 
also (with reduced frequency) on /normal/ application VMs.

For what we have found so far, it looks like kind of: 
* https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/997978 -- Bug #997978 
“KVM images lose connectivity with bridged network” : Bugs : qemu-kvm package : 
Ubuntu
* https://bugs.centos.org/view.php?id=5526 -- 0005526: KVM Guest with virtio 
network loses network connectivity - CentOS Bug Tracker

Via `rtmon` we can observe that it starts with some "FAILED" neighbor entries 
and that they increase over time. As we know that this is only one consequence 
of not sending ARP replys to the requester; or that requested ARP is unanswered 
(cause the packet is not leaving the VM), the increasing count of 'FAILED' 
neighbors is /normal/. BUT: This can start on any interface, bridged tap 
interface for WAN, bridged tap in VXLAN, routed tap; it does not matter, or is 
not directly linked to the "kind" of interface.


# General overview of the setup

* ganiti-cluster with ~60 nodes
* each node has 2 x 50G (mlnx5 dual-port) connected to 2 x MLNX SN2700 switches
* each node runs `bird` with OSPF and ECMP (and OSPF with ECMP on SN2700 too)
* each VM has one or more vNICs in a bridged or routed network
* networks: bridged tap in WAN; bridged tap with attached VTEP; routed tap
* host OS: Ubuntu 16.04.3 with Ubuntu Kernel 4.12.13; first tested with 
qemu-kvm 1:2.5+dfsg-5ubuntu10.15, and later upgraded to qemu-kvm 
2.10~rc3+dfsg-0ubuntu1, same issue; guest OS Ubutnu 14.04, Ubuntu 16.04 and 
Ubuntu 16.04 with latest Ubuntu mainline kernel PPA


# So far we can "verify" it is 'vhost'

Without "vhost=on" for the kvm process we can not observe this issue. While 
using "vhost=on", a effected VM can be "fixed" by `rmmod` and `insmod 
virtio_net`, but reboot seams to provide a "fix" for a "longer" period. (But as 
you may know, virtio has not the performance we expect.)


So we have some questions:

* How can we debug the main issue to provide a meaningful bug report? Debug 
flags on the kernel but where to hang gdb on it? Sadly we are no kernel hackers 
:/, but we can compile our own kernel and qemu-kvm to test also release 
candidates and/or put patches in place.
* Does someone have seen this too? Can provide a better workaround, or patch or 
anything?
* Where to file/reopen this issue? qemu, netdev?
* Is qemu-kvm even the right place to look for answers?

We are happy to provide more information or collect debug information if 
someone wants to investigate.

Thanks for your time!
Best,
Bernd Naumann

Spreadshirt 
Bernd Naumann 
Systems Engineer, Networking & Operations 
bernd.naum...@spreadshirt.net 

http://www.spreadshirt.com 

sprd.net AG 
Gießerstraße 27 
D-04229 Leipzig 

Fon: +49 341 594 00 - 5900 
Fax: +49 341 594 00 - 5149 

Vorstand / executive board: Philip Rooke (CEO/Vorsitzender) · Tobias Schaugg 
Aufsichtsratsvorsitzender / chairman of the supervisory board: Lukasz Gadowski 
Handelsregister / trade register: Amtsgericht Leipzig, HRB 22478 
Umsatzsteuer-IdentNummer / VAT-ID: DE 8138 7149 4

[Patch net] net_sched: remove cls_flower idr on failure

2017-09-20 Thread Cong Wang

Fixes: c15ab236d69d ("net/sched: Change cls_flower to use IDR")
Cc: Chris Mi 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/cls_flower.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 1a267e77c6de..d230cb4c8094 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -922,28 +922,28 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
 
if (!tc_flags_valid(fnew->flags)) {
err = -EINVAL;
-   goto errout;
+   goto errout_idr;
}
}
 
err = fl_set_parms(net, tp, fnew, , base, tb, tca[TCA_RATE], ovr);
if (err)
-   goto errout;
+   goto errout_idr;
 
err = fl_check_assign_mask(head, );
if (err)
-   goto errout;
+   goto errout_idr;
 
if (!tc_skip_sw(fnew->flags)) {
if (!fold && fl_lookup(head, >mkey)) {
err = -EEXIST;
-   goto errout;
+   goto errout_idr;
}
 
err = rhashtable_insert_fast(>ht, >ht_node,
 head->ht_params);
if (err)
-   goto errout;
+   goto errout_idr;
}
 
if (!tc_skip_hw(fnew->flags)) {
@@ -952,7 +952,7 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
   ,
   fnew);
if (err)
-   goto errout;
+   goto errout_idr;
}
 
if (!tc_in_hw(fnew->flags))
@@ -981,6 +981,9 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
kfree(tb);
return 0;
 
+errout_idr:
+   if (fnew->handle)
+   idr_remove_ext(>handle_idr, fnew->handle);
 errout:
tcf_exts_destroy(>exts);
kfree(fnew);
-- 
2.13.0

[PATCH] ipv6: Use ipv6_authlen for len in ipv6_skip_exthdr

2017-09-20 Thread Xiang Gao

In ipv6_skip_exthdr, the lengh of AH header is computed manually
as (hp->hdrlen+2)<<2. However, in include/linux/ipv6.h, a macro
named ipv6_authlen is already defined for exactly the same job. This
commit replaces the manual computation code with the macro.

Signed-off-by: Xiang Gao 
---
 net/ipv6/exthdrs_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/exthdrs_core.c b/net/ipv6/exthdrs_core.c
index 305e2ed730bf..115d60919f72 100644
--- a/net/ipv6/exthdrs_core.c
+++ b/net/ipv6/exthdrs_core.c
@@ -99,7 +99,7 @@ int ipv6_skip_exthdr(const struct sk_buff *skb, int start, u8 
*nexthdrp,
break;
hdrlen = 8;
} else if (nexthdr == NEXTHDR_AUTH)
-   hdrlen = (hp->hdrlen+2)<<2;
+   hdrlen = ipv6_authlen(hp);
else
hdrlen = ipv6_optlen(hp);
 
-- 
2.14.1

Re: [PATCH net-next 09/14] gtp: Allow configuring GTP interface as standalone

2017-09-20 Thread Tom Herbert

On Wed, Sep 20, 2017 at 9:07 AM, Andreas Schultz  wrote:
>
>
> On 20/09/17 17:57, Tom Herbert wrote:
>>
>> On Wed, Sep 20, 2017 at 8:27 AM, Andreas Schultz 
>> wrote:
>>>
>>> On 19/09/17 02:38, Tom Herbert wrote:


 Add new configuration of GTP interfaces that allow specifying a port to
 listen on (as opposed to having to get sockets from a userspace control
 plane). This allows GTP interfaces to be configured and the data path
 tested without requiring a GTP-C daemon.
>>>
>>>
>>>
>>> This would imply that you can have multiple independent GTP sockets on
>>> the
>>> same IP address.That is not permitted by the GTP specifications. 3GPP TS
>>> 29.281, section 4.3 states clearly that there is "only" one GTP entity
>>> per
>>> IP address.A PDP context is defined by the destination IP and the TEID.
>>> The
>>> destination port is not part of the identity of a PDP context.
>>>
>> We are in no way trying change GTP, if someone runs this in a real GTP
>> network then they need to abide by the specification. However, there
>> is nothing inconsistent and it breaks nothing if someone wishes to use
>> different port numbers in their own private network for testing or
>> development purposes. Every other UDP application that has assigned
>> port number allows configurable ports, I don't see that GTP is so
>> special that it should be an exception.
>
>
> GTP isn't special, I just don't like to have testing only features in there
> when the same goal can be reached without having to add extra stuff. Adding
> code that is not going to be useful in real production setups (or in this
> case would even break production setups when enabled accidentally) makes the
> implementation more complex than it needs to be.

Well, you could make the same argument that allowing GTP to configured
as standalone interface is a problem since GTP is only allowed to be
with used with GTP-C. But, then we have something in the kernel that
the community is expected to support, but requires jumping through a
whole bunch of hoops just to run a simple netperf. The more that
patches and features look like other things in the kernel that are
already well established, the better the chances we can accept them
and support them. It's probably a natural consequence of any large
open source project, so sometimes it's worth the effort to add a few
lines of complexity to get the benefits of community contribution and
support.

Tom

Re: [PATCH v4 4/4] samples/bpf: Add documentation on cross compilation

2017-09-20 Thread Randy Dunlap

On 09/20/17 09:11, Joel Fernandes wrote:
> Acked-by: Alexei Starovoitov 
> Signed-off-by: Joel Fernandes 
> ---
>  samples/bpf/README.rst | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/samples/bpf/README.rst b/samples/bpf/README.rst
> index 79f9a58f1872..2b906127ef54 100644
> --- a/samples/bpf/README.rst
> +++ b/samples/bpf/README.rst
> @@ -64,3 +64,13 @@ It is also possible to point make to the newly compiled 
> 'llc' or
>  'clang' command via redefining LLC or CLANG on the make command line::
>  
>   make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
> CLANG=~/git/llvm/build/bin/clang
> +
> +Cross compiling samples
> +---
> +Inorder to cross-compile, say for arm64 targets, export CROSS_COMPILE and 
> ARCH

   In order to

> +environment variables before calling make. This will direct make to build
> +samples for the cross target.
> +
> +export ARCH=arm64
> +export CROSS_COMPILE="aarch64-linux-gnu-"
> +make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
> CLANG=~/git/llvm/build/bin/clang
> 


-- 
~Randy

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-20 Thread Samudrala, Sridhar




On 9/20/2017 7:18 AM, Tom Herbert wrote:

On Tue, Sep 19, 2017 at 10:13 PM, Eric Dumazet  wrote:

On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote:

On 9/19/2017 5:48 PM, Tom Herbert wrote:

On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar
 wrote:

On 9/12/2017 3:53 PM, Tom Herbert wrote:

On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
 wrote:

On 9/12/2017 8:47 AM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:

On 9/11/2017 8:53 PM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:


Two ints in sock_common for this purpose is quite expensive and the
use case for this is limited-- even if a RX->TX queue mapping were
introduced to eliminate the queue pair assumption this still won't
help if the receive and transmit interfaces are different for the
connection. I think we really need to see some very compelling
results
to be able to justify this.

Will try to collect and post some perf data with symmetric queue
configuration.

Here is some performance data i collected with memcached workload over
ixgbe 10Gb NIC with mcblaster benchmark.
ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very
low
interrupt rate.
   ethtool -L p1p1 combined 16
   ethtool -C p1p1 rx-usecs 1000
and busy poll is set to 1000usecs
   sysctl net.core.busy_poll = 1000

16 threads  800K requests/sec
=
   rtt(min/avg/max)usecs intr/sec contextswitch/sec
---
Default2/182/1064123391 61163
Symmetric Queues   2/50/6311  20457 32843

32 threads  800K requests/sec
=
  rtt(min/avg/max)usecs intr/sec contextswitch/sec

Default2/162/639032168 69450
Symmetric Queues2/50/385335044 35847


No idea what "Default" configuration is. Please report how xps_cpus is
being set, how many RSS queues there are, and what the mapping is
between RSS queues and CPUs and shared caches. Also, whether and
threads are pinned.

Default is linux 4.13 with the settings i listed above.
 ethtool -L p1p1 combined 16
 ethtool -C p1p1 rx-usecs 1000
 sysctl net.core.busy_poll = 1000

# ethtool -x p1p1
RX flow hash indirection table for p1p1 with 16 RX ring(s):
 0:  0 1 2 3 4 5 6 7
 8:  8 9101112131415
16:  0 1 2 3 4 5 6 7
24:  8 9101112131415
32:  0 1 2 3 4 5 6 7
40:  8 9101112131415
48:  0 1 2 3 4 5 6 7
56:  8 9101112131415
64:  0 1 2 3 4 5 6 7
72:  8 9101112131415
80:  0 1 2 3 4 5 6 7
88:  8 9101112131415
96:  0 1 2 3 4 5 6 7
   104:  8 9101112131415
   112:  0 1 2 3 4 5 6 7
   120:  8 9101112131415

smp_affinity for the 16 queuepairs
 141 p1p1-TxRx-0 ,0001
 142 p1p1-TxRx-1 ,0002
 143 p1p1-TxRx-2 ,0004
 144 p1p1-TxRx-3 ,0008
 145 p1p1-TxRx-4 ,0010
 146 p1p1-TxRx-5 ,0020
 147 p1p1-TxRx-6 ,0040
 148 p1p1-TxRx-7 ,0080
 149 p1p1-TxRx-8 ,0100
 150 p1p1-TxRx-9 ,0200
 151 p1p1-TxRx-10 ,0400
 152 p1p1-TxRx-11 ,0800
 153 p1p1-TxRx-12 ,1000
 154 p1p1-TxRx-13 ,2000
 155 p1p1-TxRx-14 ,4000
 156 p1p1-TxRx-15 ,8000
xps_cpus for the 16 Tx queues
 ,0001
 ,0002
 ,0004
 ,0008
 ,0010
 ,0020
 ,0040
 ,0080
 ,0100
 ,0200
 ,0400
 ,0800
 ,1000
 ,2000
 ,4000
 ,8000
memcached threads are not pinned.


...

I urge you to take the time to properly tune this host.

linux kernel does not do automagic configuration. This is user policy.

Documentation/networking/scaling.txt has everything you need.


Yes, tuning a system for optimal performance is difficult. Even if you
find a performance benefit for a configuration on one system, that
might not translate to another. In other words, if you've produced
some code

[PATCH net-next 1/5] net: add support for noref skb->sk

2017-09-20 Thread Paolo Abeni

Noref sk do not carry a socket refcount, are valid
only inside the current RCU section and must be
explicitly cleared before exiting such section.

They will be used in a later patch to allow early demux
without sock refcounting.

Signed-off-by: Paolo Abeni 
---
 include/linux/skbuff.h | 30 ++
 net/core/sock.c|  6 ++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 72299ef00061..459a5672811d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -922,6 +922,36 @@ static inline struct rtable *skb_rtable(const struct 
sk_buff *skb)
return (struct rtable *)skb_dst(skb);
 }
 
+void sock_dummyfree(struct sk_buff *skb);
+
+/* only early demux can set noref socks
+ * noref socks do not carry any refcount and must be
+ * cleared before exiting the current RCU section
+ */
+static inline void skb_set_noref_sk(struct sk_buff *skb, struct sock *sk)
+{
+   skb->sk = sk;
+   skb->destructor = sock_dummyfree;
+}
+
+static inline bool skb_has_noref_sk(struct sk_buff *skb)
+{
+   return skb->destructor == sock_dummyfree;
+}
+
+static inline struct sock *skb_clear_noref_sk(struct sk_buff *skb)
+{
+   struct sock *ret;
+
+   if (!skb_has_noref_sk(skb))
+   return NULL;
+
+   ret = skb->sk;
+   skb->sk = NULL;
+   skb->destructor = NULL;
+   return ret;
+}
+
 /* For mangling skb->pkt_type from user space side from applications
  * such as nft, tc, etc, we only allow a conservative subset of
  * possible pkt_types to be set.
diff --git a/net/core/sock.c b/net/core/sock.c
index 9b7b6bbb2a23..3aa4950639bb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1893,6 +1893,12 @@ void sock_efree(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sock_efree);
 
+/* dummy destructor used by noref sockets */
+void sock_dummyfree(struct sk_buff *skb)
+{
+}
+EXPORT_SYMBOL(sock_dummyfree);
+
 kuid_t sock_i_uid(struct sock *sk)
 {
kuid_t uid;
-- 
2.13.5

[PATCH net-next 0/5] net: introduce noref sk

2017-09-20 Thread Paolo Abeni

This series introduce the infrastructure to store inside the skb a socket
pointer without carrying a refcount to the socket.

Such infrastructure is then used in the network receive path - and
specifically the early demux operation.

This allows the UDP early demux to perform a full lookup for UDP sockets,
with many benefits:

- the UDP early demux code is now much simpler
- the early demux does not hit any performance penalties in case of UDP hash
  table collision - previously the early demux performed a partial, unsuccesful,
  lookup
- early demux is now operational also for unconnected sockets.

This infrastrcture will be used in follow-up series to allow dst caching for
unconnected UDP sockets, and than to extend the same features to TCP listening
sockets.

Paolo Abeni (5):
  net: add support for noref skb->sk
  net: allow early demux to fetch noref socket
  udp: do not touch socket refcount in early demux
  net: add simple socket-like dst cache helpers
  udp: perform full socket lookup in early demux

 include/linux/skbuff.h   | 30 +++
 include/linux/udp.h  |  2 +
 include/net/dst.h| 12 ++
 net/core/dst.c   | 16 
 net/core/sock.c  |  6 +++
 net/ipv4/ip_input.c  | 12 ++
 net/ipv4/ipmr.c  | 18 +++--
 net/ipv4/netfilter/nf_dup_ipv4.c |  3 ++
 net/ipv4/udp.c   | 80 
 net/ipv6/ip6_input.c |  7 +++-
 net/ipv6/netfilter/nf_dup_ipv6.c |  3 ++
 net/ipv6/udp.c   | 67 ++---
 net/netfilter/nf_queue.c |  3 ++
 13 files changed, 159 insertions(+), 100 deletions(-)

-- 
2.13.5

[PATCH net-next 4/5] net: add simple socket-like dst cache helpers

2017-09-20 Thread Paolo Abeni

It will be used by later patches to reduce code duplication.

Signed-off-by: Paolo Abeni 
---
 include/net/dst.h | 12 
 net/core/dst.c| 16 
 2 files changed, 28 insertions(+)

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..a6a39357f19a 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -485,6 +485,18 @@ static inline struct dst_entry *dst_check(struct dst_entry 
*dst, u32 cookie)
return dst;
 }
 
+bool dst_update(struct dst_entry **cache, struct dst_entry *dst);
+static inline struct dst_entry *dst_access(struct dst_entry **cache,
+ u32 cookie)
+{
+   struct dst_entry *dst = READ_ONCE(*cache);
+
+   if (!dst)
+   return NULL;
+
+   return dst_check(dst, cookie);
+}
+
 /* Flags for xfrm_lookup flags argument. */
 enum {
XFRM_LOOKUP_ICMP = 1 << 0,
diff --git a/net/core/dst.c b/net/core/dst.c
index a6c47da7d0f8..6aff0a3e7ba3 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -205,6 +205,22 @@ void dst_release_immediate(struct dst_entry *dst)
 }
 EXPORT_SYMBOL(dst_release_immediate);
 
+/* 'dst' is not refcounted */
+bool dst_update(struct dst_entry **cache, struct dst_entry *dst)
+{
+   if (likely(*cache == dst))
+   return false;
+
+   if (dst_hold_safe(dst)) {
+   struct dst_entry *old = xchg(cache, dst);
+
+   dst_release(old);
+   return old != dst;
+   }
+   return false;
+}
+EXPORT_SYMBOL_GPL(dst_update);
+
 u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old)
 {
struct dst_metrics *p = kmalloc(sizeof(*p), GFP_ATOMIC);
-- 
2.13.5

Re: [PATCH net-next 08/14] gtp: Support encpasulating over IPv6

2017-09-20 Thread David Miller

From: Tom Herbert 
Date: Wed, 20 Sep 2017 11:03:52 -0700

> On Mon, Sep 18, 2017 at 9:19 PM, David Miller  wrote:
>> From: Tom Herbert 
>> Date: Mon, 18 Sep 2017 17:38:58 -0700
>>
>>> Allow peers to be specified by IPv6 addresses.
>>>
>>> Signed-off-by: Tom Herbert 
>>
>> Hmmm, can you just check the socket family or something like that?
> 
> I'm not sure what code you're referring to.

There is a socket associated with the tunnel to do the encapsulation
and it has an address family, right?

[PATCH] [RESEND][for 4.14] net: qcom/emac: add software control for pause frame mode

2017-09-20 Thread Timur Tabi

The EMAC has the option of sending only a single pause frame when
flow control is enabled and the RX queue is full.  Although sending
only one pause frame has little value, this would allow admins to
enable automatic flow control without having to worry about the EMAC
flooding nearby switches with pause frames if the kernel hangs.

The option is enabled by using the single-pause-mode private flag.

Signed-off-by: Timur Tabi 
---
 drivers/net/ethernet/qualcomm/emac/emac-ethtool.c | 30 +++
 drivers/net/ethernet/qualcomm/emac/emac-mac.c | 22 +
 drivers/net/ethernet/qualcomm/emac/emac.c |  3 +++
 drivers/net/ethernet/qualcomm/emac/emac.h |  3 +++
 4 files changed, 58 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac-ethtool.c 
b/drivers/net/ethernet/qualcomm/emac/emac-ethtool.c
index bbe24639aa5a..c8c6231b87f3 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-ethtool.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-ethtool.c
@@ -88,6 +88,8 @@ static void emac_set_msglevel(struct net_device *netdev, u32 
data)
 static int emac_get_sset_count(struct net_device *netdev, int sset)
 {
switch (sset) {
+   case ETH_SS_PRIV_FLAGS:
+   return 1;
case ETH_SS_STATS:
return EMAC_STATS_LEN;
default:
@@ -100,6 +102,10 @@ static void emac_get_strings(struct net_device *netdev, 
u32 stringset, u8 *data)
unsigned int i;
 
switch (stringset) {
+   case ETH_SS_PRIV_FLAGS:
+   strcpy(data, "single-pause-mode");
+   break;
+
case ETH_SS_STATS:
for (i = 0; i < EMAC_STATS_LEN; i++) {
strlcpy(data, emac_ethtool_stat_strings[i],
@@ -230,6 +236,27 @@ static int emac_get_regs_len(struct net_device *netdev)
return EMAC_MAX_REG_SIZE * sizeof(u32);
 }
 
+#define EMAC_PRIV_ENABLE_SINGLE_PAUSE  BIT(0)
+
+static int emac_set_priv_flags(struct net_device *netdev, u32 flags)
+{
+   struct emac_adapter *adpt = netdev_priv(netdev);
+
+   adpt->single_pause_mode = !!(flags & EMAC_PRIV_ENABLE_SINGLE_PAUSE);
+
+   if (netif_running(netdev))
+   return emac_reinit_locked(adpt);
+
+   return 0;
+}
+
+static u32 emac_get_priv_flags(struct net_device *netdev)
+{
+   struct emac_adapter *adpt = netdev_priv(netdev);
+
+   return adpt->single_pause_mode ? EMAC_PRIV_ENABLE_SINGLE_PAUSE : 0;
+}
+
 static const struct ethtool_ops emac_ethtool_ops = {
.get_link_ksettings = phy_ethtool_get_link_ksettings,
.set_link_ksettings = phy_ethtool_set_link_ksettings,
@@ -253,6 +280,9 @@ static int emac_get_regs_len(struct net_device *netdev)
 
.get_regs_len= emac_get_regs_len,
.get_regs= emac_get_regs,
+
+   .set_priv_flags = emac_set_priv_flags,
+   .get_priv_flags = emac_get_priv_flags,
 };
 
 void emac_set_ethtool_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/qualcomm/emac/emac-mac.c 
b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
index bcd4708b3745..0ea3ca09c689 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-mac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-mac.c
@@ -551,6 +551,28 @@ static void emac_mac_start(struct emac_adapter *adpt)
mac &= ~(HUGEN | VLAN_STRIP | TPAUSE | SIMR | HUGE | MULTI_ALL |
 DEBUG_MODE | SINGLE_PAUSE_MODE);
 
+   /* Enable single-pause-frame mode if requested.
+*
+* If enabled, the EMAC will send a single pause frame when the RX
+* queue is full.  This normally leads to packet loss because
+* the pause frame disables the remote MAC only for 33ms (the quanta),
+* and then the remote MAC continues sending packets even though
+* the RX queue is still full.
+*
+* If disabled, the EMAC sends a pause frame every 31ms until the RX
+* queue is no longer full.  Normally, this is the preferred
+* method of operation.  However, when the system is hung (e.g.
+* cores are halted), the EMAC interrupt handler is never called
+* and so the RX queue fills up quickly and stays full.  The resuling
+* non-stop "flood" of pause frames sometimes has the effect of
+* disabling nearby switches.  In some cases, other nearby switches
+* are also affected, shutting down the entire network.
+*
+* The user can enable or disable single-pause-frame mode
+* via ethtool.
+*/
+   mac |= adpt->single_pause_mode ? SINGLE_PAUSE_MODE : 0;
+
writel_relaxed(csr1, adpt->csr + EMAC_EMAC_WRAPPER_CSR1);
 
writel_relaxed(mac, adpt->base + EMAC_MAC_CTRL);
diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c 
b/drivers/net/ethernet/qualcomm/emac/emac.c
index 60850bfa3d32..759543512117 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -443,6 +443,9 @@ static void

Re: [PATCH net-next 08/14] gtp: Support encpasulating over IPv6

2017-09-20 Thread Tom Herbert

On Wed, Sep 20, 2017 at 12:45 PM, David Miller  wrote:
> From: Tom Herbert 
> Date: Wed, 20 Sep 2017 11:03:52 -0700
>
>> On Mon, Sep 18, 2017 at 9:19 PM, David Miller  wrote:
>>> From: Tom Herbert 
>>> Date: Mon, 18 Sep 2017 17:38:58 -0700
>>>
 Allow peers to be specified by IPv6 addresses.

 Signed-off-by: Tom Herbert 
>>>
>>> Hmmm, can you just check the socket family or something like that?
>>
>> I'm not sure what code you're referring to.
>
> There is a socket associated with the tunnel to do the encapsulation
> and it has an address family, right?

If fd's are set from userspace for the sockets then we could derive
the address family from them. I'll change that. Although, looking at
now I am wondering why were passing fds into GTP instead of just
having the kernel create the UDP port like is done for other encaps.

Tom

[PATCH v3 22/31] sctp: Copy struct sctp_sock.autoclose to userspace using put_user()

2017-09-20 Thread Kees Cook

From: David Windsor 

The autoclose field can be copied with put_user(), so there is no need to
use copy_to_user(). In both cases, hardened usercopy is being bypassed
since the size is constant, and not open to runtime manipulation.

This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor 
[kees: adjust commit log]
Cc: Vlad Yasevich 
Cc: Neil Horman 
Cc: "David S. Miller" 
Cc: linux-s...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook 
---
 net/sctp/socket.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index aa4f86d64545..e070c0934638 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4893,7 +4893,7 @@ static int sctp_getsockopt_autoclose(struct sock *sk, int 
len, char __user *optv
len = sizeof(int);
if (put_user(len, optlen))
return -EFAULT;
-   if (copy_to_user(optval, _sk(sk)->autoclose, sizeof(int)))
+   if (put_user(sctp_sk(sk)->autoclose, (int __user *)optval))
return -EFAULT;
return 0;
 }
-- 
2.7.4

1 2 3 4 >

1 - 100 of 300 matches

Mail list logo