date:20151012

Re: [PATCH iproute2 v2] bridge: add batch command support

2015-10-12 Thread Christophe Gouault

2015-10-11 23:03 GMT+02:00 Roopa Prabhu :
> From: Wilson Kok 
>
> This patch adds support to batch bridge commands.
> Follows ip batch code.
>
> Signed-off-by: Wilson Kok 
> Signed-off-by: Roopa Prabhu 
> Acked-by: Christophe Gouault 
> ---
> v2 - change tab to space in usage as pointed out by Christophe Gouault

No more comments on this patch, looks good to me.

Best Regards,
Christophe
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] mISDN: use kstrdup() in dsp_pipeline_build

2015-10-12 Thread Geliang Tang

Use kstrdup instead of strlen-kmalloc-strcpy. Remove unneeded NULL
test, it will be tested inside kstrdup. Remove 0 length string test,
it has been tested in the caller of dsp_pipeline_build.

Signed-off-by: Geliang Tang 
---
Changes in v2:
  - Remove unneeded NULL test.
---
 drivers/isdn/mISDN/dsp_pipeline.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/drivers/isdn/mISDN/dsp_pipeline.c 
b/drivers/isdn/mISDN/dsp_pipeline.c
index 8b1a66c..e72b4e7 100644
--- a/drivers/isdn/mISDN/dsp_pipeline.c
+++ b/drivers/isdn/mISDN/dsp_pipeline.c
@@ -235,7 +235,7 @@ void dsp_pipeline_destroy(struct dsp_pipeline *pipeline)
 
 int dsp_pipeline_build(struct dsp_pipeline *pipeline, const char *cfg)
 {
-   int len, incomplete = 0, found = 0;
+   int incomplete = 0, found = 0;
char *dup, *tok, *name, *args;
struct dsp_element_entry *entry, *n;
struct dsp_pipeline_entry *pipeline_entry;
@@ -247,17 +247,9 @@ int dsp_pipeline_build(struct dsp_pipeline *pipeline, 
const char *cfg)
if (!list_empty(>list))
_dsp_pipeline_destroy(pipeline);
 
-   if (!cfg)
-   return 0;
-
-   len = strlen(cfg);
-   if (!len)
-   return 0;
-
-   dup = kmalloc(len + 1, GFP_ATOMIC);
+   dup = kstrdup(cfg, GFP_ATOMIC);
if (!dup)
return 0;
-   strcpy(dup, cfg);
while ((tok = strsep(, "|"))) {
if (!strlen(tok))
continue;
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next PATCH] driver: net: cpsw: add no_bd_ram dt parsing

2015-10-12 Thread Mugunthan V N

On Friday 09 October 2015 03:36 PM, Mugunthan V N wrote:
> cpdma is capable of placing the dma descriptors in ddr using
> dma_alloc_coherent() when the internal bd ram size is not enough.
> To utilize this feature pass the DT parameter "no_bd_ram" and
> increase bd_ram_size and number of rx descriptors.
> 
> Signed-off-by: Mugunthan V N 

Dave

Please drop this patch as it is not working on AM437x platform. Will
send a v2 after fixing it.

Regards
Mugunthan V N

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/5] be2net: release mcc-lock in a failure case in be_cmd_notify_wait()

2015-10-12 Thread Sathya Perla

From: Suresh Reddy 

The mcc/mbox lock is not being released when be_cmd_copy() returns
an error.

Signed-off-by: Suresh Reddy 
Signed-off-by: Sathya Perla 
---
 drivers/net/ethernet/emulex/benet/be_cmds.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c 
b/drivers/net/ethernet/emulex/benet/be_cmds.c
index eb32391..9dc5ce1 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -851,8 +851,10 @@ static int be_cmd_notify_wait(struct be_adapter *adapter,
return status;
 
dest_wrb = be_cmd_copy(adapter, wrb);
-   if (!dest_wrb)
-   return -EBUSY;
+   if (!dest_wrb) {
+   status = -EBUSY;
+   goto unlock;
+   }
 
if (use_mcc(adapter))
status = be_mcc_notify_wait(adapter);
@@ -862,6 +864,7 @@ static int be_cmd_notify_wait(struct be_adapter *adapter,
if (!status)
memcpy(wrb, dest_wrb, sizeof(*wrb));
 
+unlock:
be_cmd_unlock(adapter);
return status;
 }
-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/5] be2net: set pci_func_num while issuing GET_PROFILE_CONFIG cmd

2015-10-12 Thread Sathya Perla

From: Somnath Kotur 

The FW requires the pf_num field in the cmd hdr to be set for it to return
the specific function's descriptors in the GET_PROFILE_CONFIG cmd. If not
set, the FW returns the descriptors of all the functions on the device.
If the first descriptor is not what is being queried for, the driver will
read wrong data. This patch fixes this issue by using the GET_CNTL_ATTRIB
cmd to query the real pci_func_num of a function and then uses it in the
GET_PROFILE_CONFIG cmd.

Signed-off-by: Somnath Kotur 
Signed-off-by: Sathya Perla 
---
 drivers/net/ethernet/emulex/benet/be.h  |  1 +
 drivers/net/ethernet/emulex/benet/be_cmds.c | 14 +++---
 drivers/net/ethernet/emulex/benet/be_cmds.h | 10 --
 drivers/net/ethernet/emulex/benet/be_main.c |  9 +
 4 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be.h 
b/drivers/net/ethernet/emulex/benet/be.h
index 8215409..d463563 100644
--- a/drivers/net/ethernet/emulex/benet/be.h
+++ b/drivers/net/ethernet/emulex/benet/be.h
@@ -592,6 +592,7 @@ struct be_adapter {
int be_get_temp_freq;
struct be_hwmon hwmon_info;
u8 pf_number;
+   u8 pci_func_num;
struct rss_info rss_info;
/* Filters for packets that need to be sent to BMC */
u32 bmc_filt_mask;
diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c 
b/drivers/net/ethernet/emulex/benet/be_cmds.c
index 9dc5ce1..790284d 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -2890,6 +2890,7 @@ int be_cmd_get_cntl_attributes(struct be_adapter *adapter)
if (!status) {
attribs = attribs_cmd.va + sizeof(struct be_cmd_resp_hdr);
adapter->hba_port_num = attribs->hba_attribs.phy_port;
+   adapter->pci_func_num = attribs->pci_func_num;
serial_num = attribs->hba_attribs.controller_serial_number;
for (i = 0; i < CNTL_SERIAL_NUM_WORDS; i++)
adapter->serial_num[i] = le32_to_cpu(serial_num[i]) &
@@ -3712,7 +3713,6 @@ int be_cmd_get_func_config(struct be_adapter *adapter, 
struct be_resources *res)
status = -EINVAL;
goto err;
}
-
adapter->pf_number = desc->pf_num;
be_copy_nic_desc(res, desc);
}
@@ -3724,7 +3724,10 @@ err:
return status;
 }
 
-/* Will use MBOX only if MCCQ has not been created */
+/* Will use MBOX only if MCCQ has not been created
+ * non-zero domain => a PF is querying this on behalf of a VF
+ * zero domain => a PF or a VF is querying this for itself
+ */
 int be_cmd_get_profile_config(struct be_adapter *adapter,
  struct be_resources *res, u8 query, u8 domain)
 {
@@ -3751,10 +3754,15 @@ int be_cmd_get_profile_config(struct be_adapter 
*adapter,
   OPCODE_COMMON_GET_PROFILE_CONFIG,
   cmd.size, , );
 
-   req->hdr.domain = domain;
if (!lancer_chip(adapter))
req->hdr.version = 1;
req->type = ACTIVE_PROFILE_TYPE;
+   /* When a function is querying profile information relating to
+* itself hdr.pf_number must be set to it's pci_func_num + 1
+*/
+   req->hdr.domain = domain;
+   if (domain == 0)
+   req->hdr.pf_num = adapter->pci_func_num + 1;
 
/* When QUERY_MODIFIABLE_FIELDS_TYPE bit is set, cmd returns the
 * descriptors with all bits set to "1" for the fields which can be
diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.h 
b/drivers/net/ethernet/emulex/benet/be_cmds.h
index 7d178bd..91155ea 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.h
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.h
@@ -289,7 +289,9 @@ struct be_cmd_req_hdr {
u32 timeout;/* dword 1 */
u32 request_length; /* dword 2 */
u8 version; /* dword 3 */
-   u8 rsvd[3]; /* dword 3 */
+   u8 rsvd1;   /* dword 3 */
+   u8 pf_num;  /* dword 3 */
+   u8 rsvd2;   /* dword 3 */
 };
 
 #define RESP_HDR_INFO_OPCODE_SHIFT 0   /* bits 0 - 7 */
@@ -1652,7 +1654,11 @@ struct mgmt_hba_attribs {
 
 struct mgmt_controller_attrib {
struct mgmt_hba_attribs hba_attribs;
-   u32 rsvd0[10];
+   u32 rsvd0[2];
+   u16 rsvd1;
+   u8 pci_func_num;
+   u8 rsvd2;
+   u32 rsvd3[7];
 } __packed;
 
 struct be_cmd_req_cntl_attribs {
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 821e014..eb48a97 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4206,10 +4206,6 @@ static int be_get_config(struct be_adapter *adapter)
int status, level;

[PATCH 3/5] be2net: pad skb to meet minimum TX pkt size in BE3

2015-10-12 Thread Sathya Perla

From: Suresh Reddy 

On BE3 chips in SRIOV configs, the TX path stalls when a packet less
than 32B is received from the host. A workaround to pad such packets
already exists for the Skyhawk and Lancer chips. Use the same workaround
for BE3 chips too.

Signed-off-by: Suresh Reddy 
Signed-off-by: Sathya Perla 
---
 drivers/net/ethernet/emulex/benet/be_main.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 86eed47..821e014 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -1123,11 +1123,12 @@ static struct sk_buff *be_xmit_workarounds(struct 
be_adapter *adapter,
   struct sk_buff *skb,
   struct be_wrb_params *wrb_params)
 {
-   /* Lancer, SH-R ASICs have a bug wherein Packets that are 32 bytes or
-* less may cause a transmit stall on that port. So the work-around is
-* to pad short packets (<= 32 bytes) to a 36-byte length.
+   /* Lancer, SH and BE3 in SRIOV mode have a bug wherein
+* packets that are 32b or less may cause a transmit stall
+* on that port. The workaround is to pad such packets
+* (len <= 32 bytes) to a minimum length of 36b.
 */
-   if (unlikely(!BEx_chip(adapter) && skb->len <= 32)) {
+   if (skb->len <= 32) {
if (skb_put_padto(skb, 36))
return NULL;
}
-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/5] be2net: remove vlan promisc capability from VF's profile descriptors

2015-10-12 Thread Sathya Perla

From: Kalesh AP 

The commit 435452aa8847 ("Prevent VFs from enabling VLAN promiscuous mode")
fixed the PF driver to not include the VLAN promisc capability while
provisioning the interface for a VF. But the fix did not remove this
capability from the profile descriptor of the VF. This causes the VF
driver to request this capability when it tries to create it's interface
at probe time.  This could potentailly cause the VF probe to fail if the
FW enforces strict checking of the flags based on what was provisoned
by the PF.  This strict checking is not being done by FW currently but
will be fixed in a future version. This patch fixes this issue by updating
the VF's profile descriptor so that they match the interface capability
flags provisioned by the PF.

Fixes: 435452aa8847 ("Prevent VFs from enabling VLAN promiscuous mode")
Signed-off-by: Kalesh AP 
Signed-off-by: Sathya Perla 
---
 drivers/net/ethernet/emulex/benet/be_cmds.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c 
b/drivers/net/ethernet/emulex/benet/be_cmds.c
index 790284d..1795c93 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -1987,6 +1987,8 @@ int be_cmd_rx_filter(struct be_adapter *adapter, u32 
flags, u32 value)
 be_if_cap_flags(adapter));
}
flags &= be_if_cap_flags(adapter);
+   if (!flags)
+   return -ENOTSUPP;
 
return __be_cmd_rx_filter(adapter, flags, value);
 }
@@ -3932,12 +3934,16 @@ static void be_fill_vf_res_template(struct be_adapter 
*adapter,
vf_if_cap_flags &= ~(BE_IF_FLAGS_RSS |
 BE_IF_FLAGS_DEFQ_RSS);
}
-
-   nic_vft->cap_flags = cpu_to_le32(vf_if_cap_flags);
} else {
num_vf_qs = 1;
}
 
+   if (res_mod.vf_if_cap_flags & BE_IF_FLAGS_VLAN_PROMISCUOUS) {
+   nic_vft->flags |= BIT(IF_CAPS_FLAGS_VALID_SHIFT);
+   vf_if_cap_flags &= ~BE_IF_FLAGS_VLAN_PROMISCUOUS;
+   }
+
+   nic_vft->cap_flags = cpu_to_le32(vf_if_cap_flags);
nic_vft->rq_count = cpu_to_le16(num_vf_qs);
nic_vft->txq_count = cpu_to_le16(num_vf_qs);
nic_vft->rssq_count = cpu_to_le16(num_vf_qs);
-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/5] be2net: patch set

2015-10-12 Thread Sathya Perla

Patch 1 fixes a FW image compatibility check in the driver that
prevents certain FW images from being flashed on BE3 (not BE3-R)
adapters.

Patch 2 fixes a spin_lock not being released in a failure case in
be_cmd_notify_wait().

Patch 3 includes a workaround to pad packets that are only 32b long or less
to be applicabe to BE3 too. This workaround was currently applied only to
Skyhawk and Lancer chips. Such packets are causing BE3's TX path to stall
on a SR-IOV config.

Patch 4 fixes the be_cmd_get_profile_config() routine to set the pf_num
field in the cmd request. The FW requires this field to be set for it to
return the specific function's descriptors. If not set, the FW returns
the descriptors of all the functions on the device. If the first descriptor
is not what is being queried for, the driver will read wrong data.
This patch fixes this issue by using the GET_CNTL_ATTRIB cmd to query the
real pci_func_num of a function and then uses it in the GET_PROFILE_CONFIG
cmd.

Patch 5 completes an earlier fix that removed the vlan promisc capability
for VFs. The earlier fix did not update the removal of this capability from
the profile descriptor of the VF. This causes the VF driver to request this
capability when it tries to create it's interface at probe time. This could
potentailly cause the VF probe to fail if the FW enforces strict checking of
the flags based on what was provisoned by the PF.  This strict checking is
not being done by FW currently but will be fixed in a future version. This
patch fixes this issue by updating the VF's profile descriptor so that they
match the interface capability flags provisioned by the PF.

Pls consider adding these patches to the net tree. Thanks!

Kalesh AP (2):
  be2net: fix BE3-R FW download compatibility check
  be2net: remove vlan promisc capability from VF's profile descriptors

Somnath Kotur (1):
  be2net: set pci_func_num while issuing GET_PROFILE_CONFIG cmd

Suresh Reddy (2):
  be2net: release mcc-lock in a failure case in be_cmd_notify_wait()
  be2net: pad skb to meet minimum TX pkt size in BE3

 drivers/net/ethernet/emulex/benet/be.h  |  1 +
 drivers/net/ethernet/emulex/benet/be_cmds.c | 31 ++---
 drivers/net/ethernet/emulex/benet/be_cmds.h | 10 --
 drivers/net/ethernet/emulex/benet/be_main.c | 28 +-
 4 files changed, 52 insertions(+), 18 deletions(-)

-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/5] be2net: fix BE3-R FW download compatibility check

2015-10-12 Thread Sathya Perla

From: Kalesh AP 

In the BE3 FW image, unlike Skyhawk's, the "asic_type_rev" field doesn't
track the asic_rev of chip it is compatible with. When asic_type_rev
is 0 the image is compatible only with pre-BE3-R chips (asic_rev < 0x10).
Fix the current compatibility check to take care of this.
We hit this issue when we try to flash old BE3 images (used prior to the
release of BE3-R) on pre-BE3-R adapters.

Fixes: a6e6ff6eee12f3e ("be2net: simplify UFI compatibility checking")
Signed-off-by: Kalesh AP 
Signed-off-by: Sathya Perla 
---
 drivers/net/ethernet/emulex/benet/be_main.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 7bf51a1..86eed47 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4999,7 +4999,15 @@ static bool be_check_ufi_compatibility(struct be_adapter 
*adapter,
return false;
}
 
-   return (fhdr->asic_type_rev >= adapter->asic_rev);
+   /* In BE3 FW images the "asic_type_rev" field doesn't track the
+* asic_rev of the chips it is compatible with.
+* When asic_type_rev is 0 the image is compatible only with
+* pre-BE3-R chips (asic_rev < 0x10)
+*/
+   if (BEx_chip(adapter) && fhdr->asic_type_rev == 0)
+   return adapter->asic_rev < 0x10;
+   else
+   return (fhdr->asic_type_rev >= adapter->asic_rev);
 }
 
 static int be_fw_download(struct be_adapter *adapter, const struct firmware* 
fw)
-- 
2.4.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.1.3-rt8] [report][cpuhotplug] BUG: spinlock bad magic on CPU#0, sh/137

2015-10-12 Thread Thomas Gleixner

On Fri, 9 Oct 2015, Grygorii Strashko wrote:
> I can constantly see below error report with 4.1 RT-kernel on TI ARM dra7-evm 
> if I'm trying to unplug cpu1:
> 
> [   57.737589] CPU1: shutdown
> [   57.767537] BUG: spinlock bad magic on CPU#0, sh/137
> [   57.767546]  lock: 0xee994730, .magic: , .owner: /-1, 
> .owner_cpu: 0

> It looks like this backtrace was introduces by 
> 
> commit 91df05da13a6c6c358e71182e80f19f3c48d1615
> net: Use skbufhead with raw lock
>
> I see the potential fix for this issue as below: 
> 
> index 4969c0d..f8c23de 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -7217,7 +7217,7 @@ static int dev_cpu_callback(struct notifier_block *nfb,
> netif_rx_ni(skb);
> input_queue_head_incr(oldsd);
> }
> -   while ((skb = skb_dequeue(>input_pkt_queue))) {
> +   while ((skb = __skb_dequeue(>input_pkt_queue))) {

Your patch is white space damaged 

> netif_rx_ni(skb);
> input_queue_head_incr(oldsd);
> }
> 
> input_pkt_queue is per-cpu queue and at this moment cpu is dead already,
> so no one should touch it. But I'm not sure if my assumption is correct.

It is. Picking it up for the next release

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: e1000e: hard system lockup on Linux 4.2

2015-10-12 Thread Avargil, Raanan

Hi Jason,
Your analysis is correct. 
The issue was initially reported by Valdis Kletnieks (valdis.kletni...@vt.edu)
http://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20150615/000992.html

Commit 37b12910dd11d9ab969f2c310dc9160b7f3e3405 fixes the lockup issue, and 
according to my last check it is in 4.3-rc5. 

Thanks,
Raanan

-Original Message-
From: Jason A. Donenfeld [mailto:ja...@zx2c4.com] 
Sent: Friday, October 09, 2015 22:22
To: Kirsher, Jeffrey T ; Avargil, Raanan 
; yanirx.lubet...@intel.com; Greg Kroah-Hartman 

Cc: intel-wired-...@lists.osuosl.org; Netdev ; 
linux-ker...@vger.kernel.org
Subject: e1000e: hard system lockup on Linux 4.2

Hi Jeffrey & Raanan & Yanirx,

I have a Thinkpad W530 with a 82579LM inside of it, which uses the e1000e 
driver. Every few hours, my system does a hard lockup, and I am unable to do 
anything at all with it except power it off. There isn't a panic or oops, as 
nothing is written to /sys/fs/pstore after. But, I did enable the lockup 
detection debug option, and was able to gain a few stack traces. And all of the 
time, the culprit is the e1000e driver.

The funny thing is that I'm not actually using the Ethernet card -- nothing is 
plugged into the jack, as I'm mostly on wifi these days.
Nevertheless, it receives power from my laptop and thus the driver is partaking 
in some form of communication with it.

The stack traces follow below. You'll notice that some time after the initial 
e1000e lockup, the there's a soft lockup in the bpf paths. I believe this is 
due to the fact that at the time of the hard lockup in e1000e, I was trying to 
open a new Chrome tab process, which makes use of seccomp-bpf. I do not believe 
the bug is in the bpf code, however.

Briefly looking at the stack trace myself shows quite a bit of activity in 
`e1000e_cyclecounter_read`. Running `git log drivers/net/ethernet/intel/e1000e` 
indicates a recent change from Raanan -- 
37b12910dd11d9ab969f2c310dc9160b7f3e3405 -- "e1000e: Fix tight loop 
implementation of systime read algorithm". In this change, a loop is entirely 
removed.

Investigating the origin of that loop reveals this commit from Yanirx
-- 83129b37ef35bb6a7f01c060129736a8db5d31c4. This commit appears to be present 
in 4.2, but not in 4.1. This leads me to think it may be the cause of the bug, 
with the aforementioned
37b12910dd11d9ab969f2c310dc9160b7f3e3405 being the fix for it.

I would therefore recommend that -- if this analysis is correct --
37b12910dd11d9ab969f2c310dc9160b7f3e3405 be backported to the 4.2 stable 
releases (thus, CCing Greg).

If my very brief and preliminary investigation is not correct, please let me 
know if there is any additional information or debugging steps I can apply, so 
that we can fix this regression.

In the meantime while I wait to hear back, I'll try backporting that commit to 
4.2 myself, and seeing the stability of my laptop over the next 24 hours.

Thanks,
Jason

=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~

[14469.274866] [ cut here ] [14469.274874] WARNING: 
CPU: 1 PID: 12032 at kernel/watchdog.c:311
watchdog_overflow_callback+0x74/0xa0()
[14469.274875] Watchdog detected hard LOCKUP on cpu 1 [14469.274877] Modules 
linked in:
[14469.274878]  bnep hid_generic usbhid cdc_acm af_packet pl2303 usbserial 
btusb btbcm btintel bluetooth uvcvideo videobuf2_vmalloc videobuf2_memops 
videobuf2_core v4l2_common videodev ip6table_filter iptable_filter ip6_tables 
ip_tables x_tables mmc_block snd_hda_codec_realtek snd_hda_codec_generic iwldvm 
coretemp x86_pkg_temp_thermal intel_powerclamp mac80211 kvm_intel snd_hda_intel 
sdhci_pci snd_hda_codec kvm iwlwifi snd_hwdep snd_hda_core joydev xhci_pci 
ehci_pci sdhci xhci_hcd ehci_hcd snd_pcm mousedev microcode
cfg80211 mmc_core usbcore snd_timer usb_common thinkpad_acpi thermal snd 
soundcore hwmon rfkill ac battery evdev processor ipv6 [14469.274912] CPU: 1 
PID: 12032 Comm: kworker/1:0 Not tainted 4.2.3-gentoo #3 [14469.274913] 
Hardware name: LENOVO 2436CTO/2436CTO, BIOS G5ETA2WW
(2.62 ) 03/31/2015
[14469.274918] Workqueue: events e1000e_systim_overflow_work [14469.274920]  
 81a0eb69 81683f33
88081e245ba0
[14469.274922]  810ce4b8 8807fa6f 
88081e245c80
[14469.274924]  88081e245ef8  810ce535
81a0a3f8
[14469.274926] Call Trace:
[14469.274928][] ? dump_stack+0x40/0x50 
[14469.274935]  [] ? warn_slowpath_common+0x78/0xb0 
[14469.274937]  [] ? warn_slowpath_fmt+0x45/0x50 
[14469.274939]  [] ? watchdog_overflow_callback+0x74/0xa0
[14469.274941]  [] ? __perf_event_overflow+0x86/0x1c0 
[14469.274944]  [] ? intel_pmu_handle_irq+0x1c9/0x3f0 
[14469.274948]  [] ? perf_event_nmi_handler+0x25/0x40 
[14469.274951]  [] ? nmi_handle+0x7c/0x100 [14469.274952]  
[] ? do_nmi+0x1dd/0x360 [14469.274956]  []

[PATCH v2 2/2] net/fsl_pq_mdio: fix computed address for the TBI register

2015-10-12 Thread Gerlando Falauto

commit afae5ad78b342f401c28b0bb1adb3cd494cb125a
  "net/fsl_pq_mdio: streamline probing of MDIO nodes"

added support for different types of MDIO devices:
1) Gianfar MDIO nodes that only map the MII registers
2) Gianfar MDIO nodes that map the full MDIO register set
3) eTSEC2 MDIO nodes (which map the full MDIO register set)
4) QE MDIO nodes (which map only the MII registers)

However, the implementation for types 1 and 4 would mistakenly assume
a mapping of the full MDIO register set, thereby computing the address
for the TBI register starting from the containing structure.
The TBI register would therefore be accessed at a wrong (much bigger)
address, not giving the expected result at all.
This patch restores the correct behavior we had prior to the above one.

The consequences of this bug are apparent when trying to access a PHY
with the same address as the value contained in the initial value of
the TBI register (normally 0); in that case you'll get answers from the
internal TBI device (even though MDIO/MDC pins are actually *also*
toggling on the physical bus!).
Beware that you also need to add a fake tbi node to your device tree
with an unused address.

Notice how this fix is related to commit
220669495bf8b68130a8218607147c7b74c28d2b
  "powerpc: Add TBI PHY node to first MDIO bus"

which fixed the behavior in kernel 3.3, which was later broken by the
above commit on kernel 3.7.

Signed-off-by: Gerlando Falauto 
Cc: Timur Tabi 
Cc: David S. Miller 
Cc: Kumar Gala 
---
 drivers/net/ethernet/freescale/fsl_pq_mdio.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fsl_pq_mdio.c 
b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
index 5333d0a..55c3623 100644
--- a/drivers/net/ethernet/freescale/fsl_pq_mdio.c
+++ b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
@@ -198,11 +198,13 @@ static int fsl_pq_mdio_reset(struct mii_bus *bus)
 
 #if defined(CONFIG_GIANFAR) || defined(CONFIG_GIANFAR_MODULE)
 /*
+ * Return the TBIPA address, starting from the address
+ * of the mapped GFAR MDIO registers (struct gfar)
  * This is mildly evil, but so is our hardware for doing this.
  * Also, we have to cast back to struct gfar because of
  * definition weirdness done in gianfar.h.
  */
-static uint32_t __iomem *get_gfar_tbipa(void __iomem *p)
+static uint32_t __iomem *get_gfar_tbipa_from_mdio(void __iomem *p)
 {
struct gfar __iomem *enet_regs = p;
 
@@ -210,6 +212,15 @@ static uint32_t __iomem *get_gfar_tbipa(void __iomem *p)
 }
 
 /*
+ * Return the TBIPA address, starting from the address
+ * of the mapped GFAR MII registers (gfar_mii_regs[] within struct gfar)
+ */
+static uint32_t __iomem *get_gfar_tbipa_from_mii(void __iomem *p)
+{
+   return get_gfar_tbipa_from_mdio(container_of(p, struct gfar, 
gfar_mii_regs));
+}
+
+/*
  * Return the TBIPAR address for an eTSEC2 node
  */
 static uint32_t __iomem *get_etsec_tbipa(void __iomem *p)
@@ -220,11 +231,12 @@ static uint32_t __iomem *get_etsec_tbipa(void __iomem *p)
 
 #if defined(CONFIG_UCC_GETH) || defined(CONFIG_UCC_GETH_MODULE)
 /*
- * Return the TBIPAR address for a QE MDIO node
+ * Return the TBIPAR address for a QE MDIO node, starting from the address
+ * of the mapped MII registers (struct fsl_pq_mii)
  */
 static uint32_t __iomem *get_ucc_tbipa(void __iomem *p)
 {
-   struct fsl_pq_mdio __iomem *mdio = p;
+   struct fsl_pq_mdio __iomem *mdio = container_of(p, struct fsl_pq_mdio, 
mii);
 
return >utbipar;
 }
@@ -300,14 +312,14 @@ static const struct of_device_id fsl_pq_mdio_match[] = {
.compatible = "fsl,gianfar-tbi",
.data = &(struct fsl_pq_mdio_data) {
.mii_offset = 0,
-   .get_tbipa = get_gfar_tbipa,
+   .get_tbipa = get_gfar_tbipa_from_mii,
},
},
{
.compatible = "fsl,gianfar-mdio",
.data = &(struct fsl_pq_mdio_data) {
.mii_offset = 0,
-   .get_tbipa = get_gfar_tbipa,
+   .get_tbipa = get_gfar_tbipa_from_mii,
},
},
{
@@ -315,7 +327,7 @@ static const struct of_device_id fsl_pq_mdio_match[] = {
.compatible = "gianfar",
.data = &(struct fsl_pq_mdio_data) {
.mii_offset = offsetof(struct fsl_pq_mdio, mii),
-   .get_tbipa = get_gfar_tbipa,
+   .get_tbipa = get_gfar_tbipa_from_mdio,
},
},
{
-- 
1.8.0.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/2] net/fsl_pq_mdio: check TBI address for consistency with mapped range

2015-10-12 Thread Gerlando Falauto

When configuring the MDIO subsystem it is also necessary to configure
the TBI register. Make sure the TBI is contained within the mapped
register range in order to:
a) make sure the address is computed correctly
b) make users aware that we're actually accessing that register

In case of error, print a message but continue anyway.

Signed-off-by: Gerlando Falauto 
Cc: Timur Tabi 
Cc: David S. Miller 
Cc: Kumar Gala 
---
Changes from v1:
- Added type cast & fixed range
- removed freescale recipients

 drivers/net/ethernet/freescale/fsl_pq_mdio.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/freescale/fsl_pq_mdio.c 
b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
index 3c40f6b..5333d0a 100644
--- a/drivers/net/ethernet/freescale/fsl_pq_mdio.c
+++ b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
@@ -445,6 +445,16 @@ static int fsl_pq_mdio_probe(struct platform_device *pdev)
 
tbipa = data->get_tbipa(priv->map);
 
+   /*
+* Add consistency check to make sure TBI is contained
+* within the mapped range (not because we would get a
+* segfault, rather to catch bugs in computing TBI
+* address). Print error message but continue anyway.
+*/
+   if ((void *)tbipa > priv->map + resource_size() - 4)
+   dev_err(>dev, "invalid register map 
(should be at least 0x%04x to contain TBI address)\n",
+   ((void *)tbipa - priv->map) + 4);
+
iowrite32be(be32_to_cpup(prop), tbipa);
}
}
-- 
1.8.0.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Issue with /proc/sys/net/ipv4/tcp_mem

2015-10-12 Thread wangyufen

Hi,

I tried on linux-4.1:
linux:~# cat /proc/sys/net/ipv4/tcp_mem 
8388608 1258291216777216
linux:~# echo 1234 >/proc/sys/net/ipv4/tcp_mem 
-bash: echo: write error: Invalid argument
linux:~# cat /proc/sys/net/ipv4/tcp_mem 
12341258291216777216

the echo operation got error, but value already written to tcp_mem.

I checked, patch f594d63199688ad568fb caused the issue.


thanks,
Wang

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/2] bpf: enable/disable events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-12 Thread Kaixu Xia

In some scenarios we don't want to output trace data when perf sampling
in order to reduce overhead. For example, perf can be run as daemon to
dump trace data when necessary, such as the system performance goes down.

This patchset adds the helpers bpf_perf_event_sample_enable/disable() to
implement this function. By applying these helpers, we can enable/disable
events stored in PERF_EVENT_ARRAY maps trace data output and get the
samples we are most interested in.

We also need to make the perf user side can adds the normal PMU events
from perf cmdline to PERF_EVENT_ARRAY maps. My colleague He Kuang is doing
this work. In the following example, the cycles will be stored in the
PERF_EVENT_ARRAY maps.

Before this patch,
   $ ./perf record -e cycles -a sleep 1
   $ ./perf report --stdio
# To display the perf.data header info, please use 
--header/--header-only option
#
#
# Total Lost Samples: 0
#
# Samples: 655  of event 'cycles'
# Event count (approx.): 129323548
...

After this patch,
   $ ./perf record -e pmux=cycles --event perf-bpf.o/my_cycles_map=pmux/ -a 
sleep 1
   $ ./perf report --stdio
# To display the perf.data header info, please use 
--header/--header-only option
#
#
# Total Lost Samples: 0
#
# Samples: 23  of event 'cycles'
# Event count (approx.): 2064170
...

The bpf program example:

  struct bpf_map_def SEC("maps") my_cycles_map = {
  .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
  .key_size = sizeof(int),
  .value_size = sizeof(u32),
  .max_entries = 32, 
  };

  SEC("enter=sys_write")
  int bpf_prog_1(struct pt_regs *ctx)
  {
  bpf_perf_event_sample_enable(_cycles_map);
  return 0;
  }

  SEC("exit=sys_write%return")
  int bpf_prog_2(struct pt_regs *ctx)
  {
  bpf_perf_event_sample_disable(_cycles_map);
  return 0;
  }


Kaixu Xia (2):
  perf: Add the flag sample_disable not to output data on samples
  bpf: Implement bpf_perf_event_sample_enable/disable() helpers

 include/linux/bpf.h|  3 +++
 include/linux/perf_event.h |  2 ++
 include/uapi/linux/bpf.h   |  2 ++
 kernel/bpf/arraymap.c  |  5 +
 kernel/bpf/verifier.c  |  4 +++-
 kernel/events/core.c   |  3 +++
 kernel/trace/bpf_trace.c   | 34 ++
 7 files changed, 52 insertions(+), 1 deletion(-)

-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] sunrpc: fix waitqueue_active without memory barrier in sunrpc

2015-10-12 Thread Kosuke Tatsukawa

J. Bruce Fields wrote:
> On Fri, Oct 09, 2015 at 06:29:44AM +, Kosuke Tatsukawa wrote:
>> Neil Brown wrote:
>> > Kosuke Tatsukawa  writes:
>> > 
>> >> There are several places in net/sunrpc/svcsock.c which calls
>> >> waitqueue_active() without calling a memory barrier.  Add a memory
>> >> barrier just as in wq_has_sleeper().
>> >>
>> >> I found this issue when I was looking through the linux source code
>> >> for places calling waitqueue_active() before wake_up*(), but without
>> >> preceding memory barriers, after sending a patch to fix a similar
>> >> issue in drivers/tty/n_tty.c  (Details about the original issue can be
>> >> found here: https://lkml.org/lkml/2015/9/28/849).
>> > 
>> > hi,
>> > this feels like the wrong approach to the problem.  It requires extra
>> > 'smb_mb's to be spread around which are hard to understand as easy to
>> > forget.
>> > 
>> > A quick look seems to suggest that (nearly) every waitqueue_active()
>> > will need an smb_mb.  Could we just put the smb_mb() inside
>> > waitqueue_active()??
>> 
>> 
>> There are around 200 occurrences of waitqueue_active() in the kernel
>> source, and most of the places which use it before wake_up are either
>> protected by some spin lock, or already has a memory barrier or some
>> kind of atomic operation before it.
>> 
>> Simply adding smp_mb() to waitqueue_active() would incur extra cost in
>> many cases and won't be a good idea.
>> 
>> Another way to solve this problem is to remove the waitqueue_active(),
>> making the code look like this;
>>  if (wq)
>>  wake_up_interruptible(wq);
>> This also fixes the problem because the spinlock in the wake_up*() acts
>> as a memory barrier and prevents the code from being reordered by the
>> CPU (and it also makes the resulting code is much simpler).
> 
> I might not care which we did, except I don't have the means to test
> this quickly, and I guess this is some of our most frequently called
> code.
> 
> I suppose your patch is the most conservative approach, as the
> alternative is a spinlock/unlock in wake_up_interruptible, which I
> assume is necessarily more expensive than an smp_mb().
> 
> As far as I can tell it's been this way since forever.  (Well, since a
> 2002 patch "NFSD: TCP: rationalise locking in RPC server routines" which
> removed some spinlocks from the data_ready routines.)
> 
> I don't understand what the actual race is yet (which code exactly is
> missing the wakeup in this case?  nfsd threads seem to instead get
> woken up by the wake_up_process() in svc_xprt_do_enqueue().)

Thank you for the reply.  I tried looking into this.

The callbacks in net/sunrpc/svcsock.c are set up in svc_tcp_init() and
svc_udp_init(), which are both called from svc_setup_socket().
svc_setup_socket() is called (indirectly) from lockd, nfsd, and nfsv4
callback port related code.

Maybe I'm wrong, but there might not be any kernel code that is using
the socket's wait queue in this case.

Best regards.
---
Kosuke TATSUKAWA  | 3rd IT Platform Department
  | IT Platform Division, NEC Corporation
  | ta...@ab.jp.nec.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Use-after-free in ep_remove_wait_queue

2015-10-12 Thread Dmitry Vyukov

Hello,

The following program causes use-after-in kernel:

// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include 
#include 
#include 

int main()
{
long r0 = syscall(SYS_mmap, 0x20001000ul, 0x1000ul, 0x3ul,
0x32ul, 0xul, 0x0ul);
long r1 = syscall(SYS_mmap, 0x2000ul, 0x1000ul, 0x3ul,
0x32ul, 0xul, 0x0ul);
long r2 = syscall(SYS_socketpair, 0x1ul, 0x3ul, 0x1ul, 0x2ffcul);
long r3 = -1;
if (r2 != -1)
r3 = *(uint32_t*)0x2ffc;
long r4 = -1;
if (r2 != -1)
r4 = *(uint32_t*)0x20001000;
long r5 = syscall(SYS_epoll_create, 0x1ul);
long r6 = syscall(SYS_mmap, 0x20003000ul, 0x1000ul, 0x3ul,
0x32ul, 0xul, 0x0ul);
long r7 = syscall(SYS_dup3, r4, r3, 0x8ul);
*(uint32_t*)0x20003000 = 0x6;
*(uint32_t*)0x20003004 = 0x2;
*(uint64_t*)0x20003008 = 0x6;
long r11 = syscall(SYS_epoll_ctl, r5, 0x1ul, r3, 0x20003000ul);
long r12 = syscall(SYS_mmap, 0x20002000ul, 0x1000ul, 0x3ul,
0x32ul, 0xul, 0x0ul);
memcpy((void*)0x20002000, "\x00", 1);
long r14 = syscall(SYS_write, r7, 0x20002000ul, 0x1ul);
*(uint64_t*)0x20001a4d = 0x5;
long r16 = syscall(SYS_epoll_pwait, r5, 0x20004000ul, 0x1ul,
0x1ul, 0x20001a4dul, 0x8ul);
return 0;
}


on commit bcee19f424a0d8c26ecf2607b73c690802658b29
(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)

BUG: KASan: use after free in remove_wait_queue+0xfb/0x120 at addr
88002db3cf50
Write of size 8 by task syzkaller_execu/10568

BUG UNIX (Not tainted): kasan: bad access detected
-

Call Trace:
 [< inline >] __list_del include/linux/list.h:89
 [< inline >] list_del include/linux/list.h:107
 [< inline >] __remove_wait_queue include/linux/wait.h:145
 [] remove_wait_queue+0xfb/0x120 kernel/sched/wait.c:50
 [< inline >] ep_remove_wait_queue fs/eventpoll.c:524
 [] ep_unregister_pollwait.isra.7+0x10b/0x1c0
fs/eventpoll.c:542
 [] ep_free+0x97/0x190 fs/eventpoll.c:759 (discriminator 3)
 [] ep_eventpoll_release+0x44/0x60 fs/eventpoll.c:791
 [] __fput+0x21d/0x6e0 fs/file_table.c:208
 [] fput+0x15/0x20 fs/file_table.c:244
 [] task_work_run+0x164/0x1f0 kernel/task_work.c:115
(discriminator 1)
 [< inline >] exit_task_work include/linux/task_work.h:21
 [] do_exit+0xa4e/0x2d40 kernel/exit.c:746
 [] do_group_exit+0xf6/0x340 kernel/exit.c:874
 [< inline >] SYSC_exit_group kernel/exit.c:885
 [] SyS_exit_group+0x1d/0x20 kernel/exit.c:883
 [] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:187

INFO: Allocated in sk_prot_alloc+0x69/0x340 age=6 cpu=1 pid=10568
[<  none  >] __slab_alloc+0x426/0x470 mm/slub.c:2402
[< inline >] slab_alloc_node mm/slub.c:2470
[< inline >] slab_alloc mm/slub.c:2512
[<  none  >] kmem_cache_alloc+0x10d/0x140 mm/slub.c:2517
[<  none  >] sk_prot_alloc+0x69/0x340 net/core/sock.c:1329
[<  none  >] sk_alloc+0x33/0x280 net/core/sock.c:1404
[<  none  >] unix_create1+0x5e/0x3d0 net/unix/af_unix.c:648
[<  none  >] unix_create+0x14d/0x1f0 net/unix/af_unix.c:707
[<  none  >] __sock_create+0x1f1/0x4c0 net/socket.c:1169
[< inline >] sock_create net/socket.c:1209
[< inline >] SYSC_socketpair net/socket.c:1281
[<  none  >] SyS_socketpair+0x10f/0x4d0 net/socket.c:1260
[<  none  >] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:187

INFO: Freed in sk_destruct+0x2e9/0x400 age=9 cpu=1 pid=10568
[<  none  >] __slab_free+0x12d/0x280 mm/slub.c:2587 (discriminator 1)
[< inline >] slab_free mm/slub.c:2736
[<  none  >] kmem_cache_free+0x161/0x180 mm/slub.c:2745
[< inline >] sk_prot_free net/core/sock.c:1374
[<  none  >] sk_destruct+0x2e9/0x400 net/core/sock.c:1452
[<  none  >] __sk_free+0x57/0x200 net/core/sock.c:1460
[<  none  >] sk_free+0x30/0x40 net/core/sock.c:1471
[< inline >] sock_put include/net/sock.h:1593
[<  none  >] unix_dgram_sendmsg+0xeaf/0x1100 net/unix/af_unix.c:1571
[< inline >] sock_sendmsg_nosec net/socket.c:610
[<  none  >] sock_sendmsg+0xca/0x110 net/socket.c:620
[<  none  >] sock_write_iter+0x216/0x3a0 net/socket.c:819
[< inline >] new_sync_write fs/read_write.c:478
[<  none  >] __vfs_write+0x2ed/0x3d0 fs/read_write.c:491
[<  none  >] vfs_write+0x173/0x4a0 fs/read_write.c:538
[< inline >] SYSC_write fs/read_write.c:585
[<  none  >] SyS_write+0x108/0x220 fs/read_write.c:577
[<  none  >] entry_SYSCALL_64_fastpath+0x31/0x95
arch/x86/entry/entry_64.S:187

INFO: Slab 0xeab6ce00 objects=26 used=4 fp=0x88002db3cc00

[RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples

2015-10-12 Thread Kaixu Xia

In some scenarios we don't want to output trace data when sampling
to reduce overhead. This patch adds the flag sample_disable to
implement this function. By setting this flag and integrating with
ebpf, we can control the data output process and get the samples we
are most interested in.

Signed-off-by: Kaixu Xia 
---
 include/linux/bpf.h| 1 +
 include/linux/perf_event.h | 2 ++
 kernel/bpf/arraymap.c  | 5 +
 kernel/events/core.c   | 3 +++
 4 files changed, 11 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f57d7fe..25e073d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -39,6 +39,7 @@ struct bpf_map {
u32 max_entries;
const struct bpf_map_ops *ops;
struct work_struct work;
+   atomic_t perf_sample_disable;
 };
 
 struct bpf_map_type_list {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 092a0e8..0606d1d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -483,6 +483,8 @@ struct perf_event {
perf_overflow_handler_t overflow_handler;
void*overflow_handler_context;
 
+   atomic_t*sample_disable;
+
 #ifdef CONFIG_EVENT_TRACING
struct trace_event_call *tp_event;
struct event_filter *filter;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 29ace10..4ae82c9 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -51,6 +51,9 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 
array->elem_size = elem_size;
 
+   if (attr->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY)
+   atomic_set(>map.perf_sample_disable, 1);
+
return >map;
 }
 
@@ -298,6 +301,8 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map, int fd)
perf_event_release_kernel(event);
return ERR_PTR(-EINVAL);
}
+
+   event->sample_disable = >perf_sample_disable;
return event;
 }
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b11756f..f6ef45c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
irq_work_queue(>pending);
}
 
+   if ((event->sample_disable) && atomic_read(event->sample_disable))
+   return ret;
+
if (event->overflow_handler)
event->overflow_handler(event, data, regs);
else
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 2/2] bpf: Implement bpf_perf_event_sample_enable/disable() helpers

2015-10-12 Thread Kaixu Xia

The functions bpf_perf_event_sample_enable/disable() can set the
flag sample_disable to enable/disable output trace data on samples.

Signed-off-by: Kaixu Xia 
---
 include/linux/bpf.h  |  2 ++
 include/uapi/linux/bpf.h |  2 ++
 kernel/bpf/verifier.c|  4 +++-
 kernel/trace/bpf_trace.c | 34 ++
 4 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 25e073d..09148ff 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -192,6 +192,8 @@ extern const struct bpf_func_proto 
bpf_map_update_elem_proto;
 extern const struct bpf_func_proto bpf_map_delete_elem_proto;
 
 extern const struct bpf_func_proto bpf_perf_event_read_proto;
+extern const struct bpf_func_proto bpf_perf_event_sample_enable_proto;
+extern const struct bpf_func_proto bpf_perf_event_sample_disable_proto;
 extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
 extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
 extern const struct bpf_func_proto bpf_tail_call_proto;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 92a48e2..5229c550 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -272,6 +272,8 @@ enum bpf_func_id {
BPF_FUNC_skb_get_tunnel_key,
BPF_FUNC_skb_set_tunnel_key,
BPF_FUNC_perf_event_read,   /* u64 bpf_perf_event_read(, index) 
*/
+   BPF_FUNC_perf_event_sample_enable,  /* u64 
bpf_perf_event_enable() */
+   BPF_FUNC_perf_event_sample_disable, /* u64 
bpf_perf_event_disable() */
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b074b23..6428daf 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -244,6 +244,8 @@ static const struct {
 } func_limit[] = {
{BPF_MAP_TYPE_PROG_ARRAY, BPF_FUNC_tail_call},
{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_read},
+   {BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_sample_enable},
+   {BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_sample_disable},
 };
 
 static void print_verifier_state(struct verifier_env *env)
@@ -860,7 +862,7 @@ static int check_map_func_compatibility(struct bpf_map 
*map, int func_id)
 * don't allow any other map type to be passed into
 * the special func;
 */
-   if (bool_map != bool_func)
+   if (bool_func && bool_map != bool_func)
return -EINVAL;
}
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0fe96c7..abe943a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -215,6 +215,36 @@ const struct bpf_func_proto bpf_perf_event_read_proto = {
.arg2_type  = ARG_ANYTHING,
 };
 
+static u64 bpf_perf_event_sample_enable(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+
+   atomic_set(>perf_sample_disable, 0);
+   return 0;
+}
+
+const struct bpf_func_proto bpf_perf_event_sample_enable_proto = {
+   .func   = bpf_perf_event_sample_enable,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_CONST_MAP_PTR,
+};
+
+static u64 bpf_perf_event_sample_disable(u64 r1, u64 r2, u64 r3, u64 r4, u64 
r5)
+{
+   struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+
+   atomic_set(>perf_sample_disable, 1);
+   return 0;
+}
+
+const struct bpf_func_proto bpf_perf_event_sample_disable_proto = {
+   .func   = bpf_perf_event_sample_disable,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_CONST_MAP_PTR,
+};
+
 static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id 
func_id)
 {
switch (func_id) {
@@ -242,6 +272,10 @@ static const struct bpf_func_proto 
*kprobe_prog_func_proto(enum bpf_func_id func
return _get_smp_processor_id_proto;
case BPF_FUNC_perf_event_read:
return _perf_event_read_proto;
+   case BPF_FUNC_perf_event_sample_enable:
+   return _perf_event_sample_enable_proto;
+   case BPF_FUNC_perf_event_sample_disable:
+   return _perf_event_sample_disable_proto;
default:
return NULL;
}
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

GPF in rt6_uncached_list_flush_dev

2015-10-12 Thread Dmitry Vyukov

Hello,

The following program causes episodic crashes:

// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include 
#define CLONE_NEWNET 0x4000
int main(void)
{
unshare(CLONE_NEWNET);
}

On commit dd36d7393d6310b0c1adefb22fba79c3cf8a577c
(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)

general protection fault:  [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 1058 Comm: kworker/u8:1 Not tainted 4.3.0-rc2+ #12
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: netns cleanup_net
task: 880051c71a00 ti: 8800514f8000 task.ti: 8800514f8000
RIP: 0010:[]  [] rt6_ifdown+0x481/0x740
RSP: 0018:8800514ffaa0  EFLAGS: 00010246
RAX: dc59 RBX: 88005107c580 RCX: 0002
RDX:  RSI: 000f RDI: 880052a1f340
RBP: 8800514ffb78 R08:  R09: 8800514ffb10
R10: 88002d5b7dc0 R11: 88002ec07600 R12: 880051c11140
R13: 88005144af40 R14:  R15: dc00
FS:  () GS:88002f00() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 00648056 CR3: 0361 CR4: 06f0
Stack:
 02c8 11000a29ff5e dc59 00022d5b61c0
 880052a1f340 880051c11140 880052a1f348 88005107c6d8
 88005107c598  41b58ab3 83471ca6
Call Trace:
 [] fib6_net_exit+0x20/0x100 net/ipv6/ip6_fib.c:1847
 [] ops_exit_list.isra.6+0xae/0x150
net/core/net_namespace.c:134
 [] cleanup_net+0x3cd/0x730
net/core/net_namespace.c:431 (discriminator 3)
 [] process_one_work+0x6d1/0x1370 kernel/workqueue.c:2030
 [] worker_thread+0xe3/0x1300 kernel/workqueue.c:2162
 [] kthread+0x1e7/0x260 kernel/kthread.c:209
 [] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:475
Code: 89 95 50 ff ff ff e8 6f 41 9f fe 48 8b 95 50 ff ff ff 48 39 95
70 ff ff ff 0f 84 d5 fe ff ff e8 56 41 9f fe 48 8b 85 38 ff ff ff <80>
38 00 0f 85 9b 01 00 00 48 8b 85 70 ff ff ff 48 8b 90 c8 02
RIP  [< inline >] __read_once_size include/linux/compiler.h:207
RIP  [< inline >] in6_dev_get include/net/addrconf.h:281
RIP  [< inline >] rt6_uncached_list_flush_dev net/ipv6/route.c:156
RIP  [] rt6_ifdown+0x481/0x740 net/ipv6/route.c:2621
 RSP 
---[ end trace 113e678e9b762d96 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt


The crash happens because loopback_dev is NULL in
rt6_uncached_list_flush_dev(). The crash happens only if there is an
uncached route when the interface in destroyed.

I've tried to run the program with the following patch applied:

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index dc7d970..fd7e88d 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -144,6 +144,8 @@ static int loopback_dev_init(struct net_device *dev)

 static void loopback_dev_free(struct net_device *dev)
 {
+   pr_err("loopback_dev_free %p = %p",
_net(dev)->loopback_dev, dev_net(dev)->loopback_dev);
+   WARN_ON(1);
dev_net(dev)->loopback_dev = NULL;
free_percpu(dev->lstats);
free_netdev(dev);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f204089..fd558a4 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -142,6 +142,8 @@ static void rt6_uncached_list_flush_dev(struct net
*net, struct net_device *dev)
struct net_device *loopback_dev = net->loopback_dev;
int cpu;

+   pr_err("rt6_uncached_list_flush_dev %p = %p",
>loopback_dev, net->loopback_dev);
+   WARN_ON(1);
for_each_possible_cpu(cpu) {
struct uncached_list *ul = per_cpu_ptr(_uncached_list, cpu);
struct rt6_info *rt;


And it shows that the loopback device is destroyed before
rt6_uncached_list_flush_dev is executed, while
rt6_uncached_list_flush_dev seems to assume that loopback_dev is alive
when it is called:

[  197.812174] loopback_dev_free 88003d288150 = 88003e1d67c0
[  197.812890] [ cut here ]
[  197.813389] WARNING: CPU: 2 PID: 1044 at drivers/net/loopback.c:148
loopback_dev_free+0x3c/0x70()
[  197.814186] Modules linked in:
[  197.814478] CPU: 2 PID: 1044 Comm: kworker/u8:1 Tainted: GW
  4.3.0-rc3+ #45
[  197.815186] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[  197.815886] Workqueue: netns cleanup_net
[  197.816256]  81c27c67 88003d923c50 812fe8d6

[  197.816949]  88003d923c88 81051ff1 88003e1d67c0
88003e1d6bd0
[  197.817662]  fffe70d4 fffe70d4 03e8
88003d923c98
[  197.818367] Call Trace:
[  197.818589]  [] dump_stack+0x44/0x5e
[  197.819048]  [] warn_slowpath_common+0x81/0xc0
[  197.819573]  [] warn_slowpath_null+0x15/0x20
[  197.820088]  [] loopback_dev_free+0x3c/0x70
[  197.820588]

Re: [PATCH 4/9] net/can: can_dropped_invalid_skb can be boolean

2015-10-12 Thread Marc Kleine-Budde

On 10/09/2015 04:25 PM, Yaowei Bai wrote:
>> Yaowei, feel free to send the CAN patch as part of your series directly
>> to David.
> 
> OK, i'll do that and sorry for disturbing you. :)

Putting me on Cc was 100% correct, but IMHO no need to split up the
series when David can apply it in one go.

regards,
Marc

-- 
Pengutronix e.K.  | Marc Kleine-Budde   |
Industrial Linux Solutions| Phone: +49-231-2826-924 |
Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



signature.asc
Description: OpenPGP digital signature

Infinite loop in ip6_fragment

2015-10-12 Thread Dmitry Vyukov

Hello,

The following program causes infinite loop in ip6_fragment function:

// autogenerated by syzkaller (http://github.com/google/syzkaller)
#include 
#include 
#include 

int main()
{
long r0 = syscall(SYS_socket, 0xaul, 0x3ul, 0x53cul);
long r1 = syscall(SYS_mmap, 0x2000ul, 0x1000ul, 0x3ul,
0x32ul, 0xul, 0x0ul);
*(uint64_t*)0x2fc0 = 0xa;
*(uint64_t*)0x2fc8 = 0x;
*(uint64_t*)0x2fd0 = 0x0;
*(uint64_t*)0x2fd8 = 0xa5;
*(uint64_t*)0x2fe0 = 0x1;
*(uint64_t*)0x2fe8 = 0x9;
*(uint64_t*)0x2ff0 = 0x8;
*(uint64_t*)0x2ff8 = 0x4cd;
long r10 = syscall(SYS_connect, r0, 0x2fc0ul, 0x40ul);
long r11 = syscall(SYS_mmap, 0x20001000ul, 0x1000ul, 0x3ul,
0x32ul, 0xul, 0x0ul);
long r12 = syscall(SYS_mmap, 0x20002000ul, 0x1000ul, 0x3ul,
0x32ul, 0xul, 0x0ul);
long r13 = syscall(SYS_mmap, 0x20003000ul, 0x1000ul, 0x3ul,
0x32ul, 0xul, 0x0ul);
*(uint64_t*)0x200010b3 = 0x200012e6;
*(uint64_t*)0x200010bb = 0xf;
*(uint64_t*)0x200010c3 = 0x20002000;
*(uint64_t*)0x200010cb = 0x1000;
long r20 = syscall(SYS_writev, r0, 0x200010b3ul, 0x2ul);
return 0;
}

On commit dd36d7393d6310b0c1adefb22fba79c3cf8a577c
(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)

INFO: rcu_sched self-detected stall on CPU
0: (20822 ticks this GP) idle=94b/141/0
softirq=2353/2355 fqs=6717
 (t=21000 jiffies g=1103 c=1102 q=171)
Task dump for CPU 0:
a.out   R  running task13472  2613   2610 0x0008
 81e40900 88083fc03bb0 810788b3 
 81e40900 88083fc03bc8 8107ab84 0001
 88083fc03bf8 8109fc35 88083fc15e00 81e40900
Call Trace:
   [] sched_show_task+0xc3/0x120 kernel/sched/core.c:4874
 [] dump_cpu_task+0x34/0x40 kernel/sched/core.c:8563
 [] rcu_dump_cpu_stacks+0x85/0xc0 kernel/rcu/tree.c:1199
 [< inline >] print_cpu_stall kernel/rcu/tree.c:1306
 [< inline >] check_cpu_stall kernel/rcu/tree.c:1370
 [< inline >] __rcu_pending kernel/rcu/tree.c:3601
 [< inline >] rcu_pending kernel/rcu/tree.c:3665
 [] rcu_check_callbacks+0x45c/0x740 kernel/rcu/tree.c:2764
 [] update_process_times+0x34/0x60 kernel/time/timer.c:1397
 [] tick_sched_handle.isra.15+0x31/0x40
kernel/time/tick-sched.c:151
 [] tick_sched_timer+0x3b/0x70 kernel/time/tick-sched.c:1070
 [< inline >] __run_hrtimer kernel/time/hrtimer.c:1229
 [] __hrtimer_run_queues+0xda/0x1f0 kernel/time/hrtimer.c:1293
 [] hrtimer_interrupt+0xa3/0x190 kernel/time/hrtimer.c:1327
 [] local_apic_timer_interrupt+0x30/0x60
arch/x86/kernel/apic/apic.c:901
 [] smp_apic_timer_interrupt+0x38/0x50
arch/x86/kernel/apic/apic.c:925
 [] apic_timer_interrupt+0x7f/0x90
arch/x86/entry/entry_64.S:683
 [< inline >] napi_poll net/core/dev.c:4755
 [] net_rx_action+0x134/0x300 net/core/dev.c:4820
 [] __do_softirq+0xc7/0x240 kernel/softirq.c:273
 [] do_softirq_own_stack+0x1c/0x30
arch/x86/entry/entry_64.S:871
   [] do_softirq+0x2c/0x40 kernel/softirq.c:317
 [] __local_bh_enable_ip+0x73/0x80 kernel/softirq.c:170
 [< inline >] local_bh_enable include/linux/bottom_half.h:31
 [< inline >] rcu_read_unlock_bh include/linux/rcupdate.h:954
 [] ip6_finish_output2+0x16c/0x480 net/ipv6/ip6_output.c:114
 [] ip6_fragment+0x37e/0x9d0 net/ipv6/ip6_output.c:805
 [] ip6_finish_output+0xcd/0xe0 net/ipv6/ip6_output.c:130
 [< inline >] NF_HOOK_COND include/linux/netfilter.h:236
 [] ip6_output+0x3f/0xe0 net/ipv6/ip6_output.c:146
 [< inline >] dst_output_sk include/net/dst.h:459
 [] ip6_local_out_sk+0x28/0x30 net/ipv6/output_core.c:167
 [] ip6_local_out+0x10/0x20 net/ipv6/output_core.c:175
 [] ip6_send_skb+0x18/0x60 net/ipv6/ip6_output.c:1683
 [] ip6_push_pending_frames+0x34/0x40
net/ipv6/ip6_output.c:1703
 [< inline >] rawv6_push_pending_frames net/ipv6/raw.c:607
 [] rawv6_sendmsg+0x871/0xb30 net/ipv6/raw.c:901
 [] inet_sendmsg+0x62/0xa0 net/ipv4/af_inet.c:737
 [< inline >] sock_sendmsg_nosec net/socket.c:610
 [] sock_sendmsg+0x33/0x40 net/socket.c:620
 [] sock_write_iter+0x73/0xd0 net/socket.c:819
 [< inline >] do_iter_readv_writev fs/read_write.c:664
 [] do_readv_writev+0x1bd/0x270 fs/read_write.c:808
 [] vfs_writev+0x34/0x40 fs/read_write.c:847
 [< inline >] SYSC_writev fs/read_write.c:880
 [] SyS_writev+0x45/0xc0 fs/read_write.c:872
 [] entry_SYSCALL_64_fastpath+0x12/0x6a
arch/x86/entry/entry_64.S:185

ip6_fragment computes mtu value as 4, which is then rounded down to 8
and becomes 0. This causes infinite send loop by 0 bytes. Initial mtu
value is 1500, but here is becomes 4:

mtu -= hlen + sizeof(struct frag_hdr);

sizeof(struct frag_hdr) = 8, hlen = 1488.

I use plain defconfig/kvmconfig.

Found with syzkaller fuzzer.
--
To

Re: [patch net-next 0/7] switchdev: change locking

2015-10-12 Thread Jiri Pirko

Sun, Oct 11, 2015 at 05:21:04PM CEST, j...@resnulli.us wrote:
>From: Jiri Pirko 
>
>This is something which I'm currently struggling with.
>Callers of attr_set and obj_add/del often hold not only RTNL, but also
>spinlock (bridge). So in that case, the driver implementing the op cannot 
>sleep.
>
>The way rocker is dealing with this now is just to invoke driver operation
>and go out, without any checking or reporting of the operation status.
>
>Since it would be nice to at least put a warning in case the operation fails,
>it makes sense to do this in delayed work directly in switchdev core
>instead of implementing this in separate drivers. And that is what this 
>patchset
>is introducing.
>
>So from now on, the locking of switchdev mod ops is consistent. Caller either
>holds rtnl mutex or in case it does not, caller sets defer flag, telling
>switchdev core to process the op later in delayed work.
>
>Flush function for switchdev deferred ops can be called by op
>caller in appropriate location, for example after it releases
>spin lock, to force switchdev core to process pending ops.

Dave, if you are going to apply
"[PATCH net-next v3 0/4] switchdev: push bridge ageing_time attribute down"
first, I have to rebase this patchset on top of that. Just say so.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: switchdev and VLAN ranges

2015-10-12 Thread Nikolay Aleksandrov

On 10/12/2015 07:14 AM, Scott Feldman wrote:
> On Sun, Oct 11, 2015 at 5:13 PM, Nikolay Aleksandrov
>  wrote:
>> On 10/12/2015 12:41 AM, Vivien Didelot wrote:
>>> On Oct. Sunday 11 (41) 09:12 AM, Jiri Pirko wrote:
 Sat, Oct 10, 2015 at 12:36:26PM CEST, niko...@cumulusnetworks.com wrote:
> On 10/10/2015 09:49 AM, Elad Raz wrote:
>>
>>> On Oct 10, 2015, at 2:30 AM, Vivien Didelot 
>>>  wrote:
>>>
>>> I have two concerns in mind:
>>>
>>> a) if we imagine that drivers like Rocker allocate memory in the prepare
>>> phase for each VID, preparing a range like 100-4000 would definitely not
>>> be recommended.
>>>
>>> b) imagine that you have two Linux bridges on a switch, one using the
>>> hardware VLAN 100. If you request the VLAN range 99-101 for the other
>>> bridge members, it is not possible for the driver to say "I can
>>> accelerate VLAN 99 and 101, but not 100". It must return OPNOTSUPP for
>>> the whole range.
>>
>> Another concern I have with vid_being..vid_end range is the “flags”. 
>> Where flags can be BRIDGE_VLAN_INFO_PVID.
>> There is no sense having more than one VLAN as a PVID.
>> This leave the HW vendor the choice which VLAN id they will use as the 
>> PVID.
>>
>
> iproute2 doesn't allow to do it but I can see that someone can actually 
> make it
> so the flags for the range have it and it doesn't look correct. Perhaps 
> we need
> something like the patch below to enforce this from kernel-side.
>
>
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index d78b4429505a..02b17b53e9a6 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -524,6 +524,9 @@ static int br_afspec(struct net_bridge *br,
> if (vinfo_start)
> return -EINVAL;
> vinfo_start = vinfo;
> +   /* don't allow range of pvids */
> +   if (vinfo_start->flags & BRIDGE_VLAN_INFO_PVID)
> +   return -EINVAL;
> continue;
> }
>

 Looks correct to me. Could you please submit this properly? Thanks!
>>>
>>> The above patch is correct, but we only solve part of the problem, since
>>> the range and bridge flags are exposed by switchdev_obj_port_vlan as is.
>>>
>>> Thanks,
>>> -v
>>>
>>
>> Yes, the above fixes the bridge side. About the switchdev side it seems like 
>> it's
>> up to the switchdev driver to do the right thing in its switchdev_ops. I 
>> took a
>> quick look at DSA and it seems correct, the flag isn't saved and on dump 
>> request
>> the flags are generated so it shouldn't be possible to export multiple pvids.
>> But switchdev_port_br_afspec() seems problematic, in fact I don't even see a 
>> vlan
>> id check, i.e. ==0 || >= VLAN_N_MASK.
>> Of course, I might be totally off point as I'm not that familiar with 
>> switchdev and
>> it's very late. :-)
>> But maybe it needs something like:
>>
>> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
>> index 6e4a4f9ad927..3dd52a53867f 100644
>> --- a/net/switchdev/switchdev.c
>> +++ b/net/switchdev/switchdev.c
>> @@ -16,6 +16,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -716,10 +717,14 @@ static int switchdev_port_br_afspec(struct net_device 
>> *dev,
>> return -EINVAL;
>> vinfo = nla_data(attr);
>> vlan.flags = vinfo->flags;
>> +   if (!vinfo->vid || vinfo->vid >= VLAN_VID_MASK)
>> +return -EINVAL;
>> if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN) {
>> if (vlan.vid_begin)
>> return -EINVAL;
>> vlan.vid_begin = vinfo->vid;
>> +   if (vlan.flags & BRIDGE_VLAN_INFO_PVID)
>> +   return -EINVAL;
>> } else if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_END) {
>> if (!vlan.vid_begin)
>> return -EINVAL;
> 
> This (and you other patch) seem right to me, if we're going to block
> setting PVID when specifying a vlan range.  Would you mind combining
> and resending both patches as one as a proper patch?
> 

Thanks for the review, I'll prepare a small set as I'd like to keep these
separate since they touch two different subsystems and will re-post.
I'll target net-next with the pvid range change and -net with the vlan
range check patch. Does this sound okay ?

Thanks,
 Nik

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH nf-next 1/2] ipvs: Remove possibly unused variable from ip_vs_out

2015-10-12 Thread Pablo Neira Ayuso

On Wed, Oct 07, 2015 at 04:58:47PM +0300, Sergei Shtylyov wrote:
> Hello.
> 
> On 10/7/2015 8:23 AM, Simon Horman wrote:
> 
> >From: David Ahern 
> >
> >Eric's net namespace changes in 1b75097dd7a26 leaves net unreferenced if
> >CONFIG_IP_VS_IPV6 is not enabled:
> >
> >../net/netfilter/ipvs/ip_vs_core.c: In function ‘ip_vs_out’:
> >../net/netfilter/ipvs/ip_vs_core.c:1177:14: warning: unused variable ‘net’ 
> >[-Wunused-variable]
> >
> >After the net refactoring there is only 1 user; push the reference to the
> >1 user. While the line length slightly exceeds 80 it seems to be the
> >best change.
> >
> >Fixes: 1b75097dd7a26("ipvs: Pass ipvs into ip_vs_out")
> 
>Minor nit: missing space before (. Could be probbably fixed while 
> applying...

I have pulled this anyway, I don't want to keep this back for this
minor style problem.

But please Simon, run checkpatch.pl before submitting next time.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] bridge/netfilter: avoid unused label warning

2015-10-12 Thread Pablo Neira Ayuso

On Thu, Oct 08, 2015 at 02:51:05PM +0200, Nikolay Aleksandrov wrote:
> On 10/08/2015 02:30 PM, Arnd Bergmann wrote:
> > With the ARM mini2440_defconfig, the bridge netfilter code gets
> > built with both CONFIG_NF_DEFRAG_IPV4 and CONFIG_NF_DEFRAG_IPV6
> > disabled, which leads to a harmless gcc warning:
> > 
> > net/bridge/br_netfilter_hooks.c: In function 'br_nf_dev_queue_xmit':
> > net/bridge/br_netfilter_hooks.c:792:2: warning: label 'drop' defined but 
> > not used [-Wunused-label]
> > 
> > This gets rid of the warning by cleaning up the code to avoid
> > the respective #ifdefs causing this problem, and replacing them
> > with if(IS_ENABLED()) checks. I have verified that the resulting
> > object code is unchanged, and an additional advantage is that
> > we now get compile coverage of the unused functions in more
> > configurations.
> > 
> > Signed-off-by: Arnd Bergmann 
> > Fixes: dd302b59bde0 ("netfilter: bridge: don't leak skb in error paths")
> > ---
> > Version 2:
> > 
> > Rebased to git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
> > 
> 
> Looks good to me,
> 
> Reviewed-by: Nikolay Aleksandrov 

Nice that we got rid of that many ifdefs. Thanks!

BTW, just for the record in case someone search for this on the
Internet: I have renamed the subject to "netfilter: bridge: avoid
unused label warning" for consistency with other existing subject
lines in the repo.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] ipv6: Don't call with rt6_uncached_list_flush_dev

2015-10-12 Thread Eric W. Biederman


As originally written rt6_uncached_list_flush_dev makes no sense when
called with dev == NULL as it attempts to flush all uncached routes
regardless of network namespace when dev == NULL.  Which is simply
incorrect behavior.

Furthermore at the point rt6_ifdown is called with dev == NULL no more
network devices exist in the network namespace so even if the code in
rt6_uncached_list_flush_dev were to attempt something sensible it
would be meaningless.

Therefore remove support in rt6_uncached_list_flush_dev for handling
network devices where dev == NULL, and only call rt6_uncached_list_flush_dev
 when rt6_ifdown is called with a network device.

Fixes: 8d0b94afdca8 ("ipv6: Keep track of DST_NOCACHE routes in case of iface 
down/unregister")
Signed-off-by: "Eric W. Biederman" 
---
 net/ipv6/route.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b8f85f143b69..1c45d7d90718 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -143,6 +143,9 @@ static void rt6_uncached_list_flush_dev(struct net *net, 
struct net_device *dev)
struct net_device *loopback_dev = net->loopback_dev;
int cpu;
 
+   if (dev == loopback_dev)
+   return;
+
for_each_possible_cpu(cpu) {
struct uncached_list *ul = per_cpu_ptr(_uncached_list, cpu);
struct rt6_info *rt;
@@ -152,14 +155,12 @@ static void rt6_uncached_list_flush_dev(struct net *net, 
struct net_device *dev)
struct inet6_dev *rt_idev = rt->rt6i_idev;
struct net_device *rt_dev = rt->dst.dev;
 
-   if (rt_idev && (rt_idev->dev == dev || !dev) &&
-   rt_idev->dev != loopback_dev) {
+   if (rt_idev->dev == dev) {
rt->rt6i_idev = in6_dev_get(loopback_dev);
in6_dev_put(rt_idev);
}
 
-   if (rt_dev && (rt_dev == dev || !dev) &&
-   rt_dev != loopback_dev) {
+   if (rt_dev == dev) {
rt->dst.dev = loopback_dev;
dev_hold(rt->dst.dev);
dev_put(rt_dev);
@@ -2600,7 +2601,8 @@ void rt6_ifdown(struct net *net, struct net_device *dev)
 
fib6_clean_all(net, fib6_ifdown, );
icmp6_clean_all(fib6_ifdown, );
-   rt6_uncached_list_flush_dev(net, dev);
+   if (dev)
+   rt6_uncached_list_flush_dev(net, dev);
 }
 
 struct rt6_mtu_change_arg {
-- 
2.2.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v5] net: ipv6: Make address flushing on ifdown optional

2015-10-12 Thread David Ahern

Currently, all ipv6 addresses are flushed when the interface is configured
down, including global, static addresses:

$ ip -6 addr add dev eth1 2000:11:1:1::1/64
$ ip addr show dev eth1
3: eth1:  mtu 1500 qdisc noop state DOWN group default 
qlen 1000
link/ether 02:04:11:22:33:01 brd ff:ff:ff:ff:ff:ff
inet6 2000:11:1:1::1/64 scope global tentative
   valid_lft forever preferred_lft forever
$ ip link set dev eth1 up
$ ip link set dev eth1 down
$ ip addr show dev eth1
3: eth1:  mtu 1500 qdisc pfifo_fast state DOWN group 
default qlen 1000
link/ether 02:04:11:22:33:01 brd ff:ff:ff:ff:ff:ff

Add a new sysctl to make this behavior optional. The new setting defaults to
flush all addresses to maintain backwards compatibility. When the setting is
reset global addresses with no expire times are not flushed:

$ echo 0 > /proc/sys/net/ipv6/conf/eth1/flush_addr_on_down
$ ip -6 addr add dev eth1 2000:11:1:1::1/64
$ ip addr show dev eth1
3: eth1:  mtu 1500 qdisc pfifo_fast state DOWN group 
default qlen 1000
link/ether 02:04:11:22:33:01 brd ff:ff:ff:ff:ff:ff
inet6 2000:11:1:1::1/64 scope global tentative
   valid_lft forever preferred_lft forever
$ ip link set dev eth1 up
$ ip link set dev eth1 down
$ ip addr show dev eth1
3: eth1:  mtu 1500 qdisc pfifo_fast state DOWN group 
default qlen 1000
link/ether 02:04:11:22:33:01 brd ff:ff:ff:ff:ff:ff
inet6 2000:11:1:1::1/64 scope global
   valid_lft forever preferred_lft forever
inet6 fe80::4:11ff:fe22:3301/64 scope link
   valid_lft forever preferred_lft forever

Signed-off-by: David Ahern 
---
v5:
- renamed managed to user_managed as requested by Hannes
- handle addrconf_dst_alloc failure and cleanup ifp as noted by Dave
  -- tested by faking allocation failure
- minor ordering changes in addrconf_ifdown() to handle changes under lock

v4:
- rebased to top of tree
- updated to clear all routes on admin down and re-added on admin up
- verified the route tables (main and local) on a link down have *no*
  remnants of the configured, global address. On a link up all routes
  are restored -- multicast, linklocal, local routes and connected.

v3:
- fix local variable ordering and comment style per Dave's comment
- consistency in DEVCONF naming per Brian Haley's comment
- added entry to Documentation/networking/ip-sysctl.txt

v2:
- only keep static addresses as suggested by Hannes
- added new managed flag to track configured addresses
- on ifdown do not remove from configured address from inet6_addr_lst
- on ifdown reset the TENTATIVE flag and set state to DAD so that DAD is
  redone when link is brought up again

 Documentation/networking/ip-sysctl.txt |   6 ++
 include/linux/ipv6.h   |   1 +
 include/net/if_inet6.h |   1 +
 include/uapi/linux/ipv6.h  |   1 +
 net/ipv6/addrconf.c| 124 +
 5 files changed, 118 insertions(+), 15 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index ebe94f2cab98..51c60f58f7ec 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1432,6 +1432,12 @@ dad_transmits - INTEGER
The amount of Duplicate Address Detection probes to send.
Default: 1
 
+flush_addr_on_down - BOOLEAN
+   Flush all IPv6 addresses on an interface down event. If disabled
+   static global addresses with no expiration time are not flushed.
+
+   Default: enabled
+
 forwarding - INTEGER
Configure interface-specific Host/Router behaviour.
 
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 0ef2a97ccdb5..112a18940ab2 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -60,6 +60,7 @@ struct ipv6_devconf {
struct in6_addr secret;
} stable_secret;
__s32   use_oif_addrs_only;
+   __s32   flush_addr_on_down;
void*sysctl;
 };
 
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index 1c8b6820b694..01ba6a286a4b 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -72,6 +72,7 @@ struct inet6_ifaddr {
int regen_count;
 
booltokenized;
+   booluser_managed;
 
struct rcu_head rcu;
struct in6_addr peer_addr;
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index 38b4fef20219..7c514f7cd209 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -174,6 +174,7 @@ enum {
DEVCONF_USE_OIF_ADDRS_ONLY,
DEVCONF_ACCEPT_RA_MIN_HOP_LIMIT,
DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN,
+   DEVCONF_FLUSH_ADDR_ON_DOWN,

Re: [PATCH net-next 0/3] net: Pass net into defragmentation

2015-10-12 Thread Nicolas Dichtel


Le 09/10/2015 20:42, Eric W. Biederman a écrit :


This is the next installment of my work to pass struct net through the
output path so the code does not need to guess how to figure out which
network namespace it is in, and ultimately routes can have output
devices in another network namespace.

In netfilter and af_packet we defragment packets in the output path,
and there is the usual amount of confusion about how to compute which
net we are processing the packets in.  This patchset clears that
confusion up by explicitly passing in struct net in ip_defrag,
ip_check_defrag, and nf_ct_frag6_gather.


Acked-by: Nicolas Dichtel 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] atm: iphase: Fix misleading indention and return -ENOMEM on error

2015-10-12 Thread Charles (Chas) Williams

On Sat, 2015-10-10 at 21:47 +0200, Tillmann Heidsieck wrote:
> this series fixes two of them. The if(); warning would require
> restructuring the code to a larger extend. Beyond this there remains a
> whooping number of > 2k checkpatch.pl warnings and errors each. Those
> can be grouped into 
...
> Generally I would not mind cleaning all this up for those who have to
> make functional changes to the driver. However, I would like to know
> from the maintainers if such an afford would be welcome or not.

It doesn't bother me if you do this.  I can review it.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] net/bonding: send arp in interval if no active slave

2015-10-12 Thread Jarod Wilson


Jay Vosburgh wrote:

Jarod Wilson  wrote:


Jarod Wilson wrote:
...

As Andy already stated I'm not a fan of such workarounds either but it's
necessary sometimes so if this is going to be actually considered then a
few things need to be fixed. Please make this a proper bonding option
which can be changed at runtime and not only via a module parameter.

Is there any particular userspace tool that would need some updating, or
is adding the sysfs knobs sufficient here? I think I've got all the sysfs
stuff thrown together now, but still need to test.


Most (all?) bonding options should be configurable via iproute
(netlink) now.


D'oh, of course. I've done the kernel-side netlink bits now too, and 
started looking at the iproute source. However...




Now, I saw that you've only tested with 500 ms, can't this be fixed by
using
a different interval ? This seems like a very specific problem to have a
whole new option for.

...I'll wait until we've heard confirmation from Uwe that intervals
other than 500ms don't fix things.

Okay, so I believe the "only tested with 500ms" was in reference to
testing with Uwe's initial patch. I do have supporting evidence in a
bugzilla report that shows upwards of 5000ms still experience the problem
here.


I did set up some switches and attempt to reproduce this
yesterday; I daisy-chained three switches (two Cisco and an HP) together
and connected the bonded interfaces to the "end" switches.  I tried
various ARP targets (the switch, hosts on various points of the switch)
and varying arp_intervals and was unable to reproduce the problem.

As I understand it, the working theory is something like this:

- host with two bonded interfaces, A and B.  For active-backup
mode, the interfaces have been assigned the same MAC address.

- switch has MAC for B in its forwarding table

- bonding goes from down to up, and thinks all its slaves are
down, and starts the "curr_arp_slave" search for an active
arp_ip_target.  In this case, it starts with A, and sends an ARP from A.

As an aside, I'm not 100% clear on what exactly is going on in
the "bonding goes from down to up" transition; this seems to be key in
reproducing the issue.

- switch sees source mac coming from port A, starts to update
its forwarding table

- meanwhile, switch forwards ARP request, and receives ARP
reply, which it forwards to port B.  Bonding drops this, as the slave is
inactive.

- switch finishes updating forwarding table, MAC is now assigned
to port A.

- bonding now tries sending on port B, and the cycle repeats.

If this is what's taking place, then the arp_interval itself is
irrelevant, the race is between the switch table update and the
generation of the ARP reply.

Also, presuming the above is what's going on, we could modify
the ARP "curr_arp_slave" logic a bit to resolve this without requiring
any magic knobs.


I really like this idea. Still trying to grasp exactly how we get into 
this situation and what everything looks like as we hop through the 
various bond_ab_arp_* functions though.



For example, we could change the "drop on inactive" logic to
recognise the "curr_arp_slave" search and accept the unicast ARP reply,
and perhaps make that receiving slave the next curr_arp_slave
automatically.


Nothing ever actually getting picked as curr_arp_slave does appear to be 
the problem, so that does sound like it could do the trick.



I also wonder if the fail_over_mac option would affect this
behavior, as it would cause the slaves to keep their MAC address for the
duration, so the switch would not see the MAC move from port to port.


Not sure if that's an option for the particular environment, but we 
could certainly ask Uwe to give it a try.



Another thought would be to have the curr_arp_slave cycle
through the slaves in random order, but that could create
non-deterministic results even when things are working correctly.


I'd say avoid this route if at all possible, would rather not make 
things less predictable.


--
Jarod Wilson
ja...@redhat.com


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Issue with /proc/sys/net/ipv4/tcp_mem

2015-10-12 Thread Eric W. Biederman

wangyufen  writes:

> Hi,
>
> I tried on linux-4.1:
> linux:~# cat /proc/sys/net/ipv4/tcp_mem 
> 8388608   1258291216777216
> linux:~# echo 1234 >/proc/sys/net/ipv4/tcp_mem 
> -bash: echo: write error: Invalid argument
> linux:~# cat /proc/sys/net/ipv4/tcp_mem 
> 1234  1258291216777216
>
> the echo operation got error, but value already written to tcp_mem.
>
> I checked, patch f594d63199688ad568fb caused the issue.

Use the cgroup interface if you want per cgroup control.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v3 3/7] switchdev: allow caller to explicitly request attr_set as deferred

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

Caller should know if he can call attr_set directly (when holding RTNL)
or if he has to defer the att_set processing for later.

This also allows drivers to sleep inside attr_set and report operation
status back to switchdev core. Switchdev core then warns if status is
not ok, instead of silent errors happening in drivers.

Signed-off-by: Jiri Pirko 
---
 include/net/switchdev.h   |   1 +
 net/bridge/br_stp.c   |   3 +-
 net/switchdev/switchdev.c | 108 --
 3 files changed, 60 insertions(+), 52 deletions(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index d2879f2..6b109e4 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -17,6 +17,7 @@
 
 #define SWITCHDEV_F_NO_RECURSE BIT(0)
 #define SWITCHDEV_F_SKIP_EOPNOTSUPPBIT(1)
+#define SWITCHDEV_F_DEFER  BIT(2)
 
 struct switchdev_trans_item {
struct list_head list;
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index db6d243de..80c34d7 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned int 
state)
 {
struct switchdev_attr attr = {
.id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
+   .flags = SWITCHDEV_F_DEFER,
.u.stp_state = state,
};
int err;
 
p->state = state;
err = switchdev_port_attr_set(p->dev, );
-   if (err && err != -EOPNOTSUPP)
+   if (err)
br_warn(p->br, "error setting offload STP state on port 
%u(%s)\n",
(unsigned int) p->port_no, p->dev->name);
 }
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index f782eba..5a92d08 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -173,6 +173,51 @@ done:
return err;
 }
 
+static int switchdev_port_attr_set_now(struct net_device *dev,
+  struct switchdev_attr *attr)
+{
+   struct switchdev_trans trans;
+   int err;
+
+   ASSERT_RTNL();
+
+   switchdev_trans_init();
+
+   /* Phase I: prepare for attr set. Driver/device should fail
+* here if there are going to be issues in the commit phase,
+* such as lack of resources or support.  The driver/device
+* should reserve resources needed for the commit phase here,
+* but should not commit the attr.
+*/
+
+   trans.ph_prepare = true;
+   err = __switchdev_port_attr_set(dev, attr, );
+   if (err) {
+   /* Prepare phase failed: abort the transaction.  Any
+* resources reserved in the prepare phase are
+* released.
+*/
+
+   if (err != -EOPNOTSUPP)
+   switchdev_trans_items_destroy();
+
+   return err;
+   }
+
+   /* Phase II: commit attr set.  This cannot fail as a fault
+* of driver/device.  If it does, it's a bug in the driver/device
+* because the driver said everythings was OK in phase I.
+*/
+
+   trans.ph_prepare = false;
+   err = __switchdev_port_attr_set(dev, attr, );
+   WARN(err, "%s: Commit of attribute (id=%d) failed.\n",
+dev->name, attr->id);
+   switchdev_trans_items_warn_destroy(dev, );
+
+   return err;
+}
+
 struct switchdev_attr_set_work {
struct work_struct work;
struct net_device *dev;
@@ -183,14 +228,17 @@ static void switchdev_port_attr_set_work(struct 
work_struct *work)
 {
struct switchdev_attr_set_work *asw =
container_of(work, struct switchdev_attr_set_work, work);
+   bool rtnl_locked = rtnl_is_locked();
int err;
 
-   rtnl_lock();
-   err = switchdev_port_attr_set(asw->dev, >attr);
+   if (!rtnl_locked)
+   rtnl_lock();
+   err = switchdev_port_attr_set_now(asw->dev, >attr);
if (err && err != -EOPNOTSUPP)
netdev_err(asw->dev, "failed (err=%d) to set attribute 
(id=%d)\n",
   err, asw->attr.id);
-   rtnl_unlock();
+   if (!rtnl_locked)
+   rtnl_unlock();
 
dev_put(asw->dev);
kfree(work);
@@ -211,7 +259,7 @@ static int switchdev_port_attr_set_defer(struct net_device 
*dev,
asw->dev = dev;
memcpy(>attr, attr, sizeof(asw->attr));
 
-   schedule_work(>work);
+   queue_work(switchdev_wq, >work);
 
return 0;
 }
@@ -225,57 +273,15 @@ static int switchdev_port_attr_set_defer(struct 
net_device *dev,
  * Use a 2-phase prepare-commit transaction model to ensure
  * system is not left in a partially updated state due to
  * failure from driver/device.
+ *
+ * rtnl_lock must be held and must not be in atomic section,
+ * in case SWITCHDEV_F_DEFER flag is not set.
  */
 int switchdev_port_attr_set(struct net_device *dev, struct

[patch net-next v3 1/7] switchdev: assert rtnl in switchdev_port_obj_del

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

RTNL mutex needs to be held for this function.
Safe usage of netdev_for_each_lower_dev requires that.

Signed-off-by: Jiri Pirko 
---
 net/switchdev/switchdev.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 7a9ab90..a79ee44 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -366,6 +366,8 @@ EXPORT_SYMBOL_GPL(switchdev_port_obj_add);
  * @dev: port device
  * @id: object ID
  * @obj: object to delete
+ *
+ * rtnl_lock must be held.
  */
 int switchdev_port_obj_del(struct net_device *dev,
   const struct switchdev_obj *obj)
@@ -375,6 +377,8 @@ int switchdev_port_obj_del(struct net_device *dev,
struct list_head *iter;
int err = -EOPNOTSUPP;
 
+   ASSERT_RTNL();
+
if (ops && ops->switchdev_port_obj_del)
return ops->switchdev_port_obj_del(dev, obj);
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v3 0/7] switchdev: change locking

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

This is something which I'm currently struggling with.
Callers of attr_set and obj_add/del often hold not only RTNL, but also
spinlock (bridge). So in that case, the driver implementing the op cannot sleep.

The way rocker is dealing with this now is just to invoke driver operation
and go out, without any checking or reporting of the operation status.

Since it would be nice to at least put a warning in case the operation fails,
it makes sense to do this in delayed work directly in switchdev core
instead of implementing this in separate drivers. And that is what this patchset
is introducing.

So from now on, the locking of switchdev mod ops is consistent. Caller either
holds rtnl mutex or in case it does not, caller sets defer flag, telling
switchdev core to process the op later in delayed work.

Flush function for switchdev deferred ops can be called by op
caller in appropriate location, for example after it releases
spin lock, to force switchdev core to process pending ops.

v1->v2:
- rebased on current net-next head (including Scott's ageing patchset)
v2->v3:
- fixed comment s/of/or/ typo suggested by Nik

Jiri Pirko (7):
  switchdev: assert rtnl in switchdev_port_obj_del
  switchdev: introduce switchdev workqueue
  switchdev: allow caller to explicitly request attr_set as deferred
  switchdev: remove pointers from switchdev objects
  switchdev: introduce possibility to defer obj_add/del
  bridge: defer switchdev fdb del call in fdb_del_external_learn
  rocker: remove nowait from switchdev callbacks.

 drivers/net/ethernet/rocker/rocker.c |  13 +-
 include/net/switchdev.h  |  14 +-
 net/bridge/br_fdb.c  |   7 +-
 net/bridge/br_if.c   |   3 +
 net/bridge/br_stp.c  |   3 +-
 net/dsa/slave.c  |   2 +-
 net/switchdev/switchdev.c| 276 +--
 7 files changed, 222 insertions(+), 96 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v3 6/7] bridge: defer switchdev fdb del call in fdb_del_external_learn

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

Since spinlock is held here, defer the switchdev operation.

Signed-off-by: Jiri Pirko 
---
 net/bridge/br_fdb.c | 5 -
 net/bridge/br_if.c  | 3 +++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index f5e7da0..c88bd8e 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -134,7 +134,10 @@ static void fdb_del_hw_addr(struct net_bridge *br, const 
unsigned char *addr)
 static void fdb_del_external_learn(struct net_bridge_fdb_entry *f)
 {
struct switchdev_obj_port_fdb fdb = {
-   .obj.id = SWITCHDEV_OBJ_ID_PORT_FDB,
+   .obj = {
+   .id = SWITCHDEV_OBJ_ID_PORT_FDB,
+   .flags = SWITCHDEV_F_DEFER,
+   },
.vid = f->vlan_id,
};
 
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 934cae9..09147cb 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "br_private.h"
 
@@ -249,6 +250,8 @@ static void del_nbp(struct net_bridge_port *p)
list_del_rcu(>list);
 
br_fdb_delete_by_port(br, p, 0, 1);
+   switchdev_flush_deferred();
+
nbp_update_port_count(br);
 
netdev_upper_dev_unlink(dev, br->dev);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v3 7/7] rocker: remove nowait from switchdev callbacks.

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

No need to avoid sleeping in switchdev callbacks now, as the switchdev
core allows it.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index bb956a5..9629c5b5 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -3672,7 +3672,7 @@ static int rocker_port_fdb_flush(struct rocker_port 
*rocker_port,
rocker_port->stp_state == BR_STATE_FORWARDING)
return 0;
 
-   flags |= ROCKER_OP_FLAG_REMOVE;
+   flags |= ROCKER_OP_FLAG_NOWAIT | ROCKER_OP_FLAG_REMOVE;
 
spin_lock_irqsave(>fdb_tbl_lock, lock_flags);
 
@@ -4382,8 +4382,7 @@ static int rocker_port_attr_set(struct net_device *dev,
 
switch (attr->id) {
case SWITCHDEV_ATTR_ID_PORT_STP_STATE:
-   err = rocker_port_stp_update(rocker_port, trans,
-ROCKER_OP_FLAG_NOWAIT,
+   err = rocker_port_stp_update(rocker_port, trans, 0,
 attr->u.stp_state);
break;
case SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS:
@@ -4517,7 +4516,7 @@ static int rocker_port_fdb_del(struct rocker_port 
*rocker_port,
   const struct switchdev_obj_port_fdb *fdb)
 {
__be16 vlan_id = rocker_port_vid_to_vlan(rocker_port, fdb->vid, NULL);
-   int flags = ROCKER_OP_FLAG_NOWAIT | ROCKER_OP_FLAG_REMOVE;
+   int flags = ROCKER_OP_FLAG_REMOVE;
 
if (!rocker_port_is_bridged(rocker_port))
return -EINVAL;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v3 4/7] switchdev: remove pointers from switchdev objects

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

When object is used in deferred work, we cannot use pointers in
switchdev object structures because the memory they point at may be already
used by someone else. So rather do local copy of the value.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker.c |  6 +++---
 include/net/switchdev.h  |  7 +++
 net/bridge/br_fdb.c  |  2 +-
 net/dsa/slave.c  |  2 +-
 net/switchdev/switchdev.c| 11 +++
 5 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index eafa907..bb956a5 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -4469,7 +4469,7 @@ static int rocker_port_obj_add(struct net_device *dev,
fib4 = SWITCHDEV_OBJ_IPV4_FIB(obj);
err = rocker_port_fib_ipv4(rocker_port, trans,
   htonl(fib4->dst), fib4->dst_len,
-  fib4->fi, fib4->tb_id, 0);
+  >fi, fib4->tb_id, 0);
break;
case SWITCHDEV_OBJ_ID_PORT_FDB:
err = rocker_port_fdb_add(rocker_port, trans,
@@ -4541,7 +4541,7 @@ static int rocker_port_obj_del(struct net_device *dev,
fib4 = SWITCHDEV_OBJ_IPV4_FIB(obj);
err = rocker_port_fib_ipv4(rocker_port, NULL,
   htonl(fib4->dst), fib4->dst_len,
-  fib4->fi, fib4->tb_id,
+  >fi, fib4->tb_id,
   ROCKER_OP_FLAG_REMOVE);
break;
case SWITCHDEV_OBJ_ID_PORT_FDB:
@@ -4571,7 +4571,7 @@ static int rocker_port_fdb_dump(const struct rocker_port 
*rocker_port,
hash_for_each_safe(rocker->fdb_tbl, bkt, tmp, found, entry) {
if (found->key.rocker_port != rocker_port)
continue;
-   fdb->addr = found->key.addr;
+   ether_addr_copy(fdb->addr, found->key.addr);
fdb->ndm_state = NUD_REACHABLE;
fdb->vid = rocker_port_vlan_to_vid(rocker_port,
   found->key.vlan_id);
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 6b109e4..767d516 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define SWITCHDEV_F_NO_RECURSE BIT(0)
 #define SWITCHDEV_F_SKIP_EOPNOTSUPPBIT(1)
@@ -59,8 +60,6 @@ struct switchdev_attr {
} u;
 };
 
-struct fib_info;
-
 enum switchdev_obj_id {
SWITCHDEV_OBJ_ID_UNDEFINED,
SWITCHDEV_OBJ_ID_PORT_VLAN,
@@ -88,7 +87,7 @@ struct switchdev_obj_ipv4_fib {
struct switchdev_obj obj;
u32 dst;
int dst_len;
-   struct fib_info *fi;
+   struct fib_info fi;
u8 tos;
u8 type;
u32 nlflags;
@@ -101,7 +100,7 @@ struct switchdev_obj_ipv4_fib {
 /* SWITCHDEV_OBJ_ID_PORT_FDB */
 struct switchdev_obj_port_fdb {
struct switchdev_obj obj;
-   const unsigned char *addr;
+   unsigned char addr[ETH_ALEN];
u16 vid;
u16 ndm_state;
 };
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index f43ce05..f5e7da0 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -135,10 +135,10 @@ static void fdb_del_external_learn(struct 
net_bridge_fdb_entry *f)
 {
struct switchdev_obj_port_fdb fdb = {
.obj.id = SWITCHDEV_OBJ_ID_PORT_FDB,
-   .addr = f->addr.addr,
.vid = f->vlan_id,
};
 
+   ether_addr_copy(fdb.addr, f->addr.addr);
switchdev_port_obj_del(f->dst->dev, );
 }
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index bb2bd3b..2ad4e0e 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -393,7 +393,7 @@ static int dsa_slave_port_fdb_dump(struct net_device *dev,
if (ret < 0)
break;
 
-   fdb->addr = addr;
+   ether_addr_copy(fdb->addr, addr);
fdb->vid = vid;
fdb->ndm_state = is_static ? NUD_NOARP : NUD_REACHABLE;
 
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 5a92d08..77b616b 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -842,10 +843,10 @@ int switchdev_port_fdb_add(struct ndmsg *ndm, struct 
nlattr *tb[],
 {
struct switchdev_obj_port_fdb fdb = {
.obj.id = SWITCHDEV_OBJ_ID_PORT_FDB,
-   .addr = addr,
.vid = vid,
};
 
+   ether_addr_copy(fdb.addr, addr);
return switchdev_port_obj_add(dev, );
 }

[patch net-next v3 5/7] switchdev: introduce possibility to defer obj_add/del

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

Similar to the attr usecase, the caller knows if he is holding RTNL and is
in atomic section. So let the called to decide the correct call variant.

This allows drivers to sleep inside their ops and wait for hw to get the
operation status. Then the status is propagated into switchdev core.
This avoids silent errors in drivers.

Signed-off-by: Jiri Pirko 
---
 include/net/switchdev.h   |   1 +
 net/switchdev/switchdev.c | 137 --
 2 files changed, 110 insertions(+), 28 deletions(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 767d516..14e2595 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -69,6 +69,7 @@ enum switchdev_obj_id {
 
 struct switchdev_obj {
enum switchdev_obj_id id;
+   u32 flags;
 };
 
 /* SWITCHDEV_OBJ_ID_PORT_VLAN */
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 77b616b..abb23b5 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -312,21 +312,8 @@ static int __switchdev_port_obj_add(struct net_device *dev,
return err;
 }
 
-/**
- * switchdev_port_obj_add - Add port object
- *
- * @dev: port device
- * @id: object ID
- * @obj: object to add
- *
- * Use a 2-phase prepare-commit transaction model to ensure
- * system is not left in a partially updated state due to
- * failure from driver/device.
- *
- * rtnl_lock must be held.
- */
-int switchdev_port_obj_add(struct net_device *dev,
-  const struct switchdev_obj *obj)
+static int switchdev_port_obj_add_now(struct net_device *dev,
+ const struct switchdev_obj *obj)
 {
struct switchdev_trans trans;
int err;
@@ -368,19 +355,9 @@ int switchdev_port_obj_add(struct net_device *dev,
 
return err;
 }
-EXPORT_SYMBOL_GPL(switchdev_port_obj_add);
 
-/**
- * switchdev_port_obj_del - Delete port object
- *
- * @dev: port device
- * @id: object ID
- * @obj: object to delete
- *
- * rtnl_lock must be held.
- */
-int switchdev_port_obj_del(struct net_device *dev,
-  const struct switchdev_obj *obj)
+static int switchdev_port_obj_del_now(struct net_device *dev,
+ const struct switchdev_obj *obj)
 {
const struct switchdev_ops *ops = dev->switchdev_ops;
struct net_device *lower_dev;
@@ -398,13 +375,117 @@ int switchdev_port_obj_del(struct net_device *dev,
 */
 
netdev_for_each_lower_dev(dev, lower_dev, iter) {
-   err = switchdev_port_obj_del(lower_dev, obj);
+   err = switchdev_port_obj_del_now(lower_dev, obj);
if (err)
break;
}
 
return err;
 }
+
+struct switchdev_obj_work {
+   struct work_struct work;
+   struct net_device *dev;
+   struct switchdev_obj obj;
+   bool add; /* add or del */
+};
+
+static void switchdev_port_obj_work(struct work_struct *work)
+{
+   struct switchdev_obj_work *ow =
+   container_of(work, struct switchdev_obj_work, work);
+   bool rtnl_locked = rtnl_is_locked();
+   int err;
+
+   if (!rtnl_locked)
+   rtnl_lock();
+   if (ow->add)
+   err = switchdev_port_obj_add_now(ow->dev, >obj);
+   else
+   err = switchdev_port_obj_del_now(ow->dev, >obj);
+   if (err && err != -EOPNOTSUPP)
+   netdev_err(ow->dev, "failed (err=%d) to %s object (id=%d)\n",
+  err, ow->add ? "add" : "del", ow->obj.id);
+   if (!rtnl_locked)
+   rtnl_unlock();
+
+   dev_put(ow->dev);
+   kfree(ow);
+}
+
+static int switchdev_port_obj_work_schedule(struct net_device *dev,
+   const struct switchdev_obj *obj,
+   bool add)
+{
+   struct switchdev_obj_work *ow;
+
+   ow = kmalloc(sizeof(*ow), GFP_ATOMIC);
+   if (!ow)
+   return -ENOMEM;
+
+   INIT_WORK(>work, switchdev_port_obj_work);
+
+   dev_hold(dev);
+   ow->dev = dev;
+   memcpy(>obj, obj, sizeof(ow->obj));
+   ow->add = add;
+
+   queue_work(switchdev_wq, >work);
+   return 0;
+}
+
+static int switchdev_port_obj_add_defer(struct net_device *dev,
+   const struct switchdev_obj *obj)
+{
+   return switchdev_port_obj_work_schedule(dev, obj, true);
+}
+
+/**
+ * switchdev_port_obj_add - Add port object
+ *
+ * @dev: port device
+ * @id: object ID
+ * @obj: object to add
+ *
+ * Use a 2-phase prepare-commit transaction model to ensure
+ * system is not left in a partially updated state due to
+ * failure from driver/device.
+ *
+ * rtnl_lock must be held and must not be in atomic section,
+ * in case SWITCHDEV_F_DEFER flag is not set.
+

[patch net-next v3 2/7] switchdev: introduce switchdev workqueue

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

This is going to be used for deferred operations.

Signed-off-by: Jiri Pirko 
---
 include/net/switchdev.h   |  5 +
 net/switchdev/switchdev.c | 20 
 2 files changed, 25 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 1ce7083..d2879f2 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -205,6 +205,7 @@ int switchdev_port_fdb_dump(struct sk_buff *skb, struct 
netlink_callback *cb,
 void switchdev_port_fwd_mark_set(struct net_device *dev,
 struct net_device *group_dev,
 bool joining);
+void switchdev_flush_deferred(void);
 
 #else
 
@@ -326,6 +327,10 @@ static inline void switchdev_port_fwd_mark_set(struct 
net_device *dev,
 {
 }
 
+static inline void switchdev_flush_deferred(void)
+{
+}
+
 #endif
 
 #endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index a79ee44..f782eba 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -17,9 +17,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
+static struct workqueue_struct *switchdev_wq;
+
 /**
  * switchdev_trans_item_enqueue - Enqueue data item to transaction queue
  *
@@ -1221,3 +1224,20 @@ void switchdev_port_fwd_mark_set(struct net_device *dev,
dev->offload_fwd_mark = mark;
 }
 EXPORT_SYMBOL_GPL(switchdev_port_fwd_mark_set);
+
+void switchdev_flush_deferred(void)
+{
+   flush_workqueue(switchdev_wq);
+}
+
+EXPORT_SYMBOL_GPL(switchdev_flush_deferred);
+
+static int __init switchdev_init(void)
+{
+   switchdev_wq = create_workqueue("switchdev");
+   if (!switchdev_wq)
+   return -ENOMEM;
+   return 0;
+}
+
+subsys_initcall(switchdev_init);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v3 0/7] switchdev: change locking

2015-10-12 Thread Jiri Pirko

Damn, wrong branch, please ignore, will send v4 shortly.

Mon, Oct 12, 2015 at 05:45:43PM CEST, j...@resnulli.us wrote:
>From: Jiri Pirko 
>
>This is something which I'm currently struggling with.
>Callers of attr_set and obj_add/del often hold not only RTNL, but also
>spinlock (bridge). So in that case, the driver implementing the op cannot 
>sleep.
>
>The way rocker is dealing with this now is just to invoke driver operation
>and go out, without any checking or reporting of the operation status.
>
>Since it would be nice to at least put a warning in case the operation fails,
>it makes sense to do this in delayed work directly in switchdev core
>instead of implementing this in separate drivers. And that is what this 
>patchset
>is introducing.
>
>So from now on, the locking of switchdev mod ops is consistent. Caller either
>holds rtnl mutex or in case it does not, caller sets defer flag, telling
>switchdev core to process the op later in delayed work.
>
>Flush function for switchdev deferred ops can be called by op
>caller in appropriate location, for example after it releases
>spin lock, to force switchdev core to process pending ops.
>
>v1->v2:
>- rebased on current net-next head (including Scott's ageing patchset)
>v2->v3:
>- fixed comment s/of/or/ typo suggested by Nik
>
>Jiri Pirko (7):
>  switchdev: assert rtnl in switchdev_port_obj_del
>  switchdev: introduce switchdev workqueue
>  switchdev: allow caller to explicitly request attr_set as deferred
>  switchdev: remove pointers from switchdev objects
>  switchdev: introduce possibility to defer obj_add/del
>  bridge: defer switchdev fdb del call in fdb_del_external_learn
>  rocker: remove nowait from switchdev callbacks.
>
> drivers/net/ethernet/rocker/rocker.c |  13 +-
> include/net/switchdev.h  |  14 +-
> net/bridge/br_fdb.c  |   7 +-
> net/bridge/br_if.c   |   3 +
> net/bridge/br_stp.c  |   3 +-
> net/dsa/slave.c  |   2 +-
> net/switchdev/switchdev.c| 276 +--
> 7 files changed, 222 insertions(+), 96 deletions(-)
>
>-- 
>1.9.3
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ip neigh: Add support for filtering dumps by master device

2015-10-12 Thread Stephen Hemminger

On Fri,  2 Oct 2015 09:42:27 -0700
David Ahern  wrote:

> Add support for filtering neighbor dumps by master device. Kernel side
> support provided by commit 21fdd092acc7. Since the feature is not
> available in older kernels the user is given a warning message if the
> kernel does not support the request.
> 
> Signed-off-by: David Ahern 

Applied, to net-next branch of iproute2.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH iproute2 -next] m_bpf: don't require default opcode on ebpf actions

2015-10-12 Thread Stephen Hemminger

On Thu,  8 Oct 2015 15:22:05 +0200
Daniel Borkmann  wrote:

> After the patch, the most minimal command to load an eBPF action
> for late binding with auto index selection through tc is:
> 
>   tc actions add action bpf obj prog.o
> 
> We already set TC_ACT_PIPE in tc as default opcode, so if nothing
> further has been specified, just use it. Also, allow "ok" next to
> "pass" for matching cmdline on TC_ACT_OK.
> 
> Signed-off-by: Daniel Borkmann 

Applied to net-next
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] bridge: fix gc_timer mod/del race condition

2015-10-12 Thread Nikolay Aleksandrov

From: Nikolay Aleksandrov 

commit c62987bbd8a1 ("bridge: push bridge setting ageing_time down to
switchdev") introduced a timer race condition because the gc_timer can
get rearmed after it's supposedly stopped and flushed in br_dev_delete()
leading to a use of freed memory. So take rtnl to sync with bridge
destruction when setting ageing_timer.
Here's the trace reproduced with these two commands running in parallel:
while :; do echo 1 > /sys/class/net/br0/bridge/ageing_timer; done;
while :; do brctl addbr br0; ip l set br0 up; ip l set br0 down;
brctl delbr br0; done;

[  300.29] BUG: unable to handle kernel paging request at
811c59d3
[  300.000263] IP: [] __internal_add_timer+0x2e/0xd0
[  300.000422] PGD 1a0f067 PUD 1a10063 PMD 10001e1
[  300.000639] Oops: 0003 [#1] SMP
[  300.000793] Modules linked in: bridge stp llc nfsd auth_rpcgss
oid_registry nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev aesni_intel
aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd
snd_hda_codec_generic qxl drm_kms_helper psmouse pcspkr ttm
snd_hda_intel 9pnet_virtio evdev serio_raw joydev snd_hda_codec 9pnet
virtio_balloon drm snd_hwdep virtio_console snd_hda_core pvpanic snd_pcm
i2c_piix4 snd_timer acpi_cpufreq parport_pc snd parport soundcore button
processor i2c_core ipv6 autofs4 hid_generic usbhid hid ext4 crc16
mbcache jbd2 sg sr_mod cdrom ata_generic virtio_blk virtio_net e1000
ehci_pci uhci_hcd ehci_hcd usbcore usb_common floppy ata_piix libata
virtio_pci virtio_ring virtio scsi_mod
[  300.004008] CPU: 1 PID: 1169 Comm: bash Not tainted 4.3.0-rc3+ #46
[  300.004008] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  300.004008] task: 880035be2200 ti: 88003795c000 task.ti:
88003795c000
[  300.004008] RIP: 0010:[]  []
__internal_add_timer+0x2e/0xd0
[  300.004008] RSP: 0018:88003fd03e78  EFLAGS: 00010046
[  300.004008] RAX: 88003fd0ef60 RBX: 840fc78949c08548 RCX:
0001
[  300.004008] RDX:  RSI: 811c59d3 RDI:
88003fd0df00
[  300.004008] RBP: 88003fd03e78 R08:  R09:

[  300.004008] R10:  R11:  R12:
88003fd0df00
[  300.004008] R13:  R14: 0001 R15:
816032e0
[  300.004008] FS:  7fcbdd609700() GS:88003fd0()
knlGS:
[  300.004008] CS:  0010 DS:  ES:  CR0: 80050033
[  300.004008] CR2: 811c59d3 CR3: 37879000 CR4:
000406e0
[  300.004008] Stack:
[  300.004008]  88003fd03ea8 810f1775 88003c8cb958
88003fd0df00
[  300.004008]   0001 88003fd03f18
810f28c4
[  300.004008]  88003fd0eb68 88003fd0e968 88003fd0e768
88003fd0df68
[  300.004008] Call Trace:
[  300.004008]  
[  300.004008]  [] cascade+0x45/0x70
[  300.004008]  [] run_timer_softirq+0x2f4/0x340
[  300.004008]  [] __do_softirq+0xd0/0x440
[  300.004008]  [] irq_exit+0xb3/0xc0
[  300.004008]  [] smp_apic_timer_interrupt+0x42/0x50
[  300.004008]  [] apic_timer_interrupt+0x87/0x90
[  300.004008]  
[  300.004008]  [] ? create_object+0x13c/0x2e0
[  300.004008]  [] ? __kernel_text_address+0x4e/0x70
[  300.004008]  [] ? __kernel_text_address+0x4e/0x70
[  300.004008]  [] print_context_stack+0x7f/0xf0
[  300.004008]  [] dump_trace+0x11b/0x300
[  300.004008]  [] save_stack_trace+0x2b/0x50
[  300.004008]  [] create_object+0x13c/0x2e0
[  300.004008]  [] kmemleak_alloc+0x4e/0xb0
[  300.004008]  [] kmem_cache_alloc_trace+0x18d/0x2f0
[  300.004008]  [] kernfs_fop_open+0xc9/0x380
[  300.004008]  [] do_dentry_open+0x1ff/0x2f0
[  300.004008]  [] ? kernfs_fop_release+0x70/0x70
[  300.004008]  [] vfs_open+0x59/0x60
[  300.004008]  [] path_openat+0x1ce/0x1260
[  300.004008]  [] do_filp_open+0x7e/0xe0
[  300.004008]  [] ? __alloc_fd+0xaf/0x180
[  300.004008]  [] do_sys_open+0x12b/0x210
[  300.004008]  [] SyS_open+0x1e/0x20
[  300.004008]  [] entry_SYSCALL_64_fastpath+0x16/0x7a
[  300.004008] Code: 66 90 48 8b 46 10 48 8b 4f 40 55 48 89 c2 48 89 e5
48 29 ca 48 81 fa ff 00 00 00 77 20 0f b6 c0 48 8d 44 c7 68 48 8b 10 48
85 d2 <48> 89 16 74 04 48 89 72 08 48 89 30 48 89 46 08 5d c3 48 81 fa
[  300.004008] RIP  [] __internal_add_timer+0x2e/0xd0
[  300.004008]  RSP 
[  300.004008] CR2: 811c59d3

Fixes: c62987bbd8a1 ("bridge: push bridge setting ageing_time down to 
switchdev")
Signed-off-by: Nikolay Aleksandrov 
---
 net/bridge/br_sysfs_br.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 04ef1926ee7e..8365bd53c421 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -102,7 +102,15 @@ static ssize_t ageing_time_show(struct device *d,
 
 static int set_ageing_time(struct net_bridge *br, unsigned long val)
 {
-   return br_set_ageing_time(br, val);

Re: GPF in rt6_uncached_list_flush_dev

2015-10-12 Thread Eric W. Biederman

Eric Dumazet  writes:

> On Mon, 2015-10-12 at 11:34 +0200, Dmitry Vyukov wrote:
>> Hello,
>> 
>> The following program causes episodic crashes:
>> 
>> // autogenerated by syzkaller (http://github.com/google/syzkaller)
>> #include 
>> #define CLONE_NEWNET 0x4000
>> int main(void)
>> {
>> unshare(CLONE_NEWNET);
>> }
>> 
>> On commit dd36d7393d6310b0c1adefb22fba79c3cf8a577c
>> (git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
>> 
>> general protection fault:  [#1] SMP KASAN
>> Modules linked in:
>> CPU: 0 PID: 1058 Comm: kworker/u8:1 Not tainted 4.3.0-rc2+ #12
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> Workqueue: netns cleanup_net
>> task: 880051c71a00 ti: 8800514f8000 task.ti: 8800514f8000
>> RIP: 0010:[]  [] rt6_ifdown+0x481/0x740
>> RSP: 0018:8800514ffaa0  EFLAGS: 00010246
>> RAX: dc59 RBX: 88005107c580 RCX: 0002
>> RDX:  RSI: 000f RDI: 880052a1f340
>> RBP: 8800514ffb78 R08:  R09: 8800514ffb10
>> R10: 88002d5b7dc0 R11: 88002ec07600 R12: 880051c11140
>> R13: 88005144af40 R14:  R15: dc00
>> FS:  () GS:88002f00() knlGS:
>> CS:  0010 DS:  ES:  CR0: 8005003b
>> CR2: 00648056 CR3: 0361 CR4: 06f0
>> Stack:
>>  02c8 11000a29ff5e dc59 00022d5b61c0
>>  880052a1f340 880051c11140 880052a1f348 88005107c6d8
>>  88005107c598  41b58ab3 83471ca6
>> Call Trace:
>>  [] fib6_net_exit+0x20/0x100 net/ipv6/ip6_fib.c:1847
>>  [] ops_exit_list.isra.6+0xae/0x150
>> net/core/net_namespace.c:134
>>  [] cleanup_net+0x3cd/0x730
>> net/core/net_namespace.c:431 (discriminator 3)
>>  [] process_one_work+0x6d1/0x1370 kernel/workqueue.c:2030
>>  [] worker_thread+0xe3/0x1300 kernel/workqueue.c:2162
>>  [] kthread+0x1e7/0x260 kernel/kthread.c:209
>>  [] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:475
>> Code: 89 95 50 ff ff ff e8 6f 41 9f fe 48 8b 95 50 ff ff ff 48 39 95
>> 70 ff ff ff 0f 84 d5 fe ff ff e8 56 41 9f fe 48 8b 85 38 ff ff ff <80>
>> 38 00 0f 85 9b 01 00 00 48 8b 85 70 ff ff ff 48 8b 90 c8 02
>> RIP  [< inline >] __read_once_size include/linux/compiler.h:207
>> RIP  [< inline >] in6_dev_get include/net/addrconf.h:281
>> RIP  [< inline >] rt6_uncached_list_flush_dev net/ipv6/route.c:156
>> RIP  [] rt6_ifdown+0x481/0x740 net/ipv6/route.c:2621
>>  RSP 
>> ---[ end trace 113e678e9b762d96 ]---
>> Kernel panic - not syncing: Fatal exception in interrupt
>> Kernel Offset: disabled
>> ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>> 
>> 
>> The crash happens because loopback_dev is NULL in
>> rt6_uncached_list_flush_dev(). The crash happens only if there is an
>> uncached route when the interface in destroyed.
>> 
>> I've tried to run the program with the following patch applied:
>> 
>> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
>> index dc7d970..fd7e88d 100644
>> --- a/drivers/net/loopback.c
>> +++ b/drivers/net/loopback.c
>> @@ -144,6 +144,8 @@ static int loopback_dev_init(struct net_device *dev)
>> 
>>  static void loopback_dev_free(struct net_device *dev)
>>  {
>> +   pr_err("loopback_dev_free %p = %p",
>> _net(dev)->loopback_dev, dev_net(dev)->loopback_dev);
>> +   WARN_ON(1);
>> dev_net(dev)->loopback_dev = NULL;
>> free_percpu(dev->lstats);
>> free_netdev(dev);
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index f204089..fd558a4 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -142,6 +142,8 @@ static void rt6_uncached_list_flush_dev(struct net
>> *net, struct net_device *dev)
>> struct net_device *loopback_dev = net->loopback_dev;
>> int cpu;
>> 
>> +   pr_err("rt6_uncached_list_flush_dev %p = %p",
>> >loopback_dev, net->loopback_dev);
>> +   WARN_ON(1);
>> for_each_possible_cpu(cpu) {
>> struct uncached_list *ul = per_cpu_ptr(_uncached_list, 
>> cpu);
>> struct rt6_info *rt;
>> 
>> 
>> And it shows that the loopback device is destroyed before
>> rt6_uncached_list_flush_dev is executed, while
>> rt6_uncached_list_flush_dev seems to assume that loopback_dev is alive
>> when it is called:
>> 
>> [  197.812174] loopback_dev_free 88003d288150 = 88003e1d67c0
>> [  197.812890] [ cut here ]
>> [  197.813389] WARNING: CPU: 2 PID: 1044 at drivers/net/loopback.c:148
>> loopback_dev_free+0x3c/0x70()
>> [  197.814186] Modules linked in:
>> [  197.814478] CPU: 2 PID: 1044 Comm: kworker/u8:1 Tainted: GW
>>   4.3.0-rc3+ #45
>> [  197.815186] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS Bochs 01/01/2011
>> [  197.815886] Workqueue: netns cleanup_net
>> [  197.816256]  81c27c67

Re: [PATCH iproute2 v2] bridge: add batch command support

2015-10-12 Thread Stephen Hemminger

On Sun, 11 Oct 2015 14:03:03 -0700
Roopa Prabhu  wrote:

> From: Wilson Kok  
> 
> This patch adds support to batch bridge commands.
> Follows ip batch code.
> 
> Signed-off-by: Wilson Kok 
> Signed-off-by: Roopa Prabhu 
> Acked-by: Christophe Gouault 

Applied, thanks
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] iproute2: document -timestamp option

2015-10-12 Thread Stephen Hemminger

On Tue,  6 Oct 2015 14:16:29 -0700
Roopa Prabhu  wrote:

> From: Satish Ashok 
> 
> This patch documents bridge and ip -timestamp option
> 
> Signed-off-by: Satish Ashok 

Applied, required some defuzzing.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ip neigh: Add ifindex to request when filtering dumps by device

2015-10-12 Thread Stephen Hemminger

On Wed,  7 Oct 2015 10:23:24 -0700
David Ahern  wrote:

> Add ifindex to dump request when filtering by device. If the kernel
> supports it adding the index to the request limits the amount of data
> the kernel pushes to userpsace.
> 
> The feature exists in userspace already, so no need to warn the user
> if kernel side support does not exist. Using the kernel side filter
> makes the request more efficient.
> 
> Signed-off-by: David Ahern 

Applied to net-next branch
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH iproute2 -next] f_bpf: allow for optional classid and add flags

2015-10-12 Thread Stephen Hemminger

On Fri, 25 Sep 2015 12:32:41 +0200
Daniel Borkmann  wrote:

> When having optional classid, most minimal command can be sth
> like:
> 
>   tc filter add dev foo parent X: bpf obj prog.o
> 
> Therefore, adapt the code so that a next argument will not be
> enforced as the case currently.
> 
> Also, minor cleanup on the classid, where we should rather
> have used addattr32(), and add flags for exec configuration,
> for example (using short notation):
> 
>   tc filter add dev foo parent X: bpf da obj prog.o
> 
> Signed-off-by: Daniel Borkmann 

Applied to net-next branch of iproute2
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [iproute PATCH 0/3] improve MACVLAN/MACVTAP support

2015-10-12 Thread Stephen Hemminger

On Fri, 25 Sep 2015 14:09:48 +0200
Phil Sutter  wrote:

> While implementing support for MACVLAN_FLAG_NOPROMISC, I realized how similar
> iplink_macvlan.c and iplink_macvtap.c are and that the main differences (apart
> from substituting "macvlan" for "macvtap") were fixes and enhancements which
> weren't applied to both files. To prevent this in the future and to share
> common code, this series merges the files. In addition, it implements support
> for MACVLAN_FLAG_NOPROMISC and finally documents these interface types in
> ip-link.8.in.
> 
> Phil Sutter (3):
>   ip: link: consolidate macvlan and macvtap
>   ip: macvlan: support MACVLAN_FLAG_NOPROMISC flag
>   man: ip-link: document MACVLAN/MACVTAP interface types
> 
>  ip/Makefile   |   2 +-
>  ip/iplink_macvlan.c   |  68 
>  ip/iplink_macvtap.c   | 105 
> --
>  man/man8/ip-link.8.in |  50 
>  4 files changed, 104 insertions(+), 121 deletions(-)
>  delete mode 100644 ip/iplink_macvtap.c
> 

Looks good applied. Always love to see more code deleted..
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 1/2] hisilicon net: removes the once HANDEL_TX_MSG macro

2015-10-12 Thread Joe Perches

On Mon, 2015-10-12 at 13:59 +0200, Arnd Bergmann wrote:
> On Monday 12 October 2015 11:23:44 huangdaode wrote:
> > +   s += sprintf(s,
> > +   "\t\ttx_ring on 
> > %p:%u,%u,%u,%u,%u,%llu,%llu\n",
> > +   h->qs[i]->tx_ring.io_base,
[]
> There is actually a more significant problem with this code, which I
> failed to notice when doing the original bugfix: 
> 
> You have a sysfs interface here that exports internal data of the
> device that should not be visible like this. One problem is that
> the io_base is a kernel pointer that must not be visible to non-root
> users (so we don't easily create an attack surface for exploits).

Using %pK might have been appropriate.

> It would probably be better to completely remove that sysfs interface, and
> to use the ethtool netlink interface to export them.

But this would be better.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 4/4] bridge: vlan: combine (br|nbp)_vlan_flush into one

2015-10-12 Thread Nikolay Aleksandrov

From: Nikolay Aleksandrov 

As Ido Schimmel pointed out the vlan_vid_del() loop in nbp_vlan_flush is
unnecessary (and is actually a remnant of the old vlan code) so we can
remove it and combine both br/nbp vlan_flush functions into one.

Signed-off-by: Nikolay Aleksandrov 
---
 net/bridge/br_if.c  |  8 +---
 net/bridge/br_private.h |  9 ++---
 net/bridge/br_vlan.c| 16 +---
 3 files changed, 8 insertions(+), 25 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 74a03c0a4e5f..ed431cc80b3d 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -233,6 +233,7 @@ static void destroy_nbp_rcu(struct rcu_head *head)
  */
 static void del_nbp(struct net_bridge_port *p)
 {
+   struct net_bridge_vlan_group *vg;
struct net_bridge *br = p->br;
struct net_device *dev = p->dev;
 
@@ -249,7 +250,8 @@ static void del_nbp(struct net_bridge_port *p)
list_del_rcu(>list);
 
/* vlan_flush phase I: remove vlans */
-   nbp_vlan_flush(p, false);
+   vg = nbp_vlan_group(p);
+   br_vlan_flush(vg, false);
br_fdb_delete_by_port(br, p, 0, 1);
nbp_update_port_count(br);
 
@@ -261,7 +263,7 @@ static void del_nbp(struct net_bridge_port *p)
/* use the synchronize_rcu done by netdev_rx_handler_unregister
 * vlan_flush phase II: free rht and vlgrp
 */
-   nbp_vlan_flush(p, true);
+   br_vlan_flush(vg, true);
 
br_multicast_del_port(p);
 
@@ -286,7 +288,7 @@ void br_dev_delete(struct net_device *dev, struct list_head 
*head)
br_fdb_delete_by_port(br, NULL, 0, 1);
 
/* vlan_flush execute both phases (see del_nbp) */
-   br_vlan_flush(br, true);
+   br_vlan_flush(br_vlan_group(br), true);
br_multicast_dev_del(br);
del_timer_sync(>gc_timer);
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 3938a976417f..73ee71c0a960 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -682,7 +682,7 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
   struct sk_buff *skb);
 int br_vlan_add(struct net_bridge *br, u16 vid, u16 flags);
 int br_vlan_delete(struct net_bridge *br, u16 vid);
-void br_vlan_flush(struct net_bridge *br, bool free_rht);
+void br_vlan_flush(struct net_bridge_vlan_group *vg, bool free_rht);
 struct net_bridge_vlan *br_vlan_find(struct net_bridge_vlan_group *vg, u16 
vid);
 void br_recalculate_fwd_mask(struct net_bridge *br);
 int __br_vlan_filter_toggle(struct net_bridge *br, unsigned long val);
@@ -694,7 +694,6 @@ int br_vlan_set_default_pvid(struct net_bridge *br, 
unsigned long val);
 int __br_vlan_set_default_pvid(struct net_bridge *br, u16 pvid);
 int nbp_vlan_add(struct net_bridge_port *port, u16 vid, u16 flags);
 int nbp_vlan_delete(struct net_bridge_port *port, u16 vid);
-void nbp_vlan_flush(struct net_bridge_port *port, bool free_rht);
 int nbp_vlan_init(struct net_bridge_port *port);
 int nbp_get_num_vlan_infos(struct net_bridge_port *p, u32 filter_mask);
 
@@ -790,7 +789,7 @@ static inline int br_vlan_delete(struct net_bridge *br, u16 
vid)
return -EOPNOTSUPP;
 }
 
-static inline void br_vlan_flush(struct net_bridge *br, bool free_rht)
+static inline void br_vlan_flush(struct net_bridge_vlan_group *vg, bool 
free_rht)
 {
 }
 
@@ -813,10 +812,6 @@ static inline int nbp_vlan_delete(struct net_bridge_port 
*port, u16 vid)
return -EOPNOTSUPP;
 }
 
-static inline void nbp_vlan_flush(struct net_bridge_port *port, bool free_rht)
-{
-}
-
 static inline struct net_bridge_vlan *br_vlan_find(struct 
net_bridge_vlan_group *vg,
   u16 vid)
 {
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 4fb9b23c9838..11ac14f60206 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -572,13 +572,6 @@ int br_vlan_delete(struct net_bridge *br, u16 vid)
return __vlan_del(v);
 }
 
-void br_vlan_flush(struct net_bridge *br, bool free_rht)
-{
-   ASSERT_RTNL();
-
-   __vlan_flush(br_vlan_group(br), free_rht);
-}
-
 struct net_bridge_vlan *br_vlan_find(struct net_bridge_vlan_group *vg, u16 vid)
 {
if (!vg)
@@ -960,16 +953,9 @@ int nbp_vlan_delete(struct net_bridge_port *port, u16 vid)
return __vlan_del(v);
 }
 
-void nbp_vlan_flush(struct net_bridge_port *port, bool free_rht)
+void br_vlan_flush(struct net_bridge_vlan_group *vg, bool free_rht)
 {
-   struct net_bridge_vlan_group *vg;
-   struct net_bridge_vlan *vlan;
-
ASSERT_RTNL();
 
-   vg = nbp_vlan_group(port);
-   list_for_each_entry(vlan, >vlan_list, vlist)
-   vlan_vid_del(port->dev, port->br->vlan_proto, vlan->vid);
-
__vlan_flush(vg, free_rht);
 }
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at

[PATCH net-next 2/4] bridge: vlan: use rcu for vlan_list traversal in br_fill_ifinfo

2015-10-12 Thread Nikolay Aleksandrov

From: Nikolay Aleksandrov 

br_fill_ifinfo is called by br_ifinfo_notify which can be called from
many contexts with different locks held, sometimes it relies upon
bridge's spinlock only which is a problem for the vlan code, so use
explicitly rcu for that to avoid problems.

Signed-off-by: Nikolay Aleksandrov 
---
 net/bridge/br_netlink.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index edee48e9aa8f..e27bde2642cc 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -253,7 +253,7 @@ static int br_fill_ifvlaninfo_compressed(struct sk_buff 
*skb,
 * if vlaninfo represents a range
 */
pvid = br_get_pvid(vg);
-   list_for_each_entry(v, >vlan_list, vlist) {
+   list_for_each_entry_rcu(v, >vlan_list, vlist) {
flags = 0;
if (!br_vlan_should_use(v))
continue;
@@ -303,7 +303,7 @@ static int br_fill_ifvlaninfo(struct sk_buff *skb,
u16 pvid;
 
pvid = br_get_pvid(vg);
-   list_for_each_entry(v, >vlan_list, vlist) {
+   list_for_each_entry_rcu(v, >vlan_list, vlist) {
if (!br_vlan_should_use(v))
continue;
 
@@ -386,22 +386,27 @@ static int br_fill_ifinfo(struct sk_buff *skb,
struct nlattr *af;
int err;
 
+   /* RCU needed because of the VLAN locking rules (rcu || rtnl) */
+   rcu_read_lock();
if (port)
-   vg = nbp_vlan_group(port);
+   vg = nbp_vlan_group_rcu(port);
else
-   vg = br_vlan_group(br);
+   vg = br_vlan_group_rcu(br);
 
-   if (!vg || !vg->num_vlans)
+   if (!vg || !vg->num_vlans) {
+   rcu_read_unlock();
goto done;
-
+   }
af = nla_nest_start(skb, IFLA_AF_SPEC);
-   if (!af)
+   if (!af) {
+   rcu_read_unlock();
goto nla_put_failure;
-
+   }
if (filter_mask & RTEXT_FILTER_BRVLAN_COMPRESSED)
err = br_fill_ifvlaninfo_compressed(skb, vg);
else
err = br_fill_ifvlaninfo(skb, vg);
+   rcu_read_unlock();
if (err)
goto nla_put_failure;
nla_nest_end(skb, af);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 0/4] bridge: vlan: cleanups & fixes (part 3)

2015-10-12 Thread Nikolay Aleksandrov

From: Nikolay Aleksandrov 

Hi,
Patch 01 converts the vlgrp member to use rcu as it was already used in a
similar way so better to make it official and use all the available RCU
instrumentation. Patch 02 fixes a bug where the vlan_list can be traversed
without rtnl or rcu held which could lead to using freed entries.
I'm not intializing vlgrp to null before kfree_rcu because Ido has a patch
for that which fixes a warning from kasan.
Patch 03 fixes a bug reported by Ido Schimmel about the vlan_flush order
and switchdevs, and patch 04 refactors (br|nbp)_vlan_flush and combines
them into a single function.

Thank you,
 Nik

Nikolay Aleksandrov (4):
  bridge: vlan: use proper rcu for the vlgrp member
  bridge: vlan: use rcu for vlan_list traversal in br_fill_ifinfo
  bridge: vlan: break vlan_flush in two phases to keep old order
  bridge: vlan: combine (br|nbp)_vlan_flush into one

 net/bridge/br_device.c  |   2 +-
 net/bridge/br_forward.c |   6 +--
 net/bridge/br_if.c  |  13 +++--
 net/bridge/br_input.c   |   4 +-
 net/bridge/br_netlink.c |  25 ++
 net/bridge/br_private.h |  43 -
 net/bridge/br_vlan.c| 124 +++-
 7 files changed, 132 insertions(+), 85 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 3/4] bridge: vlan: break vlan_flush in two phases to keep old order

2015-10-12 Thread Nikolay Aleksandrov

From: Nikolay Aleksandrov 

Ido Schimmel reported a problem with switchdev devices because of the
order change of del_nbp operations, more specifically the move of
nbp_vlan_flush() which deletes all vlans and frees vlgrp after the
rx_handler has been unregistered. So in order to fix this break
vlan_flush in two phases:
1. delete all of vlan_group's vlans
2. destroy rhtable and free vlgrp
We execute phase I (free_rht == false) in the same place as before so the
vlans can be cleared and free the vlgrp after the rx_handler has been
unregistered in phase II (free_rht == true).

Reported-by: Ido Schimmel 
Signed-off-by: Nikolay Aleksandrov 
---
Ido: I hope this fixes it for your case, a test would be much appreciated.

 net/bridge/br_if.c  | 11 ---
 net/bridge/br_private.h |  8 
 net/bridge/br_vlan.c| 17 ++---
 3 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 934cae9fa317..74a03c0a4e5f 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -248,6 +248,8 @@ static void del_nbp(struct net_bridge_port *p)
 
list_del_rcu(>list);
 
+   /* vlan_flush phase I: remove vlans */
+   nbp_vlan_flush(p, false);
br_fdb_delete_by_port(br, p, 0, 1);
nbp_update_port_count(br);
 
@@ -256,8 +258,10 @@ static void del_nbp(struct net_bridge_port *p)
dev->priv_flags &= ~IFF_BRIDGE_PORT;
 
netdev_rx_handler_unregister(dev);
-   /* use the synchronize_rcu done by netdev_rx_handler_unregister */
-   nbp_vlan_flush(p);
+   /* use the synchronize_rcu done by netdev_rx_handler_unregister
+* vlan_flush phase II: free rht and vlgrp
+*/
+   nbp_vlan_flush(p, true);
 
br_multicast_del_port(p);
 
@@ -281,7 +285,8 @@ void br_dev_delete(struct net_device *dev, struct list_head 
*head)
 
br_fdb_delete_by_port(br, NULL, 0, 1);
 
-   br_vlan_flush(br);
+   /* vlan_flush execute both phases (see del_nbp) */
+   br_vlan_flush(br, true);
br_multicast_dev_del(br);
del_timer_sync(>gc_timer);
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 7d14ba93bba4..3938a976417f 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -682,7 +682,7 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
   struct sk_buff *skb);
 int br_vlan_add(struct net_bridge *br, u16 vid, u16 flags);
 int br_vlan_delete(struct net_bridge *br, u16 vid);
-void br_vlan_flush(struct net_bridge *br);
+void br_vlan_flush(struct net_bridge *br, bool free_rht);
 struct net_bridge_vlan *br_vlan_find(struct net_bridge_vlan_group *vg, u16 
vid);
 void br_recalculate_fwd_mask(struct net_bridge *br);
 int __br_vlan_filter_toggle(struct net_bridge *br, unsigned long val);
@@ -694,7 +694,7 @@ int br_vlan_set_default_pvid(struct net_bridge *br, 
unsigned long val);
 int __br_vlan_set_default_pvid(struct net_bridge *br, u16 pvid);
 int nbp_vlan_add(struct net_bridge_port *port, u16 vid, u16 flags);
 int nbp_vlan_delete(struct net_bridge_port *port, u16 vid);
-void nbp_vlan_flush(struct net_bridge_port *port);
+void nbp_vlan_flush(struct net_bridge_port *port, bool free_rht);
 int nbp_vlan_init(struct net_bridge_port *port);
 int nbp_get_num_vlan_infos(struct net_bridge_port *p, u32 filter_mask);
 
@@ -790,7 +790,7 @@ static inline int br_vlan_delete(struct net_bridge *br, u16 
vid)
return -EOPNOTSUPP;
 }
 
-static inline void br_vlan_flush(struct net_bridge *br)
+static inline void br_vlan_flush(struct net_bridge *br, bool free_rht)
 {
 }
 
@@ -813,7 +813,7 @@ static inline int nbp_vlan_delete(struct net_bridge_port 
*port, u16 vid)
return -EOPNOTSUPP;
 }
 
-static inline void nbp_vlan_flush(struct net_bridge_port *port)
+static inline void nbp_vlan_flush(struct net_bridge_port *port, bool free_rht)
 {
 }
 
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index a212979257b1..4fb9b23c9838 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -307,15 +307,18 @@ out:
return err;
 }
 
-static void __vlan_flush(struct net_bridge_vlan_group *vlgrp)
+static void __vlan_flush(struct net_bridge_vlan_group *vlgrp, bool free_rht)
 {
struct net_bridge_vlan *vlan, *tmp;
 
__vlan_delete_pvid(vlgrp, vlgrp->pvid);
list_for_each_entry_safe(vlan, tmp, >vlan_list, vlist)
__vlan_del(vlan);
-   rhashtable_destroy(>vlan_hash);
-   kfree_rcu(vlgrp, rcu);
+
+   if (free_rht) {
+   rhashtable_destroy(>vlan_hash);
+   kfree_rcu(vlgrp, rcu);
+   }
 }
 
 struct sk_buff *br_handle_vlan(struct net_bridge *br,
@@ -569,11 +572,11 @@ int br_vlan_delete(struct net_bridge *br, u16 vid)
return __vlan_del(v);
 }
 
-void br_vlan_flush(struct net_bridge *br)
+void br_vlan_flush(struct net_bridge *br, bool free_rht)
 {
ASSERT_RTNL();

Re: [PATCH net] tcp: change type of alive from int to bool

2015-10-12 Thread David Miller

From: Richard Sailer 
Date: Fri,  9 Oct 2015 02:41:37 +0200

> The alive parameter of tcp_orphan_retries, indicates
> whether the connection is assumed alive or not.
> In the function and all places calling it is used as a boolean value.
> 
> Therefore this changes the type of alive to bool in the function
> definition and all calling locations.
> 
> Since tcp_orphan_tries is a tcp_timer.c local function no change in
> any other file or header is necessary.
> 
> Signed-off-by: Richard Sailer 

Applied to net-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3 0/4] switchdev: push bridge ageing_time attribute down

2015-10-12 Thread David Miller

From: sfel...@gmail.com
Date: Thu,  8 Oct 2015 19:23:16 -0700

> From: Scott Feldman 
> 
> Push bridge-level attributes down to switchdev drivers.  This patchset
> adds the infrastructure and then pushes, as an example, ageing_time attribute
> down from bridge to switchdev (rocker) driver.  Add some range-checking
> for ageing_time.
> 
> # ip link set dev br0 type bridge ageing_time 1000
> 
> # ip link set dev br0 type bridge ageing_time 999
> RTNETLINK answers: Numerical result out of range
> 
> Up until now, switchdev attrs where port-level attrs, so the netdev used in
> switchdev_attr_set() would be a switch port or bond of switch ports.  With
> bridge-level attrs, the netdev passed to switchdev_attr_set() is the bridge
> netdev.  The same recusive algo is used to visit the leaves of the stacked
> drivers to set the attr, it's just in this case we start one layer higher in
> the stack.  One note is not all ports in the bridge may support setting a
> bridge-level attribute, so rather than failing the entire set, we'll skip over
> those ports returning -EOPNOTSUPP.
> 
> v2->v3: Per Jiri review: push only ageing_time attr down at this time, and
> don't pass raw bridge IFLA_BR_* values; rather use new switchdev attr ID for
> ageing_time.
> 
> v1->v2: rebase w/ net-next

Series applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Use-after-free in ep_remove_wait_queue

2015-10-12 Thread Eric Dumazet

On Mon, 2015-10-12 at 14:02 +0200, Michal Kubecek wrote:

> Probably the issue discussed in
> 
>   http://thread.gmane.org/gmane.linux.kernel/2057497/
> 
> and previous related threads.
> 

Same issue, but Dmitry apparently did not trust me.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] pcnet32: fix a logic error with pci_set_dma_mask

2015-10-12 Thread David Miller

From: Geliang Tang 
Date: Fri,  9 Oct 2015 03:45:39 -0700

> pcnet32 can't work on my machine recently. It says "architecture
> does not support 32bit PCI busmaster DMA". There is a logic error
> in it: pci_set_dma_mask() return 0 means return successfully.
> 
> Signed-off-by: Geliang Tang 

This driver doesn't call pci_set_dma_mask() in any of my tree(s).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v3] bridge: allow adding of fdb entries pointing to the bridge device

2015-10-12 Thread David Miller

From: Roopa Prabhu 
Date: Thu,  8 Oct 2015 10:38:52 -0700

> From: Roopa Prabhu 
> 
> This patch enables adding of fdb entries pointing to the bridge device.
> This can be used to propagate mac address of vlan interfaces
> configured on top of the vlan filtering bridge.
> 
> Before:
> $bridge fdb add 44:38:39:00:27:9f dev bridge
> RTNETLINK answers: Invalid argument
> 
> After:
> $bridge fdb add 44:38:39:00:27:9f dev bridge
> 
> Signed-off-by: Roopa Prabhu 
> Reviewed-by: Nikolay Aleksandrov 
> ---
> v1 - v2 : fix kbuild warnings
> v2 - v3 : address review comments from Nikolay (use of br_vlan_should_use)

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples

2015-10-12 Thread Peter Zijlstra

On Mon, Oct 12, 2015 at 09:02:42AM +, Kaixu Xia wrote:
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -483,6 +483,8 @@ struct perf_event {
>   perf_overflow_handler_t overflow_handler;
>   void*overflow_handler_context;
>  
> + atomic_t*sample_disable;
> +
>  #ifdef CONFIG_EVENT_TRACING
>   struct trace_event_call *tp_event;
>   struct event_filter *filter;

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b11756f..f6ef45c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event 
> *event,
>   irq_work_queue(>pending);
>   }
>  
> + if ((event->sample_disable) && atomic_read(event->sample_disable))
> + return ret;
> +
>   if (event->overflow_handler)
>   event->overflow_handler(event, data, regs);
>   else

Try and guarantee sample_disable lives in the same cacheline as
overflow_handler.

I think we should at the very least replace the kzalloc() currently used
with a cacheline aligned alloc, and check the structure layout to verify
these two do in fact share a cacheline.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Use-after-free in ep_remove_wait_queue

2015-10-12 Thread Michal Kubecek

On Mon, Oct 12, 2015 at 01:07:55PM +0200, Dmitry Vyukov wrote:
> Hello,
> 
> The following program causes use-after-in kernel:
> 
...
> long r0 = syscall(SYS_mmap, 0x20001000ul, 0x1000ul, 0x3ul,
> 0x32ul, 0xul, 0x0ul);
> long r1 = syscall(SYS_mmap, 0x2000ul, 0x1000ul, 0x3ul,
> 0x32ul, 0xul, 0x0ul);
> long r2 = syscall(SYS_socketpair, 0x1ul, 0x3ul, 0x1ul, 0x2ffcul);
> long r3 = -1;
> if (r2 != -1)
> r3 = *(uint32_t*)0x2ffc;
> long r4 = -1;
> if (r2 != -1)
> r4 = *(uint32_t*)0x20001000;
> long r5 = syscall(SYS_epoll_create, 0x1ul);
> long r6 = syscall(SYS_mmap, 0x20003000ul, 0x1000ul, 0x3ul,
> 0x32ul, 0xul, 0x0ul);
> long r7 = syscall(SYS_dup3, r4, r3, 0x8ul);
> *(uint32_t*)0x20003000 = 0x6;
> *(uint32_t*)0x20003004 = 0x2;
> *(uint64_t*)0x20003008 = 0x6;
> long r11 = syscall(SYS_epoll_ctl, r5, 0x1ul, r3, 0x20003000ul);
> long r12 = syscall(SYS_mmap, 0x20002000ul, 0x1000ul, 0x3ul,
> 0x32ul, 0xul, 0x0ul);
> memcpy((void*)0x20002000, "\x00", 1);
> long r14 = syscall(SYS_write, r7, 0x20002000ul, 0x1ul);
> *(uint64_t*)0x20001a4d = 0x5;
> long r16 = syscall(SYS_epoll_pwait, r5, 0x20004000ul, 0x1ul,
> 0x1ul, 0x20001a4dul, 0x8ul);
> return 0;
...
> Call Trace:
>  [< inline >] __list_del include/linux/list.h:89
>  [< inline >] list_del include/linux/list.h:107
>  [< inline >] __remove_wait_queue include/linux/wait.h:145
>  [] remove_wait_queue+0xfb/0x120 kernel/sched/wait.c:50
>  [< inline >] ep_remove_wait_queue fs/eventpoll.c:524
>  [] ep_unregister_pollwait.isra.7+0x10b/0x1c0
> fs/eventpoll.c:542
>  [] ep_free+0x97/0x190 fs/eventpoll.c:759 (discriminator 3)
>  [] ep_eventpoll_release+0x44/0x60 fs/eventpoll.c:791
>  [] __fput+0x21d/0x6e0 fs/file_table.c:208
>  [] fput+0x15/0x20 fs/file_table.c:244
>  [] task_work_run+0x164/0x1f0 kernel/task_work.c:115
> (discriminator 1)
>  [< inline >] exit_task_work include/linux/task_work.h:21
>  [] do_exit+0xa4e/0x2d40 kernel/exit.c:746
>  [] do_group_exit+0xf6/0x340 kernel/exit.c:874
>  [< inline >] SYSC_exit_group kernel/exit.c:885
>  [] SyS_exit_group+0x1d/0x20 kernel/exit.c:883
>  [] entry_SYSCALL_64_fastpath+0x31/0x95
> arch/x86/entry/entry_64.S:187
> 
> INFO: Allocated in sk_prot_alloc+0x69/0x340 age=6 cpu=1 pid=10568
> [<  none  >] __slab_alloc+0x426/0x470 mm/slub.c:2402
> [< inline >] slab_alloc_node mm/slub.c:2470
> [< inline >] slab_alloc mm/slub.c:2512
> [<  none  >] kmem_cache_alloc+0x10d/0x140 mm/slub.c:2517
> [<  none  >] sk_prot_alloc+0x69/0x340 net/core/sock.c:1329
> [<  none  >] sk_alloc+0x33/0x280 net/core/sock.c:1404
> [<  none  >] unix_create1+0x5e/0x3d0 net/unix/af_unix.c:648
> [<  none  >] unix_create+0x14d/0x1f0 net/unix/af_unix.c:707
> [<  none  >] __sock_create+0x1f1/0x4c0 net/socket.c:1169
> [< inline >] sock_create net/socket.c:1209
> [< inline >] SYSC_socketpair net/socket.c:1281
> [<  none  >] SyS_socketpair+0x10f/0x4d0 net/socket.c:1260
> [<  none  >] entry_SYSCALL_64_fastpath+0x31/0x95
> arch/x86/entry/entry_64.S:187
> 
> INFO: Freed in sk_destruct+0x2e9/0x400 age=9 cpu=1 pid=10568
> [<  none  >] __slab_free+0x12d/0x280 mm/slub.c:2587 (discriminator 1)
> [< inline >] slab_free mm/slub.c:2736
> [<  none  >] kmem_cache_free+0x161/0x180 mm/slub.c:2745
> [< inline >] sk_prot_free net/core/sock.c:1374
> [<  none  >] sk_destruct+0x2e9/0x400 net/core/sock.c:1452
> [<  none  >] __sk_free+0x57/0x200 net/core/sock.c:1460
> [<  none  >] sk_free+0x30/0x40 net/core/sock.c:1471
> [< inline >] sock_put include/net/sock.h:1593
> [<  none  >] unix_dgram_sendmsg+0xeaf/0x1100 net/unix/af_unix.c:1571
> [< inline >] sock_sendmsg_nosec net/socket.c:610
> [<  none  >] sock_sendmsg+0xca/0x110 net/socket.c:620
> [<  none  >] sock_write_iter+0x216/0x3a0 net/socket.c:819
> [< inline >] new_sync_write fs/read_write.c:478
> [<  none  >] __vfs_write+0x2ed/0x3d0 fs/read_write.c:491
> [<  none  >] vfs_write+0x173/0x4a0 fs/read_write.c:538
> [< inline >] SYSC_write fs/read_write.c:585
> [<  none  >] SyS_write+0x108/0x220 fs/read_write.c:577
> [<  none  >] entry_SYSCALL_64_fastpath+0x31/0x95
> arch/x86/entry/entry_64.S:187

Probably the issue discussed in

  http://thread.gmane.org/gmane.linux.kernel/2057497/

and previous related threads.

Michal Kubecek

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 1/1] sfc: replace spinlocks with bit ops for busy poll locking

2015-10-12 Thread David Miller

From: Shradha Shah 
Date: Fri, 9 Oct 2015 10:18:56 +0100

>  static void efx_remove_port(struct efx_nic *efx);
> -static void efx_init_napi_channel(struct efx_channel *channel);
> +static int efx_init_napi_channel(struct efx_channel *channel);

The changes to modify the call chain to return a status code is
completely unnecessary, and distracts from the core of what this
patch is actually doing.

Nothing signals an error, all of these code paths return '0',
so it's pointless to return a status or check such statuses.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GPF in rt6_uncached_list_flush_dev

2015-10-12 Thread Eric Dumazet

On Mon, 2015-10-12 at 11:34 +0200, Dmitry Vyukov wrote:
> Hello,
> 
> The following program causes episodic crashes:
> 
> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> #include 
> #define CLONE_NEWNET 0x4000
> int main(void)
> {
> unshare(CLONE_NEWNET);
> }
> 
> On commit dd36d7393d6310b0c1adefb22fba79c3cf8a577c
> (git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
> 
> general protection fault:  [#1] SMP KASAN
> Modules linked in:
> CPU: 0 PID: 1058 Comm: kworker/u8:1 Not tainted 4.3.0-rc2+ #12
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Workqueue: netns cleanup_net
> task: 880051c71a00 ti: 8800514f8000 task.ti: 8800514f8000
> RIP: 0010:[]  [] rt6_ifdown+0x481/0x740
> RSP: 0018:8800514ffaa0  EFLAGS: 00010246
> RAX: dc59 RBX: 88005107c580 RCX: 0002
> RDX:  RSI: 000f RDI: 880052a1f340
> RBP: 8800514ffb78 R08:  R09: 8800514ffb10
> R10: 88002d5b7dc0 R11: 88002ec07600 R12: 880051c11140
> R13: 88005144af40 R14:  R15: dc00
> FS:  () GS:88002f00() knlGS:
> CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: 00648056 CR3: 0361 CR4: 06f0
> Stack:
>  02c8 11000a29ff5e dc59 00022d5b61c0
>  880052a1f340 880051c11140 880052a1f348 88005107c6d8
>  88005107c598  41b58ab3 83471ca6
> Call Trace:
>  [] fib6_net_exit+0x20/0x100 net/ipv6/ip6_fib.c:1847
>  [] ops_exit_list.isra.6+0xae/0x150
> net/core/net_namespace.c:134
>  [] cleanup_net+0x3cd/0x730
> net/core/net_namespace.c:431 (discriminator 3)
>  [] process_one_work+0x6d1/0x1370 kernel/workqueue.c:2030
>  [] worker_thread+0xe3/0x1300 kernel/workqueue.c:2162
>  [] kthread+0x1e7/0x260 kernel/kthread.c:209
>  [] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:475
> Code: 89 95 50 ff ff ff e8 6f 41 9f fe 48 8b 95 50 ff ff ff 48 39 95
> 70 ff ff ff 0f 84 d5 fe ff ff e8 56 41 9f fe 48 8b 85 38 ff ff ff <80>
> 38 00 0f 85 9b 01 00 00 48 8b 85 70 ff ff ff 48 8b 90 c8 02
> RIP  [< inline >] __read_once_size include/linux/compiler.h:207
> RIP  [< inline >] in6_dev_get include/net/addrconf.h:281
> RIP  [< inline >] rt6_uncached_list_flush_dev net/ipv6/route.c:156
> RIP  [] rt6_ifdown+0x481/0x740 net/ipv6/route.c:2621
>  RSP 
> ---[ end trace 113e678e9b762d96 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: disabled
> ---[ end Kernel panic - not syncing: Fatal exception in interrupt
> 
> 
> The crash happens because loopback_dev is NULL in
> rt6_uncached_list_flush_dev(). The crash happens only if there is an
> uncached route when the interface in destroyed.
> 
> I've tried to run the program with the following patch applied:
> 
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> index dc7d970..fd7e88d 100644
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -144,6 +144,8 @@ static int loopback_dev_init(struct net_device *dev)
> 
>  static void loopback_dev_free(struct net_device *dev)
>  {
> +   pr_err("loopback_dev_free %p = %p",
> _net(dev)->loopback_dev, dev_net(dev)->loopback_dev);
> +   WARN_ON(1);
> dev_net(dev)->loopback_dev = NULL;
> free_percpu(dev->lstats);
> free_netdev(dev);
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index f204089..fd558a4 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -142,6 +142,8 @@ static void rt6_uncached_list_flush_dev(struct net
> *net, struct net_device *dev)
> struct net_device *loopback_dev = net->loopback_dev;
> int cpu;
> 
> +   pr_err("rt6_uncached_list_flush_dev %p = %p",
> >loopback_dev, net->loopback_dev);
> +   WARN_ON(1);
> for_each_possible_cpu(cpu) {
> struct uncached_list *ul = per_cpu_ptr(_uncached_list, 
> cpu);
> struct rt6_info *rt;
> 
> 
> And it shows that the loopback device is destroyed before
> rt6_uncached_list_flush_dev is executed, while
> rt6_uncached_list_flush_dev seems to assume that loopback_dev is alive
> when it is called:
> 
> [  197.812174] loopback_dev_free 88003d288150 = 88003e1d67c0
> [  197.812890] [ cut here ]
> [  197.813389] WARNING: CPU: 2 PID: 1044 at drivers/net/loopback.c:148
> loopback_dev_free+0x3c/0x70()
> [  197.814186] Modules linked in:
> [  197.814478] CPU: 2 PID: 1044 Comm: kworker/u8:1 Tainted: GW
>   4.3.0-rc3+ #45
> [  197.815186] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Bochs 01/01/2011
> [  197.815886] Workqueue: netns cleanup_net
> [  197.816256]  81c27c67 88003d923c50 812fe8d6
> 
> [  197.816949]  88003d923c88 81051ff1 88003e1d67c0
> 88003e1d6bd0
> [  197.817662]

Re: [PATCH v4 0/3] net: unix: fix use-after-free

2015-10-12 Thread Rainer Weikusat

David Miller  writes:
> From: Jason Baron 
> Date: Fri,  9 Oct 2015 00:15:59 -0400
>
>> These patches are against mainline, I can re-base to net-next, please
>> let me know.
>> 
>> They have been tested against: https://lkml.org/lkml/2015/9/13/195,
>> which causes the use-after-free quite quickly and here:
>> https://lkml.org/lkml/2015/10/2/693.
>
> I'd like to understand how patches that don't even compile can be
> "tested"?
>
> net/unix/af_unix.c: In function ‘unix_dgram_writable’:
> net/unix/af_unix.c:2480:3: error: ‘other_full’ undeclared (first use in this 
> function)
> net/unix/af_unix.c:2480:3: note: each undeclared identifier is reported only 
> once for each function it appears in
>
> Could you explain how that works, I'm having a hard time understanding
> this?

This is basicallly a workaround for the problem that it's not possible
to tell epoll to let go of a certain wait queue: Instead of registering
the peer_wait queue via sock_poll_wait, a wait_queue_t under control of
the af_unix.c code is linked onto it which relays a wake up on the
peer_wait queue to the 'ordinary' wait queue associated with the polled
socket via custom wake function. But (at least the code I looked it) it
enqueues a unix socket on connect which has certain side effects (in
particular, /dev/log will have a seriously large wait queue of entirely
uninterested peers) and in many cases, this is simply not necessary, as
the additional peer_wait event is only interesting in case a peer of a
fan-in socket (like /dev/log) happens to be waiting for writeabilty via
poll/ select/ epoll/ ...

Since the wait queue handling code is now under control of the af_unix.c
code, it can remove itself from the peer_wait queue prior to dropping
its reference to a peer on disconnect or on detecting a dead peer in
unix_dgram_sendmsg.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Use-after-free in ep_remove_wait_queue

2015-10-12 Thread Dmitry Vyukov

On Mon, Oct 12, 2015 at 2:14 PM, Eric Dumazet  wrote:
> On Mon, 2015-10-12 at 14:02 +0200, Michal Kubecek wrote:
>
>> Probably the issue discussed in
>>
>>   http://thread.gmane.org/gmane.linux.kernel/2057497/
>>
>> and previous related threads.
>>
>
> Same issue, but Dmitry apparently did not trust me.

Just wanted to make sure. Let's consider this as fixed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 1/1] sfc: fully reset if MC_REBOOT event received without warm_boot_count increment

2015-10-12 Thread David Miller

From: Shradha Shah 
Date: Fri, 9 Oct 2015 10:40:35 +0100

> From: Daniel Pieczko 
> 
> On EF10, MC_CMD_VPORT_RECONFIGURE can cause a CODE_MC_REBOOT event
> to be sent to a function without incrementing the (adapter-wide)
> warm_boot_count.  In this case, the reboot is not detected by the
> loop on efx_mcdi_poll_reboot(), so prepare for recovery from an MC
> reboot anyway.  When this codepath is run, the MC has always just
> rebooted, so this recovery is valid.
> 
> The loop on efx_mcdi_poll_reboot() is still required for other MC
> reboot cases, so that actions in response to an MC reboot are
> performed, such as clearing locally calculated statistics.
> Siena NICs are unaffected by this change as the above scenario
> does not apply.
> 
> Signed-off-by: Shradha Shah 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] switchdev: enforce no pvid flag in vlan ranges

2015-10-12 Thread Nikolay Aleksandrov

From: Nikolay Aleksandrov 

We shouldn't allow BRIDGE_VLAN_INFO_PVID flag in VLAN ranges.

Signed-off-by: Nikolay Aleksandrov 
---
 net/switchdev/switchdev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 6e4a4f9ad927..256c596de896 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -720,6 +720,9 @@ static int switchdev_port_br_afspec(struct net_device *dev,
if (vlan.vid_begin)
return -EINVAL;
vlan.vid_begin = vinfo->vid;
+   /* don't allow range of pvids */
+   if (vlan.flags & BRIDGE_VLAN_INFO_PVID)
+   return -EINVAL;
} else if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_END) {
if (!vlan.vid_begin)
return -EINVAL;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v2 1/7] switchdev: introduce switchdev workqueue

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

This is going to be used for deferred operations.

Signed-off-by: Jiri Pirko 
---
 include/net/switchdev.h   |  5 +
 net/switchdev/switchdev.c | 19 +++
 2 files changed, 24 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 1ce7083..d2879f2 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -205,6 +205,7 @@ int switchdev_port_fdb_dump(struct sk_buff *skb, struct 
netlink_callback *cb,
 void switchdev_port_fwd_mark_set(struct net_device *dev,
 struct net_device *group_dev,
 bool joining);
+void switchdev_flush_deferred(void);
 
 #else
 
@@ -326,6 +327,10 @@ static inline void switchdev_port_fwd_mark_set(struct 
net_device *dev,
 {
 }
 
+static inline void switchdev_flush_deferred(void)
+{
+}
+
 #endif
 
 #endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 7a9ab90..d119e9c 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -17,9 +17,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
+static struct workqueue_struct *switchdev_wq;
+
 /**
  * switchdev_trans_item_enqueue - Enqueue data item to transaction queue
  *
@@ -1217,3 +1220,19 @@ void switchdev_port_fwd_mark_set(struct net_device *dev,
dev->offload_fwd_mark = mark;
 }
 EXPORT_SYMBOL_GPL(switchdev_port_fwd_mark_set);
+
+void switchdev_flush_deferred(void)
+{
+   flush_workqueue(switchdev_wq);
+}
+EXPORT_SYMBOL_GPL(switchdev_flush_deferred);
+
+static int __init switchdev_init(void)
+{
+   switchdev_wq = create_workqueue("switchdev");
+   if (!switchdev_wq)
+   return -ENOMEM;
+   return 0;
+}
+
+subsys_initcall(switchdev_init);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 1/4] bridge: vlan: use proper rcu for the vlgrp member

2015-10-12 Thread Nikolay Aleksandrov

From: Nikolay Aleksandrov 

The bridge and port's vlgrp member is already used in RCU way, currently
we rely on the fact that it cannot disappear while the port exists but
that is error-prone and we might miss places with improper locking
(either RCU or RTNL must be held to walk the vlan_list). So make it
official and use RCU for vlgrp to catch offenders. Introduce proper vlgrp
accessors and use them consistently throughout the code.

Signed-off-by: Nikolay Aleksandrov 
---
 net/bridge/br_device.c  |   2 +-
 net/bridge/br_forward.c |   6 +--
 net/bridge/br_input.c   |   4 +-
 net/bridge/br_netlink.c |   4 +-
 net/bridge/br_private.h |  34 +--
 net/bridge/br_vlan.c| 107 +---
 6 files changed, 104 insertions(+), 53 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index bdfb9544ca03..5e88d3e17546 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -56,7 +56,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
skb_reset_mac_header(skb);
skb_pull(skb, ETH_HLEN);
 
-   if (!br_allowed_ingress(br, br_vlan_group(br), skb, ))
+   if (!br_allowed_ingress(br, br_vlan_group_rcu(br), skb, ))
goto out;
 
if (is_broadcast_ether_addr(dest))
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 6d5ed795c3e2..a9d424e20229 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -32,7 +32,7 @@ static inline int should_deliver(const struct net_bridge_port 
*p,
 {
struct net_bridge_vlan_group *vg;
 
-   vg = nbp_vlan_group(p);
+   vg = nbp_vlan_group_rcu(p);
return ((p->flags & BR_HAIRPIN_MODE) || skb->dev != p->dev) &&
br_allowed_egress(vg, skb) && p->state == BR_STATE_FORWARDING;
 }
@@ -80,7 +80,7 @@ static void __br_deliver(const struct net_bridge_port *to, 
struct sk_buff *skb)
 {
struct net_bridge_vlan_group *vg;
 
-   vg = nbp_vlan_group(to);
+   vg = nbp_vlan_group_rcu(to);
skb = br_handle_vlan(to->br, vg, skb);
if (!skb)
return;
@@ -112,7 +112,7 @@ static void __br_forward(const struct net_bridge_port *to, 
struct sk_buff *skb)
return;
}
 
-   vg = nbp_vlan_group(to);
+   vg = nbp_vlan_group_rcu(to);
skb = br_handle_vlan(to->br, vg, skb);
if (!skb)
return;
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index f5c5a4500e2f..f7fba74108a9 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -44,7 +44,7 @@ static int br_pass_frame_up(struct sk_buff *skb)
brstats->rx_bytes += skb->len;
u64_stats_update_end(>syncp);
 
-   vg = br_vlan_group(br);
+   vg = br_vlan_group_rcu(br);
/* Bridge is just like any other port.  Make sure the
 * packet is allowed except in promisc modue when someone
 * may be running packet capture.
@@ -140,7 +140,7 @@ int br_handle_frame_finish(struct net *net, struct sock 
*sk, struct sk_buff *skb
if (!p || p->state == BR_STATE_DISABLED)
goto drop;
 
-   if (!br_allowed_ingress(p->br, nbp_vlan_group(p), skb, ))
+   if (!br_allowed_ingress(p->br, nbp_vlan_group_rcu(p), skb, ))
goto out;
 
/* insert into forwarding database after filtering to avoid spoofing */
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index d78b4429505a..edee48e9aa8f 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -102,10 +102,10 @@ static size_t br_get_link_af_size_filtered(const struct 
net_device *dev,
rcu_read_lock();
if (br_port_exists(dev)) {
p = br_port_get_rcu(dev);
-   vg = nbp_vlan_group(p);
+   vg = nbp_vlan_group_rcu(p);
} else if (dev->priv_flags & IFF_EBRIDGE) {
br = netdev_priv(dev);
-   vg = br_vlan_group(br);
+   vg = br_vlan_group_rcu(br);
}
num_vlan_infos = br_get_num_vlan_infos(vg, filter_mask);
rcu_read_unlock();
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 09d3ecbcb4f0..7d14ba93bba4 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -132,6 +132,7 @@ struct net_bridge_vlan_group {
struct list_headvlan_list;
u16 num_vlans;
u16 pvid;
+   struct rcu_head rcu;
 };
 
 struct net_bridge_fdb_entry
@@ -229,7 +230,7 @@ struct net_bridge_port
struct netpoll  *np;
 #endif
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
-   struct net_bridge_vlan_group*vlgrp;
+   struct net_bridge_vlan_group__rcu *vlgrp;
 #endif
 };
 
@@ -337,7 +338,7 @@ struct net_bridge
struct kobject  *ifobj;
u32

Re: [PATCH net-next v2 1/2] hisilicon net: removes the once HANDEL_TX_MSG macro

2015-10-12 Thread Arnd Bergmann

On Monday 12 October 2015 11:38:24 huangdaode wrote:
> On 2015/10/12 11:24, Joe Perches wrote:
> > Hello Huang.
> >
> > On Mon, 2015-10-12 at 11:23 +0800, huangdaode wrote:
> >> This patch changes the code style to make the code more simple.
> >> also removes the once used HNADEL_TX_MSG macro, according to the
> > HANDEL_TX_MSG typo
> >
> >> review comments from Joe Perches.
> >>
> >> Signed-off-by: huangdaode 
> >> Reviewed-by: Joe Perches 
> > I didn't review this.
> > '
> Hi Joe
> please refer to http://lists.openwall.net/netdev/2015/10/11/61

You should only add the "Reviewed-by" tag to a commit message
if the person you are adding replied with this explicitly.
Don't add it just because someone gave feedback.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples

2015-10-12 Thread Peter Zijlstra

On Mon, Oct 12, 2015 at 08:05:20PM +0800, Wangnan (F) wrote:
> 
> 
> On 2015/10/12 20:02, Peter Zijlstra wrote:
> >On Mon, Oct 12, 2015 at 09:02:42AM +, Kaixu Xia wrote:
> >>--- a/include/linux/perf_event.h
> >>+++ b/include/linux/perf_event.h
> >>@@ -483,6 +483,8 @@ struct perf_event {
> >>perf_overflow_handler_t overflow_handler;
> >>void*overflow_handler_context;
> >>+   atomic_t*sample_disable;
> >>+
> >>  #ifdef CONFIG_EVENT_TRACING
> >>struct trace_event_call *tp_event;
> >>struct event_filter *filter;
> >>diff --git a/kernel/events/core.c b/kernel/events/core.c
> >>index b11756f..f6ef45c 100644
> >>--- a/kernel/events/core.c
> >>+++ b/kernel/events/core.c
> >>@@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event 
> >>*event,
> >>irq_work_queue(>pending);
> >>}
> >>+   if ((event->sample_disable) && atomic_read(event->sample_disable))
> >>+   return ret;
> >>+
> >>if (event->overflow_handler)
> >>event->overflow_handler(event, data, regs);
> >>else
> >Try and guarantee sample_disable lives in the same cacheline as
> >overflow_handler.
> 
> Could you please explain why we need them to be in a same cacheline?

Because otherwise you've just added a cacheline miss to this relatively
hot path.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v2 4/7] switchdev: introduce possibility to defer obj_add/del

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

Similar to the attr usecase, the caller knows if he is holding RTNL and is
in atomic section. So let the called to decide the correct call variant.

This allows drivers to sleep inside their ops and wait for hw to get the
operation status. Then the status is propagated into switchdev core.
This avoids silent errors in drivers.

Signed-off-by: Jiri Pirko 
---
 include/net/switchdev.h   |   1 +
 net/switchdev/switchdev.c | 137 +-
 2 files changed, 112 insertions(+), 26 deletions(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 767d516..14e2595 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -69,6 +69,7 @@ enum switchdev_obj_id {
 
 struct switchdev_obj {
enum switchdev_obj_id id;
+   u32 flags;
 };
 
 /* SWITCHDEV_OBJ_ID_PORT_VLAN */
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index c23dd31..5d0d731 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -311,21 +311,8 @@ static int __switchdev_port_obj_add(struct net_device *dev,
return err;
 }
 
-/**
- * switchdev_port_obj_add - Add port object
- *
- * @dev: port device
- * @id: object ID
- * @obj: object to add
- *
- * Use a 2-phase prepare-commit transaction model to ensure
- * system is not left in a partially updated state due to
- * failure from driver/device.
- *
- * rtnl_lock must be held.
- */
-int switchdev_port_obj_add(struct net_device *dev,
-  const struct switchdev_obj *obj)
+static int switchdev_port_obj_add_now(struct net_device *dev,
+ const struct switchdev_obj *obj)
 {
struct switchdev_trans trans;
int err;
@@ -367,17 +354,9 @@ int switchdev_port_obj_add(struct net_device *dev,
 
return err;
 }
-EXPORT_SYMBOL_GPL(switchdev_port_obj_add);
 
-/**
- * switchdev_port_obj_del - Delete port object
- *
- * @dev: port device
- * @id: object ID
- * @obj: object to delete
- */
-int switchdev_port_obj_del(struct net_device *dev,
-  const struct switchdev_obj *obj)
+static int switchdev_port_obj_del_now(struct net_device *dev,
+ const struct switchdev_obj *obj)
 {
const struct switchdev_ops *ops = dev->switchdev_ops;
struct net_device *lower_dev;
@@ -393,13 +372,119 @@ int switchdev_port_obj_del(struct net_device *dev,
 */
 
netdev_for_each_lower_dev(dev, lower_dev, iter) {
-   err = switchdev_port_obj_del(lower_dev, obj);
+   err = switchdev_port_obj_del_now(lower_dev, obj);
if (err)
break;
}
 
return err;
 }
+
+struct switchdev_obj_work {
+   struct work_struct work;
+   struct net_device *dev;
+   struct switchdev_obj obj;
+   bool add; /* add of del */
+};
+
+static void switchdev_port_obj_work(struct work_struct *work)
+{
+   struct switchdev_obj_work *ow =
+   container_of(work, struct switchdev_obj_work, work);
+   bool rtnl_locked = rtnl_is_locked();
+   int err;
+
+   if (!rtnl_locked)
+   rtnl_lock();
+   if (ow->add)
+   err = switchdev_port_obj_add_now(ow->dev, >obj);
+   else
+   err = switchdev_port_obj_del_now(ow->dev, >obj);
+   if (err && err != -EOPNOTSUPP)
+   netdev_err(ow->dev, "failed (err=%d) to %s object (id=%d)\n",
+  err, ow->add ? "add" : "del", ow->obj.id);
+   if (!rtnl_locked)
+   rtnl_unlock();
+
+   dev_put(ow->dev);
+   kfree(ow);
+}
+
+static int switchdev_port_obj_work_schedule(struct net_device *dev,
+   const struct switchdev_obj *obj,
+   bool add)
+{
+   struct switchdev_obj_work *ow;
+
+   ow = kmalloc(sizeof(*ow), GFP_ATOMIC);
+   if (!ow)
+   return -ENOMEM;
+
+   INIT_WORK(>work, switchdev_port_obj_work);
+
+   dev_hold(dev);
+   ow->dev = dev;
+   memcpy(>obj, obj, sizeof(ow->obj));
+   ow->add = add;
+
+   queue_work(switchdev_wq, >work);
+   return 0;
+}
+
+static int switchdev_port_obj_add_defer(struct net_device *dev,
+   const struct switchdev_obj *obj)
+{
+   return switchdev_port_obj_work_schedule(dev, obj, true);
+}
+
+/**
+ * switchdev_port_obj_add - Add port object
+ *
+ * @dev: port device
+ * @id: object ID
+ * @obj: object to add
+ *
+ * Use a 2-phase prepare-commit transaction model to ensure
+ * system is not left in a partially updated state due to
+ * failure from driver/device.
+ *
+ * rtnl_lock must be held and must not be in atomic section,
+ * in case SWITCHDEV_F_DEFER flag is not set.
+ */
+int

[patch net-next v2 2/7] switchdev: allow caller to explicitly request attr_set as deferred

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

Caller should know if he can call attr_set directly (when holding RTNL)
or if he has to defer the att_set processing for later.

This also allows drivers to sleep inside attr_set and report operation
status back to switchdev core. Switchdev core then warns if status is
not ok, instead of silent errors happening in drivers.

Signed-off-by: Jiri Pirko 
---
 include/net/switchdev.h   |   1 +
 net/bridge/br_stp.c   |   3 +-
 net/switchdev/switchdev.c | 107 --
 3 files changed, 59 insertions(+), 52 deletions(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index d2879f2..6b109e4 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -17,6 +17,7 @@
 
 #define SWITCHDEV_F_NO_RECURSE BIT(0)
 #define SWITCHDEV_F_SKIP_EOPNOTSUPPBIT(1)
+#define SWITCHDEV_F_DEFER  BIT(2)
 
 struct switchdev_trans_item {
struct list_head list;
diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c
index db6d243de..80c34d7 100644
--- a/net/bridge/br_stp.c
+++ b/net/bridge/br_stp.c
@@ -41,13 +41,14 @@ void br_set_state(struct net_bridge_port *p, unsigned int 
state)
 {
struct switchdev_attr attr = {
.id = SWITCHDEV_ATTR_ID_PORT_STP_STATE,
+   .flags = SWITCHDEV_F_DEFER,
.u.stp_state = state,
};
int err;
 
p->state = state;
err = switchdev_port_attr_set(p->dev, );
-   if (err && err != -EOPNOTSUPP)
+   if (err)
br_warn(p->br, "error setting offload STP state on port 
%u(%s)\n",
(unsigned int) p->port_no, p->dev->name);
 }
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index d119e9c..9f94272 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -173,6 +173,49 @@ done:
return err;
 }
 
+static int switchdev_port_attr_set_now(struct net_device *dev,
+  struct switchdev_attr *attr)
+{
+   struct switchdev_trans trans;
+   int err;
+
+   switchdev_trans_init();
+
+   /* Phase I: prepare for attr set. Driver/device should fail
+* here if there are going to be issues in the commit phase,
+* such as lack of resources or support.  The driver/device
+* should reserve resources needed for the commit phase here,
+* but should not commit the attr.
+*/
+
+   trans.ph_prepare = true;
+   err = __switchdev_port_attr_set(dev, attr, );
+   if (err) {
+   /* Prepare phase failed: abort the transaction.  Any
+* resources reserved in the prepare phase are
+* released.
+*/
+
+   if (err != -EOPNOTSUPP)
+   switchdev_trans_items_destroy();
+
+   return err;
+   }
+
+   /* Phase II: commit attr set.  This cannot fail as a fault
+* of driver/device.  If it does, it's a bug in the driver/device
+* because the driver said everythings was OK in phase I.
+*/
+
+   trans.ph_prepare = false;
+   err = __switchdev_port_attr_set(dev, attr, );
+   WARN(err, "%s: Commit of attribute (id=%d) failed.\n",
+dev->name, attr->id);
+   switchdev_trans_items_warn_destroy(dev, );
+
+   return err;
+}
+
 struct switchdev_attr_set_work {
struct work_struct work;
struct net_device *dev;
@@ -183,14 +226,17 @@ static void switchdev_port_attr_set_work(struct 
work_struct *work)
 {
struct switchdev_attr_set_work *asw =
container_of(work, struct switchdev_attr_set_work, work);
+   bool rtnl_locked = rtnl_is_locked();
int err;
 
-   rtnl_lock();
-   err = switchdev_port_attr_set(asw->dev, >attr);
+   if (!rtnl_locked)
+   rtnl_lock();
+   err = switchdev_port_attr_set_now(asw->dev, >attr);
if (err && err != -EOPNOTSUPP)
netdev_err(asw->dev, "failed (err=%d) to set attribute 
(id=%d)\n",
   err, asw->attr.id);
-   rtnl_unlock();
+   if (!rtnl_locked)
+   rtnl_unlock();
 
dev_put(asw->dev);
kfree(work);
@@ -211,7 +257,7 @@ static int switchdev_port_attr_set_defer(struct net_device 
*dev,
asw->dev = dev;
memcpy(>attr, attr, sizeof(asw->attr));
 
-   schedule_work(>work);
+   queue_work(switchdev_wq, >work);
 
return 0;
 }
@@ -225,57 +271,16 @@ static int switchdev_port_attr_set_defer(struct 
net_device *dev,
  * Use a 2-phase prepare-commit transaction model to ensure
  * system is not left in a partially updated state due to
  * failure from driver/device.
+ *
+ * rtnl_lock must be held and must not be in atomic section,
+ * in case SWITCHDEV_F_DEFER flag is not set.
  */
 int switchdev_port_attr_set(struct net_device *dev, struct switchdev_attr 
*attr)
 {
-

[patch net-next v2 5/7] bridge: defer switchdev fdb del call in fdb_del_external_learn

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

Since spinlock is held here, defer the switchdev operation.

Signed-off-by: Jiri Pirko 
---
 net/bridge/br_fdb.c | 5 -
 net/bridge/br_if.c  | 3 +++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index f5e7da0..c88bd8e 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -134,7 +134,10 @@ static void fdb_del_hw_addr(struct net_bridge *br, const 
unsigned char *addr)
 static void fdb_del_external_learn(struct net_bridge_fdb_entry *f)
 {
struct switchdev_obj_port_fdb fdb = {
-   .obj.id = SWITCHDEV_OBJ_ID_PORT_FDB,
+   .obj = {
+   .id = SWITCHDEV_OBJ_ID_PORT_FDB,
+   .flags = SWITCHDEV_F_DEFER,
+   },
.vid = f->vlan_id,
};
 
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 934cae9..09147cb 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "br_private.h"
 
@@ -249,6 +250,8 @@ static void del_nbp(struct net_bridge_port *p)
list_del_rcu(>list);
 
br_fdb_delete_by_port(br, p, 0, 1);
+   switchdev_flush_deferred();
+
nbp_update_port_count(br);
 
netdev_upper_dev_unlink(dev, br->dev);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v2 6/7] rocker: remove nowait from switchdev callbacks.

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

No need to avoid sleeping in switchdev callbacks now, as the switchdev
core allows it.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index bb956a5..9629c5b5 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -3672,7 +3672,7 @@ static int rocker_port_fdb_flush(struct rocker_port 
*rocker_port,
rocker_port->stp_state == BR_STATE_FORWARDING)
return 0;
 
-   flags |= ROCKER_OP_FLAG_REMOVE;
+   flags |= ROCKER_OP_FLAG_NOWAIT | ROCKER_OP_FLAG_REMOVE;
 
spin_lock_irqsave(>fdb_tbl_lock, lock_flags);
 
@@ -4382,8 +4382,7 @@ static int rocker_port_attr_set(struct net_device *dev,
 
switch (attr->id) {
case SWITCHDEV_ATTR_ID_PORT_STP_STATE:
-   err = rocker_port_stp_update(rocker_port, trans,
-ROCKER_OP_FLAG_NOWAIT,
+   err = rocker_port_stp_update(rocker_port, trans, 0,
 attr->u.stp_state);
break;
case SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS:
@@ -4517,7 +4516,7 @@ static int rocker_port_fdb_del(struct rocker_port 
*rocker_port,
   const struct switchdev_obj_port_fdb *fdb)
 {
__be16 vlan_id = rocker_port_vid_to_vlan(rocker_port, fdb->vid, NULL);
-   int flags = ROCKER_OP_FLAG_NOWAIT | ROCKER_OP_FLAG_REMOVE;
+   int flags = ROCKER_OP_FLAG_REMOVE;
 
if (!rocker_port_is_bridged(rocker_port))
return -EINVAL;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v2 7/7] switchdev: assert rtnl mutex when going over lower netdevs

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

netdev_for_each_lower_dev has to be called with rtnl mutex held. So
better enforce it in switchdev functions.

Signed-off-by: Jiri Pirko 
---
 net/switchdev/switchdev.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 5d0d731..11e0291 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -494,6 +494,8 @@ EXPORT_SYMBOL_GPL(switchdev_port_obj_del);
  * @id: object ID
  * @obj: object to dump
  * @cb: function to call with a filled object
+ *
+ * rtnl_lock must be held.
  */
 int switchdev_port_obj_dump(struct net_device *dev, struct switchdev_obj *obj,
switchdev_obj_dump_cb_t *cb)
@@ -503,6 +505,8 @@ int switchdev_port_obj_dump(struct net_device *dev, struct 
switchdev_obj *obj,
struct list_head *iter;
int err = -EOPNOTSUPP;
 
+   ASSERT_RTNL();
+
if (ops && ops->switchdev_port_obj_dump)
return ops->switchdev_port_obj_dump(dev, obj, cb);
 
@@ -1068,6 +1072,8 @@ static struct net_device *switchdev_get_dev_by_nhs(struct 
fib_info *fi)
struct net_device *dev = NULL;
int nhsel;
 
+   ASSERT_RTNL();
+
/* For this route, all nexthop devs must be on the same switch. */
 
for (nhsel = 0; nhsel < fi->fib_nhs; nhsel++) {
@@ -1298,10 +1304,11 @@ void switchdev_port_fwd_mark_set(struct net_device *dev,
u32 mark = dev->ifindex;
u32 reset_mark = 0;
 
-   if (group_dev && joining) {
-   mark = switchdev_port_fwd_mark_get(dev, group_dev);
-   } else if (group_dev && !joining) {
-   if (dev->offload_fwd_mark == mark)
+   if (group_dev) {
+   ASSERT_RTNL();
+   if (joining)
+   mark = switchdev_port_fwd_mark_get(dev, group_dev);
+   else if (dev->offload_fwd_mark == mark)
/* Ohoh, this port was the mark reference port,
 * but it's leaving the group, so reset the
 * mark for the remaining ports in the group.
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v2 3/7] switchdev: remove pointers from switchdev objects

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

When object is used in deferred work, we cannot use pointers in
switchdev object structures because the memory they point at may be already
used by someone else. So rather do local copy of the value.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker.c |  6 +++---
 include/net/switchdev.h  |  7 +++
 net/bridge/br_fdb.c  |  2 +-
 net/dsa/slave.c  |  2 +-
 net/switchdev/switchdev.c| 11 +++
 5 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index eafa907..bb956a5 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -4469,7 +4469,7 @@ static int rocker_port_obj_add(struct net_device *dev,
fib4 = SWITCHDEV_OBJ_IPV4_FIB(obj);
err = rocker_port_fib_ipv4(rocker_port, trans,
   htonl(fib4->dst), fib4->dst_len,
-  fib4->fi, fib4->tb_id, 0);
+  >fi, fib4->tb_id, 0);
break;
case SWITCHDEV_OBJ_ID_PORT_FDB:
err = rocker_port_fdb_add(rocker_port, trans,
@@ -4541,7 +4541,7 @@ static int rocker_port_obj_del(struct net_device *dev,
fib4 = SWITCHDEV_OBJ_IPV4_FIB(obj);
err = rocker_port_fib_ipv4(rocker_port, NULL,
   htonl(fib4->dst), fib4->dst_len,
-  fib4->fi, fib4->tb_id,
+  >fi, fib4->tb_id,
   ROCKER_OP_FLAG_REMOVE);
break;
case SWITCHDEV_OBJ_ID_PORT_FDB:
@@ -4571,7 +4571,7 @@ static int rocker_port_fdb_dump(const struct rocker_port 
*rocker_port,
hash_for_each_safe(rocker->fdb_tbl, bkt, tmp, found, entry) {
if (found->key.rocker_port != rocker_port)
continue;
-   fdb->addr = found->key.addr;
+   ether_addr_copy(fdb->addr, found->key.addr);
fdb->ndm_state = NUD_REACHABLE;
fdb->vid = rocker_port_vlan_to_vid(rocker_port,
   found->key.vlan_id);
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 6b109e4..767d516 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define SWITCHDEV_F_NO_RECURSE BIT(0)
 #define SWITCHDEV_F_SKIP_EOPNOTSUPPBIT(1)
@@ -59,8 +60,6 @@ struct switchdev_attr {
} u;
 };
 
-struct fib_info;
-
 enum switchdev_obj_id {
SWITCHDEV_OBJ_ID_UNDEFINED,
SWITCHDEV_OBJ_ID_PORT_VLAN,
@@ -88,7 +87,7 @@ struct switchdev_obj_ipv4_fib {
struct switchdev_obj obj;
u32 dst;
int dst_len;
-   struct fib_info *fi;
+   struct fib_info fi;
u8 tos;
u8 type;
u32 nlflags;
@@ -101,7 +100,7 @@ struct switchdev_obj_ipv4_fib {
 /* SWITCHDEV_OBJ_ID_PORT_FDB */
 struct switchdev_obj_port_fdb {
struct switchdev_obj obj;
-   const unsigned char *addr;
+   unsigned char addr[ETH_ALEN];
u16 vid;
u16 ndm_state;
 };
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index f43ce05..f5e7da0 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -135,10 +135,10 @@ static void fdb_del_external_learn(struct 
net_bridge_fdb_entry *f)
 {
struct switchdev_obj_port_fdb fdb = {
.obj.id = SWITCHDEV_OBJ_ID_PORT_FDB,
-   .addr = f->addr.addr,
.vid = f->vlan_id,
};
 
+   ether_addr_copy(fdb.addr, f->addr.addr);
switchdev_port_obj_del(f->dst->dev, );
 }
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index bb2bd3b..2ad4e0e 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -393,7 +393,7 @@ static int dsa_slave_port_fdb_dump(struct net_device *dev,
if (ret < 0)
break;
 
-   fdb->addr = addr;
+   ether_addr_copy(fdb->addr, addr);
fdb->vid = vid;
fdb->ndm_state = is_static ? NUD_NOARP : NUD_REACHABLE;
 
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 9f94272..c23dd31 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -837,10 +838,10 @@ int switchdev_port_fdb_add(struct ndmsg *ndm, struct 
nlattr *tb[],
 {
struct switchdev_obj_port_fdb fdb = {
.obj.id = SWITCHDEV_OBJ_ID_PORT_FDB,
-   .addr = addr,
.vid = vid,
};
 
+   ether_addr_copy(fdb.addr, addr);
return switchdev_port_obj_add(dev, );
 }

[patch net-next v2 0/7] switchdev: change locking

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

This is something which I'm currently struggling with.
Callers of attr_set and obj_add/del often hold not only RTNL, but also
spinlock (bridge). So in that case, the driver implementing the op cannot sleep.

The way rocker is dealing with this now is just to invoke driver operation
and go out, without any checking or reporting of the operation status.

Since it would be nice to at least put a warning in case the operation fails,
it makes sense to do this in delayed work directly in switchdev core
instead of implementing this in separate drivers. And that is what this patchset
is introducing.

So from now on, the locking of switchdev mod ops is consistent. Caller either
holds rtnl mutex or in case it does not, caller sets defer flag, telling
switchdev core to process the op later in delayed work.

Flush function for switchdev deferred ops can be called by op
caller in appropriate location, for example after it releases
spin lock, to force switchdev core to process pending ops.

v1->v2:
- rebased on current net-next head (including Scott's ageing patchset)

Jiri Pirko (7):
  switchdev: introduce switchdev workqueue
  switchdev: allow caller to explicitly request attr_set as deferred
  switchdev: remove pointers from switchdev objects
  switchdev: introduce possibility to defer obj_add/del
  bridge: defer switchdev fdb del call in fdb_del_external_learn
  rocker: remove nowait from switchdev callbacks.
  switchdev: assert rtnl mutex when going over lower netdevs

 drivers/net/ethernet/rocker/rocker.c |  13 +-
 include/net/switchdev.h  |  14 +-
 net/bridge/br_fdb.c  |   7 +-
 net/bridge/br_if.c   |   3 +
 net/bridge/br_stp.c  |   3 +-
 net/dsa/slave.c  |   2 +-
 net/switchdev/switchdev.c| 289 ---
 7 files changed, 231 insertions(+), 100 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 1/2] hisilicon net: removes the once HANDEL_TX_MSG macro

2015-10-12 Thread Arnd Bergmann

On Monday 12 October 2015 11:23:44 huangdaode wrote:
> +   s += sprintf(s,
> +   "\t\ttx_ring on 
> %p:%u,%u,%u,%u,%u,%llu,%llu\n",
> +   h->qs[i]->tx_ring.io_base,
> +   h->qs[i]->tx_ring.buf_size,
> +   h->qs[i]->tx_ring.desc_num,
> +   h->qs[i]->tx_ring.max_desc_num_per_pkt,
> +   
> h->qs[i]->tx_ring.max_raw_data_sz_per_desc,
> +   h->qs[i]->tx_ring.max_pkt_size,
> +   h->qs[i]->tx_ring.stats.sw_err_cnt,
> +   h->qs[i]->tx_ring.stats.io_err_cnt);

There is actually a more significant problem with this code, which I
failed to notice when doing the original bugfix: 

You have a sysfs interface here that exports internal data of the
device that should not be visible like this. One problem is that
the io_base is a kernel pointer that must not be visible to non-root
users (so we don't easily create an attack surface for exploits).
Another problem is that the format is not documented in Documentation/ABI/
and that you have multiple values in one sysfs file here.

It would probably be better to completely remove that sysfs interface, and
to use the ethtool netlink interface to export them.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples

2015-10-12 Thread Wangnan (F)




On 2015/10/12 20:02, Peter Zijlstra wrote:

On Mon, Oct 12, 2015 at 09:02:42AM +, Kaixu Xia wrote:

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -483,6 +483,8 @@ struct perf_event {
perf_overflow_handler_t overflow_handler;
void*overflow_handler_context;
  
+	atomic_t			*sample_disable;

+
  #ifdef CONFIG_EVENT_TRACING
struct trace_event_call *tp_event;
struct event_filter *filter;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b11756f..f6ef45c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event *event,
irq_work_queue(>pending);
}
  
+	if ((event->sample_disable) && atomic_read(event->sample_disable))

+   return ret;
+
if (event->overflow_handler)
event->overflow_handler(event, data, regs);
else

Try and guarantee sample_disable lives in the same cacheline as
overflow_handler.


Could you please explain why we need them to be in a same cacheline?

Thank you.


I think we should at the very least replace the kzalloc() currently used
with a cacheline aligned alloc, and check the structure layout to verify
these two do in fact share a cacheline.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] switchdev: enforce no pvid flag in vlan ranges

2015-10-12 Thread Jiri Pirko

Mon, Oct 12, 2015 at 02:01:39PM CEST, ra...@blackwall.org wrote:
>From: Nikolay Aleksandrov 
>
>We shouldn't allow BRIDGE_VLAN_INFO_PVID flag in VLAN ranges.
>
>Signed-off-by: Nikolay Aleksandrov 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] switchdev: enforce no pvid flag in vlan ranges

2015-10-12 Thread Elad Raz


> On Oct 12, 2015, at 3:01 PM, Nikolay Aleksandrov  wrote:
> 
> From: Nikolay Aleksandrov 
> 
> We shouldn't allow BRIDGE_VLAN_INFO_PVID flag in VLAN ranges.
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---
> net/switchdev/switchdev.c | 3 +++
> 1 file changed, 3 insertions(+)
> 
> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
> index 6e4a4f9ad927..256c596de896 100644
> --- a/net/switchdev/switchdev.c
> +++ b/net/switchdev/switchdev.c
> @@ -720,6 +720,9 @@ static int switchdev_port_br_afspec(struct net_device 
> *dev,
>   if (vlan.vid_begin)
>   return -EINVAL;
>   vlan.vid_begin = vinfo->vid;
> + /* don't allow range of pvids */
> + if (vlan.flags & BRIDGE_VLAN_INFO_PVID)
> + return -EINVAL;
>   } else if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_END) {
>   if (!vlan.vid_begin)
>   return -EINVAL;

Acked-by: Elad Raz 

> -- 
> 2.4.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] switchdev: check if the vlan id is in the proper vlan range

2015-10-12 Thread Nikolay Aleksandrov

From: Nikolay Aleksandrov 

VLANs 0 and 4095 are reserved and shouldn't be used, add checks to
switchdev similar to the bridge. Also make sure ids above 4095 cannot
be passed either.

Fixes: 47f8328bb1a4 ("switchdev: add new switchdev bridge setlink")
Signed-off-by: Nikolay Aleksandrov 
---
 net/switchdev/switchdev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index fda38f830a10..77f5d17e2612 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -634,6 +635,8 @@ static int switchdev_port_br_afspec(struct net_device *dev,
if (nla_len(attr) != sizeof(struct bridge_vlan_info))
return -EINVAL;
vinfo = nla_data(attr);
+   if (!vinfo->vid || vinfo->vid >= VLAN_VID_MASK)
+   return -EINVAL;
vlan->flags = vinfo->flags;
if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_BEGIN) {
if (vlan->vid_begin)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 0/3] net: unix: fix use-after-free

2015-10-12 Thread Eric Dumazet

On Mon, 2015-10-12 at 13:54 +0100, Rainer Weikusat wrote:
> David Miller  writes:
> > From: Jason Baron 
> > Date: Fri,  9 Oct 2015 00:15:59 -0400
> >
> >> These patches are against mainline, I can re-base to net-next, please
> >> let me know.
> >> 
> >> They have been tested against: https://lkml.org/lkml/2015/9/13/195,
> >> which causes the use-after-free quite quickly and here:
> >> https://lkml.org/lkml/2015/10/2/693.
> >
> > I'd like to understand how patches that don't even compile can be
> > "tested"?
> >
> > net/unix/af_unix.c: In function ‘unix_dgram_writable’:
> > net/unix/af_unix.c:2480:3: error: ‘other_full’ undeclared (first use in 
> > this function)
> > net/unix/af_unix.c:2480:3: note: each undeclared identifier is reported 
> > only once for each function it appears in
> >
> > Could you explain how that works, I'm having a hard time understanding
> > this?
> 
> This is basicallly a workaround for the problem that it's not possible
> to tell epoll to let go of a certain wait queue: Instead of registering
> the peer_wait queue via sock_poll_wait, a wait_queue_t under control of
> the af_unix.c code is linked onto it which relays a wake up on the
> peer_wait queue to the 'ordinary' wait queue associated with the polled
> socket via custom wake function. But (at least the code I looked it) it
> enqueues a unix socket on connect which has certain side effects (in
> particular, /dev/log will have a seriously large wait queue of entirely
> uninterested peers) and in many cases, this is simply not necessary, as
> the additional peer_wait event is only interesting in case a peer of a
> fan-in socket (like /dev/log) happens to be waiting for writeabilty via
> poll/ select/ epoll/ ...
> 
> Since the wait queue handling code is now under control of the af_unix.c
> code, it can remove itself from the peer_wait queue prior to dropping
> its reference to a peer on disconnect or on detecting a dead peer in
> unix_dgram_sendmsg.
> --

Okay, but David was asking how the patch was supposed to be tested, and
applied, if it does not compile.

A patch is not only showing the idea, but must be ready for inclusion.

Please ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 4/7] switchdev: introduce possibility to defer obj_add/del

2015-10-12 Thread Jiri Pirko

Mon, Oct 12, 2015 at 04:34:25PM CEST, niko...@cumulusnetworks.com wrote:
>On 10/12/2015 03:15 PM, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> Similar to the attr usecase, the caller knows if he is holding RTNL and is
>> in atomic section. So let the called to decide the correct call variant.
>> 
>> This allows drivers to sleep inside their ops and wait for hw to get the
>> operation status. Then the status is propagated into switchdev core.
>> This avoids silent errors in drivers.
>> 
>> Signed-off-by: Jiri Pirko 
>> ---
>>  include/net/switchdev.h   |   1 +
>>  net/switchdev/switchdev.c | 137 
>> +-
>>  2 files changed, 112 insertions(+), 26 deletions(-)
>> 
>[snip]
>> +
>> +struct switchdev_obj_work {
>> +struct work_struct work;
>> +struct net_device *dev;
>> +struct switchdev_obj obj;
>> +bool add; /* add of del */
>s/of/or/ ? :-)

will fix, thanks.


>
>> +};
>> +
>> +static void switchdev_port_obj_work(struct work_struct *work)
>> +{
>> +struct switchdev_obj_work *ow =
>> +container_of(work, struct switchdev_obj_work, work);
>> +bool rtnl_locked = rtnl_is_locked();
>> +int err;
>> +
>> +if (!rtnl_locked)
>> +rtnl_lock();
>> +if (ow->add)
>> +err = switchdev_port_obj_add_now(ow->dev, >obj);
>> +else
>> +err = switchdev_port_obj_del_now(ow->dev, >obj);
>> +if (err && err != -EOPNOTSUPP)
>> +netdev_err(ow->dev, "failed (err=%d) to %s object (id=%d)\n",
>> +   err, ow->add ? "add" : "del", ow->obj.id);
>> +if (!rtnl_locked)
>> +rtnl_unlock();
>> +
>> +dev_put(ow->dev);
>> +kfree(ow);
>> +}
>> +
>> +static int switchdev_port_obj_work_schedule(struct net_device *dev,
>> +const struct switchdev_obj *obj,
>> +bool add)
>> +{
>> +struct switchdev_obj_work *ow;
>> +
>> +ow = kmalloc(sizeof(*ow), GFP_ATOMIC);
>> +if (!ow)
>> +return -ENOMEM;
>> +
>> +INIT_WORK(>work, switchdev_port_obj_work);
>> +
>This can be called without rtnl, what stops the device from disappearing
>between the above and the hold below ?

You are right. I will have to figure that out. Btw the same issue
already exists for attr_set deferred work.


>
>> +dev_hold(dev);
>> +ow->dev = dev;
>> +memcpy(>obj, obj, sizeof(ow->obj));
>> +ow->add = add;
>> +
>> +queue_work(switchdev_wq, >work);
>> +return 0;
>> +}
>> +
>[snip]
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 4/7] switchdev: introduce possibility to defer obj_add/del

2015-10-12 Thread Nikolay Aleksandrov

On 10/12/2015 04:34 PM, Nikolay Aleksandrov wrote:
> On 10/12/2015 03:15 PM, Jiri Pirko wrote:
>> From: Jiri Pirko 
>>
>> Similar to the attr usecase, the caller knows if he is holding RTNL and is
>> in atomic section. So let the called to decide the correct call variant.
>>
>> This allows drivers to sleep inside their ops and wait for hw to get the
>> operation status. Then the status is propagated into switchdev core.
>> This avoids silent errors in drivers.
>>
>> Signed-off-by: Jiri Pirko 
>> ---
>>  include/net/switchdev.h   |   1 +
>>  net/switchdev/switchdev.c | 137 
>> +-
>>  2 files changed, 112 insertions(+), 26 deletions(-)
>>
> [snip]
>> +
>> +struct switchdev_obj_work {
>> +struct work_struct work;
>> +struct net_device *dev;
>> +struct switchdev_obj obj;
>> +bool add; /* add of del */
> s/of/or/ ? :-)
> 
>> +};
>> +
>> +static void switchdev_port_obj_work(struct work_struct *work)
>> +{
>> +struct switchdev_obj_work *ow =
>> +container_of(work, struct switchdev_obj_work, work);
>> +bool rtnl_locked = rtnl_is_locked();
>> +int err;
>> +
>> +if (!rtnl_locked)
>> +rtnl_lock();
>> +if (ow->add)
>> +err = switchdev_port_obj_add_now(ow->dev, >obj);
>> +else
>> +err = switchdev_port_obj_del_now(ow->dev, >obj);
>> +if (err && err != -EOPNOTSUPP)
>> +netdev_err(ow->dev, "failed (err=%d) to %s object (id=%d)\n",
>> +   err, ow->add ? "add" : "del", ow->obj.id);
>> +if (!rtnl_locked)
>> +rtnl_unlock();
>> +
>> +dev_put(ow->dev);
>> +kfree(ow);
>> +}
>> +
>> +static int switchdev_port_obj_work_schedule(struct net_device *dev,
>> +const struct switchdev_obj *obj,
>> +bool add)
>> +{
>> +struct switchdev_obj_work *ow;
>> +
>> +ow = kmalloc(sizeof(*ow), GFP_ATOMIC);
>> +if (!ow)
>> +return -ENOMEM;
>> +
>> +INIT_WORK(>work, switchdev_port_obj_work);
>> +
> This can be called without rtnl, what stops the device from disappearing
> between the above and the hold below ?
> 
Nevermind this question, got it.

Cheers,
 Nik

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 4/7] switchdev: introduce possibility to defer obj_add/del

2015-10-12 Thread Nikolay Aleksandrov

On 10/12/2015 04:44 PM, Jiri Pirko wrote:
> Mon, Oct 12, 2015 at 04:34:25PM CEST, niko...@cumulusnetworks.com wrote:
>> On 10/12/2015 03:15 PM, Jiri Pirko wrote:
>>> From: Jiri Pirko 
>>>
>>> Similar to the attr usecase, the caller knows if he is holding RTNL and is
>>> in atomic section. So let the called to decide the correct call variant.
>>>
>>> This allows drivers to sleep inside their ops and wait for hw to get the
>>> operation status. Then the status is propagated into switchdev core.
>>> This avoids silent errors in drivers.
>>>
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>>  include/net/switchdev.h   |   1 +
>>>  net/switchdev/switchdev.c | 137 
>>> +-
>>>  2 files changed, 112 insertions(+), 26 deletions(-)
>>>
>> [snip]
>>> +
>>> +struct switchdev_obj_work {
>>> +   struct work_struct work;
>>> +   struct net_device *dev;
>>> +   struct switchdev_obj obj;
>>> +   bool add; /* add of del */
>> s/of/or/ ? :-)
> 
> will fix, thanks.
> 
> 
>>
>>> +};
>>> +
>>> +static void switchdev_port_obj_work(struct work_struct *work)
>>> +{
>>> +   struct switchdev_obj_work *ow =
>>> +   container_of(work, struct switchdev_obj_work, work);
>>> +   bool rtnl_locked = rtnl_is_locked();
>>> +   int err;
>>> +
>>> +   if (!rtnl_locked)
>>> +   rtnl_lock();
>>> +   if (ow->add)
>>> +   err = switchdev_port_obj_add_now(ow->dev, >obj);
>>> +   else
>>> +   err = switchdev_port_obj_del_now(ow->dev, >obj);
>>> +   if (err && err != -EOPNOTSUPP)
>>> +   netdev_err(ow->dev, "failed (err=%d) to %s object (id=%d)\n",
>>> +  err, ow->add ? "add" : "del", ow->obj.id);
>>> +   if (!rtnl_locked)
>>> +   rtnl_unlock();
>>> +
>>> +   dev_put(ow->dev);
>>> +   kfree(ow);
>>> +}
>>> +
>>> +static int switchdev_port_obj_work_schedule(struct net_device *dev,
>>> +   const struct switchdev_obj *obj,
>>> +   bool add)
>>> +{
>>> +   struct switchdev_obj_work *ow;
>>> +
>>> +   ow = kmalloc(sizeof(*ow), GFP_ATOMIC);
>>> +   if (!ow)
>>> +   return -ENOMEM;
>>> +
>>> +   INIT_WORK(>work, switchdev_port_obj_work);
>>> +
>> This can be called without rtnl, what stops the device from disappearing
>> between the above and the hold below ?
> 
> You are right. I will have to figure that out. Btw the same issue
> already exists for attr_set deferred work.
> 
> 

I have to say there're a few users now that need delayed RTNL execution
the bonding being a heavy one, teaming I think also has some rtnl delays.
Maybe it's time we do a generic delayed rtnl execution so it can be re-used
by all.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 4/7] switchdev: introduce possibility to defer obj_add/del

2015-10-12 Thread Nikolay Aleksandrov

On 10/12/2015 03:15 PM, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Similar to the attr usecase, the caller knows if he is holding RTNL and is
> in atomic section. So let the called to decide the correct call variant.
> 
> This allows drivers to sleep inside their ops and wait for hw to get the
> operation status. Then the status is propagated into switchdev core.
> This avoids silent errors in drivers.
> 
> Signed-off-by: Jiri Pirko 
> ---
>  include/net/switchdev.h   |   1 +
>  net/switchdev/switchdev.c | 137 
> +-
>  2 files changed, 112 insertions(+), 26 deletions(-)
> 
[snip]
> +
> +struct switchdev_obj_work {
> + struct work_struct work;
> + struct net_device *dev;
> + struct switchdev_obj obj;
> + bool add; /* add of del */
s/of/or/ ? :-)

> +};
> +
> +static void switchdev_port_obj_work(struct work_struct *work)
> +{
> + struct switchdev_obj_work *ow =
> + container_of(work, struct switchdev_obj_work, work);
> + bool rtnl_locked = rtnl_is_locked();
> + int err;
> +
> + if (!rtnl_locked)
> + rtnl_lock();
> + if (ow->add)
> + err = switchdev_port_obj_add_now(ow->dev, >obj);
> + else
> + err = switchdev_port_obj_del_now(ow->dev, >obj);
> + if (err && err != -EOPNOTSUPP)
> + netdev_err(ow->dev, "failed (err=%d) to %s object (id=%d)\n",
> +err, ow->add ? "add" : "del", ow->obj.id);
> + if (!rtnl_locked)
> + rtnl_unlock();
> +
> + dev_put(ow->dev);
> + kfree(ow);
> +}
> +
> +static int switchdev_port_obj_work_schedule(struct net_device *dev,
> + const struct switchdev_obj *obj,
> + bool add)
> +{
> + struct switchdev_obj_work *ow;
> +
> + ow = kmalloc(sizeof(*ow), GFP_ATOMIC);
> + if (!ow)
> + return -ENOMEM;
> +
> + INIT_WORK(>work, switchdev_port_obj_work);
> +
This can be called without rtnl, what stops the device from disappearing
between the above and the hold below ?

> + dev_hold(dev);
> + ow->dev = dev;
> + memcpy(>obj, obj, sizeof(ow->obj));
> + ow->add = add;
> +
> + queue_work(switchdev_wq, >work);
> + return 0;
> +}
> +
[snip]

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 4/7] switchdev: introduce possibility to defer obj_add/del

2015-10-12 Thread Jiri Pirko

Mon, Oct 12, 2015 at 04:42:12PM CEST, niko...@cumulusnetworks.com wrote:
>On 10/12/2015 04:34 PM, Nikolay Aleksandrov wrote:
>> On 10/12/2015 03:15 PM, Jiri Pirko wrote:
>>> From: Jiri Pirko 
>>>
>>> Similar to the attr usecase, the caller knows if he is holding RTNL and is
>>> in atomic section. So let the called to decide the correct call variant.
>>>
>>> This allows drivers to sleep inside their ops and wait for hw to get the
>>> operation status. Then the status is propagated into switchdev core.
>>> This avoids silent errors in drivers.
>>>
>>> Signed-off-by: Jiri Pirko 
>>> ---
>>>  include/net/switchdev.h   |   1 +
>>>  net/switchdev/switchdev.c | 137 
>>> +-
>>>  2 files changed, 112 insertions(+), 26 deletions(-)
>>>
>> [snip]
>>> +
>>> +struct switchdev_obj_work {
>>> +   struct work_struct work;
>>> +   struct net_device *dev;
>>> +   struct switchdev_obj obj;
>>> +   bool add; /* add of del */
>> s/of/or/ ? :-)
>> 
>>> +};
>>> +
>>> +static void switchdev_port_obj_work(struct work_struct *work)
>>> +{
>>> +   struct switchdev_obj_work *ow =
>>> +   container_of(work, struct switchdev_obj_work, work);
>>> +   bool rtnl_locked = rtnl_is_locked();
>>> +   int err;
>>> +
>>> +   if (!rtnl_locked)
>>> +   rtnl_lock();
>>> +   if (ow->add)
>>> +   err = switchdev_port_obj_add_now(ow->dev, >obj);
>>> +   else
>>> +   err = switchdev_port_obj_del_now(ow->dev, >obj);
>>> +   if (err && err != -EOPNOTSUPP)
>>> +   netdev_err(ow->dev, "failed (err=%d) to %s object (id=%d)\n",
>>> +  err, ow->add ? "add" : "del", ow->obj.id);
>>> +   if (!rtnl_locked)
>>> +   rtnl_unlock();
>>> +
>>> +   dev_put(ow->dev);
>>> +   kfree(ow);
>>> +}
>>> +
>>> +static int switchdev_port_obj_work_schedule(struct net_device *dev,
>>> +   const struct switchdev_obj *obj,
>>> +   bool add)
>>> +{
>>> +   struct switchdev_obj_work *ow;
>>> +
>>> +   ow = kmalloc(sizeof(*ow), GFP_ATOMIC);
>>> +   if (!ow)
>>> +   return -ENOMEM;
>>> +
>>> +   INIT_WORK(>work, switchdev_port_obj_work);
>>> +
>> This can be called without rtnl, what stops the device from disappearing
>> between the above and the hold below ?
>> 
>Nevermind this question, got it.

Well anyone without rtnl should hold dev in the first place. So this
should be ok.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next V15 2/3] Check for vlan ethernet types for 8021.q or 802.1ad

2015-10-12 Thread Sergei Shtylyov


Hello.

On 10/11/2015 2:40 AM, Thomas F Herbert wrote:


Signed-off-by: Thomas F Herbert 
---
  include/linux/if_vlan.h | 17 +
  1 file changed, 17 insertions(+)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 67ce5bd..88d1be4 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -627,6 +627,23 @@ static inline netdev_features_t vlan_features_check(const 
struct sk_buff *skb,

return features;
  }
+/**
+ * eth_type_vlan - check for valid vlan ether type.
+ * @ethertype: ether type to check
+ *
+ * Returns true if the ether type is a vlan ether type.
+ */
+static inline bool eth_type_vlan(__be16 ethertype)
+{
+   switch (ethertype) {
+   case (htons(ETH_P_8021Q)):
+   return true;
+   case (htons(ETH_P_8021AD)):
+   return true;


   I'm not sure if I've already suggested that or not but why not merge these 
2 cases?



+   default:
+   return false;
+   }
+}


[...]

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 18/23] spear13xx_pcie_gadget: use per-attribute show and store methods

2015-10-12 Thread Felipe Balbi

Christoph Hellwig  writes:

> On Fri, Oct 09, 2015 at 04:05:17PM -0500, Felipe Balbi wrote:
>> Pratyush Anand  writes:
>> 
>> > On Sat, Oct 3, 2015 at 7:02 PM, Christoph Hellwig  wrote:
>> >> Signed-off-by: Christoph Hellwig 
>> >
>> > Acked-by: Pratyush Anand 
>> 
>> I don't seem to have the actual patch, care to resend?
>
> The whole series was sent to all the receipients in the To and Cc lists,
> so check your spam folder or one of the list archives.

I don't have it, if you want it to reach upstream, please resend

-- 
balbi


signature.asc
Description: PGP signature

Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples

2015-10-12 Thread kbuild test robot

Hi Kaixu,

[auto build test ERROR on tip/perf/core -- if it's inappropriate base, please 
suggest rules for selecting the more suitable base]

url:
https://github.com/0day-ci/linux/commits/Kaixu-Xia/bpf-enable-disable-events-stored-in-PERF_EVENT_ARRAY-maps-trace-data-output-when-perf-sampling/20151012-170616
config: m68k-allyesconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   kernel/bpf/arraymap.c: In function 'perf_event_fd_array_get_ptr':
>> kernel/bpf/arraymap.c:305:7: error: 'struct perf_event' has no member named 
>> 'sample_disable'
 event->sample_disable = >perf_sample_disable;
  ^

vim +305 kernel/bpf/arraymap.c

   299  if (attr->type != PERF_TYPE_RAW &&
   300  attr->type != PERF_TYPE_HARDWARE) {
   301  perf_event_release_kernel(event);
   302  return ERR_PTR(-EINVAL);
   303  }
   304  
 > 305  event->sample_disable = >perf_sample_disable;
   306  return event;
   307  }
   308  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

[PATCH v2 net] raw: increment correct SNMP counters for ICMP messages

2015-10-12 Thread Ben Cartwright-Cox

Sending ICMP packets with raw sockets ends up in the SNMP counters
logging the type as the first byte of the IPv4 header rather than
the ICMP header. This is fixed by adding the IP Header Length to
the casting into a icmphdr struct.

Signed-off-by: Ben Cartwright-Cox 
Acked-by: Eric Dumazet 
---
 net/ipv4/raw.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 561cd4b..ef3c9ba 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -406,10 +406,12 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
ip_select_ident(net, skb, NULL);
 
iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
+   skb->transport_header += iphlen;
+   if (iph->protocol == IPPROTO_ICMP &&
+   length >= iphlen + sizeof(struct icmphdr))
+   icmp_out_count(net, ((struct icmphdr *)
+   skb_transport_header(skb))->type);
}
-   if (iph->protocol == IPPROTO_ICMP)
-   icmp_out_count(net, ((struct icmphdr *)
-   skb_transport_header(skb))->type);
 
err = NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_OUT, sk, skb,
  NULL, rt->dst.dev, dst_output_sk);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: "ss -p" segfaults (updated to 4.2)

2015-10-12 Thread Stephen Hemminger

Applied, and did some editing on commit msg
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 3/4] bridge: vlan: break vlan_flush in two phases to keep old order

2015-10-12 Thread Nikolay Aleksandrov

On 10/12/2015 07:39 PM, Ido Schimmel wrote:
> Mon, Oct 12, 2015 at 02:41:08PM IDT, ra...@blackwall.org wrote:
>> From: Nikolay Aleksandrov 
>>
> Hi,
> 
>> Ido Schimmel reported a problem with switchdev devices because of the
>> order change of del_nbp operations, more specifically the move of
>> nbp_vlan_flush() which deletes all vlans and frees vlgrp after the
>> rx_handler has been unregistered. So in order to fix this break
>> vlan_flush in two phases:
>> 1. delete all of vlan_group's vlans
>> 2. destroy rhtable and free vlgrp
>> We execute phase I (free_rht == false) in the same place as before so the
>> vlans can be cleared and free the vlgrp after the rx_handler has been
>> unregistered in phase II (free_rht == true).
> I don't fully understand the reason for the two-phase cleanup. Please
> see below.
>>
>> Reported-by: Ido Schimmel 
>> Signed-off-by: Nikolay Aleksandrov 
>> ---
>> Ido: I hope this fixes it for your case, a test would be much appreciated.
> This indeed fixes our switchdev issue. Thanks for the fix!
>>
[snip]
>>
>> -static void __vlan_flush(struct net_bridge_vlan_group *vlgrp)
>> +static void __vlan_flush(struct net_bridge_vlan_group *vlgrp, bool free_rht)
>> {
>>  struct net_bridge_vlan *vlan, *tmp;
>>
>>  __vlan_delete_pvid(vlgrp, vlgrp->pvid);
>>  list_for_each_entry_safe(vlan, tmp, >vlan_list, vlist)
>>  __vlan_del(vlan);
>> -rhashtable_destroy(>vlan_hash);
>> -kfree_rcu(vlgrp, rcu);
>> +
> Why not just issue a synchronize_rcu here and remove the if statement? I
> believe that would also be better for when we remove the bridge device
> itself. It's fully symmetric with init that way.
Hi,
I considered that, but I don't want to issue a second synchronize_rcu() for each
port when deleting them, with this change we avoid a second synchronize_rcu()
and use the rx_handler unregister one. In complex setups with lots of ports
this is a considerable overhead.
For the bridge device del case - the call is the same, there're no two phases
there.

Cheers,
 Nik
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Issue with /proc/sys/net/ipv4/tcp_mem

2015-10-12 Thread Eric W. Biederman

wangyufen  writes:

> Hi,
>
> I tried on linux-4.1:
> linux:~# cat /proc/sys/net/ipv4/tcp_mem 
> 8388608   1258291216777216
> linux:~# echo 1234 >/proc/sys/net/ipv4/tcp_mem 
> -bash: echo: write error: Invalid argument
> linux:~# cat /proc/sys/net/ipv4/tcp_mem 
> 1234  1258291216777216
>
> the echo operation got error, but value already written to tcp_mem.
>
> I checked, patch f594d63199688ad568fb caused the issue.


If your problem is that you can not write a single value and instead
have to write all three values I don't know what to tell you.  I don't
see how that could have ever worked.

Certainly the commit you pointed at did not change that behavior.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] switchdev: enforce no pvid flag in vlan ranges

2015-10-12 Thread Vivien Didelot

Hi guys,

On Oct. Monday 12 (42) 02:01 PM, Nikolay Aleksandrov wrote:
> From: Nikolay Aleksandrov 
> 
> We shouldn't allow BRIDGE_VLAN_INFO_PVID flag in VLAN ranges.
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---
>  net/switchdev/switchdev.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
> index 6e4a4f9ad927..256c596de896 100644
> --- a/net/switchdev/switchdev.c
> +++ b/net/switchdev/switchdev.c
> @@ -720,6 +720,9 @@ static int switchdev_port_br_afspec(struct net_device 
> *dev,
>   if (vlan.vid_begin)
>   return -EINVAL;
>   vlan.vid_begin = vinfo->vid;
> + /* don't allow range of pvids */
> + if (vlan.flags & BRIDGE_VLAN_INFO_PVID)
> + return -EINVAL;
>   } else if (vinfo->flags & BRIDGE_VLAN_INFO_RANGE_END) {
>   if (!vlan.vid_begin)
>   return -EINVAL;
> -- 
> 2.4.3
> 

Yes the patch looks good, but it is a minor check though. I hope the
subject of this thread is making sense.

VLAN ranges seem to have been included for an UX purpose (so commands
look like Cisco IOS). We don't want to change any existing interface, so
we pushed that down to drivers, with the only valid reason that, maybe
one day, an hardware can be capable of programming a range on a per-port
basis.

So what happens is that we'll add some code to fix and check non-sense
(e.g. range + PVID) in switchdev, bridge, and I'm sure we are missing
other spots.

Sorry for being insistent, but this still doesn't look right to me.

It seems like we are bloating bridge, switchdev and drivers for the only
reason to maintain a kernel support for something like:

# for i in $(seq 100 4000); do bridge vlan add vid $i dev swp0; done

Thanks,
-v
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v4 1/7] switchdev: introduce switchdev workqueue

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

This is going to be used for deferred operations.

Signed-off-by: Jiri Pirko 
---
 include/net/switchdev.h   |  5 +
 net/switchdev/switchdev.c | 19 +++
 2 files changed, 24 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 1ce7083..d2879f2 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -205,6 +205,7 @@ int switchdev_port_fdb_dump(struct sk_buff *skb, struct 
netlink_callback *cb,
 void switchdev_port_fwd_mark_set(struct net_device *dev,
 struct net_device *group_dev,
 bool joining);
+void switchdev_flush_deferred(void);
 
 #else
 
@@ -326,6 +327,10 @@ static inline void switchdev_port_fwd_mark_set(struct 
net_device *dev,
 {
 }
 
+static inline void switchdev_flush_deferred(void)
+{
+}
+
 #endif
 
 #endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 7a9ab90..d119e9c 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -17,9 +17,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
+static struct workqueue_struct *switchdev_wq;
+
 /**
  * switchdev_trans_item_enqueue - Enqueue data item to transaction queue
  *
@@ -1217,3 +1220,19 @@ void switchdev_port_fwd_mark_set(struct net_device *dev,
dev->offload_fwd_mark = mark;
 }
 EXPORT_SYMBOL_GPL(switchdev_port_fwd_mark_set);
+
+void switchdev_flush_deferred(void)
+{
+   flush_workqueue(switchdev_wq);
+}
+EXPORT_SYMBOL_GPL(switchdev_flush_deferred);
+
+static int __init switchdev_init(void)
+{
+   switchdev_wq = create_workqueue("switchdev");
+   if (!switchdev_wq)
+   return -ENOMEM;
+   return 0;
+}
+
+subsys_initcall(switchdev_init);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch net-next v4 0/7] switchdev: change locking

2015-10-12 Thread Jiri Pirko

From: Jiri Pirko 

This is something which I'm currently struggling with.
Callers of attr_set and obj_add/del often hold not only RTNL, but also
spinlock (bridge). So in that case, the driver implementing the op cannot sleep.

The way rocker is dealing with this now is just to invoke driver operation
and go out, without any checking or reporting of the operation status.

Since it would be nice to at least put a warning in case the operation fails,
it makes sense to do this in delayed work directly in switchdev core
instead of implementing this in separate drivers. And that is what this patchset
is introducing.

So from now on, the locking of switchdev mod ops is consistent. Caller either
holds rtnl mutex or in case it does not, caller sets defer flag, telling
switchdev core to process the op later in delayed work.

Flush function for switchdev deferred ops can be called by op
caller in appropriate location, for example after it releases
spin lock, to force switchdev core to process pending ops.

v1->v2:
- rebased on current net-next head (including Scott's ageing patchset)
v2->v3:
- fixed comment s/of/or/ typo suggested by Nik
v3->v4:
- the actual patchset is sent instead of different branch I send in v3 :/

Jiri Pirko (7):
  switchdev: introduce switchdev workqueue
  switchdev: allow caller to explicitly request attr_set as deferred
  switchdev: remove pointers from switchdev objects
  switchdev: introduce possibility to defer obj_add/del
  bridge: defer switchdev fdb del call in fdb_del_external_learn
  rocker: remove nowait from switchdev callbacks.
  switchdev: assert rtnl mutex when going over lower netdevs

 drivers/net/ethernet/rocker/rocker.c |  13 +-
 include/net/switchdev.h  |  14 +-
 net/bridge/br_fdb.c  |   7 +-
 net/bridge/br_if.c   |   3 +
 net/bridge/br_stp.c  |   3 +-
 net/dsa/slave.c  |   2 +-
 net/switchdev/switchdev.c| 289 ---
 7 files changed, 231 insertions(+), 100 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 246 matches

Mail list logo