[PATCH 08/13] ARM: mvebu: enable SRAM support in mvebu_v7_defconfig
Signed-off-by: Marcin Wojtas --- arch/arm/configs/mvebu_v7_defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm/configs/mvebu_v7_defconfig b/arch/arm/configs/mvebu_v7_defconfig index c6729bf..fe57e20 100644 --- a/arch/arm/configs/mvebu_v7_defconfig +++ b/arch/arm/configs/mvebu_v7_defconfig @@ -58,6 +58,7 @@ CONFIG_MTD_M25P80=y CONFIG_MTD_NAND=y CONFIG_MTD_NAND_PXA3xx=y CONFIG_MTD_SPI_NOR=y +CONFIG_SRAM=y CONFIG_EEPROM_AT24=y CONFIG_BLK_DEV_SD=y CONFIG_ATA=y -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/13] net: mvneta: enable mixed egress processing using HR timer
Mixed approach allows using higher interrupt threshold (increased back to 15 packets), useful in high throughput. In case of small amount of data or very short TX queues HR timer ensures releasing buffers with small latency. Along with existing tx_done processing by coalescing interrupts this commit enables triggering HR timer each time the packets are sent. Time threshold can also be configured, using ethtool. Signed-off-by: Marcin Wojtas Signed-off-by: Simon Guinot --- drivers/net/ethernet/marvell/mvneta.c | 89 +-- 1 file changed, 85 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 9c9e858..f5acaf6 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -21,6 +21,8 @@ #include #include #include +#include +#include #include #include #include @@ -226,7 +228,8 @@ /* Various constants */ /* Coalescing */ -#define MVNETA_TXDONE_COAL_PKTS1 +#define MVNETA_TXDONE_COAL_PKTS15 +#define MVNETA_TXDONE_COAL_USEC100 #define MVNETA_RX_COAL_PKTS32 #define MVNETA_RX_COAL_USEC100 @@ -356,6 +359,11 @@ struct mvneta_port { struct net_device *dev; struct notifier_block cpu_notifier; + /* Egress finalization */ + struct tasklet_struct tx_done_tasklet; + struct hrtimer tx_done_timer; + bool timer_scheduled; + /* Core clock */ struct clk *clk; u8 mcast_count[256]; @@ -481,6 +489,7 @@ struct mvneta_tx_queue { int txq_get_index; u32 done_pkts_coal; + u32 done_time_coal; /* Virtual address of the TX DMA descriptors array */ struct mvneta_tx_desc *descs; @@ -1791,6 +1800,30 @@ error: return -ENOMEM; } +/* Trigger HR timer for TX processing */ +static void mvneta_timer_set(struct mvneta_port *pp) +{ + ktime_t interval; + + if (!pp->timer_scheduled) { + pp->timer_scheduled = true; + interval = ktime_set(0, pp->txqs[0].done_time_coal * 1000); + hrtimer_start(&pp->tx_done_timer, interval, + HRTIMER_MODE_REL_PINNED); + } +} + +/* TX processing HR timer callback */ +static enum hrtimer_restart mvneta_hr_timer_cb(struct hrtimer *timer) +{ + struct mvneta_port *pp = container_of(timer, struct mvneta_port, + tx_done_timer); + + tasklet_schedule(&pp->tx_done_tasklet); + + return HRTIMER_NORESTART; +} + /* Main tx processing */ static int mvneta_tx(struct sk_buff *skb, struct net_device *dev) { @@ -1862,10 +1895,13 @@ out: if (txq->count >= txq->tx_stop_threshold) netif_tx_stop_queue(nq); - if (!skb->xmit_more || netif_xmit_stopped(nq)) + if (!skb->xmit_more || netif_xmit_stopped(nq)) { mvneta_txq_pend_desc_add(pp, txq, frags); - else + if (txq->done_time_coal && !pp->timer_scheduled) + mvneta_timer_set(pp); + } else { txq->pending += frags; + } u64_stats_update_begin(&stats->syncp); stats->tx_packets++; @@ -1902,6 +1938,7 @@ static void mvneta_tx_done_gbe(struct mvneta_port *pp, u32 cause_tx_done) { struct mvneta_tx_queue *txq; struct netdev_queue *nq; + unsigned int tx_todo = 0; while (cause_tx_done) { txq = mvneta_tx_done_policy(pp, cause_tx_done); @@ -1909,12 +1946,40 @@ static void mvneta_tx_done_gbe(struct mvneta_port *pp, u32 cause_tx_done) nq = netdev_get_tx_queue(pp->dev, txq->id); __netif_tx_lock(nq, smp_processor_id()); - if (txq->count) + if (txq->count) { mvneta_txq_done(pp, txq); + tx_todo += txq->count; + } __netif_tx_unlock(nq); cause_tx_done &= ~((1 << txq->id)); } + + if (!pp->txqs[0].done_time_coal) + return; + + /* Set the timer in case not all the packets were +* processed. Otherwise attempt to cancel timer. +*/ + if (tx_todo) + mvneta_timer_set(pp); + else if (pp->timer_scheduled) + hrtimer_cancel(&pp->tx_done_timer); +} + +/* TX done processing tasklet */ +static void mvneta_tx_done_proc(unsigned long data) +{ + struct net_device *dev = (struct net_device *)data; + struct mvneta_port *pp = netdev_priv(dev); + + if (!netif_running(dev)) + return; + + pp->timer_scheduled = false; + + /* Process all Tx queues */ + mvneta_tx_done_gbe(pp, (1 << txq_number) - 1); } /* Compute crc8 of the specified address, using a unique al
[PATCH 10/13] ARM: mvebu: add buffer manager nodes to armada-38x.dtsi
Armada 38x network controller supports hardware buffer management (BM). Since it is now enabled in mvneta driver, appropriate nodes can be added to armada-38x.dtsi - for the actual common BM unit (bm@c8000) and its internal SRAM (bm-bppi), which is used for indirect access to buffer pointer ring residing in DRAM. Pools - ports mapping, bm-bppi entry in 'soc' node's ranges and optional parameters are supposed to be set in board files. Signed-off-by: Marcin Wojtas --- arch/arm/boot/dts/armada-38x.dtsi | 18 ++ 1 file changed, 18 insertions(+) diff --git a/arch/arm/boot/dts/armada-38x.dtsi b/arch/arm/boot/dts/armada-38x.dtsi index b7868b2..b9f4ce2 100644 --- a/arch/arm/boot/dts/armada-38x.dtsi +++ b/arch/arm/boot/dts/armada-38x.dtsi @@ -539,6 +539,14 @@ status = "disabled"; }; + bm: bm@c8000 { + compatible = "marvell,armada-380-neta-bm"; + reg = <0xc8000 0xac>; + clocks = <&gateclk 13>; + internal-mem = <&bm_bppi>; + status = "disabled"; + }; + sata@e { compatible = "marvell,armada-380-ahci"; reg = <0xe 0x2000>; @@ -617,6 +625,16 @@ #size-cells = <1>; ranges = <0 MBUS_ID(0x09, 0x15) 0 0x800>; }; + + bm_bppi: bm-bppi { + compatible = "mmio-sram"; + reg = ; + ranges = <0 MBUS_ID(0x0c, 0x04) 0 0x10>; + #address-cells = <1>; + #size-cells = <1>; + clocks = <&gateclk 13>; + status = "disabled"; + }; }; clocks { -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/13] bus: mvebu-mbus: provide api for obtaining IO and DRAM window information
This commit enables finding appropriate mbus window and obtaining its target id and attribute for given physical address in two separate routines, both for IO and DRAM windows. This functionality is needed for Armada XP/38x Network Controller's Buffer Manager and PnC configuration. Signed-off-by: Marcin Wojtas [DRAM window information reference in LKv3.10] Signed-off-by: Evan Wang --- drivers/bus/mvebu-mbus.c | 51 include/linux/mbus.h | 3 +++ 2 files changed, 54 insertions(+) diff --git a/drivers/bus/mvebu-mbus.c b/drivers/bus/mvebu-mbus.c index c43c3d2..3d1c0c3 100644 --- a/drivers/bus/mvebu-mbus.c +++ b/drivers/bus/mvebu-mbus.c @@ -948,6 +948,57 @@ void mvebu_mbus_get_pcie_io_aperture(struct resource *res) *res = mbus_state.pcie_io_aperture; } +int mvebu_mbus_get_dram_win_info(phys_addr_t phyaddr, u8 *target, u8 *attr) +{ + const struct mbus_dram_target_info *dram; + int i; + + /* Get dram info */ + dram = mv_mbus_dram_info(); + if (!dram) { + pr_err("missing DRAM information\n"); + return -ENODEV; + } + + /* Try to find matching DRAM window for phyaddr */ + for (i = 0; i < dram->num_cs; i++) { + const struct mbus_dram_window *cs = dram->cs + i; + + if (cs->base <= phyaddr && phyaddr <= (cs->base + cs->size)) { + *target = dram->mbus_dram_target_id; + *attr = cs->mbus_attr; + return 0; + } + } + + pr_err("invalid dram address 0x%x\n", phyaddr); + return -EINVAL; +} +EXPORT_SYMBOL_GPL(mvebu_mbus_get_dram_win_info); + +int mvebu_mbus_get_io_win_info(phys_addr_t phyaddr, u32 *size, u8 *target, + u8 *attr) +{ + int win; + + for (win = 0; win < mbus_state.soc->num_wins; win++) { + u64 wbase; + int enabled; + + mvebu_mbus_read_window(&mbus_state, win, &enabled, &wbase, + size, target, attr, NULL); + + if (!enabled) + continue; + + if (wbase <= phyaddr && phyaddr <= wbase + *size) + return win; + } + + return -EINVAL; +} +EXPORT_SYMBOL_GPL(mvebu_mbus_get_io_win_info); + static __init int mvebu_mbus_debugfs_init(void) { struct mvebu_mbus_state *s = &mbus_state; diff --git a/include/linux/mbus.h b/include/linux/mbus.h index 1f7bc63..ea34a86 100644 --- a/include/linux/mbus.h +++ b/include/linux/mbus.h @@ -69,6 +69,9 @@ static inline const struct mbus_dram_target_info *mv_mbus_dram_info_nooverlap(vo int mvebu_mbus_save_cpu_target(u32 *store_addr); void mvebu_mbus_get_pcie_mem_aperture(struct resource *res); void mvebu_mbus_get_pcie_io_aperture(struct resource *res); +int mvebu_mbus_get_dram_win_info(phys_addr_t phyaddr, u8 *target, u8 *attr); +int mvebu_mbus_get_io_win_info(phys_addr_t phyaddr, u32 *size, u8 *target, + u8 *attr); int mvebu_mbus_add_window_remap_by_id(unsigned int target, unsigned int attribute, phys_addr_t base, size_t size, -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/13] net: mvneta: add configuration for MBUS windows access protection
This commit adds missing configuration of MBUS windows access protection in mvneta_conf_mbus_windows function - a dedicated variable for that purpose remained there unused since v3.8 initial mvneta support. Because of that the register contents were inherited from the bootloader. Signed-off-by: Marcin Wojtas Cc: # v3.8+ --- drivers/net/ethernet/marvell/mvneta.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index e84c7f2..0f30aaa 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -62,6 +62,7 @@ #define MVNETA_WIN_SIZE(w) (0x2204 + ((w) << 3)) #define MVNETA_WIN_REMAP(w) (0x2280 + ((w) << 2)) #define MVNETA_BASE_ADDR_ENABLE 0x2290 +#define MVNETA_ACCESS_PROTECT_ENABLE 0x2294 #define MVNETA_PORT_CONFIG 0x2400 #define MVNETA_UNI_PROMISC_MODEBIT(0) #define MVNETA_DEF_RXQ(q) ((q) << 1) @@ -3188,6 +3189,8 @@ static void mvneta_conf_mbus_windows(struct mvneta_port *pp, win_enable &= ~(1 << i); win_protect |= 3 << (2 * i); + + mvreg_write(pp, MVNETA_ACCESS_PROTECT_ENABLE, win_protect); } mvreg_write(pp, MVNETA_BASE_ADDR_ENABLE, win_enable); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/13] net: mvneta: add xmit_more support
From: Simon Guinot Basing on xmit_more flag of the skb, TX descriptors can be concatenated before flushing. This commit delay Tx descriptor flush if the queue is running and if there is more skb's to send. Signed-off-by: Simon Guinot --- drivers/net/ethernet/marvell/mvneta.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index f079b13..9c9e858 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -467,6 +467,7 @@ struct mvneta_tx_queue { * descriptor ring */ int count; + int pending; int tx_stop_threshold; int tx_wake_threshold; @@ -751,8 +752,9 @@ static void mvneta_txq_pend_desc_add(struct mvneta_port *pp, /* Only 255 descriptors can be added at once ; Assume caller * process TX desriptors in quanta less than 256 */ - val = pend_desc; + val = pend_desc + txq->pending; mvreg_write(pp, MVNETA_TXQ_UPDATE_REG(txq->id), val); + txq->pending = 0; } /* Get pointer to next TX descriptor to be processed (send) by HW */ @@ -1857,11 +1859,14 @@ out: struct netdev_queue *nq = netdev_get_tx_queue(dev, txq_id); txq->count += frags; - mvneta_txq_pend_desc_add(pp, txq, frags); - if (txq->count >= txq->tx_stop_threshold) netif_tx_stop_queue(nq); + if (!skb->xmit_more || netif_xmit_stopped(nq)) + mvneta_txq_pend_desc_add(pp, txq, frags); + else + txq->pending += frags; + u64_stats_update_begin(&stats->syncp); stats->tx_packets++; stats->tx_bytes += len; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/13] ARM: mvebu: enable buffer manager support on Armada XP boards
Since mvneta driver supports using hardware buffer management (BM), in order to use it, board files have to be adjusted accordingly. This commit enables BM on AXP-DB and AXP-GP in same manner - because number of ports on those boards is the same as number of possible pools, each port is supposed to use single pool for all kind of packets. Moreover appropriate entry is added to 'soc' node ranges, as well as "okay" status for 'bm' and 'bm-bppi' (internal SRAM) nodes. Signed-off-by: Marcin Wojtas --- arch/arm/boot/dts/armada-xp-db.dts | 19 ++- arch/arm/boot/dts/armada-xp-gp.dts | 19 ++- 2 files changed, 36 insertions(+), 2 deletions(-) diff --git a/arch/arm/boot/dts/armada-xp-db.dts b/arch/arm/boot/dts/armada-xp-db.dts index f774101..3065730 100644 --- a/arch/arm/boot/dts/armada-xp-db.dts +++ b/arch/arm/boot/dts/armada-xp-db.dts @@ -77,7 +77,8 @@ MBUS_ID(0x01, 0x1d) 0 0 0xfff0 0x10 MBUS_ID(0x01, 0x2f) 0 0 0xf000 0x100 MBUS_ID(0x09, 0x09) 0 0 0xf810 0x1 - MBUS_ID(0x09, 0x05) 0 0 0xf811 0x1>; + MBUS_ID(0x09, 0x05) 0 0 0xf811 0x1 + MBUS_ID(0x0c, 0x04) 0 0 0xf120 0x10>; devbus-bootcs { status = "okay"; @@ -181,21 +182,33 @@ status = "okay"; phy = <&phy0>; phy-mode = "rgmii-id"; + buffer-manager = <&bm>; + bm,pool-long = <0>; }; ethernet@74000 { status = "okay"; phy = <&phy1>; phy-mode = "rgmii-id"; + buffer-manager = <&bm>; + bm,pool-long = <1>; }; ethernet@3 { status = "okay"; phy = <&phy2>; phy-mode = "sgmii"; + buffer-manager = <&bm>; + bm,pool-long = <2>; }; ethernet@34000 { status = "okay"; phy = <&phy3>; phy-mode = "sgmii"; + buffer-manager = <&bm>; + bm,pool-long = <3>; + }; + + bm@c { + status = "okay"; }; mvsdio@d4000 { @@ -230,5 +243,9 @@ }; }; }; + + bm-bppi { + status = "okay"; + }; }; }; diff --git a/arch/arm/boot/dts/armada-xp-gp.dts b/arch/arm/boot/dts/armada-xp-gp.dts index 4878d73..a1ded01 100644 --- a/arch/arm/boot/dts/armada-xp-gp.dts +++ b/arch/arm/boot/dts/armada-xp-gp.dts @@ -96,7 +96,8 @@ MBUS_ID(0x01, 0x1d) 0 0 0xfff0 0x10 MBUS_ID(0x01, 0x2f) 0 0 0xf000 0x100 MBUS_ID(0x09, 0x09) 0 0 0xf810 0x1 - MBUS_ID(0x09, 0x05) 0 0 0xf811 0x1>; + MBUS_ID(0x09, 0x05) 0 0 0xf811 0x1 + MBUS_ID(0x0c, 0x04) 0 0 0xf120 0x10>; devbus-bootcs { status = "okay"; @@ -196,21 +197,29 @@ status = "okay"; phy = <&phy0>; phy-mode = "qsgmii"; + buffer-manager = <&bm>; + bm,pool-long = <0>; }; ethernet@74000 { status = "okay"; phy = <&phy1>; phy-mode = "qsgmii"; + buffer-manager = <&bm>; + bm,pool-long = <1>; }; ethernet@3 { status = "okay"; phy = <&phy2>; phy-mode = "qsgmii"; + buffer-manager = <&bm>; + bm,pool-long = <2>; }; ethernet@34000 { status = "okay"; phy = <&phy3>; phy-mode = "qsgmii"; + buffer-manager = <&bm>; + bm,pool-long = <3>
[PATCH 11/13] ARM: mvebu: enable buffer manager support on Armada 38x boards
Since mvneta driver supports using hardware buffer management (BM), in order to use it, board files have to be adjusted accordingly. This commit enables BM on: * A385-DB-AP - each port has its own pool for long and common pool for short packets, * A388-DB - to each port unique 'short' and 'long' pools are mapped, * A388-GP - same as above. Moreover appropriate entry is added to 'soc' node ranges, as well as "okay" status for 'bm' and 'bm-bppi' (internal SRAM) nodes. Signed-off-by: Marcin Wojtas --- arch/arm/boot/dts/armada-385-db-ap.dts | 20 +++- arch/arm/boot/dts/armada-388-db.dts| 17 - arch/arm/boot/dts/armada-388-gp.dts| 17 - 3 files changed, 51 insertions(+), 3 deletions(-) diff --git a/arch/arm/boot/dts/armada-385-db-ap.dts b/arch/arm/boot/dts/armada-385-db-ap.dts index acd5b15..5f9451b 100644 --- a/arch/arm/boot/dts/armada-385-db-ap.dts +++ b/arch/arm/boot/dts/armada-385-db-ap.dts @@ -61,7 +61,8 @@ ranges = ; + MBUS_ID(0x09, 0x15) 0 0xf111 0x1 + MBUS_ID(0x0c, 0x04) 0 0xf120 0x10>; internal-regs { spi1: spi@10680 { @@ -138,12 +139,18 @@ status = "okay"; phy = <&phy2>; phy-mode = "sgmii"; + buffer-manager = <&bm>; + bm,pool-long = <1>; + bm,pool-short = <3>; }; ethernet@34000 { status = "okay"; phy = <&phy1>; phy-mode = "sgmii"; + buffer-manager = <&bm>; + bm,pool-long = <2>; + bm,pool-short = <3>; }; ethernet@7 { @@ -157,6 +164,13 @@ status = "okay"; phy = <&phy0>; phy-mode = "rgmii-id"; + buffer-manager = <&bm>; + bm,pool-long = <0>; + bm,pool-short = <3>; + }; + + bm@c8000 { + status = "okay"; }; nfc: flash@d { @@ -178,6 +192,10 @@ }; }; + bm-bppi { + status = "okay"; + }; + pcie-controller { status = "okay"; diff --git a/arch/arm/boot/dts/armada-388-db.dts b/arch/arm/boot/dts/armada-388-db.dts index ff47af5..ea93ed7 100644 --- a/arch/arm/boot/dts/armada-388-db.dts +++ b/arch/arm/boot/dts/armada-388-db.dts @@ -66,7 +66,8 @@ ranges = ; + MBUS_ID(0x09, 0x15) 0 0xf111 0x1 + MBUS_ID(0x0c, 0x04) 0 0xf120 0x10>; internal-regs { spi@10600 { @@ -99,6 +100,9 @@ status = "okay"; phy = <&phy1>; phy-mode = "rgmii-id"; + buffer-manager = <&bm>; + bm,pool-long = <2>; + bm,pool-short = <3>; }; usb@58000 { @@ -109,6 +113,9 @@ status = "okay"; phy = <&phy0>; phy-mode = "rgmii-id"; + buffer-manager = <&bm>; + bm,pool-long = <0>; + bm,pool-short = <1>; }; mdio@72004 { @@ -129,6 +136,10 @@ status = "okay"; }; + bm@c8000 { + status = "okay"; + }; + flash@d { status = "okay"; num-cs = <1>; @@ -169,6 +180,10 @@ }; }; + bm-bppi { + status = "okay"; + }; + pcie-controller { status = "okay"; /* diff --git a/arch/arm/boot/dts/armada-388-gp.dts b/arch/arm/boot/dts/armada-388-gp.dts index a633be3..0a3bd7f 100644 --- a/arch/arm/boot/dts/armada-388-gp.dts +++ b/arch/arm/boot/dts/armada-388-gp.dts @@ -60,7 +60,8 @@ ranges = ; + MBUS_ID(0x09, 0x15) 0 0xf111 0x1 + MBUS_ID(0x0c, 0x04) 0 0xf120 0x10>;
[PATCH 09/13] net: mvneta: bm: add support for hardware buffer management
Buffer manager (BM) is a dedicated hardware unit that can be used by all ethernet ports of Armada XP and 38x SoC's. It allows to offload CPU on RX path by sparing DRAM access on refilling buffer pool, hardware-based filling of descriptor ring data and better memory utilization due to HW arbitration for using 'short' pools for small packets. Tests performed with A388 SoC working as a network bridge between two packet generators showed increase of maximum processed 64B packets by ~20k (~555k packets with BM enabled vs ~535 packets without BM). Also when pushing 1500B-packets with a line rate achieved, CPU load decreased from around 25% without BM to 20% with BM. BM comprise up to 4 buffer pointers' (BP) rings kept in DRAM, which are called external BP pools - BPPE. Allocating and releasing buffer pointers (BP) to/from BPPE is performed indirectly by write/read access to a dedicated internal SRAM, where internal BP pools (BPPI) are placed. BM hardware controls status of BPPE automatically, as well as assigning proper buffers to RX descriptors. For more details please refer to Functional Specification of Armada XP or 38x SoC. In order to enable support for a separate hardware block, common for all ports, a new driver has to be implemented ('mvneta_bm'). It provides initialization sequence of address space, clocks, registers, SRAM, empty pools' structures and also obtaining optional configuration from DT (please refer to device tree binding documentation). mvneta_bm exposes also a necessary API to mvneta driver, as well as a dedicated structure with BM information (bm_priv), whose presence is used as a flag notifying of BM usage by port. It has to be ensured that mvneta_bm probe is executed prior to the ones in ports' driver. In case BM is not used or its probe fails, mvneta falls back to use software buffer management. A sequence executed in mvneta_probe function is modified in order to have an access to needed resources before possible port's BM initialization is done. According to port-pools mapping provided by DT appropriate registers are configured and the buffer pools are filled. RX path is modified accordingly. Becaues the hardware allows a wide variety of configuration options, following assumptions are made: * using BM mechanisms can be selectively disabled/enabled basing on DT configuration among the ports * 'long' pool's single buffer size is tied to port's MTU * using 'long' pool by port is obligatory and it cannot be shared * using 'short' pool for smaller packets is optional * one 'short' pool can be shared among all ports This commit enables hardware buffer management operation cooperating with existing mvneta driver. New device tree binding documentation is added and the one of mvneta is updated accordingly. Signed-off-by: Marcin Wojtas --- .../bindings/net/marvell-armada-370-neta.txt | 19 +- .../devicetree/bindings/net/marvell-neta-bm.txt| 49 ++ drivers/net/ethernet/marvell/Kconfig | 14 + drivers/net/ethernet/marvell/Makefile | 1 + drivers/net/ethernet/marvell/mvneta.c | 493 ++-- drivers/net/ethernet/marvell/mvneta_bm.c | 642 + drivers/net/ethernet/marvell/mvneta_bm.h | 171 ++ 7 files changed, 1335 insertions(+), 54 deletions(-) create mode 100644 Documentation/devicetree/bindings/net/marvell-neta-bm.txt create mode 100644 drivers/net/ethernet/marvell/mvneta_bm.c create mode 100644 drivers/net/ethernet/marvell/mvneta_bm.h diff --git a/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt b/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt index f5a8ca2..ea1aae2 100644 --- a/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt +++ b/Documentation/devicetree/bindings/net/marvell-armada-370-neta.txt @@ -8,14 +8,29 @@ Required properties: - phy-mode: See ethernet.txt file in the same directory - clocks: a pointer to the reference clock for this device. +Optional properties (valid only for Armada XP/38x): + +- buffer-manager: a phandle to a buffer manager node. Please refer to + Documentation/devicetree/bindings/net/marvell-neta-bm.txt +- bm,pool-long: ID of a pool, that will accept all packets of a size + higher than 'short' pool's threshold (if set) and up to MTU value. + Obligatory, when the port is supposed to use hardware + buffer management. +- bm,pool-short: ID of a pool, that will be used for accepting + packets of a size lower than given threshold. If not set, the port + will use a single 'long' pool for all packets, as defined above. + Example: -ethernet@d007 { +ethernet@7 { compatible = "marvell,armada-370-neta"; - reg = <0xd007 0x2500>; + reg = <0x7 0x2500>; interrupts = <8>; clocks = <&gate_clk 4>; status = "okay"; phy = <&phy0>; phy-mode = "rgmii-id"; + buffer-manager = <&bm>; + bm,pool-long = <0>; + bm,pool-s
[PATCH 12/13] ARM: mvebu: add buffer manager nodes to armada-xp.dtsi
Armada XP network controller supports hardware buffer management (BM). Since it is now enabled in mvneta driver, appropriate nodes can be added to armada-xp.dtsi - for the actual common BM unit (bm@c) and its internal SRAM (bm-bppi), which is used for indirect access to buffer pointer ring residing in DRAM. Pools - ports mapping, bm-bppi entry in 'soc' node's ranges and optional parameters are supposed to be set in board files. Signed-off-by: Marcin Wojtas --- arch/arm/boot/dts/armada-xp.dtsi | 18 ++ 1 file changed, 18 insertions(+) diff --git a/arch/arm/boot/dts/armada-xp.dtsi b/arch/arm/boot/dts/armada-xp.dtsi index be23196..bd45936 100644 --- a/arch/arm/boot/dts/armada-xp.dtsi +++ b/arch/arm/boot/dts/armada-xp.dtsi @@ -253,6 +253,14 @@ marvell,crypto-sram-size = <0x800>; }; + bm: bm@c { + compatible = "marvell,armada-380-neta-bm"; + reg = <0xc 0xac>; + clocks = <&gateclk 13>; + internal-mem = <&bm_bppi>; + status = "disabled"; + }; + xor@f0900 { compatible = "marvell,orion-xor"; reg = <0xF0900 0x100 @@ -291,6 +299,16 @@ #size-cells = <1>; ranges = <0 MBUS_ID(0x09, 0x05) 0 0x800>; }; + + bm_bppi: bm-bppi { + compatible = "mmio-sram"; + reg = ; + ranges = <0 MBUS_ID(0x0c, 0x04) 0 0x10>; + #address-cells = <1>; + #size-cells = <1>; + clocks = <&gateclk 13>; + status = "disabled"; + }; }; clocks { -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/13] net: mvneta: enable suspend/resume support
This commit introduces suspend/resume routines used for both in 'standby' and 'mem' modes. For the latter, in which registers' contents are lost, following steps are performed: * in suspend - update port statistics and, if interface is running, detach netif, clean the queues, disable cpu notifier, shutdown interface and reset port's link status; * in resume, for all interfaces, set default configuration of the port and MBUS windows; * in resume, in case the interface is running, enable accepting packets in legacy parser, power up the port, register cpu notifier and attach netif. Signed-off-by: Marcin Wojtas --- drivers/net/ethernet/marvell/mvneta.c | 70 +++ 1 file changed, 70 insertions(+) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index d12b8c6..f079b13 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -3442,6 +3442,72 @@ static int mvneta_remove(struct platform_device *pdev) return 0; } +#ifdef CONFIG_PM_SLEEP +static int mvneta_suspend(struct platform_device *pdev, pm_message_t state) +{ + struct net_device *dev = platform_get_drvdata(pdev); + struct mvneta_port *pp = netdev_priv(dev); + + mvneta_ethtool_update_stats(pp); + + if (!netif_running(dev)) + return 0; + + netif_device_detach(dev); + + mvneta_stop_dev(pp); + unregister_cpu_notifier(&pp->cpu_notifier); + mvneta_cleanup_rxqs(pp); + mvneta_cleanup_txqs(pp); + + /* Reset link status */ + pp->link = 0; + pp->duplex = -1; + pp->speed = 0; + + return 0; +} + +static int mvneta_resume(struct platform_device *pdev) +{ + const struct mbus_dram_target_info *dram_target_info; + struct net_device *dev = platform_get_drvdata(pdev); + struct mvneta_port *pp = netdev_priv(dev); + int ret; + + mvneta_defaults_set(pp); + mvneta_port_power_up(pp, pp->phy_interface); + + dram_target_info = mv_mbus_dram_info(); + if (dram_target_info) + mvneta_conf_mbus_windows(pp, dram_target_info); + + if (!netif_running(dev)) + return 0; + + ret = mvneta_setup_rxqs(pp); + if (ret) { + netdev_err(dev, "unable to setup rxqs after resume\n"); + return ret; + } + + ret = mvneta_setup_txqs(pp); + if (ret) { + netdev_err(dev, "unable to setup txqs after resume\n"); + return ret; + } + + mvneta_set_rx_mode(dev); + mvneta_percpu_elect(pp); + register_cpu_notifier(&pp->cpu_notifier); + mvneta_start_dev(pp); + + netif_device_attach(dev); + + return 0; +} +#endif /* CONFIG_PM_SLEEP */ + static const struct of_device_id mvneta_match[] = { { .compatible = "marvell,armada-370-neta" }, { .compatible = "marvell,armada-xp-neta" }, @@ -3452,6 +3518,10 @@ MODULE_DEVICE_TABLE(of, mvneta_match); static struct platform_driver mvneta_driver = { .probe = mvneta_probe, .remove = mvneta_remove, +#ifdef CONFIG_PM_SLEEP + .suspend = mvneta_suspend, + .resume = mvneta_resume, +#endif .driver = { .name = MVNETA_DRIVER_NAME, .of_match_table = mvneta_match, -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/13] mvneta Buffer Management and enhancements
Hi, Hereby I submit a patchset that introduces various fixes and support for new features and enhancements to the mvneta driver: 1. First three patches are minimal fixes, stable-CC'ed. 2. Suspend to ram ('s2ram') support. Due to some stability problems Thomas Petazzoni's patches did not get merged yet, but I used them for verification. Contrary to wfi mode ('standby' - linux does not differentiate between them, so same routines are used) all registers' contents are lost due to power down, so the configuration has to be fully reconstructed during resume. 3. Optimisations - concatenating TX descriptors' flush, basing on xmit_more support and combined approach for finalizing egress processing. Thanks to HR timer buffers can be released with small latency, which is good for low transfer and small queues. Along with the timer, coalescing irqs are used, whose threshold could be increased back to 15. 4. Buffer manager (BM) support with two preparatory commits. As it is a separate block, common for all network ports, a new driver is introduced, which configures it and exposes API to the main network driver. It is throughly described in binding documentation and commit log. Please note, that enabling per-port BM usage is done using phandle and the data passed in mvneta_bm_probe. It is designed for usage of on-demand device probe and dev_set/get_drvdata, however it's awaiting merge to linux-next. Therefore, deferring probe is not used - if something goes wrong (same in case of errors during changing MTU or suspend/resume cycle) mvneta driver falls back to software buffer management and works in a regular way. Known issues: - problems with obtaining all mapped buffers from internal SRAM, when destroying the buffer pointer pool - problems with unmapping chunk of SRAM during driver removal Above do not have an impact on the operation, as they are called during driver removal or in error path. 5. Enable BM on Armada XP and 38X development boards - those ones and A370 I could check on my own. In all cases they survived night-long linerate iperf. Also tests were performed with A388 SoC working as a network bridge between two packet generators. They showed increase of maximum processed 64B packets by ~20k (~555k packets with BM enabled vs ~535 packets without BM). Also when pushing 1500B-packets with a line rate achieved, CPU load decreased from around 25% without BM vs 18-20% with BM. I'm looking forward to any remarks and comments. Best regards, Marcin Wojtas Marcin Wojtas (12): net: mvneta: add configuration for MBUS windows access protection net: mvneta: enable IP checksum with jumbo frames for Armada 38x on Port0 net: mvneta: fix bit assignment in MVNETA_RXQ_CONFIG_REG net: mvneta: enable suspend/resume support net: mvneta: enable mixed egress processing using HR timer bus: mvebu-mbus: provide api for obtaining IO and DRAM window information ARM: mvebu: enable SRAM support in mvebu_v7_defconfig net: mvneta: bm: add support for hardware buffer management ARM: mvebu: add buffer manager nodes to armada-38x.dtsi ARM: mvebu: enable buffer manager support on Armada 38x boards ARM: mvebu: add buffer manager nodes to armada-xp.dtsi ARM: mvebu: enable buffer manager support on Armada XP boards Simon Guinot (1): net: mvneta: add xmit_more support .../bindings/net/marvell-armada-370-neta.txt | 19 +- .../devicetree/bindings/net/marvell-neta-bm.txt| 49 ++ arch/arm/boot/dts/armada-385-db-ap.dts | 20 +- arch/arm/boot/dts/armada-388-db.dts| 17 +- arch/arm/boot/dts/armada-388-gp.dts| 17 +- arch/arm/boot/dts/armada-38x.dtsi | 20 +- arch/arm/boot/dts/armada-xp-db.dts | 19 +- arch/arm/boot/dts/armada-xp-gp.dts | 19 +- arch/arm/boot/dts/armada-xp.dtsi | 18 + arch/arm/configs/mvebu_v7_defconfig| 1 + drivers/bus/mvebu-mbus.c | 51 ++ drivers/net/ethernet/marvell/Kconfig | 14 + drivers/net/ethernet/marvell/Makefile | 1 + drivers/net/ethernet/marvell/mvneta.c | 660 +++-- drivers/net/ethernet/marvell/mvneta_bm.c | 642 drivers/net/ethernet/marvell/mvneta_bm.h | 171 ++ include/linux/mbus.h | 3 + 17 files changed, 1677 insertions(+), 64 deletions(-) create mode 100644 Documentation/devicetree/bindings/net/marvell-neta-bm.txt create mode 100644 drivers/net/ethernet/marvell/mvneta_bm.c create mode 100644 drivers/net/ethernet/marvell/mvneta_bm.h -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] tipc: avoid packets leaking on socket receive queue
Even if we drain receive queue thoroughly in tipc_release() after tipc socket is removed from rhashtable, it is possible that some packets are in flight because some CPU runs receiver and did rhashtable lookup before we removed socket. They will achieve receive queue, but nobody delete them at all. To avoid this leak, we register a private socket destructor to purge receive queue, meaning releasing packets pending on receive queue will be delayed until the last reference of tipc socket will be released. Signed-off-by: Ying Xue --- net/tipc/socket.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/net/tipc/socket.c b/net/tipc/socket.c index 552dbab..b53246f 100644 --- a/net/tipc/socket.c +++ b/net/tipc/socket.c @@ -105,6 +105,7 @@ struct tipc_sock { static int tipc_backlog_rcv(struct sock *sk, struct sk_buff *skb); static void tipc_data_ready(struct sock *sk); static void tipc_write_space(struct sock *sk); +static void tipc_sock_destruct(struct sock *sk); static int tipc_release(struct socket *sock); static int tipc_accept(struct socket *sock, struct socket *new_sock, int flags); static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p); @@ -381,6 +382,7 @@ static int tipc_sk_create(struct net *net, struct socket *sock, sk->sk_rcvbuf = sysctl_tipc_rmem[1]; sk->sk_data_ready = tipc_data_ready; sk->sk_write_space = tipc_write_space; + sk->sk_destruct = tipc_sock_destruct; tsk->conn_timeout = CONN_TIMEOUT_DEFAULT; tsk->sent_unacked = 0; atomic_set(&tsk->dupl_rcvcnt, 0); @@ -470,9 +472,6 @@ static int tipc_release(struct socket *sock) tipc_node_remove_conn(net, dnode, tsk->portid); } - /* Discard any remaining (connection-based) messages in receive queue */ - __skb_queue_purge(&sk->sk_receive_queue); - /* Reject any messages that accumulated in backlog queue */ sock->state = SS_DISCONNECTING; release_sock(sk); @@ -1515,6 +1514,11 @@ static void tipc_data_ready(struct sock *sk) rcu_read_unlock(); } +static void tipc_sock_destruct(struct sock *sk) +{ + __skb_queue_purge(&sk->sk_receive_queue); +} + /** * filter_connect - Handle all incoming messages for a connection-based socket * @tsk: TIPC socket -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] tipc: fix error handling of expanding buffer headroom
Coverity says: *** CID 1338065: Error handling issues (CHECKED_RETURN) /net/tipc/udp_media.c: 162 in tipc_udp_send_msg() 156 struct udp_media_addr *dst = (struct udp_media_addr *)&dest->value; 157 struct udp_media_addr *src = (struct udp_media_addr *)&b->addr.value; 158 struct sk_buff *clone; 159 struct rtable *rt; 160 161 if (skb_headroom(skb) < UDP_MIN_HEADROOM) >>> CID 1338065: Error handling issues (CHECKED_RETURN) >>> Calling "pskb_expand_head" without checking return value (as is done >>> elsewhere 51 out of 56 times). 162 pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC); 163 164 clone = skb_clone(skb, GFP_ATOMIC); 165 skb_set_inner_protocol(clone, htons(ETH_P_TIPC)); 166 ub = rcu_dereference_rtnl(b->media_ptr); 167 if (!ub) { When expanding buffer headroom over udp tunnel with pskb_expand_head(), it's unfortunate that we don't check its return value. As a result, if the function returns an error code due to the lack of memory, it may cause unpredictable consequence as we unconditionally consider that it's always successful. Fixes: e53567948f82 ("tipc: conditionally expand buffer headroom over udp tunnel") Reported-by: Cc: Stephen Hemminger Signed-off-by: Ying Xue --- net/tipc/udp_media.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c index ad2719a..f01ff16 100644 --- a/net/tipc/udp_media.c +++ b/net/tipc/udp_media.c @@ -158,8 +158,11 @@ static int tipc_udp_send_msg(struct net *net, struct sk_buff *skb, struct udp_media_addr *src = (struct udp_media_addr *)&b->addr.value; struct rtable *rt; - if (skb_headroom(skb) < UDP_MIN_HEADROOM) - pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC); + if (skb_headroom(skb) < UDP_MIN_HEADROOM) { + err = pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC); + if (!err) + goto tx_error; + } skb_set_inner_protocol(skb, htons(ETH_P_TIPC)); ub = rcu_dereference_rtnl(b->media_ptr); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 4.1.12 crash
On 11/21/2015 12:16 AM, Andrew wrote: Memory corruption, if happens, IMHO shouldn't be a hardware-related - almost all of these boxes, except H61M-based box from 1st log, works for a long time with uptime more than year; and only software was changed on it; H61M-based box runs memtest86 for a tens of hours w/o any error. If it was caused by hardware - they should crash even earlier. I wasn't saying it was hardware related. My thought is that it could be some sort of use after free or double free type issue. Basically what you end up with is the memory getting corrupted by software that is accessing regions it shouldn't be. Rarely on different servers I saw 'zram decompression error' messages (in this case I've got such message on H61M-based box). Also, other people that uses accel-ppp as BRAS software, have different kernel panics/bugs/oopses on fresh kernels. I'll try to apply these patches, and I'll try to switch back to kernels that were stable on some boxes. If you could bisect this it would be useful. Basically we just need to determine where in the git history these issues started popping up so that we can then narrow down on the root cause. - Alex -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] bpf: add show_fdinfo handler for maps
On Fri, Nov 20, 2015 at 11:30:13AM +0100, Hannes Frederic Sowa wrote: > Hi Alexei, > > > If user space can be see both 'count' and 'max_entries', it can be very > > tempting to start assuming 'full' and 'empty' state of the map which will > > lead to race conditions and bad design. > > bpf programs and maps are inherently multi-thread and concurrent. > > If userapp wants to do the counting of elements it needs to do so on its > > own > > and shoot itself in the foot eventually. > > For the same reason I don't want to see BPF_MAP_GET_COUNT command. > > Hmmm... I don't understand your argument. This is the same with memory > management in general and we still report memory statistics to user > space. I really would find it helpful to have a feeling if a map is > nearly full or nearly empty. memory is not the same, since it's a shared resource and knowledge about consumption by the process gives no insight whether next malloc() will succeed or not. > We can also count collisions or the load in the buckets, but some > evidence what is going on would be nice, wouldn't it? reporting collisions may be ok, since it's probably hard to exploit such stats, but security may become a concern in some use cases, so may not be such a good idea at the end. In general when user space passed kernel some numbers (like type, element size, max_entries) it's ok to report it back via fdinfo. Anything else I'd rather keep private. debugging in general should be done by debugging tools. Like I often use kprobe+bpf to debug networking+bpf :) Unfortunately kprobe+bpf doesn't work to debug itself, but then regular tracing and kprobe comes to rescue. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] broadcom: fix PHY_ID_BCM5481 entry in the id table
Commit fcb26ec5b18d ("broadcom: move all PHY_ID's to header") updated broadcom_tbl to use PHY_IDs, but incorrectly replaced 0x0143bca0 with PHY_ID_BCM5482 (making a duplicate entry, and completely omitting the original). Fix that. Fixes: fcb26ec5b18d ("broadcom: move all PHY_ID's to header") Signed-off-by: Aaro Koskinen --- drivers/net/phy/broadcom.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c index 07a6119..3ce5d95 100644 --- a/drivers/net/phy/broadcom.c +++ b/drivers/net/phy/broadcom.c @@ -614,7 +614,7 @@ static struct mdio_device_id __maybe_unused broadcom_tbl[] = { { PHY_ID_BCM5461, 0xfff0 }, { PHY_ID_BCM54616S, 0xfff0 }, { PHY_ID_BCM5464, 0xfff0 }, - { PHY_ID_BCM5482, 0xfff0 }, + { PHY_ID_BCM5481, 0xfff0 }, { PHY_ID_BCM5482, 0xfff0 }, { PHY_ID_BCM50610, 0xfff0 }, { PHY_ID_BCM50610M, 0xfff0 }, -- 2.4.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] vrf: fix double free and memory corruption on register_netdevice failure
On 11/21/15 11:46 AM, Nikolay Aleksandrov wrote: From: Nikolay Aleksandrov When vrf's ->newlink is called, if register_netdevice() fails then it does free_netdev(), but that's also done by rtnl_newlink() so a second free happens and memory gets corrupted, to reproduce execute the following line a couple of times (1 - 5 usually is enough): $ for i in `seq 1 5`; do ip link add vrf: type vrf table 1; done; This works because we fail in register_netdevice() because of the wrong name "vrf:". Acked-by: David Ahern Thanks, Nik. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: r8169 regression: UDP packets dropped intermittantly
Francois Romieu : [...] If you can crash your system at will, you may apply the patch below to da78dbff2e05630921c551dbbc70a4b7981a8fff ("r8169: remove work from irq handler.") parent (aka 1e874e041fc7c222cbd85b20c4406070be1f687a) and build it in a current tree (say 4.2). If it works it will be easier for me to compare behaviors in current tree (especially as none of my current test boxes wants to boot with my own ga311). How much memory and CPU may I rely on in your test computer ? --- r8169.c 2015-11-21 23:02:10.435275753 +0100 +++ r8169.c 2015-11-21 23:21:49.429554012 +0100 @@ -29,7 +29,6 @@ #include #include -#include #include #include @@ -1616,7 +1615,7 @@ static int rtl8169_set_features(struct n else tp->cp_cmd &= ~RxChkSum; - if (dev->features & NETIF_F_HW_VLAN_RX) + if (dev->features & NETIF_F_HW_VLAN_CTAG_RX) tp->cp_cmd |= RxVlan; else tp->cp_cmd &= ~RxVlan; @@ -1632,8 +1631,8 @@ static int rtl8169_set_features(struct n static inline u32 rtl8169_tx_vlan_tag(struct rtl8169_private *tp, struct sk_buff *skb) { - return (vlan_tx_tag_present(skb)) ? - TxVlanTag | swab16(vlan_tx_tag_get(skb)) : 0x00; + return (skb_vlan_tag_present(skb)) ? + TxVlanTag | swab16(skb_vlan_tag_get(skb)) : 0x00; } static void rtl8169_rx_vlan_tag(struct RxDesc *desc, struct sk_buff *skb) @@ -1641,7 +1640,7 @@ static void rtl8169_rx_vlan_tag(struct R u32 opts2 = le32_to_cpu(desc->opts2); if (opts2 & RxVlanTag) - __vlan_hwaccel_put_tag(skb, swab16(opts2 & 0x)); + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), swab16(opts2 & 0x)); desc->opts2 = 0; } @@ -3508,7 +3507,7 @@ static const struct net_device_ops rtl81 }; -static void __devinit rtl_init_mdio_ops(struct rtl8169_private *tp) +static void rtl_init_mdio_ops(struct rtl8169_private *tp) { struct mdio_ops *ops = &tp->mdio_ops; @@ -3725,7 +3724,7 @@ static void rtl_pll_power_up(struct rtl8 rtl_generic_op(tp, tp->pll_power_ops.up); } -static void __devinit rtl_init_pll_power_ops(struct rtl8169_private *tp) +static void rtl_init_pll_power_ops(struct rtl8169_private *tp) { struct pll_power_ops *ops = &tp->pll_power_ops; @@ -3905,7 +3904,7 @@ static void r8168b_1_hw_jumbo_disable(st RTL_W8(Config4, RTL_R8(Config4) & ~(1 << 0)); } -static void __devinit rtl_init_jumbo_ops(struct rtl8169_private *tp) +static void rtl_init_jumbo_ops(struct rtl8169_private *tp) { struct jumbo_ops *ops = &tp->jumbo_ops; @@ -3971,7 +3970,7 @@ static void rtl_hw_reset(struct rtl8169_ } } -static int __devinit +static int rtl8169_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) { const struct rtl_cfg_info *cfg = rtl_cfg_infos + ent->driver_data; @@ -4137,7 +4136,7 @@ rtl8169_init_one(struct pci_dev *pdev, c dev->dev_addr[i] = RTL_R8(MAC0 + i); memcpy(dev->perm_addr, dev->dev_addr, dev->addr_len); - SET_ETHTOOL_OPS(dev, &rtl8169_ethtool_ops); + dev->ethtool_ops = &rtl8169_ethtool_ops; dev->watchdog_timeo = RTL8169_TX_TIMEOUT; dev->irq = pdev->irq; dev->base_addr = (unsigned long) ioaddr; @@ -4147,16 +4146,16 @@ rtl8169_init_one(struct pci_dev *pdev, c /* don't enable SG, IP_CSUM and TSO by default - it might not work * properly for all devices */ dev->features |= NETIF_F_RXCSUM | - NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX; + NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX; dev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO | - NETIF_F_RXCSUM | NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX; + NETIF_F_RXCSUM | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX; dev->vlan_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO | NETIF_F_HIGHDMA; if (tp->mac_version == RTL_GIGA_MAC_VER_05) /* 8110SCd requires hardware Rx VLAN - disallow toggling */ - dev->hw_features &= ~NETIF_F_HW_VLAN_RX; + dev->hw_features &= ~NETIF_F_HW_VLAN_CTAG_RX; tp->intr_mask = 0x; tp->hw_start = cfg->hw_start; @@ -4217,7 +4216,7 @@ err_out_free_dev_1: goto out; } -static void __devexit rtl8169_remove_one(struct pci_dev *pdev) +static void rtl8169_remove_one(struct pci_dev *pdev) { struct net_device *dev = pci_get_drvdata(pdev); struct rtl8169_private *tp = netdev_priv(dev); @@ -6218,20 +6217,9 @@ static struct pci_driver rtl8169_pci_dri .name = MODULENAME, .id_table = rtl8169_pci_tbl, .probe = rtl8169_init_one, - .remove = __devexit_p(rtl8169_remove_one), + .remove = rtl8169_remove_one, .shutdown = rtl_shutdown, .d
Re: [B.A.T.M.A.N.] [PATCH 1/3] batman-adv: Delete an unnecessary check before the function call "batadv_softif_vlan_free_ref"
On Tuesday, November 03, 2015 21:52:58 SF Markus Elfring wrote: > From: Markus Elfring > Date: Tue, 3 Nov 2015 19:20:34 +0100 > > The batadv_softif_vlan_free_ref() function tests whether its argument is > NULL and then returns immediately. Thus the test around the call is not > needed. > > This issue was detected by using the Coccinelle software. > > Signed-off-by: Markus Elfring > --- > net/batman-adv/translation-table.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) Applied in revision bbcbe0f. Thanks, Marek signature.asc Description: This is a digitally signed message part.
Re: [B.A.T.M.A.N.] [PATCH 1/2] batman-adv: Delete unnecessary checks before the function call "kfree_skb"
On Sunday, November 15, 2015 09:43:26 SF Markus Elfring wrote: > From: Markus Elfring > Date: Sun, 15 Nov 2015 08:04:43 +0100 > > The kfree_skb() function tests whether its argument is NULL and then > returns immediately. Thus the test around the calls is not needed. > > This issue was detected by using the Coccinelle software. > > Signed-off-by: Markus Elfring > --- > net/batman-adv/main.c | 2 +- > net/batman-adv/network-coding.c | 4 +--- > net/batman-adv/send.c | 3 +-- > 3 files changed, 3 insertions(+), 6 deletions(-) Applied in revision 77d84a6. Thanks, Marek signature.asc Description: This is a digitally signed message part.
Re: [B.A.T.M.A.N.] [PATCH 2/2] batman-adv: Less checks in batadv_tvlv_unicast_send()
On Sunday, November 15, 2015 09:45:51 SF Markus Elfring wrote: > From: Markus Elfring > Date: Sun, 15 Nov 2015 09:00:42 +0100 > > * Let us return directly if a call of the batadv_orig_hash_find() function > returned a null pointer. > > * Omit the initialisation for the variable "skb" at the beginning. > > * Replace an assignment by a call of the kfree_skb() function > and delete the affected variable "ret" then. > > Signed-off-by: Markus Elfring > --- > net/batman-adv/main.c | 15 +-- > 1 file changed, 5 insertions(+), 10 deletions(-) Applied in revision 5a878b8. Thanks, Marek signature.asc Description: This is a digitally signed message part.
Re: [B.A.T.M.A.N.] [PATCH 2/3] batman-adv: Split a condition check
On Tuesday, November 03, 2015 21:54:35 SF Markus Elfring wrote: > From: Markus Elfring > Date: Tue, 3 Nov 2015 20:41:02 +0100 > > Let us split a check for a condition at the beginning of the > batadv_is_ap_isolated() function so that a direct return can be performed > in this function if the variable "vlan" contained a null pointer. > > Signed-off-by: Markus Elfring > --- > net/batman-adv/translation-table.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) Applied in revision b1199c6. Thanks, Marek signature.asc Description: This is a digitally signed message part.
Re: [PATCH] packet: Allow packets with only a header (but no payload)
Hello. On 11/21/2015 3:50 AM, Martin Blumenstingl wrote: 9c70776 added validation for the packet size in packet_snd. This change Please run your patch thru scripts/checkpatch.pl -- it now enforces certain commit citing style. enforced that every packet needs a header with at least hard_header_len bytes and at least one byte payload. This fixes PPPoE connections which do not have a "Service" or "Host-Uniq" configured (which is violating the spec, but is still widely used in real-world setups). Those are currently failing with the following message: "pppd: packet size is too short (24 <= 24)" Signed-off-by: Martin Blumenstingl [...] MBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match
On Saturday 2015-11-21 19:54, Florian Westphal wrote: > >The only other question I have is wheter PATH_MAX might be a possible >ABI breaker in future. It would have to be guaranteed that this is the >same size forever, else you'd get strange errors on rule insertion if >the sizes of the kernel and userspace version differs. > The same goes for IFNAMSIZ. But, so far, nobody changed it in the kernel, even though there are voices that 15 characters + '\0' were a tight choice. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match
Tejun Heo wrote: > On Sat, Nov 21, 2015 at 05:56:06PM +0100, Florian Westphal wrote: > > > +struct xt_cgroup_info_v1 { > > > + __u8has_path; > > > + __u8has_classid; > > > + __u8invert_path; > > > + __u8invert_classid; > > > + charpath[PATH_MAX]; > > > + __u32 classid; > > > + > > > + /* kernel internal data */ > > > + void*priv __attribute__((aligned(8))); > > > +}; > > > > Ahem. Am I reading this right? This struct is > 4k in size? > > If so -- Ugh. Does sizeof(path) really have to be PATH_MAX? > > Hmmm... yeap but would this be an acutual problem? Since rule blob can be allocated via vmalloc i guess "no", its not really a problem unless someone needs realy insane amount of such rules. I don't have any better suggestion, so I guess its necessary evil. The only other question I have is wheter PATH_MAX might be a possible ABI breaker in future. It would have to be guaranteed that this is the same size forever, else you'd get strange errors on rule insertion if the sizes of the kernel and userspace version differs. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] vrf: fix double free and memory corruption on register_netdevice failure
From: Nikolay Aleksandrov When vrf's ->newlink is called, if register_netdevice() fails then it does free_netdev(), but that's also done by rtnl_newlink() so a second free happens and memory gets corrupted, to reproduce execute the following line a couple of times (1 - 5 usually is enough): $ for i in `seq 1 5`; do ip link add vrf: type vrf table 1; done; This works because we fail in register_netdevice() because of the wrong name "vrf:". And here's a trace of one crash: [ 28.792157] [ cut here ] [ 28.792407] kernel BUG at fs/namei.c:246! [ 28.792608] invalid opcode: [#1] SMP [ 28.793240] Modules linked in: vrf nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc crct10dif_pclmul crc32_pclmul crc32c_intel qxl drm_kms_helper ttm drm aesni_intel aes_x86_64 psmouse glue_helper lrw evdev gf128mul i2c_piix4 ablk_helper cryptd ppdev parport_pc parport serio_raw pcspkr virtio_balloon virtio_console i2c_core acpi_cpufreq button 9pnet_virtio 9p 9pnet fscache ipv6 autofs4 ext4 crc16 mbcache jbd2 virtio_blk virtio_net sg sr_mod cdrom ata_generic ehci_pci uhci_hcd ehci_hcd e1000 usbcore usb_common ata_piix libata virtio_pci virtio_ring virtio scsi_mod floppy [ 28.796016] CPU: 0 PID: 1148 Comm: ld-linux-x86-64 Not tainted 4.4.0-rc1+ #24 [ 28.796016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 [ 28.796016] task: 8800352561c0 ti: 88003592c000 task.ti: 88003592c000 [ 28.796016] RIP: 0010:[] [] putname+0x43/0x60 [ 28.796016] RSP: 0018:88003592fe88 EFLAGS: 00010246 [ 28.796016] RAX: RBX: 8800352561c0 RCX: 0001 [ 28.796016] RDX: RSI: RDI: 88003784f000 [ 28.796016] RBP: 88003592ff08 R08: 0001 R09: [ 28.796016] R10: R11: 0001 R12: [ 28.796016] R13: 047c R14: 88003784f000 R15: 8800358c4a00 [ 28.796016] FS: () GS:88003fc0() knlGS: [ 28.796016] CS: 0010 DS: ES: CR0: 80050033 [ 28.796016] CR2: 7ffd583bc2d9 CR3: 35a99000 CR4: 000406f0 [ 28.796016] Stack: [ 28.796016] 8121045d 812102d3 8800352561c0 880035a91660 [ 28.796016] 888a9880 81a49940 00ff81218684 [ 28.796016] 8800352561c0 047c 880035b36d80 [ 28.796016] Call Trace: [ 28.796016] [] ? do_execveat_common.isra.34+0x74d/0x930 [ 28.796016] [] ? do_execveat_common.isra.34+0x5c3/0x930 [ 28.796016] [] do_execve+0x2c/0x30 [ 28.796016] [] call_usermodehelper_exec_async+0xf0/0x140 [ 28.796016] [] ? umh_complete+0x40/0x40 [ 28.796016] [] ret_from_fork+0x3f/0x70 [ 28.796016] Code: 48 8d 47 1c 48 89 e5 53 48 8b 37 48 89 fb 48 39 c6 74 1a 48 8b 3d 7e e9 8f 00 e8 49 fa fc ff 48 89 df e8 f1 01 fd ff 5b 5d f3 c3 <0f> 0b 48 89 fe 48 8b 3d 61 e9 8f 00 e8 2c fa fc ff 5b 5d eb e9 [ 28.796016] RIP [] putname+0x43/0x60 [ 28.796016] RSP Fixes: 193125dbd8eb ("net: Introduce VRF device driver") Signed-off-by: Nikolay Aleksandrov --- drivers/net/vrf.c | 11 +-- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index 92fa3e1ea65c..4f9748457f5a 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -907,7 +907,6 @@ static int vrf_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[]) { struct net_vrf *vrf = netdev_priv(dev); - int err; if (!data || !data[IFLA_VRF_TABLE]) return -EINVAL; @@ -916,15 +915,7 @@ static int vrf_newlink(struct net *src_net, struct net_device *dev, dev->priv_flags |= IFF_L3MDEV_MASTER; - err = register_netdevice(dev); - if (err < 0) - goto out_fail; - - return 0; - -out_fail: - free_netdev(dev); - return err; + return register_netdevice(dev); } static size_t vrf_nl_getsize(const struct net_device *dev) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] VSOCK: constify vmci_transport_notify_ops structures
The vmci_transport_notify_ops structures are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall --- net/vmw_vsock/vmci_transport.h |2 +- net/vmw_vsock/vmci_transport_notify.c|2 +- net/vmw_vsock/vmci_transport_notify.h|5 +++-- net/vmw_vsock/vmci_transport_notify_qstate.c |2 +- 4 files changed, 6 insertions(+), 5 deletions(-) diff --git a/net/vmw_vsock/vmci_transport.h b/net/vmw_vsock/vmci_transport.h index 2ad46f3..1820e74 100644 --- a/net/vmw_vsock/vmci_transport.h +++ b/net/vmw_vsock/vmci_transport.h @@ -121,7 +121,7 @@ struct vmci_transport { u64 queue_pair_max_size; u32 detach_sub_id; union vmci_transport_notify notify; - struct vmci_transport_notify_ops *notify_ops; + const struct vmci_transport_notify_ops *notify_ops; struct list_head elem; struct sock *sk; spinlock_t lock; /* protects sk. */ diff --git a/net/vmw_vsock/vmci_transport_notify.c b/net/vmw_vsock/vmci_transport_notify.c index 9b7f207..fd8cf02 100644 --- a/net/vmw_vsock/vmci_transport_notify.c +++ b/net/vmw_vsock/vmci_transport_notify.c @@ -661,7 +661,7 @@ static void vmci_transport_notify_pkt_process_negotiate(struct sock *sk) } /* Socket control packet based operations. */ -struct vmci_transport_notify_ops vmci_transport_notify_pkt_ops = { +const struct vmci_transport_notify_ops vmci_transport_notify_pkt_ops = { vmci_transport_notify_pkt_socket_init, vmci_transport_notify_pkt_socket_destruct, vmci_transport_notify_pkt_poll_in, diff --git a/net/vmw_vsock/vmci_transport_notify.h b/net/vmw_vsock/vmci_transport_notify.h index 7df7932..3c464d3 100644 --- a/net/vmw_vsock/vmci_transport_notify.h +++ b/net/vmw_vsock/vmci_transport_notify.h @@ -77,7 +77,8 @@ struct vmci_transport_notify_ops { void (*process_negotiate) (struct sock *sk); }; -extern struct vmci_transport_notify_ops vmci_transport_notify_pkt_ops; -extern struct vmci_transport_notify_ops vmci_transport_notify_pkt_q_state_ops; +extern const struct vmci_transport_notify_ops vmci_transport_notify_pkt_ops; +extern const +struct vmci_transport_notify_ops vmci_transport_notify_pkt_q_state_ops; #endif /* __VMCI_TRANSPORT_NOTIFY_H__ */ diff --git a/net/vmw_vsock/vmci_transport_notify_qstate.c b/net/vmw_vsock/vmci_transport_notify_qstate.c index dc9c792..21e591d 100644 --- a/net/vmw_vsock/vmci_transport_notify_qstate.c +++ b/net/vmw_vsock/vmci_transport_notify_qstate.c @@ -419,7 +419,7 @@ vmci_transport_notify_pkt_send_pre_enqueue( } /* Socket always on control packet based operations. */ -struct vmci_transport_notify_ops vmci_transport_notify_pkt_q_state_ops = { +const struct vmci_transport_notify_ops vmci_transport_notify_pkt_q_state_ops = { vmci_transport_notify_pkt_socket_init, vmci_transport_notify_pkt_socket_destruct, vmci_transport_notify_pkt_poll_in, -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: atm: constify in_cache_ops and eg_cache_ops structures
The in_cache_ops and eg_cache_ops structures are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall --- net/atm/mpc.h |4 ++-- net/atm/mpoa_caches.c |4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/net/atm/mpc.h b/net/atm/mpc.h index 0919a88..cfc7b74 100644 --- a/net/atm/mpc.h +++ b/net/atm/mpc.h @@ -21,11 +21,11 @@ struct mpoa_client { uint8_t our_ctrl_addr[ATM_ESA_LEN]; /* MPC's control ATM address */ rwlock_t ingress_lock; - struct in_cache_ops *in_ops; /* ingress cache operations*/ + const struct in_cache_ops *in_ops; /* ingress cache operations */ in_cache_entry *in_cache;/* the ingress cache of this MPC */ rwlock_t egress_lock; - struct eg_cache_ops *eg_ops; /* egress cache operations */ + const struct eg_cache_ops *eg_ops; /* egress cache operations */ eg_cache_entry *eg_cache;/* the egress cache of this MPC */ uint8_t *mps_macs; /* array of MPS MAC addresses, >=1 */ diff --git a/net/atm/mpoa_caches.c b/net/atm/mpoa_caches.c index d1b2d9a..9e60e74 100644 --- a/net/atm/mpoa_caches.c +++ b/net/atm/mpoa_caches.c @@ -534,7 +534,7 @@ static void eg_destroy_cache(struct mpoa_client *mpc) } -static struct in_cache_ops ingress_ops = { +static const struct in_cache_ops ingress_ops = { in_cache_add_entry, /* add_entry */ in_cache_get, /* get */ in_cache_get_with_mask, /* get_with_mask */ @@ -548,7 +548,7 @@ static struct in_cache_ops ingress_ops = { in_destroy_cache /* destroy_cache */ }; -static struct eg_cache_ops egress_ops = { +static const struct eg_cache_ops egress_ops = { eg_cache_add_entry, /* add_entry*/ eg_cache_get_by_cache_id, /* get_by_cache_id */ eg_cache_get_by_tag, /* get_by_tag */ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match
Hello, On Sat, Nov 21, 2015 at 05:56:06PM +0100, Florian Westphal wrote: > > +struct xt_cgroup_info_v1 { > > + __u8has_path; > > + __u8has_classid; > > + __u8invert_path; > > + __u8invert_classid; > > + charpath[PATH_MAX]; > > + __u32 classid; > > + > > + /* kernel internal data */ > > + void*priv __attribute__((aligned(8))); > > +}; > > Ahem. Am I reading this right? This struct is > 4k in size? > If so -- Ugh. Does sizeof(path) really have to be PATH_MAX? Hmmm... yeap but would this be an acutual problem? We can try to make it shorter but idk it ultimately is a path. Another solution would be trying to pass inode around but that is problematic with showing and printing rules as the only way to reverse-map inode to path is walking the tree and the cgroup may already be gone at that point. While >4k struct isn't pretty, this looks like the path of least resistance. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match
Tejun Heo wrote: > This patch implements xt_cgroup path match which matches cgroup2 > membership of the associated socket. The match is recursive and > invertible. > > For rationales on introducing another cgroup based match, please refer > to a preceding commit "sock, cgroup: add sock->sk_cgroup". > > v3: Folded into xt_cgroup as a new revision interface as suggested by > Pablo. > > v2: Included linux/limits.h from xt_cgroup2.h for PATH_MAX. Added > explicit alignment to the priv field. Both suggested by Jan. > > Signed-off-by: Tejun Heo > Cc: Daniel Borkmann > Cc: Daniel Wagner > CC: Neil Horman > Cc: Jan Engelhardt > Cc: Pablo Neira Ayuso > --- > include/uapi/linux/netfilter/xt_cgroup.h | 13 ++ > net/netfilter/xt_cgroup.c| 69 > > 2 files changed, 82 insertions(+) > > diff --git a/include/uapi/linux/netfilter/xt_cgroup.h > b/include/uapi/linux/netfilter/xt_cgroup.h > index 577c9e0..1e4b37b 100644 > --- a/include/uapi/linux/netfilter/xt_cgroup.h > +++ b/include/uapi/linux/netfilter/xt_cgroup.h > @@ -2,10 +2,23 @@ > #define _UAPI_XT_CGROUP_H > > #include > +#include > > struct xt_cgroup_info_v0 { > __u32 id; > __u32 invert; > }; > > +struct xt_cgroup_info_v1 { > + __u8has_path; > + __u8has_classid; > + __u8invert_path; > + __u8invert_classid; > + charpath[PATH_MAX]; > + __u32 classid; > + > + /* kernel internal data */ > + void*priv __attribute__((aligned(8))); > +}; Ahem. Am I reading this right? This struct is > 4k in size? If so -- Ugh. Does sizeof(path) really have to be PATH_MAX? Thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2 iptables] libxt_cgroup: add support for cgroup2 path matching
This patch updates xt_cgroup so that it supports revision 1 interface which includes cgroup2 path based matching. v3: Folded into xt_cgroup as a new revision interface as suggested by Pablo. v2: cgroup2_match->userspacesize and ->save and man page updated as per Jan. Signed-off-by: Tejun Heo Cc: Daniel Borkmann Cc: Jan Engelhardt Cc: Pablo Neira Ayuso --- extensions/libxt_cgroup.c | 86 extensions/libxt_cgroup.man | 33 - include/linux/netfilter/xt_cgroup.h | 13 + 3 files changed, 119 insertions(+), 13 deletions(-) --- a/extensions/libxt_cgroup.c +++ b/extensions/libxt_cgroup.c @@ -4,6 +4,7 @@ enum { O_CLASSID = 0, + O_PATH = 1, }; static void cgroup_help_v0(void) @@ -13,6 +14,14 @@ static void cgroup_help_v0(void) "[!] --cgroup classidMatch cgroup classid\n"); } +static void cgroup_help_v1(void) +{ + printf( +"cgroup match options:\n" +"[!] --path path Recursively match path relative to cgroup2 root\n" +"[!] --cgroup claasidMatch cgroup classid, can't be used with --path\n"); +} + static const struct xt_option_entry cgroup_opts_v0[] = { { .name = "cgroup", @@ -24,6 +33,24 @@ static const struct xt_option_entry cgro XTOPT_TABLEEND, }; +static const struct xt_option_entry cgroup_opts_v1[] = { + { + .name = "path", + .id = O_PATH, + .type = XTTYPE_STRING, + .flags = XTOPT_INVERT | XTOPT_PUT, + XTOPT_POINTER(struct xt_cgroup_info_v1, path) + }, + { + .name = "cgroup", + .id = O_CLASSID, + .type = XTTYPE_UINT32, + .flags = XTOPT_INVERT | XTOPT_PUT, + XTOPT_POINTER(struct xt_cgroup_info_v1, classid) + }, + XTOPT_TABLEEND, +}; + static void cgroup_parse_v0(struct xt_option_call *cb) { struct xt_cgroup_info_v0 *cgroupinfo = cb->data; @@ -33,6 +60,26 @@ static void cgroup_parse_v0(struct xt_op cgroupinfo->invert = true; } +static void cgroup_parse_v1(struct xt_option_call *cb) +{ + struct xt_cgroup_info_v1 *info = cb->data; + + xtables_option_parse(cb); + + switch (cb->entry->id) { + case O_PATH: + info->has_path = true; + if (cb->invert) + info->invert_path = true; + break; + case O_CLASSID: + info->has_classid = true; + if (cb->invert) + info->invert_classid = true; + break; + } +} + static void cgroup_print_v0(const void *ip, const struct xt_entry_match *match, int numeric) { @@ -48,6 +95,32 @@ static void cgroup_save_v0(const void *i printf("%s --cgroup %u", info->invert ? " !" : "", info->id); } +static void +cgroup_print_v1(const void *ip, const struct xt_entry_match *match, int numeric) +{ + const struct xt_cgroup_info_v1 *info = (void *)match->data; + + printf(" cgroup"); + if (info->has_path) + printf(" %s%s", info->invert_path ? "! ":"", info->path); + if (info->has_classid) + printf(" %s%u", info->invert_classid ? "! ":"", info->classid); +} + +static void cgroup_save_v1(const void *ip, const struct xt_entry_match *match) +{ + const struct xt_cgroup_info_v1 *info = (void *)match->data; + + if (info->has_path) { + printf("%s --path", info->invert_path ? " !" : ""); + xtables_save_string(info->path); + } + + if (info->has_classid) + printf("%s --cgroup %u", info->invert_classid ? " !" : "", + info->classid); +} + static struct xtables_match cgroup_match[] = { { .family = NFPROTO_UNSPEC, @@ -62,6 +135,19 @@ static struct xtables_match cgroup_match .x6_parse = cgroup_parse_v0, .x6_options = cgroup_opts_v0, }, + { + .family = NFPROTO_UNSPEC, + .revision = 1, + .name = "cgroup", + .version= XTABLES_VERSION, + .size = XT_ALIGN(sizeof(struct xt_cgroup_info_v1)), + .userspacesize = offsetof(struct xt_cgroup_info_v1, priv), + .help = cgroup_help_v1, + .print = cgroup_print_v1, + .save = cgroup_save_v1, + .x6_parse = cgroup_parse_v1, + .x6_options = cgroup_opts_v1, + }, }; void _init(void) --- a/extensions/libxt_cgroup.man +++ b/extensions/libxt_cgroup.man @@ -1,23 +1,30 @@ .TP -[\fB!\fP] \fB\-\-cgroup\fP \fIfwid\fP -Match corresponding cgroup for this packet. +[\fB!\fP] \fB\-\-path\fP \fIpath\fP +Match cgroup2 membership. -Can be used in the OUTPUT chain to assign particula
Re: [PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup
Oops, made a copy & paste error on Neil Horman's address. Sorry, Neil. The thread can be found at http://lkml.kernel.org/g/1448122441-9335-1-git-send-email...@kernel.org Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9] cgroup: record ancestor IDs and reimplement cgroup_is_descendant() using it
cgroup_is_descendant() currently walks up the hierarchy and compares each ancestor to the cgroup in question. While enough for cgroup core usages, this can't be used in hot paths to test cgroup membership. This patch adds cgroup->ancestor_ids[] which records the IDs of all ancestors including self and cgroup->level for the nesting level. This allows testing whether a given cgroup is a descendant of another in three finite steps - testing whether the two belong to the same hierarchy, whether the descendant candidate is at the same or a higher level than the ancestor and comparing the recorded ancestor_id at the matching level. cgroup_is_descendant() is accordingly reimplmented and made inline. Signed-off-by: Tejun Heo --- include/linux/cgroup-defs.h | 14 ++ include/linux/cgroup.h | 18 +- kernel/cgroup.c | 32 ++-- 3 files changed, 41 insertions(+), 23 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 60d44b2..504d859 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -235,6 +235,14 @@ struct cgroup { int id; /* +* The depth this cgroup is at. The root is at depth zero and each +* step down the hierarchy increments the level. This along with +* ancestor_ids[] can determine whether a given cgroup is a +* descendant of another without traversing the hierarchy. +*/ + int level; + + /* * Each non-empty css_set associated with this cgroup contributes * one to populated_cnt. All children with non-zero popuplated_cnt * of their own contribute one. The count is zero iff there's no @@ -289,6 +297,9 @@ struct cgroup { /* used to schedule release agent */ struct work_struct release_agent_work; + + /* ids of the ancestors at each level including self */ + int ancestor_ids[]; }; /* @@ -308,6 +319,9 @@ struct cgroup_root { /* The root cgroup. Root is destroyed on its release. */ struct cgroup cgrp; + /* for cgrp->ancestor_ids[0] */ + int cgrp_ancestor_id_storage; + /* Number of cgroups in the hierarchy, used only for /proc/cgroups */ atomic_t nr_cgrps; diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 22e3754..b5ee2c4 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -81,7 +81,6 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup, struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry, struct cgroup_subsys *ss); -bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor); int cgroup_attach_task_all(struct task_struct *from, struct task_struct *); int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from); @@ -459,6 +458,23 @@ static inline struct cgroup *task_cgroup(struct task_struct *task, return task_css(task, subsys_id)->cgroup; } +/** + * cgroup_is_descendant - test ancestry + * @cgrp: the cgroup to be tested + * @ancestor: possible ancestor of @cgrp + * + * Test whether @cgrp is a descendant of @ancestor. It also returns %true + * if @cgrp == @ancestor. This function is safe to call as long as @cgrp + * and @ancestor are accessible. + */ +static inline bool cgroup_is_descendant(struct cgroup *cgrp, + struct cgroup *ancestor) +{ + if (cgrp->root != ancestor->root || cgrp->level < ancestor->level) + return false; + return cgrp->ancestor_ids[ancestor->level] == ancestor->id; +} + /* no synchronization, the result can only be used as a hint */ static inline bool cgroup_is_populated(struct cgroup *cgrp) { diff --git a/kernel/cgroup.c b/kernel/cgroup.c index f1603c1..3190040 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -459,25 +459,6 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file *of) } EXPORT_SYMBOL_GPL(of_css); -/** - * cgroup_is_descendant - test ancestry - * @cgrp: the cgroup to be tested - * @ancestor: possible ancestor of @cgrp - * - * Test whether @cgrp is a descendant of @ancestor. It also returns %true - * if @cgrp == @ancestor. This function is safe to call as long as @cgrp - * and @ancestor are accessible. - */ -bool cgroup_is_descendant(struct cgroup *cgrp, struct cgroup *ancestor) -{ - while (cgrp) { - if (cgrp == ancestor) - return true; - cgrp = cgroup_parent(cgrp); - } - return false; -} - static int notify_on_release(const struct cgroup *cgrp) { return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags); @@ -1903,6 +1884,7 @@ static int cgroup_setup_root(struct cgroup_root *root, unsigned long ss_mask) if (ret < 0) goto out; root_cgrp->id = ret; + root_cgrp->ancestor_ids[0] = ret; ret = percpu
[PATCH 3/9] cgroup: implement cgroup_get_from_path() and expose cgroup_put()
Implement cgroup_get_from_path() using kernfs_walk_and_get() which obtains a default hierarchy cgroup from its path. This will be used to allow cgroup path based matching from outside cgroup proper - e.g. networking and perf. v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path). Signed-off-by: Tejun Heo --- include/linux/cgroup.h | 7 +++ kernel/cgroup.c| 39 ++- 2 files changed, 41 insertions(+), 5 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index b5ee2c4..4c3ffab 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -81,6 +81,8 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup, struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry, struct cgroup_subsys *ss); +struct cgroup *cgroup_get_from_path(const char *path); + int cgroup_attach_task_all(struct task_struct *from, struct task_struct *); int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from); @@ -351,6 +353,11 @@ static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n) percpu_ref_put_many(&css->refcnt, n); } +static inline void cgroup_put(struct cgroup *cgrp) +{ + css_put(&cgrp->self); +} + /** * task_css_set_check - obtain a task's css_set with extra access conditions * @task: the task to obtain css_set for diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 3190040..3db5e8f 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -434,11 +434,6 @@ static bool cgroup_tryget(struct cgroup *cgrp) return css_tryget(&cgrp->self); } -static void cgroup_put(struct cgroup *cgrp) -{ - css_put(&cgrp->self); -} - struct cgroup_subsys_state *of_css(struct kernfs_open_file *of) { struct cgroup *cgrp = of->kn->parent->priv; @@ -5753,6 +5748,40 @@ struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss) return id > 0 ? idr_find(&ss->css_idr, id) : NULL; } +/** + * cgroup_get_from_path - lookup and get a cgroup from its default hierarchy path + * @path: path on the default hierarchy + * + * Find the cgroup at @path on the default hierarchy, increment its + * reference count and return it. Returns pointer to the found cgroup on + * success, ERR_PTR(-ENOENT) if @path doens't exist and ERR_PTR(-ENOTDIR) + * if @path points to a non-directory. + */ +struct cgroup *cgroup_get_from_path(const char *path) +{ + struct kernfs_node *kn; + struct cgroup *cgrp; + + mutex_lock(&cgroup_mutex); + + kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path); + if (kn) { + if (kernfs_type(kn) == KERNFS_DIR) { + cgrp = kn->priv; + cgroup_get(cgrp); + } else { + cgrp = ERR_PTR(-ENOTDIR); + } + kernfs_put(kn); + } else { + cgrp = ERR_PTR(-ENOENT); + } + + mutex_unlock(&cgroup_mutex); + return cgrp; +} +EXPORT_SYMBOL_GPL(cgroup_get_from_path); + #ifdef CONFIG_CGROUP_DEBUG static struct cgroup_subsys_state * debug_css_alloc(struct cgroup_subsys_state *parent_css) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/9] netprio_cgroup: limit the maximum css->id to USHRT_MAX
netprio builds per-netdev contiguous priomap array which is indexed by css->id. The array is allocated using kzalloc() effectively limiting the maximum ID supported to some thousand range. This patch caps the maximum supported css->id to USHRT_MAX which should be way above what is actually useable. This allows reducing sock->sk_cgrp_prioidx to u16 from u32. The freed up part will be used to overload the cgroup related fields. sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the two cgroup related fields are adjacent. Signed-off-by: Tejun Heo Acked-by: Daniel Wagner Cc: Daniel Borkmann CC: Neil Horman --- include/net/sock.h| 10 +- net/core/netprio_cgroup.c | 9 + 2 files changed, 14 insertions(+), 5 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index bbf7c2c..b517351 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -288,7 +288,6 @@ struct cg_proto; *@sk_ack_backlog: current listen backlog *@sk_max_ack_backlog: listen backlog set in listen() *@sk_priority: %SO_PRIORITY setting - *@sk_cgrp_prioidx: socket group's priority map index *@sk_type: socket type (%SOCK_STREAM, etc) *@sk_protocol: which protocol this socket belongs in this network family *@sk_peer_pid: &struct pid for this socket's peer @@ -309,6 +308,7 @@ struct cg_proto; *@sk_send_head: front of stuff to transmit *@sk_security: used by security modules *@sk_mark: generic packet mark + *@sk_cgrp_prioidx: socket group's priority map index *@sk_classid: this socket's cgroup classid *@sk_cgrp: this socket's cgroup-specific proto data *@sk_write_pending: a write to stream socket waits to start @@ -423,9 +423,7 @@ struct sock { u32 sk_ack_backlog; u32 sk_max_ack_backlog; __u32 sk_priority; -#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO) - __u32 sk_cgrp_prioidx; -#endif + __u32 sk_mark; struct pid *sk_peer_pid; const struct cred *sk_peer_cred; longsk_rcvtimeo; @@ -443,7 +441,9 @@ struct sock { #ifdef CONFIG_SECURITY void*sk_security; #endif - __u32 sk_mark; +#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO) + u16 sk_cgrp_prioidx; +#endif #ifdef CONFIG_CGROUP_NET_CLASSID u32 sk_classid; #endif diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c index cbd0a19..2b9159b 100644 --- a/net/core/netprio_cgroup.c +++ b/net/core/netprio_cgroup.c @@ -27,6 +27,12 @@ #include +/* + * netprio allocates per-net_device priomap array which is indexed by + * css->id. Limiting css ID to 16bits doesn't lose anything. + */ +#define NETPRIO_ID_MAX USHRT_MAX + #define PRIOMAP_MIN_SZ 128 /* @@ -144,6 +150,9 @@ static int cgrp_css_online(struct cgroup_subsys_state *css) struct net_device *dev; int ret = 0; + if (css->id > NETPRIO_ID_MAX) + return -ENOSPC; + if (!parent_css) return 0; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] cgroups: Allow dynamically changing net_classid
From: Nina Schiff The classid of a process is changed either when a process is moved to or from a cgroup or when the net_cls.classid file is updated. Previously net_cls only supported propogating these changes to the cgroup's related sockets when a process was added or removed from the cgroup. This means it was neccessary to remove and re-add all processes to a cgroup in order to update its classid. This change introduces support for doing this dynamically - i.e. when the value is changed in the net_cls_classid file, this will also trigger an update to the classid associated with all sockets controlled by the cgroup. This mimics the behaviour of other cgroup subsystems. net_prio circumvents this issue by storing an index into a table with each socket (and so any updates to the table, don't require updating the value associated with the socket). net_cls, however, passes the socket the classid directly, and so this additional step is needed. Signed-off-by: Nina Schiff Signed-off-by: Tejun Heo --- net/core/netclassid_cgroup.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c index 6441f47..2e4df84 100644 --- a/net/core/netclassid_cgroup.c +++ b/net/core/netclassid_cgroup.c @@ -56,7 +56,7 @@ static void cgrp_css_free(struct cgroup_subsys_state *css) kfree(css_cls_state(css)); } -static int update_classid(const void *v, struct file *file, unsigned n) +static int update_classid_sock(const void *v, struct file *file, unsigned n) { int err; struct socket *sock = sock_from_file(file, &err); @@ -67,18 +67,25 @@ static int update_classid(const void *v, struct file *file, unsigned n) return 0; } -static void cgrp_attach(struct cgroup_subsys_state *css, - struct cgroup_taskset *tset) +static void update_classid(struct cgroup_subsys_state *css, void *v) { - struct cgroup_cls_state *cs = css_cls_state(css); - void *v = (void *)(unsigned long)cs->classid; + struct css_task_iter it; struct task_struct *p; - cgroup_taskset_for_each(p, tset) { + css_task_iter_start(css, &it); + while ((p = css_task_iter_next(&it))) { task_lock(p); - iterate_fd(p->files, 0, update_classid, v); + iterate_fd(p->files, 0, update_classid_sock, v); task_unlock(p); } + css_task_iter_end(&it); +} + +static void cgrp_attach(struct cgroup_subsys_state *css, + struct cgroup_taskset *tset) +{ + update_classid(css, + (void *)(unsigned long)css_cls_state(css)->classid); } static u64 read_classid(struct cgroup_subsys_state *css, struct cftype *cft) @@ -89,8 +96,11 @@ static u64 read_classid(struct cgroup_subsys_state *css, struct cftype *cft) static int write_classid(struct cgroup_subsys_state *css, struct cftype *cft, u64 value) { - css_cls_state(css)->classid = (u32) value; + struct cgroup_cls_state *cs = css_cls_state(css); + + cs->classid = (u32)value; + update_classid(css, (void *)(unsigned long)cs->classid); return 0; } -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/9] sock, cgroup: add sock->sk_cgroup
In cgroup v1, dealing with cgroup membership was difficult because the number of membership associations was unbound. As a result, cgroup v1 grew several controllers whose primary purpose is either tagging membership or pull in configuration knobs from other subsystems so that cgroup membership test can be avoided. net_cls and net_prio controllers are examples of the latter. They allow configuring network-specific attributes from cgroup side so that network subsystem can avoid testing cgroup membership; unfortunately, these are not only cumbersome but also problematic. Both net_cls and net_prio aren't properly hierarchical. Both inherit configuration from the parent on creation but there's no interaction afterwards. An ancestor doesn't restrict the behavior in its subtree in anyway and configuration changes aren't propagated downwards. Especially when combined with cgroup delegation, this is problematic because delegatees can mess up whatever network configuration implemented at the system level. net_prio would allow the delegatees to set whatever priority value regardless of CAP_NET_ADMIN and net_cls the same for classid. While it is possible to solve these issues from controller side by implementing hierarchical allowable ranges in both controllers, it would involve quite a bit of complexity in the controllers and further obfuscate network configuration as it becomes even more difficult to tell what's actually being configured looking from the network side. While not much can be done for v1 at this point, as membership handling is sane on cgroup v2, it'd be better to make cgroup matching behave like other network matches and classifiers than introducing further complications. In preparation, this patch updates sock->sk_cgrp_data handling so that it points to the v2 cgroup that sock was created in until either net_prio or net_cls is used. Once either of the two is used, sock->sk_cgrp_data reverts to its previous role of carrying prioidx and classid. This is to avoid adding yet another cgroup related field to struct sock. As the mode switching can happen at most once per boot, the switching mechanism is aimed at lowering hot path overhead. It may leak a finite, likely small, number of cgroup refs and report spurious prioidx or classid on switching; however, dynamic updates of prioidx and classid have always been racy and lossy - socks between creation and fd installation are never updated, config changes don't update existing sockets at all, and prioidx may index with dead and recycled cgroup IDs. Non-critical inaccuracies from small race windows won't make any noticeable difference. This patch doesn't make use of the pointer yet. The following patch will implement netfilter match for cgroup2 membership. v2: Use sock_cgroup_data to avoid inflating struct sock w/ another cgroup specific field. v3: Add comments explaining why sock_data_prioidx() and sock_data_classid() use different fallback values. Signed-off-by: Tejun Heo Cc: Daniel Borkmann Cc: Daniel Wagner CC: Neil Horman --- include/linux/cgroup-defs.h | 88 +--- include/linux/cgroup.h | 41 + kernel/cgroup.c | 55 ++- net/core/netclassid_cgroup.c | 7 +++- net/core/netprio_cgroup.c| 7 +++- net/core/sock.c | 2 + 6 files changed, 191 insertions(+), 9 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index ed128fed..9dc2263 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -544,31 +544,107 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {} #ifdef CONFIG_SOCK_CGROUP_DATA +/* + * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains + * per-socket cgroup information except for memcg association. + * + * On legacy hierarchies, net_prio and net_cls controllers directly set + * attributes on each sock which can then be tested by the network layer. + * On the default hierarchy, each sock is associated with the cgroup it was + * created in and the networking layer can match the cgroup directly. + * + * To avoid carrying all three cgroup related fields separately in sock, + * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer. + * On boot, sock_cgroup_data records the cgroup that the sock was created + * in so that cgroup2 matches can be made; however, once either net_prio or + * net_cls starts being used, the area is overriden to carry prioidx and/or + * classid. The two modes are distinguished by whether the lowest bit is + * set. Clear bit indicates cgroup pointer while set bit prioidx and + * classid. + * + * While userland may start using net_prio or net_cls at any time, once + * either is used, cgroup2 matching no longer works. There is no reason to + * mix the two and this is in line with how legacy and v2 compatibility is + * handled. On mode switch, cgroup references whi
[PATCH 1/2 iptables] libxt_cgroup: prepare for multi revisions
libxt_cgroup will grow cgroup2 path based match. Postfix existing symbols with _v0 and prepare for multi revision registration. While at it, rename O_CGROUP to O_CLASSID and fwid to classid. Signed-off-by: Tejun Heo Cc: Daniel Borkmann Cc: Jan Engelhardt Cc: Pablo Neira Ayuso --- extensions/libxt_cgroup.c | 51 +++- include/linux/netfilter/xt_cgroup.h |2 - 2 files changed, 28 insertions(+), 25 deletions(-) --- a/extensions/libxt_cgroup.c +++ b/extensions/libxt_cgroup.c @@ -3,30 +3,30 @@ #include enum { - O_CGROUP = 0, + O_CLASSID = 0, }; -static void cgroup_help(void) +static void cgroup_help_v0(void) { printf( "cgroup match options:\n" -"[!] --cgroup fwid Match cgroup fwid\n"); +"[!] --cgroup classidMatch cgroup classid\n"); } -static const struct xt_option_entry cgroup_opts[] = { +static const struct xt_option_entry cgroup_opts_v0[] = { { .name = "cgroup", - .id = O_CGROUP, + .id = O_CLASSID, .type = XTTYPE_UINT32, .flags = XTOPT_INVERT | XTOPT_MAND | XTOPT_PUT, - XTOPT_POINTER(struct xt_cgroup_info, id) + XTOPT_POINTER(struct xt_cgroup_info_v0, id) }, XTOPT_TABLEEND, }; -static void cgroup_parse(struct xt_option_call *cb) +static void cgroup_parse_v0(struct xt_option_call *cb) { - struct xt_cgroup_info *cgroupinfo = cb->data; + struct xt_cgroup_info_v0 *cgroupinfo = cb->data; xtables_option_parse(cb); if (cb->invert) @@ -34,34 +34,37 @@ static void cgroup_parse(struct xt_optio } static void -cgroup_print(const void *ip, const struct xt_entry_match *match, int numeric) +cgroup_print_v0(const void *ip, const struct xt_entry_match *match, int numeric) { - const struct xt_cgroup_info *info = (void *) match->data; + const struct xt_cgroup_info_v0 *info = (void *) match->data; printf(" cgroup %s%u", info->invert ? "! ":"", info->id); } -static void cgroup_save(const void *ip, const struct xt_entry_match *match) +static void cgroup_save_v0(const void *ip, const struct xt_entry_match *match) { - const struct xt_cgroup_info *info = (void *) match->data; + const struct xt_cgroup_info_v0 *info = (void *) match->data; printf("%s --cgroup %u", info->invert ? " !" : "", info->id); } -static struct xtables_match cgroup_match = { - .family = NFPROTO_UNSPEC, - .name = "cgroup", - .version= XTABLES_VERSION, - .size = XT_ALIGN(sizeof(struct xt_cgroup_info)), - .userspacesize = XT_ALIGN(sizeof(struct xt_cgroup_info)), - .help = cgroup_help, - .print = cgroup_print, - .save = cgroup_save, - .x6_parse = cgroup_parse, - .x6_options = cgroup_opts, +static struct xtables_match cgroup_match[] = { + { + .family = NFPROTO_UNSPEC, + .revision = 0, + .name = "cgroup", + .version= XTABLES_VERSION, + .size = XT_ALIGN(sizeof(struct xt_cgroup_info_v0)), + .userspacesize = XT_ALIGN(sizeof(struct xt_cgroup_info_v0)), + .help = cgroup_help_v0, + .print = cgroup_print_v0, + .save = cgroup_save_v0, + .x6_parse = cgroup_parse_v0, + .x6_options = cgroup_opts_v0, + }, }; void _init(void) { - xtables_register_match(&cgroup_match); + xtables_register_matches(cgroup_match, ARRAY_SIZE(cgroup_match)); } --- a/include/linux/netfilter/xt_cgroup.h +++ b/include/linux/netfilter/xt_cgroup.h @@ -3,7 +3,7 @@ #include -struct xt_cgroup_info { +struct xt_cgroup_info_v0 { __u32 id; __u32 invert; }; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 9/9] netfilter: implement xt_cgroup cgroup2 path match
This patch implements xt_cgroup path match which matches cgroup2 membership of the associated socket. The match is recursive and invertible. For rationales on introducing another cgroup based match, please refer to a preceding commit "sock, cgroup: add sock->sk_cgroup". v3: Folded into xt_cgroup as a new revision interface as suggested by Pablo. v2: Included linux/limits.h from xt_cgroup2.h for PATH_MAX. Added explicit alignment to the priv field. Both suggested by Jan. Signed-off-by: Tejun Heo Cc: Daniel Borkmann Cc: Daniel Wagner CC: Neil Horman Cc: Jan Engelhardt Cc: Pablo Neira Ayuso --- include/uapi/linux/netfilter/xt_cgroup.h | 13 ++ net/netfilter/xt_cgroup.c| 69 2 files changed, 82 insertions(+) diff --git a/include/uapi/linux/netfilter/xt_cgroup.h b/include/uapi/linux/netfilter/xt_cgroup.h index 577c9e0..1e4b37b 100644 --- a/include/uapi/linux/netfilter/xt_cgroup.h +++ b/include/uapi/linux/netfilter/xt_cgroup.h @@ -2,10 +2,23 @@ #define _UAPI_XT_CGROUP_H #include +#include struct xt_cgroup_info_v0 { __u32 id; __u32 invert; }; +struct xt_cgroup_info_v1 { + __u8has_path; + __u8has_classid; + __u8invert_path; + __u8invert_classid; + charpath[PATH_MAX]; + __u32 classid; + + /* kernel internal data */ + void*priv __attribute__((aligned(8))); +}; + #endif /* _UAPI_XT_CGROUP_H */ diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c index 1730025..a086a91 100644 --- a/net/netfilter/xt_cgroup.c +++ b/net/netfilter/xt_cgroup.c @@ -34,6 +34,37 @@ static int cgroup_mt_check_v0(const struct xt_mtchk_param *par) return 0; } +static int cgroup_mt_check_v1(const struct xt_mtchk_param *par) +{ + struct xt_cgroup_info_v1 *info = par->matchinfo; + struct cgroup *cgrp; + + if ((info->invert_path & ~1) || (info->invert_classid & ~1)) + return -EINVAL; + + if (!info->has_path && !info->has_classid) { + pr_info("xt_cgroup: no path or classid specified\n"); + return -EINVAL; + } + + if (info->has_path && info->has_classid) { + pr_info("xt_cgroup: both path and classid specified\n"); + return -EINVAL; + } + + if (info->has_path) { + cgrp = cgroup_get_from_path(info->path); + if (IS_ERR(cgrp)) { + pr_info("xt_cgroup: invalid path, errno=%ld\n", + PTR_ERR(cgrp)); + return -EINVAL; + } + info->priv = cgrp; + } + + return 0; +} + static bool cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par) { @@ -46,6 +77,31 @@ cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par) info->invert; } +static bool cgroup_mt_v1(const struct sk_buff *skb, struct xt_action_param *par) +{ + const struct xt_cgroup_info_v1 *info = par->matchinfo; + struct sock_cgroup_data *skcd = &skb->sk->sk_cgrp_data; + struct cgroup *ancestor = info->priv; + + if (!skb->sk || !sk_fullsock(skb->sk)) + return false; + + if (ancestor) + return cgroup_is_descendant(sock_cgroup_ptr(skcd), ancestor) ^ + info->invert_path; + else + return (info->classid == sock_cgroup_classid(skcd)) ^ + info->invert_classid; +} + +static void cgroup_mt_destroy_v1(const struct xt_mtdtor_param *par) +{ + struct xt_cgroup_info_v1 *info = par->matchinfo; + + if (info->priv) + cgroup_put(info->priv); +} + static struct xt_match cgroup_mt_reg[] __read_mostly = { { .name = "cgroup", @@ -59,6 +115,19 @@ static struct xt_match cgroup_mt_reg[] __read_mostly = { (1 << NF_INET_POST_ROUTING) | (1 << NF_INET_LOCAL_IN), }, + { + .name = "cgroup", + .revision = 1, + .family = NFPROTO_UNSPEC, + .checkentry = cgroup_mt_check_v1, + .match = cgroup_mt_v1, + .matchsize = sizeof(struct xt_cgroup_info_v1), + .destroy= cgroup_mt_destroy_v1, + .me = THIS_MODULE, + .hooks = (1 << NF_INET_LOCAL_OUT) | + (1 << NF_INET_POST_ROUTING) | + (1 << NF_INET_LOCAL_IN), + }, }; static int __init cgroup_mt_init(void) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/9] net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct
Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data. ->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct and its accessors are defined in cgroup-defs.h. This is to prepare for overloading the fields with a cgroup pointer. This patch mostly performs equivalent conversions but the followings are noteworthy. * Equality test before updating classid is removed from sock_update_classid(). This shouldn't make any noticeable difference and a similar test will be implemented on the helper side later. * sock_update_netprioidx() now takes struct sock_cgroup_data and can be moved to netprio_cgroup.h without causing include dependency loop. Moved. * The dummy version of sock_update_netprioidx() converted to a static inline function while at it. Signed-off-by: Tejun Heo --- include/linux/cgroup-defs.h | 36 include/net/cls_cgroup.h | 11 +-- include/net/netprio_cgroup.h | 16 +--- include/net/sock.h | 11 +++ net/Kconfig | 6 ++ net/core/dev.c | 3 ++- net/core/netclassid_cgroup.c | 4 ++-- net/core/netprio_cgroup.c| 3 ++- net/core/scm.c | 4 ++-- net/core/sock.c | 15 ++- net/netfilter/nft_meta.c | 2 +- net/netfilter/xt_cgroup.c| 3 ++- 12 files changed, 76 insertions(+), 38 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 504d859..ed128fed 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -542,4 +542,40 @@ static inline void cgroup_threadgroup_change_end(struct task_struct *tsk) {} #endif /* CONFIG_CGROUPS */ +#ifdef CONFIG_SOCK_CGROUP_DATA + +struct sock_cgroup_data { + u16 prioidx; + u32 classid; +}; + +static inline u16 sock_cgroup_prioidx(struct sock_cgroup_data *skcd) +{ + return skcd->prioidx; +} + +static inline u32 sock_cgroup_classid(struct sock_cgroup_data *skcd) +{ + return skcd->classid; +} + +static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd, + u16 prioidx) +{ + skcd->prioidx = prioidx; +} + +static inline void sock_cgroup_set_classid(struct sock_cgroup_data *skcd, + u32 classid) +{ + skcd->classid = classid; +} + +#else /* CONFIG_SOCK_CGROUP_DATA */ + +struct sock_cgroup_data { +}; + +#endif /* CONFIG_SOCK_CGROUP_DATA */ + #endif /* _LINUX_CGROUP_DEFS_H */ diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h index ccd6d8b..c0a92e2 100644 --- a/include/net/cls_cgroup.h +++ b/include/net/cls_cgroup.h @@ -41,13 +41,12 @@ static inline u32 task_cls_classid(struct task_struct *p) return classid; } -static inline void sock_update_classid(struct sock *sk) +static inline void sock_update_classid(struct sock_cgroup_data *skcd) { u32 classid; classid = task_cls_classid(current); - if (classid != sk->sk_classid) - sk->sk_classid = classid; + sock_cgroup_set_classid(skcd, classid); } static inline u32 task_get_classid(const struct sk_buff *skb) @@ -64,17 +63,17 @@ static inline u32 task_get_classid(const struct sk_buff *skb) * softirqs always disables bh. */ if (in_serving_softirq()) { - /* If there is an sk_classid we'll use that. */ + /* If there is an sock_cgroup_classid we'll use that. */ if (!skb->sk) return 0; - classid = skb->sk->sk_classid; + classid = sock_cgroup_classid(&skb->sk->sk_cgrp_data); } return classid; } #else /* !CONFIG_CGROUP_NET_CLASSID */ -static inline void sock_update_classid(struct sock *sk) +static inline void sock_update_classid(struct sock_cgroup_data *skcd) { } diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h index f2a9597..6041905 100644 --- a/include/net/netprio_cgroup.h +++ b/include/net/netprio_cgroup.h @@ -25,8 +25,6 @@ struct netprio_map { u32 priomap[]; }; -void sock_update_netprioidx(struct sock *sk); - static inline u32 task_netprioidx(struct task_struct *p) { struct cgroup_subsys_state *css; @@ -38,13 +36,25 @@ static inline u32 task_netprioidx(struct task_struct *p) rcu_read_unlock(); return idx; } + +static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd) +{ + if (in_interrupt()) + return; + + sock_cgroup_set_prioidx(skcd, task_netprioidx(current)); +} + #else /* !CONFIG_CGROUP_NET_PRIO */ + static inline u32 task_netprioidx(struct task_struct *p) { return 0; } -#define sock_update_netprioidx(sk) +static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd) +{ +} #endif /* CONFIG_CGROUP_NET_PRIO */ #endif /* _NET_CLS_CGROUP_H */ diff --git a/include/net/sock.h b/include/net/sock.h
[PATCH 8/9] netfilter: prepare xt_cgroup for multi revisions
xt_cgroup will grow cgroup2 path based match. Postfix existing symbols with _v0 and prepare for multi revision registration. Signed-off-by: Tejun Heo Cc: Daniel Borkmann Cc: Daniel Wagner CC: Neil Horman Cc: Jan Engelhardt Cc: Pablo Neira Ayuso --- include/uapi/linux/netfilter/xt_cgroup.h | 2 +- net/netfilter/xt_cgroup.c| 36 +--- 2 files changed, 20 insertions(+), 18 deletions(-) diff --git a/include/uapi/linux/netfilter/xt_cgroup.h b/include/uapi/linux/netfilter/xt_cgroup.h index 43acb7e..577c9e0 100644 --- a/include/uapi/linux/netfilter/xt_cgroup.h +++ b/include/uapi/linux/netfilter/xt_cgroup.h @@ -3,7 +3,7 @@ #include -struct xt_cgroup_info { +struct xt_cgroup_info_v0 { __u32 id; __u32 invert; }; diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c index 54eaeb4..1730025 100644 --- a/net/netfilter/xt_cgroup.c +++ b/net/netfilter/xt_cgroup.c @@ -24,9 +24,9 @@ MODULE_DESCRIPTION("Xtables: process control group matching"); MODULE_ALIAS("ipt_cgroup"); MODULE_ALIAS("ip6t_cgroup"); -static int cgroup_mt_check(const struct xt_mtchk_param *par) +static int cgroup_mt_check_v0(const struct xt_mtchk_param *par) { - struct xt_cgroup_info *info = par->matchinfo; + struct xt_cgroup_info_v0 *info = par->matchinfo; if (info->invert & ~1) return -EINVAL; @@ -35,9 +35,9 @@ static int cgroup_mt_check(const struct xt_mtchk_param *par) } static bool -cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par) +cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par) { - const struct xt_cgroup_info *info = par->matchinfo; + const struct xt_cgroup_info_v0 *info = par->matchinfo; if (skb->sk == NULL || !sk_fullsock(skb->sk)) return false; @@ -46,27 +46,29 @@ cgroup_mt(const struct sk_buff *skb, struct xt_action_param *par) info->invert; } -static struct xt_match cgroup_mt_reg __read_mostly = { - .name = "cgroup", - .revision = 0, - .family = NFPROTO_UNSPEC, - .checkentry = cgroup_mt_check, - .match = cgroup_mt, - .matchsize = sizeof(struct xt_cgroup_info), - .me = THIS_MODULE, - .hooks = (1 << NF_INET_LOCAL_OUT) | - (1 << NF_INET_POST_ROUTING) | - (1 << NF_INET_LOCAL_IN), +static struct xt_match cgroup_mt_reg[] __read_mostly = { + { + .name = "cgroup", + .revision = 0, + .family = NFPROTO_UNSPEC, + .checkentry = cgroup_mt_check_v0, + .match = cgroup_mt_v0, + .matchsize = sizeof(struct xt_cgroup_info_v0), + .me = THIS_MODULE, + .hooks = (1 << NF_INET_LOCAL_OUT) | + (1 << NF_INET_POST_ROUTING) | + (1 << NF_INET_LOCAL_IN), + }, }; static int __init cgroup_mt_init(void) { - return xt_register_match(&cgroup_mt_reg); + return xt_register_matches(cgroup_mt_reg, ARRAY_SIZE(cgroup_mt_reg)); } static void __exit cgroup_mt_exit(void) { - xt_unregister_match(&cgroup_mt_reg); + xt_unregister_matches(cgroup_mt_reg, ARRAY_SIZE(cgroup_mt_reg)); } module_init(cgroup_mt_init); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHSET v3] netfilter, cgroup: implement cgroup2 path match in xt_cgroup
Hello, This is v3 of the xt_cgroup2 patchset. Changes from the last take are * Folded cgroup2 path matching into xt_cgroup as a new revision rather than a separate xt_cgroup2 match as suggested by Pablo. * Refreshed on top of Nina's net_cls dynamic config update fix patch. I included the fix patch as part of this series to ease reviewing. The changes from v1 to v2 are * Instead of adding sock->sk_cgroup separately, sock->sk_cgrp_data now carries either (prioidx, classid) pair or cgroup2 pointer. This avoids inflating struct sock with yet another cgroup related field. Unfortunately, this does add some complexity but that's the trade-off and the complexity is contained in cgroup proper. * Various small updats as per David and Jan's reviews. In cgroup v1, dealing with cgroup membership was difficult because the number of membership associations was unbound. As a result, cgroup v1 grew several controllers whose primary purpose is either tagging membership or pull in configuration knobs from other subsystems so that cgroup membership test can be avoided. net_cls and net_prio controllers are examples of the latter. They allow configuring network-specific attributes from cgroup side so that network subsystem can avoid testing cgroup membership; unfortunately, these are not only cumbersome but also problematic. Both net_cls and net_prio aren't properly hierarchical. Both inherit configuration from the parent on creation but there's no interaction afterwards. An ancestor doesn't restrict the behavior in its subtree in anyway and configuration changes aren't propagated downwards. Especially when combined with cgroup delegation, this is problematic because delegatees can mess up whatever network configuration implemented at the system level. net_prio would allow the delegatees to set whatever priority value regardless of CAP_NET_ADMIN and net_cls the same for classid. While it is possible to solve these issues from controller side by implementing hierarchical allowable ranges in both controllers, it would involve quite a bit of complexity in the controllers and further obfuscate network configuration as it becomes even more difficult to tell what's actually being configured looking from the network side. While not much can be done for v1 at this point, as membership handling is sane on cgroup v2, it'd be better to make cgroup matching behave like other network matches and classifiers than introducing further complications. This patchset includes the following nine patches. 0001-cgroup-record-ancestor-IDs-and-reimplement-cgroup_is.patch 0002-kernfs-implement-kernfs_walk_and_get.patch 0003-cgroup-implement-cgroup_get_from_path-and-expose-cgr.patch 0004-cgroups-Allow-dynamically-changing-net_classid.patch 0005-netprio_cgroup-limit-the-maximum-css-id-to-USHRT_MAX.patch 0006-net-wrap-sock-sk_cgrp_prioidx-and-sk_classid-inside-.patch 0007-sock-cgroup-add-sock-sk_cgroup.patch 0008-netfilter-prepare-xt_cgroup-for-multi-revisions.patch 0009-netfilter-implement-xt_cgroup-cgroup2-path-match.patch 0001-0003 are prepatory patches in kernfs and cgroup. These patches are available in the following branch which will stay stable. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.5-ancestor-test 0004 is the following net_cls config update fix patch included in this series to ease reviewing as it causes a conflict with a later patch in this series. http://lkml.kernel.org/g/1448051499-1885574-1-git-send-email-nin...@fb.com 0005-0007 consolidate two cgroup related fields in struct sock into cgroup_sock_data and update it so that it can alternatively carry a cgroup pointer. 0008-0009 implement cgroup2 patch matching in xt_cgroup. This patchset is on top of v4.4-rc1 and also available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-xt_cgroup2 I'll post iptables extension as a reply. diffstat follows. Thanks. fs/kernfs/dir.c | 46 +++ include/linux/cgroup-defs.h | 126 +++ include/linux/cgroup.h | 66 +++- include/linux/kernfs.h | 12 ++ include/net/cls_cgroup.h | 11 +- include/net/netprio_cgroup.h | 16 +++ include/net/sock.h | 13 --- include/uapi/linux/netfilter/xt_cgroup.h | 15 +++ kernel/cgroup.c | 126 --- net/Kconfig |6 + net/core/dev.c |3 net/core/netclassid_cgroup.c | 37 ++--- net/core/netprio_cgroup.c| 19 net/core/scm.c |4 net/core/sock.c | 17 net/netfilter/nft_meta.c |2 net/netfilter/xt_cgroup.c| 108 ++ 17 files changed, 531 insertions(+
[PATCH 2/9] kernfs: implement kernfs_walk_and_get()
Implement kernfs_walk_and_get() which is similar to kernfs_find_and_get() but can walk a path instead of just a name. v2: Use strlcpy() instead of strlen() + memcpy() as suggested by David. Signed-off-by: Tejun Heo Acked-by: Greg Kroah-Hartman Cc: David Miller --- fs/kernfs/dir.c| 46 ++ include/linux/kernfs.h | 12 2 files changed, 58 insertions(+) diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index 91e0045..742bf4a 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -694,6 +694,29 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent, return NULL; } +static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent, + const unsigned char *path, + const void *ns) +{ + static char path_buf[PATH_MAX]; /* protected by kernfs_mutex */ + size_t len = strlcpy(path_buf, path, PATH_MAX); + char *p = path_buf; + char *name; + + lockdep_assert_held(&kernfs_mutex); + + if (len >= PATH_MAX) + return NULL; + + while ((name = strsep(&p, "/")) && parent) { + if (*name == '\0') + continue; + parent = kernfs_find_ns(parent, name, ns); + } + + return parent; +} + /** * kernfs_find_and_get_ns - find and get kernfs_node with the given name * @parent: kernfs_node to search under @@ -719,6 +742,29 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent, EXPORT_SYMBOL_GPL(kernfs_find_and_get_ns); /** + * kernfs_walk_and_get_ns - find and get kernfs_node with the given path + * @parent: kernfs_node to search under + * @path: path to look for + * @ns: the namespace tag to use + * + * Look for kernfs_node with path @path under @parent and get a reference + * if found. This function may sleep and returns pointer to the found + * kernfs_node on success, %NULL on failure. + */ +struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent, + const char *path, const void *ns) +{ + struct kernfs_node *kn; + + mutex_lock(&kernfs_mutex); + kn = kernfs_walk_ns(parent, path, ns); + kernfs_get(kn); + mutex_unlock(&kernfs_mutex); + + return kn; +} + +/** * kernfs_create_root - create a new kernfs hierarchy * @scops: optional syscall operations for the hierarchy * @flags: KERNFS_ROOT_* flags diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h index 5d4e9c4..af51df3 100644 --- a/include/linux/kernfs.h +++ b/include/linux/kernfs.h @@ -274,6 +274,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn); struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn); struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent, const char *name, const void *ns); +struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent, + const char *path, const void *ns); void kernfs_get(struct kernfs_node *kn); void kernfs_put(struct kernfs_node *kn); @@ -350,6 +352,10 @@ static inline struct kernfs_node * kernfs_find_and_get_ns(struct kernfs_node *parent, const char *name, const void *ns) { return NULL; } +static inline struct kernfs_node * +kernfs_walk_and_get_ns(struct kernfs_node *parent, const char *path, + const void *ns) +{ return NULL; } static inline void kernfs_get(struct kernfs_node *kn) { } static inline void kernfs_put(struct kernfs_node *kn) { } @@ -431,6 +437,12 @@ kernfs_find_and_get(struct kernfs_node *kn, const char *name) } static inline struct kernfs_node * +kernfs_walk_and_get(struct kernfs_node *kn, const char *path) +{ + return kernfs_walk_and_get_ns(kn, path, NULL); +} + +static inline struct kernfs_node * kernfs_create_dir(struct kernfs_node *parent, const char *name, umode_t mode, void *priv) { -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 3/9] net: ipmr: remove some pimsm ifdefs and simplify
From: Nikolay Aleksandrov Add the helper pimsm_enabled() which replaces the old CONFIG_IP_PIMSM define and is used to check if any version of PIM-SM has been enabled. Use a single if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2) for the pim-sm shared code. This is okay w.r.t IGMPMSG_WHOLEPKT because only a VIFF_REGISTER device can send such packet, and it can't be created if pimsm_enabled() is false. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 180 ++-- 1 file changed, 84 insertions(+), 96 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index dd2462f70d34..e153ab7b17a1 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -67,10 +67,6 @@ #include #include -#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2) -#define CONFIG_IP_PIMSM1 -#endif - struct mr_table { struct list_headlist; possible_net_t net; @@ -95,6 +91,11 @@ struct ipmr_result { struct mr_table *mrt; }; +static inline bool pimsm_enabled(void) +{ + return IS_BUILTIN(CONFIG_IP_PIMSM_V1) || IS_BUILTIN(CONFIG_IP_PIMSM_V2); +} + /* Big lock, protecting vif table, mrt cache and mroute socket state. * Note that the changes are semaphored via rtnl_lock. */ @@ -454,8 +455,7 @@ failure: return NULL; } -#ifdef CONFIG_IP_PIMSM - +#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2) static netdev_tx_t reg_vif_xmit(struct sk_buff *skb, struct net_device *dev) { struct net *net = dev_net(dev); @@ -552,6 +552,51 @@ failure: unregister_netdevice(dev); return NULL; } + +/* called with rcu_read_lock() */ +static int __pim_rcv(struct mr_table *mrt, struct sk_buff *skb, +unsigned int pimlen) +{ + struct net_device *reg_dev = NULL; + struct iphdr *encap; + + encap = (struct iphdr *)(skb_transport_header(skb) + pimlen); + /* +* Check that: +* a. packet is really sent to a multicast group +* b. packet is not a NULL-REGISTER +* c. packet is not truncated +*/ + if (!ipv4_is_multicast(encap->daddr) || + encap->tot_len == 0 || + ntohs(encap->tot_len) + pimlen > skb->len) + return 1; + + read_lock(&mrt_lock); + if (mrt->mroute_reg_vif_num >= 0) + reg_dev = mrt->vif_table[mrt->mroute_reg_vif_num].dev; + read_unlock(&mrt_lock); + + if (!reg_dev) + return 1; + + skb->mac_header = skb->network_header; + skb_pull(skb, (u8 *)encap - skb->data); + skb_reset_network_header(skb); + skb->protocol = htons(ETH_P_IP); + skb->ip_summed = CHECKSUM_NONE; + + skb_tunnel_rx(skb, reg_dev, dev_net(reg_dev)); + + netif_rx(skb); + + return NET_RX_SUCCESS; +} +#else +static struct net_device *ipmr_reg_vif(struct net *net, struct mr_table *mrt) +{ + return NULL; +} #endif /** @@ -734,10 +779,10 @@ static int vif_add(struct net *net, struct mr_table *mrt, return -EADDRINUSE; switch (vifc->vifc_flags) { -#ifdef CONFIG_IP_PIMSM case VIFF_REGISTER: - /* -* Special Purpose VIF in PIM + if (!pimsm_enabled()) + return -EINVAL; + /* Special Purpose VIF in PIM * All the packets will be sent to the daemon */ if (mrt->mroute_reg_vif_num >= 0) @@ -752,7 +797,6 @@ static int vif_add(struct net *net, struct mr_table *mrt, return err; } break; -#endif case VIFF_TUNNEL: dev = ipmr_new_tunnel(net, vifc); if (!dev) @@ -942,34 +986,29 @@ static void ipmr_cache_resolve(struct net *net, struct mr_table *mrt, } } -/* - * Bounce a cache query up to mrouted. We could use netlink for this but mrouted - * expects the following bizarre scheme. +/* Bounce a cache query up to mrouted. We could use netlink for this but mrouted + * expects the following bizarre scheme. * - * Called under mrt_lock. + * Called under mrt_lock. */ - static int ipmr_cache_report(struct mr_table *mrt, struct sk_buff *pkt, vifi_t vifi, int assert) { - struct sk_buff *skb; const int ihl = ip_hdrlen(pkt); + struct sock *mroute_sk; struct igmphdr *igmp; struct igmpmsg *msg; - struct sock *mroute_sk; + struct sk_buff *skb; int ret; -#ifdef CONFIG_IP_PIMSM if (assert == IGMPMSG_WHOLEPKT) skb = skb_realloc_headroom(pkt, sizeof(struct iphdr)); else -#endif skb = alloc_skb(128, GFP_ATOMIC); if (!skb) return -ENOBUFS; -#ifdef CONFIG_IP_PIMSM if (assert == IGMPMSG_WHOLEPKT) { /* Ugly, but we have no choice with this interface.
[PATCH net-next v2 0/9] net: ipmr: cleanups and minor improvements
From: Nikolay Aleksandrov Hi, Since I'll have to work with ipmr, I decided to clean it up and do some minor improvements. Functionally there're almost no changes except the SLAB_PANIC removal. Most of the patches just re-design some functions to be clearer and more concise and try to remove the ifdef web that was inside. There's more information in each commit. This is the first set, the end goal is to introduce complete netlink support and control over the mfc and vif devices. I've tried to test all of the setsockopt/getsockopt options, and also made builds with various ipmr kconfig options turned on and off. v2: change patch 7 to keep SLAB_PANIC and just drop the unnecessary null check Thank you, Nik Nikolay Aleksandrov (9): net: ipmr: move the tbl id check in ipmr_new_table net: ipmr: always define mroute_reg_vif_num net: ipmr: remove some pimsm ifdefs and simplify net: ipmr: fix code and comment style net: ipmr: make ip_mroute_getsockopt more understandable net: ipmr: drop an instance of CONFIG_IP_MROUTE_MULTIPLE_TABLES net: ipmr: drop ip_mr_init() mrt_cachep null check as we'll panic if it fails net: ipmr: rearrange and cleanup setsockopt net: ipmr: factor out common vif init code include/uapi/linux/mroute.h | 59 ++--- net/ipv4/ipmr.c | 597 2 files changed, 284 insertions(+), 372 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 5/9] net: ipmr: make ip_mroute_getsockopt more understandable
From: Nikolay Aleksandrov Use a switch to determine if optname is correct and set val accordingly. This produces a much more straight-forward and readable code. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 28 ++-- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 286ede3716ee..694fecf7838e 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1443,29 +1443,29 @@ int ip_mroute_getsockopt(struct sock *sk, int optname, char __user *optval, int if (!mrt) return -ENOENT; - if (optname != MRT_VERSION && - optname != MRT_PIM && - optname != MRT_ASSERT) + switch (optname) { + case MRT_VERSION: + val = 0x0305; + break; + case MRT_PIM: + if (!pimsm_enabled()) + return -ENOPROTOOPT; + val = mrt->mroute_do_pim; + break; + case MRT_ASSERT: + val = mrt->mroute_do_assert; + break; + default: return -ENOPROTOOPT; + } if (get_user(olr, optlen)) return -EFAULT; - olr = min_t(unsigned int, olr, sizeof(int)); if (olr < 0) return -EINVAL; - if (put_user(olr, optlen)) return -EFAULT; - if (optname == MRT_VERSION) { - val = 0x0305; - } else if (optname == MRT_PIM) { - if (!pimsm_enabled()) - return -ENOPROTOOPT; - val = mrt->mroute_do_pim; - } else { - val = mrt->mroute_do_assert; - } if (copy_to_user(optval, &val, olr)) return -EFAULT; return 0; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 1/9] net: ipmr: move the tbl id check in ipmr_new_table
From: Nikolay Aleksandrov Move the table id check in ipmr_new_table and make it return error pointer. We need this change for the upcoming netlink table manipulation support in order to avoid code duplication and a race condition. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 28 +--- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 92dd4b74d513..5271e2eee110 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -252,8 +252,8 @@ static int __net_init ipmr_rules_init(struct net *net) INIT_LIST_HEAD(&net->ipv4.mr_tables); mrt = ipmr_new_table(net, RT_TABLE_DEFAULT); - if (!mrt) { - err = -ENOMEM; + if (IS_ERR(mrt)) { + err = PTR_ERR(mrt); goto err1; } @@ -301,8 +301,13 @@ static int ipmr_fib_lookup(struct net *net, struct flowi4 *flp4, static int __net_init ipmr_rules_init(struct net *net) { - net->ipv4.mrt = ipmr_new_table(net, RT_TABLE_DEFAULT); - return net->ipv4.mrt ? 0 : -ENOMEM; + struct mr_table *mrt; + + mrt = ipmr_new_table(net, RT_TABLE_DEFAULT); + if (IS_ERR(mrt)) + return PTR_ERR(mrt); + net->ipv4.mrt = mrt; + return 0; } static void __net_exit ipmr_rules_exit(struct net *net) @@ -319,13 +324,17 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id) struct mr_table *mrt; unsigned int i; + /* "pimreg%u" should not exceed 16 bytes (IFNAMSIZ) */ + if (id != RT_TABLE_DEFAULT && id >= 10) + return ERR_PTR(-EINVAL); + mrt = ipmr_get_table(net, id); if (mrt) return mrt; mrt = kzalloc(sizeof(*mrt), GFP_KERNEL); if (!mrt) - return NULL; + return ERR_PTR(-ENOMEM); write_pnet(&mrt->net, net); mrt->id = id; @@ -1407,17 +1416,14 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi if (get_user(v, (u32 __user *)optval)) return -EFAULT; - /* "pimreg%u" should not exceed 16 bytes (IFNAMSIZ) */ - if (v != RT_TABLE_DEFAULT && v >= 10) - return -EINVAL; - rtnl_lock(); ret = 0; if (sk == rtnl_dereference(mrt->mroute_sk)) { ret = -EBUSY; } else { - if (!ipmr_new_table(net, v)) - ret = -ENOMEM; + mrt = ipmr_new_table(net, v); + if (IS_ERR(mrt)) + ret = PTR_ERR(mrt); else raw_sk(sk)->ipmr_table = v; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 2/9] net: ipmr: always define mroute_reg_vif_num
From: Nikolay Aleksandrov Before mroute_reg_vif_num was defined only if any of the CONFIG_PIMSM_ options were set, but that's not really necessary as the size of the struct is the same in both cases (checked with pahole, both cases size is 3256 bytes) and we can remove some unnecessary ifdefs to simplify the code. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 8 1 file changed, 8 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 5271e2eee110..dd2462f70d34 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -84,9 +84,7 @@ struct mr_table { atomic_tcache_resolve_queue_len; boolmroute_do_assert; boolmroute_do_pim; -#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2) int mroute_reg_vif_num; -#endif }; struct ipmr_rule { @@ -347,9 +345,7 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id) setup_timer(&mrt->ipmr_expire_timer, ipmr_expire_process, (unsigned long)mrt); -#ifdef CONFIG_IP_PIMSM mrt->mroute_reg_vif_num = -1; -#endif #ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES list_add_tail_rcu(&mrt->list, &net->ipv4.mr_tables); #endif @@ -584,10 +580,8 @@ static int vif_delete(struct mr_table *mrt, int vifi, int notify, return -EADDRNOTAVAIL; } -#ifdef CONFIG_IP_PIMSM if (vifi == mrt->mroute_reg_vif_num) mrt->mroute_reg_vif_num = -1; -#endif if (vifi + 1 == mrt->maxvif) { int tmp; @@ -824,10 +818,8 @@ static int vif_add(struct net *net, struct mr_table *mrt, /* And finish update writing critical data */ write_lock_bh(&mrt_lock); v->dev = dev; -#ifdef CONFIG_IP_PIMSM if (v->flags & VIFF_REGISTER) mrt->mroute_reg_vif_num = vifi; -#endif if (vifi+1 > mrt->maxvif) mrt->maxvif = vifi+1; write_unlock_bh(&mrt_lock); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 6/9] net: ipmr: drop an instance of CONFIG_IP_MROUTE_MULTIPLE_TABLES
From: Nikolay Aleksandrov Trivial replace of ifdef with IS_BUILTIN(). Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 694fecf7838e..a006d96d6cd9 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1396,11 +1396,12 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi rtnl_unlock(); return ret; } -#ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES case MRT_TABLE: { u32 v; + if (!IS_BUILTIN(CONFIG_IP_MROUTE_MULTIPLE_TABLES)) + return -ENOPROTOOPT; if (optlen != sizeof(u32)) return -EINVAL; if (get_user(v, (u32 __user *)optval)) @@ -1420,7 +1421,6 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi rtnl_unlock(); return ret; } -#endif /* Spurious command, or MRT_VERSION which you cannot set. */ default: return -ENOPROTOOPT; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 8/9] net: ipmr: rearrange and cleanup setsockopt
From: Nikolay Aleksandrov Take rtnl in the beginning unconditionally as most options already need it (one exception - MRT_DONE, see the comment inside), make the lock/unlock places central and move out the switch() local variables. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 191 +++- 1 file changed, 107 insertions(+), 84 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 50aec313119d..e384f39202cb 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1276,38 +1276,45 @@ static void mrtsock_destruct(struct sock *sk) * MOSPF/PIM router set up we can clean this up. */ -int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsigned int optlen) +int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, +unsigned int optlen) { - int ret, parent = 0; - struct vifctl vif; - struct mfcctl mfc; struct net *net = sock_net(sk); + int val, ret = 0, parent = 0; struct mr_table *mrt; + struct vifctl vif; + struct mfcctl mfc; + u32 uval; + /* There's one exception to the lock - MRT_DONE which needs to unlock */ + rtnl_lock(); if (sk->sk_type != SOCK_RAW || - inet_sk(sk)->inet_num != IPPROTO_IGMP) - return -EOPNOTSUPP; + inet_sk(sk)->inet_num != IPPROTO_IGMP) { + ret = -EOPNOTSUPP; + goto out_unlock; + } mrt = ipmr_get_table(net, raw_sk(sk)->ipmr_table ? : RT_TABLE_DEFAULT); - if (!mrt) - return -ENOENT; - + if (!mrt) { + ret = -ENOENT; + goto out_unlock; + } if (optname != MRT_INIT) { if (sk != rcu_access_pointer(mrt->mroute_sk) && - !ns_capable(net->user_ns, CAP_NET_ADMIN)) - return -EACCES; + !ns_capable(net->user_ns, CAP_NET_ADMIN)) { + ret = -EACCES; + goto out_unlock; + } } switch (optname) { case MRT_INIT: if (optlen != sizeof(int)) - return -EINVAL; - - rtnl_lock(); - if (rtnl_dereference(mrt->mroute_sk)) { - rtnl_unlock(); - return -EADDRINUSE; - } + ret = -EINVAL; + if (rtnl_dereference(mrt->mroute_sk)) + ret = -EADDRINUSE; + if (ret) + break; ret = ip_ra_control(sk, 1, mrtsock_destruct); if (ret == 0) { @@ -1317,30 +1324,41 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi NETCONFA_IFINDEX_ALL, net->ipv4.devconf_all); } - rtnl_unlock(); - return ret; + break; case MRT_DONE: - if (sk != rcu_access_pointer(mrt->mroute_sk)) - return -EACCES; - return ip_ra_control(sk, 0, NULL); + if (sk != rcu_access_pointer(mrt->mroute_sk)) { + ret = -EACCES; + } else { + /* We need to unlock here because mrtsock_destruct takes +* care of rtnl itself and we can't change that due to +* the IP_ROUTER_ALERT setsockopt which runs without it. +*/ + rtnl_unlock(); + ret = ip_ra_control(sk, 0, NULL); + goto out; + } + break; case MRT_ADD_VIF: case MRT_DEL_VIF: - if (optlen != sizeof(vif)) - return -EINVAL; - if (copy_from_user(&vif, optval, sizeof(vif))) - return -EFAULT; - if (vif.vifc_vifi >= MAXVIFS) - return -ENFILE; - rtnl_lock(); + if (optlen != sizeof(vif)) { + ret = -EINVAL; + break; + } + if (copy_from_user(&vif, optval, sizeof(vif))) { + ret = -EFAULT; + break; + } + if (vif.vifc_vifi >= MAXVIFS) { + ret = -ENFILE; + break; + } if (optname == MRT_ADD_VIF) { ret = vif_add(net, mrt, &vif, sk == rtnl_dereference(mrt->mroute_sk)); } else { ret = vif_delete(mrt, vif.vifc_vifi, 0, NULL); } - rtnl_unlock(); - return ret; - + break; /* Manipula
[PATCH net-next v2 7/9] net: ipmr: drop ip_mr_init() mrt_cachep null check as we'll panic if it fails
From: Nikolay Aleksandrov It's not necessary to check for null as SLAB_PANIC is used and we'll panic if the alloc fails, so just drop it. Signed-off-by: Nikolay Aleksandrov --- v2: new patch, keep SLAB_PANIC and drop the unnecessary null check net/ipv4/ipmr.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index a006d96d6cd9..50aec313119d 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -2677,8 +2677,6 @@ int __init ip_mr_init(void) sizeof(struct mfc_cache), 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); - if (!mrt_cachep) - return -ENOMEM; err = register_pernet_subsys(&ipmr_net_ops); if (err) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 9/9] net: ipmr: factor out common vif init code
From: Nikolay Aleksandrov Factor out common vif init code used in both tunnel and pimreg initialization and create ipmr_init_vif_indev() function. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 40 +++- 1 file changed, 19 insertions(+), 21 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index e384f39202cb..a2d248d9c35c 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -391,6 +391,23 @@ static void ipmr_del_tunnel(struct net_device *dev, struct vifctl *v) } } +/* Initialize ipmr pimreg/tunnel in_device */ +static bool ipmr_init_vif_indev(const struct net_device *dev) +{ + struct in_device *in_dev; + + ASSERT_RTNL(); + + in_dev = __in_dev_get_rtnl(dev); + if (!in_dev) + return false; + ipv4_devconf_setall(in_dev); + neigh_parms_data_state_setall(in_dev->arp_parms); + IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0; + + return true; +} + static struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v) { struct net_device *dev; @@ -402,7 +419,6 @@ static struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v) int err; struct ifreq ifr; struct ip_tunnel_parm p; - struct in_device *in_dev; memset(&p, 0, sizeof(p)); p.iph.daddr = v->vifc_rmt_addr.s_addr; @@ -427,15 +443,8 @@ static struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v) if (err == 0 && (dev = __dev_get_by_name(net, p.name)) != NULL) { dev->flags |= IFF_MULTICAST; - - in_dev = __in_dev_get_rtnl(dev); - if (!in_dev) + if (!ipmr_init_vif_indev(dev)) goto failure; - - ipv4_devconf_setall(in_dev); - neigh_parms_data_state_setall(in_dev->arp_parms); - IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0; - if (dev_open(dev)) goto failure; dev_hold(dev); @@ -502,7 +511,6 @@ static void reg_vif_setup(struct net_device *dev) static struct net_device *ipmr_reg_vif(struct net *net, struct mr_table *mrt) { struct net_device *dev; - struct in_device *in_dev; char name[IFNAMSIZ]; if (mrt->id == RT_TABLE_DEFAULT) @@ -522,18 +530,8 @@ static struct net_device *ipmr_reg_vif(struct net *net, struct mr_table *mrt) return NULL; } - rcu_read_lock(); - in_dev = __in_dev_get_rcu(dev); - if (!in_dev) { - rcu_read_unlock(); + if (!ipmr_init_vif_indev(dev)) goto failure; - } - - ipv4_devconf_setall(in_dev); - neigh_parms_data_state_setall(in_dev->arp_parms); - IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0; - rcu_read_unlock(); - if (dev_open(dev)) goto failure; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2 4/9] net: ipmr: fix code and comment style
From: Nikolay Aleksandrov Trivial code and comment style fixes, also removed some extra newlines, spaces and tabs. Signed-off-by: Nikolay Aleksandrov --- include/uapi/linux/mroute.h | 59 ++ net/ipv4/ipmr.c | 142 2 files changed, 54 insertions(+), 147 deletions(-) diff --git a/include/uapi/linux/mroute.h b/include/uapi/linux/mroute.h index a382d2c04a42..cf943016930f 100644 --- a/include/uapi/linux/mroute.h +++ b/include/uapi/linux/mroute.h @@ -4,15 +4,13 @@ #include #include -/* - * Based on the MROUTING 3.5 defines primarily to keep - * source compatibility with BSD. +/* Based on the MROUTING 3.5 defines primarily to keep + * source compatibility with BSD. * - * See the mrouted code for the original history. - * - * Protocol Independent Multicast (PIM) data structures included - * Carlos Picoto (c...@di.fc.ul.pt) + * See the mrouted code for the original history. * + * Protocol Independent Multicast (PIM) data structures included + * Carlos Picoto (c...@di.fc.ul.pt) */ #define MRT_BASE 200 @@ -34,15 +32,13 @@ #define SIOCGETSGCNT (SIOCPROTOPRIVATE+1) #define SIOCGETRPF (SIOCPROTOPRIVATE+2) -#define MAXVIFS32 +#define MAXVIFS32 typedef unsigned long vifbitmap_t; /* User mode code depends on this lot */ typedef unsigned short vifi_t; #define ALL_VIFS ((vifi_t)(-1)) -/* - * Same idea as select - */ - +/* Same idea as select */ + #define VIFM_SET(n,m) ((m)|=(1<<(n))) #define VIFM_CLR(n,m) ((m)&=~(1<<(n))) #define VIFM_ISSET(n,m)((m)&(1<<(n))) @@ -50,11 +46,9 @@ typedef unsigned short vifi_t; #define VIFM_COPY(mfrom,mto) ((mto)=(mfrom)) #define VIFM_SAME(m1,m2) ((m1)==(m2)) -/* - * Passed by mrouted for an MRT_ADD_VIF - again we use the - * mrouted 3.6 structures for compatibility +/* Passed by mrouted for an MRT_ADD_VIF - again we use the + * mrouted 3.6 structures for compatibility */ - struct vifctl { vifi_t vifc_vifi; /* Index of VIF */ unsigned char vifc_flags; /* VIFF_ flags */ @@ -73,10 +67,7 @@ struct vifctl { #define VIFF_USE_IFINDEX 0x8 /* use vifc_lcl_ifindex instead of vifc_lcl_addr to find an interface */ -/* - * Cache manipulation structures for mrouted and PIMd - */ - +/* Cache manipulation structures for mrouted and PIMd */ struct mfcctl { struct in_addr mfcc_origin; /* Origin of mcast */ struct in_addr mfcc_mcastgrp; /* Group in question*/ @@ -88,10 +79,7 @@ struct mfcctl { int mfcc_expire; }; -/* - * Group count retrieval for mrouted - */ - +/* Group count retrieval for mrouted */ struct sioc_sg_req { struct in_addr src; struct in_addr grp; @@ -100,10 +88,7 @@ struct sioc_sg_req { unsigned long wrong_if; }; -/* - * To get vif packet counts - */ - +/* To get vif packet counts */ struct sioc_vif_req { vifi_t vifi; /* Which iface */ unsigned long icount; /* In packets */ @@ -112,11 +97,9 @@ struct sioc_vif_req { unsigned long obytes; /* Out bytes */ }; -/* - * This is the format the mroute daemon expects to see IGMP control - * data. Magically happens to be like an IP packet as per the original +/* This is the format the mroute daemon expects to see IGMP control + * data. Magically happens to be like an IP packet as per the original */ - struct igmpmsg { __u32 unused1,unused2; unsigned char im_msgtype; /* What is this */ @@ -126,21 +109,13 @@ struct igmpmsg { struct in_addr im_src,im_dst; }; -/* - * That's all usermode folks - */ - - +/* That's all usermode folks */ #define MFC_ASSERT_THRESH (3*HZ) /* Maximal freq. of asserts */ -/* - * Pseudo messages used by mrouted - */ - +/* Pseudo messages used by mrouted */ #define IGMPMSG_NOCACHE1 /* Kern cache fill request to mrouted */ #define IGMPMSG_WRONGVIF 2 /* For PIM assert processing (unused) */ #define IGMPMSG_WHOLEPKT 3 /* For PIM Register processing */ - #endif /* _UAPI__LINUX_MROUTE_H */ diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index e153ab7b17a1..286ede3716ee 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -102,9 +102,7 @@ static inline bool pimsm_enabled(void) static DEFINE_RWLOCK(mrt_lock); -/* - * Multicast router control variables - */ +/* Multicast router control variables */ #define VIF_EXISTS(_mrt, _idx) ((_mrt)->vif_table[_idx].dev != NULL) @@ -393,8 +391,7 @@ static void ipmr_del_tunnel(struct net_device *dev, struct vifctl *v) } } -static -struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v) +static struct net_device *ipmr_new_tunnel(struct net *net, stru
Re: [PATCH net-next 7/9] net: ipmr: remove SLAB_PANIC
On 11/21/2015 03:23 PM, Eric Dumazet wrote: > On Sat, 2015-11-21 at 14:01 +0100, Nikolay Aleksandrov wrote: >> From: Nikolay Aleksandrov >> >> It's not necessary to panic upon allocation failure, returning an error >> at that point is okay because user-space won't be able to use any of the >> ops since they didn't get registered and the default table is null. >> >> Signed-off-by: Nikolay Aleksandrov >> --- >> net/ipv4/ipmr.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c >> index a006d96d6cd9..2c7fa584a274 100644 >> --- a/net/ipv4/ipmr.c >> +++ b/net/ipv4/ipmr.c >> @@ -2675,7 +2675,7 @@ int __init ip_mr_init(void) >> >> mrt_cachep = kmem_cache_create("ip_mrt_cache", >> sizeof(struct mfc_cache), >> - 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, >> + 0, SLAB_HWCACHE_ALIGN, >> NULL); >> if (!mrt_cachep) >> return -ENOMEM; > > This runs at boot time. > > I very much prefer a panic instead of having to deal with a host with a > probable bug in mm layer at this point. > Right, I was unsure about this one, I tried to rely on the ipmr_get_table() returning NULL but I guess it's better to be safe than sorry. I'll respin with the panic in place and removed null check. > For IPv6, it is a different matter as it might be a module, thus > ip6_mr_init() does not use SLAB_PANIC Yep, I saw. It wasn't the reason I removed this one. Thanks for the review! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 7/9] net: ipmr: remove SLAB_PANIC
On Sat, 2015-11-21 at 14:01 +0100, Nikolay Aleksandrov wrote: > From: Nikolay Aleksandrov > > It's not necessary to panic upon allocation failure, returning an error > at that point is okay because user-space won't be able to use any of the > ops since they didn't get registered and the default table is null. > > Signed-off-by: Nikolay Aleksandrov > --- > net/ipv4/ipmr.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c > index a006d96d6cd9..2c7fa584a274 100644 > --- a/net/ipv4/ipmr.c > +++ b/net/ipv4/ipmr.c > @@ -2675,7 +2675,7 @@ int __init ip_mr_init(void) > > mrt_cachep = kmem_cache_create("ip_mrt_cache", > sizeof(struct mfc_cache), > -0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, > +0, SLAB_HWCACHE_ALIGN, > NULL); > if (!mrt_cachep) > return -ENOMEM; This runs at boot time. I very much prefer a panic instead of having to deal with a host with a probable bug in mm layer at this point. For IPv6, it is a different matter as it might be a module, thus ip6_mr_init() does not use SLAB_PANIC -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/27] brcm80211: move under broadcom vendor directory
On 11/20/2015 10:53 PM, Arend van Spriel wrote: > On 11/19/2015 08:48 AM, Kalle Valo wrote: >> Hauke Mehrtens writes: >> >>> On 11/18/2015 03:45 PM, Kalle Valo wrote: Part of reorganising wireless drivers directory and Kconfig. Note that I had to edit Makefiles from subdirectories to use the new location. Signed-off-by: Kalle Valo --- >>> >>> I would prefer to remove the brcm80211 directory in this process and >>> create: >>> drivers/net/wireless/broadcom/brcmfmac >>> drivers/net/wireless/broadcom/brcmsmac >>> drivers/net/wireless/broadcom/brcmutil >>> drivers/net/wireless/broadcom/include >>> >>> This way we have one directory less. >> >> I think this could be done separately. This patchset is big enough >> already, I would not like to make it anymore complicated. >> >> And I actually like the brcm80211 directory, I would not mind keeping it >> still. > > I prefer to keep it as brcmsmac and brcmfmac rely on brcmutil module so > I want to keep them together under brcm80211. > > So does this patch go in before or after the patches I submitted before > the merge window. I hope after :-p Ok, then leave it like Kalle proposed. backports should work with both versions. Hauke -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
yet another uninterruptable hang in sendfile
Hello, On commit 8005c49d9aea74d382f474ce11afbbc7d7130bec (Nov 15). The program is: // autogenerated by syzkaller (http://github.com/google/syzkaller) #include #include #include int main() { long r0 = syscall(SYS_socket, 0x10ul, 0x2ul, 0x0ul, 0, 0, 0); long r1 = syscall(SYS_mmap, 0x2000ul, 0x1000ul, 0x3ul, 0x32ul, 0xul, 0x0ul); long r2 = syscall(SYS_mmap, 0x20001000ul, 0x1000ul, 0x3ul, 0x32ul, 0xul, 0x0ul); *(uint64_t*)0x2000153f = 0x20001f99; *(uint64_t*)0x20001547 = 0x67; *(uint64_t*)0x2000154f = 0x20001fa5; *(uint64_t*)0x20001557 = 0x5b; *(uint64_t*)0x2000155f = 0x20001000; *(uint64_t*)0x20001567 = 0x6; long r9 = syscall(SYS_readv, r0, 0x2000153ful, 0x3ul, 0, 0, 0); long r10 = syscall(SYS_mmap, 0x20002000ul, 0x1000ul, 0x3ul, 0x32ul, 0xul, 0x0ul); memcpy((void*)0x20002000, "\x65\x74\x68\x31\x00", 5); long r12 = syscall(SYS_memfd_create, 0x20002000ul, 0x1ul, 0, 0, 0, 0); long r13 = syscall(SYS_fallocate, r12, 0x0ul, 0x5616e07ul, 0x1ul, 0, 0); memcpy((void*)0x2da2, "\x02\xbe\x98\x59\x88\xb1\x7b\xfd\xe6\x27\x95\xdc\x18\x4e\x04\x87\x28\x1a\xd0\x30\x52\xcd\xa5\xee\x09\x7f\xfa\x7a\x9b\x72\x17\xfa\x2a\xa1\xe1\x60\x09\xbb\xaf\xdd\x0b\x5c\xa8\x18\x81\x4b\x6d\x42\x11\x20\x4a\xd7\x9e\x86\x8b\x63\xd2\x36\xbf\x5f\xb0\x36\x13\x82\x79\xc8\x31\x3b\x3b\x1e", 70); memcpy((void*)0x28b7, "\x0a\x00\x33\xe8\x3d\xe7\x4a\xcc\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xbf\xce\xa1\x60", 28); long r16 = syscall(SYS_sendto, r0, 0x2da2ul, 0x46ul, 0x8000ul, 0x28b7ul, 0x1cul); long r17 = syscall(SYS_sendfile, r0, r12, 0x2000ul, 0x4785d2c1ul, 0, 0); return 0; } It hangs in unkillable state. It is probably similar issue to the other reported issues related to sendfile: https://groups.google.com/forum/#!topic/syzkaller/zfuHHRXL7Zg https://groups.google.com/forum/#!topic/syzkaller/sjA9DrBQviw However this one also blankets dmesg with zillions of: [ 1682.801412] SELinux: unrecognized netlink message: protocol=0 nlmsg_type=0 sclass=netlink_route_socket [ 1682.803565] SELinux: unrecognized netlink message: protocol=0 nlmsg_type=0 sclass=netlink_route_socket [ 1682.804991] SELinux: unrecognized netlink message: protocol=0 nlmsg_type=0 sclass=netlink_route_socket The program should be killable. Thank you -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/9] net: ipmr: move the tbl id check in ipmr_new_table
From: Nikolay Aleksandrov Move the table id check in ipmr_new_table and make it return error pointer. We need this change for the upcoming netlink table manipulation support in order to avoid code duplication and a race condition. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 28 +--- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 92dd4b74d513..5271e2eee110 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -252,8 +252,8 @@ static int __net_init ipmr_rules_init(struct net *net) INIT_LIST_HEAD(&net->ipv4.mr_tables); mrt = ipmr_new_table(net, RT_TABLE_DEFAULT); - if (!mrt) { - err = -ENOMEM; + if (IS_ERR(mrt)) { + err = PTR_ERR(mrt); goto err1; } @@ -301,8 +301,13 @@ static int ipmr_fib_lookup(struct net *net, struct flowi4 *flp4, static int __net_init ipmr_rules_init(struct net *net) { - net->ipv4.mrt = ipmr_new_table(net, RT_TABLE_DEFAULT); - return net->ipv4.mrt ? 0 : -ENOMEM; + struct mr_table *mrt; + + mrt = ipmr_new_table(net, RT_TABLE_DEFAULT); + if (IS_ERR(mrt)) + return PTR_ERR(mrt); + net->ipv4.mrt = mrt; + return 0; } static void __net_exit ipmr_rules_exit(struct net *net) @@ -319,13 +324,17 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id) struct mr_table *mrt; unsigned int i; + /* "pimreg%u" should not exceed 16 bytes (IFNAMSIZ) */ + if (id != RT_TABLE_DEFAULT && id >= 10) + return ERR_PTR(-EINVAL); + mrt = ipmr_get_table(net, id); if (mrt) return mrt; mrt = kzalloc(sizeof(*mrt), GFP_KERNEL); if (!mrt) - return NULL; + return ERR_PTR(-ENOMEM); write_pnet(&mrt->net, net); mrt->id = id; @@ -1407,17 +1416,14 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi if (get_user(v, (u32 __user *)optval)) return -EFAULT; - /* "pimreg%u" should not exceed 16 bytes (IFNAMSIZ) */ - if (v != RT_TABLE_DEFAULT && v >= 10) - return -EINVAL; - rtnl_lock(); ret = 0; if (sk == rtnl_dereference(mrt->mroute_sk)) { ret = -EBUSY; } else { - if (!ipmr_new_table(net, v)) - ret = -ENOMEM; + mrt = ipmr_new_table(net, v); + if (IS_ERR(mrt)) + ret = PTR_ERR(mrt); else raw_sk(sk)->ipmr_table = v; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 6/9] net: ipmr: drop an instance of CONFIG_IP_MROUTE_MULTIPLE_TABLES
From: Nikolay Aleksandrov Trivial replace of ifdef with IS_BUILTIN(). Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 694fecf7838e..a006d96d6cd9 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1396,11 +1396,12 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi rtnl_unlock(); return ret; } -#ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES case MRT_TABLE: { u32 v; + if (!IS_BUILTIN(CONFIG_IP_MROUTE_MULTIPLE_TABLES)) + return -ENOPROTOOPT; if (optlen != sizeof(u32)) return -EINVAL; if (get_user(v, (u32 __user *)optval)) @@ -1420,7 +1421,6 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi rtnl_unlock(); return ret; } -#endif /* Spurious command, or MRT_VERSION which you cannot set. */ default: return -ENOPROTOOPT; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 3/9] net: ipmr: remove some pimsm ifdefs and simplify
From: Nikolay Aleksandrov Add the helper pimsm_enabled() which replaces the old CONFIG_IP_PIMSM define and is used to check if any version of PIM-SM has been enabled. Use a single if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2) for the pim-sm shared code. This is okay w.r.t IGMPMSG_WHOLEPKT because only a VIFF_REGISTER device can send such packet, and it can't be created if pimsm_enabled() is false. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 180 ++-- 1 file changed, 84 insertions(+), 96 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index dd2462f70d34..e153ab7b17a1 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -67,10 +67,6 @@ #include #include -#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2) -#define CONFIG_IP_PIMSM1 -#endif - struct mr_table { struct list_headlist; possible_net_t net; @@ -95,6 +91,11 @@ struct ipmr_result { struct mr_table *mrt; }; +static inline bool pimsm_enabled(void) +{ + return IS_BUILTIN(CONFIG_IP_PIMSM_V1) || IS_BUILTIN(CONFIG_IP_PIMSM_V2); +} + /* Big lock, protecting vif table, mrt cache and mroute socket state. * Note that the changes are semaphored via rtnl_lock. */ @@ -454,8 +455,7 @@ failure: return NULL; } -#ifdef CONFIG_IP_PIMSM - +#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2) static netdev_tx_t reg_vif_xmit(struct sk_buff *skb, struct net_device *dev) { struct net *net = dev_net(dev); @@ -552,6 +552,51 @@ failure: unregister_netdevice(dev); return NULL; } + +/* called with rcu_read_lock() */ +static int __pim_rcv(struct mr_table *mrt, struct sk_buff *skb, +unsigned int pimlen) +{ + struct net_device *reg_dev = NULL; + struct iphdr *encap; + + encap = (struct iphdr *)(skb_transport_header(skb) + pimlen); + /* +* Check that: +* a. packet is really sent to a multicast group +* b. packet is not a NULL-REGISTER +* c. packet is not truncated +*/ + if (!ipv4_is_multicast(encap->daddr) || + encap->tot_len == 0 || + ntohs(encap->tot_len) + pimlen > skb->len) + return 1; + + read_lock(&mrt_lock); + if (mrt->mroute_reg_vif_num >= 0) + reg_dev = mrt->vif_table[mrt->mroute_reg_vif_num].dev; + read_unlock(&mrt_lock); + + if (!reg_dev) + return 1; + + skb->mac_header = skb->network_header; + skb_pull(skb, (u8 *)encap - skb->data); + skb_reset_network_header(skb); + skb->protocol = htons(ETH_P_IP); + skb->ip_summed = CHECKSUM_NONE; + + skb_tunnel_rx(skb, reg_dev, dev_net(reg_dev)); + + netif_rx(skb); + + return NET_RX_SUCCESS; +} +#else +static struct net_device *ipmr_reg_vif(struct net *net, struct mr_table *mrt) +{ + return NULL; +} #endif /** @@ -734,10 +779,10 @@ static int vif_add(struct net *net, struct mr_table *mrt, return -EADDRINUSE; switch (vifc->vifc_flags) { -#ifdef CONFIG_IP_PIMSM case VIFF_REGISTER: - /* -* Special Purpose VIF in PIM + if (!pimsm_enabled()) + return -EINVAL; + /* Special Purpose VIF in PIM * All the packets will be sent to the daemon */ if (mrt->mroute_reg_vif_num >= 0) @@ -752,7 +797,6 @@ static int vif_add(struct net *net, struct mr_table *mrt, return err; } break; -#endif case VIFF_TUNNEL: dev = ipmr_new_tunnel(net, vifc); if (!dev) @@ -942,34 +986,29 @@ static void ipmr_cache_resolve(struct net *net, struct mr_table *mrt, } } -/* - * Bounce a cache query up to mrouted. We could use netlink for this but mrouted - * expects the following bizarre scheme. +/* Bounce a cache query up to mrouted. We could use netlink for this but mrouted + * expects the following bizarre scheme. * - * Called under mrt_lock. + * Called under mrt_lock. */ - static int ipmr_cache_report(struct mr_table *mrt, struct sk_buff *pkt, vifi_t vifi, int assert) { - struct sk_buff *skb; const int ihl = ip_hdrlen(pkt); + struct sock *mroute_sk; struct igmphdr *igmp; struct igmpmsg *msg; - struct sock *mroute_sk; + struct sk_buff *skb; int ret; -#ifdef CONFIG_IP_PIMSM if (assert == IGMPMSG_WHOLEPKT) skb = skb_realloc_headroom(pkt, sizeof(struct iphdr)); else -#endif skb = alloc_skb(128, GFP_ATOMIC); if (!skb) return -ENOBUFS; -#ifdef CONFIG_IP_PIMSM if (assert == IGMPMSG_WHOLEPKT) { /* Ugly, but we have no choice with this interface.
[PATCH net-next 4/9] net: ipmr: fix code and comment style
From: Nikolay Aleksandrov Trivial code and comment style fixes, also removed some extra newlines, spaces and tabs. Signed-off-by: Nikolay Aleksandrov --- include/uapi/linux/mroute.h | 59 ++ net/ipv4/ipmr.c | 142 2 files changed, 54 insertions(+), 147 deletions(-) diff --git a/include/uapi/linux/mroute.h b/include/uapi/linux/mroute.h index a382d2c04a42..cf943016930f 100644 --- a/include/uapi/linux/mroute.h +++ b/include/uapi/linux/mroute.h @@ -4,15 +4,13 @@ #include #include -/* - * Based on the MROUTING 3.5 defines primarily to keep - * source compatibility with BSD. +/* Based on the MROUTING 3.5 defines primarily to keep + * source compatibility with BSD. * - * See the mrouted code for the original history. - * - * Protocol Independent Multicast (PIM) data structures included - * Carlos Picoto (c...@di.fc.ul.pt) + * See the mrouted code for the original history. * + * Protocol Independent Multicast (PIM) data structures included + * Carlos Picoto (c...@di.fc.ul.pt) */ #define MRT_BASE 200 @@ -34,15 +32,13 @@ #define SIOCGETSGCNT (SIOCPROTOPRIVATE+1) #define SIOCGETRPF (SIOCPROTOPRIVATE+2) -#define MAXVIFS32 +#define MAXVIFS32 typedef unsigned long vifbitmap_t; /* User mode code depends on this lot */ typedef unsigned short vifi_t; #define ALL_VIFS ((vifi_t)(-1)) -/* - * Same idea as select - */ - +/* Same idea as select */ + #define VIFM_SET(n,m) ((m)|=(1<<(n))) #define VIFM_CLR(n,m) ((m)&=~(1<<(n))) #define VIFM_ISSET(n,m)((m)&(1<<(n))) @@ -50,11 +46,9 @@ typedef unsigned short vifi_t; #define VIFM_COPY(mfrom,mto) ((mto)=(mfrom)) #define VIFM_SAME(m1,m2) ((m1)==(m2)) -/* - * Passed by mrouted for an MRT_ADD_VIF - again we use the - * mrouted 3.6 structures for compatibility +/* Passed by mrouted for an MRT_ADD_VIF - again we use the + * mrouted 3.6 structures for compatibility */ - struct vifctl { vifi_t vifc_vifi; /* Index of VIF */ unsigned char vifc_flags; /* VIFF_ flags */ @@ -73,10 +67,7 @@ struct vifctl { #define VIFF_USE_IFINDEX 0x8 /* use vifc_lcl_ifindex instead of vifc_lcl_addr to find an interface */ -/* - * Cache manipulation structures for mrouted and PIMd - */ - +/* Cache manipulation structures for mrouted and PIMd */ struct mfcctl { struct in_addr mfcc_origin; /* Origin of mcast */ struct in_addr mfcc_mcastgrp; /* Group in question*/ @@ -88,10 +79,7 @@ struct mfcctl { int mfcc_expire; }; -/* - * Group count retrieval for mrouted - */ - +/* Group count retrieval for mrouted */ struct sioc_sg_req { struct in_addr src; struct in_addr grp; @@ -100,10 +88,7 @@ struct sioc_sg_req { unsigned long wrong_if; }; -/* - * To get vif packet counts - */ - +/* To get vif packet counts */ struct sioc_vif_req { vifi_t vifi; /* Which iface */ unsigned long icount; /* In packets */ @@ -112,11 +97,9 @@ struct sioc_vif_req { unsigned long obytes; /* Out bytes */ }; -/* - * This is the format the mroute daemon expects to see IGMP control - * data. Magically happens to be like an IP packet as per the original +/* This is the format the mroute daemon expects to see IGMP control + * data. Magically happens to be like an IP packet as per the original */ - struct igmpmsg { __u32 unused1,unused2; unsigned char im_msgtype; /* What is this */ @@ -126,21 +109,13 @@ struct igmpmsg { struct in_addr im_src,im_dst; }; -/* - * That's all usermode folks - */ - - +/* That's all usermode folks */ #define MFC_ASSERT_THRESH (3*HZ) /* Maximal freq. of asserts */ -/* - * Pseudo messages used by mrouted - */ - +/* Pseudo messages used by mrouted */ #define IGMPMSG_NOCACHE1 /* Kern cache fill request to mrouted */ #define IGMPMSG_WRONGVIF 2 /* For PIM assert processing (unused) */ #define IGMPMSG_WHOLEPKT 3 /* For PIM Register processing */ - #endif /* _UAPI__LINUX_MROUTE_H */ diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index e153ab7b17a1..286ede3716ee 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -102,9 +102,7 @@ static inline bool pimsm_enabled(void) static DEFINE_RWLOCK(mrt_lock); -/* - * Multicast router control variables - */ +/* Multicast router control variables */ #define VIF_EXISTS(_mrt, _idx) ((_mrt)->vif_table[_idx].dev != NULL) @@ -393,8 +391,7 @@ static void ipmr_del_tunnel(struct net_device *dev, struct vifctl *v) } } -static -struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v) +static struct net_device *ipmr_new_tunnel(struct net *net, stru
[PATCH net-next 8/9] net: ipmr: rearrange and cleanup setsockopt
From: Nikolay Aleksandrov Take rtnl in the beginning unconditionally as most options already need it (one exception - MRT_DONE, see the comment inside), make the lock/unlock places central and move out the switch() local variables. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 191 +++- 1 file changed, 107 insertions(+), 84 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 2c7fa584a274..f2ea97fbb8a8 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1276,38 +1276,45 @@ static void mrtsock_destruct(struct sock *sk) * MOSPF/PIM router set up we can clean this up. */ -int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsigned int optlen) +int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, +unsigned int optlen) { - int ret, parent = 0; - struct vifctl vif; - struct mfcctl mfc; struct net *net = sock_net(sk); + int val, ret = 0, parent = 0; struct mr_table *mrt; + struct vifctl vif; + struct mfcctl mfc; + u32 uval; + /* There's one exception to the lock - MRT_DONE which needs to unlock */ + rtnl_lock(); if (sk->sk_type != SOCK_RAW || - inet_sk(sk)->inet_num != IPPROTO_IGMP) - return -EOPNOTSUPP; + inet_sk(sk)->inet_num != IPPROTO_IGMP) { + ret = -EOPNOTSUPP; + goto out_unlock; + } mrt = ipmr_get_table(net, raw_sk(sk)->ipmr_table ? : RT_TABLE_DEFAULT); - if (!mrt) - return -ENOENT; - + if (!mrt) { + ret = -ENOENT; + goto out_unlock; + } if (optname != MRT_INIT) { if (sk != rcu_access_pointer(mrt->mroute_sk) && - !ns_capable(net->user_ns, CAP_NET_ADMIN)) - return -EACCES; + !ns_capable(net->user_ns, CAP_NET_ADMIN)) { + ret = -EACCES; + goto out_unlock; + } } switch (optname) { case MRT_INIT: if (optlen != sizeof(int)) - return -EINVAL; - - rtnl_lock(); - if (rtnl_dereference(mrt->mroute_sk)) { - rtnl_unlock(); - return -EADDRINUSE; - } + ret = -EINVAL; + if (rtnl_dereference(mrt->mroute_sk)) + ret = -EADDRINUSE; + if (ret) + break; ret = ip_ra_control(sk, 1, mrtsock_destruct); if (ret == 0) { @@ -1317,30 +1324,41 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, unsi NETCONFA_IFINDEX_ALL, net->ipv4.devconf_all); } - rtnl_unlock(); - return ret; + break; case MRT_DONE: - if (sk != rcu_access_pointer(mrt->mroute_sk)) - return -EACCES; - return ip_ra_control(sk, 0, NULL); + if (sk != rcu_access_pointer(mrt->mroute_sk)) { + ret = -EACCES; + } else { + /* We need to unlock here because mrtsock_destruct takes +* care of rtnl itself and we can't change that due to +* the IP_ROUTER_ALERT setsockopt which runs without it. +*/ + rtnl_unlock(); + ret = ip_ra_control(sk, 0, NULL); + goto out; + } + break; case MRT_ADD_VIF: case MRT_DEL_VIF: - if (optlen != sizeof(vif)) - return -EINVAL; - if (copy_from_user(&vif, optval, sizeof(vif))) - return -EFAULT; - if (vif.vifc_vifi >= MAXVIFS) - return -ENFILE; - rtnl_lock(); + if (optlen != sizeof(vif)) { + ret = -EINVAL; + break; + } + if (copy_from_user(&vif, optval, sizeof(vif))) { + ret = -EFAULT; + break; + } + if (vif.vifc_vifi >= MAXVIFS) { + ret = -ENFILE; + break; + } if (optname == MRT_ADD_VIF) { ret = vif_add(net, mrt, &vif, sk == rtnl_dereference(mrt->mroute_sk)); } else { ret = vif_delete(mrt, vif.vifc_vifi, 0, NULL); } - rtnl_unlock(); - return ret; - + break; /* Manipula
[PATCH net-next 9/9] net: ipmr: factor out common vif init code
From: Nikolay Aleksandrov Factor out common vif init code used in both tunnel and pimreg initialization and create ipmr_init_vif_indev() function. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 40 +++- 1 file changed, 19 insertions(+), 21 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index f2ea97fbb8a8..73361c908e5e 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -391,6 +391,23 @@ static void ipmr_del_tunnel(struct net_device *dev, struct vifctl *v) } } +/* Initialize ipmr pimreg/tunnel in_device */ +static bool ipmr_init_vif_indev(const struct net_device *dev) +{ + struct in_device *in_dev; + + ASSERT_RTNL(); + + in_dev = __in_dev_get_rtnl(dev); + if (!in_dev) + return false; + ipv4_devconf_setall(in_dev); + neigh_parms_data_state_setall(in_dev->arp_parms); + IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0; + + return true; +} + static struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v) { struct net_device *dev; @@ -402,7 +419,6 @@ static struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v) int err; struct ifreq ifr; struct ip_tunnel_parm p; - struct in_device *in_dev; memset(&p, 0, sizeof(p)); p.iph.daddr = v->vifc_rmt_addr.s_addr; @@ -427,15 +443,8 @@ static struct net_device *ipmr_new_tunnel(struct net *net, struct vifctl *v) if (err == 0 && (dev = __dev_get_by_name(net, p.name)) != NULL) { dev->flags |= IFF_MULTICAST; - - in_dev = __in_dev_get_rtnl(dev); - if (!in_dev) + if (!ipmr_init_vif_indev(dev)) goto failure; - - ipv4_devconf_setall(in_dev); - neigh_parms_data_state_setall(in_dev->arp_parms); - IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0; - if (dev_open(dev)) goto failure; dev_hold(dev); @@ -502,7 +511,6 @@ static void reg_vif_setup(struct net_device *dev) static struct net_device *ipmr_reg_vif(struct net *net, struct mr_table *mrt) { struct net_device *dev; - struct in_device *in_dev; char name[IFNAMSIZ]; if (mrt->id == RT_TABLE_DEFAULT) @@ -522,18 +530,8 @@ static struct net_device *ipmr_reg_vif(struct net *net, struct mr_table *mrt) return NULL; } - rcu_read_lock(); - in_dev = __in_dev_get_rcu(dev); - if (!in_dev) { - rcu_read_unlock(); + if (!ipmr_init_vif_indev(dev)) goto failure; - } - - ipv4_devconf_setall(in_dev); - neigh_parms_data_state_setall(in_dev->arp_parms); - IPV4_DEVCONF(in_dev->cnf, RP_FILTER) = 0; - rcu_read_unlock(); - if (dev_open(dev)) goto failure; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 7/9] net: ipmr: remove SLAB_PANIC
From: Nikolay Aleksandrov It's not necessary to panic upon allocation failure, returning an error at that point is okay because user-space won't be able to use any of the ops since they didn't get registered and the default table is null. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index a006d96d6cd9..2c7fa584a274 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -2675,7 +2675,7 @@ int __init ip_mr_init(void) mrt_cachep = kmem_cache_create("ip_mrt_cache", sizeof(struct mfc_cache), - 0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, + 0, SLAB_HWCACHE_ALIGN, NULL); if (!mrt_cachep) return -ENOMEM; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 2/9] net: ipmr: always define mroute_reg_vif_num
From: Nikolay Aleksandrov Before mroute_reg_vif_num was defined only if any of the CONFIG_PIMSM_ options were set, but that's not really necessary as the size of the struct is the same in both cases (checked with pahole, both cases size is 3256 bytes) and we can remove some unnecessary ifdefs to simplify the code. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 8 1 file changed, 8 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 5271e2eee110..dd2462f70d34 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -84,9 +84,7 @@ struct mr_table { atomic_tcache_resolve_queue_len; boolmroute_do_assert; boolmroute_do_pim; -#if defined(CONFIG_IP_PIMSM_V1) || defined(CONFIG_IP_PIMSM_V2) int mroute_reg_vif_num; -#endif }; struct ipmr_rule { @@ -347,9 +345,7 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id) setup_timer(&mrt->ipmr_expire_timer, ipmr_expire_process, (unsigned long)mrt); -#ifdef CONFIG_IP_PIMSM mrt->mroute_reg_vif_num = -1; -#endif #ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES list_add_tail_rcu(&mrt->list, &net->ipv4.mr_tables); #endif @@ -584,10 +580,8 @@ static int vif_delete(struct mr_table *mrt, int vifi, int notify, return -EADDRNOTAVAIL; } -#ifdef CONFIG_IP_PIMSM if (vifi == mrt->mroute_reg_vif_num) mrt->mroute_reg_vif_num = -1; -#endif if (vifi + 1 == mrt->maxvif) { int tmp; @@ -824,10 +818,8 @@ static int vif_add(struct net *net, struct mr_table *mrt, /* And finish update writing critical data */ write_lock_bh(&mrt_lock); v->dev = dev; -#ifdef CONFIG_IP_PIMSM if (v->flags & VIFF_REGISTER) mrt->mroute_reg_vif_num = vifi; -#endif if (vifi+1 > mrt->maxvif) mrt->maxvif = vifi+1; write_unlock_bh(&mrt_lock); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 5/9] net: ipmr: make ip_mroute_getsockopt more understandable
From: Nikolay Aleksandrov Use a switch to determine if optname is correct and set val accordingly. This produces a much more straight-forward and readable code. Signed-off-by: Nikolay Aleksandrov --- net/ipv4/ipmr.c | 28 ++-- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 286ede3716ee..694fecf7838e 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1443,29 +1443,29 @@ int ip_mroute_getsockopt(struct sock *sk, int optname, char __user *optval, int if (!mrt) return -ENOENT; - if (optname != MRT_VERSION && - optname != MRT_PIM && - optname != MRT_ASSERT) + switch (optname) { + case MRT_VERSION: + val = 0x0305; + break; + case MRT_PIM: + if (!pimsm_enabled()) + return -ENOPROTOOPT; + val = mrt->mroute_do_pim; + break; + case MRT_ASSERT: + val = mrt->mroute_do_assert; + break; + default: return -ENOPROTOOPT; + } if (get_user(olr, optlen)) return -EFAULT; - olr = min_t(unsigned int, olr, sizeof(int)); if (olr < 0) return -EINVAL; - if (put_user(olr, optlen)) return -EFAULT; - if (optname == MRT_VERSION) { - val = 0x0305; - } else if (optname == MRT_PIM) { - if (!pimsm_enabled()) - return -ENOPROTOOPT; - val = mrt->mroute_do_pim; - } else { - val = mrt->mroute_do_assert; - } if (copy_to_user(optval, &val, olr)) return -EFAULT; return 0; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 0/9] net: ipmr: cleanups and minor improvements
From: Nikolay Aleksandrov Hi, Since I'll have to work with ipmr, I decided to clean it up and do some minor improvements. Functionally there're almost no changes except the SLAB_PANIC removal. Most of the patches just re-design some functions to be clearer and more concise and try to remove the ifdef web that was inside. There's more information in each commit. This is the first set, the end goal is to introduce complete netlink support and control over the mfc and vif devices. I've tried to test all of the setsockopt/getsockopt options, and also made builds with various ipmr kconfig options turned on and off. Thank you, Nik Nikolay Aleksandrov (9): net: ipmr: move the tbl id check in ipmr_new_table net: ipmr: always define mroute_reg_vif_num net: ipmr: remove some pimsm ifdefs and simplify net: ipmr: fix code and comment style net: ipmr: make ip_mroute_getsockopt more understandable net: ipmr: drop an instance of CONFIG_IP_MROUTE_MULTIPLE_TABLES net: ipmr: remove SLAB_PANIC net: ipmr: rearrange and cleanup setsockopt net: ipmr: factor out common vif init code include/uapi/linux/mroute.h | 59 ++--- net/ipv4/ipmr.c | 597 2 files changed, 285 insertions(+), 371 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Linux 3.16.7: WARNING at kernfs_get+0x2a/0x30() and ida_remove+0xdd/0x120() in loop
Last week I have a problem with kernel update my PPPoE BRAS server from Linux 3.2.0-4-686-pae #1 SMP Debian 3.2.68-1+deb7u1 i686 GNU/Linux to 3.16.0-4-686-pae #1 Debian 3.16.7-ckt11-1+deb8u5. At any time, for no reason, at any load server downed to oops with following trace http://spec.oborona.net/bras_panic Server software is: rp-pppoe + pppoe + tc htb shapers on ppp interfaces called from ip-ip script. No special settings are not using. I tried to install new debian kernel 4.2.6, but the problem is not solved, everything became worse, server worked with 4.2.6 about 15 minutes and downed to panic (no trace - network rsyslog is empty). After that, i decide to try stable kernel from kernel.org - 4.2.3, and saw the same panic with no network logs as with 4.2.6. Many technical staff from Russia and UA ISP confirmed these problems with ppp in all new kernels (accell-ppp and rp-pppoe). https://translate.google.ru/translate?hl=ru&sl=ru&tl=en&u=http%3A%2F%2Fforum.nag.ru%2Fforum%2Findex.php%3Fshowtopic%3D45266%26st%3D6220 This is the same our problem from UA colleague http://www.spinics.net/lists/netdev/msg352992.html Last kernel with no problem, for me - 3.2.68-1+deb7u1 i686 GNU/Linux. With 3.2 kerel server uptime is infinite. Can i help with additional information? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] net/hsr: fix a warning message
WARN_ON_ONCE() takes a condition, it doesn't take an error message. I have converted this to WARN() instead. Signed-off-by: Dan Carpenter diff --git a/net/hsr/hsr_device.c b/net/hsr/hsr_device.c index 35a9788..c7d1adc 100644 --- a/net/hsr/hsr_device.c +++ b/net/hsr/hsr_device.c @@ -312,7 +312,7 @@ static void send_hsr_supervision_frame(struct hsr_port *master, u8 type) return; out: - WARN_ON_ONCE("HSR: Could not send supervision frame\n"); + WARN_ONCE(1, "HSR: Could not send supervision frame\n"); kfree_skb(skb); } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 4.1.12 crash
Memory corruption, if happens, IMHO shouldn't be a hardware-related - almost all of these boxes, except H61M-based box from 1st log, works for a long time with uptime more than year; and only software was changed on it; H61M-based box runs memtest86 for a tens of hours w/o any error. If it was caused by hardware - they should crash even earlier. Rarely on different servers I saw 'zram decompression error' messages (in this case I've got such message on H61M-based box). Also, other people that uses accel-ppp as BRAS software, have different kernel panics/bugs/oopses on fresh kernels. I'll try to apply these patches, and I'll try to switch back to kernels that were stable on some boxes. 21.11.2015 01:13, Alexander Duyck пишет: On 11/20/2015 05:58 AM, Andrew wrote: Hi all. Today some BRASes on 4.1.12 kernel were crashed. Here's crash traces: http://pastebin.com/p68hNS8R http://pastebin.com/36ieRAM2 http://pastebin.com/3BRTVEB6 On 3.2 kernel same hardware works OK, troubles were noticed after kernel upgrade. What additional info is needed? Looking over the traces there seem to be two areas called out. The first is the fib_trie resize BUG_ON that was triggered due to the parent and child not being associated. I think that might be due to memory corruption as I cannot find any spots where we are resizing without correctly setting up the parent-child relationship of the nodes first. The other spot that is showing up is ppp_shutdown_interface and it's related path. It looks like there are a couple of patches you could try back-porting to see if it resolves the issue. If they do then perhaps they should be considered candidates for stable: 8cb775bc0a3 ("ppp: fix device unregistration upon netns deletion") 58a89ecaca5 ("ppp: fix lockdep splat in ppp_dev_uninit()") - Alex -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html