Re: pktgen
Hello! Seems you found a race when rmmod is done before it's fully started Try: diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 733d86d..ac0b4b1 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -160,7 +160,7 @@ #include asm/div64.h /* do_div */ #include asm/timex.h -#define VERSION pktgen v2.68: Packet Generator for packet performance testing.\n +#define VERSION pktgen v2.69: Packet Generator for packet performance testing.\n /* #define PG_DEBUG(a) a */ #define PG_DEBUG(a) @@ -3673,6 +3673,8 @@ static void __exit pg_cleanup(void) struct list_head *q, *n; wait_queue_head_t queue; init_waitqueue_head(queue); + + schedule_timeout_interruptible(msecs_to_jiffies(125)); /* Stop all interfaces threads */ for i in 1 2 3 4 5 ; do modprobe pktgen ; rmmod pktgen ; done In dmesg pktgen v2.69: Packet Generator for packet performance testing. pktgen v2.69: Packet Generator for packet performance testing. pktgen v2.69: Packet Generator for packet performance testing. pktgen v2.69: Packet Generator for packet performance testing. pktgen v2.69: Packet Generator for packet performance testing. Cheers. --ro Alexey Dobriyan writes: On 11/30/06, David Miller [EMAIL PROTECTED] wrote: From: Alexey Dobriyan [EMAIL PROTECTED] Date: Wed, 29 Nov 2006 23:04:37 +0300 Looks like worker thread strategically clears it if scheduled at wrong moment. --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -3292,7 +3292,6 @@ static void pktgen_thread_worker(struct init_waitqueue_head(t-queue); -t-control = ~(T_TERMINATE); t-control = ~(T_RUN); t-control = ~(T_STOP); t-control = ~(T_REMDEVALL); Good catch Alexey. Did you rerun the load/unload test with this fix applied? If it fixes things, I'll merge it. Well, yes, it fixes things, except Ctrl+C getting you out of modprobe/rmmod loop will spit backtrace again. And other flags: T_RUN, T_STOP. Clearance is not needed due to kZalloc and create bugs as demostrated. Give me some time. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] [PATCH V2 0/3] bonding support for operation over IPoIB
This patch series is a second version (see below link to V1) of the suggested changes to the bonding driver such that it would be able to support non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. The motivation is to enable the bonding driver on its HA mode to work with the IP over Infiniband (IPoIB) driver. With these patches I was able to enslave IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast and ICMP traffic with fail-over and fail-back working fine. My working env was the net-2.6.20 git. More over, as IPoIB is also the IB ARP provider for the RDMA CM driver which is used by native IB ULPs whose addressing scheme is based on IP (eg iSER, SDP, Lustre, NFSoRDMA, RDS), bonding support for IPoIB devices **enables** HA for these ULPs. This holds as when the ULP is informed by the IB HW on the failure of the current IB connection, it just need to reconnect, where the bonding device will now issue the IB ARP over the active IPoIB slave. The first patch changes some of the bond netdevice attributes and functions to be that of the active slave for the case of the enslaved device not being of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(), which are netdevice **type** dependent and hence might be not appropriate for devices of other types. It also enforces mutual exclusion on bonding slaves from dissimilar ether types, as was concluded over the v1 discussion. IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this IPoIB device is bounded to. The QP is a resource created by the IB HW and the GID is an identifier burned into the HCA (i have omitted here some details which are not important for the bonding RFC). Basically the IPoIB spec and impl. do not allow for setting the MAC address of an IPoIB device and this work was made under this assumption. Hence, the second patch allows for enslaving netdevices which do not support the set_mac_address() function. In that case the bond mac address is the one of the active slave, where remote peers are notified on the mac address (neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs (this is already done by the bonding code). Normally, the bonding driver is UP before any enslavement takes place. Once a netdevice is UP, the network stack acts to have it join some multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called where for mcast joins taking place **after** the enslavement another ip_xxx_mc_map() is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) The third patch handles this problem by allowing to enslave devices when the bonding device is not up. Over the discussion held at the previous post this seemed to be the most clean way to go, where it is not expected to cause instabilities. These patches are not enough for configuration of IPoIB bonding through tools (eg /sbin/ifenslave and /sbin/ifup) provided by packages such as sysconfig and initscripts, specifically since these tools sets the bonding device to be UP before enslaving anything. Once this patchset gets positive/feedback the next step would be to look how to enhance the tools/packages so it would be possible to bond/enslave with the modified code. As suggested by the bonding maintainer, this step can potentially involve converting ifenslave to be a script based on the bonding sysfs infrastructure rather on the somehow obsoleted Documentation/networking/ifenslave.c For the ease of potential testers, I will post an example bonding sysfs script which can be used to set bonding to work with patches 1-3 (let me know!) Or. changes from V1 (the links point to V1 0-3/3) http://marc.theaimsgroup.com/?l=linux-netdevm=115926582209736w=2 http://marc.theaimsgroup.com/?l=linux-netdevm=115926599515568w=2 http://marc.theaimsgroup.com/?l=linux-netdevm=115926599430055w=2 http://marc.theaimsgroup.com/?l=linux-netdevm=115926599415729w=2 + enforce mutual exclusion on the slaves ether types + don't attempt to set the bond mtu when enslaving a non ARPHRD_ETHER device + rather than hack the bond device ether type through mod params allow enslavement when the bond device is not up - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH V2 1/3] enable bonding to enslave non ARPHRD_ETHER netdevices
Signed-off-by: Or Gerlitz [EMAIL PROTECTED] Index: net-2.6.20/drivers/net/bonding/bond_main.c === --- net-2.6.20.orig/drivers/net/bonding/bond_main.c 2006-11-30 10:54:23.0 +0200 +++ net-2.6.20/drivers/net/bonding/bond_main.c 2006-11-30 11:53:06.0 +0200 @@ -1252,6 +1252,24 @@ static int bond_compute_features(struct return 0; } + +static void bond_setup_by_slave(struct net_device *bond_dev, + struct net_device *slave_dev) +{ + bond_dev-hard_header = slave_dev-hard_header; + bond_dev-rebuild_header= slave_dev-rebuild_header; + bond_dev-hard_header_cache = slave_dev-hard_header_cache; + bond_dev-header_cache_update = slave_dev-header_cache_update; + bond_dev-hard_header_parse = slave_dev-hard_header_parse; + + bond_dev-type = slave_dev-type; + bond_dev-hard_header_len = slave_dev-hard_header_len; + bond_dev-addr_len = slave_dev-addr_len; + + memcpy(bond_dev-broadcast, slave_dev-broadcast, + slave_dev-addr_len); +} + /* enslave device slave to bond device master */ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) { @@ -1326,6 +1344,24 @@ int bond_enslave(struct net_device *bond goto err_undo_flags; } + /* set bonding device ether type by slave - bonding netdevices are +* created with ether_setup, so when the slave type is not ARPHRD_ETHER +* there is a need to override some of the type dependent attribs/funcs. +* +* bond ether type mutual exclusion - don't allow slaves of dissimilar +* ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same bond +*/ + if (bond-slave_cnt == 0) { + if (slave_dev-type != ARPHRD_ETHER) + bond_setup_by_slave(bond_dev, slave_dev); + } else if (bond_dev-type != slave_dev-type) { + printk(KERN_ERR DRV_NAME : %s ether type (%d) is different from + other slaves (%d), can not enslave it.\n, slave_dev-name, + slave_dev-type, bond_dev-type); + res = -EINVAL; + goto err_undo_flags; + } + if (slave_dev-set_mac_address == NULL) { printk(KERN_ERR DRV_NAME : %s: Error: The slave device you specified does Index: net-2.6.20/drivers/net/bonding/bonding.h === --- net-2.6.20.orig/drivers/net/bonding/bonding.h 2006-11-30 10:54:23.0 +0200 +++ net-2.6.20/drivers/net/bonding/bonding.h2006-11-30 10:58:10.0 +0200 @@ -201,6 +201,7 @@ struct bonding { struct list_head vlan_list; struct vlan_group *vlgrp; struct packet_type arp_mon_pt; + s8 do_set_mac_addr; }; /** - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH V2 2/3] enable bonding to enslave netdevices not supporting set_mac_address()
Signed-off-by: Or Gerlitz [EMAIL PROTECTED] Index: net-2.6.20/drivers/net/bonding/bond_main.c === --- net-2.6.20.orig/drivers/net/bonding/bond_main.c 2006-11-30 11:53:06.0 +0200 +++ net-2.6.20/drivers/net/bonding/bond_main.c 2006-11-30 11:53:16.0 +0200 @@ -1103,6 +1103,14 @@ void bond_change_active_slave(struct bon if (new_active) { bond_set_slave_active_flags(new_active); } + + /* when bonding does not set the slave MAC address, the bond MAC +* address is the one of the active slave. +*/ + if (new_active !bond-do_set_mac_addr) + memcpy(bond-dev-dev_addr, new_active-dev-dev_addr, + new_active-dev-addr_len); + bond_send_gratuitous_arp(bond); } } @@ -1363,14 +1371,23 @@ int bond_enslave(struct net_device *bond } if (slave_dev-set_mac_address == NULL) { - printk(KERN_ERR DRV_NAME - : %s: Error: The slave device you specified does - not support setting the MAC address. - Your kernel likely does not support slave - devices.\n, bond_dev-name); - res = -EOPNOTSUPP; - goto err_undo_flags; - } + if (bond-slave_cnt == 0) { + printk(KERN_WARNING DRV_NAME + : %s: Warning: The first slave device you + specified does not support setting the MAC + address. This bond MAC address would be that + of the active slave.\n, bond_dev-name); + bond-do_set_mac_addr = 0; + } else if (bond-do_set_mac_addr) { + printk(KERN_ERR DRV_NAME + : %s: Error: The slave device you specified + does not support setting the MAC addres,. + but this bond uses this practice. \n + , bond_dev-name); + res = -EOPNOTSUPP; + goto err_undo_flags; + } + } new_slave = kmalloc(sizeof(struct slave), GFP_KERNEL); if (!new_slave) { @@ -1392,16 +1409,18 @@ int bond_enslave(struct net_device *bond */ memcpy(new_slave-perm_hwaddr, slave_dev-dev_addr, ETH_ALEN); - /* -* Set slave to master's mac address. The application already -* set the master's mac address to that of the first slave -*/ - memcpy(addr.sa_data, bond_dev-dev_addr, bond_dev-addr_len); - addr.sa_family = slave_dev-type; - res = dev_set_mac_address(slave_dev, addr); - if (res) { - dprintk(Error %d calling set_mac_address\n, res); - goto err_free; + if (bond-do_set_mac_addr) { + /* +* Set slave to master's mac address. The application already +* set the master's mac address to that of the first slave +*/ + memcpy(addr.sa_data, bond_dev-dev_addr, bond_dev-addr_len); + addr.sa_family = slave_dev-type; + res = dev_set_mac_address(slave_dev, addr); + if (res) { + dprintk(Error %d calling set_mac_address\n, res); + goto err_free; + } } /* open the slave since the application closed it */ @@ -1627,9 +1646,11 @@ err_close: dev_close(slave_dev); err_restore_mac: - memcpy(addr.sa_data, new_slave-perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev-type; - dev_set_mac_address(slave_dev, addr); + if (bond-do_set_mac_addr) { + memcpy(addr.sa_data, new_slave-perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev-type; + dev_set_mac_address(slave_dev, addr); + } err_free: kfree(new_slave); @@ -1807,10 +1828,12 @@ int bond_release(struct net_device *bond /* close slave before restoring its mac address */ dev_close(slave_dev); - /* restore original (permanent) mac address */ - memcpy(addr.sa_data, slave-perm_hwaddr, ETH_ALEN); - addr.sa_family = slave_dev-type; - dev_set_mac_address(slave_dev, addr); + if (bond-do_set_mac_addr) { + /* restore original (permanent) mac address */ + memcpy(addr.sa_data, slave-perm_hwaddr, ETH_ALEN); + addr.sa_family = slave_dev-type; + dev_set_mac_address(slave_dev, addr); + } slave_dev-priv_flags = ~(IFF_MASTER_8023AD | IFF_MASTER_ALB | IFF_SLAVE_INACTIVE | IFF_BONDING | @@ -1897,10 +1920,12 @@ static int
[RFC] [PATCH V2 3/3] enable IP multicast for bonding IPoIB devices - allow not UP enslavement
Signed-off-by: Or Gerlitz [EMAIL PROTECTED] Index: net-2.6.20/drivers/net/bonding/bond_sysfs.c === --- net-2.6.20.orig/drivers/net/bonding/bond_sysfs.c2006-11-30 10:45:53.0 +0200 +++ net-2.6.20/drivers/net/bonding/bond_sysfs.c 2006-11-30 10:48:13.0 +0200 @@ -265,11 +265,9 @@ static ssize_t bonding_store_slaves(stru /* Quick sanity check -- is the bond interface up? */ if (!(bond-dev-flags IFF_UP)) { - printk(KERN_ERR DRV_NAME - : %s: Unable to update slaves because interface is down.\n, + printk(KERN_WARNING DRV_NAME + : %s: doing slave updates when interface is down.\n, bond-dev-name); - ret = -EPERM; - goto out; } /* Note: We can't hold bond-lock here, as bond_create grabs it. */ Index: net-2.6.20/drivers/net/bonding/bond_main.c === --- net-2.6.20.orig/drivers/net/bonding/bond_main.c 2006-11-30 10:46:57.0 +0200 +++ net-2.6.20/drivers/net/bonding/bond_main.c 2006-11-30 10:48:13.0 +0200 @@ -1298,8 +1298,8 @@ int bond_enslave(struct net_device *bond /* bond must be initialized by bond_open() before enslaving */ if (!(bond_dev-flags IFF_UP)) { - dprintk(Error, master_dev is not up\n); - return -EPERM; + printk(KERN_WARNING DRV_NAME +%s: master_dev is not up in bond_enslave\n, bond_dev-name); } /* already enslaved */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull from 'upstream-fixes' branch of wireless-2.6
John W. Linville wrote: The following changes since commit 0579e303553655245e8a6616bd8b4428b07d63a2: Linus Torvalds: Merge branch 'for-linus' of git://git.kernel.org/.../drzeus/mmc are found in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git upstream-fixes John W. Linville: Revert zd1211rw: Removed unneeded packed attributes Michael Buesch: softmac: remove netif_tx_disable when scanning Ulrich Kunitz: zd1211rw: Fix of a locking bug Zhu Yi: ieee80211: Fix kernel panic when QoS is enabled pulled - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] r8169: Fix iteration variable sign
Francois Romieu wrote: This changes the type of variable i in rtl8169_init_one() from unsigned int to int. i is checked for 0 later, which can never happen for unsigned. This results in broken error handling. Signed-off-by: Michael Buesch [EMAIL PROTECTED] Signed-off-by: Francois Romieu [EMAIL PROTECTED] ACK but doesn't seem to apply to 2.6.19? should this go into #upstream rather than #upstream-fixes? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PKT_SCHED] act_gact: division by zero
tc qdisc add dev eth1 handle : ingress tc filter add dev eth1 protocol ip parent : pref 99 basic \ flowid 1:1 action pass random determ drop 0 ^ the above cause a division by zero in the kernel on the first packet in. Signed-off-by: Kim Nordlund [EMAIL PROTECTED] diff -rub linux-2.6.19-orig/net/sched/act_gact.c linux/net/sched/act_gact.c --- linux-2.6.19-orig/net/sched/act_gact.c 2006-11-29 23:57:37.0 +0200 +++ linux/net/sched/act_gact.c 2006-11-30 13:22:37.0 +0200 @@ -111,7 +111,7 @@ if (tb[TCA_GACT_PROB-1] != NULL) { struct tc_gact_p *p_parm = RTA_DATA(tb[TCA_GACT_PROB-1]); gact-tcfg_paction = p_parm-paction; - gact-tcfg_pval= p_parm-pval; + gact-tcfg_pval= p_parm-pval ? : 1; gact-tcfg_ptype = p_parm-ptype; } #endif - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 6/6] net: vm deadlock avoidance core
Oops, it seems I missed some chunks. New patch attached. --- Subject: net: vm deadlock avoidance core In order to provide robust networked block devices there must be a guarantee of progress. That is, the block device must never stall because of (physical) OOM, because the device itself might be needed to get out of it (reclaim). This means that the device queue must always be unplugable, this in turn means that it must always find enough memory to build/send packets over the network _and_ receive (level 7) ACKs for those packets. The network stack has a huge capacity for buffering packets; waiting for user-space to read them. There is a practical limit imposed to avoid DoS scenarios. These two things make for a deadlock; what if the receive limit is reached and all packets are buffered in non-critical sockets (those not serving the network block device waiting for an ACK to free a page). Memory pressure will add to that; what if there is simply no memory left to receive packets in. This patch provides a service to register sockets as critical; SOCK_VMIO is a promise the socket will never block on receive. Along with with a memory reserve that will service a limited number of packets this can guarantee a limited service to these critical sockets. When we make sure that packets allocated from the reserve will only service critical sockets we will not lose the memory and can guarantee progress. (Note on the name SOCK_VMIO; the basic problem is a circular dependency between the network and virtual memory subsystems which needs to be broken. This does make VM network IO - and only VM network IO - special, it does not generalize) Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/skbuff.h | 13 +++- include/net/sock.h | 36 + net/core/dev.c | 40 +- net/core/skbuff.c | 51 -- net/core/sock.c| 121 + net/ipv4/ip_fragment.c |1 net/ipv4/ipmr.c|4 + net/ipv4/route.c | 15 + net/ipv4/sysctl_net_ipv4.c | 14 - net/ipv4/tcp_ipv4.c| 27 +- net/ipv6/reassembly.c |1 net/ipv6/route.c | 15 + net/ipv6/sysctl_net_ipv6.c |6 +- net/ipv6/tcp_ipv6.c| 27 +- net/netfilter/core.c |5 + security/selinux/avc.c |2 16 files changed, 355 insertions(+), 23 deletions(-) Index: linux-2.6-git/include/linux/skbuff.h === --- linux-2.6-git.orig/include/linux/skbuff.h 2006-11-30 10:56:33.0 +0100 +++ linux-2.6-git/include/linux/skbuff.h2006-11-30 11:37:51.0 +0100 @@ -283,7 +283,8 @@ struct sk_buff { nfctinfo:3; __u8pkt_type:3, fclone:2, - ipvs_property:1; + ipvs_property:1, + emergency:1; __be16 protocol; void(*destructor)(struct sk_buff *skb); @@ -328,10 +329,13 @@ struct sk_buff { #include asm/system.h +#define SKB_ALLOC_FCLONE 0x01 +#define SKB_ALLOC_RX 0x02 + extern void kfree_skb(struct sk_buff *skb); extern void __kfree_skb(struct sk_buff *skb); extern struct sk_buff *__alloc_skb(unsigned int size, - gfp_t priority, int fclone); + gfp_t priority, int flags); static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority) { @@ -341,7 +345,7 @@ static inline struct sk_buff *alloc_skb( static inline struct sk_buff *alloc_skb_fclone(unsigned int size, gfp_t priority) { - return __alloc_skb(size, priority, 1); + return __alloc_skb(size, priority, SKB_ALLOC_FCLONE); } extern struct sk_buff *alloc_skb_from_cache(kmem_cache_t *cp, @@ -1102,7 +1106,8 @@ static inline void __skb_queue_purge(str static inline struct sk_buff *__dev_alloc_skb(unsigned int length, gfp_t gfp_mask) { - struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask); + struct sk_buff *skb = + __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX); if (likely(skb)) skb_reserve(skb, NET_SKB_PAD); return skb; Index: linux-2.6-git/include/net/sock.h === --- linux-2.6-git.orig/include/net/sock.h 2006-11-30 10:56:33.0 +0100 +++ linux-2.6-git/include/net/sock.h2006-11-30 11:37:51.0 +0100 @@ -391,6 +391,7 @@ enum sock_flags { SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */ SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
Re: [PATCH][NET_SCHED] sch_cbq: deactivating when grafting, purging etc.
Jarek Poplawski wrote: [NET_SCHED] sch_cbq: [PATCH 2.6.19-rc6 with Fix endless loops set of patches] - P. McHardy's Fix endless loops patch supplement (cbq_graft, cbq_qlen_notify, cbq_delete, cbq_class_ops) - deactivating of active classes when q.qlen drops to zero (cbq_drop) - a redundant instruction removed from cbq_deactivate_class PS: probably htb_deactivate in htb_delete and cbq_deactivate_class in cbq_delete are also redundant now. This looks good, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NET_SCHED] sch_htb: turn intermediate classes into leaves
Jarek Poplawski wrote: [NET_SCHED] sch_htb: [PATCH 2.6.19-rc6 with Fix endless loops set of patches] - turn intermediate classes into leaves again when their last child is deleted (struct htb_class changed) Looks good to me too, but it still seems to be missing class level adjustment after deletion. The classification function refuses to queue packets to classes with level 0. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NET_SCHED] sch_htb: turn intermediate classes into leaves
Jarek Poplawski wrote: On Thu, Nov 30, 2006 at 01:26:34PM +0100, Patrick McHardy wrote: Jarek Poplawski wrote: [NET_SCHED] sch_htb: [PATCH 2.6.19-rc6 with Fix endless loops set of patches] - turn intermediate classes into leaves again when their last child is deleted (struct htb_class changed) Looks good to me too, but it still seems to be missing class level adjustment after deletion. The classification function refuses to queue packets to classes with level 0. +static void htb_parent_to_leaf(struct htb_class *cl, struct Qdisc *new_q) +{ + struct htb_class *parent = cl-parent; + + BUG_TRAP(!cl-level cl-un.leaf.q !cl-prio_activity); + + parent-level = 0; I've thought this is enough, but probably you mean something else? I missed that, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2.6.18] declance: Fix RX ownership handover
The change for PMAD support introduced a bug, where the ownership of RX descriptors was given back to the LANCE in the wrong way. Occasional lockups would happen as a result. This is a fix for this problem. Signed-off-by: Maciej W. Rozycki [EMAIL PROTECTED] --- Tested with the onboard LANCE of a DECstation 5000/133. Please apply. Maciej patch-mips-2.6.18-20060920-pmax-lance-rx-fix-0 diff -up --recursive --new-file linux-mips-2.6.18-20060920.macro/drivers/net/declance.c linux-mips-2.6.18-20060920/drivers/net/declance.c --- linux-mips-2.6.18-20060920.macro/drivers/net/declance.c 2006-11-23 02:55:34.0 + +++ linux-mips-2.6.18-20060920/drivers/net/declance.c 2006-11-30 02:26:34.0 + @@ -628,7 +628,6 @@ static int lance_rx(struct net_device *d /* Return the packet to the pool */ *rds_ptr(rd, mblength, lp-type) = 0; *rds_ptr(rd, length, lp-type) = -RX_BUFF_SIZE | 0xf000; - *rds_ptr(rd, rmd1, lp-type) = LE_R1_OWN; *rds_ptr(rd, rmd1, lp-type) = ((lp-rx_buf_ptr_lnc[entry] 16) 0xff) | LE_R1_OWN; lp-rx_new = (entry + 1) RX_RING_MOD_MASK; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST 1/2] NET: Accurate packet scheduling for ATM/ADSL (kernel)
First, sorry for letting you wait so long .. Russell Stuart wrote: On Tue, 2006-10-24 at 18:19 +0200, Patrick McHardy wrote: No, my patch works for qdiscs with and without RTABs, this is where they overlap. Could you explain how this works? I didn't see how qdiscs that used RTAB to measure rates of transmission could use your STAB to do the same thing. At least not without substantial modifications to your patch. Qdiscs don't use RTABs to measure rates but to calculate transmission times. Transmission time is always related to the length, the difference between our patches is that you modify the RTABs in advance to include the overhead in the calculation, my patch changes the length used to look up the transmission time. Which works with or without RTABs. No, as we already discussed, SFQ uses the packet size for calculating remaining quanta, and fairness would increase if the real transmission size (and time) were used. RED uses the backlog size to calculate the drop probabilty (and supports attaching inner qdiscs nowadays), so keeping accurate backlog statistics seems to be a win as well (besides their use for estimators). It is also possible to specify the maximum latency for TBF instead of a byte limit (which is passed as max. backlog value to the inner bfifo qdisc), this would also need accurate backlog statistics. This is all beside the point if you can show how you patch gets rid of RTAB - currently I am acting under the assumption it doesn't. If it does you get all you describe for free. Why? Otherwise - yes, you are correct. The ATM patch does not introduce accurate packet lengths into the kernel, which is what is required to give the benefits you describe. But that was never the ATM patches goal. The ATM patch gives accurate rate calculations for ATM links, nothing more. Accurate packet length calculations is apparently the goal of your patch, and I wish you luck with it. Again, its not rate calculations but transmission time calculations, which _are a function of the length_. Ethernet, VLAN, Tunnels, ... its especially useful for tunnels if you also shape on the underlying device since the qdisc on the tunnel device and the qdisc on the underlying device should ideally be in sync (otherwise no accurate bandwidth reservation is possible). Hmmm - not as far as I am aware. In all those cases the IP layer breaks up the data into MTU sized packets before they get to the scheduler. ATM is the only technology I am known of where setting the MTU to be bigger than the end to end link can support is normal. Thats not the point. If I want to do scheduling on the ipip device and on the underlying device at the same time I need to reserve the amount of bandwidth given to the ipip device + the bandwidth uses for encapsulation on the underlying device. The easy way to do this is to use the same amount of bandwidth on both devices and make the scheduler on the ipip device aware that some overhead is going to be added. The hard way is to calculate the worst case (bandwidth / minimum packet size * overhead per packet) and add that on the underlying device. Either you or Jesper pointed to this code in iproute: for (i=0; i256; i++) { unsigned sz = (icell_log); ... rtab[i] = tc_core_usec2tick(100*((double)sz/bps)); which tends to underestimate the transmission time by using the smallest possible size for each cell. Firstly, yes you are correct. It will under some circumstances underestimate the number of cells it takes to send a packet. The reason is because the whole aim of the ATM patch was to make maximum use of the ATM link, while at the same time keeping control of scheduling decisions. To keep control of scheduling decisions, we must _never_ overestimate the speed of the link. If we do the ISP will take control of the scheduling. Underestimating the transmission time is equivalent to overestimating the rate. At first sight this seems a minor issue. Its not, because the error can be large. An example of overestimating the link speed would be were one RTAB entry covers both the 2 and 3 cell cases. If we say the IP packet is going to use 2 cells, and in fact it uses 3, then the error is 50%. This is a huge error, and in fact eliminating this error is the whole point of the ATM patch. As an example of its impact, I was trying to make VOIP work over a shared link. If the ISP starts making the scheduling decisions then VOIP packets start being dropped or delayed, rendering VOIP unusable. So in order to use VOIP on the link I have to understate the link capacity by 50%. As it happens, VOIP generates a stream of packets in the 2-3 cell size range, the actual size depending on the codec negotiated by the end points. Jesper in his thesis gives perhaps an more important example what happens if you overestimate the link speed. It turns out in interacts with TCP's flow
Re: [PATCH][NET_SCHED] sch_htb: turn intermediate classes into leaves
On Thu, Nov 30, 2006 at 01:26:34PM +0100, Patrick McHardy wrote: Jarek Poplawski wrote: [NET_SCHED] sch_htb: [PATCH 2.6.19-rc6 with Fix endless loops set of patches] - turn intermediate classes into leaves again when their last child is deleted (struct htb_class changed) Looks good to me too, but it still seems to be missing class level adjustment after deletion. The classification function refuses to queue packets to classes with level 0. +static void htb_parent_to_leaf(struct htb_class *cl, struct Qdisc *new_q) +{ + struct htb_class *parent = cl-parent; + + BUG_TRAP(!cl-level cl-un.leaf.q !cl-prio_activity); + + parent-level = 0; I've thought this is enough, but probably you mean something else? Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED] act_gact: division by zero
Nordlund Kim (Nokia-NET/Helsinki) wrote: tc qdisc add dev eth1 handle : ingress tc filter add dev eth1 protocol ip parent : pref 99 basic \ flowid 1:1 action pass random determ drop 0 ^ the above cause a division by zero in the kernel on the first packet in. Signed-off-by: Kim Nordlund [EMAIL PROTECTED] diff -rub linux-2.6.19-orig/net/sched/act_gact.c linux/net/sched/act_gact.c --- linux-2.6.19-orig/net/sched/act_gact.c2006-11-29 23:57:37.0 +0200 +++ linux/net/sched/act_gact.c2006-11-30 13:22:37.0 +0200 @@ -111,7 +111,7 @@ if (tb[TCA_GACT_PROB-1] != NULL) { struct tc_gact_p *p_parm = RTA_DATA(tb[TCA_GACT_PROB-1]); gact-tcfg_paction = p_parm-paction; - gact-tcfg_pval= p_parm-pval; + gact-tcfg_pval= p_parm-pval ? : 1; I think it should reject an invalid configuration or handle the zero case correctly by not dividing. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2.6.18] declance: Support the I/O ASIC LANCE w/o TURBOchannel
The onboard LANCE of I/O ASIC systems is not a TURBOchannel device, at least from the software point of view. Therefore it does not rely on any kernel TURBOchannel bus services and can be supported even if support for TURBOchannel has not been enabled in the configuration. Signed-off-by: Maciej W. Rozycki [EMAIL PROTECTED] --- Tested with the onboard LANCE of a DECstation 5000/133. Please apply. Maciej patch-mips-2.6.18-20060920-declance-tc-0 diff -up --recursive --new-file linux-mips-2.6.18-20060920.macro/drivers/net/declance.c linux-mips-2.6.18-20060920/drivers/net/declance.c --- linux-mips-2.6.18-20060920.macro/drivers/net/declance.c 2006-11-23 02:55:34.0 + +++ linux-mips-2.6.18-20060920/drivers/net/declance.c 2006-11-30 02:31:19.0 + @@ -1068,7 +1068,6 @@ static int __init dec_lance_init(const i lp-type = type; lp-slot = slot; switch (type) { -#ifdef CONFIG_TC case ASIC_LANCE: dev-base_addr = CKSEG1ADDR(dec_kn_slot_base + IOASIC_LANCE); @@ -1112,7 +,7 @@ static int __init dec_lance_init(const i CPHYSADDR(dev-mem_start) 3); break; - +#ifdef CONFIG_TC case PMAD_LANCE: claim_tc_card(slot); @@ -1143,7 +1142,6 @@ static int __init dec_lance_init(const i break; #endif - case PMAX_LANCE: dev-irq = dec_interrupt[DEC_IRQ_LANCE]; dev-base_addr = CKSEG1ADDR(KN01_SLOT_BASE + KN01_LANCE); @@ -1298,10 +1296,8 @@ static int __init dec_lance_probe(void) /* Then handle onboard devices. */ if (dec_interrupt[DEC_IRQ_LANCE] = 0) { if (dec_interrupt[DEC_IRQ_LANCE_MERR] = 0) { -#ifdef CONFIG_TC if (dec_lance_init(ASIC_LANCE, -1) = 0) count++; -#endif } else if (!TURBOCHANNEL) { if (dec_lance_init(PMAX_LANCE, -1) = 0) count++; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
re: [RFC] [PATCH V2 0/3] bonding support for operation over IPoIB - example config script
Below is an example script i was using to configure bonding in my testing Or. --- /dev/null 2006-10-30 17:30:04.465997856 +0200 +++ bond_init_sysfs.bash2006-11-30 12:51:05.109565889 +0200 @@ -0,0 +1,25 @@ +#!/bin/bash + +BOND_IP=192.168.10.118 +BOND_NETMASK=255.255.255.0 + +SLAVE_A=ib0 +SLAVE_B=ib1 + +modprobe bonding + +echo active-backup /sys/class/net/bond0/bonding/mode +echo 100/sys/class/net/bond0/bonding/miimon + +modprobe ib_ipoib + +## this is some debug info that can enable seeing below the scenes... +## to learn more see Documentation/infiniband/ipoib.txt + +#echo 1 /sys/module/ib_ipoib/parameters/mcast_debug_level +#echo 1 /sys/module/ib_ipoib/parameters/debug_level + +echo +$SLAVE_A /sys/class/net/bond0/bonding/slaves +echo +$SLAVE_B /sys/class/net/bond0/bonding/slaves + +ifconfig bond0 $BOND_IP netmask $BOND_NETMASK up - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH V2 1/3] enable bonding to enslave non ARPHRD_ETHER netdevices
Or Gerlitz wrote: Index: net-2.6.20/drivers/net/bonding/bonding.h === --- net-2.6.20.orig/drivers/net/bonding/bonding.h 2006-11-30 10:54:23.0 +0200 +++ net-2.6.20/drivers/net/bonding/bonding.h2006-11-30 10:58:10.0 +0200 @@ -201,6 +201,7 @@ struct bonding { struct list_head vlan_list; struct vlan_group *vlgrp; struct packet_type arp_mon_pt; + s8 do_set_mac_addr; }; /** oops - this piece belongs to the second patch, which actually uses the added field, sorry for that. I will fix it for the next version. Or. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED] act_gact: division by zero
On Thu, 30 Nov 2006, ext Patrick McHardy wrote: I think it should reject an invalid configuration or handle the zero case correctly by not dividing. You are correct. Not returning -EINVAL, because someone might want to use the value zero in some future gact_prob algorithm? Signed-off-by: Kim Nordlund [EMAIL PROTECTED] diff -rub linux-2.6.19-orig/net/sched/act_gact.c linux/net/sched/act_gact.c --- linux-2.6.19-orig/net/sched/act_gact.c 2006-11-29 23:57:37.0 +0200 +++ linux/net/sched/act_gact.c 2006-11-30 15:33:12.0 +0200 @@ -48,14 +48,14 @@ #ifdef CONFIG_GACT_PROB static int gact_net_rand(struct tcf_gact *gact) { - if (net_random() % gact-tcfg_pval) + if (!gact-tcfg_pval || net_random() % gact-tcfg_pval) return gact-tcf_action; return gact-tcfg_paction; } static int gact_determ(struct tcf_gact *gact) { - if (gact-tcf_bstats.packets % gact-tcfg_pval) + if (!gact-tcfg_pval || gact-tcf_bstats.packets % gact-tcfg_pval) return gact-tcf_action; return gact-tcfg_paction; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED] act_gact: division by zero
Nordlund Kim (Nokia-NET/Helsinki) wrote: On Thu, 30 Nov 2006, ext Patrick McHardy wrote: I think it should reject an invalid configuration or handle the zero case correctly by not dividing. You are correct. Not returning -EINVAL, because someone might want to use the value zero in some future gact_prob algorithm? That looks good, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] r8169: Fix iteration variable sign
On Thursday 30 November 2006 12:20, Jeff Garzik wrote: Francois Romieu wrote: This changes the type of variable i in rtl8169_init_one() from unsigned int to int. i is checked for 0 later, which can never happen for unsigned. This results in broken error handling. Signed-off-by: Michael Buesch [EMAIL PROTECTED] Signed-off-by: Francois Romieu [EMAIL PROTECTED] ACK but doesn't seem to apply to 2.6.19? should this go into #upstream rather than #upstream-fixes? Hm, I did this against latest linus' tree. -- Greetings Michael. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] bonding: change spinlocks and remove timers in favor of workqueues
The main purpose of this patch is to clean-up the bonding code so that several important operations are not done in the incorrect (softirq) context. Whenever a kernel is compiled with CONFIG_DEBUG_SPINLOCK_SLEEP all sorts of backtraces are spewed to the log since might_sleep will kindly remind us we are doing something in a atomic context when we probably should not. In order to resolve this, the spin_[un]lock_bh needed to be converted to spin_[un]lock and to do that the timers needed to be dropped in favor of workqueues. Since there isn't the chance that this work will be done in a softirq context, the bh-locks aren't needed since we should not be preempted to service the workqueue. Both of those changes are included in this patch. I've done a bit of testing switching between modes and changing some of the important values through sysfs, so I feel that creating and canceling the work is working fine. This code could use some quick testing with 802.3ad since I didn't have access to a switch with that capability, so if someone can verify it I would appreciate it. Signed-off-by: Andy Gospodarek [EMAIL PROTECTED] --- bond_3ad.c |9 +- bond_3ad.h |2 bond_alb.c | 17 +++- bond_alb.h |2 bond_main.c | 215 ++- bond_sysfs.c | 78 ++--- bonding.h| 21 +++-- 7 files changed, 212 insertions(+), 132 deletions(-) diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c index 3fb354d..e65ca19 100644 --- a/drivers/net/bonding/bond_3ad.c +++ b/drivers/net/bonding/bond_3ad.c @@ -2097,8 +2097,10 @@ void bond_3ad_unbind_slave(struct slave * times out, and it selects an aggregator for the ports that are yet not * related to any aggregator, and selects the active aggregator for a bond. */ -void bond_3ad_state_machine_handler(struct bonding *bond) +void bond_3ad_state_machine_handler(void *work_data) { + struct net_device *bond_dev = (struct net_device *)work_data; + struct bonding *bond = bond_dev-priv; struct port *port; struct aggregator *aggregator; @@ -2149,7 +2151,10 @@ void bond_3ad_state_machine_handler(stru } re_arm: - mod_timer((BOND_AD_INFO(bond).ad_timer), jiffies + ad_delta_in_ticks); + bond_work_create(bond_dev, + bond_3ad_state_machine_handler, + bond-ad_work, + ad_delta_in_ticks); out: read_unlock(bond-lock); } diff --git a/drivers/net/bonding/bond_3ad.h b/drivers/net/bonding/bond_3ad.h index 6ad5ad6..4fa16a9 100644 --- a/drivers/net/bonding/bond_3ad.h +++ b/drivers/net/bonding/bond_3ad.h @@ -276,7 +276,7 @@ struct ad_slave_info { void bond_3ad_initialize(struct bonding *bond, u16 tick_resolution, int lacp_fast); int bond_3ad_bind_slave(struct slave *slave); void bond_3ad_unbind_slave(struct slave *slave); -void bond_3ad_state_machine_handler(struct bonding *bond); +void bond_3ad_state_machine_handler(void *work_data); void bond_3ad_adapter_speed_changed(struct slave *slave); void bond_3ad_adapter_duplex_changed(struct slave *slave); void bond_3ad_handle_link_change(struct slave *slave, char link); diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c index 3292316..a163e3d 100644 --- a/drivers/net/bonding/bond_alb.c +++ b/drivers/net/bonding/bond_alb.c @@ -1367,8 +1367,10 @@ out: return 0; } -void bond_alb_monitor(struct bonding *bond) +void bond_alb_monitor(void *work_data) { + struct net_device *bond_dev = (struct net_device *)work_data; + struct bonding *bond = bond_dev-priv; struct alb_bond_info *bond_info = (BOND_ALB_INFO(bond)); struct slave *slave; int i; @@ -1433,7 +1435,7 @@ void bond_alb_monitor(struct bonding *bo * write lock to protect from other code that also * sets the promiscuity. */ - write_lock_bh(bond-curr_slave_lock); + write_lock(bond-curr_slave_lock); if (bond_info-primary_is_promisc (++bond_info-rlb_promisc_timeout_counter = RLB_PROMISC_TIMEOUT)) { @@ -1448,7 +1450,7 @@ void bond_alb_monitor(struct bonding *bo bond_info-primary_is_promisc = 0; } - write_unlock_bh(bond-curr_slave_lock); + write_unlock(bond-curr_slave_lock); if (bond_info-rlb_rebalance) { bond_info-rlb_rebalance = 0; @@ -1471,7 +1473,10 @@ void bond_alb_monitor(struct bonding *bo } re_arm: - mod_timer((bond_info-alb_timer), jiffies + alb_delta_in_ticks); + bond_work_create(bond_dev, + bond_alb_monitor, + bond-alb_work, + alb_delta_in_ticks); out: read_unlock(bond-lock); } @@ -1492,11 +1497,11 @@ int bond_alb_init_slave(struct bonding * /* caller must hold the bond lock for write since the
Re: [Devel] Re: Network virtualization/isolation
Daniel Lezcano wrote: Brian Haley wrote: Eric W. Biederman wrote: I think for cases across network socket namespaces it should be a matter for the rules, to decide if the connection should happen and what error code to return if the connection does not happen. There is a potential in this to have an ambiguous case where two applications can be listening for connections on the same socket on the same port and both will allow the connection. If that is the case I believe the proper definition is the first socket that we find that will accept the connection gets the connection. No. If you try to connect, the destination IP address is assigned to a network namespace. This network namespace is used to leave the listening socket ambiguity. Wouldn't you want to catch this at bind() and/or configuration time and fail? Having overlapping namespaces/rules seems undesirable, since as Herbert said, can get you unexpected behaviour. Overlapping is not a problem, you can have several sockets binded on the same INADDR_ANY/port without ambiguity because the network namespace pointer is added as a new key for sockets lookup, (src addr, src port, dst addr, dst port, net ns pointer). The bind should not be forced to a specific address because you will not be able to connect via 127.0.0.1. So, all this leads to me ask, how to handle 127.0.0.1? For L2 it seems easy. Each namespace gets a tagged lo device. How do you propose to do it for L3, because disabling access to loopback is not a valid option, IMO. I agree that adding a namespace to the (using generic terms) TCB lookup solves the conflict issue. -vlad - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] skge: restore device multicast membership after link down/up
Yukon hardware will lose multicast membership data and promiscuous mode information if a link is disconnected and reconnected without taking the interface down. A call to yukon_reset in yukon_link_down will clear the hardware's multicast list, so it needs to be added back on link_up. It does not appear that Genesis hardware needs a similar patch is needed since it does not seem to clear multicast membership when taking the link down. Signed-off-by: Andy Gospodarek [EMAIL PROTECTED] --- skge.c |4 1 file changed, 4 insertions(+) diff --git a/drivers/net/skge.c b/drivers/net/skge.c index 3f1b72e..c02e1f1 100644 --- a/drivers/net/skge.c +++ b/drivers/net/skge.c @@ -1922,6 +1922,10 @@ static void yukon_link_up(struct skge_po gma_write16(hw, port, GM_GP_CTRL, reg); gm_phy_write(hw, port, PHY_MARV_INT_MASK, PHY_M_IS_DEF_MSK); + + /* reset multicast list */ + yukon_set_multicast(skge-netdev); + skge_link_up(skge); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: Network virtualization/isolation
Vlad Yasevich wrote: Daniel Lezcano wrote: Brian Haley wrote: Eric W. Biederman wrote: I think for cases across network socket namespaces it should be a matter for the rules, to decide if the connection should happen and what error code to return if the connection does not happen. There is a potential in this to have an ambiguous case where two applications can be listening for connections on the same socket on the same port and both will allow the connection. If that is the case I believe the proper definition is the first socket that we find that will accept the connection gets the connection. No. If you try to connect, the destination IP address is assigned to a network namespace. This network namespace is used to leave the listening socket ambiguity. Wouldn't you want to catch this at bind() and/or configuration time and fail? Having overlapping namespaces/rules seems undesirable, since as Herbert said, can get you unexpected behaviour. Overlapping is not a problem, you can have several sockets binded on the same INADDR_ANY/port without ambiguity because the network namespace pointer is added as a new key for sockets lookup, (src addr, src port, dst addr, dst port, net ns pointer). The bind should not be forced to a specific address because you will not be able to connect via 127.0.0.1. So, all this leads to me ask, how to handle 127.0.0.1? For L2 it seems easy. Each namespace gets a tagged lo device. How do you propose to do it for L3, because disabling access to loopback is not a valid option, IMO. There are 2 options: 1 - Dmitry Mishin proposed to use the l2 mechanism and reinstantiate a new loopback device, I didn't tested that yet, perhaps there are issues with non-127.0.0.1 loopback traffic and routes creation, I don't know. 2 - add the pointer of the network namespace who has originated the packet into the skbuff when the traffic is for 127.0.0.1, so when the packet arrive to IP, it has the namespace destination information because source == destination. I tested it and it works fine without noticeable overhead and this can be done with a very few lines of code. -- Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: Network virtualization/isolation
On Thu, Nov 30, 2006 at 05:38:16PM +0100, Daniel Lezcano wrote: Vlad Yasevich wrote: Daniel Lezcano wrote: Brian Haley wrote: Eric W. Biederman wrote: I think for cases across network socket namespaces it should be a matter for the rules, to decide if the connection should happen and what error code to return if the connection does not happen. There is a potential in this to have an ambiguous case where two applications can be listening for connections on the same socket on the same port and both will allow the connection. If that is the case I believe the proper definition is the first socket that we find that will accept the connection gets the connection. No. If you try to connect, the destination IP address is assigned to a network namespace. This network namespace is used to leave the listening socket ambiguity. Wouldn't you want to catch this at bind() and/or configuration time and fail? Having overlapping namespaces/rules seems undesirable, since as Herbert said, can get you unexpected behaviour. Overlapping is not a problem, you can have several sockets binded on the same INADDR_ANY/port without ambiguity because the network namespace pointer is added as a new key for sockets lookup, (src addr, src port, dst addr, dst port, net ns pointer). The bind should not be forced to a specific address because you will not be able to connect via 127.0.0.1. So, all this leads to me ask, how to handle 127.0.0.1? For L2 it seems easy. Each namespace gets a tagged lo device. How do you propose to do it for L3, because disabling access to loopback is not a valid option, IMO. There are 2 options: 1 - Dmitry Mishin proposed to use the l2 mechanism and reinstantiate a new loopback device, I didn't tested that yet, perhaps there are issues with non-127.0.0.1 loopback traffic and routes creation, I don't know. 2 - add the pointer of the network namespace who has originated the packet into the skbuff when the traffic is for 127.0.0.1, so when the packet arrive to IP, it has the namespace destination information because source == destination. I tested it and it works fine without noticeable overhead and this can be done with a very few lines of code. there is a third option, which is a little 'hacky' but works quite fine too: use different loopback addresses for each 'guest' e.g. 127.x.y.z and 'map' them to 127.0.0.1 (or the other way round) whenever appropriate advantages: - doesn't require any skb tagging - doesn't change the routing in any way - allows isolated loopback connections disadvantages: - blocks those special addresses (127.x.y.z) - requires the mapping at bind/receive best, Herbert -- Daniel ___ Containers mailing list [EMAIL PROTECTED] https://lists.osdl.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen
Robert Olsson wrote: Hello! Seems you found a race when rmmod is done before it's fully started Try: diff --git a/net/core/pktgen.c b/net/core/pktgen.c index 733d86d..ac0b4b1 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -160,7 +160,7 @@ #include asm/div64.h /* do_div */ #include asm/timex.h -#define VERSION pktgen v2.68: Packet Generator for packet performance testing.\n +#define VERSION pktgen v2.69: Packet Generator for packet performance testing.\n /* #define PG_DEBUG(a) a */ #define PG_DEBUG(a) @@ -3673,6 +3673,8 @@ static void __exit pg_cleanup(void) struct list_head *q, *n; wait_queue_head_t queue; init_waitqueue_head(queue); + + schedule_timeout_interruptible(msecs_to_jiffies(125)); /* Stop all interfaces threads */ That strikes me as a hack..surely there is a better method than just adding a sleep?? Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/6] 2.6.18: sb1250-mac: Phylib IRQ handling fixes
On Mon, 23 Oct 2006, Maciej W. Rozycki wrote: I'm not too enthusiastic about requiring the ethernet drivers to call phy_disconnect in a separate thread after close is called. Assuming there's not some sort of squash work queue function that can be invoked with rtnl_lock held, I think phy_disconnect should schedule itself to flush the queue. This would also require that mdiobus_unregister hold off on freeing phydevs if any of the phys were still waiting for pending flush_pending calls to finish. Which would, in turn, require mdiobus_unregister to schedule cleaning up memory for some later time. This could work, indeed. I'm not enthusiastic about that implementation, either, but it maintains the abstractions I consider important for this code. The ethernet driver should not need to know what structures the PHY lib uses to implement its interrupt handling, and how to work around their failings, IMHO. Agreed. So what's the plan? Here's a new version of the patch that addresses your other concerns. Maciej patch-mips-2.6.18-20060920-phy-irq-18 diff -up --recursive --new-file linux-mips-2.6.18-20060920.macro/drivers/net/phy/phy.c linux-mips-2.6.18-20060920/drivers/net/phy/phy.c --- linux-mips-2.6.18-20060920.macro/drivers/net/phy/phy.c 2006-08-05 04:58:46.0 + +++ linux-mips-2.6.18-20060920/drivers/net/phy/phy.c2006-11-30 17:58:37.0 + @@ -7,6 +7,7 @@ * Author: Andy Fleming * * Copyright (c) 2004 Freescale Semiconductor, Inc. + * Copyright (c) 2006 Maciej W. Rozycki * * This program is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License as published by the @@ -32,6 +33,8 @@ #include linux/mii.h #include linux/ethtool.h #include linux/phy.h +#include linux/timer.h +#include linux/workqueue.h #include asm/io.h #include asm/irq.h @@ -484,6 +487,9 @@ static irqreturn_t phy_interrupt(int irq { struct phy_device *phydev = phy_dat; + if (PHY_HALTED == phydev-state) + return IRQ_NONE;/* It can't be ours. */ + /* The MDIO bus is not allowed to be written in interrupt * context, so we need to disable the irq here. A work * queue will write the PHY to disable and clear the @@ -577,6 +583,13 @@ int phy_stop_interrupts(struct phy_devic if (err) phy_error(phydev); + /* +* Finish any pending work; we might have been scheduled +* to be called from keventd ourselves, though. +*/ + if (!current_is_keventd()) + flush_scheduled_work(); + free_irq(phydev-irq, phydev); return err; @@ -596,14 +609,17 @@ static void phy_change(void *data) goto phy_err; spin_lock(phydev-lock); + if ((PHY_RUNNING == phydev-state) || (PHY_NOLINK == phydev-state)) phydev-state = PHY_CHANGELINK; - spin_unlock(phydev-lock); enable_irq(phydev-irq); /* Reenable interrupts */ - err = phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED); + if (PHY_HALTED != phydev-state) + err = phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED); + + spin_unlock(phydev-lock); if (err) goto irq_enable_err; @@ -624,15 +640,15 @@ void phy_stop(struct phy_device *phydev) if (PHY_HALTED == phydev-state) goto out_unlock; - if (phydev-irq != PHY_POLL) { - /* Clear any pending interrupts */ - phy_clear_interrupt(phydev); + phydev-state = PHY_HALTED; + if (phydev-irq != PHY_POLL) { /* Disable PHY Interrupts */ phy_config_interrupt(phydev, PHY_INTERRUPT_DISABLED); - } - phydev-state = PHY_HALTED; + /* Clear any pending interrupts */ + phy_clear_interrupt(phydev); + } out_unlock: spin_unlock(phydev-lock); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] NetXen: 1G/10G Ethernet Driver updates
The first patch sent by Amit on 29 Nov applied, but the following three patches did not apply to Jeff's #upstream tree. Here are the corrected 2nd, 3rd, and 4th patches, with a repeat of the 1st for completeness. There is a 5th patch which fixes a bug caused by casting a 16-bit variable into a 32-bit one and using the address. -- Don Fry [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] NetXen: Fixed /sys mapping between device and driver
Signed-off-by: Amit S. Kale [EMAIL PROTECTED] diff -Nupr netdev-2.6/drivers/net/netxen.orig/netxen_nic_main.c netdev-2.6/drivers/net/netxen/netxen_nic_main.c --- netdev-2.6/drivers/net/netxen.orig/netxen_nic_main.c2006-11-29 12:13:58.0 -0800 +++ netdev-2.6/drivers/net/netxen/netxen_nic_main.c 2006-11-30 09:17:51.0 -0800 @@ -273,6 +273,7 @@ netxen_nic_probe(struct pci_dev *pdev, c } SET_MODULE_OWNER(netdev); + SET_NETDEV_DEV(netdev, pdev-dev); port = netdev_priv(netdev); port-netdev = netdev; @@ -1043,7 +1044,7 @@ static int netxen_nic_poll(struct net_de netxen_nic_enable_int(adapter); } - return (done ? 0 : 1); + return !done; } #ifdef CONFIG_NET_POLL_CONTROLLER -- Don Fry [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] NetXen: 64-bit memory fixes and driver cleanup
NetXen: 1G/10G Ethernet Driver updates - These fixes take care of driver on machines with 4G memory - Driver cleanup Signed-off-by: Amit S. Kale [EMAIL PROTECTED] Signed-off-by: Don Fry [EMAIL PROTECTED] diff -Nupr netdev-2.6/drivers/net/netxen.two/netxen_nic_ethtool.c netdev-2.6/drivers/net/netxen/netxen_nic_ethtool.c --- netdev-2.6/drivers/net/netxen.two/netxen_nic_ethtool.c 2006-11-30 09:40:47.0 -0800 +++ netdev-2.6/drivers/net/netxen/netxen_nic_ethtool.c 2006-11-30 09:46:16.0 -0800 @@ -6,12 +6,12 @@ * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. - * + * * This program is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. - * + * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, @@ -118,7 +118,7 @@ netxen_nic_get_drvinfo(struct net_device u32 fw_minor = 0; u32 fw_build = 0; - strncpy(drvinfo-driver, netxen_nic, 32); + strncpy(drvinfo-driver, netxen_nic_driver_name, 32); strncpy(drvinfo-version, NETXEN_NIC_LINUX_VERSIONID, 32); fw_major = readl(NETXEN_CRB_NORMALIZE(adapter, NETXEN_FW_VERSION_MAJOR)); @@ -211,7 +211,6 @@ netxen_nic_get_settings(struct net_devic printk(ERROR: Unsupported board model %d\n, (netxen_brdtype_t) boardinfo-board_type); return -EIO; - } return 0; @@ -461,20 +460,22 @@ netxen_nic_get_ringparam(struct net_devi { struct netxen_port *port = netdev_priv(dev); struct netxen_adapter *adapter = port-adapter; - int i, j; + int i; ring-rx_pending = 0; + ring-rx_jumbo_pending = 0; for (i = 0; i MAX_RCV_CTX; ++i) { - for (j = 0; j NUM_RCV_DESC_RINGS; j++) - ring-rx_pending += - adapter-recv_ctx[i].rcv_desc[j].rcv_pending; + ring-rx_pending += adapter-recv_ctx[i]. + rcv_desc[RCV_DESC_NORMAL_CTXID].rcv_pending; + ring-rx_jumbo_pending += adapter-recv_ctx[i]. + rcv_desc[RCV_DESC_JUMBO_CTXID].rcv_pending; } ring-rx_max_pending = adapter-max_rx_desc_count; ring-tx_max_pending = adapter-max_tx_desc_count; + ring-rx_jumbo_max_pending = adapter-max_jumbo_rx_desc_count; ring-rx_mini_max_pending = 0; ring-rx_mini_pending = 0; - ring-rx_jumbo_max_pending = 0; ring-rx_jumbo_pending = 0; } diff -Nupr netdev-2.6/drivers/net/netxen.two/netxen_nic.h netdev-2.6/drivers/net/netxen/netxen_nic.h --- netdev-2.6/drivers/net/netxen.two/netxen_nic.h 2006-11-30 09:40:47.0 -0800 +++ netdev-2.6/drivers/net/netxen/netxen_nic.h 2006-11-30 09:46:16.0 -0800 @@ -6,12 +6,12 @@ * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. - * + * * This program is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. - * + * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, @@ -89,8 +89,8 @@ * normalize a 64MB crb address to 32MB PCI window * To use NETXEN_CRB_NORMALIZE, window _must_ be set to 1 */ -#define NETXEN_CRB_NORMAL(reg)\ - (reg) - NETXEN_CRB_PCIX_HOST2 + NETXEN_CRB_PCIX_HOST +#define NETXEN_CRB_NORMAL(reg) \ + ((reg) - NETXEN_CRB_PCIX_HOST2 + NETXEN_CRB_PCIX_HOST) #define NETXEN_CRB_NORMALIZE(adapter, reg) \ pci_base_offset(adapter, NETXEN_CRB_NORMAL(reg)) @@ -164,7 +164,7 @@ enum { #define MAX_CMD_DESCRIPTORS1024 #define MAX_RCV_DESCRIPTORS32768 -#define MAX_JUMBO_RCV_DESCRIPTORS 1024 +#define MAX_JUMBO_RCV_DESCRIPTORS 4096 #define MAX_RCVSTATUS_DESCRIPTORS MAX_RCV_DESCRIPTORS #define MAX_JUMBO_RCV_DESC MAX_JUMBO_RCV_DESCRIPTORS #define MAX_RCV_DESC MAX_RCV_DESCRIPTORS @@ -592,6 +592,16 @@ struct netxen_skb_frag { u32 length; }; +/* Bounce buffer index */ +struct bounce_index { + /* Index of a buffer */ +
[PATCH 5/5] NetXen: Fix cast error
Fix for pointer casting error. Signed-off-by: Don Fry [EMAIL PROTECTED] diff -Nupr netdev-2.6/drivers/net/netxen.four/netxen_nic_hw.c netdev-2.6/drivers/net/netxen/netxen_nic_hw.c --- netdev-2.6/drivers/net/netxen.four/netxen_nic_hw.c 2006-11-30 10:06:24.0 -0800 +++ netdev-2.6/drivers/net/netxen/netxen_nic_hw.c 2006-11-30 10:31:00.0 -0800 @@ -867,7 +867,7 @@ void netxen_nic_set_link_parameters(stru { struct netxen_adapter *adapter = port-adapter; __le32 status; - u16 autoneg; + __le32 autoneg = 0; __le32 mode; netxen_nic_read_w0(adapter, NETXEN_NIU_MODE, mode); @@ -907,7 +907,7 @@ void netxen_nic_set_link_parameters(stru adapter- phy_read(adapter, port-portnum, NETXEN_NIU_GB_MII_MGMT_ADDR_AUTONEG, -(__le32 *) autoneg) != 0) +autoneg) != 0) port-link_autoneg = autoneg; } else goto link_down; -- Don Fry [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()
On Thu, 30 Nov 2006, Peter Zijlstra wrote: +unsigned int kmem_cache_objs_to_pages(struct kmem_cache *cachep, int nr) +{ + return ((nr + cachep-num - 1) / cachep-num) cachep-gfporder; cachep-num refers to the number of objects in a slab of gfporder. thus return (nr + cachep-num - 1) / cachep-num; But then this is very optimistic estimate that assumes a single node and no free objects in between. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5 addendum] NetXen
The NetXen patches fix many problems in the current #upstream version of the driver. It has warts and probably lots of bugs still, but it is better than what is queued for mainline inclusion at this time. Please apply to 2.6.20. -- Don Fry [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()
On Thu, 2006-11-30 at 10:55 -0800, Christoph Lameter wrote: On Thu, 30 Nov 2006, Peter Zijlstra wrote: +unsigned int kmem_cache_objs_to_pages(struct kmem_cache *cachep, int nr) +{ + return ((nr + cachep-num - 1) / cachep-num) cachep-gfporder; cachep-num refers to the number of objects in a slab of gfporder. Ah, my bad, thanks! thus return (nr + cachep-num - 1) / cachep-num; But then this is very optimistic estimate that assumes a single node and no free objects in between. Right, perhaps my bad in wording the intent; the needed information is how many more pages would I need to grow the slab with in order to store so many new object. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()
On Thu, 30 Nov 2006, Peter Zijlstra wrote: Right, perhaps my bad in wording the intent; the needed information is how many more pages would I need to grow the slab with in order to store so many new object. Would you not have to take objects currently available in caches into account? If you are short on memory then a flushing of all the caches may give you the memory you need (especially on a system with a large number of processors). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/6] mm: slab allocation fairness
On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote: On Thu, 30 Nov 2006, Peter Zijlstra wrote: The slab has some unfairness wrt gfp flags; when the slab is grown the gfp flags are used to allocate more memory, however when there is slab space available, gfp flags are ignored. Thus it is possible for less critical slab allocations to succeed and gobble up precious memory. The gfpflags are ignored if there are 1) objects in the per cpu, shared or alien caches 2) objects are in partial or free slabs in the per node queues. Yeah, basically as long as free objects can be found. No matter how 'hard' is was to obtain these objects. This patch avoids this by keeping track of the allocation hardness when growing. This is then compared to the current slab alloc's gfp flags. The approach is to force the allocation of additional slab to increase the number of free slabs? The next free will drop the number of free slabs back again to the allowed amount. No, the forced allocation is to test the allocation hardness at that point in time. I could not think of another way to test that than to actually to an allocation. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()
On Thu, 2006-11-30 at 11:06 -0800, Christoph Lameter wrote: On Thu, 30 Nov 2006, Peter Zijlstra wrote: Right, perhaps my bad in wording the intent; the needed information is how many more pages would I need to grow the slab with in order to store so many new object. Would you not have to take objects currently available in caches into account? If you are short on memory then a flushing of all the caches may give you the memory you need (especially on a system with a large number of processors). Sure, but this gives a safe upper bound. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/6] mm: slab allocation fairness
On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote: I would think that one would need a rank with each cached object and free slab in order to do this the right way. Allocation hardness is a temporal attribute, ie. it changes over time. Hence I do it per slab. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/6] mm: slab allocation fairness
On Thu, 30 Nov 2006, Peter Zijlstra wrote: No, the forced allocation is to test the allocation hardness at that point in time. I could not think of another way to test that than to actually to an allocation. Typically we do this by checking the number of free pages in a zone compared to the high low limits. See mmzone.h. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/6] mm: slab allocation fairness
On Thu, 30 Nov 2006, Peter Zijlstra wrote: On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote: I would think that one would need a rank with each cached object and free slab in order to do this the right way. Allocation hardness is a temporal attribute, ie. it changes over time. Hence I do it per slab. cached objects are also temporal and change over time. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bonding: change spinlocks and remove timers in favor of workqueues
Andy Gospodarek [EMAIL PROTECTED] wrote: The main purpose of this patch is to clean-up the bonding code so that several important operations are not done in the incorrect (softirq) context. Whenever a kernel is compiled with CONFIG_DEBUG_SPINLOCK_SLEEP all sorts of backtraces are spewed to the log since might_sleep will kindly remind us we are doing something in a atomic context when we probably should not. [...] I'll look at the patch in detail in a bit (and I have 802.3ad switches to test on), but on first glance, does this not still hold a lock during failover operations in balance-alb mode? I.e., this doesn't change the locking model, it just moves the timers to workqueues and relaxes the _bh locking. The really problematic case calls dev_set_mac_address() with a lock held, and I don't see that this patch changes that behavior. Do you still get the lock warnings during link fail / recovery in balance-alb mode? Also, on an CONFIG_PREEMPT kernel, it'll still get the sleep warnings, since in_atomic() will trip __might_sleep() for any lock (if I'm reading things correctly). Don't get me wrong, this (switching to workqueues, etc) needs to be done, but I don't think this patch really resolves the underlying problem that causes the warnings. Let me see if I can dust off the extensive patch that does change the locking model; I'll see if I can bring it up to the current git and post it. -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/6] mm: slab allocation fairness
On Thu, 2006-11-30 at 11:33 -0800, Christoph Lameter wrote: On Thu, 30 Nov 2006, Peter Zijlstra wrote: No, the forced allocation is to test the allocation hardness at that point in time. I could not think of another way to test that than to actually to an allocation. Typically we do this by checking the number of free pages in a zone compared to the high low limits. See mmzone.h. True, I did think about that and started out that way but saw myself duplicating a lot of the page allocation code. I'll give it another try... see if I can factor out the common parts without too much duplication. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/6] mm: slab allocation fairness
On Thu, 2006-11-30 at 11:37 -0800, Christoph Lameter wrote: On Thu, 30 Nov 2006, Peter Zijlstra wrote: On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote: I would think that one would need a rank with each cached object and free slab in order to do this the right way. Allocation hardness is a temporal attribute, ie. it changes over time. Hence I do it per slab. cached objects are also temporal and change over time. Sure, but there is nothing wrong with using a slab page with a lower allocation rank when there is memory aplenty. I'm just not seeing how keeping all individual page ranks would make this better. The only thing that matters is the actual free pages limit, not that of a few allocation ago. The stored rank is a safe shortcut for it allows harder allocation to use easily obtainable free space not the other way around. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/6] mm: slab allocation fairness
On Thu, 30 Nov 2006, Peter Zijlstra wrote: Sure, but there is nothing wrong with using a slab page with a lower allocation rank when there is memory aplenty. What does a slab page with a lower allocation rank mean? Slab pages have no allocation ranks that I am aware of. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bonding: change spinlocks and remove timers in favor of workqueues
On 11/30/06, Jay Vosburgh [EMAIL PROTECTED] wrote: Andy Gospodarek [EMAIL PROTECTED] wrote: The main purpose of this patch is to clean-up the bonding code so that several important operations are not done in the incorrect (softirq) context. Whenever a kernel is compiled with CONFIG_DEBUG_SPINLOCK_SLEEP all sorts of backtraces are spewed to the log since might_sleep will kindly remind us we are doing something in a atomic context when we probably should not. [...] I'll look at the patch in detail in a bit (and I have 802.3ad switches to test on), but on first glance, does this not still hold a lock during failover operations in balance-alb mode? I.e., this doesn't change the locking model, it just moves the timers to workqueues and relaxes the _bh locking. Jay, Thanks for the response. You are correct. This patch really doesn't change functionality -- in fact that was one of goals of this patch. I wanted to simply start the conversion since it seemed like 'the right way' to do things going forward. The really problematic case calls dev_set_mac_address() with a lock held, and I don't see that this patch changes that behavior. Do you still get the lock warnings during link fail / recovery in balance-alb mode? I no longer get lock warnings indicating that I'm taking a lock in an invalid context, but lately I've been seeing rtnl lock assertion failures when in balance-alb mode and whenever a call to dev_set_mac_address is made. It seems to be expected that the rtnl lock is taken and that isn't the case anymore. Also, on an CONFIG_PREEMPT kernel, it'll still get the sleep warnings, since in_atomic() will trip __might_sleep() for any lock (if I'm reading things correctly). Based on my reading you will still only get these warnings if CONFIG_DEBUG_SPINLOCK_SLEEP=y and CONFIG_PREEMPT=y. Since most never try with CONFIG_DEBUG_SPINLOCK_SLEEP=y they don't see these. Don't get me wrong, this (switching to workqueues, etc) needs to be done, but I don't think this patch really resolves the underlying problem that causes the warnings. Agreed. I didn't want to tackle too many of the issues with one giant patch. Doing them in smallish steps seemed like a better way to go. Let me see if I can dust off the extensive patch that does change the locking model; I'll see if I can bring it up to the current git and post it. It would seem ideal if we could combine the two into one big patch. -andy -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/6] mm: slab allocation fairness
On Thu, 2006-11-30 at 12:11 -0800, Christoph Lameter wrote: On Thu, 30 Nov 2006, Peter Zijlstra wrote: Sure, but there is nothing wrong with using a slab page with a lower allocation rank when there is memory aplenty. What does a slab page with a lower allocation rank mean? Slab pages have no allocation ranks that I am aware of. I just added allocation rank and didn't you suggest tracking it for all slab pages instead of per slab? The rank is an expression of how hard it was to get that page, with 0 being the hardest allocation (ALLOC_NO_WATERMARK) and 16 the easiest (ALLOC_WMARK_HIGH). I store the rank of the last allocated page and retest the rank when a gfp flag indicates a higher rank, that is when the current slab allocation would have failed to grow the slab under the conditions of the previous allocation. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/6] mm: slab allocation fairness
On Thu, 30 Nov 2006, Peter Zijlstra wrote: Sure, but there is nothing wrong with using a slab page with a lower allocation rank when there is memory aplenty. What does a slab page with a lower allocation rank mean? Slab pages have no allocation ranks that I am aware of. I just added allocation rank and didn't you suggest tracking it for all slab pages instead of per slab? Yes but that is not in place so I was wondering what you were talking about. It would help to have some longer text describing what you intend to do and how rank would work throughout the VM. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux 2.6.19
On Thursday 30 November 2006 03:15, David Miller wrote: From: Phil Oester [EMAIL PROTECTED] Date: Wed, 29 Nov 2006 17:49:04 -0800 Getting an oops on boot here, caused by commit e81c73596704793e73e6dbb478f41686f15a4b34 titled [NET]: Fix MAX_HEADER setting. Reverting that patch fixes things up for me. Dave? I suspect that it might be because I removed the IPV6 ifdef from the list, but I can't imagine why that would matter other than due to a bug in the IPV6 stack Indeed. Looking at ndisc_send_rs() I wonder if it miscalculates 'len' or similar and the old MAX_HEADER setting was merely papering around this bug In fact it does, the NDISC code is using MAX_HEADER incorrectly. It needs to explicitly allocate space for the struct ipv6hdr in 'len'. Luckily the TCP ipv6 code was doing it right. What a horrible bug, this patch should fix it. Let me know if it doesn't, thanks: I also encountered this bug (wasn't there in -rc6). The patch also fixes it for me. regards -- --- Malte Schröder [EMAIL PROTECTED] ICQ# 68121508 --- pgpOqfDpsQNjB.pgp Description: PGP signature
Re: Broken commit: [NETFILTER]: ipt_REJECT: remove largely duplicate route_reverse function
On Wed, Nov 29, 2006 at 04:16:06PM +0100, Krzysztof Halasa wrote: Krzysztof Halasa [EMAIL PROTECTED] writes: I wound't care less btw. s/wound/couldn/, eh those foreign languages... So, you say, you don't care about David Miller's credits? It isn't nice. He could be very disappointed... Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
On Wed, Nov 29, 2006 at 07:56:58PM -0600, Wenji Wu wrote: Yes, when CONFIG_PREEMPT is disabled, the problem won't happen. That is why I put for 2.6 desktop, low-latency desktop in the uploaded paper. This problem happens in the 2.6 Desktop and Low-latency Desktop. CONFIG_PREEMPT is only for people that are in for the feeling. There is no real world advtantage to it and we should probably remove it again. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] mv643xx add missing brackets
Hello, This patch adds missing brackets. Signed-off-by: Mariusz Kozlowski [EMAIL PROTECTED] include/linux/mv643xx.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- linux-2.6.19-rc6-mm2-a/include/linux/mv643xx.h 2006-11-16 05:03:40.0 +0100 +++ linux-2.6.19-rc6-mm2-b/include/linux/mv643xx.h 2006-11-30 01:10:53.0 +0100 @@ -724,7 +724,7 @@ #define MV643XX_ETH_RX_FIFO_URGENT_THRESHOLD_REG(port) (0x2470 + (port10)) #define MV643XX_ETH_TX_FIFO_URGENT_THRESHOLD_REG(port) (0x2474 + (port10)) #define MV643XX_ETH_RX_MINIMAL_FRAME_SIZE_REG(port)(0x247c + (port10)) -#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port) (0x2484 + (port10) +#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port) (0x2484 + (port10)) #define MV643XX_ETH_PORT_DEBUG_0_REG(port) (0x248c + (port10)) #define MV643XX_ETH_PORT_DEBUG_1_REG(port) (0x2490 + (port10)) #define MV643XX_ETH_PORT_INTERNAL_ADDR_ERROR_REG(port) (0x2494 + (port10)) @@ -1135,7 +1135,7 @@ struct mv64xxx_i2c_pdata { #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_1 (119) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_2 (120) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_3 ((120) | (119)) -#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121) +#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121)) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_5 ((121) | (119)) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_6 ((121) | (120)) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_7 ((121) | (120) | (119)) -- Regards, Mariusz Kozlowski - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar ([EMAIL PROTECTED]) wrote: what was observed here were the effects of completely throttling TCP processing for a given socket. I think such throttling can in fact be desirable: there is a /reason/ why the process context was preempted: in that load scenario there was 10 times more processing requested from the CPU than it can possibly service. It's a serious overload situation and it's the scheduler's task to prioritize between workloads! normally such kind of throttling of the TCP stack for this particular socket does not happen. Note that there's no performance lost: we dont do TCP processing because there are /9 other tasks for this CPU to run/, and the scheduler has a tough choice. Now i agree that there are more intelligent ways to throttle and less intelligent ways to throttle, but the notion to allow a given workload 'steal' CPU time from other workloads by allowing it to push its processing into a softirq is i think unfair. (and this issue is partially addressed by my softirq threading patches in -rt :-) Doesn't the provided solution is just a in-kernel variant of the SCHED_FIFO set from userspace? Why kernel should be able to mark some users as having higher priority? What if workload of the system is targeted to not the maximum TCP performance, but maximum other-task performance, which will be broken with provided patch. Ingo -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mv643xx add missing brackets
On Thu, Nov 30, 2006 at 10:35:37AM +0100, Mariusz Kozlowski wrote: Hello, This patch adds missing brackets. Signed-off-by: Mariusz Kozlowski [EMAIL PROTECTED] include/linux/mv643xx.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- linux-2.6.19-rc6-mm2-a/include/linux/mv643xx.h2006-11-16 05:03:40.0 +0100 +++ linux-2.6.19-rc6-mm2-b/include/linux/mv643xx.h2006-11-30 01:10:53.0 +0100 @@ -724,7 +724,7 @@ #define MV643XX_ETH_RX_FIFO_URGENT_THRESHOLD_REG(port) (0x2470 + (port10)) #define MV643XX_ETH_TX_FIFO_URGENT_THRESHOLD_REG(port) (0x2474 + (port10)) #define MV643XX_ETH_RX_MINIMAL_FRAME_SIZE_REG(port)(0x247c + (port10)) -#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port) (0x2484 + (port10) +#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port) (0x2484 + (port10)) Good. Thanks. #define MV643XX_ETH_PORT_DEBUG_0_REG(port) (0x248c + (port10)) #define MV643XX_ETH_PORT_DEBUG_1_REG(port) (0x2490 + (port10)) #define MV643XX_ETH_PORT_INTERNAL_ADDR_ERROR_REG(port) (0x2494 + (port10)) @@ -1135,7 +1135,7 @@ struct mv64xxx_i2c_pdata { #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_1 (119) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_2 (120) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_3 ((120) | (119)) -#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121) +#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121)) Mariusz, please remove the extra parenthesis instead of adding an extra one, like: #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 (121) and resubmit. Thanks, -Dale - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
Evgeniy Polyakov wrote: On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar ([EMAIL PROTECTED]) wrote: Doesn't the provided solution is just a in-kernel variant of the SCHED_FIFO set from userspace? Why kernel should be able to mark some users as having higher priority? What if workload of the system is targeted to not the maximum TCP performance, but maximum other-task performance, which will be broken with provided patch. David's line of thinking for a solution sounds better to me. This patch does not prevent the process from being preempted (for potentially a long time), by any means. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
On Thu, Nov 30, 2006 at 09:07:42PM +1100, Nick Piggin ([EMAIL PROTECTED]) wrote: Doesn't the provided solution is just a in-kernel variant of the SCHED_FIFO set from userspace? Why kernel should be able to mark some users as having higher priority? What if workload of the system is targeted to not the maximum TCP performance, but maximum other-task performance, which will be broken with provided patch. David's line of thinking for a solution sounds better to me. This patch does not prevent the process from being preempted (for potentially a long time), by any means. It steals timeslices from other processes to complete tcp_recvmsg() task, and only when it does it for too long, it will be preempted. Processing backlog queue on behalf of need_resched() will break fairness too - processing itself can take a lot of time, so process can be scheduled away in that part too. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mv643xx add missing brackets
+#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121)) Mariusz, please remove the extra parenthesis instead of adding an extra one, like: #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 (121) and resubmit. Sure. Here it goes. Second try: This patch fixes some mv643xx macros. Signed-off-by: Mariusz Kozlowski [EMAIL PROTECTED] include/linux/mv643xx.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- linux-2.6.19-rc6-mm2-a/include/linux/mv643xx.h 2006-11-16 05:03:40.0 +0100 +++ linux-2.6.19-rc6-mm2-b/include/linux/mv643xx.h 2006-11-30 11:30:14.0 +0100 @@ -724,7 +724,7 @@ #define MV643XX_ETH_RX_FIFO_URGENT_THRESHOLD_REG(port) (0x2470 + (port10)) #define MV643XX_ETH_TX_FIFO_URGENT_THRESHOLD_REG(port) (0x2474 + (port10)) #define MV643XX_ETH_RX_MINIMAL_FRAME_SIZE_REG(port)(0x247c + (port10)) -#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port) (0x2484 + (port10) +#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port) (0x2484 + (port10)) #define MV643XX_ETH_PORT_DEBUG_0_REG(port) (0x248c + (port10)) #define MV643XX_ETH_PORT_DEBUG_1_REG(port) (0x2490 + (port10)) #define MV643XX_ETH_PORT_INTERNAL_ADDR_ERROR_REG(port) (0x2494 + (port10)) @@ -1135,7 +1135,7 @@ struct mv64xxx_i2c_pdata { #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_1 (119) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_2 (120) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_3 ((120) | (119)) -#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121) +#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 (121) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_5 ((121) | (119)) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_6 ((121) | (120)) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_7 ((121) | (120) | (119)) -- Regards, Mariusz Kozlowski - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* Evgeniy Polyakov [EMAIL PROTECTED] wrote: David's line of thinking for a solution sounds better to me. This patch does not prevent the process from being preempted (for potentially a long time), by any means. It steals timeslices from other processes to complete tcp_recvmsg() task, and only when it does it for too long, it will be preempted. Processing backlog queue on behalf of need_resched() will break fairness too - processing itself can take a lot of time, so process can be scheduled away in that part too. correct - it's just the wrong thing to do. The '10% performance win' that was measured was against _9 other tasks who contended for the same CPU resource_. I.e. it's /not/ an absolute 'performance win' AFAICS, it's a simple shift in CPU cycles away from the other 9 tasks and towards the task that does TCP receive. Note that even without the change the TCP receiving task is already getting a disproportionate share of cycles due to softirq processing! Under a load of 10.0 it went from 500 mbits to 74 mbits, while the 'fair' share would be 50 mbits. So the TCP receiver /already/ has an unfair advantage. The patch only deepends that unfairness. The solution is really simple and needs no kernel change at all: if you want the TCP receiver to get a larger share of timeslices then either renice it to -20 or renice the other tasks to +19. The other disadvantage, even ignoring that it's the wrong thing to do, is the crudeness of preempt_disable() that i mentioned in the other post: -- independently of the issue at hand, in general the explicit use of preempt_disable() in non-infrastructure code is quite a heavy tool. Its effects are heavy and global: it disables /all/ preemption (even on PREEMPT_RT). Furthermore, when preempt_disable() is used for per-CPU data structures then [unlike for example to a spin-lock] the connection between the 'data' and the 'lock' is not explicit - causing all kinds of grief when trying to convert such code to a different preemption model. (such as PREEMPT_RT :-) So my plan is to remove all open-coded use of preempt_disable() [and raw use of local_irq_save/restore] from the kernel and replace it with some facility that connects data and lock. (Note that this will not result in any actual changes on the instruction level because internally every such facility still maps to preempt_disable() on non-PREEMPT_RT kernels, so on non-PREEMPT_RT kernels such code will still be the same as before.) Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] mv643xx_eth: fix unbalanced parentheses in macros
From: Mariusz Kozlowski [EMAIL PROTECTED] Signed-off-by: Mariusz Kozlowski [EMAIL PROTECTED] Signed-off-by: Dale Farnsworth [EMAIL PROTECTED] --- include/linux/mv643xx.h |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- linux-2.6.19-rc6-mm2-a/include/linux/mv643xx.h 2006-11-16 05:03:40.0 +0100 +++ linux-2.6.19-rc6-mm2-b/include/linux/mv643xx.h 2006-11-30 11:30:14.0 +0100 @@ -724,7 +724,7 @@ #define MV643XX_ETH_RX_FIFO_URGENT_THRESHOLD_REG(port) (0x2470 + (port10)) #define MV643XX_ETH_TX_FIFO_URGENT_THRESHOLD_REG(port) (0x2474 + (port10)) #define MV643XX_ETH_RX_MINIMAL_FRAME_SIZE_REG(port)(0x247c + (port10)) -#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port) (0x2484 + (port10) +#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port) (0x2484 + (port10)) #define MV643XX_ETH_PORT_DEBUG_0_REG(port) (0x248c + (port10)) #define MV643XX_ETH_PORT_DEBUG_1_REG(port) (0x2490 + (port10)) #define MV643XX_ETH_PORT_INTERNAL_ADDR_ERROR_REG(port) (0x2494 + (port10)) @@ -1135,7 +1135,7 @@ struct mv64xxx_i2c_pdata { #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_1 (119) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_2 (120) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_3 ((120) | (119)) -#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121) +#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 (121) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_5 ((121) | (119)) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_6 ((121) | (120)) #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_7 ((121) | (120) | (119)) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
We can make explicitl preemption checks in the main loop of tcp_recvmsg(), and release the socket and run the backlog if need_resched() is TRUE. This is the simplest and most elegant solution to this problem. I am not sure whether this approach will work. How can you make the explicit preemption checks? For Desktop case, yes, you can make the explicit preemption checks at some points whether need_resched() is true. But when need_resched() is true, you can not decide whether it is triggered by higher priority processes becoming runnable, or the process within tcp_recvmsg being expiring. If the higher prioirty processes become runnable (e.g., interactive process), you better yield the CPU, instead of continuing this process. If it is the case that the process within tcp_recvmsg() is expriring, then, you can continue the process to go ahead to process backlog. For Low-latency Desktop case, I believe it is very hard to make the checks. We do not know when the process is going to expire, or when higher priority process will become runnable. The process could expire at any moment, or higher priority process could become runnnable at any moment. If we do not want to tradeoff system responsiveness, where do you want to make the check? If you just make the chekc, then need_resched() become TRUE, what are you going to do in this case? wenji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
On Thu, 2006-11-30 at 09:33 +, Christoph Hellwig wrote: On Wed, Nov 29, 2006 at 07:56:58PM -0600, Wenji Wu wrote: Yes, when CONFIG_PREEMPT is disabled, the problem won't happen. That is why I put for 2.6 desktop, low-latency desktop in the uploaded paper. This problem happens in the 2.6 Desktop and Low-latency Desktop. CONFIG_PREEMPT is only for people that are in for the feeling. There is no real world advtantage to it and we should probably remove it again. There certainly is a real world advantage for many applications. Of course it would be better if the latency requirements could be met without kernel preemption but that's not the case now. Lee - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
The solution is really simple and needs no kernel change at all: if you want the TCP receiver to get a larger share of timeslices then either renice it to -20 or renice the other tasks to +19. Simply give a larger share of timeslices to the TCP receiver won't solve the problem. No matter what the timeslice is, if the TCP receiving process has packets within backlog, and the process is expired and moved to the expired array, RTO might happen in the TCP sender. The solution does not look like that simple. wenji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take26 7/8] kevent: Signal notifications.
Signal notifications. This type of notifications allows to deliver signals through kevent queue. One can find example application signal.c on project homepage. If KEVENT_SIGNAL_NOMASK bit is set in raw_u64 id then signal will be delivered only through queue, otherwise both delivery types are used - old through update of mask of pending signals and through queue. If signal is delivered only through kevent queue mask of pending signals is not updated at all, which is equal to putting signal into blocked mask, but with delivery of that signal through kevent queue. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/include/linux/sched.h b/include/linux/sched.h index fc4a987..ef38a3c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -80,6 +80,7 @@ struct sched_param { #include linux/resource.h #include linux/timer.h #include linux/hrtimer.h +#include linux/kevent_storage.h #include asm/processor.h @@ -1013,6 +1014,10 @@ struct task_struct { #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif +#ifdef CONFIG_KEVENT_SIGNAL + struct kevent_storage st; + u32 kevent_signals; +#endif }; static inline pid_t process_group(struct task_struct *tsk) diff --git a/kernel/fork.c b/kernel/fork.c index 1c999f3..e5b5b14 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -46,6 +46,7 @@ #include linux/delayacct.h #include linux/taskstats_kern.h #include linux/random.h +#include linux/kevent.h #include asm/pgtable.h #include asm/pgalloc.h @@ -115,6 +116,9 @@ void __put_task_struct(struct task_struc WARN_ON(atomic_read(tsk-usage)); WARN_ON(tsk == current); +#ifdef CONFIG_KEVENT_SIGNAL + kevent_storage_fini(tsk-st); +#endif security_task_free(tsk); free_uid(tsk-user); put_group_info(tsk-group_info); @@ -1121,6 +1125,10 @@ static struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_namespace; +#ifdef CONFIG_KEVENT_SIGNAL + kevent_storage_init(p, p-st); +#endif + p-set_child_tid = (clone_flags CLONE_CHILD_SETTID) ? child_tidptr : NULL; /* * Clear TID on mm_release()? diff --git a/kernel/kevent/kevent_signal.c b/kernel/kevent/kevent_signal.c new file mode 100644 index 000..0edd2e4 --- /dev/null +++ b/kernel/kevent/kevent_signal.c @@ -0,0 +1,92 @@ +/* + * kevent_signal.c + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/file.h +#include linux/fs.h +#include linux/kevent.h + +static int kevent_signal_callback(struct kevent *k) +{ + struct task_struct *tsk = k-st-origin; + int sig = k-event.id.raw[0]; + int ret = 0; + + if (sig == tsk-kevent_signals) + ret = 1; + + if (ret (k-event.id.raw_u64 KEVENT_SIGNAL_NOMASK)) + tsk-kevent_signals |= 0x8000; + + return ret; +} + +int kevent_signal_enqueue(struct kevent *k) +{ + int err; + + err = kevent_storage_enqueue(current-st, k); + if (err) + goto err_out_exit; + + if (k-event.req_flags KEVENT_REQ_ALWAYS_QUEUE) { + kevent_requeue(k); + err = 0; + } else { + err = k-callbacks.callback(k); + if (err) + goto err_out_dequeue; + } + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k-st, k); +err_out_exit: + return err; +} + +int kevent_signal_dequeue(struct kevent *k) +{ + kevent_storage_dequeue(k-st, k); + return 0; +} + +int kevent_signal_notify(struct task_struct *tsk, int sig) +{ + tsk-kevent_signals = sig; + kevent_storage_ready(tsk-st, NULL, KEVENT_SIGNAL_DELIVERY); + return (tsk-kevent_signals 0x8000); +} + +static int __init kevent_init_signal(void) +{ + struct kevent_callbacks sc = { + .callback = kevent_signal_callback, + .enqueue = kevent_signal_enqueue, + .dequeue = kevent_signal_dequeue}; + + return kevent_add_callbacks(sc, KEVENT_SIGNAL); +} +module_init(kevent_init_signal); diff --git
[take26 6/8] kevent: Pipe notifications.
Pipe notifications. diff --git a/fs/pipe.c b/fs/pipe.c index f3b6f71..aeaee9c 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -16,6 +16,7 @@ #include linux/uio.h #include linux/highmem.h #include linux/pagemap.h +#include linux/kevent.h #include asm/uaccess.h #include asm/ioctls.h @@ -312,6 +313,7 @@ redo: break; } if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND); wake_up_interruptible_sync(pipe-wait); kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT); } @@ -321,6 +323,7 @@ redo: /* Signal writers asynchronously that there is more room. */ if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND); wake_up_interruptible(pipe-wait); kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT); } @@ -490,6 +493,7 @@ redo2: break; } if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_RECV); wake_up_interruptible_sync(pipe-wait); kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN); do_wakeup = 0; @@ -501,6 +505,7 @@ redo2: out: mutex_unlock(inode-i_mutex); if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_RECV); wake_up_interruptible(pipe-wait); kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN); } @@ -605,6 +610,7 @@ pipe_release(struct inode *inode, int de free_pipe_info(inode); } else { wake_up_interruptible(pipe-wait); + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN); kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT); } diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c new file mode 100644 index 000..d529fa9 --- /dev/null +++ b/kernel/kevent/kevent_pipe.c @@ -0,0 +1,121 @@ +/* + * kevent_pipe.c + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/file.h +#include linux/fs.h +#include linux/kevent.h +#include linux/pipe_fs_i.h + +static int kevent_pipe_callback(struct kevent *k) +{ + struct inode *inode = k-st-origin; + struct pipe_inode_info *pipe = inode-i_pipe; + int nrbufs = pipe-nrbufs; + + if (k-event.event KEVENT_SOCKET_RECV nrbufs 0) { + if (!pipe-writers) + return -1; + return 1; + } + + if (k-event.event KEVENT_SOCKET_SEND nrbufs PIPE_BUFFERS) { + if (!pipe-readers) + return -1; + return 1; + } + + return 0; +} + +int kevent_pipe_enqueue(struct kevent *k) +{ + struct file *pipe; + int err = -EBADF; + struct inode *inode; + + pipe = fget(k-event.id.raw[0]); + if (!pipe) + goto err_out_exit; + + inode = igrab(pipe-f_dentry-d_inode); + if (!inode) + goto err_out_fput; + + err = -EINVAL; + if (!S_ISFIFO(inode-i_mode)) + goto err_out_iput; + + err = kevent_storage_enqueue(inode-st, k); + if (err) + goto err_out_iput; + + if (k-event.req_flags KEVENT_REQ_ALWAYS_QUEUE) { + kevent_requeue(k); + err = 0; + } else { + err = k-callbacks.callback(k); + if (err) + goto err_out_dequeue; + } + + fput(pipe); + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k-st, k); +err_out_iput: + iput(inode); +err_out_fput: + fput(pipe); +err_out_exit: + return err; +} + +int kevent_pipe_dequeue(struct kevent *k) +{ + struct inode *inode = k-st-origin; + + kevent_storage_dequeue(k-st, k); + iput(inode); + + return 0; +} + +void kevent_pipe_notify(struct inode *inode, u32
[take26 4/8] kevent: Socket notifications.
Socket notifications. This patch includes socket send/recv/accept notifications. Using trivial web server based on kevent and this features instead of epoll it's performance increased more than noticebly. More details about various benchmarks and server itself (evserver_kevent.c) can be found on project's homepage. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/fs/inode.c b/fs/inode.c index ada7643..2740617 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include linux/cdev.h #include linux/bootmem.h #include linux/inotify.h +#include linux/kevent.h #include linux/mount.h /* @@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct } inode-i_private = 0; inode-i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_init(inode, inode-st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_fini(inode-st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode-i_sb-s_op-destroy_inode) diff --git a/include/net/sock.h b/include/net/sock.h index edd4d73..d48ded8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -48,6 +48,7 @@ #include linux/netdevice.h #include linux/skbuff.h /* struct sk_buff */ #include linux/security.h +#include linux/kevent.h #include linux/filter.h @@ -450,6 +451,21 @@ static inline int sk_stream_memory_free( extern void sk_stream_rfree(struct sk_buff *skb); +struct socket_alloc { + struct socket socket; + struct inode vfs_inode; +}; + +static inline struct socket *SOCKET_I(struct inode *inode) +{ + return container_of(inode, struct socket_alloc, vfs_inode)-socket; +} + +static inline struct inode *SOCK_INODE(struct socket *socket) +{ + return container_of(socket, struct socket_alloc, socket)-vfs_inode; +} + static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb-sk = sk; @@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct sk-sk_backlog.tail = skb; } skb-next = NULL; + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); } #define sk_wait_event(__sk, __timeo, __condition) \ @@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio return si-kiocb; } -struct socket_alloc { - struct socket socket; - struct inode vfs_inode; -}; - -static inline struct socket *SOCKET_I(struct inode *inode) -{ - return container_of(inode, struct socket_alloc, vfs_inode)-socket; -} - -static inline struct inode *SOCK_INODE(struct socket *socket) -{ - return container_of(socket, struct socket_alloc, socket)-vfs_inode; -} - extern void __sk_stream_mem_reclaim(struct sock *sk); extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..69f4ad2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so tp-ucopy.memory = 0; } else if (skb_queue_len(tp-ucopy.prequeue) == 1) { wake_up_interruptible(sk-sk_sleep); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); if (!inet_csk_ack_scheduled(sk)) inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK, (3 * TCP_RTO_MIN) / 4, diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c new file mode 100644 index 000..9c24b5b --- /dev/null +++ b/kernel/kevent/kevent_socket.c @@ -0,0 +1,142 @@ +/* + * kevent_socket.c + * + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/list.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/timer.h +#include linux/file.h +#include linux/tcp.h +#include linux/kevent.h + +#include net/sock.h +#include net/request_sock.h +#include
[take26 5/8] kevent: Timer notifications.
Timer notifications. Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. This subsystem uses high-resolution timers. id.raw[0] is used as number of seconds id.raw[1] is used as number of nanoseconds Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c new file mode 100644 index 000..df93049 --- /dev/null +++ b/kernel/kevent/kevent_timer.c @@ -0,0 +1,112 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/list.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/hrtimer.h +#include linux/jiffies.h +#include linux/kevent.h + +struct kevent_timer +{ + struct hrtimer ktimer; + struct kevent_storage ktimer_storage; + struct kevent *ktimer_event; +}; + +static int kevent_timer_func(struct hrtimer *timer) +{ + struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer); + struct kevent *k = t-ktimer_event; + + kevent_storage_ready(t-ktimer_storage, NULL, KEVENT_MASK_ALL); + hrtimer_forward(timer, timer-base-softirq_time, + ktime_set(k-event.id.raw[0], k-event.id.raw[1])); + return HRTIMER_RESTART; +} + +static struct lock_class_key kevent_timer_key; + +static int kevent_timer_enqueue(struct kevent *k) +{ + int err; + struct kevent_timer *t; + + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL); + if (!t) + return -ENOMEM; + + hrtimer_init(t-ktimer, CLOCK_MONOTONIC, HRTIMER_REL); + t-ktimer.expires = ktime_set(k-event.id.raw[0], k-event.id.raw[1]); + t-ktimer.function = kevent_timer_func; + t-ktimer_event = k; + + err = kevent_storage_init(t-ktimer, t-ktimer_storage); + if (err) + goto err_out_free; + lockdep_set_class(t-ktimer_storage.lock, kevent_timer_key); + + err = kevent_storage_enqueue(t-ktimer_storage, k); + if (err) + goto err_out_st_fini; + + hrtimer_start(t-ktimer, t-ktimer.expires, HRTIMER_REL); + + return 0; + +err_out_st_fini: + kevent_storage_fini(t-ktimer_storage); +err_out_free: + kfree(t); + + return err; +} + +static int kevent_timer_dequeue(struct kevent *k) +{ + struct kevent_storage *st = k-st; + struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage); + + hrtimer_cancel(t-ktimer); + kevent_storage_dequeue(st, k); + kfree(t); + + return 0; +} + +static int kevent_timer_callback(struct kevent *k) +{ + k-event.ret_data[0] = jiffies_to_msecs(jiffies); + return 1; +} + +static int __init kevent_init_timer(void) +{ + struct kevent_callbacks tc = { + .callback = kevent_timer_callback, + .enqueue = kevent_timer_enqueue, + .dequeue = kevent_timer_dequeue}; + + return kevent_add_callbacks(tc, KEVENT_TIMER); +} +module_init(kevent_init_timer); + - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take26 8/8] kevent: Kevent posix timer notifications.
Kevent posix timer notifications. Simple extensions to POSIX timers which allows to deliver notification of the timer expiration through kevent queue. Example application posix_timer.c can be found in archive on project homepage. Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h index 8786e01..3768746 100644 --- a/include/asm-generic/siginfo.h +++ b/include/asm-generic/siginfo.h @@ -235,6 +235,7 @@ typedef struct siginfo { #define SIGEV_NONE 1 /* other notification: meaningless */ #define SIGEV_THREAD 2 /* deliver via thread creation */ #define SIGEV_THREAD_ID 4 /* deliver to thread */ +#define SIGEV_KEVENT 8 /* deliver through kevent queue */ /* * This works because the alignment is ok on all current architectures @@ -260,6 +261,8 @@ typedef struct sigevent { void (*_function)(sigval_t); void *_attribute; /* really pthread_attr_t */ } _sigev_thread; + + int kevent_fd; } _sigev_un; } sigevent_t; diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h index a7dd38f..4b9deb4 100644 --- a/include/linux/posix-timers.h +++ b/include/linux/posix-timers.h @@ -4,6 +4,7 @@ #include linux/spinlock.h #include linux/list.h #include linux/sched.h +#include linux/kevent_storage.h union cpu_time_count { cputime_t cpu; @@ -49,6 +50,9 @@ struct k_itimer { sigval_t it_sigev_value;/* value word of sigevent struct */ struct task_struct *it_process; /* process to send signal to */ struct sigqueue *sigq; /* signal queue entry. */ +#ifdef CONFIG_KEVENT_TIMER + struct kevent_storage st; +#endif union { struct { struct hrtimer timer; diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index e5ebcc1..8d0e7a3 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -48,6 +48,8 @@ #include linux/wait.h #include linux/workqueue.h #include linux/module.h +#include linux/kevent.h +#include linux/file.h /* * Management arrays for POSIX timers. Timers are kept in slab memory @@ -224,6 +226,99 @@ static int posix_ktime_get_ts(clockid_t return 0; } +#ifdef CONFIG_KEVENT_TIMER +static int posix_kevent_enqueue(struct kevent *k) +{ + /* +* It is not ugly - there is no pointer in the id field union, +* but its size is 64bits, which is ok for any known pointer size. +*/ + struct k_itimer *tmr = (struct k_itimer *)(unsigned long)k-event.id.raw_u64; + return kevent_storage_enqueue(tmr-st, k); +} +static int posix_kevent_dequeue(struct kevent *k) +{ + struct k_itimer *tmr = (struct k_itimer *)(unsigned long)k-event.id.raw_u64; + kevent_storage_dequeue(tmr-st, k); + return 0; +} +static int posix_kevent_callback(struct kevent *k) +{ + return 1; +} +static int posix_kevent_init(void) +{ + struct kevent_callbacks tc = { + .callback = posix_kevent_callback, + .enqueue = posix_kevent_enqueue, + .dequeue = posix_kevent_dequeue}; + + return kevent_add_callbacks(tc, KEVENT_POSIX_TIMER); +} + +extern struct file_operations kevent_user_fops; + +static int posix_kevent_init_timer(struct k_itimer *tmr, int fd) +{ + struct ukevent uk; + struct file *file; + struct kevent_user *u; + int err; + + file = fget(fd); + if (!file) { + err = -EBADF; + goto err_out; + } + + if (file-f_op != kevent_user_fops) { + err = -EINVAL; + goto err_out_fput; + } + + u = file-private_data; + + memset(uk, 0, sizeof(struct ukevent)); + + uk.event = KEVENT_MASK_ALL; + uk.type = KEVENT_POSIX_TIMER; + uk.id.raw_u64 = (unsigned long)(tmr); /* Just cast to something unique */ + uk.req_flags = KEVENT_REQ_ONESHOT | KEVENT_REQ_ALWAYS_QUEUE; + uk.ptr = tmr-it_sigev_value.sival_ptr; + + err = kevent_user_add_ukevent(uk, u); + if (err) + goto err_out_fput; + + fput(file); + + return 0; + +err_out_fput: + fput(file); +err_out: + return err; +} + +static void posix_kevent_fini_timer(struct k_itimer *tmr) +{ + kevent_storage_fini(tmr-st); +} +#else +static int posix_kevent_init_timer(struct k_itimer *tmr, int fd) +{ + return -ENOSYS; +} +static int posix_kevent_init(void) +{ + return 0; +} +static void posix_kevent_fini_timer(struct k_itimer *tmr) +{ +} +#endif + + /* * Initialize everything, well, just everything in Posix clocks/timers ;) */ @@ -241,6 +336,11 @@ static __init int init_posix_timers(void register_posix_clock(CLOCK_REALTIME, clock_realtime); register_posix_clock(CLOCK_MONOTONIC, clock_monotonic); + if (posix_kevent_init()) { +
[take26 0/8] kevent: Generic event handling mechanism.
Generic event handling mechanism. Kevent is a generic subsytem which allows to handle event notifications. It supports both level and edge triggered events. It is similar to poll/epoll in some cases, but it is more scalable, it is faster and allows to work with essentially eny kind of events. Events are provided into kernel through control syscall and can be read back through ring buffer or using usual syscalls. Kevent update (i.e. readiness switching) happens directly from internals of the appropriate state machine of the underlying subsytem (like network, filesystem, timer or any other). Homepage: http://tservice.net.ru/~s0mbre/old/?section=projectsitem=kevent Documentation page (will update Dec 1): http://linux-net.osdl.org/index.php/Kevent I installed slightly used, but still functional (bought on ebay) remote mind reader, and set it up to read Ulrich's alpha brain waves (I hope he agrees that it is a good decision), which took me the whole week. So I think the last ring buffer implementation is what we all wanted. Details in documentation part. It seems that setup was correct and we finially found what we wanted from interface part. Changes from 'take35' patchset: * use timespec as timeout parameter. * added high-resolution timer to handle absolute timeouts. * added flags to waiting and initialization syscalls. * kevent_commit() has new_uidx parameter. * kevent_wait() has old_uidx parameter, which, if not equal to u-uidx, results in immediate wakeup (usefull for the case when entries are added asynchronously from kernel (not supported for now)). * added interface to mark any event as ready. * event POSIX timers support. * return -ENOSYS if there is no registered event type. * provided file descriptor must be checked for fifo type (spotted by Eric Dumazet). * documentation update. * lighttpd patch updated (the latest benchmarks with lighttpd patch can be found in blog). Changes from 'take24' patchset: * new (old (new)) ring buffer implementation with kernel and user indexes. * added initialization syscall instead of opening /dev/kevent * kevent_commit() syscall to commit ring buffer entries * changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes only first thread always if that flag is not set * KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue instead of copying back to userspace when kevent is ready immediately when it is added. * lighttpd patch (Hail! Although nothing really outstanding compared to epoll) Changes from 'take23' patchset: * kevent PIPE notifications * KEVENT_REQ_LAST_CHECK flag, which allows to perform last check at dequeueing time * fixed poll/select notifications (were broken due to tree manipulations) * made Documentation/kevent.txt look nice in 80-col terminal * fix for copy_to_user() failure report for the first kevent (Andrew Morton) * minor function renames Changes from 'take22' patchset: * new ring buffer implementation in process' memory * wakeup-one-thread flag * edge-triggered behaviour Changes from 'take21' patchset: * minor cleanups (different return values, removed unneded variables, whitespaces and so on) * fixed bug in kevent removal in case when kevent being removed is the same as overflow_kevent (spotted by Eric Dumazet) Changes from 'take20' patchset: * new ring buffer implementation * removed artificial limit on possible number of kevents Changes from 'take19' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take18' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take17' patchset: * Use RB tree instead of hash table. At least for a web sever, frequency of addition/deletion of new kevent is comparable with number of search access, i.e. most of the time events are added, accesed only couple of times and then removed, so it justifies RB tree usage over AVL tree, since the latter does have much slower deletion time (max O(log(N)) compared to 3 ops), although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). So for kevents I use RB tree for now and later, when my AVL tree implementation is ready, it will be possible to compare them. * Changed readiness check for socket notifications. With both above changes it is possible to achieve more than 3380 req/second compared to 2200, sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same hardware. It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is 4096 events. Changes from 'take16' patchset: * misc cleanups
[take26 1/8] kevent: Description.
Description. diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt new file mode 100644 index 000..2e03a3f --- /dev/null +++ b/Documentation/kevent.txt @@ -0,0 +1,240 @@ +Description. + +int kevent_init(struct kevent_ring *ring, unsigned int ring_size, + unsigned int flags); + +num - size of the ring buffer in events +ring - pointer to allocated ring buffer +flags - various flags, see KEVENT_FLAGS_* definitions. + +Return value: kevent control file descriptor or negative error value. + + struct kevent_ring + { + unsigned int ring_kidx, ring_over; + struct ukevent event[0]; + } + +ring_kidx - index in the ring buffer where kernel will put new events + when kevent_wait() or kevent_get_events() is called +ring_over - number of overflows of ring_uidx happend from the start. + Overflow counter is used to prevent situation when two threads + are going to free the same events, but one of them was scheduled + away for too long, so ring indexes were wrapped, so when that + thread will be awakened, it will free not those events, which + it suppose to free. + +Example userspace code (ring_buffer.c) can be found on project's homepage. + +Each kevent syscall can be so called cancellation point in glibc, i.e. when +thread has been cancelled in kevent syscall, thread can be safely removed +and no events will be lost, since each syscall (kevent_wait() or +kevent_get_events()) will copy event into special ring buffer, accessible +from other threads or even processes (if shared memory is used). + +When kevent is removed (not dequeued when it is ready, but just removed), +even if it was ready, it is not copied into ring buffer, since if it is +removed, no one cares about it (otherwise user would wait until it becomes +ready and got it through usual way using kevent_get_events() or kevent_wait()) +and thus no need to copy it to the ring buffer. + +--- + + +int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg); + +fd - is the file descriptor referring to the kevent queue to manipulate. +It is created by opening /dev/kevent char device, which is created with +dynamic minor number and major number assigned for misc devices. + +cmd - is the requested operation. It can be one of the following: +KEVENT_CTL_ADD - add event notification +KEVENT_CTL_REMOVE - remove event notification +KEVENT_CTL_MODIFY - modify existing notification +KEVENT_CTL_READY - mark existing events as ready, if number of events is zero, + it just wakes up parked in syscall thread + +num - number of struct ukevent in the array pointed to by arg +arg - array of struct ukevent + +Return value: + number of events processed or negative error value. + +When called, kevent_ctl will carry out the operation specified in the +cmd parameter. +--- + + int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, + struct timespec timeout, struct ukevent *buf, unsigned flags); + +ctl_fd - file descriptor referring to the kevent queue +min_nr - minimum number of completed events that kevent_get_events will block +waiting for +max_nr - number of struct ukevent in buf +timeout - time to wait before returning less than min_nr + events. If this is -1, then wait forever. +buf - pointer to an array of struct ukevent. +flags - various flags, see KEVENT_FLAGS_* definitions. + +Return value: + number of events copied or negative error value. + +kevent_get_events will wait timeout milliseconds for at least min_nr completed +events, copying completed struct ukevents to buf and deleting any +KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many +events as possible, but not more than max_nr. In blocking mode it waits until +timeout or if at least min_nr events are ready. + +This function copies event into ring buffer if it was initialized, if ring buffer +is full, KEVENT_RET_COPY_FAILED flag is set in ret_flags field. +--- + + int kevent_wait(int ctl_fd, unsigned int num, unsigned int old_uidx, + struct timespec timeout, unsigned int flags); + +ctl_fd - file descriptor referring to the kevent queue +num - number of processed kevents +old_uidx - the last index user is aware of +timeout - time to wait until there is free space in kevent queue +flags - various flags, see KEVENT_FLAGS_* definitions. + +Return value: + number of events copied into ring buffer or negative error value. + +This syscall waits until either timeout expires or at least one event becomes +ready. It also copies events into special ring buffer. If ring buffer is full, +it waits until there are ready events and then return. +If kevent is one-shot kevent it is
[take26 3/8] kevent: poll/select() notifications.
poll/select() notifications. This patch includes generic poll/select notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake, a lot of allocations and so on). Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/fs/file_table.c b/fs/file_table.c index bc35a40..0805547 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -20,6 +20,7 @@ #include linux/cdev.h #include linux/fsnotify.h #include linux/sysctl.h +#include linux/kevent.h #include linux/percpu_counter.h #include asm/atomic.h @@ -119,6 +120,7 @@ struct file *get_empty_filp(void) f-f_uid = tsk-fsuid; f-f_gid = tsk-fsgid; eventpoll_init_file(f); + kevent_init_file(f); /* f-f_version: 0 */ return f; @@ -164,6 +166,7 @@ void fastcall __fput(struct file *file) * in the file cleanup chain. */ eventpoll_release(file); + kevent_cleanup_file(file); locks_remove_flock(file); if (file-f_op file-f_op-release) diff --git a/include/linux/fs.h b/include/linux/fs.h index 5baf3a1..8bbf3a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -276,6 +276,7 @@ extern int dir_notify_enable; #include linux/init.h #include linux/sched.h #include linux/mutex.h +#include linux/kevent_storage.h #include asm/atomic.h #include asm/semaphore.h @@ -586,6 +587,10 @@ struct inode { struct mutexinotify_mutex; /* protects the watches list */ #endif +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + struct kevent_storage st; +#endif + unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -739,6 +744,9 @@ struct file { struct list_headf_ep_links; spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ +#ifdef CONFIG_KEVENT_POLL + struct kevent_storage st; +#endif struct address_space*f_mapping; }; extern spinlock_t files_lock; diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 000..11dbe25 --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,232 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/list.h +#include linux/slab.h +#include linux/spinlock.h +#include linux/timer.h +#include linux/file.h +#include linux/kevent.h +#include linux/poll.h +#include linux/fs.h + +static kmem_cache_t *kevent_poll_container_cache; +static kmem_cache_t *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_structpt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_headcontainer_entry; + wait_queue_head_t *whead; + wait_queue_twait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_headcontainer_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont-k; + + kevent_storage_ready(k-st, NULL, KEVENT_MASK_ALL); + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)-k; + struct kevent_poll_private *priv = k-priv; + struct kevent_poll_wait_container *cont; + unsigned long flags; + + cont = kmem_cache_alloc(kevent_poll_container_cache, GFP_KERNEL); + if (!cont) { + kevent_break(k); + return; + } + + cont-k = k; + init_waitqueue_func_entry(cont-wait, kevent_poll_wait_callback); + cont-whead = whead; + + spin_lock_irqsave(priv-container_lock, flags); +
Re: e100 breakage located
sorry for the delay, your mail got marked as spam. In the future please copy networking issues to netdev@vger.kernel.org, and be sure to copy the maintainers of the driver you're having problems with (they are in the MAINTAINERS file) On 11/22/06, Amin Azez [EMAIL PROTECTED] wrote: I notice a patch in 2005 from Micahel O'Donnel to the e100.c driver has stopped auto-crossover working on some e100 devices we use. On one system the auto-negotiation was restored by commenting out: (nic-mac == mac_82551_10) in function e100_phy_init where the MDI/MDI-X is disabled. are you sure that patch did that? What version of e100 are you using? we've since enabled MDI-X on most parts with this patch: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=60ffa478759f39a2eb3be1ed179bc3764804b2c8;hp=09e590e5d5a93f2eaa748a89c623258e6bad1648 Please try the latest kernel or the latest e100 available from e1000.sf.net if that doesn't work we'll need to know what kernel are you using? lspci reports: 01:04.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 10) 01:04.0 Class 0200: 8086:1229 (rev 10) and on another device 01:05.0 Ethernet controller: Intel Corp. 82559ER (rev 10) 01:01.0 Class 0200: 8086:1209 (rev 10) So it is true that we are revision 10, but 82557/9 not 82551. you're getting confused between decimal and hex. 82551 is rev 16 (0x10) I must confess that having gotten this far, I am lost. Of course I can fix the driver for our hardware but I am not sure how to contrive a general fix. Maybe the actual damage is done in e100_get_defaults(struct nic *nic) where nic-mac is set to nic-rev_id ? But it generally seems to be a failure to take into account the actual hardware type, and only consider the revision. the only relevant way to tell e100 parts apart is the revision id - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Wenji Wu [EMAIL PROTECTED] Date: Thu, 30 Nov 2006 10:08:22 -0600 If the higher prioirty processes become runnable (e.g., interactive process), you better yield the CPU, instead of continuing this process. If it is the case that the process within tcp_recvmsg() is expriring, then, you can continue the process to go ahead to process backlog. Yes, I understand this, and I made that point in one of my replies to Ingo Molnar last night. The only seemingly remaining possibility is to find a way to allow input packet processing, at least enough to emit ACKs, during tcp_recvmsg() processing. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Thu, 30 Nov 2006 13:22:06 +0300 It steals timeslices from other processes to complete tcp_recvmsg() task, and only when it does it for too long, it will be preempted. Processing backlog queue on behalf of need_resched() will break fairness too - processing itself can take a lot of time, so process can be scheduled away in that part too. Yes, at this point I agree with this analysis. Currently I am therefore advocating some way to allow full input packet handling even amidst tcp_recvmsg() processing. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Ingo Molnar [EMAIL PROTECTED] Date: Thu, 30 Nov 2006 11:32:40 +0100 Note that even without the change the TCP receiving task is already getting a disproportionate share of cycles due to softirq processing! Under a load of 10.0 it went from 500 mbits to 74 mbits, while the 'fair' share would be 50 mbits. So the TCP receiver /already/ has an unfair advantage. The patch only deepends that unfairness. I want to point out something which is slightly misleading about this kind of analysis. Your disk I/O speed doesn't go down by a factor of 10 just because 9 other non disk I/O tasks are running, yet for TCP that's seemingly OK :-) Not looking at input TCP packets enough to send out the ACKs is the same as forgetting to queue some I/O requests that can go to the controller right now. That's the problem, TCP performance is intimately tied to ACK feedback. So we should find a way to make sure ACK feedback goes out, in preference to other tcp_recvmsg() processing. What really should pace the TCP sender in this kind of situation is the advertised window, not the lack of ACKs. Lack of an ACK mean the packet didn't get there, which is the wrong signal in this kind of situation, whereas a closing window means application can't keep up with the data rate, hold on... and is the proper flow control signal in this high load scenerio. If you don't send ACKs, packets are retransmitted when there is no reason for it, and that borders on illegal. :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* Wenji Wu [EMAIL PROTECTED] wrote: The solution is really simple and needs no kernel change at all: if you want the TCP receiver to get a larger share of timeslices then either renice it to -20 or renice the other tasks to +19. Simply give a larger share of timeslices to the TCP receiver won't solve the problem. No matter what the timeslice is, if the TCP receiving process has packets within backlog, and the process is expired and moved to the expired array, RTO might happen in the TCP sender. if you still have the test-setup, could you nevertheless try setting the priority of the receiving TCP task to nice -20 and see what kind of performance you get? Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.19-rc6-mm2: uli526x only works after reload
On Thursday, 30 November 2006 02:04, Rafael J. Wysocki wrote: On Thursday, 30 November 2006 00:26, Andrew Morton wrote: On Thu, 30 Nov 2006 00:08:21 +0100 Rafael J. Wysocki [EMAIL PROTECTED] wrote: On Wednesday, 29 November 2006 22:31, Rafael J. Wysocki wrote: On Wednesday, 29 November 2006 22:30, Andrew Morton wrote: On Wed, 29 Nov 2006 21:08:00 +0100 Rafael J. Wysocki [EMAIL PROTECTED] wrote: On Wednesday, 29 November 2006 20:54, Rafael J. Wysocki wrote: On Tuesday, 28 November 2006 11:02, Andrew Morton wrote: Temporarily at http://userweb.kernel.org/~akpm/2.6.19-rc6-mm2/ Will appear eventually at ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc6/2.6.19-rc6-mm2/ A minor issue: on one of my (x86-64) test boxes the uli526x driver doesn't work when it's first loaded. I have to rmmod and modprobe it to make it work. That isn't a minor issue. It worked just fine on -mm1, so something must have happened to it recently. Sorry, I was wrong. The driver doesn't work at all, even after reload. tulip-dmfe-carrier-detection-fix.patch was added in rc6-mm2. But you're not using that (corrent?) git-netdev-all changes drivers/net/tulip/de2104x.c, but you're not using that either. git-powerpc(!) alters drivers/net/tulip/de4x5.c, but you're not using that. Beats me, sorry. Perhaps it's due to changes in networking core. It's presumably a showstopper for statically-linked-uli526x users. If you could bisect it, please? I'd start with git-netdev-all, then tulip-*. OK, but it'll take some time. OK, done. It's one of these (the first one alone doesn't compile): git-netdev-all.patch git-netdev-all-fixup.patch libphy-dont-do-that.patch Hm, all of these patches are the same as in -mm1 which hasn't caused any problems to appear on this box. So, it seems there's another change between -mm1 and -mm2 that causes this to happen. Greetings, Rafael - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* David Miller [EMAIL PROTECTED] wrote: I want to point out something which is slightly misleading about this kind of analysis. Your disk I/O speed doesn't go down by a factor of 10 just because 9 other non disk I/O tasks are running, yet for TCP that's seemingly OK :-) disk I/O is typically not CPU bound, and i believe these TCP tests /are/ CPU-bound. Otherwise there would be no expiry of the timeslice to begin with and the TCP receiver task would always be boosted to 'interactive' status by the scheduler and would happily chug along at 500 mbits ... (and i grant you, if a disk IO test is 20% CPU bound in process context and system load is 10, then the scheduler will throttle that task quite effectively.) Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Ingo Molnar [EMAIL PROTECTED] Date: Thu, 30 Nov 2006 21:30:26 +0100 disk I/O is typically not CPU bound, and i believe these TCP tests /are/ CPU-bound. Otherwise there would be no expiry of the timeslice to begin with and the TCP receiver task would always be boosted to 'interactive' status by the scheduler and would happily chug along at 500 mbits ... It's about the prioritization of the work. If all disk I/O were shut off and frozen while we copy file data into userspace, you'd see the same problem for disk I/O. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
It steals timeslices from other processes to complete tcp_recvmsg() task, and only when it does it for too long, it will be preempted. Processing backlog queue on behalf of need_resched() will break fairness too - processing itself can take a lot of time, so process can be scheduled away in that part too. It does steal timeslices from other processes to complete tcp_recvmsg() task. But I do not think it will take long. When processing backlog, the processed packets will go to the receive buffer, the TCP flow control will take effect to slow down the sender. The data receiving process might be preempted by higher priority processes. Only the data recieving process stays in the active array, the problem is not that bad because the process might resume its execution soon. The worst case is that it expires and is moved to the active array with packets within the backlog queue. wenji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* David Miller [EMAIL PROTECTED] wrote: disk I/O is typically not CPU bound, and i believe these TCP tests /are/ CPU-bound. Otherwise there would be no expiry of the timeslice to begin with and the TCP receiver task would always be boosted to 'interactive' status by the scheduler and would happily chug along at 500 mbits ... It's about the prioritization of the work. If all disk I/O were shut off and frozen while we copy file data into userspace, you'd see the same problem for disk I/O. well, it's an issue of how much processing is done in non-prioritized contexts. TCP is a bit more sensitive to process context being throttled - but disk I/O is not immune either: if nothing submits new IO, or if the task does shorts reads+writes then any process level throttling immediately shows up in IO throughput. but in the general sense it is /unfair/ that certain processing such as disk and network IO can get a disproportionate amount of CPU time from the system - just because they happen to have some of their processing in IRQ and softirq context (which is essentially prioritized to SCHED_FIFO 100). A system can easily spend 80% CPU time in softirq context. (and that is easily visible in something like an -rt kernel where various softirq contexts are separate threads and you can see 30% net-rx and 20% net-tx CPU utilization in 'top'). How is this kind of processing different from purely process-context based subsystems? so i agree with you that by tweaking the TCP stack to be less sensitive to process throttling you /will/ improve the relative performance of the TCP receiver task - but in general system design and scheduler design terms it's not a win. i'd also agree with the notion that the current 'throttling' of process contexts can be abrupt and uncooperative, and hence the TCP stack could get more out of the same amount of CPU time if it used it in a smarter way. As i pointed it out in the first mail i'd support the TCP stack getting the ability to query how much timeslices it has - or even the scheduler notifying the TCP stack via some downcall if current-timeslice reaches 1 (or something like that). So i dont support the scheme proposed here, the blatant bending of the priority scale towards the TCP workload. Instead what i'd like to see is more TCP performance (and a nicer over-the-wire behavior - no retransmits for example) /with the same 10% CPU time used/. Are we in rough agreement? Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
if you still have the test-setup, could you nevertheless try setting the priority of the receiving TCP task to nice -20 and see what kind of performance you get? A process with nice of -20 can easily get the interactivity status. When it expires, it still go back to the active array. It just hide the TCP problem, instead of solving it. For a process with nice value of -20, it will have the following advantages over other processes: (1) its timeslice is 800ms, the timeslice of a process with a nice value of 0 is 100ms (2) it has higher priority than other processes (3) it is easier to gain the interactivity status. The chances that the process expires and moves to the expired array with packets within backlog is much reduces, but still has the chance. wenji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* Ingo Molnar [EMAIL PROTECTED] wrote: [...] Instead what i'd like to see is more TCP performance (and a nicer over-the-wire behavior - no retransmits for example) /with the same 10% CPU time used/. Are we in rough agreement? put in another way: i'd like to see the TCP bytes transferred per CPU time spent by the TCP stack ratio to be maximized in a load-independent way (part of which is the sender host too: to not cause unnecessary retransmits is important as well). In a high-load scenario this means that any measure that purely improves TCP throughput by giving it more cycles is not a real improvement. So the focus should be on throttling intelligently and without causing extra work on the sender side either - not on trying to circumvent throttling measures. Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Ingo Molnar [EMAIL PROTECTED] Date: Thu, 30 Nov 2006 21:49:08 +0100 So i dont support the scheme proposed here, the blatant bending of the priority scale towards the TCP workload. I don't support this scheme either ;-) That's why my proposal is to find a way to allow input packet processing even during tcp_recvmsg() work. It is a solution that would give the TCP task exactly it's time slice, no more, no less, without the erroneous behavior of sleeping with packets held in the socket backlog. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.19-rc6-mm2: uli526x only works after reload
On Thu, 30 Nov 2006 21:21:27 +0100 Rafael J. Wysocki [EMAIL PROTECTED] wrote: On Thursday, 30 November 2006 02:04, Rafael J. Wysocki wrote: On Thursday, 30 November 2006 00:26, Andrew Morton wrote: On Thu, 30 Nov 2006 00:08:21 +0100 Rafael J. Wysocki [EMAIL PROTECTED] wrote: On Wednesday, 29 November 2006 22:31, Rafael J. Wysocki wrote: On Wednesday, 29 November 2006 22:30, Andrew Morton wrote: On Wed, 29 Nov 2006 21:08:00 +0100 Rafael J. Wysocki [EMAIL PROTECTED] wrote: On Wednesday, 29 November 2006 20:54, Rafael J. Wysocki wrote: On Tuesday, 28 November 2006 11:02, Andrew Morton wrote: Temporarily at http://userweb.kernel.org/~akpm/2.6.19-rc6-mm2/ Will appear eventually at ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc6/2.6.19-rc6-mm2/ A minor issue: on one of my (x86-64) test boxes the uli526x driver doesn't work when it's first loaded. I have to rmmod and modprobe it to make it work. That isn't a minor issue. It worked just fine on -mm1, so something must have happened to it recently. Sorry, I was wrong. The driver doesn't work at all, even after reload. tulip-dmfe-carrier-detection-fix.patch was added in rc6-mm2. But you're not using that (corrent?) git-netdev-all changes drivers/net/tulip/de2104x.c, but you're not using that either. git-powerpc(!) alters drivers/net/tulip/de4x5.c, but you're not using that. Beats me, sorry. Perhaps it's due to changes in networking core. It's presumably a showstopper for statically-linked-uli526x users. If you could bisect it, please? I'd start with git-netdev-all, then tulip-*. OK, but it'll take some time. OK, done. It's one of these (the first one alone doesn't compile): git-netdev-all.patch git-netdev-all-fixup.patch libphy-dont-do-that.patch Hm, all of these patches are the same as in -mm1 which hasn't caused any problems to appear on this box. So, it seems there's another change between -mm1 and -mm2 that causes this to happen. It would be nice to eliminate libphy-dont-do-that.patch if poss - that was a rogue akpm patch aimed at some incomprehensible gobbledigook in the netdev tree (and to fix the current_is_keventd-not-exported-to-modules bug). I have a feeling that your bug will be cheerily merged into mainline soon. That might of course mean that someone will hit it more firmly and it'll get fixed. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] additional change to ipsec audit
Sorry! Sign off included this time. This patch disables auditing in ipsec when CONFIG_AUDITSYSCALL is disabled in the kernel. This patch also includes a bug fix for xfrm_state.c as a result of original ipsec audit patch. regards, Joy Signed-off-by: Joy Latten [EMAIL PROTECTED] --- diff -urpN linux-2.6.18-patch/include/net/xfrm.h linux-2.6.18-patch.2/include/net/xfrm.h --- linux-2.6.18-patch/include/net/xfrm.h 2006-11-27 12:29:11.0 -0600 +++ linux-2.6.18-patch.2/include/net/xfrm.h 2006-11-28 13:26:49.0 -0600 @@ -395,8 +395,13 @@ struct xfrm_audit uid_t loginuid; u32 secid; }; -void xfrm_audit_log(uid_t auid, u32 secid, int type, int result, + +#ifdef CONFIG_AUDITSYSCALL +extern void xfrm_audit_log(uid_t auid, u32 secid, int type, int result, struct xfrm_policy *xp, struct xfrm_state *x); +#else +#define xfrm_audit_log(a,s,t,r,p,x) do { ; } while (0) +#endif /* CONFIG_AUDITSYSCALL */ static inline void xfrm_pol_hold(struct xfrm_policy *policy) { diff -urpN linux-2.6.18-patch/net/xfrm/xfrm_policy.c linux-2.6.18-patch.2/net/xfrm/xfrm_policy.c --- linux-2.6.18-patch/net/xfrm/xfrm_policy.c 2006-11-27 12:29:33.0 -0600 +++ linux-2.6.18-patch.2/net/xfrm/xfrm_policy.c 2006-11-28 14:51:09.0 -0600 @@ -1955,6 +1955,7 @@ int xfrm_bundle_ok(struct xfrm_policy *p EXPORT_SYMBOL(xfrm_bundle_ok); +#ifdef CONFIG_AUDITSYSCALL /* Audit addition and deletion of SAs and ipsec policy */ void xfrm_audit_log(uid_t auid, u32 sid, int type, int result, @@ -2063,6 +2064,7 @@ void xfrm_audit_log(uid_t auid, u32 sid, } EXPORT_SYMBOL(xfrm_audit_log); +#endif /* CONFIG_AUDITSYSCALL */ int xfrm_policy_register_afinfo(struct xfrm_policy_afinfo *afinfo) { diff -urpN linux-2.6.18-patch/net/xfrm/xfrm_state.c linux-2.6.18-patch.2/net/xfrm/xfrm_state.c --- linux-2.6.18-patch/net/xfrm/xfrm_state.c2006-11-27 12:29:33.0 -0600 +++ linux-2.6.18-patch.2/net/xfrm/xfrm_state.c 2006-11-28 12:58:56.0 -0600 @@ -407,7 +407,6 @@ restart: xfrm_state_hold(x); spin_unlock_bh(xfrm_state_lock); - xfrm_state_delete(x); err = xfrm_state_delete(x); xfrm_audit_log(audit_info-loginuid, audit_info-secid, - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] additional ipsec audit patch
On Wed, 2006-11-29 at 19:32 -0500, James Morris wrote: On Wed, 29 Nov 2006, James Morris wrote: On Wed, 29 Nov 2006, Joy Latten wrote: This patch disables auditing in ipsec when CONFIG_AUDITSYSCALL is disabled in the kernel. This patch also includes a bug fix for xfrm_state.c as a result of original ipsec audit patch. Let me know if it looks ok. Also, the last patch contains no Signed-off-by: line, please resend. And, what is the testing status of these patches? I ran a stress test overnight using labeled ipsec on a patched lspp55 kernel using racoon last week. The additional patch to xfrm_state.c was my fault when rebasing to 2.6.19-rc6 to send upstream. I plan to run an ipv4 and ipv6 stress test tonight and tomorrow using labeled ipsec with auditing enabled on the lspp56 kernel, which contains ipsec audit patch, to ensure no regression has occurred. I can also run an ipv4 and ipv6 stress tests with regular ipsec over the weekend for further ensurance. I compiled and did unit test with SELINUX disabled, AUDITSYSCALL disabled, and with both enabled. regards, Joy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] additional ipsec audit patch
On Thu, 30 Nov 2006, Joy Latten wrote: I ran a stress test overnight using labeled ipsec on a patched lspp55 kernel using racoon last week. The additional patch to xfrm_state.c was my fault when rebasing to 2.6.19-rc6 to send upstream. I plan to run an ipv4 and ipv6 stress test tonight and tomorrow using labeled ipsec with auditing enabled on the lspp56 kernel, which contains ipsec audit patch, to ensure no regression has occurred. I can also run an ipv4 and ipv6 stress tests with regular ipsec over the weekend for further ensurance. I compiled and did unit test with SELINUX disabled, AUDITSYSCALL disabled, and with both enabled. Thanks, applied to git://git.infradead.org/~jmorris/selinux-net-2.6.20#for-akpm might be worth having it in -mm for a bit. -- James Morris [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] NetXen: temp monitoring, newer firmware support, mm footprint reduction
Don Fry [EMAIL PROTECTED] : NetXen: 1G/10G Ethernet Driver updates - Temparature monitoring and device control - Memory footprint reduction - Driver changes to support newer version of firmware Signed-off-by: Amit S. Kale [EMAIL PROTECTED] Signed-off-by: Don Fry [EMAIL PROTECTED] diff -Nupr netdev-2.6/drivers/net/netxen.one/netxen_nic_ethtool.c netdev-2.6/drivers/net/netxen/netxen_nic_ethtool.c --- netdev-2.6/drivers/net/netxen.one/netxen_nic_ethtool.c2006-11-30 09:16:23.0 -0800 +++ netdev-2.6/drivers/net/netxen/netxen_nic_ethtool.c2006-11-30 09:22:41.0 -0800 @@ -53,6 +53,9 @@ struct netxen_nic_stats { #define NETXEN_NIC_STAT(m) sizeof(((struct netxen_port *)0)-m), \ offsetof(struct netxen_port, m) +#define NETXEN_NIC_PORT_WINDOW 0x1 +#define NETXEN_NIC_INVALID_DATA 0xDEADBEEF + static const struct netxen_nic_stats netxen_nic_gstrings_stats[] = { {rcvd_bad_skb, NETXEN_NIC_STAT(stats.rcvdbadskb)}, {xmit_called, NETXEN_NIC_STAT(stats.xmitcalled)}, @@ -111,9 +114,9 @@ netxen_nic_get_drvinfo(struct net_device { struct netxen_port *port = netdev_priv(dev); struct netxen_adapter *adapter = port-adapter; - uint32_t fw_major = 0; - uint32_t fw_minor = 0; - uint32_t fw_build = 0; + u32 fw_major = 0; + u32 fw_minor = 0; + u32 fw_build = 0; The description of the patch did not announce (welcome) cleanup. There are a few ones. [...] strncpy(drvinfo-driver, netxen_nic, 32); strncpy(drvinfo-version, NETXEN_NIC_LINUX_VERSIONID, 32); @@ -136,6 +139,8 @@ netxen_nic_get_settings(struct net_devic { struct netxen_port *port = netdev_priv(dev); struct netxen_adapter *adapter = port-adapter; + struct netxen_board_info *boardinfo; + boardinfo = adapter-ahw.boardcfg; Missing separating line or merge the two lines. [...] @@ -182,13 +174,47 @@ netxen_nic_get_settings(struct net_devic ecmd-speed = SPEED_1; ecmd-duplex = DUPLEX_FULL; - ecmd-phy_address = port-portnum; - ecmd-transceiver = XCVR_EXTERNAL; ecmd-autoneg = AUTONEG_DISABLE; - return 0; + } else + return -EIO; + + ecmd-phy_address = port-portnum; + ecmd-transceiver = XCVR_EXTERNAL; + + switch ((netxen_brdtype_t) boardinfo-board_type) { + case NETXEN_BRDTYPE_P2_SB35_4G: + case NETXEN_BRDTYPE_P2_SB31_2G: + ecmd-supported |= SUPPORTED_Autoneg; + ecmd-advertising |= ADVERTISED_Autoneg; + case NETXEN_BRDTYPE_P2_SB31_10G_CX4: + ecmd-supported |= SUPPORTED_TP; + ecmd-advertising |= ADVERTISED_TP; + ecmd-port = PORT_TP; + ecmd-autoneg = (boardinfo-board_type == + NETXEN_BRDTYPE_P2_SB31_10G_CX4) ? + (AUTONEG_DISABLE) : (port-link_autoneg); + break; + case NETXEN_BRDTYPE_P2_SB31_10G_HMEZ: + case NETXEN_BRDTYPE_P2_SB31_10G_IMEZ: + ecmd-supported |= SUPPORTED_MII; + ecmd-advertising |= ADVERTISED_MII; + ecmd-port = PORT_FIBRE; + ecmd-autoneg = AUTONEG_DISABLE; + break; + case NETXEN_BRDTYPE_P2_SB31_10G: + ecmd-supported |= SUPPORTED_FIBRE; + ecmd-advertising |= ADVERTISED_FIBRE; + ecmd-port = PORT_FIBRE; + ecmd-autoneg = AUTONEG_DISABLE; + break; + default: + printk(ERROR: Unsupported board model %d\n, +(netxen_brdtype_t) boardinfo-board_type); Missing KERN_ERR [...] diff -Nupr netdev-2.6/drivers/net/netxen.one/netxen_nic.h netdev-2.6/drivers/net/netxen/netxen_nic.h --- netdev-2.6/drivers/net/netxen.one/netxen_nic.h2006-11-30 09:16:23.0 -0800 +++ netdev-2.6/drivers/net/netxen/netxen_nic.h2006-11-30 09:22:41.0 -0800 [...] @@ -328,6 +343,7 @@ typedef enum { NETXEN_BRDTYPE_P2_SB31_10G_HMEZ = 0x000e, NETXEN_BRDTYPE_P2_SB31_10G_CX4 = 0x000f } netxen_brdtype_t; +#define NUM_SUPPORTED_BOARDS (sizeof(netxen_boards)/sizeof(netxen_brdinfo_t)) typedef enum { NETXEN_BRDMFG_INVENTEC = 1 [...] @@ -869,7 +937,10 @@ static inline void netxen_nic_disable_in /* * ISR_INT_MASK: Can be read from window 0 or 1. */ - writel(0x7ff, (void __iomem *)(adapter-ahw.pci_base + ISR_INT_MASK)); + writel(0x7ff, +(void __iomem + *)(PCI_OFFSET_SECOND_RANGE(adapter, ISR_INT_MASK))); + Yuck. [...] @@ -888,13 +959,83 @@ static inline void netxen_nic_enable_int break; } - writel(mask, (void __iomem *)(adapter-ahw.pci_base + ISR_INT_MASK)); + writel(mask, +(void __iomem + *)(PCI_OFFSET_SECOND_RANGE(adapter, ISR_INT_MASK))); if (!(adapter-flags
Re: [PATCH 0/5 addendum] NetXen
Don Fry wrote: The NetXen patches fix many problems in the current #upstream version of the driver. It has warts and probably lots of bugs still, but it is better than what is queued for mainline inclusion at this time. Please apply to 2.6.20. Please resync with netdev#upstream, and update for comments on netdev... Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPSEC][1/7] inter address family ipsec tunnel
From: Kazunori MIYAZAWA [EMAIL PROTECTED] Date: Fri, 24 Nov 2006 14:38:07 +0900 This patch adds encapsulation family. Signed-off-by: Miika Komu [EMAIL PROTECTED] Signed-off-by: Diego Beltrami [EMAIL PROTECTED] Signed-off-by: Kazunori Miyazawa [EMAIL PROTECTED] Applied to net-2.6.20, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPSEC][2/7] inter address family ipsec tunnel
From: Kazunori MIYAZAWA [EMAIL PROTECTED] Date: Fri, 24 Nov 2006 14:38:17 +0900 This patch adds netlink interface of the family Signed-off-by: Miika Komu [EMAIL PROTECTED] Signed-off-by: Diego Beltrami [EMAIL PROTECTED] Signed-off-by: Kazunori Miyazawa [EMAIL PROTECTED] Applied to net-2.6.20, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPSEC][3/7] inter address family ipsec tunnel
From: Kazunori MIYAZAWA [EMAIL PROTECTED] Date: Thu, 30 Nov 2006 10:54:26 +0900 Hello, I found a bug in my previous patch for af_key. The patch breaks transport mode. This is a fixed version. Signed-off-by: Miika Komu [EMAIL PROTECTED] Signed-off-by: Diego Beltrami [EMAIL PROTECTED] Signed-off-by: Kazunori Miyazawa [EMAIL PROTECTED] Applied to net-2.6.20, thanks a lot. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPSEC][4/7] inter address family ipsec tunnel
From: Kazunori MIYAZAWA [EMAIL PROTECTED] Date: Fri, 24 Nov 2006 14:38:39 +0900 What is going on here? + /* Without this, the atomic inc below segfaults */ + if (encap_family == AF_INET6) { + rt-peer = NULL; + rt_bind_peer(rt,1); + } ... - dst_prev-output= xfrm4_output; + if (dst_prev-xfrm-props.family == AF_INET) + dst_prev-output = xfrm4_output; +#if defined(CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE) + else + dst_prev-output = xfrm6_output; +#endif if (rt-peer) atomic_inc(rt-peer-refcnt); If it's non-NULL and you get a segfault for atomic_inc() that means there is garbage here, and it seems that if you're setting it to NULL explicitly then it's just a workaround for whatever problem is causing it to be non-NULL to begin with. What is putting a non-valid pointer value there? Is this an IPV6 or IPSEC dst route by chance? If so, that makes this change really wrong, and we are corrupting the route by running rt_bind_peer() on it. rt_bind_peer() is only valid on ipv4 route entries. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPSEC][5/7] inter address family ipsec tunnel
From: Kazunori MIYAZAWA [EMAIL PROTECTED] Date: Fri, 24 Nov 2006 14:38:52 +0900 +static inline void ip6ip_ecn_decapsulate(struct sk_buff *skb) +{ + if (INET_ECN_is_ce(ipv6_get_dsfield(skb-nh.ipv6h))) + IP_ECN_set_ce(skb-h.ipiph); +} + Please fix this extra tab indentation :-) Thank you. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] zd1211rw: zd_mac_rx isn't always called in IRQ context
e.g. usb 1-7: rx_urb_complete() *** first fragment *** usb 1-7: rx_urb_complete() *** second fragment *** drivers/net/wireless/zd1211rw/zd_mac.c:1063 ASSERT (((current_thread_info()-preempt_count) (((1UL (12))-1) ((0 + 8) + 8 VIOLATED! [f0299448] zd_mac_rx+0x3e7/0x47a [zd1211rw] [f029badc] rx_urb_complete+0x22d/0x24a [zd1211rw] [b028a22f] urb_destroy+0x0/0x5 [b01f0930] kref_put+0x65/0x72 [b0288cdf] usb_hcd_giveback_urb+0x28/0x57 [b02950c4] qh_completions+0x296/0x2f6 [b0294b21] ehci_urb_done+0x70/0x7a [b0294ea1] qh_completions+0x73/0x2f6 [b02951bc] ehci_work+0x98/0x538 Remove the bogus assertion, and use dev_kfree_skb_any as pointed out by Ulrich Kunitz. Signed-off-by: Daniel Drake [EMAIL PROTECTED] Index: linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c === --- linux-2.6.orig/drivers/net/wireless/zd1211rw/zd_mac.c +++ linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c @@ -1059,10 +1059,8 @@ int zd_mac_rx(struct zd_mac *mac, const memcpy(skb_put(skb, length), buffer, length); r = ieee80211_rx(ieee, skb, stats); - if (!r) { - ZD_ASSERT(in_irq()); - dev_kfree_skb_irq(skb); - } + if (!r) + dev_kfree_skb_any(skb); return 0; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] zd1211rw: Fill enc_capa in GIWRANGE handler
This is needed for NetworkManager users to connect to WPA networks. Pointed out by Matthew Campbell. Signed-off-by: Daniel Drake [EMAIL PROTECTED] --- zd_mac.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) Index: linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c === --- linux-2.6.orig/drivers/net/wireless/zd1211rw/zd_mac.c +++ linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c @@ -615,6 +615,9 @@ int zd_mac_get_range(struct zd_mac *mac, range-we_version_compiled = WIRELESS_EXT; range-we_version_source = 20; + range-enc_capa = IW_ENC_CAPA_WPA | IW_ENC_CAPA_WPA2 | + IW_ENC_CAPA_CIPHER_TKIP | IW_ENC_CAPA_CIPHER_CCMP; + ZD_ASSERT(!irqs_disabled()); spin_lock_irq(mac-lock); regdomain = mac-regdomain; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] zd1211rw: Support for multicast addresses
From: Ulrich Kunitz [EMAIL PROTECTED] Support for multicast adresses is implemented by supporting the set_multicast_list() function of the network device. Address filtering is supported by a group hash table in the device. This is based on earlier work by Benoit Papillaut. Fixes multicast packet reception and ipv6 connectivity: http://bugzilla.kernel.org/show_bug.cgi?id=7424 http://bugzilla.kernel.org/show_bug.cgi?id=7425 Signed-off-by: Ulrich Kunitz [EMAIL PROTECTED] Signed-off-by: Daniel Drake [EMAIL PROTECTED] --- zd_chip.c | 13 + zd_chip.h | 43 ++- zd_mac.c| 44 +++- zd_mac.h|3 +++ zd_netdev.c |2 +- 5 files changed, 102 insertions(+), 3 deletions(-) Index: linux-2.6/drivers/net/wireless/zd1211rw/zd_chip.c === --- linux-2.6.orig/drivers/net/wireless/zd1211rw/zd_chip.c +++ linux-2.6/drivers/net/wireless/zd1211rw/zd_chip.c @@ -1673,3 +1673,16 @@ int zd_rfwritev_cr_locked(struct zd_chip return 0; } + +int zd_chip_set_multicast_hash(struct zd_chip *chip, + struct zd_mc_hash *hash) +{ + struct zd_ioreq32 ioreqs[] = { + { CR_GROUP_HASH_P1, hash-low }, + { CR_GROUP_HASH_P2, hash-high }, + }; + + dev_dbg_f(zd_chip_dev(chip), hash l 0x%08x h 0x%08x\n, + ioreqs[0].value, ioreqs[1].value); + return zd_iowrite32a(chip, ioreqs, ARRAY_SIZE(ioreqs)); +} Index: linux-2.6/drivers/net/wireless/zd1211rw/zd_chip.h === --- linux-2.6.orig/drivers/net/wireless/zd1211rw/zd_chip.h +++ linux-2.6/drivers/net/wireless/zd1211rw/zd_chip.h @@ -390,10 +390,19 @@ #define CR_BSSID_P1CTL_REG(0x0618) #define CR_BSSID_P2CTL_REG(0x061C) #define CR_BCN_PLCP_CFGCTL_REG(0x0620) + +/* Group hash table for filtering incoming packets. + * + * The group hash table is 64 bit large and split over two parts. The first + * part is the lower part. The upper 6 bits of the last byte of the target + * address are used as index. Packets are received if the hash table bit is + * set. This is used for multicast handling, but for broadcasts (address + * ff:ff:ff:ff:ff:ff) the highest bit in the second table must also be set. + */ #define CR_GROUP_HASH_P1 CTL_REG(0x0624) #define CR_GROUP_HASH_P2 CTL_REG(0x0628) -#define CR_RX_TIMEOUT CTL_REG(0x062C) +#define CR_RX_TIMEOUT CTL_REG(0x062C) /* Basic rates supported by the BSS. When producing ACK or CTS messages, the * device will use a rate in this table that is less than or equal to the rate * of the incoming frame which prompted the response */ @@ -864,4 +873,36 @@ u8 zd_rx_strength_percent(u8 rssi); u16 zd_rx_rate(const void *rx_frame, const struct rx_status *status); +struct zd_mc_hash { + u32 low; + u32 high; +}; + +static inline void zd_mc_clear(struct zd_mc_hash *hash) +{ + hash-low = 0; + /* The interfaces must always received broadcasts. +* The hash of the broadcast address ff:ff:ff:ff:ff:ff is 63. +*/ + hash-high = 0x8000; +} + +static inline void zd_mc_add_all(struct zd_mc_hash *hash) +{ + hash-low = hash-high = 0x; +} + +static inline void zd_mc_add_addr(struct zd_mc_hash *hash, u8 *addr) +{ + unsigned int i = addr[5] 2; + if (i 32) { + hash-low |= 1 i; + } else { + hash-high |= 1 (i-32); + } +} + +int zd_chip_set_multicast_hash(struct zd_chip *chip, + struct zd_mc_hash *hash); + #endif /* _ZD_CHIP_H */ Index: linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c === --- linux-2.6.orig/drivers/net/wireless/zd1211rw/zd_mac.c +++ linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c @@ -39,6 +39,8 @@ static void housekeeping_init(struct zd_ static void housekeeping_enable(struct zd_mac *mac); static void housekeeping_disable(struct zd_mac *mac); +static void set_multicast_hash_handler(void *mac_ptr); + int zd_mac_init(struct zd_mac *mac, struct net_device *netdev, struct usb_interface *intf) @@ -55,6 +57,8 @@ int zd_mac_init(struct zd_mac *mac, softmac_init(ieee80211_priv(netdev)); zd_chip_init(mac-chip, netdev, intf); housekeeping_init(mac); + INIT_WORK(mac-set_multicast_hash_work, set_multicast_hash_handler, + mac); return 0; } @@ -136,6 +140,7 @@ out: void zd_mac_clear(struct zd_mac *mac) { + flush_workqueue(zd_workqueue); zd_chip_clear(mac-chip); ZD_ASSERT(!spin_is_locked(mac-lock)); ZD_MEMCLEAR(mac, sizeof(struct zd_mac)); @@ -256,6 +261,42 @@ int zd_mac_set_mac_address(struct
Re: [PATCH][IPSEC][6/7] inter address family ipsec tunnel
From: Kazunori MIYAZAWA [EMAIL PROTECTED] Date: Fri, 24 Nov 2006 14:39:01 +0900 This patch fixes mtu calculation of IPv4 ip_append_data should refer the mtu of dst not path. if dst is stacked, path is the actual dst_entry in the routing table. therefore the mtu of path equals link mtu which is depends on the device so that it ignores the header length and the trailer length dst has mtu for creating packet. Signed-off-by: Miika Komu [EMAIL PROTECTED] Signed-off-by: Diego Beltrami [EMAIL PROTECTED] Signed-off-by: Kazunori Miyazawa [EMAIL PROTECTED] I'm not sure about this change. If you look at the code in this function, mtu is always used with adjustments via 'exthdrlen' (which is set to rt-u.dst.header_len). So it seems the encapsulation is taken into account. Perhaps any problem you are seeing is some artifact of the ipv6 in ipv4 tunnel implementation. Otherwise we'd have other reports of this problem, wouldn't we? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][IPSEC][7/7] inter address family ipsec tunnel
From: Kazunori MIYAZAWA [EMAIL PROTECTED] Date: Fri, 24 Nov 2006 14:39:17 +0900 ip6_append_data should refer mtu of dst because of the same reasone of the previous patch. Same comments of mine for ipv4 side of this change also apply here. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html