Re: pktgen

2006-11-30 Thread Robert Olsson


Hello!

Seems you found a race when rmmod is done before it's fully started

Try:

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 733d86d..ac0b4b1 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -160,7 +160,7 @@
 #include asm/div64.h /* do_div */
 #include asm/timex.h
 
-#define VERSION  pktgen v2.68: Packet Generator for packet performance 
testing.\n
+#define VERSION  pktgen v2.69: Packet Generator for packet performance 
testing.\n
 
 /* #define PG_DEBUG(a) a */
 #define PG_DEBUG(a)
@@ -3673,6 +3673,8 @@ static void __exit pg_cleanup(void)
struct list_head *q, *n;
wait_queue_head_t queue;
init_waitqueue_head(queue);
+   
+   schedule_timeout_interruptible(msecs_to_jiffies(125));
 
/* Stop all interfaces  threads */
 


for i in 1 2 3 4 5 ; do modprobe pktgen ; rmmod pktgen ; done

In dmesg
pktgen v2.69: Packet Generator for packet performance testing.
pktgen v2.69: Packet Generator for packet performance testing.
pktgen v2.69: Packet Generator for packet performance testing.
pktgen v2.69: Packet Generator for packet performance testing.
pktgen v2.69: Packet Generator for packet performance testing.

Cheers.
--ro



Alexey Dobriyan writes:
  On 11/30/06, David Miller [EMAIL PROTECTED] wrote:
   From: Alexey Dobriyan [EMAIL PROTECTED]
   Date: Wed, 29 Nov 2006 23:04:37 +0300
  
Looks like worker thread strategically clears it if scheduled at wrong
moment.
   
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3292,7 +3292,6 @@ static void pktgen_thread_worker(struct
   
 init_waitqueue_head(t-queue);
   
-t-control = ~(T_TERMINATE);
 t-control = ~(T_RUN);
 t-control = ~(T_STOP);
 t-control = ~(T_REMDEVALL);
  
   Good catch Alexey.  Did you rerun the load/unload test with
   this fix applied?  If it fixes things, I'll merge it.
  
  Well, yes, it fixes things, except Ctrl+C getting you out of
  modprobe/rmmod loop will spit
  backtrace again. And other flags: T_RUN, T_STOP. Clearance is not
  needed due to kZalloc and
  create bugs as demostrated.
  
  Give me some time.
  -
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] [PATCH V2 0/3] bonding support for operation over IPoIB

2006-11-30 Thread Or Gerlitz
This patch series is a second version (see below link to V1) of the suggested
changes to the bonding driver such that it would be able to support non 
ARPHRD_ETHER
netdevices for its High-Availability (active-backup) mode.

The motivation is to enable the bonding driver on its HA mode to work with the
IP over Infiniband (IPoIB) driver. With these patches I was able to enslave
IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast and ICMP traffic with
fail-over and fail-back working fine. My working env was the net-2.6.20 git.

More over, as IPoIB is also the IB ARP provider for the RDMA CM driver which
is used by native IB ULPs whose addressing scheme is based on IP (eg iSER, SDP,
Lustre, NFSoRDMA, RDS), bonding support for IPoIB devices **enables** HA for
these ULPs. This holds as when the ULP is informed by the IB HW on the failure
of the current IB connection, it just need to reconnect, where the bonding
device will now issue the IB ARP over the active IPoIB slave.

The first patch changes some of the bond netdevice attributes and functions
to be that of the active slave for the case of the enslaved device not being
of ARPHRD_ETHER type. Basically it overrides those setting done by 
ether_setup(),
which are netdevice **type** dependent and hence might be not appropriate for
devices of other types. It also enforces mutual exclusion on bonding slaves
from dissimilar ether types, as was concluded over the v1 discussion.

IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes
IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this
IPoIB device is bounded to. The QP is a resource created by the IB HW and the
GID is an identifier burned into the HCA (i have omitted here some details which
are not important for the bonding RFC).

Basically the IPoIB spec and impl. do not allow for setting the MAC address of
an IPoIB device and this work was made under this assumption.

Hence, the second patch allows for enslaving netdevices which do not support
the set_mac_address() function. In that case the bond mac address is the one
of the active slave, where remote peers are notified on the mac address
(neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs
(this is already done by the bonding code).

Normally, the bonding driver is UP before any enslavement takes place.
Once a netdevice is UP, the network stack acts to have it join some multicast 
groups
(eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding 
device
type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code
computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called
where for mcast joins taking place **after** the enslavement another 
ip_xxx_mc_map()
is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND)

The third patch handles this problem by allowing to enslave devices when the
bonding device is not up. Over the discussion held at the previous post this
seemed to be the most clean way to go, where it is not expected to cause
instabilities.

These patches are not enough for configuration of IPoIB bonding through tools
(eg /sbin/ifenslave and /sbin/ifup) provided by packages such as sysconfig and
initscripts, specifically since these tools sets the bonding device to be UP
before enslaving anything. Once this patchset gets positive/feedback the next 
step
would be to look how to enhance the tools/packages so it would be possible to
bond/enslave with the modified code. As suggested by the bonding maintainer, 
this
step can potentially involve converting ifenslave to be a script based on the 
bonding
sysfs infrastructure rather on the somehow obsoleted 
Documentation/networking/ifenslave.c

For the ease of potential testers, I will post an example bonding sysfs script 
which
can be used to set bonding to work with patches 1-3 (let me know!)

Or.

changes from V1 (the links point to V1 0-3/3)

http://marc.theaimsgroup.com/?l=linux-netdevm=115926582209736w=2
http://marc.theaimsgroup.com/?l=linux-netdevm=115926599515568w=2
http://marc.theaimsgroup.com/?l=linux-netdevm=115926599430055w=2
http://marc.theaimsgroup.com/?l=linux-netdevm=115926599415729w=2

+ enforce mutual exclusion on the slaves ether types
+ don't attempt to set the bond mtu when enslaving a non ARPHRD_ETHER device
+ rather than hack the bond device ether type through mod params allow 
enslavement
  when the bond device is not up
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH V2 1/3] enable bonding to enslave non ARPHRD_ETHER netdevices

2006-11-30 Thread Or Gerlitz
Signed-off-by: Or Gerlitz [EMAIL PROTECTED]

Index: net-2.6.20/drivers/net/bonding/bond_main.c
===
--- net-2.6.20.orig/drivers/net/bonding/bond_main.c 2006-11-30 
10:54:23.0 +0200
+++ net-2.6.20/drivers/net/bonding/bond_main.c  2006-11-30 11:53:06.0 
+0200
@@ -1252,6 +1252,24 @@ static int bond_compute_features(struct
return 0;
 }

+
+static void bond_setup_by_slave(struct net_device *bond_dev,
+   struct net_device *slave_dev)
+{
+   bond_dev-hard_header   = slave_dev-hard_header;
+   bond_dev-rebuild_header= slave_dev-rebuild_header;
+   bond_dev-hard_header_cache = slave_dev-hard_header_cache;
+   bond_dev-header_cache_update   = slave_dev-header_cache_update;
+   bond_dev-hard_header_parse = slave_dev-hard_header_parse;
+
+   bond_dev-type  = slave_dev-type;
+   bond_dev-hard_header_len   = slave_dev-hard_header_len;
+   bond_dev-addr_len  = slave_dev-addr_len;
+
+   memcpy(bond_dev-broadcast, slave_dev-broadcast,
+   slave_dev-addr_len);
+}
+
 /* enslave device slave to bond device master */
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 {
@@ -1326,6 +1344,24 @@ int bond_enslave(struct net_device *bond
goto err_undo_flags;
}

+   /* set bonding device ether type by slave - bonding netdevices are
+* created with ether_setup, so when the slave type is not ARPHRD_ETHER
+* there is a need to override some of the type dependent attribs/funcs.
+*
+* bond ether type mutual exclusion - don't allow slaves of dissimilar
+* ether type (eg ARPHRD_ETHER and ARPHRD_INFINIBAND) share the same 
bond
+*/
+   if (bond-slave_cnt == 0) {
+   if (slave_dev-type != ARPHRD_ETHER)
+   bond_setup_by_slave(bond_dev, slave_dev);
+   } else if (bond_dev-type != slave_dev-type) {
+   printk(KERN_ERR DRV_NAME : %s ether type (%d) is different 
from 
+   other slaves (%d), can not enslave it.\n, 
slave_dev-name,
+   slave_dev-type, bond_dev-type);
+   res = -EINVAL;
+   goto err_undo_flags;
+   }
+
if (slave_dev-set_mac_address == NULL) {
printk(KERN_ERR DRV_NAME
: %s: Error: The slave device you specified does 
Index: net-2.6.20/drivers/net/bonding/bonding.h
===
--- net-2.6.20.orig/drivers/net/bonding/bonding.h   2006-11-30 
10:54:23.0 +0200
+++ net-2.6.20/drivers/net/bonding/bonding.h2006-11-30 10:58:10.0 
+0200
@@ -201,6 +201,7 @@ struct bonding {
struct   list_head vlan_list;
struct   vlan_group *vlgrp;
struct   packet_type arp_mon_pt;
+   s8   do_set_mac_addr;
 };

 /**

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH V2 2/3] enable bonding to enslave netdevices not supporting set_mac_address()

2006-11-30 Thread Or Gerlitz
Signed-off-by: Or Gerlitz [EMAIL PROTECTED]

Index: net-2.6.20/drivers/net/bonding/bond_main.c
===
--- net-2.6.20.orig/drivers/net/bonding/bond_main.c 2006-11-30 
11:53:06.0 +0200
+++ net-2.6.20/drivers/net/bonding/bond_main.c  2006-11-30 11:53:16.0 
+0200
@@ -1103,6 +1103,14 @@ void bond_change_active_slave(struct bon
if (new_active) {
bond_set_slave_active_flags(new_active);
}
+
+   /* when bonding does not set the slave MAC address, the bond MAC
+* address is the one of the active slave.
+*/
+   if (new_active  !bond-do_set_mac_addr)
+   memcpy(bond-dev-dev_addr,  new_active-dev-dev_addr,
+   new_active-dev-addr_len);
+
bond_send_gratuitous_arp(bond);
}
 }
@@ -1363,14 +1371,23 @@ int bond_enslave(struct net_device *bond
}

if (slave_dev-set_mac_address == NULL) {
-   printk(KERN_ERR DRV_NAME
-   : %s: Error: The slave device you specified does 
-   not support setting the MAC address. 
-   Your kernel likely does not support slave 
-   devices.\n, bond_dev-name);
-   res = -EOPNOTSUPP;
-   goto err_undo_flags;
-   }
+   if (bond-slave_cnt == 0) {
+   printk(KERN_WARNING DRV_NAME
+   : %s: Warning: The first slave device you 
+   specified does not support setting the MAC 
+   address. This bond MAC address would be that 
+   of the active slave.\n, bond_dev-name);
+   bond-do_set_mac_addr = 0;
+   } else if (bond-do_set_mac_addr) {
+   printk(KERN_ERR DRV_NAME
+   : %s: Error: The slave device you specified 
+   does not support setting the MAC addres,.
+   but this bond uses this practice. \n
+   , bond_dev-name);
+   res = -EOPNOTSUPP;
+   goto err_undo_flags;
+   }
+   }

new_slave = kmalloc(sizeof(struct slave), GFP_KERNEL);
if (!new_slave) {
@@ -1392,16 +1409,18 @@ int bond_enslave(struct net_device *bond
 */
memcpy(new_slave-perm_hwaddr, slave_dev-dev_addr, ETH_ALEN);

-   /*
-* Set slave to master's mac address.  The application already
-* set the master's mac address to that of the first slave
-*/
-   memcpy(addr.sa_data, bond_dev-dev_addr, bond_dev-addr_len);
-   addr.sa_family = slave_dev-type;
-   res = dev_set_mac_address(slave_dev, addr);
-   if (res) {
-   dprintk(Error %d calling set_mac_address\n, res);
-   goto err_free;
+   if (bond-do_set_mac_addr) {
+   /*
+* Set slave to master's mac address.  The application already
+* set the master's mac address to that of the first slave
+*/
+   memcpy(addr.sa_data, bond_dev-dev_addr, bond_dev-addr_len);
+   addr.sa_family = slave_dev-type;
+   res = dev_set_mac_address(slave_dev, addr);
+   if (res) {
+   dprintk(Error %d calling set_mac_address\n, res);
+   goto err_free;
+   }
}

/* open the slave since the application closed it */
@@ -1627,9 +1646,11 @@ err_close:
dev_close(slave_dev);

 err_restore_mac:
-   memcpy(addr.sa_data, new_slave-perm_hwaddr, ETH_ALEN);
-   addr.sa_family = slave_dev-type;
-   dev_set_mac_address(slave_dev, addr);
+   if (bond-do_set_mac_addr) {
+   memcpy(addr.sa_data, new_slave-perm_hwaddr, ETH_ALEN);
+   addr.sa_family = slave_dev-type;
+   dev_set_mac_address(slave_dev, addr);
+   }

 err_free:
kfree(new_slave);
@@ -1807,10 +1828,12 @@ int bond_release(struct net_device *bond
/* close slave before restoring its mac address */
dev_close(slave_dev);

-   /* restore original (permanent) mac address */
-   memcpy(addr.sa_data, slave-perm_hwaddr, ETH_ALEN);
-   addr.sa_family = slave_dev-type;
-   dev_set_mac_address(slave_dev, addr);
+   if (bond-do_set_mac_addr) {
+   /* restore original (permanent) mac address */
+   memcpy(addr.sa_data, slave-perm_hwaddr, ETH_ALEN);
+   addr.sa_family = slave_dev-type;
+   dev_set_mac_address(slave_dev, addr);
+   }

slave_dev-priv_flags = ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
   IFF_SLAVE_INACTIVE | IFF_BONDING |
@@ -1897,10 +1920,12 @@ static int 

[RFC] [PATCH V2 3/3] enable IP multicast for bonding IPoIB devices - allow not UP enslavement

2006-11-30 Thread Or Gerlitz
Signed-off-by: Or Gerlitz [EMAIL PROTECTED]

Index: net-2.6.20/drivers/net/bonding/bond_sysfs.c
===
--- net-2.6.20.orig/drivers/net/bonding/bond_sysfs.c2006-11-30 
10:45:53.0 +0200
+++ net-2.6.20/drivers/net/bonding/bond_sysfs.c 2006-11-30 10:48:13.0 
+0200
@@ -265,11 +265,9 @@ static ssize_t bonding_store_slaves(stru

/* Quick sanity check -- is the bond interface up? */
if (!(bond-dev-flags  IFF_UP)) {
-   printk(KERN_ERR DRV_NAME
-  : %s: Unable to update slaves because interface is 
down.\n,
+   printk(KERN_WARNING DRV_NAME
+  : %s: doing slave updates when interface is down.\n,
   bond-dev-name);
-   ret = -EPERM;
-   goto out;
}

/* Note:  We can't hold bond-lock here, as bond_create grabs it. */
Index: net-2.6.20/drivers/net/bonding/bond_main.c
===
--- net-2.6.20.orig/drivers/net/bonding/bond_main.c 2006-11-30 
10:46:57.0 +0200
+++ net-2.6.20/drivers/net/bonding/bond_main.c  2006-11-30 10:48:13.0 
+0200
@@ -1298,8 +1298,8 @@ int bond_enslave(struct net_device *bond

/* bond must be initialized by bond_open() before enslaving */
if (!(bond_dev-flags  IFF_UP)) {
-   dprintk(Error, master_dev is not up\n);
-   return -EPERM;
+   printk(KERN_WARNING DRV_NAME
+%s: master_dev is not up in bond_enslave\n, 
bond_dev-name);
}

/* already enslaved */

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please pull from 'upstream-fixes' branch of wireless-2.6

2006-11-30 Thread Jeff Garzik

John W. Linville wrote:

The following changes since commit 0579e303553655245e8a6616bd8b4428b07d63a2:
  Linus Torvalds:
Merge branch 'for-linus' of git://git.kernel.org/.../drzeus/mmc

are found in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
upstream-fixes

John W. Linville:
  Revert zd1211rw: Removed unneeded packed attributes

Michael Buesch:
  softmac: remove netif_tx_disable when scanning

Ulrich Kunitz:
  zd1211rw: Fix of a locking bug

Zhu Yi:
  ieee80211: Fix kernel panic when QoS is enabled



pulled

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] r8169: Fix iteration variable sign

2006-11-30 Thread Jeff Garzik

Francois Romieu wrote:

This changes the type of variable i in rtl8169_init_one()
from unsigned int to int. i is checked for  0 later,
which can never happen for unsigned. This results in broken
error handling.

Signed-off-by: Michael Buesch [EMAIL PROTECTED]

Signed-off-by: Francois Romieu [EMAIL PROTECTED]


ACK but doesn't seem to apply to 2.6.19?

should this go into #upstream rather than #upstream-fixes?


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PKT_SCHED] act_gact: division by zero

2006-11-30 Thread Nordlund Kim (Nokia-NET/Helsinki)

tc qdisc add dev eth1 handle : ingress
tc filter add dev eth1 protocol ip parent : pref 99 basic \
   flowid 1:1 action pass random determ drop 0
 ^
the above cause a division by zero in the kernel on the first
packet in.

Signed-off-by: Kim Nordlund [EMAIL PROTECTED]

diff -rub linux-2.6.19-orig/net/sched/act_gact.c linux/net/sched/act_gact.c
--- linux-2.6.19-orig/net/sched/act_gact.c  2006-11-29 23:57:37.0 
+0200
+++ linux/net/sched/act_gact.c  2006-11-30 13:22:37.0 +0200
@@ -111,7 +111,7 @@
if (tb[TCA_GACT_PROB-1] != NULL) {
struct tc_gact_p *p_parm = RTA_DATA(tb[TCA_GACT_PROB-1]);
gact-tcfg_paction = p_parm-paction;
-   gact-tcfg_pval= p_parm-pval;
+   gact-tcfg_pval= p_parm-pval ? : 1;
gact-tcfg_ptype   = p_parm-ptype;
}
 #endif

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 6/6] net: vm deadlock avoidance core

2006-11-30 Thread Peter Zijlstra
Oops, it seems I missed some chunks.
New patch attached.

---
Subject: net: vm deadlock avoidance core

In order to provide robust networked block devices there must be a guarantee
of progress. That is, the block device must never stall because of (physical)
OOM, because the device itself might be needed to get out of it (reclaim).

This means that the device queue must always be unplugable, this in turn means
that it must always find enough memory to build/send packets over the network
_and_ receive (level 7) ACKs for those packets.

The network stack has a huge capacity for buffering packets; waiting for 
user-space to read them. There is a practical limit imposed to avoid DoS 
scenarios. These two things make for a deadlock; what if the receive limit is
reached and all packets are buffered in non-critical sockets (those not serving
the network block device waiting for an ACK to free a page). 

Memory pressure will add to that; what if there is simply no memory left to
receive packets in.

This patch provides a service to register sockets as critical; SOCK_VMIO
is a promise the socket will never block on receive. Along with with a memory
reserve that will service a limited number of packets this can guarantee a
limited service to these critical sockets.

When we make sure that packets allocated from the reserve will only service
critical sockets we will not lose the memory and can guarantee progress.

(Note on the name SOCK_VMIO; the basic problem is a circular dependency between
the network and virtual memory subsystems which needs to be broken. This does
make VM network IO - and only VM network IO - special, it does not generalize)

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/skbuff.h |   13 +++-
 include/net/sock.h |   36 +
 net/core/dev.c |   40 +-
 net/core/skbuff.c  |   51 --
 net/core/sock.c|  121 +
 net/ipv4/ip_fragment.c |1 
 net/ipv4/ipmr.c|4 +
 net/ipv4/route.c   |   15 +
 net/ipv4/sysctl_net_ipv4.c |   14 -
 net/ipv4/tcp_ipv4.c|   27 +-
 net/ipv6/reassembly.c  |1 
 net/ipv6/route.c   |   15 +
 net/ipv6/sysctl_net_ipv6.c |6 +-
 net/ipv6/tcp_ipv6.c|   27 +-
 net/netfilter/core.c   |5 +
 security/selinux/avc.c |2 
 16 files changed, 355 insertions(+), 23 deletions(-)

Index: linux-2.6-git/include/linux/skbuff.h
===
--- linux-2.6-git.orig/include/linux/skbuff.h   2006-11-30 10:56:33.0 
+0100
+++ linux-2.6-git/include/linux/skbuff.h2006-11-30 11:37:51.0 
+0100
@@ -283,7 +283,8 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1;
+   ipvs_property:1,
+   emergency:1;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
@@ -328,10 +329,13 @@ struct sk_buff {
 
 #include asm/system.h
 
+#define SKB_ALLOC_FCLONE   0x01
+#define SKB_ALLOC_RX   0x02
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void   __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-  gfp_t priority, int fclone);
+  gfp_t priority, int flags);
 static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
 {
@@ -341,7 +345,7 @@ static inline struct sk_buff *alloc_skb(
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
   gfp_t priority)
 {
-   return __alloc_skb(size, priority, 1);
+   return __alloc_skb(size, priority, SKB_ALLOC_FCLONE);
 }
 
 extern struct sk_buff *alloc_skb_from_cache(kmem_cache_t *cp,
@@ -1102,7 +1106,8 @@ static inline void __skb_queue_purge(str
 static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
  gfp_t gfp_mask)
 {
-   struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
+   struct sk_buff *skb =
+   __alloc_skb(length + NET_SKB_PAD, gfp_mask, SKB_ALLOC_RX);
if (likely(skb))
skb_reserve(skb, NET_SKB_PAD);
return skb;
Index: linux-2.6-git/include/net/sock.h
===
--- linux-2.6-git.orig/include/net/sock.h   2006-11-30 10:56:33.0 
+0100
+++ linux-2.6-git/include/net/sock.h2006-11-30 11:37:51.0 +0100
@@ -391,6 +391,7 @@ enum sock_flags {
SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */

Re: [PATCH][NET_SCHED] sch_cbq: deactivating when grafting, purging etc.

2006-11-30 Thread Patrick McHardy
Jarek Poplawski wrote:
 [NET_SCHED] sch_cbq:
 
 [PATCH 2.6.19-rc6 with Fix endless loops set of patches]
 
 - P. McHardy's Fix endless loops patch supplement
   (cbq_graft, cbq_qlen_notify, cbq_delete, cbq_class_ops)
 
 - deactivating of active classes when q.qlen drops to zero
   (cbq_drop)
 
 - a redundant instruction removed from cbq_deactivate_class
 
 PS: probably htb_deactivate in htb_delete and
 cbq_deactivate_class in cbq_delete are also
 redundant now.

This looks good, thanks.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NET_SCHED] sch_htb: turn intermediate classes into leaves

2006-11-30 Thread Patrick McHardy
Jarek Poplawski wrote:
 [NET_SCHED] sch_htb:
 
 [PATCH 2.6.19-rc6 with Fix endless loops set of patches]
 
 - turn intermediate classes into leaves again when their
   last child is deleted (struct htb_class changed)

Looks good to me too, but it still seems to be missing
class level adjustment after deletion. The classification
function refuses to queue packets to classes with level  0.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NET_SCHED] sch_htb: turn intermediate classes into leaves

2006-11-30 Thread Patrick McHardy
Jarek Poplawski wrote:
 On Thu, Nov 30, 2006 at 01:26:34PM +0100, Patrick McHardy wrote:
 
Jarek Poplawski wrote:

[NET_SCHED] sch_htb:

[PATCH 2.6.19-rc6 with Fix endless loops set of patches]

- turn intermediate classes into leaves again when their
  last child is deleted (struct htb_class changed)

Looks good to me too, but it still seems to be missing
class level adjustment after deletion. The classification
function refuses to queue packets to classes with level  0.
 
 
 +static void htb_parent_to_leaf(struct htb_class *cl, struct Qdisc *new_q)
 +{
 + struct htb_class *parent = cl-parent;
 +
 + BUG_TRAP(!cl-level  cl-un.leaf.q  !cl-prio_activity);
 +
 + parent-level = 0;
 
 I've thought this is enough, but probably you mean something
 else? 

I missed that, thanks.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.18] declance: Fix RX ownership handover

2006-11-30 Thread Maciej W. Rozycki
 The change for PMAD support introduced a bug, where the ownership of RX 
descriptors was given back to the LANCE in the wrong way.  Occasional 
lockups would happen as a result.  This is a fix for this problem.

Signed-off-by: Maciej W. Rozycki [EMAIL PROTECTED]
---
 Tested with the onboard LANCE of a DECstation 5000/133.

 Please apply.

  Maciej

patch-mips-2.6.18-20060920-pmax-lance-rx-fix-0
diff -up --recursive --new-file 
linux-mips-2.6.18-20060920.macro/drivers/net/declance.c 
linux-mips-2.6.18-20060920/drivers/net/declance.c
--- linux-mips-2.6.18-20060920.macro/drivers/net/declance.c 2006-11-23 
02:55:34.0 +
+++ linux-mips-2.6.18-20060920/drivers/net/declance.c   2006-11-30 
02:26:34.0 +
@@ -628,7 +628,6 @@ static int lance_rx(struct net_device *d
/* Return the packet to the pool */
*rds_ptr(rd, mblength, lp-type) = 0;
*rds_ptr(rd, length, lp-type) = -RX_BUFF_SIZE | 0xf000;
-   *rds_ptr(rd, rmd1, lp-type) = LE_R1_OWN;
*rds_ptr(rd, rmd1, lp-type) =
((lp-rx_buf_ptr_lnc[entry]  16)  0xff) | LE_R1_OWN;
lp-rx_new = (entry + 1)  RX_RING_MOD_MASK;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH REPOST 1/2] NET: Accurate packet scheduling for ATM/ADSL (kernel)

2006-11-30 Thread Patrick McHardy
First, sorry for letting you wait so long ..

Russell Stuart wrote:
 On Tue, 2006-10-24 at 18:19 +0200, Patrick McHardy wrote:
 
No, my patch works for qdiscs with and without RTABs, this
is where they overlap.
 
 
 Could you explain how this works?  I didn't see how
 qdiscs that used RTAB to measure rates of transmission 
 could use your STAB to do the same thing.  At least not
 without substantial modifications to your patch.

Qdiscs don't use RTABs to measure rates but to calculate
transmission times. Transmission time is always related
to the length, the difference between our patches is that
you modify the RTABs in advance to include the overhead
in the calculation, my patch changes the length used to
look up the transmission time. Which works with or
without RTABs.

No, as we already discussed, SFQ uses the packet size for
calculating remaining quanta, and fairness would increase
if the real transmission size (and time) were used. RED
uses the backlog size to calculate the drop probabilty
(and supports attaching inner qdiscs nowadays), so keeping
accurate backlog statistics seems to be a win as well
(besides their use for estimators). It is also possible
to specify the maximum latency for TBF instead of a byte
limit (which is passed as max. backlog value to the inner
bfifo qdisc), this would also need accurate backlog statistics.
 
 
 This is all beside the point if you can show how
 you patch gets rid of RTAB - currently I am acting
 under the assumption it doesn't.  If it does you
 get all you describe for free.

Why?

 Otherwise - yes, you are correct.  The ATM patch does
 not introduce accurate packet lengths into the kernel,
 which is what is required to give the benefits you
 describe.  But that was never the ATM patches goal.
 The ATM patch gives accurate rate calculations for ATM
 links, nothing more.  Accurate packet length calculations
 is apparently the goal of your patch, and I wish you 
 luck with it.

Again, its not rate calculations but transmission time
calculations, which _are a function of the length_.

Ethernet, VLAN, Tunnels, ... its especially useful for tunnels
if you also shape on the underlying device since the qdisc
on the tunnel device and the qdisc on the underlying device
should ideally be in sync (otherwise no accurate bandwidth
reservation is possible).
 
 
 Hmmm - not as far as I am aware.  In all those cases
 the IP layer breaks up the data into MTU sized packets
 before they get to the scheduler.  ATM is the only
 technology I am known of where setting the MTU to be
 bigger than the end to end link can support is normal.

Thats not the point. If I want to do scheduling on the
ipip device and on the underlying device at the same
time I need to reserve the amount of bandwidth given to
the ipip device + the bandwidth uses for encapsulation
on the underlying device. The easy way to do this is
to use the same amount of bandwidth on both devices
and make the scheduler on the ipip device aware that
some overhead is going to be added. The hard way is
to calculate the worst case (bandwidth / minimum packet
size * overhead per packet) and add that on the
underlying device.

Either you or Jesper pointed to this code in iproute:

for (i=0; i256; i++) {
unsigned sz = (icell_log);
...
rtab[i] = tc_core_usec2tick(100*((double)sz/bps));

which tends to underestimate the transmission time by using
the smallest possible size for each cell.
 
 
 Firstly, yes you are correct.  It will under some
 circumstances underestimate the number of cells it
 takes to send a packet.  The reason is because the 
 whole aim of the ATM patch was to make maximum use 
 of the ATM link, while at the same time keeping 
 control of scheduling decisions.  To keep control of
 scheduling decisions, we must _never_ overestimate 
 the speed of the link.  If we do the ISP will take 
 control of the scheduling.

Underestimating the transmission time is equivalent to
overestimating the rate.

 At first sight this seems a minor issue.  Its not, because
 the error can be large.  An example of overestimating the
 link speed would be were one RTAB entry covers both the
 2 and 3 cell cases.  If we say the IP packet is going to
 use 2 cells, and in fact it uses 3, then the error is 50%.
 This is a huge error, and in fact eliminating this error
 is the whole point of the ATM patch.
 
 As an example of its impact, I was trying to make VOIP
 work over a shared link.  If the ISP starts making the
 scheduling decisions then VOIP packets start being
 dropped or delayed, rendering VOIP unusable.  So in
 order to use VOIP on the link I have to understate the
 link capacity by 50%.  As it happens, VOIP generates a
 stream of packets in the 2-3 cell size range, the actual
 size depending on the codec negotiated by the end points.
 
 Jesper in his thesis gives perhaps an more important
 example what happens if you overestimate the link speed.
 It turns out in interacts with TCP's flow 

Re: [PATCH][NET_SCHED] sch_htb: turn intermediate classes into leaves

2006-11-30 Thread Jarek Poplawski
On Thu, Nov 30, 2006 at 01:26:34PM +0100, Patrick McHardy wrote:
 Jarek Poplawski wrote:
  [NET_SCHED] sch_htb:
  
  [PATCH 2.6.19-rc6 with Fix endless loops set of patches]
  
  - turn intermediate classes into leaves again when their
last child is deleted (struct htb_class changed)
 
 Looks good to me too, but it still seems to be missing
 class level adjustment after deletion. The classification
 function refuses to queue packets to classes with level  0.

+static void htb_parent_to_leaf(struct htb_class *cl, struct Qdisc *new_q)
+{
+   struct htb_class *parent = cl-parent;
+
+   BUG_TRAP(!cl-level  cl-un.leaf.q  !cl-prio_activity);
+
+   parent-level = 0;

I've thought this is enough, but probably you mean something
else? 

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PKT_SCHED] act_gact: division by zero

2006-11-30 Thread Patrick McHardy
Nordlund Kim (Nokia-NET/Helsinki) wrote:
 tc qdisc add dev eth1 handle : ingress
 tc filter add dev eth1 protocol ip parent : pref 99 basic \
flowid 1:1 action pass random determ drop 0
  ^
 the above cause a division by zero in the kernel on the first
 packet in.
 
 Signed-off-by: Kim Nordlund [EMAIL PROTECTED]
 
 diff -rub linux-2.6.19-orig/net/sched/act_gact.c linux/net/sched/act_gact.c
 --- linux-2.6.19-orig/net/sched/act_gact.c2006-11-29 23:57:37.0 
 +0200
 +++ linux/net/sched/act_gact.c2006-11-30 13:22:37.0 +0200
 @@ -111,7 +111,7 @@
   if (tb[TCA_GACT_PROB-1] != NULL) {
   struct tc_gact_p *p_parm = RTA_DATA(tb[TCA_GACT_PROB-1]);
   gact-tcfg_paction = p_parm-paction;
 - gact-tcfg_pval= p_parm-pval;
 + gact-tcfg_pval= p_parm-pval ? : 1;


I think it should reject an invalid configuration or handle
the zero case correctly by not dividing.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.18] declance: Support the I/O ASIC LANCE w/o TURBOchannel

2006-11-30 Thread Maciej W. Rozycki
 The onboard LANCE of I/O ASIC systems is not a TURBOchannel device, at 
least from the software point of view.  Therefore it does not rely on any 
kernel TURBOchannel bus services and can be supported even if support for 
TURBOchannel has not been enabled in the configuration.

Signed-off-by: Maciej W. Rozycki [EMAIL PROTECTED]
---
 Tested with the onboard LANCE of a DECstation 5000/133.

 Please apply.

  Maciej

patch-mips-2.6.18-20060920-declance-tc-0
diff -up --recursive --new-file 
linux-mips-2.6.18-20060920.macro/drivers/net/declance.c 
linux-mips-2.6.18-20060920/drivers/net/declance.c
--- linux-mips-2.6.18-20060920.macro/drivers/net/declance.c 2006-11-23 
02:55:34.0 +
+++ linux-mips-2.6.18-20060920/drivers/net/declance.c   2006-11-30 
02:31:19.0 +
@@ -1068,7 +1068,6 @@ static int __init dec_lance_init(const i
lp-type = type;
lp-slot = slot;
switch (type) {
-#ifdef CONFIG_TC
case ASIC_LANCE:
dev-base_addr = CKSEG1ADDR(dec_kn_slot_base + IOASIC_LANCE);
 
@@ -1112,7 +,7 @@ static int __init dec_lance_init(const i
 CPHYSADDR(dev-mem_start)  3);
 
break;
-
+#ifdef CONFIG_TC
case PMAD_LANCE:
claim_tc_card(slot);
 
@@ -1143,7 +1142,6 @@ static int __init dec_lance_init(const i
 
break;
 #endif
-
case PMAX_LANCE:
dev-irq = dec_interrupt[DEC_IRQ_LANCE];
dev-base_addr = CKSEG1ADDR(KN01_SLOT_BASE + KN01_LANCE);
@@ -1298,10 +1296,8 @@ static int __init dec_lance_probe(void)
/* Then handle onboard devices. */
if (dec_interrupt[DEC_IRQ_LANCE] = 0) {
if (dec_interrupt[DEC_IRQ_LANCE_MERR] = 0) {
-#ifdef CONFIG_TC
if (dec_lance_init(ASIC_LANCE, -1) = 0)
count++;
-#endif
} else if (!TURBOCHANNEL) {
if (dec_lance_init(PMAX_LANCE, -1) = 0)
count++;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


re: [RFC] [PATCH V2 0/3] bonding support for operation over IPoIB - example config script

2006-11-30 Thread Or Gerlitz
Below is an example script i was using to configure bonding in my testing

Or.

--- /dev/null   2006-10-30 17:30:04.465997856 +0200
+++ bond_init_sysfs.bash2006-11-30 12:51:05.109565889 +0200
@@ -0,0 +1,25 @@
+#!/bin/bash
+
+BOND_IP=192.168.10.118
+BOND_NETMASK=255.255.255.0
+
+SLAVE_A=ib0
+SLAVE_B=ib1
+
+modprobe bonding
+
+echo active-backup  /sys/class/net/bond0/bonding/mode
+echo 100/sys/class/net/bond0/bonding/miimon
+
+modprobe ib_ipoib
+
+## this is some debug info that can enable seeing below the scenes...
+## to learn more see Documentation/infiniband/ipoib.txt
+
+#echo 1  /sys/module/ib_ipoib/parameters/mcast_debug_level
+#echo 1  /sys/module/ib_ipoib/parameters/debug_level
+
+echo +$SLAVE_A  /sys/class/net/bond0/bonding/slaves
+echo +$SLAVE_B  /sys/class/net/bond0/bonding/slaves
+
+ifconfig bond0 $BOND_IP netmask $BOND_NETMASK up
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH V2 1/3] enable bonding to enslave non ARPHRD_ETHER netdevices

2006-11-30 Thread Or Gerlitz

Or Gerlitz wrote:

Index: net-2.6.20/drivers/net/bonding/bonding.h
===
--- net-2.6.20.orig/drivers/net/bonding/bonding.h   2006-11-30 
10:54:23.0 +0200
+++ net-2.6.20/drivers/net/bonding/bonding.h2006-11-30 10:58:10.0 
+0200
@@ -201,6 +201,7 @@ struct bonding {
struct   list_head vlan_list;
struct   vlan_group *vlgrp;
struct   packet_type arp_mon_pt;
+   s8   do_set_mac_addr;
 };

 /**


oops - this piece belongs to the second patch, which actually uses the 
added field, sorry for that. I will fix it for the next version.


Or.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PKT_SCHED] act_gact: division by zero

2006-11-30 Thread Nordlund Kim (Nokia-NET/Helsinki)

On Thu, 30 Nov 2006, ext Patrick McHardy wrote:
 I think it should reject an invalid configuration or handle
 the zero case correctly by not dividing.

You are correct. Not returning -EINVAL, because someone might
want to use the value zero in some future gact_prob algorithm?

Signed-off-by: Kim Nordlund [EMAIL PROTECTED]

diff -rub linux-2.6.19-orig/net/sched/act_gact.c linux/net/sched/act_gact.c
--- linux-2.6.19-orig/net/sched/act_gact.c  2006-11-29 23:57:37.0 
+0200
+++ linux/net/sched/act_gact.c  2006-11-30 15:33:12.0 +0200
@@ -48,14 +48,14 @@
 #ifdef CONFIG_GACT_PROB
 static int gact_net_rand(struct tcf_gact *gact)
 {
-   if (net_random() % gact-tcfg_pval)
+   if (!gact-tcfg_pval || net_random() % gact-tcfg_pval)
return gact-tcf_action;
return gact-tcfg_paction;
 }
 
 static int gact_determ(struct tcf_gact *gact)
 {
-   if (gact-tcf_bstats.packets % gact-tcfg_pval)
+   if (!gact-tcfg_pval || gact-tcf_bstats.packets % gact-tcfg_pval)
return gact-tcf_action;
return gact-tcfg_paction;
 }

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PKT_SCHED] act_gact: division by zero

2006-11-30 Thread Patrick McHardy
Nordlund Kim (Nokia-NET/Helsinki) wrote:
 On Thu, 30 Nov 2006, ext Patrick McHardy wrote:
 
I think it should reject an invalid configuration or handle
the zero case correctly by not dividing.
 
 
 You are correct. Not returning -EINVAL, because someone might
 want to use the value zero in some future gact_prob algorithm?

That looks good, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] r8169: Fix iteration variable sign

2006-11-30 Thread Michael Buesch
On Thursday 30 November 2006 12:20, Jeff Garzik wrote:
 Francois Romieu wrote:
  This changes the type of variable i in rtl8169_init_one()
  from unsigned int to int. i is checked for  0 later,
  which can never happen for unsigned. This results in broken
  error handling.
  
  Signed-off-by: Michael Buesch [EMAIL PROTECTED]
  Signed-off-by: Francois Romieu [EMAIL PROTECTED]
 
 ACK but doesn't seem to apply to 2.6.19?
 
 should this go into #upstream rather than #upstream-fixes?

Hm, I did this against latest linus' tree.

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] bonding: change spinlocks and remove timers in favor of workqueues

2006-11-30 Thread Andy Gospodarek

The main purpose of this patch is to clean-up the bonding code so that
several important operations are not done in the incorrect (softirq)
context. Whenever a kernel is compiled with CONFIG_DEBUG_SPINLOCK_SLEEP
all sorts of backtraces are spewed to the log since might_sleep will
kindly remind us we are doing something in a atomic context when we
probably should not. 

In order to resolve this, the spin_[un]lock_bh needed to be converted to
spin_[un]lock and to do that the timers needed to be dropped in favor of
workqueues.  Since there isn't the chance that this work will be done in
a softirq context, the bh-locks aren't needed since we should not be
preempted to service the workqueue.  Both of those changes are included
in this patch.

I've done a bit of testing switching between modes and changing some of
the important values through sysfs, so I feel that creating and
canceling the work is working fine.  This code could use some quick
testing with 802.3ad since I didn't have access to a switch with that
capability, so if someone can verify it I would appreciate it.

Signed-off-by: Andy Gospodarek [EMAIL PROTECTED]
---

 bond_3ad.c   |9 +-
 bond_3ad.h   |2 
 bond_alb.c   |   17 +++-
 bond_alb.h   |2 
 bond_main.c  |  215 ++-
 bond_sysfs.c |   78 ++---
 bonding.h|   21 +++--
 7 files changed, 212 insertions(+), 132 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index 3fb354d..e65ca19 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2097,8 +2097,10 @@ void bond_3ad_unbind_slave(struct slave 
  * times out, and it selects an aggregator for the ports that are yet not
  * related to any aggregator, and selects the active aggregator for a bond.
  */
-void bond_3ad_state_machine_handler(struct bonding *bond)
+void bond_3ad_state_machine_handler(void *work_data)
 {
+   struct net_device *bond_dev = (struct net_device *)work_data;
+   struct bonding *bond = bond_dev-priv;
struct port *port;
struct aggregator *aggregator;
 
@@ -2149,7 +2151,10 @@ void bond_3ad_state_machine_handler(stru
}
 
 re_arm:
-   mod_timer((BOND_AD_INFO(bond).ad_timer), jiffies + ad_delta_in_ticks);
+   bond_work_create(bond_dev,
+   bond_3ad_state_machine_handler,
+   bond-ad_work,
+   ad_delta_in_ticks);
 out:
read_unlock(bond-lock);
 }
diff --git a/drivers/net/bonding/bond_3ad.h b/drivers/net/bonding/bond_3ad.h
index 6ad5ad6..4fa16a9 100644
--- a/drivers/net/bonding/bond_3ad.h
+++ b/drivers/net/bonding/bond_3ad.h
@@ -276,7 +276,7 @@ struct ad_slave_info {
 void bond_3ad_initialize(struct bonding *bond, u16 tick_resolution, int 
lacp_fast);
 int  bond_3ad_bind_slave(struct slave *slave);
 void bond_3ad_unbind_slave(struct slave *slave);
-void bond_3ad_state_machine_handler(struct bonding *bond);
+void bond_3ad_state_machine_handler(void *work_data);
 void bond_3ad_adapter_speed_changed(struct slave *slave);
 void bond_3ad_adapter_duplex_changed(struct slave *slave);
 void bond_3ad_handle_link_change(struct slave *slave, char link);
diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 3292316..a163e3d 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -1367,8 +1367,10 @@ out:
return 0;
 }
 
-void bond_alb_monitor(struct bonding *bond)
+void bond_alb_monitor(void *work_data)
 {
+   struct net_device *bond_dev = (struct net_device *)work_data;
+   struct bonding *bond = bond_dev-priv;
struct alb_bond_info *bond_info = (BOND_ALB_INFO(bond));
struct slave *slave;
int i;
@@ -1433,7 +1435,7 @@ void bond_alb_monitor(struct bonding *bo
 * write lock to protect from other code that also
 * sets the promiscuity.
 */
-   write_lock_bh(bond-curr_slave_lock);
+   write_lock(bond-curr_slave_lock);
 
if (bond_info-primary_is_promisc 
(++bond_info-rlb_promisc_timeout_counter = 
RLB_PROMISC_TIMEOUT)) {
@@ -1448,7 +1450,7 @@ void bond_alb_monitor(struct bonding *bo
bond_info-primary_is_promisc = 0;
}
 
-   write_unlock_bh(bond-curr_slave_lock);
+   write_unlock(bond-curr_slave_lock);
 
if (bond_info-rlb_rebalance) {
bond_info-rlb_rebalance = 0;
@@ -1471,7 +1473,10 @@ void bond_alb_monitor(struct bonding *bo
}
 
 re_arm:
-   mod_timer((bond_info-alb_timer), jiffies + alb_delta_in_ticks);
+   bond_work_create(bond_dev,
+   bond_alb_monitor,
+   bond-alb_work,
+   alb_delta_in_ticks);
 out:
read_unlock(bond-lock);
 }
@@ -1492,11 +1497,11 @@ int bond_alb_init_slave(struct bonding *
/* caller must hold the bond lock for write since the 

Re: [Devel] Re: Network virtualization/isolation

2006-11-30 Thread Vlad Yasevich
Daniel Lezcano wrote:
 Brian Haley wrote:
 Eric W. Biederman wrote:
 I think for cases across network socket namespaces it should
 be a matter for the rules, to decide if the connection should
 happen and what error code to return if the connection does not
 happen.

 There is a potential in this to have an ambiguous case where two
 applications can be listening for connections on the same socket
 on the same port and both will allow the connection.  If that
 is the case I believe the proper definition is the first socket
 that we find that will accept the connection gets the connection.
 No. If you try to connect, the destination IP address is assigned to a
 network namespace. This network namespace is used to leave the listening
 socket ambiguity.

 Wouldn't you want to catch this at bind() and/or configuration time and
 fail?  Having overlapping namespaces/rules seems undesirable, since as
 Herbert said, can get you unexpected behaviour.
 
 Overlapping is not a problem, you can have several sockets binded on the
 same INADDR_ANY/port without ambiguity because the network namespace
 pointer is added as a new key for sockets lookup, (src addr, src port,
 dst addr, dst port, net ns pointer). The bind should not be forced to a
 specific address because you will not be able to connect via 127.0.0.1.

So, all this leads to me ask, how to handle 127.0.0.1?

For L2 it seems easy.  Each namespace gets a tagged lo device.
How do you propose to do it for L3, because disabling access to loopback is
not a valid option, IMO.

I agree that adding a namespace to the (using generic terms) TCB lookup 
solves the conflict issue.

-vlad
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] skge: restore device multicast membership after link down/up

2006-11-30 Thread Andy Gospodarek

Yukon hardware will lose multicast membership data and promiscuous mode 
information if a link is disconnected and reconnected without taking the
interface down.  A call to yukon_reset in yukon_link_down will clear the
hardware's multicast list, so it needs to be added back on link_up.

It does not appear that Genesis hardware needs a similar patch is needed
since it does not seem to clear multicast membership when taking the
link down.

Signed-off-by: Andy Gospodarek [EMAIL PROTECTED]
---

 skge.c |4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/skge.c b/drivers/net/skge.c
index 3f1b72e..c02e1f1 100644
--- a/drivers/net/skge.c
+++ b/drivers/net/skge.c
@@ -1922,6 +1922,10 @@ static void yukon_link_up(struct skge_po
gma_write16(hw, port, GM_GP_CTRL, reg);
 
gm_phy_write(hw, port, PHY_MARV_INT_MASK, PHY_M_IS_DEF_MSK);
+
+   /* reset multicast list */
+   yukon_set_multicast(skge-netdev);
+
skge_link_up(skge);
 }
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-11-30 Thread Daniel Lezcano

Vlad Yasevich wrote:

Daniel Lezcano wrote:

Brian Haley wrote:

Eric W. Biederman wrote:

I think for cases across network socket namespaces it should
be a matter for the rules, to decide if the connection should
happen and what error code to return if the connection does not
happen.

There is a potential in this to have an ambiguous case where two
applications can be listening for connections on the same socket
on the same port and both will allow the connection.  If that
is the case I believe the proper definition is the first socket
that we find that will accept the connection gets the connection.

No. If you try to connect, the destination IP address is assigned to a
network namespace. This network namespace is used to leave the listening
socket ambiguity.

Wouldn't you want to catch this at bind() and/or configuration time and
fail?  Having overlapping namespaces/rules seems undesirable, since as
Herbert said, can get you unexpected behaviour.

Overlapping is not a problem, you can have several sockets binded on the
same INADDR_ANY/port without ambiguity because the network namespace
pointer is added as a new key for sockets lookup, (src addr, src port,
dst addr, dst port, net ns pointer). The bind should not be forced to a
specific address because you will not be able to connect via 127.0.0.1.


So, all this leads to me ask, how to handle 127.0.0.1?

For L2 it seems easy.  Each namespace gets a tagged lo device.
How do you propose to do it for L3, because disabling access to loopback is
not a valid option, IMO.


There are 2 options:

1 - Dmitry Mishin proposed to use the l2 mechanism and reinstantiate a 
new loopback device, I didn't tested that yet, perhaps there are issues 
with non-127.0.0.1 loopback traffic and routes creation, I don't know.


2 - add the pointer of the network namespace who has originated the 
packet into the skbuff when the traffic is for 127.0.0.1, so when the 
packet arrive to IP, it has the namespace destination information 
because source == destination. I tested it and it works fine without 
noticeable overhead and this can be done with a very few lines of code.


  -- Daniel

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: Network virtualization/isolation

2006-11-30 Thread Herbert Poetzl
On Thu, Nov 30, 2006 at 05:38:16PM +0100, Daniel Lezcano wrote:
 Vlad Yasevich wrote:
  Daniel Lezcano wrote:
  Brian Haley wrote:
  Eric W. Biederman wrote:
  I think for cases across network socket namespaces it should
  be a matter for the rules, to decide if the connection should
  happen and what error code to return if the connection does not
  happen.
 
  There is a potential in this to have an ambiguous case where two
  applications can be listening for connections on the same socket
  on the same port and both will allow the connection.  If that
  is the case I believe the proper definition is the first socket
  that we find that will accept the connection gets the connection.
  No. If you try to connect, the destination IP address is assigned to a
  network namespace. This network namespace is used to leave the listening
  socket ambiguity.
  Wouldn't you want to catch this at bind() and/or configuration time and
  fail?  Having overlapping namespaces/rules seems undesirable, since as
  Herbert said, can get you unexpected behaviour.
  Overlapping is not a problem, you can have several sockets binded on the
  same INADDR_ANY/port without ambiguity because the network namespace
  pointer is added as a new key for sockets lookup, (src addr, src port,
  dst addr, dst port, net ns pointer). The bind should not be forced to a
  specific address because you will not be able to connect via 127.0.0.1.
  
  So, all this leads to me ask, how to handle 127.0.0.1?
  
  For L2 it seems easy.  Each namespace gets a tagged lo device.
  How do you propose to do it for L3, because disabling access to loopback is
  not a valid option, IMO.
 
 There are 2 options:
 
 1 - Dmitry Mishin proposed to use the l2 mechanism and reinstantiate a 
 new loopback device, I didn't tested that yet, perhaps there are issues 
 with non-127.0.0.1 loopback traffic and routes creation, I don't know.
 
 2 - add the pointer of the network namespace who has originated the 
 packet into the skbuff when the traffic is for 127.0.0.1, so when the 
 packet arrive to IP, it has the namespace destination information 
 because source == destination. I tested it and it works fine without 
 noticeable overhead and this can be done with a very few lines of code.

there is a third option, which is a little 'hacky' but
works quite fine too:

 use different loopback addresses for each 'guest' e.g.
 127.x.y.z and 'map' them to 127.0.0.1 (or the other
 way round) whenever appropriate

advantages:
 - doesn't require any skb tagging
 - doesn't change the routing in any way
 - allows isolated loopback connections

disadvantages:
 - blocks those special addresses (127.x.y.z)
 - requires the mapping at bind/receive

best,
Herbert
 
-- Daniel
 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pktgen

2006-11-30 Thread Ben Greear

Robert Olsson wrote:

Hello!

Seems you found a race when rmmod is done before it's fully started

Try:

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 733d86d..ac0b4b1 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -160,7 +160,7 @@
 #include asm/div64.h   /* do_div */
 #include asm/timex.h
 
-#define VERSION  pktgen v2.68: Packet Generator for packet performance testing.\n

+#define VERSION  pktgen v2.69: Packet Generator for packet performance 
testing.\n
 
 /* #define PG_DEBUG(a) a */

 #define PG_DEBUG(a)
@@ -3673,6 +3673,8 @@ static void __exit pg_cleanup(void)
struct list_head *q, *n;
wait_queue_head_t queue;
init_waitqueue_head(queue);
+   
+   schedule_timeout_interruptible(msecs_to_jiffies(125));
 
 	/* Stop all interfaces  threads */
 
  

That strikes me as a hack..surely there is a better method than just adding
a sleep??

Ben

--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/6] 2.6.18: sb1250-mac: Phylib IRQ handling fixes

2006-11-30 Thread Maciej W. Rozycki
On Mon, 23 Oct 2006, Maciej W. Rozycki wrote:

  I'm not too enthusiastic about requiring the ethernet drivers to call
  phy_disconnect in a separate thread after close is called.  Assuming 
  there's
  not some sort of squash work queue function that can be invoked with
  rtnl_lock held, I think phy_disconnect should schedule itself to flush the
  queue.  This would also require that mdiobus_unregister hold off on freeing
  phydevs if any of the phys were still waiting for pending flush_pending 
  calls
  to finish.  Which would, in turn, require mdiobus_unregister to schedule
  cleaning up memory for some later time.
 
  This could work, indeed.
 
  I'm not enthusiastic about that implementation, either, but it maintains the
  abstractions I consider important for this code.  The ethernet driver should
  not need to know what structures the PHY lib uses to implement its interrupt
  handling, and how to work around their failings, IMHO.
 
  Agreed.

 So what's the plan?

 Here's a new version of the patch that addresses your other concerns.

  Maciej

patch-mips-2.6.18-20060920-phy-irq-18
diff -up --recursive --new-file 
linux-mips-2.6.18-20060920.macro/drivers/net/phy/phy.c 
linux-mips-2.6.18-20060920/drivers/net/phy/phy.c
--- linux-mips-2.6.18-20060920.macro/drivers/net/phy/phy.c  2006-08-05 
04:58:46.0 +
+++ linux-mips-2.6.18-20060920/drivers/net/phy/phy.c2006-11-30 
17:58:37.0 +
@@ -7,6 +7,7 @@
  * Author: Andy Fleming
  *
  * Copyright (c) 2004 Freescale Semiconductor, Inc.
+ * Copyright (c) 2006  Maciej W. Rozycki
  *
  * This program is free software; you can redistribute  it and/or modify it
  * under  the terms of  the GNU General  Public License as published by the
@@ -32,6 +33,8 @@
 #include linux/mii.h
 #include linux/ethtool.h
 #include linux/phy.h
+#include linux/timer.h
+#include linux/workqueue.h
 
 #include asm/io.h
 #include asm/irq.h
@@ -484,6 +487,9 @@ static irqreturn_t phy_interrupt(int irq
 {
struct phy_device *phydev = phy_dat;
 
+   if (PHY_HALTED == phydev-state)
+   return IRQ_NONE;/* It can't be ours.  */
+
/* The MDIO bus is not allowed to be written in interrupt
 * context, so we need to disable the irq here.  A work
 * queue will write the PHY to disable and clear the
@@ -577,6 +583,13 @@ int phy_stop_interrupts(struct phy_devic
if (err)
phy_error(phydev);
 
+   /*
+* Finish any pending work; we might have been scheduled
+* to be called from keventd ourselves, though.
+*/
+   if (!current_is_keventd())
+   flush_scheduled_work();
+
free_irq(phydev-irq, phydev);
 
return err;
@@ -596,14 +609,17 @@ static void phy_change(void *data)
goto phy_err;
 
spin_lock(phydev-lock);
+
if ((PHY_RUNNING == phydev-state) || (PHY_NOLINK == phydev-state))
phydev-state = PHY_CHANGELINK;
-   spin_unlock(phydev-lock);
 
enable_irq(phydev-irq);
 
/* Reenable interrupts */
-   err = phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED);
+   if (PHY_HALTED != phydev-state)
+   err = phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED);
+
+   spin_unlock(phydev-lock);
 
if (err)
goto irq_enable_err;
@@ -624,15 +640,15 @@ void phy_stop(struct phy_device *phydev)
if (PHY_HALTED == phydev-state)
goto out_unlock;
 
-   if (phydev-irq != PHY_POLL) {
-   /* Clear any pending interrupts */
-   phy_clear_interrupt(phydev);
+   phydev-state = PHY_HALTED;
 
+   if (phydev-irq != PHY_POLL) {
/* Disable PHY Interrupts */
phy_config_interrupt(phydev, PHY_INTERRUPT_DISABLED);
-   }
 
-   phydev-state = PHY_HALTED;
+   /* Clear any pending interrupts */
+   phy_clear_interrupt(phydev);
+   }
 
 out_unlock:
spin_unlock(phydev-lock);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] NetXen: 1G/10G Ethernet Driver updates

2006-11-30 Thread Don Fry
The first patch sent by Amit on 29 Nov applied, but the following three
patches did not apply to Jeff's #upstream tree.  Here are the corrected
2nd, 3rd, and 4th patches, with a repeat of the 1st for completeness.
There is a 5th patch which fixes a bug caused by casting a 16-bit
variable into a 32-bit one and using the address.

-- 
Don Fry
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] NetXen: Fixed /sys mapping between device and driver

2006-11-30 Thread Don Fry

Signed-off-by: Amit S. Kale [EMAIL PROTECTED]

diff -Nupr netdev-2.6/drivers/net/netxen.orig/netxen_nic_main.c 
netdev-2.6/drivers/net/netxen/netxen_nic_main.c
--- netdev-2.6/drivers/net/netxen.orig/netxen_nic_main.c2006-11-29 
12:13:58.0 -0800
+++ netdev-2.6/drivers/net/netxen/netxen_nic_main.c 2006-11-30 
09:17:51.0 -0800
@@ -273,6 +273,7 @@ netxen_nic_probe(struct pci_dev *pdev, c
}
 
SET_MODULE_OWNER(netdev);
+   SET_NETDEV_DEV(netdev, pdev-dev);
 
port = netdev_priv(netdev);
port-netdev = netdev;
@@ -1043,7 +1044,7 @@ static int netxen_nic_poll(struct net_de
netxen_nic_enable_int(adapter);
}
 
-   return (done ? 0 : 1);
+   return !done;
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
-- 
Don Fry
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] NetXen: 64-bit memory fixes and driver cleanup

2006-11-30 Thread Don Fry
NetXen: 1G/10G Ethernet Driver updates
- These fixes take care of driver on machines with 4G memory
- Driver cleanup

Signed-off-by: Amit S. Kale [EMAIL PROTECTED]
Signed-off-by: Don Fry [EMAIL PROTECTED]

diff -Nupr netdev-2.6/drivers/net/netxen.two/netxen_nic_ethtool.c 
netdev-2.6/drivers/net/netxen/netxen_nic_ethtool.c
--- netdev-2.6/drivers/net/netxen.two/netxen_nic_ethtool.c  2006-11-30 
09:40:47.0 -0800
+++ netdev-2.6/drivers/net/netxen/netxen_nic_ethtool.c  2006-11-30 
09:46:16.0 -0800
@@ -6,12 +6,12 @@
  * modify it under the terms of the GNU General Public License
  * as published by the Free Software Foundation; either version 2
  * of the License, or (at your option) any later version.
- *
+ *
  * This program is distributed in the hope that it will be useful, but
  * WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
- *   
+ *
  * You should have received a copy of the GNU General Public License
  * along with this program; if not, write to the Free Software
  * Foundation, Inc., 59 Temple Place - Suite 330, Boston,
@@ -118,7 +118,7 @@ netxen_nic_get_drvinfo(struct net_device
u32 fw_minor = 0;
u32 fw_build = 0;
 
-   strncpy(drvinfo-driver, netxen_nic, 32);
+   strncpy(drvinfo-driver, netxen_nic_driver_name, 32);
strncpy(drvinfo-version, NETXEN_NIC_LINUX_VERSIONID, 32);
fw_major = readl(NETXEN_CRB_NORMALIZE(adapter,
  NETXEN_FW_VERSION_MAJOR));
@@ -211,7 +211,6 @@ netxen_nic_get_settings(struct net_devic
printk(ERROR: Unsupported board model %d\n,
   (netxen_brdtype_t) boardinfo-board_type);
return -EIO;
-
}
 
return 0;
@@ -461,20 +460,22 @@ netxen_nic_get_ringparam(struct net_devi
 {
struct netxen_port *port = netdev_priv(dev);
struct netxen_adapter *adapter = port-adapter;
-   int i, j;
+   int i;
 
ring-rx_pending = 0;
+   ring-rx_jumbo_pending = 0;
for (i = 0; i  MAX_RCV_CTX; ++i) {
-   for (j = 0; j  NUM_RCV_DESC_RINGS; j++)
-   ring-rx_pending +=
-   adapter-recv_ctx[i].rcv_desc[j].rcv_pending;
+   ring-rx_pending += adapter-recv_ctx[i].
+   rcv_desc[RCV_DESC_NORMAL_CTXID].rcv_pending;
+   ring-rx_jumbo_pending += adapter-recv_ctx[i].
+   rcv_desc[RCV_DESC_JUMBO_CTXID].rcv_pending;
}
 
ring-rx_max_pending = adapter-max_rx_desc_count;
ring-tx_max_pending = adapter-max_tx_desc_count;
+   ring-rx_jumbo_max_pending = adapter-max_jumbo_rx_desc_count;
ring-rx_mini_max_pending = 0;
ring-rx_mini_pending = 0;
-   ring-rx_jumbo_max_pending = 0;
ring-rx_jumbo_pending = 0;
 }
 
diff -Nupr netdev-2.6/drivers/net/netxen.two/netxen_nic.h 
netdev-2.6/drivers/net/netxen/netxen_nic.h
--- netdev-2.6/drivers/net/netxen.two/netxen_nic.h  2006-11-30 
09:40:47.0 -0800
+++ netdev-2.6/drivers/net/netxen/netxen_nic.h  2006-11-30 09:46:16.0 
-0800
@@ -6,12 +6,12 @@
  * modify it under the terms of the GNU General Public License
  * as published by the Free Software Foundation; either version 2
  * of the License, or (at your option) any later version.
- *
+ *
  * This program is distributed in the hope that it will be useful, but
  * WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
- *   
+ *
  * You should have received a copy of the GNU General Public License
  * along with this program; if not, write to the Free Software
  * Foundation, Inc., 59 Temple Place - Suite 330, Boston,
@@ -89,8 +89,8 @@
  * normalize a 64MB crb address to 32MB PCI window 
  * To use NETXEN_CRB_NORMALIZE, window _must_ be set to 1
  */
-#define NETXEN_CRB_NORMAL(reg)\
-   (reg) - NETXEN_CRB_PCIX_HOST2 + NETXEN_CRB_PCIX_HOST
+#define NETXEN_CRB_NORMAL(reg) \
+   ((reg) - NETXEN_CRB_PCIX_HOST2 + NETXEN_CRB_PCIX_HOST)
 
 #define NETXEN_CRB_NORMALIZE(adapter, reg) \
pci_base_offset(adapter, NETXEN_CRB_NORMAL(reg))
@@ -164,7 +164,7 @@ enum {
 
 #define MAX_CMD_DESCRIPTORS1024
 #define MAX_RCV_DESCRIPTORS32768
-#define MAX_JUMBO_RCV_DESCRIPTORS  1024
+#define MAX_JUMBO_RCV_DESCRIPTORS  4096
 #define MAX_RCVSTATUS_DESCRIPTORS  MAX_RCV_DESCRIPTORS
 #define MAX_JUMBO_RCV_DESC MAX_JUMBO_RCV_DESCRIPTORS
 #define MAX_RCV_DESC   MAX_RCV_DESCRIPTORS
@@ -592,6 +592,16 @@ struct netxen_skb_frag {
u32 length;
 };
 
+/* Bounce buffer index */
+struct bounce_index {
+   /* Index of a buffer */
+

[PATCH 5/5] NetXen: Fix cast error

2006-11-30 Thread Don Fry
Fix for pointer casting error.

Signed-off-by:  Don Fry [EMAIL PROTECTED]

diff -Nupr netdev-2.6/drivers/net/netxen.four/netxen_nic_hw.c 
netdev-2.6/drivers/net/netxen/netxen_nic_hw.c
--- netdev-2.6/drivers/net/netxen.four/netxen_nic_hw.c  2006-11-30 
10:06:24.0 -0800
+++ netdev-2.6/drivers/net/netxen/netxen_nic_hw.c   2006-11-30 
10:31:00.0 -0800
@@ -867,7 +867,7 @@ void netxen_nic_set_link_parameters(stru
 {
struct netxen_adapter *adapter = port-adapter;
__le32 status;
-   u16 autoneg;
+   __le32 autoneg = 0;
__le32 mode;
 
netxen_nic_read_w0(adapter, NETXEN_NIU_MODE, mode);
@@ -907,7 +907,7 @@ void netxen_nic_set_link_parameters(stru
 adapter-
phy_read(adapter, port-portnum,
 
NETXEN_NIU_GB_MII_MGMT_ADDR_AUTONEG,
-(__le32 *)  autoneg) != 0)
+autoneg) != 0)
port-link_autoneg = autoneg;
} else
goto link_down;
-- 
Don Fry
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()

2006-11-30 Thread Christoph Lameter
On Thu, 30 Nov 2006, Peter Zijlstra wrote:

 +unsigned int kmem_cache_objs_to_pages(struct kmem_cache *cachep, int nr)
 +{
 + return ((nr + cachep-num - 1) / cachep-num)  cachep-gfporder;

cachep-num refers to the number of objects in a slab of gfporder.

thus

return (nr + cachep-num - 1) / cachep-num;

But then this is very optimistic estimate that assumes a single node and 
no free objects in between.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5 addendum] NetXen

2006-11-30 Thread Don Fry
The NetXen patches fix many problems in the current #upstream version of
the driver.  It has warts and probably lots of bugs still, but it is
better than what is queued for mainline inclusion at this time.  Please
apply to 2.6.20.

-- 
Don Fry
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()

2006-11-30 Thread Peter Zijlstra
On Thu, 2006-11-30 at 10:55 -0800, Christoph Lameter wrote:
 On Thu, 30 Nov 2006, Peter Zijlstra wrote:
 
  +unsigned int kmem_cache_objs_to_pages(struct kmem_cache *cachep, int nr)
  +{
  +   return ((nr + cachep-num - 1) / cachep-num)  cachep-gfporder;
 
 cachep-num refers to the number of objects in a slab of gfporder.

Ah, my bad, thanks!

 thus
 
 return (nr + cachep-num - 1) / cachep-num;
 
 But then this is very optimistic estimate that assumes a single node and 
 no free objects in between.

Right, perhaps my bad in wording the intent; the needed information is
how many more pages would I need to grow the slab with in order to store
so many new object.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()

2006-11-30 Thread Christoph Lameter
On Thu, 30 Nov 2006, Peter Zijlstra wrote:

 Right, perhaps my bad in wording the intent; the needed information is
 how many more pages would I need to grow the slab with in order to store
 so many new object.

Would you not have to take objects currently available in 
caches into account? If you are short on memory then a flushing of all the 
caches may give you the memory you need (especially on a system with a 
large number of processors).

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/6] mm: slab allocation fairness

2006-11-30 Thread Peter Zijlstra
On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote:
 On Thu, 30 Nov 2006, Peter Zijlstra wrote:
 
  The slab has some unfairness wrt gfp flags; when the slab is grown the gfp 
  flags are used to allocate more memory, however when there is slab space 
  available, gfp flags are ignored. Thus it is possible for less critical 
  slab allocations to succeed and gobble up precious memory.
 
 The gfpflags are ignored if there are
 
 1) objects in the per cpu, shared or alien caches
 
 2) objects are in partial or free slabs in the per node queues.

Yeah, basically as long as free objects can be found. No matter how
'hard' is was to obtain these objects.

  This patch avoids this by keeping track of the allocation hardness when 
  growing. This is then compared to the current slab alloc's gfp flags.
 
 The approach is to force the allocation of additional slab to increase the 
 number of free slabs? The next free will drop the number of free slabs 
 back again to the allowed amount.

No, the forced allocation is to test the allocation hardness at that
point in time. I could not think of another way to test that than to
actually to an allocation.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 5/6] slab: kmem_cache_objs_to_pages()

2006-11-30 Thread Peter Zijlstra
On Thu, 2006-11-30 at 11:06 -0800, Christoph Lameter wrote:
 On Thu, 30 Nov 2006, Peter Zijlstra wrote:
 
  Right, perhaps my bad in wording the intent; the needed information is
  how many more pages would I need to grow the slab with in order to store
  so many new object.
 
 Would you not have to take objects currently available in 
 caches into account? If you are short on memory then a flushing of all the 
 caches may give you the memory you need (especially on a system with a 
 large number of processors).

Sure, but this gives a safe upper bound.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/6] mm: slab allocation fairness

2006-11-30 Thread Peter Zijlstra
On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote:

 I would think that one would need a rank with each cached object and 
 free slab in order to do this the right way.

Allocation hardness is a temporal attribute, ie. it changes over time.
Hence I do it per slab.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/6] mm: slab allocation fairness

2006-11-30 Thread Christoph Lameter
On Thu, 30 Nov 2006, Peter Zijlstra wrote:

 No, the forced allocation is to test the allocation hardness at that
 point in time. I could not think of another way to test that than to
 actually to an allocation.

Typically we do this by checking the number of free pages in a zone 
compared to the high low limits. See mmzone.h.
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/6] mm: slab allocation fairness

2006-11-30 Thread Christoph Lameter
On Thu, 30 Nov 2006, Peter Zijlstra wrote:

 On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote:
 
  I would think that one would need a rank with each cached object and 
  free slab in order to do this the right way.
 
 Allocation hardness is a temporal attribute, ie. it changes over time.
 Hence I do it per slab.

cached objects are also temporal and change over time.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bonding: change spinlocks and remove timers in favor of workqueues

2006-11-30 Thread Jay Vosburgh

Andy Gospodarek [EMAIL PROTECTED] wrote:
The main purpose of this patch is to clean-up the bonding code so that
several important operations are not done in the incorrect (softirq)
context. Whenever a kernel is compiled with CONFIG_DEBUG_SPINLOCK_SLEEP
all sorts of backtraces are spewed to the log since might_sleep will
kindly remind us we are doing something in a atomic context when we
probably should not.
[...]

I'll look at the patch in detail in a bit (and I have 802.3ad
switches to test on), but on first glance, does this not still hold a
lock during failover operations in balance-alb mode?  I.e., this doesn't
change the locking model, it just moves the timers to workqueues and
relaxes the _bh locking.

The really problematic case calls dev_set_mac_address() with a
lock held, and I don't see that this patch changes that behavior.  Do
you still get the lock warnings during link fail / recovery in
balance-alb mode?

Also, on an CONFIG_PREEMPT kernel, it'll still get the sleep
warnings, since in_atomic() will trip __might_sleep() for any lock (if
I'm reading things correctly).

Don't get me wrong, this (switching to workqueues, etc) needs to
be done, but I don't think this patch really resolves the underlying
problem that causes the warnings.

Let me see if I can dust off the extensive patch that does
change the locking model; I'll see if I can bring it up to the current
git and post it.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/6] mm: slab allocation fairness

2006-11-30 Thread Peter Zijlstra
On Thu, 2006-11-30 at 11:33 -0800, Christoph Lameter wrote:
 On Thu, 30 Nov 2006, Peter Zijlstra wrote:
 
  No, the forced allocation is to test the allocation hardness at that
  point in time. I could not think of another way to test that than to
  actually to an allocation.
 
 Typically we do this by checking the number of free pages in a zone 
 compared to the high low limits. See mmzone.h.

True, I did think about that and started out that way but saw myself
duplicating a lot of the page allocation code. I'll give it another
try... see if I can factor out the common parts without too much
duplication.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/6] mm: slab allocation fairness

2006-11-30 Thread Peter Zijlstra
On Thu, 2006-11-30 at 11:37 -0800, Christoph Lameter wrote:
 On Thu, 30 Nov 2006, Peter Zijlstra wrote:
 
  On Thu, 2006-11-30 at 10:52 -0800, Christoph Lameter wrote:
  
   I would think that one would need a rank with each cached object and 
   free slab in order to do this the right way.
  
  Allocation hardness is a temporal attribute, ie. it changes over time.
  Hence I do it per slab.
 
 cached objects are also temporal and change over time.

Sure, but there is nothing wrong with using a slab page with a lower
allocation rank when there is memory aplenty. 

I'm just not seeing how keeping all individual page ranks would make
this better.

The only thing that matters is the actual free pages limit, not that of
a few allocation ago. The stored rank is a safe shortcut for it allows
harder allocation to use easily obtainable free space not the other way
around.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/6] mm: slab allocation fairness

2006-11-30 Thread Christoph Lameter
On Thu, 30 Nov 2006, Peter Zijlstra wrote:

 Sure, but there is nothing wrong with using a slab page with a lower
 allocation rank when there is memory aplenty. 

What does a slab page with a lower allocation rank mean? Slab pages have 
no allocation ranks that I am aware of.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bonding: change spinlocks and remove timers in favor of workqueues

2006-11-30 Thread Andy Gospodarek

On 11/30/06, Jay Vosburgh [EMAIL PROTECTED] wrote:


Andy Gospodarek [EMAIL PROTECTED] wrote:
The main purpose of this patch is to clean-up the bonding code so that
several important operations are not done in the incorrect (softirq)
context. Whenever a kernel is compiled with CONFIG_DEBUG_SPINLOCK_SLEEP
all sorts of backtraces are spewed to the log since might_sleep will
kindly remind us we are doing something in a atomic context when we
probably should not.
[...]

I'll look at the patch in detail in a bit (and I have 802.3ad
switches to test on), but on first glance, does this not still hold a
lock during failover operations in balance-alb mode?  I.e., this doesn't
change the locking model, it just moves the timers to workqueues and
relaxes the _bh locking.


Jay,

Thanks for the response.  You are correct.  This patch really doesn't
change functionality -- in fact that was one of goals of this patch.
I wanted to simply start the conversion since it seemed like 'the
right way' to do things going forward.



The really problematic case calls dev_set_mac_address() with a
lock held, and I don't see that this patch changes that behavior.  Do
you still get the lock warnings during link fail / recovery in
balance-alb mode?


I no longer get lock warnings indicating that I'm taking a lock in an
invalid context, but lately I've been seeing rtnl lock assertion
failures when in balance-alb mode and whenever a call to
dev_set_mac_address is made.  It seems to be expected that the rtnl
lock is taken and that isn't the case anymore.



Also, on an CONFIG_PREEMPT kernel, it'll still get the sleep
warnings, since in_atomic() will trip __might_sleep() for any lock (if
I'm reading things correctly).


Based on my reading you will still only get these warnings if
CONFIG_DEBUG_SPINLOCK_SLEEP=y and CONFIG_PREEMPT=y.  Since most never
try with CONFIG_DEBUG_SPINLOCK_SLEEP=y they don't see these.


Don't get me wrong, this (switching to workqueues, etc) needs to
be done, but I don't think this patch really resolves the underlying
problem that causes the warnings.


Agreed.  I didn't want to tackle too many of the issues with one giant
patch.  Doing them in smallish steps seemed like a better way to go.



Let me see if I can dust off the extensive patch that does
change the locking model; I'll see if I can bring it up to the current
git and post it.



It would seem ideal if we could combine the two into one big patch.

-andy


-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/6] mm: slab allocation fairness

2006-11-30 Thread Peter Zijlstra
On Thu, 2006-11-30 at 12:11 -0800, Christoph Lameter wrote:
 On Thu, 30 Nov 2006, Peter Zijlstra wrote:
 
  Sure, but there is nothing wrong with using a slab page with a lower
  allocation rank when there is memory aplenty. 
 
 What does a slab page with a lower allocation rank mean? Slab pages have 
 no allocation ranks that I am aware of.

I just added allocation rank and didn't you suggest tracking it for all
slab pages instead of per slab?

The rank is an expression of how hard it was to get that page, with 0
being the hardest allocation (ALLOC_NO_WATERMARK) and 16 the easiest
(ALLOC_WMARK_HIGH).

I store the rank of the last allocated page and retest the rank when a
gfp flag indicates a higher rank, that is when the current slab
allocation would have failed to grow the slab under the conditions of
the previous allocation.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/6] mm: slab allocation fairness

2006-11-30 Thread Christoph Lameter
On Thu, 30 Nov 2006, Peter Zijlstra wrote:

   Sure, but there is nothing wrong with using a slab page with a lower
   allocation rank when there is memory aplenty. 
  What does a slab page with a lower allocation rank mean? Slab pages have 
  no allocation ranks that I am aware of.
 I just added allocation rank and didn't you suggest tracking it for all
 slab pages instead of per slab?

Yes but that is not in place so I was wondering what you were talking 
about. It would help to have some longer text describing what you intend 
to do and how rank would work throughout the VM.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux 2.6.19

2006-11-30 Thread Malte Schröder
On Thursday 30 November 2006 03:15, David Miller wrote:
 From: Phil Oester [EMAIL PROTECTED]
 Date: Wed, 29 Nov 2006 17:49:04 -0800

  Getting an oops on boot here, caused by commit
  e81c73596704793e73e6dbb478f41686f15a4b34 titled
  [NET]: Fix MAX_HEADER setting.
 
  Reverting that patch fixes things up for me.  Dave?

 I suspect that it might be because I removed the IPV6
 ifdef from the list,  but I can't imagine why that would
 matter other than due to a bug in the IPV6 stack

 Indeed.

 Looking at ndisc_send_rs() I wonder if it miscalculates
 'len' or similar and the old MAX_HEADER setting was
 merely papering around this bug

 In fact it does, the NDISC code is using MAX_HEADER incorrectly.  It
 needs to explicitly allocate space for the struct ipv6hdr in 'len'.
 Luckily the TCP ipv6 code was doing it right.

 What a horrible bug, this patch should fix it.  Let me know
 if it doesn't, thanks:

I also encountered this bug (wasn't there in -rc6). The patch also fixes it 
for me.

regards
-- 
---
Malte Schröder
[EMAIL PROTECTED]
ICQ# 68121508
---



pgpOqfDpsQNjB.pgp
Description: PGP signature


Re: Broken commit: [NETFILTER]: ipt_REJECT: remove largely duplicate route_reverse function

2006-11-30 Thread Jarek Poplawski
On Wed, Nov 29, 2006 at 04:16:06PM +0100, Krzysztof Halasa wrote:
 Krzysztof Halasa [EMAIL PROTECTED] writes:
 
  I wound't care less btw.
 
 s/wound/couldn/, eh those foreign languages...

So, you say, you don't care about David Miller's credits?
It isn't nice. He could be very disappointed...

Jarek P. 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Christoph Hellwig
On Wed, Nov 29, 2006 at 07:56:58PM -0600, Wenji Wu wrote:
 Yes, when CONFIG_PREEMPT is disabled, the problem won't happen. That is why 
 I put for 2.6 desktop, low-latency desktop in the uploaded paper. This 
 problem happens in the 2.6 Desktop and Low-latency Desktop.

CONFIG_PREEMPT is only for people that are in for the feeling.  There is no
real world advtantage to it and we should probably remove it again.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mv643xx add missing brackets

2006-11-30 Thread Mariusz Kozlowski
Hello,

This patch adds missing brackets.

Signed-off-by: Mariusz Kozlowski [EMAIL PROTECTED]

 include/linux/mv643xx.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-2.6.19-rc6-mm2-a/include/linux/mv643xx.h  2006-11-16 
05:03:40.0 +0100
+++ linux-2.6.19-rc6-mm2-b/include/linux/mv643xx.h  2006-11-30 
01:10:53.0 +0100
@@ -724,7 +724,7 @@
 #define MV643XX_ETH_RX_FIFO_URGENT_THRESHOLD_REG(port) (0x2470 + 
(port10))
 #define MV643XX_ETH_TX_FIFO_URGENT_THRESHOLD_REG(port) (0x2474 + 
(port10))
 #define MV643XX_ETH_RX_MINIMAL_FRAME_SIZE_REG(port)(0x247c + 
(port10))
-#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port)  (0x2484 + 
(port10)
+#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port)  (0x2484 + 
(port10))
 #define MV643XX_ETH_PORT_DEBUG_0_REG(port) (0x248c + 
(port10))
 #define MV643XX_ETH_PORT_DEBUG_1_REG(port) (0x2490 + 
(port10))
 #define MV643XX_ETH_PORT_INTERNAL_ADDR_ERROR_REG(port) (0x2494 + 
(port10))
@@ -1135,7 +1135,7 @@ struct mv64xxx_i2c_pdata {
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_1 (119)
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_2 (120)
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_3 ((120) | (119))
-#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121)
+#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121))
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_5 ((121) | (119))
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_6 ((121) | (120))
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_7 ((121) | (120) | (119))


-- 
Regards,

Mariusz Kozlowski
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Evgeniy Polyakov
On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar ([EMAIL PROTECTED]) wrote:
 what was observed here were the effects of completely throttling TCP 
 processing for a given socket. I think such throttling can in fact be 
 desirable: there is a /reason/ why the process context was preempted: in 
 that load scenario there was 10 times more processing requested from the 
 CPU than it can possibly service. It's a serious overload situation and 
 it's the scheduler's task to prioritize between workloads!
 
 normally such kind of throttling of the TCP stack for this particular 
 socket does not happen. Note that there's no performance lost: we dont 
 do TCP processing because there are /9 other tasks for this CPU to run/, 
 and the scheduler has a tough choice.
 
 Now i agree that there are more intelligent ways to throttle and less 
 intelligent ways to throttle, but the notion to allow a given workload 
 'steal' CPU time from other workloads by allowing it to push its 
 processing into a softirq is i think unfair. (and this issue is 
 partially addressed by my softirq threading patches in -rt :-)

Doesn't the provided solution is just a in-kernel variant of the
SCHED_FIFO set from userspace? Why kernel should be able to mark some
users as having higher priority?
What if workload of the system is targeted to not the maximum TCP
performance, but maximum other-task performance, which will be broken
with provided patch.

   Ingo

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mv643xx add missing brackets

2006-11-30 Thread Dale Farnsworth
On Thu, Nov 30, 2006 at 10:35:37AM +0100, Mariusz Kozlowski wrote:
 Hello,
 
   This patch adds missing brackets.
 
 Signed-off-by: Mariusz Kozlowski [EMAIL PROTECTED]
 
  include/linux/mv643xx.h |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 --- linux-2.6.19-rc6-mm2-a/include/linux/mv643xx.h2006-11-16 
 05:03:40.0 +0100
 +++ linux-2.6.19-rc6-mm2-b/include/linux/mv643xx.h2006-11-30 
 01:10:53.0 +0100
 @@ -724,7 +724,7 @@
  #define MV643XX_ETH_RX_FIFO_URGENT_THRESHOLD_REG(port) (0x2470 + 
 (port10))
  #define MV643XX_ETH_TX_FIFO_URGENT_THRESHOLD_REG(port) (0x2474 + 
 (port10))
  #define MV643XX_ETH_RX_MINIMAL_FRAME_SIZE_REG(port)(0x247c + 
 (port10))
 -#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port)  (0x2484 + 
 (port10)
 +#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port)  (0x2484 + 
 (port10))

Good.  Thanks.

  #define MV643XX_ETH_PORT_DEBUG_0_REG(port) (0x248c + 
 (port10))
  #define MV643XX_ETH_PORT_DEBUG_1_REG(port) (0x2490 + 
 (port10))
  #define MV643XX_ETH_PORT_INTERNAL_ADDR_ERROR_REG(port) (0x2494 + 
 (port10))
 @@ -1135,7 +1135,7 @@ struct mv64xxx_i2c_pdata {
  #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_1   (119)
  #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_2   (120)
  #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_3   ((120) | (119))
 -#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4   ((121)
 +#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4   ((121))

Mariusz, please remove the extra parenthesis instead of adding
an extra one, like:
#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4  (121)
and resubmit.

Thanks,
-Dale
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Nick Piggin

Evgeniy Polyakov wrote:

On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar ([EMAIL PROTECTED]) wrote:



Doesn't the provided solution is just a in-kernel variant of the
SCHED_FIFO set from userspace? Why kernel should be able to mark some
users as having higher priority?
What if workload of the system is targeted to not the maximum TCP
performance, but maximum other-task performance, which will be broken
with provided patch.


David's line of thinking for a solution sounds better to me. This patch
does not prevent the process from being preempted (for potentially a long
time), by any means.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Evgeniy Polyakov
On Thu, Nov 30, 2006 at 09:07:42PM +1100, Nick Piggin ([EMAIL PROTECTED]) wrote:
 Doesn't the provided solution is just a in-kernel variant of the
 SCHED_FIFO set from userspace? Why kernel should be able to mark some
 users as having higher priority?
 What if workload of the system is targeted to not the maximum TCP
 performance, but maximum other-task performance, which will be broken
 with provided patch.
 
 David's line of thinking for a solution sounds better to me. This patch
 does not prevent the process from being preempted (for potentially a long
 time), by any means.

It steals timeslices from other processes to complete tcp_recvmsg()
task, and only when it does it for too long, it will be preempted.
Processing backlog queue on behalf of need_resched() will break fairness
too - processing itself can take a lot of time, so process can be
scheduled away in that part too.

 -- 
 SUSE Labs, Novell Inc.
 Send instant messages to your online friends http://au.messenger.yahoo.com 

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mv643xx add missing brackets

2006-11-30 Thread Mariusz Kozlowski
  +#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121))
 
 Mariusz, please remove the extra parenthesis instead of adding
 an extra one, like:
   #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4  (121)
 and resubmit.

Sure. Here it goes. Second try:

This patch fixes some mv643xx macros.

Signed-off-by: Mariusz Kozlowski [EMAIL PROTECTED]

 include/linux/mv643xx.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-2.6.19-rc6-mm2-a/include/linux/mv643xx.h  2006-11-16 
05:03:40.0 +0100
+++ linux-2.6.19-rc6-mm2-b/include/linux/mv643xx.h  2006-11-30 
11:30:14.0 +0100
@@ -724,7 +724,7 @@
 #define MV643XX_ETH_RX_FIFO_URGENT_THRESHOLD_REG(port) (0x2470 + 
(port10))
 #define MV643XX_ETH_TX_FIFO_URGENT_THRESHOLD_REG(port) (0x2474 + 
(port10))
 #define MV643XX_ETH_RX_MINIMAL_FRAME_SIZE_REG(port)(0x247c + 
(port10))
-#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port)  (0x2484 + 
(port10)
+#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port)  (0x2484 + 
(port10))
 #define MV643XX_ETH_PORT_DEBUG_0_REG(port) (0x248c + 
(port10))
 #define MV643XX_ETH_PORT_DEBUG_1_REG(port) (0x2490 + 
(port10))
 #define MV643XX_ETH_PORT_INTERNAL_ADDR_ERROR_REG(port) (0x2494 + 
(port10))
@@ -1135,7 +1135,7 @@ struct mv64xxx_i2c_pdata {
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_1 (119)
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_2 (120)
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_3 ((120) | (119))
-#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121)
+#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 (121)
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_5 ((121) | (119))
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_6 ((121) | (120))
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_7 ((121) | (120) | (119))


-- 
Regards,

Mariusz Kozlowski
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Ingo Molnar

* Evgeniy Polyakov [EMAIL PROTECTED] wrote:

  David's line of thinking for a solution sounds better to me. This 
  patch does not prevent the process from being preempted (for 
  potentially a long time), by any means.
 
 It steals timeslices from other processes to complete tcp_recvmsg() 
 task, and only when it does it for too long, it will be preempted. 
 Processing backlog queue on behalf of need_resched() will break 
 fairness too - processing itself can take a lot of time, so process 
 can be scheduled away in that part too.

correct - it's just the wrong thing to do. The '10% performance win' 
that was measured was against _9 other tasks who contended for the same 
CPU resource_. I.e. it's /not/ an absolute 'performance win' AFAICS, 
it's a simple shift in CPU cycles away from the other 9 tasks and 
towards the task that does TCP receive.

Note that even without the change the TCP receiving task is already 
getting a disproportionate share of cycles due to softirq processing! 
Under a load of 10.0 it went from 500 mbits to 74 mbits, while the 
'fair' share would be 50 mbits. So the TCP receiver /already/ has an 
unfair advantage. The patch only deepends that unfairness.

The solution is really simple and needs no kernel change at all: if you 
want the TCP receiver to get a larger share of timeslices then either 
renice it to -20 or renice the other tasks to +19.

The other disadvantage, even ignoring that it's the wrong thing to do, 
is the crudeness of preempt_disable() that i mentioned in the other 
post:

--

independently of the issue at hand, in general the explicit use of 
preempt_disable() in non-infrastructure code is quite a heavy tool. Its 
effects are heavy and global: it disables /all/ preemption (even on 
PREEMPT_RT). Furthermore, when preempt_disable() is used for per-CPU 
data structures then [unlike for example to a spin-lock] the connection 
between the 'data' and the 'lock' is not explicit - causing all kinds of 
grief when trying to convert such code to a different preemption model. 
(such as PREEMPT_RT :-)

So my plan is to remove all open-coded use of preempt_disable() [and 
raw use of local_irq_save/restore] from the kernel and replace it with 
some facility that connects data and lock. (Note that this will not 
result in any actual changes on the instruction level because internally 
every such facility still maps to preempt_disable() on non-PREEMPT_RT 
kernels, so on non-PREEMPT_RT kernels such code will still be the same 
as before.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mv643xx_eth: fix unbalanced parentheses in macros

2006-11-30 Thread Dale Farnsworth
From: Mariusz Kozlowski [EMAIL PROTECTED]

Signed-off-by: Mariusz Kozlowski [EMAIL PROTECTED]
Signed-off-by: Dale Farnsworth [EMAIL PROTECTED]

---
 include/linux/mv643xx.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-2.6.19-rc6-mm2-a/include/linux/mv643xx.h  2006-11-16 
05:03:40.0 +0100
+++ linux-2.6.19-rc6-mm2-b/include/linux/mv643xx.h  2006-11-30 
11:30:14.0 +0100
@@ -724,7 +724,7 @@
 #define MV643XX_ETH_RX_FIFO_URGENT_THRESHOLD_REG(port) (0x2470 + 
(port10))
 #define MV643XX_ETH_TX_FIFO_URGENT_THRESHOLD_REG(port) (0x2474 + 
(port10))
 #define MV643XX_ETH_RX_MINIMAL_FRAME_SIZE_REG(port)(0x247c + 
(port10))
-#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port)  (0x2484 + 
(port10)
+#define MV643XX_ETH_RX_DISCARDED_FRAMES_COUNTER(port)  (0x2484 + 
(port10))
 #define MV643XX_ETH_PORT_DEBUG_0_REG(port) (0x248c + 
(port10))
 #define MV643XX_ETH_PORT_DEBUG_1_REG(port) (0x2490 + 
(port10))
 #define MV643XX_ETH_PORT_INTERNAL_ADDR_ERROR_REG(port) (0x2494 + 
(port10))
@@ -1135,7 +1135,7 @@ struct mv64xxx_i2c_pdata {
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_1 (119)
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_2 (120)
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_3 ((120) | (119))
-#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 ((121)
+#define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_4 (121)
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_5 ((121) | (119))
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_6 ((121) | (120))
 #define MV643XX_ETH_DEFAULT_RX_UDP_QUEUE_7 ((121) | (120) | (119))
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Wenji Wu


We can make explicitl preemption checks in the main loop of
tcp_recvmsg(), and release the socket and run the backlog if
need_resched() is TRUE.

This is the simplest and most elegant solution to this problem.


I am not sure whether this approach will work. How can you make the explicit
preemption checks?



For Desktop case, yes, you can make the explicit preemption checks at some
points whether need_resched() is true. But when need_resched() is true, you
can not decide whether it is triggered by higher priority processes becoming
runnable, or the process within tcp_recvmsg being expiring.


If the higher prioirty processes become runnable (e.g., interactive
process), you better yield the CPU, instead of continuing this process. If
it is the case that the process within tcp_recvmsg() is expriring, then, you
can continue the process to go ahead to process backlog.

For Low-latency Desktop case, I believe it is very hard to make the checks.
We do not know when the process is going to expire, or when higher priority
process will become runnable. The process could expire at any moment, or
higher priority process could become runnnable at any moment. If we do not
want to tradeoff system responsiveness, where do you want to make the check?
If you just make the chekc, then need_resched() become TRUE, what are you
going to do in this case?

wenji




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Lee Revell
On Thu, 2006-11-30 at 09:33 +, Christoph Hellwig wrote:
 On Wed, Nov 29, 2006 at 07:56:58PM -0600, Wenji Wu wrote:
  Yes, when CONFIG_PREEMPT is disabled, the problem won't happen. That is 
  why I put for 2.6 desktop, low-latency desktop in the uploaded paper. 
  This problem happens in the 2.6 Desktop and Low-latency Desktop.
 
 CONFIG_PREEMPT is only for people that are in for the feeling.  There is no
 real world advtantage to it and we should probably remove it again.

There certainly is a real world advantage for many applications.  Of
course it would be better if the latency requirements could be met
without kernel preemption but that's not the case now.

Lee

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Wenji Wu

The solution is really simple and needs no kernel change at all: if you
want the TCP receiver to get a larger share of timeslices then either
renice it to -20 or renice the other tasks to +19.

Simply give a larger share of timeslices to the TCP receiver won't solve the
problem.  No matter what the timeslice is, if the TCP receiving process has
packets within backlog, and the process is expired and moved to the expired
array, RTO might happen in the TCP sender.

The solution does not look like that simple.

wenji




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[take26 7/8] kevent: Signal notifications.

2006-11-30 Thread Evgeniy Polyakov

Signal notifications.

This type of notifications allows to deliver signals through kevent queue.
One can find example application signal.c on project homepage.

If KEVENT_SIGNAL_NOMASK bit is set in raw_u64 id then signal will be
delivered only through queue, otherwise both delivery types are used - old
through update of mask of pending signals and through queue.

If signal is delivered only through kevent queue mask of pending signals
is not updated at all, which is equal to putting signal into blocked mask,
but with delivery of that signal through kevent queue.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc4a987..ef38a3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -80,6 +80,7 @@ struct sched_param {
 #include linux/resource.h
 #include linux/timer.h
 #include linux/hrtimer.h
+#include linux/kevent_storage.h
 
 #include asm/processor.h
 
@@ -1013,6 +1014,10 @@ struct task_struct {
 #ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info *delays;
 #endif
+#ifdef CONFIG_KEVENT_SIGNAL
+   struct kevent_storage st;
+   u32 kevent_signals;
+#endif
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
diff --git a/kernel/fork.c b/kernel/fork.c
index 1c999f3..e5b5b14 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -46,6 +46,7 @@
 #include linux/delayacct.h
 #include linux/taskstats_kern.h
 #include linux/random.h
+#include linux/kevent.h
 
 #include asm/pgtable.h
 #include asm/pgalloc.h
@@ -115,6 +116,9 @@ void __put_task_struct(struct task_struc
WARN_ON(atomic_read(tsk-usage));
WARN_ON(tsk == current);
 
+#ifdef CONFIG_KEVENT_SIGNAL
+   kevent_storage_fini(tsk-st);
+#endif
security_task_free(tsk);
free_uid(tsk-user);
put_group_info(tsk-group_info);
@@ -1121,6 +1125,10 @@ static struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_namespace;
 
+#ifdef CONFIG_KEVENT_SIGNAL
+   kevent_storage_init(p, p-st);
+#endif
+
p-set_child_tid = (clone_flags  CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
/*
 * Clear TID on mm_release()?
diff --git a/kernel/kevent/kevent_signal.c b/kernel/kevent/kevent_signal.c
new file mode 100644
index 000..0edd2e4
--- /dev/null
+++ b/kernel/kevent/kevent_signal.c
@@ -0,0 +1,92 @@
+/*
+ * kevent_signal.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/file.h
+#include linux/fs.h
+#include linux/kevent.h
+
+static int kevent_signal_callback(struct kevent *k)
+{
+   struct task_struct *tsk = k-st-origin;
+   int sig = k-event.id.raw[0];
+   int ret = 0;
+
+   if (sig == tsk-kevent_signals)
+   ret = 1;
+
+   if (ret  (k-event.id.raw_u64  KEVENT_SIGNAL_NOMASK))
+   tsk-kevent_signals |= 0x8000;
+
+   return ret;
+}
+
+int kevent_signal_enqueue(struct kevent *k)
+{
+   int err;
+
+   err = kevent_storage_enqueue(current-st, k);
+   if (err)
+   goto err_out_exit;
+
+   if (k-event.req_flags  KEVENT_REQ_ALWAYS_QUEUE) {
+   kevent_requeue(k);
+   err = 0;
+   } else {
+   err = k-callbacks.callback(k);
+   if (err)
+   goto err_out_dequeue;
+   }
+
+   return err;
+
+err_out_dequeue:
+   kevent_storage_dequeue(k-st, k);
+err_out_exit:
+   return err;
+}
+
+int kevent_signal_dequeue(struct kevent *k)
+{
+   kevent_storage_dequeue(k-st, k);
+   return 0;
+}
+
+int kevent_signal_notify(struct task_struct *tsk, int sig)
+{
+   tsk-kevent_signals = sig;
+   kevent_storage_ready(tsk-st, NULL, KEVENT_SIGNAL_DELIVERY);
+   return (tsk-kevent_signals  0x8000);
+}
+
+static int __init kevent_init_signal(void)
+{
+   struct kevent_callbacks sc = {
+   .callback = kevent_signal_callback,
+   .enqueue = kevent_signal_enqueue,
+   .dequeue = kevent_signal_dequeue};
+
+   return kevent_add_callbacks(sc, KEVENT_SIGNAL);
+}
+module_init(kevent_init_signal);
diff --git 

[take26 6/8] kevent: Pipe notifications.

2006-11-30 Thread Evgeniy Polyakov

Pipe notifications.


diff --git a/fs/pipe.c b/fs/pipe.c
index f3b6f71..aeaee9c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -16,6 +16,7 @@
 #include linux/uio.h
 #include linux/highmem.h
 #include linux/pagemap.h
+#include linux/kevent.h
 
 #include asm/uaccess.h
 #include asm/ioctls.h
@@ -312,6 +313,7 @@ redo:
break;
}
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
wake_up_interruptible_sync(pipe-wait);
kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT);
}
@@ -321,6 +323,7 @@ redo:
 
/* Signal writers asynchronously that there is more room. */
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
wake_up_interruptible(pipe-wait);
kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT);
}
@@ -490,6 +493,7 @@ redo2:
break;
}
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
wake_up_interruptible_sync(pipe-wait);
kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN);
do_wakeup = 0;
@@ -501,6 +505,7 @@ redo2:
 out:
mutex_unlock(inode-i_mutex);
if (do_wakeup) {
+   kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
wake_up_interruptible(pipe-wait);
kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN);
}
@@ -605,6 +610,7 @@ pipe_release(struct inode *inode, int de
free_pipe_info(inode);
} else {
wake_up_interruptible(pipe-wait);
+   kevent_pipe_notify(inode, 
KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
kill_fasync(pipe-fasync_readers, SIGIO, POLL_IN);
kill_fasync(pipe-fasync_writers, SIGIO, POLL_OUT);
}
diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c
new file mode 100644
index 000..d529fa9
--- /dev/null
+++ b/kernel/kevent/kevent_pipe.c
@@ -0,0 +1,121 @@
+/*
+ * kevent_pipe.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/file.h
+#include linux/fs.h
+#include linux/kevent.h
+#include linux/pipe_fs_i.h
+
+static int kevent_pipe_callback(struct kevent *k)
+{
+   struct inode *inode = k-st-origin;
+   struct pipe_inode_info *pipe = inode-i_pipe;
+   int nrbufs = pipe-nrbufs;
+
+   if (k-event.event  KEVENT_SOCKET_RECV  nrbufs  0) {
+   if (!pipe-writers)
+   return -1;
+   return 1;
+   }
+   
+   if (k-event.event  KEVENT_SOCKET_SEND  nrbufs  PIPE_BUFFERS) {
+   if (!pipe-readers)
+   return -1;
+   return 1;
+   }
+
+   return 0;
+}
+
+int kevent_pipe_enqueue(struct kevent *k)
+{
+   struct file *pipe;
+   int err = -EBADF;
+   struct inode *inode;
+
+   pipe = fget(k-event.id.raw[0]);
+   if (!pipe)
+   goto err_out_exit;
+
+   inode = igrab(pipe-f_dentry-d_inode);
+   if (!inode)
+   goto err_out_fput;
+
+   err = -EINVAL;
+   if (!S_ISFIFO(inode-i_mode))
+   goto err_out_iput;
+
+   err = kevent_storage_enqueue(inode-st, k);
+   if (err)
+   goto err_out_iput;
+
+   if (k-event.req_flags  KEVENT_REQ_ALWAYS_QUEUE) {
+   kevent_requeue(k);
+   err = 0;
+   } else {
+   err = k-callbacks.callback(k);
+   if (err)
+   goto err_out_dequeue;
+   }
+
+   fput(pipe);
+
+   return err;
+
+err_out_dequeue:
+   kevent_storage_dequeue(k-st, k);
+err_out_iput:
+   iput(inode);
+err_out_fput:
+   fput(pipe);
+err_out_exit:
+   return err;
+}
+
+int kevent_pipe_dequeue(struct kevent *k)
+{
+   struct inode *inode = k-st-origin;
+
+   kevent_storage_dequeue(k-st, k);
+   iput(inode);
+
+   return 0;
+}
+
+void kevent_pipe_notify(struct inode *inode, u32 

[take26 4/8] kevent: Socket notifications.

2006-11-30 Thread Evgeniy Polyakov

Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..2740617 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@
 #include linux/cdev.h
 #include linux/bootmem.h
 #include linux/inotify.h
+#include linux/kevent.h
 #include linux/mount.h
 
 /*
@@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct
}
inode-i_private = 0;
inode-i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+   kevent_storage_init(inode, inode-st);
+#endif
}
return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+   kevent_storage_fini(inode-st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode-i_sb-s_op-destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@
 #include linux/netdevice.h
 #include linux/skbuff.h  /* struct sk_buff */
 #include linux/security.h
+#include linux/kevent.h
 
 #include linux/filter.h
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+   struct socket socket;
+   struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+   return container_of(inode, struct socket_alloc, vfs_inode)-socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+   return container_of(socket, struct socket_alloc, socket)-vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
skb-sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
sk-sk_backlog.tail = skb;
}
skb-next = NULL;
+   kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)  \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
return si-kiocb;
 }
 
-struct socket_alloc {
-   struct socket socket;
-   struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-   return container_of(inode, struct socket_alloc, vfs_inode)-socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-   return container_of(socket, struct socket_alloc, socket)-vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
tp-ucopy.memory = 0;
} else if (skb_queue_len(tp-ucopy.prequeue) == 1) {
wake_up_interruptible(sk-sk_sleep);
+   kevent_socket_notify(sk, 
KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
  (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 000..9c24b5b
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,142 @@
+/*
+ * kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/timer.h
+#include linux/file.h
+#include linux/tcp.h
+#include linux/kevent.h
+
+#include net/sock.h
+#include net/request_sock.h
+#include 

[take26 5/8] kevent: Timer notifications.

2006-11-30 Thread Evgeniy Polyakov

Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 000..df93049
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,112 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/hrtimer.h
+#include linux/jiffies.h
+#include linux/kevent.h
+
+struct kevent_timer
+{
+   struct hrtimer  ktimer;
+   struct kevent_storage   ktimer_storage;
+   struct kevent   *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+   struct kevent_timer *t = container_of(timer, struct kevent_timer, 
ktimer);
+   struct kevent *k = t-ktimer_event;
+
+   kevent_storage_ready(t-ktimer_storage, NULL, KEVENT_MASK_ALL);
+   hrtimer_forward(timer, timer-base-softirq_time,
+   ktime_set(k-event.id.raw[0], k-event.id.raw[1]));
+   return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+   int err;
+   struct kevent_timer *t;
+
+   t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+   if (!t)
+   return -ENOMEM;
+
+   hrtimer_init(t-ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+   t-ktimer.expires = ktime_set(k-event.id.raw[0], k-event.id.raw[1]);
+   t-ktimer.function = kevent_timer_func;
+   t-ktimer_event = k;
+
+   err = kevent_storage_init(t-ktimer, t-ktimer_storage);
+   if (err)
+   goto err_out_free;
+   lockdep_set_class(t-ktimer_storage.lock, kevent_timer_key);
+
+   err = kevent_storage_enqueue(t-ktimer_storage, k);
+   if (err)
+   goto err_out_st_fini;
+
+   hrtimer_start(t-ktimer, t-ktimer.expires, HRTIMER_REL);
+
+   return 0;
+
+err_out_st_fini:
+   kevent_storage_fini(t-ktimer_storage);
+err_out_free:
+   kfree(t);
+
+   return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+   struct kevent_storage *st = k-st;
+   struct kevent_timer *t = container_of(st, struct kevent_timer, 
ktimer_storage);
+
+   hrtimer_cancel(t-ktimer);
+   kevent_storage_dequeue(st, k);
+   kfree(t);
+
+   return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+   k-event.ret_data[0] = jiffies_to_msecs(jiffies);
+   return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+   struct kevent_callbacks tc = {
+   .callback = kevent_timer_callback,
+   .enqueue = kevent_timer_enqueue,
+   .dequeue = kevent_timer_dequeue};
+
+   return kevent_add_callbacks(tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[take26 8/8] kevent: Kevent posix timer notifications.

2006-11-30 Thread Evgeniy Polyakov

Kevent posix timer notifications.

Simple extensions to POSIX timers which allows
to deliver notification of the timer expiration
through kevent queue.

Example application posix_timer.c can be found
in archive on project homepage.

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]


diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 8786e01..3768746 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -235,6 +235,7 @@ typedef struct siginfo {
 #define SIGEV_NONE 1   /* other notification: meaningless */
 #define SIGEV_THREAD   2   /* deliver via thread creation */
 #define SIGEV_THREAD_ID 4  /* deliver to thread */
+#define SIGEV_KEVENT   8   /* deliver through kevent queue */
 
 /*
  * This works because the alignment is ok on all current architectures
@@ -260,6 +261,8 @@ typedef struct sigevent {
void (*_function)(sigval_t);
void *_attribute;   /* really pthread_attr_t */
} _sigev_thread;
+
+   int kevent_fd;
} _sigev_un;
 } sigevent_t;
 
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index a7dd38f..4b9deb4 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -4,6 +4,7 @@
 #include linux/spinlock.h
 #include linux/list.h
 #include linux/sched.h
+#include linux/kevent_storage.h
 
 union cpu_time_count {
cputime_t cpu;
@@ -49,6 +50,9 @@ struct k_itimer {
sigval_t it_sigev_value;/* value word of sigevent struct */
struct task_struct *it_process; /* process to send signal to */
struct sigqueue *sigq;  /* signal queue entry. */
+#ifdef CONFIG_KEVENT_TIMER
+   struct kevent_storage st;
+#endif
union {
struct {
struct hrtimer timer;
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index e5ebcc1..8d0e7a3 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -48,6 +48,8 @@
 #include linux/wait.h
 #include linux/workqueue.h
 #include linux/module.h
+#include linux/kevent.h
+#include linux/file.h
 
 /*
  * Management arrays for POSIX timers.  Timers are kept in slab memory
@@ -224,6 +226,99 @@ static int posix_ktime_get_ts(clockid_t
return 0;
 }
 
+#ifdef CONFIG_KEVENT_TIMER
+static int posix_kevent_enqueue(struct kevent *k)
+{
+   /*
+* It is not ugly - there is no pointer in the id field union, 
+* but its size is 64bits, which is ok for any known pointer size.
+*/
+   struct k_itimer *tmr = (struct k_itimer *)(unsigned 
long)k-event.id.raw_u64;
+   return kevent_storage_enqueue(tmr-st, k);
+}
+static int posix_kevent_dequeue(struct kevent *k)
+{
+   struct k_itimer *tmr = (struct k_itimer *)(unsigned 
long)k-event.id.raw_u64;
+   kevent_storage_dequeue(tmr-st, k);
+   return 0;
+}
+static int posix_kevent_callback(struct kevent *k)
+{
+   return 1;
+}
+static int posix_kevent_init(void)
+{
+   struct kevent_callbacks tc = {
+   .callback = posix_kevent_callback,
+   .enqueue = posix_kevent_enqueue,
+   .dequeue = posix_kevent_dequeue};
+
+   return kevent_add_callbacks(tc, KEVENT_POSIX_TIMER);
+}
+
+extern struct file_operations kevent_user_fops;
+
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+   struct ukevent uk;
+   struct file *file;
+   struct kevent_user *u;
+   int err;
+
+   file = fget(fd);
+   if (!file) {
+   err = -EBADF;
+   goto err_out;
+   }
+
+   if (file-f_op != kevent_user_fops) {
+   err = -EINVAL;
+   goto err_out_fput;
+   }
+
+   u = file-private_data;
+
+   memset(uk, 0, sizeof(struct ukevent));
+
+   uk.event = KEVENT_MASK_ALL;
+   uk.type = KEVENT_POSIX_TIMER;
+   uk.id.raw_u64 = (unsigned long)(tmr); /* Just cast to something unique 
*/
+   uk.req_flags = KEVENT_REQ_ONESHOT | KEVENT_REQ_ALWAYS_QUEUE;
+   uk.ptr = tmr-it_sigev_value.sival_ptr;
+
+   err = kevent_user_add_ukevent(uk, u);
+   if (err)
+   goto err_out_fput;
+
+   fput(file);
+
+   return 0;
+
+err_out_fput:
+   fput(file);
+err_out:
+   return err;
+}
+
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+   kevent_storage_fini(tmr-st);
+}
+#else
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+   return -ENOSYS;
+}
+static int posix_kevent_init(void)
+{
+   return 0;
+}
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+}
+#endif
+
+
 /*
  * Initialize everything, well, just everything in Posix clocks/timers ;)
  */
@@ -241,6 +336,11 @@ static __init int init_posix_timers(void
register_posix_clock(CLOCK_REALTIME, clock_realtime);
register_posix_clock(CLOCK_MONOTONIC, clock_monotonic);
 
+   if (posix_kevent_init()) {
+   

[take26 0/8] kevent: Generic event handling mechanism.

2006-11-30 Thread Evgeniy Polyakov

Generic event handling mechanism.

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through ring buffer or using usual syscalls.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projectsitem=kevent

Documentation page (will update Dec 1):
http://linux-net.osdl.org/index.php/Kevent

I installed slightly used, but still functional (bought on ebay) remote 
mind reader, and set it up to read Ulrich's alpha brain waves (I hope he 
agrees that it is a good decision), which took me the whole week.
So I think the last ring buffer implementation is what we all wanted.
Details in documentation part.

It seems that setup was correct and we finially found what we wanted from
interface part.

Changes from 'take35' patchset:
 * use timespec as timeout parameter.
 * added high-resolution timer to handle absolute timeouts.
 * added flags to waiting and initialization syscalls.
 * kevent_commit() has new_uidx parameter.
 * kevent_wait() has old_uidx parameter, which, if not equal to u-uidx,
results in immediate wakeup (usefull for the case when entries
are added asynchronously from kernel (not supported for now)).
 * added interface to mark any event as ready.
 * event POSIX timers support.
 * return -ENOSYS if there is no registered event type.
 * provided file descriptor must be checked for fifo type (spotted by Eric 
Dumazet).
 * documentation update.
 * lighttpd patch updated (the latest benchmarks with lighttpd patch can be 
found in blog).

Changes from 'take24' patchset:
 * new (old (new)) ring buffer implementation with kernel and user indexes.
 * added initialization syscall instead of opening /dev/kevent
 * kevent_commit() syscall to commit ring buffer entries
 * changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes
   only first thread always if that flag is not set
 * KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue
   instead of copying back to userspace when kevent is ready immediately when
   it is added.
 * lighttpd patch (Hail! Although nothing really outstanding compared to epoll)

Changes from 'take23' patchset:
 * kevent PIPE notifications
 * KEVENT_REQ_LAST_CHECK flag, which allows to perform last check at dequeueing 
time
 * fixed poll/select notifications (were broken due to tree manipulations)
 * made Documentation/kevent.txt look nice in 80-col terminal
 * fix for copy_to_user() failure report for the first kevent (Andrew Morton)
 * minor function renames

Changes from 'take22' patchset:
 * new ring buffer implementation in process' memory
 * wakeup-one-thread flag
 * edge-triggered behaviour

Changes from 'take21' patchset:
 * minor cleanups (different return values, removed unneded variables, 
whitespaces and so on)
 * fixed bug in kevent removal in case when kevent being removed
   is the same as overflow_kevent (spotted by Eric Dumazet)

Changes from 'take20' patchset:
 * new ring buffer implementation
 * removed artificial limit on possible number of kevents

Changes from 'take19' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take18' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take17' patchset:
 * Use RB tree instead of hash table. 
At least for a web sever, frequency of addition/deletion of new kevent 
is comparable with number of search access, i.e. most of the time 
events 
are added, accesed only couple of times and then removed, so it 
justifies 
RB tree usage over AVL tree, since the latter does have much slower 
deletion 
time (max O(log(N)) compared to 3 ops), 
although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). 
So for kevents I use RB tree for now and later, when my AVL tree 
implementation 
is ready, it will be possible to compare them.
 * Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second 
compared to 2200, 
sometimes 2500 req/second for epoll() for trivial web-server and httperf client 
on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a 
time limit, which is
4096 events.

Changes from 'take16' patchset:
 * misc cleanups 

[take26 1/8] kevent: Description.

2006-11-30 Thread Evgeniy Polyakov

Description.


diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt
new file mode 100644
index 000..2e03a3f
--- /dev/null
+++ b/Documentation/kevent.txt
@@ -0,0 +1,240 @@
+Description.
+
+int kevent_init(struct kevent_ring *ring, unsigned int ring_size, 
+   unsigned int flags);
+
+num - size of the ring buffer in events 
+ring - pointer to allocated ring buffer
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value: kevent control file descriptor or negative error value.
+
+ struct kevent_ring
+ {
+   unsigned int ring_kidx, ring_over;
+   struct ukevent event[0];
+ }
+
+ring_kidx - index in the ring buffer where kernel will put new events 
+   when kevent_wait() or kevent_get_events() is called 
+ring_over - number of overflows of ring_uidx happend from the start.
+   Overflow counter is used to prevent situation when two threads 
+   are going to free the same events, but one of them was scheduled 
+   away for too long, so ring indexes were wrapped, so when that 
+   thread will be awakened, it will free not those events, which 
+   it suppose to free.
+
+Example userspace code (ring_buffer.c) can be found on project's homepage.
+
+Each kevent syscall can be so called cancellation point in glibc, i.e. when 
+thread has been cancelled in kevent syscall, thread can be safely removed 
+and no events will be lost, since each syscall (kevent_wait() or 
+kevent_get_events()) will copy event into special ring buffer, accessible 
+from other threads or even processes (if shared memory is used).
+
+When kevent is removed (not dequeued when it is ready, but just removed), 
+even if it was ready, it is not copied into ring buffer, since if it is 
+removed, no one cares about it (otherwise user would wait until it becomes 
+ready and got it through usual way using kevent_get_events() or kevent_wait()) 
+and thus no need to copy it to the ring buffer.
+
+---
+
+
+int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent 
*arg);
+
+fd - is the file descriptor referring to the kevent queue to manipulate. 
+It is created by opening /dev/kevent char device, which is created with 
+dynamic minor number and major number assigned for misc devices. 
+
+cmd - is the requested operation. It can be one of the following:
+KEVENT_CTL_ADD - add event notification 
+KEVENT_CTL_REMOVE - remove event notification 
+KEVENT_CTL_MODIFY - modify existing notification 
+KEVENT_CTL_READY - mark existing events as ready, if number of events is 
zero,
+   it just wakes up parked in syscall thread
+
+num - number of struct ukevent in the array pointed to by arg 
+arg - array of struct ukevent
+
+Return value: 
+ number of events processed or negative error value.
+
+When called, kevent_ctl will carry out the operation specified in the 
+cmd parameter.
+---
+
+ int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, 
+   struct timespec timeout, struct ukevent *buf, unsigned flags);
+
+ctl_fd - file descriptor referring to the kevent queue 
+min_nr - minimum number of completed events that kevent_get_events will block 
+waiting for 
+max_nr - number of struct ukevent in buf 
+timeout - time to wait before returning less than min_nr 
+ events. If this is -1, then wait forever. 
+buf - pointer to an array of struct ukevent. 
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value:
+ number of events copied or negative error value.
+
+kevent_get_events will wait timeout milliseconds for at least min_nr completed 
+events, copying completed struct ukevents to buf and deleting any 
+KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many 
+events as possible, but not more than max_nr. In blocking mode it waits until 
+timeout or if at least min_nr events are ready.
+
+This function copies event into ring buffer if it was initialized, if ring 
buffer
+is full, KEVENT_RET_COPY_FAILED flag is set in ret_flags field.
+---
+
+ int kevent_wait(int ctl_fd, unsigned int num, unsigned int old_uidx, 
+   struct timespec timeout, unsigned int flags);
+
+ctl_fd - file descriptor referring to the kevent queue 
+num - number of processed kevents 
+old_uidx - the last index user is aware of
+timeout - time to wait until there is free space in kevent queue
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value:
+ number of events copied into ring buffer or negative error value.
+
+This syscall waits until either timeout expires or at least one event becomes 
+ready. It also copies events into special ring buffer. If ring buffer is full,
+it waits until there are ready events and then return.
+If kevent is one-shot kevent it is 

[take26 3/8] kevent: poll/select() notifications.

2006-11-30 Thread Evgeniy Polyakov

poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]

diff --git a/fs/file_table.c b/fs/file_table.c
index bc35a40..0805547 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -20,6 +20,7 @@
 #include linux/cdev.h
 #include linux/fsnotify.h
 #include linux/sysctl.h
+#include linux/kevent.h
 #include linux/percpu_counter.h
 
 #include asm/atomic.h
@@ -119,6 +120,7 @@ struct file *get_empty_filp(void)
f-f_uid = tsk-fsuid;
f-f_gid = tsk-fsgid;
eventpoll_init_file(f);
+   kevent_init_file(f);
/* f-f_version: 0 */
return f;
 
@@ -164,6 +166,7 @@ void fastcall __fput(struct file *file)
 * in the file cleanup chain.
 */
eventpoll_release(file);
+   kevent_cleanup_file(file);
locks_remove_flock(file);
 
if (file-f_op  file-f_op-release)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..8bbf3a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ extern int dir_notify_enable;
 #include linux/init.h
 #include linux/sched.h
 #include linux/mutex.h
+#include linux/kevent_storage.h
 
 #include asm/atomic.h
 #include asm/semaphore.h
@@ -586,6 +587,10 @@ struct inode {
struct mutexinotify_mutex;  /* protects the watches list */
 #endif
 
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+   struct kevent_storage   st;
+#endif
+
unsigned long   i_state;
unsigned long   dirtied_when;   /* jiffies of first dirtying */
 
@@ -739,6 +744,9 @@ struct file {
struct list_headf_ep_links;
spinlock_t  f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+   struct kevent_storage   st;
+#endif
struct address_space*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 000..11dbe25
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,232 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED]
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/list.h
+#include linux/slab.h
+#include linux/spinlock.h
+#include linux/timer.h
+#include linux/file.h
+#include linux/kevent.h
+#include linux/poll.h
+#include linux/fs.h
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+   struct poll_table_structpt;
+   struct kevent   *k;
+};
+
+struct kevent_poll_wait_container
+{
+   struct list_headcontainer_entry;
+   wait_queue_head_t   *whead;
+   wait_queue_twait;
+   struct kevent   *k;
+};
+
+struct kevent_poll_private
+{
+   struct list_headcontainer_list;
+   spinlock_t  container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+   unsigned mode, int sync, void *key)
+{
+   struct kevent_poll_wait_container *cont =
+   container_of(wait, struct kevent_poll_wait_container, wait);
+   struct kevent *k = cont-k;
+
+   kevent_storage_ready(k-st, NULL, KEVENT_MASK_ALL);
+   return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+   struct poll_table_struct *poll_table)
+{
+   struct kevent *k =
+   container_of(poll_table, struct kevent_poll_ctl, pt)-k;
+   struct kevent_poll_private *priv = k-priv;
+   struct kevent_poll_wait_container *cont;
+   unsigned long flags;
+
+   cont = kmem_cache_alloc(kevent_poll_container_cache, GFP_KERNEL);
+   if (!cont) {
+   kevent_break(k);
+   return;
+   }
+
+   cont-k = k;
+   init_waitqueue_func_entry(cont-wait, kevent_poll_wait_callback);
+   cont-whead = whead;
+
+   spin_lock_irqsave(priv-container_lock, flags);
+   

Re: e100 breakage located

2006-11-30 Thread Jesse Brandeburg

sorry for the delay, your mail got marked as spam.  In the future
please copy networking issues to netdev@vger.kernel.org, and be sure
to copy the maintainers of the driver you're having problems with
(they are in the MAINTAINERS file)

On 11/22/06, Amin Azez [EMAIL PROTECTED] wrote:

I notice a patch in 2005 from Micahel O'Donnel to the e100.c driver has
stopped auto-crossover working on some e100 devices we use.

On one system the auto-negotiation was restored by commenting out:
(nic-mac == mac_82551_10) in function e100_phy_init where the MDI/MDI-X
is disabled.


are you sure that patch did that?  What version of e100 are you using?
we've since enabled MDI-X on most parts with this patch:
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=60ffa478759f39a2eb3be1ed179bc3764804b2c8;hp=09e590e5d5a93f2eaa748a89c623258e6bad1648

Please try the latest kernel or the latest e100 available from e1000.sf.net
if that doesn't work we'll need to know what kernel are you using?


lspci reports:
 01:04.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100]
(rev 10)
 01:04.0 Class 0200: 8086:1229 (rev 10)

and on another device
 01:05.0 Ethernet controller: Intel Corp. 82559ER (rev 10)
 01:01.0 Class 0200: 8086:1209 (rev 10)

So it is true that we are revision 10, but 82557/9 not 82551.


you're getting confused between decimal and hex.  82551 is rev 16 (0x10)


I must confess that having gotten this far, I am lost. Of course I can
fix the driver for our hardware but I am not sure how to contrive a
general fix.

Maybe the actual damage is done in
 e100_get_defaults(struct nic *nic)
where nic-mac is set to nic-rev_id ?

But it generally seems to be a failure to take into account the actual
hardware type, and only consider the revision.


the only relevant way to tell e100 parts apart is the revision id
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread David Miller
From: Wenji Wu [EMAIL PROTECTED]
Date: Thu, 30 Nov 2006 10:08:22 -0600

 If the higher prioirty processes become runnable (e.g., interactive
 process), you better yield the CPU, instead of continuing this process. If
 it is the case that the process within tcp_recvmsg() is expriring, then, you
 can continue the process to go ahead to process backlog.

Yes, I understand this, and I made that point in one of my
replies to Ingo Molnar last night.

The only seemingly remaining possibility is to find a way to allow
input packet processing, at least enough to emit ACKs, during
tcp_recvmsg() processing.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread David Miller
From: Evgeniy Polyakov [EMAIL PROTECTED]
Date: Thu, 30 Nov 2006 13:22:06 +0300

 It steals timeslices from other processes to complete tcp_recvmsg()
 task, and only when it does it for too long, it will be preempted.
 Processing backlog queue on behalf of need_resched() will break
 fairness too - processing itself can take a lot of time, so process
 can be scheduled away in that part too.

Yes, at this point I agree with this analysis.

Currently I am therefore advocating some way to allow
full input packet handling even amidst tcp_recvmsg()
processing.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread David Miller
From: Ingo Molnar [EMAIL PROTECTED]
Date: Thu, 30 Nov 2006 11:32:40 +0100

 Note that even without the change the TCP receiving task is already 
 getting a disproportionate share of cycles due to softirq processing! 
 Under a load of 10.0 it went from 500 mbits to 74 mbits, while the 
 'fair' share would be 50 mbits. So the TCP receiver /already/ has an 
 unfair advantage. The patch only deepends that unfairness.

I want to point out something which is slightly misleading about this
kind of analysis.

Your disk I/O speed doesn't go down by a factor of 10 just because 9
other non disk I/O tasks are running, yet for TCP that's seemingly OK
:-)

Not looking at input TCP packets enough to send out the ACKs is the
same as forgetting to queue some I/O requests that can go to the
controller right now.

That's the problem, TCP performance is intimately tied to ACK
feedback.  So we should find a way to make sure ACK feedback goes
out, in preference to other tcp_recvmsg() processing.

What really should pace the TCP sender in this kind of situation is
the advertised window, not the lack of ACKs.  Lack of an ACK mean the
packet didn't get there, which is the wrong signal in this kind of
situation, whereas a closing window means application can't keep
up with the data rate, hold on... and is the proper flow control
signal in this high load scenerio.

If you don't send ACKs, packets are retransmitted when there is no
reason for it, and that borders on illegal. :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Ingo Molnar

* Wenji Wu [EMAIL PROTECTED] wrote:

 The solution is really simple and needs no kernel change at all: if 
 you want the TCP receiver to get a larger share of timeslices then 
 either renice it to -20 or renice the other tasks to +19.
 
 Simply give a larger share of timeslices to the TCP receiver won't 
 solve the problem.  No matter what the timeslice is, if the TCP 
 receiving process has packets within backlog, and the process is 
 expired and moved to the expired array, RTO might happen in the TCP 
 sender.

if you still have the test-setup, could you nevertheless try setting the 
priority of the receiving TCP task to nice -20 and see what kind of 
performance you get?

Ingo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.19-rc6-mm2: uli526x only works after reload

2006-11-30 Thread Rafael J. Wysocki
On Thursday, 30 November 2006 02:04, Rafael J. Wysocki wrote:
 On Thursday, 30 November 2006 00:26, Andrew Morton wrote:
  On Thu, 30 Nov 2006 00:08:21 +0100
  Rafael J. Wysocki [EMAIL PROTECTED] wrote:
  
   On Wednesday, 29 November 2006 22:31, Rafael J. Wysocki wrote:
On Wednesday, 29 November 2006 22:30, Andrew Morton wrote:
 On Wed, 29 Nov 2006 21:08:00 +0100
 Rafael J. Wysocki [EMAIL PROTECTED] wrote:
 
  On Wednesday, 29 November 2006 20:54, Rafael J. Wysocki wrote:
   On Tuesday, 28 November 2006 11:02, Andrew Morton wrote:

Temporarily at

http://userweb.kernel.org/~akpm/2.6.19-rc6-mm2/

Will appear eventually at

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc6/2.6.19-rc6-mm2/
   
   A minor issue: on one of my (x86-64) test boxes the uli526x 
   driver doesn't
   work when it's first loaded.  I have to rmmod and modprobe it to 
   make it work.
 
 That isn't a minor issue.
 
   It worked just fine on -mm1, so something must have happened to 
   it recently.
  
  Sorry, I was wrong.  The driver doesn't work at all, even after 
  reload.
  
 
 tulip-dmfe-carrier-detection-fix.patch was added in rc6-mm2.  But 
 you're
 not using that (corrent?)
 
 git-netdev-all changes drivers/net/tulip/de2104x.c, but you're not 
 using
 that either.
 
 git-powerpc(!) alters drivers/net/tulip/de4x5.c, but you're not using 
 that.
 
 Beats me, sorry.  Perhaps it's due to changes in networking core.  
 It's
 presumably a showstopper for statically-linked-uli526x users.  If you 
 could
 bisect it, please?  I'd start with git-netdev-all, then tulip-*.

OK, but it'll take some time.
   
   OK, done.
   
   It's one of these (the first one alone doesn't compile):
   
   git-netdev-all.patch
   git-netdev-all-fixup.patch
   libphy-dont-do-that.patch

Hm, all of these patches are the same as in -mm1 which hasn't caused any
problems to appear on this box.

So, it seems there's another change between -mm1 and -mm2 that causes this
to happen.

Greetings,
Rafael
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Ingo Molnar

* David Miller [EMAIL PROTECTED] wrote:

 I want to point out something which is slightly misleading about this 
 kind of analysis.
 
 Your disk I/O speed doesn't go down by a factor of 10 just because 9 
 other non disk I/O tasks are running, yet for TCP that's seemingly OK
 :-)

disk I/O is typically not CPU bound, and i believe these TCP tests /are/ 
CPU-bound. Otherwise there would be no expiry of the timeslice to begin 
with and the TCP receiver task would always be boosted to 'interactive' 
status by the scheduler and would happily chug along at 500 mbits ...

(and i grant you, if a disk IO test is 20% CPU bound in process context 
and system load is 10, then the scheduler will throttle that task quite 
effectively.)

Ingo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread David Miller
From: Ingo Molnar [EMAIL PROTECTED]
Date: Thu, 30 Nov 2006 21:30:26 +0100

 disk I/O is typically not CPU bound, and i believe these TCP tests /are/ 
 CPU-bound. Otherwise there would be no expiry of the timeslice to begin 
 with and the TCP receiver task would always be boosted to 'interactive' 
 status by the scheduler and would happily chug along at 500 mbits ...

It's about the prioritization of the work.

If all disk I/O were shut off and frozen while we copy file
data into userspace, you'd see the same problem for disk I/O.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Wenji Wu
 It steals timeslices from other processes to complete tcp_recvmsg()
 task, and only when it does it for too long, it will be preempted.
 Processing backlog queue on behalf of need_resched() will break
 fairness too - processing itself can take a lot of time, so process
 can be scheduled away in that part too.

It does steal timeslices from other processes to complete tcp_recvmsg()
task. But I do not think it will  take long. When processing backlog, the
processed packets will go to the receive buffer, the TCP flow control will
take effect to slow down the sender.


The data receiving process might be preempted by higher priority processes.
Only the data recieving process stays in the active array, the problem is
not that bad because the process might resume its execution soon. The worst
case is that it expires and is moved to the active array with packets within
the backlog queue.


wenji


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Ingo Molnar

* David Miller [EMAIL PROTECTED] wrote:

  disk I/O is typically not CPU bound, and i believe these TCP tests 
  /are/ CPU-bound. Otherwise there would be no expiry of the timeslice 
  to begin with and the TCP receiver task would always be boosted to 
  'interactive' status by the scheduler and would happily chug along 
  at 500 mbits ...
 
 It's about the prioritization of the work.
 
 If all disk I/O were shut off and frozen while we copy file data into 
 userspace, you'd see the same problem for disk I/O.

well, it's an issue of how much processing is done in non-prioritized 
contexts. TCP is a bit more sensitive to process context being throttled 
- but disk I/O is not immune either: if nothing submits new IO, or if 
the task does shorts reads+writes then any process level throttling 
immediately shows up in IO throughput.

but in the general sense it is /unfair/ that certain processing such as 
disk and network IO can get a disproportionate amount of CPU time from 
the system - just because they happen to have some of their processing 
in IRQ and softirq context (which is essentially prioritized to 
SCHED_FIFO 100). A system can easily spend 80% CPU time in softirq 
context. (and that is easily visible in something like an -rt kernel 
where various softirq contexts are separate threads and you can see 30% 
net-rx and 20% net-tx CPU utilization in 'top'). How is this kind of 
processing different from purely process-context based subsystems?

so i agree with you that by tweaking the TCP stack to be less sensitive 
to process throttling you /will/ improve the relative performance of the 
TCP receiver task - but in general system design and scheduler design 
terms it's not a win.

i'd also agree with the notion that the current 'throttling' of process 
contexts can be abrupt and uncooperative, and hence the TCP stack could 
get more out of the same amount of CPU time if it used it in a smarter 
way. As i pointed it out in the first mail i'd support the TCP stack 
getting the ability to query how much timeslices it has - or even the 
scheduler notifying the TCP stack via some downcall if 
current-timeslice reaches 1 (or something like that).

So i dont support the scheme proposed here, the blatant bending of the 
priority scale towards the TCP workload. Instead what i'd like to see is 
more TCP performance (and a nicer over-the-wire behavior - no 
retransmits for example) /with the same 10% CPU time used/. Are we in
rough agreement?

Ingo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Wenji Wu

if you still have the test-setup, could you nevertheless try setting the
priority of the receiving TCP task to nice -20 and see what kind of
performance you get?

A process with nice of -20 can easily get the interactivity status. When it
expires, it still go back to the active array. It just hide the TCP problem,
instead of solving it.

For a process with nice value of -20, it will have the following advantages
over other processes:
(1) its timeslice is 800ms, the timeslice of a process with a nice value of
0 is 100ms
(2) it has higher priority than other processes
(3) it is easier to gain the interactivity status.

The chances that the process expires and moves to the expired array with
packets within backlog is much reduces, but still has the chance.


wenji


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

 [...] Instead what i'd like to see is more TCP performance (and a 
 nicer over-the-wire behavior - no retransmits for example) /with the 
 same 10% CPU time used/. Are we in rough agreement?

put in another way: i'd like to see the TCP bytes transferred per CPU 
time spent by the TCP stack ratio to be maximized in a load-independent 
way (part of which is the sender host too: to not cause unnecessary 
retransmits is important as well). In a high-load scenario this means 
that any measure that purely improves TCP throughput by giving it more 
cycles is not a real improvement. So the focus should be on throttling 
intelligently and without causing extra work on the sender side either - 
not on trying to circumvent throttling measures.

Ingo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP

2006-11-30 Thread David Miller
From: Ingo Molnar [EMAIL PROTECTED]
Date: Thu, 30 Nov 2006 21:49:08 +0100

 So i dont support the scheme proposed here, the blatant bending of the 
 priority scale towards the TCP workload.

I don't support this scheme either ;-)

That's why my proposal is to find a way to allow input packet
processing even during tcp_recvmsg() work.  It is a solution that
would give the TCP task exactly it's time slice, no more, no less,
without the erroneous behavior of sleeping with packets held in the
socket backlog.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.19-rc6-mm2: uli526x only works after reload

2006-11-30 Thread Andrew Morton
On Thu, 30 Nov 2006 21:21:27 +0100
Rafael J. Wysocki [EMAIL PROTECTED] wrote:

 On Thursday, 30 November 2006 02:04, Rafael J. Wysocki wrote:
  On Thursday, 30 November 2006 00:26, Andrew Morton wrote:
   On Thu, 30 Nov 2006 00:08:21 +0100
   Rafael J. Wysocki [EMAIL PROTECTED] wrote:
   
On Wednesday, 29 November 2006 22:31, Rafael J. Wysocki wrote:
 On Wednesday, 29 November 2006 22:30, Andrew Morton wrote:
  On Wed, 29 Nov 2006 21:08:00 +0100
  Rafael J. Wysocki [EMAIL PROTECTED] wrote:
  
   On Wednesday, 29 November 2006 20:54, Rafael J. Wysocki wrote:
On Tuesday, 28 November 2006 11:02, Andrew Morton wrote:
 
 Temporarily at
 
 http://userweb.kernel.org/~akpm/2.6.19-rc6-mm2/
 
 Will appear eventually at
 
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc6/2.6.19-rc6-mm2/

A minor issue: on one of my (x86-64) test boxes the uli526x 
driver doesn't
work when it's first loaded.  I have to rmmod and modprobe it 
to make it work.
  
  That isn't a minor issue.
  
It worked just fine on -mm1, so something must have happened to 
it recently.
   
   Sorry, I was wrong.  The driver doesn't work at all, even after 
   reload.
   
  
  tulip-dmfe-carrier-detection-fix.patch was added in rc6-mm2.  But 
  you're
  not using that (corrent?)
  
  git-netdev-all changes drivers/net/tulip/de2104x.c, but you're not 
  using
  that either.
  
  git-powerpc(!) alters drivers/net/tulip/de4x5.c, but you're not 
  using that.
  
  Beats me, sorry.  Perhaps it's due to changes in networking core.  
  It's
  presumably a showstopper for statically-linked-uli526x users.  If 
  you could
  bisect it, please?  I'd start with git-netdev-all, then tulip-*.
 
 OK, but it'll take some time.

OK, done.

It's one of these (the first one alone doesn't compile):

git-netdev-all.patch
git-netdev-all-fixup.patch
libphy-dont-do-that.patch
 
 Hm, all of these patches are the same as in -mm1 which hasn't caused any
 problems to appear on this box.
 
 So, it seems there's another change between -mm1 and -mm2 that causes this
 to happen.
 

It would be nice to eliminate libphy-dont-do-that.patch if poss - that was
a rogue akpm patch aimed at some incomprehensible gobbledigook in the
netdev tree (and to fix the current_is_keventd-not-exported-to-modules
bug).

I have a feeling that your bug will be cheerily merged into mainline soon. 
That might of course mean that someone will hit it more firmly and it'll
get fixed.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] additional change to ipsec audit

2006-11-30 Thread Joy Latten
Sorry! Sign off included this time. 

This patch disables auditing in ipsec when CONFIG_AUDITSYSCALL is
disabled in the kernel. 

This patch also includes a bug fix for xfrm_state.c as a result of
original ipsec audit patch.

regards,
Joy

Signed-off-by: Joy Latten [EMAIL PROTECTED]

---
diff -urpN linux-2.6.18-patch/include/net/xfrm.h 
linux-2.6.18-patch.2/include/net/xfrm.h
--- linux-2.6.18-patch/include/net/xfrm.h   2006-11-27 12:29:11.0 
-0600
+++ linux-2.6.18-patch.2/include/net/xfrm.h 2006-11-28 13:26:49.0 
-0600
@@ -395,8 +395,13 @@ struct xfrm_audit
uid_t   loginuid;
u32 secid;
 };
-void xfrm_audit_log(uid_t auid, u32 secid, int type, int result,
+
+#ifdef CONFIG_AUDITSYSCALL
+extern void xfrm_audit_log(uid_t auid, u32 secid, int type, int result,
struct xfrm_policy *xp, struct xfrm_state *x);
+#else
+#define xfrm_audit_log(a,s,t,r,p,x) do { ; } while (0)
+#endif /* CONFIG_AUDITSYSCALL */
 
 static inline void xfrm_pol_hold(struct xfrm_policy *policy)
 {
diff -urpN linux-2.6.18-patch/net/xfrm/xfrm_policy.c 
linux-2.6.18-patch.2/net/xfrm/xfrm_policy.c
--- linux-2.6.18-patch/net/xfrm/xfrm_policy.c   2006-11-27 12:29:33.0 
-0600
+++ linux-2.6.18-patch.2/net/xfrm/xfrm_policy.c 2006-11-28 14:51:09.0 
-0600
@@ -1955,6 +1955,7 @@ int xfrm_bundle_ok(struct xfrm_policy *p
 
 EXPORT_SYMBOL(xfrm_bundle_ok);
 
+#ifdef CONFIG_AUDITSYSCALL
 /* Audit addition and deletion of SAs and ipsec policy */
 
 void xfrm_audit_log(uid_t auid, u32 sid, int type, int result,
@@ -2063,6 +2064,7 @@ void xfrm_audit_log(uid_t auid, u32 sid,
 }
 
 EXPORT_SYMBOL(xfrm_audit_log);
+#endif /* CONFIG_AUDITSYSCALL */
 
 int xfrm_policy_register_afinfo(struct xfrm_policy_afinfo *afinfo)
 {
diff -urpN linux-2.6.18-patch/net/xfrm/xfrm_state.c 
linux-2.6.18-patch.2/net/xfrm/xfrm_state.c
--- linux-2.6.18-patch/net/xfrm/xfrm_state.c2006-11-27 12:29:33.0 
-0600
+++ linux-2.6.18-patch.2/net/xfrm/xfrm_state.c  2006-11-28 12:58:56.0 
-0600
@@ -407,7 +407,6 @@ restart:
xfrm_state_hold(x);
spin_unlock_bh(xfrm_state_lock);
 
-   xfrm_state_delete(x);
err = xfrm_state_delete(x);
xfrm_audit_log(audit_info-loginuid,
   audit_info-secid,
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] additional ipsec audit patch

2006-11-30 Thread Joy Latten
On Wed, 2006-11-29 at 19:32 -0500, James Morris wrote:
 On Wed, 29 Nov 2006, James Morris wrote:
 
  On Wed, 29 Nov 2006, Joy Latten wrote:
  
   This patch disables auditing in ipsec when CONFIG_AUDITSYSCALL is
   disabled in the kernel. 
   
   This patch also includes a bug fix for xfrm_state.c as a result of
   original ipsec audit patch.
   
   Let me know if it looks ok.
  
  
  Also, the last patch contains no Signed-off-by: line, please resend.
 
 And, what is the testing status of these patches?
 
I ran a stress test overnight using labeled ipsec on a patched lspp55 kernel 
using racoon last week.

The additional patch to xfrm_state.c was my fault when rebasing to
2.6.19-rc6 to send upstream. I plan to run an ipv4 and ipv6 stress test
tonight and tomorrow using labeled ipsec with auditing enabled on the
lspp56 kernel, which contains ipsec audit patch, to ensure no regression
has occurred. I can also run an ipv4 and ipv6 stress tests
with regular ipsec over the weekend for further ensurance.   

I compiled and did unit test with SELINUX disabled, AUDITSYSCALL
disabled, and with both enabled. 

regards,
Joy
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] additional ipsec audit patch

2006-11-30 Thread James Morris
On Thu, 30 Nov 2006, Joy Latten wrote:

 I ran a stress test overnight using labeled ipsec on a patched lspp55 kernel 
 using racoon last week.
 
 The additional patch to xfrm_state.c was my fault when rebasing to
 2.6.19-rc6 to send upstream. I plan to run an ipv4 and ipv6 stress test
 tonight and tomorrow using labeled ipsec with auditing enabled on the
 lspp56 kernel, which contains ipsec audit patch, to ensure no regression
 has occurred. I can also run an ipv4 and ipv6 stress tests
 with regular ipsec over the weekend for further ensurance.   
 
 I compiled and did unit test with SELINUX disabled, AUDITSYSCALL
 disabled, and with both enabled. 

Thanks, applied to 

git://git.infradead.org/~jmorris/selinux-net-2.6.20#for-akpm

might be worth having it in -mm for a bit.




-- 
James Morris
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] NetXen: temp monitoring, newer firmware support, mm footprint reduction

2006-11-30 Thread Francois Romieu
Don Fry [EMAIL PROTECTED] :
 NetXen: 1G/10G Ethernet Driver updates
   - Temparature monitoring and device control
   - Memory footprint reduction
   - Driver changes to support newer version of firmware
 
 Signed-off-by: Amit S. Kale [EMAIL PROTECTED]
 Signed-off-by: Don Fry [EMAIL PROTECTED]
 
 diff -Nupr netdev-2.6/drivers/net/netxen.one/netxen_nic_ethtool.c 
 netdev-2.6/drivers/net/netxen/netxen_nic_ethtool.c
 --- netdev-2.6/drivers/net/netxen.one/netxen_nic_ethtool.c2006-11-30 
 09:16:23.0 -0800
 +++ netdev-2.6/drivers/net/netxen/netxen_nic_ethtool.c2006-11-30 
 09:22:41.0 -0800
 @@ -53,6 +53,9 @@ struct netxen_nic_stats {
  #define NETXEN_NIC_STAT(m) sizeof(((struct netxen_port *)0)-m), \
   offsetof(struct netxen_port, m)
  
 +#define NETXEN_NIC_PORT_WINDOW 0x1
 +#define NETXEN_NIC_INVALID_DATA 0xDEADBEEF
 +
  static const struct netxen_nic_stats netxen_nic_gstrings_stats[] = {
   {rcvd_bad_skb, NETXEN_NIC_STAT(stats.rcvdbadskb)},
   {xmit_called, NETXEN_NIC_STAT(stats.xmitcalled)},
 @@ -111,9 +114,9 @@ netxen_nic_get_drvinfo(struct net_device
  {
   struct netxen_port *port = netdev_priv(dev);
   struct netxen_adapter *adapter = port-adapter;
 - uint32_t fw_major = 0;
 - uint32_t fw_minor = 0;
 - uint32_t fw_build = 0;
 + u32 fw_major = 0;
 + u32 fw_minor = 0;
 + u32 fw_build = 0;

The description of the patch did not announce (welcome) cleanup.

There are a few ones.

[...]
   strncpy(drvinfo-driver, netxen_nic, 32);
   strncpy(drvinfo-version, NETXEN_NIC_LINUX_VERSIONID, 32);
 @@ -136,6 +139,8 @@ netxen_nic_get_settings(struct net_devic
  {
   struct netxen_port *port = netdev_priv(dev);
   struct netxen_adapter *adapter = port-adapter;
 + struct netxen_board_info *boardinfo;
 + boardinfo = adapter-ahw.boardcfg;

Missing separating line or merge the two lines.

[...]
 @@ -182,13 +174,47 @@ netxen_nic_get_settings(struct net_devic
  
   ecmd-speed = SPEED_1;
   ecmd-duplex = DUPLEX_FULL;
 - ecmd-phy_address = port-portnum;
 - ecmd-transceiver = XCVR_EXTERNAL;
   ecmd-autoneg = AUTONEG_DISABLE;
 - return 0;
 + } else
 + return -EIO;
 +
 + ecmd-phy_address = port-portnum;
 + ecmd-transceiver = XCVR_EXTERNAL;
 +
 + switch ((netxen_brdtype_t) boardinfo-board_type) {
 + case NETXEN_BRDTYPE_P2_SB35_4G:
 + case NETXEN_BRDTYPE_P2_SB31_2G:
 + ecmd-supported |= SUPPORTED_Autoneg;
 + ecmd-advertising |= ADVERTISED_Autoneg;
 + case NETXEN_BRDTYPE_P2_SB31_10G_CX4:
 + ecmd-supported |= SUPPORTED_TP;
 + ecmd-advertising |= ADVERTISED_TP;
 + ecmd-port = PORT_TP;
 + ecmd-autoneg = (boardinfo-board_type ==
 +  NETXEN_BRDTYPE_P2_SB31_10G_CX4) ?
 + (AUTONEG_DISABLE) : (port-link_autoneg);
 + break;
 + case NETXEN_BRDTYPE_P2_SB31_10G_HMEZ:
 + case NETXEN_BRDTYPE_P2_SB31_10G_IMEZ:
 + ecmd-supported |= SUPPORTED_MII;
 + ecmd-advertising |= ADVERTISED_MII;
 + ecmd-port = PORT_FIBRE;
 + ecmd-autoneg = AUTONEG_DISABLE;
 + break;
 + case NETXEN_BRDTYPE_P2_SB31_10G:
 + ecmd-supported |= SUPPORTED_FIBRE;
 + ecmd-advertising |= ADVERTISED_FIBRE;
 + ecmd-port = PORT_FIBRE;
 + ecmd-autoneg = AUTONEG_DISABLE;
 + break;
 + default:
 + printk(ERROR: Unsupported board model %d\n,
 +(netxen_brdtype_t) boardinfo-board_type);

Missing KERN_ERR

[...]
 diff -Nupr netdev-2.6/drivers/net/netxen.one/netxen_nic.h 
 netdev-2.6/drivers/net/netxen/netxen_nic.h
 --- netdev-2.6/drivers/net/netxen.one/netxen_nic.h2006-11-30 
 09:16:23.0 -0800
 +++ netdev-2.6/drivers/net/netxen/netxen_nic.h2006-11-30 
 09:22:41.0 -0800
[...]
 @@ -328,6 +343,7 @@ typedef enum {
   NETXEN_BRDTYPE_P2_SB31_10G_HMEZ = 0x000e,
   NETXEN_BRDTYPE_P2_SB31_10G_CX4 = 0x000f
  } netxen_brdtype_t;
 +#define NUM_SUPPORTED_BOARDS (sizeof(netxen_boards)/sizeof(netxen_brdinfo_t))
  
  typedef enum {
   NETXEN_BRDMFG_INVENTEC = 1
[...]
 @@ -869,7 +937,10 @@ static inline void netxen_nic_disable_in
   /*
* ISR_INT_MASK: Can be read from window 0 or 1.
*/
 - writel(0x7ff, (void __iomem *)(adapter-ahw.pci_base + ISR_INT_MASK));
 + writel(0x7ff,
 +(void __iomem
 + *)(PCI_OFFSET_SECOND_RANGE(adapter, ISR_INT_MASK)));
 +

Yuck.

[...]
 @@ -888,13 +959,83 @@ static inline void netxen_nic_enable_int
   break;
   }
  
 - writel(mask, (void __iomem *)(adapter-ahw.pci_base + ISR_INT_MASK));
 + writel(mask,
 +(void __iomem
 + *)(PCI_OFFSET_SECOND_RANGE(adapter, ISR_INT_MASK)));
  
   if (!(adapter-flags  

Re: [PATCH 0/5 addendum] NetXen

2006-11-30 Thread Jeff Garzik

Don Fry wrote:

The NetXen patches fix many problems in the current #upstream version of
the driver.  It has warts and probably lots of bugs still, but it is
better than what is queued for mainline inclusion at this time.  Please
apply to 2.6.20.


Please resync with netdev#upstream, and update for comments on netdev...

Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][IPSEC][1/7] inter address family ipsec tunnel

2006-11-30 Thread David Miller
From: Kazunori MIYAZAWA [EMAIL PROTECTED]
Date: Fri, 24 Nov 2006 14:38:07 +0900

 This patch adds encapsulation family.
 
 Signed-off-by: Miika Komu [EMAIL PROTECTED]
 Signed-off-by: Diego Beltrami [EMAIL PROTECTED]
 Signed-off-by: Kazunori Miyazawa [EMAIL PROTECTED]

Applied to net-2.6.20, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][IPSEC][2/7] inter address family ipsec tunnel

2006-11-30 Thread David Miller
From: Kazunori MIYAZAWA [EMAIL PROTECTED]
Date: Fri, 24 Nov 2006 14:38:17 +0900

 This patch adds netlink interface of the family
 
 Signed-off-by: Miika Komu [EMAIL PROTECTED]
 Signed-off-by: Diego Beltrami [EMAIL PROTECTED]
 Signed-off-by: Kazunori Miyazawa [EMAIL PROTECTED]

Applied to net-2.6.20, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][IPSEC][3/7] inter address family ipsec tunnel

2006-11-30 Thread David Miller
From: Kazunori MIYAZAWA [EMAIL PROTECTED]
Date: Thu, 30 Nov 2006 10:54:26 +0900

 Hello,
 
 I found a bug in my previous patch for af_key.
 The patch breaks transport mode.
 This is a fixed version.
 
 Signed-off-by: Miika Komu [EMAIL PROTECTED]
 Signed-off-by: Diego Beltrami [EMAIL PROTECTED]
 Signed-off-by: Kazunori Miyazawa [EMAIL PROTECTED]

Applied to net-2.6.20, thanks a lot.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][IPSEC][4/7] inter address family ipsec tunnel

2006-11-30 Thread David Miller
From: Kazunori MIYAZAWA [EMAIL PROTECTED]
Date: Fri, 24 Nov 2006 14:38:39 +0900

What is going on here?

 + /* Without this, the atomic inc below segfaults */
 + if (encap_family == AF_INET6) {
 + rt-peer = NULL;
 + rt_bind_peer(rt,1);
 + }
 ...
 - dst_prev-output= xfrm4_output;
 + if (dst_prev-xfrm-props.family == AF_INET)
 + dst_prev-output = xfrm4_output;
 +#if defined(CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE)
 + else
 + dst_prev-output = xfrm6_output;
 +#endif
   if (rt-peer)
   atomic_inc(rt-peer-refcnt);

If it's non-NULL and you get a segfault for atomic_inc() that
means there is garbage here, and it seems that if you're
setting it to NULL explicitly then it's just a workaround
for whatever problem is causing it to be non-NULL to begin
with.

What is putting a non-valid pointer value there?  Is this an IPV6 or
IPSEC dst route by chance?  If so, that makes this change really
wrong, and we are corrupting the route by running rt_bind_peer() on
it.  rt_bind_peer() is only valid on ipv4 route entries.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][IPSEC][5/7] inter address family ipsec tunnel

2006-11-30 Thread David Miller
From: Kazunori MIYAZAWA [EMAIL PROTECTED]
Date: Fri, 24 Nov 2006 14:38:52 +0900

 +static inline void ip6ip_ecn_decapsulate(struct sk_buff *skb)
 +{
 + if (INET_ECN_is_ce(ipv6_get_dsfield(skb-nh.ipv6h)))
 + IP_ECN_set_ce(skb-h.ipiph);
 +}
 +

Please fix this extra tab indentation :-)

Thank you.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] zd1211rw: zd_mac_rx isn't always called in IRQ context

2006-11-30 Thread Daniel Drake
e.g.

usb 1-7: rx_urb_complete() *** first fragment ***
usb 1-7: rx_urb_complete() *** second fragment ***
drivers/net/wireless/zd1211rw/zd_mac.c:1063 ASSERT
(((current_thread_info()-preempt_count)  (((1UL  (12))-1)  ((0 +
8) + 8 VIOLATED!
 [f0299448] zd_mac_rx+0x3e7/0x47a [zd1211rw]
 [f029badc] rx_urb_complete+0x22d/0x24a [zd1211rw]
 [b028a22f] urb_destroy+0x0/0x5
 [b01f0930] kref_put+0x65/0x72
 [b0288cdf] usb_hcd_giveback_urb+0x28/0x57
 [b02950c4] qh_completions+0x296/0x2f6
 [b0294b21] ehci_urb_done+0x70/0x7a
 [b0294ea1] qh_completions+0x73/0x2f6
 [b02951bc] ehci_work+0x98/0x538

Remove the bogus assertion, and use dev_kfree_skb_any as pointed out by
Ulrich Kunitz.

Signed-off-by: Daniel Drake [EMAIL PROTECTED]

Index: linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c
===
--- linux-2.6.orig/drivers/net/wireless/zd1211rw/zd_mac.c
+++ linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c
@@ -1059,10 +1059,8 @@ int zd_mac_rx(struct zd_mac *mac, const 
memcpy(skb_put(skb, length), buffer, length);
 
r = ieee80211_rx(ieee, skb, stats);
-   if (!r) {
-   ZD_ASSERT(in_irq());
-   dev_kfree_skb_irq(skb);
-   }
+   if (!r)
+   dev_kfree_skb_any(skb);
return 0;
 }
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] zd1211rw: Fill enc_capa in GIWRANGE handler

2006-11-30 Thread Daniel Drake
This is needed for NetworkManager users to connect to WPA networks.
Pointed out by Matthew Campbell.

Signed-off-by: Daniel Drake [EMAIL PROTECTED]
---
 zd_mac.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

Index: linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c
===
--- linux-2.6.orig/drivers/net/wireless/zd1211rw/zd_mac.c
+++ linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c
@@ -615,6 +615,9 @@ int zd_mac_get_range(struct zd_mac *mac,
range-we_version_compiled = WIRELESS_EXT;
range-we_version_source = 20;
 
+   range-enc_capa = IW_ENC_CAPA_WPA |  IW_ENC_CAPA_WPA2 |
+ IW_ENC_CAPA_CIPHER_TKIP | IW_ENC_CAPA_CIPHER_CCMP;
+
ZD_ASSERT(!irqs_disabled());
spin_lock_irq(mac-lock);
regdomain = mac-regdomain;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] zd1211rw: Support for multicast addresses

2006-11-30 Thread Daniel Drake
From: Ulrich Kunitz [EMAIL PROTECTED]

Support for multicast adresses is implemented by supporting the
set_multicast_list() function of the network device. Address
filtering is supported by a group hash table in the device.

This is based on earlier work by Benoit Papillaut. Fixes multicast packet
reception and ipv6 connectivity:
http://bugzilla.kernel.org/show_bug.cgi?id=7424
http://bugzilla.kernel.org/show_bug.cgi?id=7425

Signed-off-by: Ulrich Kunitz [EMAIL PROTECTED]
Signed-off-by: Daniel Drake [EMAIL PROTECTED]
---
 zd_chip.c   |   13 +
 zd_chip.h   |   43 ++-
 zd_mac.c|   44 +++-
 zd_mac.h|3 +++
 zd_netdev.c |2 +-
 5 files changed, 102 insertions(+), 3 deletions(-)

Index: linux-2.6/drivers/net/wireless/zd1211rw/zd_chip.c
===
--- linux-2.6.orig/drivers/net/wireless/zd1211rw/zd_chip.c
+++ linux-2.6/drivers/net/wireless/zd1211rw/zd_chip.c
@@ -1673,3 +1673,16 @@ int zd_rfwritev_cr_locked(struct zd_chip
 
return 0;
 }
+
+int zd_chip_set_multicast_hash(struct zd_chip *chip,
+  struct zd_mc_hash *hash)
+{
+   struct zd_ioreq32 ioreqs[] = {
+   { CR_GROUP_HASH_P1, hash-low },
+   { CR_GROUP_HASH_P2, hash-high },
+   };
+
+   dev_dbg_f(zd_chip_dev(chip), hash l 0x%08x h 0x%08x\n,
+   ioreqs[0].value, ioreqs[1].value);
+   return zd_iowrite32a(chip, ioreqs, ARRAY_SIZE(ioreqs));
+}
Index: linux-2.6/drivers/net/wireless/zd1211rw/zd_chip.h
===
--- linux-2.6.orig/drivers/net/wireless/zd1211rw/zd_chip.h
+++ linux-2.6/drivers/net/wireless/zd1211rw/zd_chip.h
@@ -390,10 +390,19 @@
 #define CR_BSSID_P1CTL_REG(0x0618)
 #define CR_BSSID_P2CTL_REG(0x061C)
 #define CR_BCN_PLCP_CFGCTL_REG(0x0620)
+
+/* Group hash table for filtering incoming packets.
+ *
+ * The group hash table is 64 bit large and split over two parts. The first
+ * part is the lower part. The upper 6 bits of the last byte of the target
+ * address are used as index. Packets are received if the hash table bit is
+ * set. This is used for multicast handling, but for broadcasts (address
+ * ff:ff:ff:ff:ff:ff) the highest bit in the second table must also be set.
+ */
 #define CR_GROUP_HASH_P1   CTL_REG(0x0624)
 #define CR_GROUP_HASH_P2   CTL_REG(0x0628)
-#define CR_RX_TIMEOUT  CTL_REG(0x062C)
 
+#define CR_RX_TIMEOUT  CTL_REG(0x062C)
 /* Basic rates supported by the BSS. When producing ACK or CTS messages, the
  * device will use a rate in this table that is less than or equal to the rate
  * of the incoming frame which prompted the response */
@@ -864,4 +873,36 @@ u8 zd_rx_strength_percent(u8 rssi);
 
 u16 zd_rx_rate(const void *rx_frame, const struct rx_status *status);
 
+struct zd_mc_hash {
+   u32 low;
+   u32 high;
+};
+
+static inline void zd_mc_clear(struct zd_mc_hash *hash)
+{
+   hash-low = 0;
+   /* The interfaces must always received broadcasts.
+* The hash of the broadcast address ff:ff:ff:ff:ff:ff is 63.
+*/
+   hash-high = 0x8000;
+}
+
+static inline void zd_mc_add_all(struct zd_mc_hash *hash)
+{
+   hash-low = hash-high = 0x;
+}
+
+static inline void zd_mc_add_addr(struct zd_mc_hash *hash, u8 *addr)
+{
+   unsigned int i = addr[5]  2;
+   if (i  32) {
+   hash-low |= 1  i;
+   } else {
+   hash-high |= 1  (i-32);
+   }
+}
+
+int zd_chip_set_multicast_hash(struct zd_chip *chip,
+  struct zd_mc_hash *hash);
+
 #endif /* _ZD_CHIP_H */
Index: linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c
===
--- linux-2.6.orig/drivers/net/wireless/zd1211rw/zd_mac.c
+++ linux-2.6/drivers/net/wireless/zd1211rw/zd_mac.c
@@ -39,6 +39,8 @@ static void housekeeping_init(struct zd_
 static void housekeeping_enable(struct zd_mac *mac);
 static void housekeeping_disable(struct zd_mac *mac);
 
+static void set_multicast_hash_handler(void *mac_ptr);
+
 int zd_mac_init(struct zd_mac *mac,
struct net_device *netdev,
struct usb_interface *intf)
@@ -55,6 +57,8 @@ int zd_mac_init(struct zd_mac *mac,
softmac_init(ieee80211_priv(netdev));
zd_chip_init(mac-chip, netdev, intf);
housekeeping_init(mac);
+   INIT_WORK(mac-set_multicast_hash_work, set_multicast_hash_handler,
+ mac);
return 0;
 }
 
@@ -136,6 +140,7 @@ out:
 
 void zd_mac_clear(struct zd_mac *mac)
 {
+   flush_workqueue(zd_workqueue);
zd_chip_clear(mac-chip);
ZD_ASSERT(!spin_is_locked(mac-lock));
ZD_MEMCLEAR(mac, sizeof(struct zd_mac));
@@ -256,6 +261,42 @@ int zd_mac_set_mac_address(struct 

Re: [PATCH][IPSEC][6/7] inter address family ipsec tunnel

2006-11-30 Thread David Miller
From: Kazunori MIYAZAWA [EMAIL PROTECTED]
Date: Fri, 24 Nov 2006 14:39:01 +0900

 This patch fixes mtu calculation of IPv4
 
 ip_append_data should refer the mtu of dst not path.
 if dst is stacked, path is the actual dst_entry in
 the routing table.
 therefore the mtu of path equals link mtu which is
 depends on the device so that it ignores the header length
 and the trailer length
 dst has mtu for creating packet.
 
 Signed-off-by: Miika Komu [EMAIL PROTECTED]
 Signed-off-by: Diego Beltrami [EMAIL PROTECTED]
 Signed-off-by: Kazunori Miyazawa [EMAIL PROTECTED]

I'm not sure about this change.

If you look at the code in this function, mtu is always used with
adjustments via 'exthdrlen' (which is set to rt-u.dst.header_len).
So it seems the encapsulation is taken into account.

Perhaps any problem you are seeing is some artifact of the ipv6 in
ipv4 tunnel implementation.  Otherwise we'd have other reports of this
problem, wouldn't we?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][IPSEC][7/7] inter address family ipsec tunnel

2006-11-30 Thread David Miller
From: Kazunori MIYAZAWA [EMAIL PROTECTED]
Date: Fri, 24 Nov 2006 14:39:17 +0900

 ip6_append_data should refer mtu of dst
 because of the same reasone of the previous patch.

Same comments of mine for ipv4 side of this change also apply here.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >