Re: [PATCH] enic: Store permanent MAC address during probe()

2017-03-20 Thread PJ Waskiewicz
On Mon, Mar 20, 2017 at 6:49 PM, Govindarajulu Varadarajan
 wrote:
> On Mon, 20 Mar 2017, PJ Waskiewicz wrote:
>
>> On Mon, Mar 20, 2017 at 5:33 PM, Govindarajulu Varadarajan
>>  wrote:
>>>
>>> On Mon, 20 Mar 2017, PJ Waskiewicz wrote:
>>>
>>>> From: PJ Waskiewicz 
>>>>
>>>> The permanent MAC address is useful to store for things like ethtool,
>>>> and when bonding with modes such as active/passive or LACP.
>>>
>>>
>>> Is this patch fixing an issue with bonding drive on enic?
>>
>>
>> We noticed that running ethtool -P  on an enic, even on 4.9,
>> returned nothing.  This has fallout when using bonding, where LACP or
>> Active/Passive overrides the LAA on one of the slaves, one can't
>> figure out what the physical MAC address is of each slave.  So not a
>> problem with bonding directly, it's more secondary as a result of the
>> driver not reporting the actual permanent address.
>>
>>>
>>>> This follows the model of other Ethernet drivers, such as ixgbe.
>>>>
>>>
>>> While other drivers set netdev->perm_addr, doesn't this actually come
>>> free
>>> in
>>> register_netdevice().
>>
>>
>> I thought it did as well, but in 4.9 when we tested it wasn't working.
>> Hence the patch.  :-)
>>
>
> Can you try with net-next? In my setup I do not see the issue on net-next
> and on
> 4.9 kernel. The issue for all drivers was fixed in
> 948b337e62ca9 ("net: init perm_addr in register_netdevice()")

The fix looks like it went in after 4.9 was tagged and released.  4.9
was tagged 12/11/2016, and 948b337e62ca9 was committed 1/8/2017.  That
would explain why I didn't see it in 4.9.

That being said, looks like 4.10 does work as expected without my
patch, so I'm fine carrying the patch internally to our 4.9 tree.  I'm
not sure it's worth sending either this patch or the netdev-level
patch to -stable though, it's a small issue that is already fixed
upstream.

Consider this patch rescinded.

Cheers,
-PJ


Re: [PATCH] enic: Store permanent MAC address during probe()

2017-03-20 Thread PJ Waskiewicz
On Mon, Mar 20, 2017 at 5:33 PM, Govindarajulu Varadarajan
 wrote:
> On Mon, 20 Mar 2017, PJ Waskiewicz wrote:
>
>> From: PJ Waskiewicz 
>>
>> The permanent MAC address is useful to store for things like ethtool,
>> and when bonding with modes such as active/passive or LACP.
>
>
> Hi Peter,
>
> Is this patch fixing an issue with bonding drive on enic?

We noticed that running ethtool -P  on an enic, even on 4.9,
returned nothing.  This has fallout when using bonding, where LACP or
Active/Passive overrides the LAA on one of the slaves, one can't
figure out what the physical MAC address is of each slave.  So not a
problem with bonding directly, it's more secondary as a result of the
driver not reporting the actual permanent address.

>
>> This follows the model of other Ethernet drivers, such as ixgbe.
>>
>
> While other drivers set netdev->perm_addr, doesn't this actually come free
> in
> register_netdevice().

I thought it did as well, but in 4.9 when we tested it wasn't working.
Hence the patch.  :-)

Cheers,
-PJ


[PATCH] enic: Store permanent MAC address during probe()

2017-03-20 Thread PJ Waskiewicz
From: PJ Waskiewicz 

The permanent MAC address is useful to store for things like ethtool,
and when bonding with modes such as active/passive or LACP.  This
follows the model of other Ethernet drivers, such as ixgbe.

This was verified on a C220 chassis with the Cisco VNIC Ethernet device.

Signed-off-by: PJ Waskiewicz 
---
 drivers/net/ethernet/cisco/enic/enic_main.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c 
b/drivers/net/ethernet/cisco/enic/enic_main.c
index 4b87bee..8bb2114 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -964,6 +964,16 @@ void enic_reset_addr_lists(struct enic *enic)
enic->flags = 0;
 }
 
+static int enic_set_perm_mac_addr(struct net_device *netdev, char *addr)
+{
+   if (!is_valid_ether_addr(addr) && !is_zero_ether_addr(addr))
+   return -EADDRNOTAVAIL;
+
+   memcpy(netdev->perm_addr, addr, netdev->addr_len);
+
+   return 0;
+}
+
 static int enic_set_mac_addr(struct net_device *netdev, char *addr)
 {
struct enic *enic = netdev_priv(netdev);
@@ -2872,6 +2882,14 @@ static int enic_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
goto err_out_dev_deinit;
}
 
+   /* Store off permanent MAC address
+*/
+   err = enic_set_perm_mac_addr(netdev, enic->mac_addr);
+   if (err) {
+   dev_err(dev, "Invalid MAC address, aborting\n");
+   goto err_out_dev_deinit;
+   }
+
enic->tx_coalesce_usecs = enic->config.intr_timer_usec;
/* rx coalesce time already got initialized. This gets used
 * if adaptive coal is turned off
-- 
2.10.2



[PATCH] [IPROUTE2] Update various classifiers' help output for expected CLASSID syntax

2008-02-13 Thread PJ Waskiewicz
update: Fix the spelling of "hexidecimal"

This updates the help output to specify that CLASSID should be hexidecimal.
This makes sure that a user entering "flowid 1:10" gets his flow put into
band 15 (0x10) and knows why.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 doc/actions/actions-general |3 +++
 tc/f_basic.c|1 +
 tc/f_fw.c   |1 +
 tc/f_route.c|1 +
 tc/f_rsvp.c |1 +
 tc/f_u32.c  |1 +
 6 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/doc/actions/actions-general b/doc/actions/actions-general
index 6561eda..70f7cd6 100644
--- a/doc/actions/actions-general
+++ b/doc/actions/actions-general
@@ -88,6 +88,9 @@ tc filter add dev lo parent : protocol ip prio 8 u32 \
 match ip dst 127.0.0.8/32 flowid 1:12 \
 action ipt -j mark --set-mark 2
 
+NOTE: flowid 1:12 is parsed flowid 0x1:0x12.  Make sure if you want flowid
+decimal 12, then use flowid 1:c.
+
 3) A feature i call pipe
 The motivation is derived from Unix pipe mechanism but applied to packets.
 Essentially take a matching packet and pass it through 
diff --git a/tc/f_basic.c b/tc/f_basic.c
index 19a7edf..aab946b 100644
--- a/tc/f_basic.c
+++ b/tc/f_basic.c
@@ -32,6 +32,7 @@ static void explain(void)
fprintf(stderr, "\n");
fprintf(stderr, "Where: SELECTOR := SAMPLE SAMPLE ...\n");
fprintf(stderr, "   FILTERID := X:Y:Z\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed as hexadecimal input.\n");
 }
 
 static int basic_parse_opt(struct filter_util *qu, char *handle,
diff --git a/tc/f_fw.c b/tc/f_fw.c
index 6d1490b..b511735 100644
--- a/tc/f_fw.c
+++ b/tc/f_fw.c
@@ -28,6 +28,7 @@ static void explain(void)
fprintf(stderr, "Usage: ... fw [ classid CLASSID ] [ police POLICE_SPEC 
]\n");
fprintf(stderr, "   POLICE_SPEC := ... look at TBF\n");
fprintf(stderr, "   CLASSID := X:Y\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed as hexadecimal input.\n");
 }
 
 #define usage() return(-1)
diff --git a/tc/f_route.c b/tc/f_route.c
index a41b9d5..67dd49c 100644
--- a/tc/f_route.c
+++ b/tc/f_route.c
@@ -31,6 +31,7 @@ static void explain(void)
fprintf(stderr, "[ flowid CLASSID ] [ police 
POLICE_SPEC ]\n");
fprintf(stderr, "   POLICE_SPEC := ... look at TBF\n");
fprintf(stderr, "   CLASSID := X:Y\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed as hexadecimal input.\n");
 }
 
 #define usage() return(-1)
diff --git a/tc/f_rsvp.c b/tc/f_rsvp.c
index 13fcf97..7e1e6d9 100644
--- a/tc/f_rsvp.c
+++ b/tc/f_rsvp.c
@@ -34,6 +34,7 @@ static void explain(void)
fprintf(stderr, "u{8|16|32} NUMBER mask MASK at 
OFFSET}\n");
fprintf(stderr, "   POLICE_SPEC := ... look at TBF\n");
fprintf(stderr, "   FILTERID := X:Y\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed as hexadecimal input.\n");
 }
 
 #define usage() return(-1)
diff --git a/tc/f_u32.c b/tc/f_u32.c
index 91f2838..d75e76c 100644
--- a/tc/f_u32.c
+++ b/tc/f_u32.c
@@ -36,6 +36,7 @@ static void explain(void)
fprintf(stderr, "Where: SELECTOR := SAMPLE SAMPLE ...\n");
fprintf(stderr, "   SAMPLE := { ip | ip6 | udp | tcp | icmp | 
u{32|16|8} | mark } SAMPLE_ARGS [divisor DIVISOR]\n");
fprintf(stderr, "   FILTERID := X:Y:Z\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed at hexadecimal input.\n");
 }
 
 #define usage() return(-1)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [NET 2.6.26]: Add per-connection option to set max TSO frame size

2008-02-13 Thread PJ Waskiewicz
Update: changed max_gso_frame_size and sk_gso_max_size from signed to
unsigned - thanks Stephen!

This patch adds the ability for device drivers to control the size of the
TSO frames being sent to them, per TCP connection.  By setting the
netdevice's max_gso_frame_size value, the socket layer will set the GSO
frame size based on that value.  This will propogate into the TCP layer,
and send TSO's of that size to the hardware.

This can be desirable to help tune the bursty nature of TSO on a
per-adapter basis, where one may have 1 GbE and 10 GbE devices coexisting
in a system, one running multiqueue and the other not, etc.

This can also be desirable for devices that cannot support full 64 KB
TSO's, but still want to benefit from some level of segmentation
offloading.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/netdevice.h |6 ++
 include/net/sock.h|2 ++
 net/core/dev.c|1 +
 net/core/sock.c   |6 --
 net/ipv4/tcp_output.c |4 ++--
 5 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 047d432..853caca 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -616,6 +616,7 @@ struct net_device
 
/* Partially transmitted GSO packet. */
struct sk_buff  *gso_skb;
+   u16 max_gso_frame_size;
 
/* ingress path synchronizer */
spinlock_t  ingress_lock;
@@ -1475,6 +1476,11 @@ static inline int netif_needs_gso(struct net_device 
*dev, struct sk_buff *skb)
unlikely(skb->ip_summed != CHECKSUM_PARTIAL));
 }
 
+static inline void netif_set_max_gso_size(struct net_device *dev, u16 size)
+{
+   dev->max_gso_frame_size = size;
+}
+
 /* On bonding slaves other than the currently active slave, suppress
  * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
  * ARP on active-backup slaves with arp_validate enabled.
diff --git a/include/net/sock.h b/include/net/sock.h
index 8a7889b..2b07af0 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -151,6 +151,7 @@ struct sock_common {
   *@sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
   *@sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
   *@sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
+  *@sk_gso_max_size: Maximum GSO segment size to build
   *@sk_lingertime: %SO_LINGER l_linger setting
   *@sk_backlog: always used with the per-socket spinlock held
   *@sk_callback_lock: used with the callbacks in the end of this struct
@@ -236,6 +237,7 @@ struct sock {
gfp_t   sk_allocation;
int sk_route_caps;
int sk_gso_type;
+   __u16   sk_gso_max_size;
int sk_rcvlowat;
unsigned long   sk_flags;
unsigned long   sk_lingertime;
diff --git a/net/core/dev.c b/net/core/dev.c
index 9549417..f635b29 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4022,6 +4022,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const 
char *name,
}
 
dev->egress_subqueue_count = queue_count;
+   dev->max_gso_frame_size = 65536;
 
dev->get_stats = internal_stats;
netpoll_netdev_init(dev);
diff --git a/net/core/sock.c b/net/core/sock.c
index 433715f..a8b0ae5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1076,10 +1076,12 @@ void sk_setup_caps(struct sock *sk, struct dst_entry 
*dst)
if (sk->sk_route_caps & NETIF_F_GSO)
sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
if (sk_can_gso(sk)) {
-   if (dst->header_len)
+   if (dst->header_len) {
sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
-   else
+   } else {
sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
+   sk->sk_gso_max_size = dst->dev->max_gso_frame_size;
+   }
}
 }
 EXPORT_SYMBOL_GPL(sk_setup_caps);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ed750f9..8cd128d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -998,7 +998,7 @@ unsigned int tcp_current_mss(struct sock *sk, int 
large_allowed)
xmit_size_goal = mss_now;
 
if (doing_tso) {
-   xmit_size_goal = (65535 -
+   xmit_size_goal = ((sk->sk_gso_max_size - 1) -
  inet_csk(sk)->icsk_af_ops->net_header_len -
  inet_csk(sk)->icsk_ext_hdr_len -
  tp->tcp_header_len);
@@ -1274,7 +1274,7 @@ static int tcp_tso_should_defer(struct sock *sk, struct 
sk_buff *skb)
limit = min(send_win, cong_win);
 
/* If a full-sized TSO skb can be sent, do it. */
-   if (limit >= 65536)
+   if (limit >= sk->sk_gso_max_size)
go

[PATCH] [NET]: Add per-connection option to set max TSO frame size

2008-02-12 Thread PJ Waskiewicz
This patch adds the ability for device drivers to control the size of the
TSO frames being sent to them, per TCP connection.  By setting the
netdevice's max_gso_frame_size value, the socket layer will set the GSO
frame size based on that value.  This will propogate into the TCP layer,
and send TSO's of that size to the hardware.

This can be desirable to help tune the bursty nature of TSO on a
per-adapter basis, where one may have 1 GbE and 10 GbE devices coexisting
in a system, one running multiqueue and the other not, etc.

This can also be desirable for devices that cannot support full 64 KB
TSO's, but still want to benefit from some level of segmentation
offloading.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/netdevice.h |6 ++
 include/net/sock.h|2 ++
 net/core/dev.c|1 +
 net/core/sock.c   |6 --
 net/ipv4/tcp_output.c |4 ++--
 5 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 047d432..ed1cc32 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -616,6 +616,7 @@ struct net_device
 
/* Partially transmitted GSO packet. */
struct sk_buff  *gso_skb;
+   int max_gso_frame_size;
 
/* ingress path synchronizer */
spinlock_t  ingress_lock;
@@ -1475,6 +1476,11 @@ static inline int netif_needs_gso(struct net_device 
*dev, struct sk_buff *skb)
unlikely(skb->ip_summed != CHECKSUM_PARTIAL));
 }
 
+static inline void netif_set_max_gso_size(struct net_device *dev, int size)
+{
+   dev->max_gso_frame_size = size;
+}
+
 /* On bonding slaves other than the currently active slave, suppress
  * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
  * ARP on active-backup slaves with arp_validate enabled.
diff --git a/include/net/sock.h b/include/net/sock.h
index 8a7889b..1977c05 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -151,6 +151,7 @@ struct sock_common {
   *@sk_no_check: %SO_NO_CHECK setting, wether or not checkup packets
   *@sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
   *@sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
+  *@sk_gso_max_size: Maximum GSO segment size to build
   *@sk_lingertime: %SO_LINGER l_linger setting
   *@sk_backlog: always used with the per-socket spinlock held
   *@sk_callback_lock: used with the callbacks in the end of this struct
@@ -236,6 +237,7 @@ struct sock {
gfp_t   sk_allocation;
int sk_route_caps;
int sk_gso_type;
+   int sk_gso_max_size;
int sk_rcvlowat;
unsigned long   sk_flags;
unsigned long   sk_lingertime;
diff --git a/net/core/dev.c b/net/core/dev.c
index 9549417..f635b29 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4022,6 +4022,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const 
char *name,
}
 
dev->egress_subqueue_count = queue_count;
+   dev->max_gso_frame_size = 65536;
 
dev->get_stats = internal_stats;
netpoll_netdev_init(dev);
diff --git a/net/core/sock.c b/net/core/sock.c
index 433715f..a8b0ae5 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1076,10 +1076,12 @@ void sk_setup_caps(struct sock *sk, struct dst_entry 
*dst)
if (sk->sk_route_caps & NETIF_F_GSO)
sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
if (sk_can_gso(sk)) {
-   if (dst->header_len)
+   if (dst->header_len) {
sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
-   else
+   } else {
sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
+   sk->sk_gso_max_size = dst->dev->max_gso_frame_size;
+   }
}
 }
 EXPORT_SYMBOL_GPL(sk_setup_caps);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ed750f9..8cd128d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -998,7 +998,7 @@ unsigned int tcp_current_mss(struct sock *sk, int 
large_allowed)
xmit_size_goal = mss_now;
 
if (doing_tso) {
-   xmit_size_goal = (65535 -
+   xmit_size_goal = ((sk->sk_gso_max_size - 1) -
  inet_csk(sk)->icsk_af_ops->net_header_len -
  inet_csk(sk)->icsk_ext_hdr_len -
  tp->tcp_header_len);
@@ -1274,7 +1274,7 @@ static int tcp_tso_should_defer(struct sock *sk, struct 
sk_buff *skb)
limit = min(send_win, cong_win);
 
/* If a full-sized TSO skb can be sent, do it. */
-   if (limit >= 65536)
+   if (limit >= sk->sk_gso_max_size)
goto send_now;
 
if (sysctl_tcp_tso_win_divisor) {

--
To unsubscribe from this list: send t

[PATCH] [IPROUTE2] Update various classifiers' help output for expected CLASSID syntax

2008-02-12 Thread PJ Waskiewicz
This updates the help output to specify that CLASSID should be hexidecimal.
This makes sure that a user entering "flowid 1:10" gets his flow put into
band 15 (0x10) and knows why.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 doc/actions/actions-general |3 +++
 tc/f_basic.c|1 +
 tc/f_fw.c   |1 +
 tc/f_route.c|1 +
 tc/f_rsvp.c |1 +
 tc/f_u32.c  |1 +
 6 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/doc/actions/actions-general b/doc/actions/actions-general
index 6561eda..70f7cd6 100644
--- a/doc/actions/actions-general
+++ b/doc/actions/actions-general
@@ -88,6 +88,9 @@ tc filter add dev lo parent : protocol ip prio 8 u32 \
 match ip dst 127.0.0.8/32 flowid 1:12 \
 action ipt -j mark --set-mark 2
 
+NOTE: flowid 1:12 is parsed flowid 0x1:0x12.  Make sure if you want flowid
+decimal 12, then use flowid 1:c.
+
 3) A feature i call pipe
 The motivation is derived from Unix pipe mechanism but applied to packets.
 Essentially take a matching packet and pass it through 
diff --git a/tc/f_basic.c b/tc/f_basic.c
index 19a7edf..d6d7767 100644
--- a/tc/f_basic.c
+++ b/tc/f_basic.c
@@ -32,6 +32,7 @@ static void explain(void)
fprintf(stderr, "\n");
fprintf(stderr, "Where: SELECTOR := SAMPLE SAMPLE ...\n");
fprintf(stderr, "   FILTERID := X:Y:Z\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed as hexidecimal input.\n");
 }
 
 static int basic_parse_opt(struct filter_util *qu, char *handle,
diff --git a/tc/f_fw.c b/tc/f_fw.c
index 6d1490b..9f4ef6e 100644
--- a/tc/f_fw.c
+++ b/tc/f_fw.c
@@ -28,6 +28,7 @@ static void explain(void)
fprintf(stderr, "Usage: ... fw [ classid CLASSID ] [ police POLICE_SPEC 
]\n");
fprintf(stderr, "   POLICE_SPEC := ... look at TBF\n");
fprintf(stderr, "   CLASSID := X:Y\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed as hexidecimal input.\n");
 }
 
 #define usage() return(-1)
diff --git a/tc/f_route.c b/tc/f_route.c
index a41b9d5..3bb963c 100644
--- a/tc/f_route.c
+++ b/tc/f_route.c
@@ -31,6 +31,7 @@ static void explain(void)
fprintf(stderr, "[ flowid CLASSID ] [ police 
POLICE_SPEC ]\n");
fprintf(stderr, "   POLICE_SPEC := ... look at TBF\n");
fprintf(stderr, "   CLASSID := X:Y\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed as hexidecimal input.\n");
 }
 
 #define usage() return(-1)
diff --git a/tc/f_rsvp.c b/tc/f_rsvp.c
index 13fcf97..9019ee2 100644
--- a/tc/f_rsvp.c
+++ b/tc/f_rsvp.c
@@ -34,6 +34,7 @@ static void explain(void)
fprintf(stderr, "u{8|16|32} NUMBER mask MASK at 
OFFSET}\n");
fprintf(stderr, "   POLICE_SPEC := ... look at TBF\n");
fprintf(stderr, "   FILTERID := X:Y\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed as hexidecimal input.\n");
 }
 
 #define usage() return(-1)
diff --git a/tc/f_u32.c b/tc/f_u32.c
index 91f2838..d38c536 100644
--- a/tc/f_u32.c
+++ b/tc/f_u32.c
@@ -36,6 +36,7 @@ static void explain(void)
fprintf(stderr, "Where: SELECTOR := SAMPLE SAMPLE ...\n");
fprintf(stderr, "   SAMPLE := { ip | ip6 | udp | tcp | icmp | 
u{32|16|8} | mark } SAMPLE_ARGS [divisor DIVISOR]\n");
fprintf(stderr, "   FILTERID := X:Y:Z\n");
+   fprintf(stderr, "\nNOTE: CLASSID is parsed at hexidecimal input.\n");
 }
 
 #define usage() return(-1)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [TC U32] Fix input parsing to support more than 9 flow id's correctly

2008-02-12 Thread PJ Waskiewicz
From: PJ Waskiewicz <[EMAIL PROTECTED]>

Using strtoul with a base of 16 converts flowid 10 into 0x10, which makes
it flowid 16.  This is interpreted by the kernel incorrectly, and causes
traffic flows above 9 to be classified into band 0 on multiband qdiscs.
This changes the base to 10, which will correctly parse input into the
proper hexidecimal value.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 tc/tc_util.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tc/tc_util.c b/tc/tc_util.c
index cdbae42..a277eac 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -65,7 +65,7 @@ int get_tc_classid(__u32 *h, const char *str)
maj = TC_H_UNSPEC;
if (strcmp(str, "none") == 0)
goto ok;
-   maj = strtoul(str, &p, 16);
+   maj = strtoul(str, &p, 10);
if (p == str) {
maj = 0;
if (*p != ':')
@@ -76,7 +76,7 @@ int get_tc_classid(__u32 *h, const char *str)
return -1;
maj <<= 16;
str = p+1;
-   min = strtoul(str, &p, 16);
+   min = strtoul(str, &p, 10);
if (*p != 0)
return -1;
if (min >= (1<<16))

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] PATCH 1/2 [SCHED 2.6.24]: Check subqueue status before calling hard_start_xmit

2007-11-13 Thread PJ Waskiewicz
The only qdiscs that check subqueue state before dequeue'ing are PRIO
and RR.  The other qdiscs, including the default pfifo_fast qdisc, will
allow traffic bound for subqueue 0 through to hard_start_xmit.  The check
for netif_queue_stopped() is done above in pkt_sched.h, so it is
unnecessary for qdisc_restart().  However, if the underlying driver is
multiqueue capable, and only sets queue states on subqueues, this will
allow packets to enter the driver when it's currently unable to process
packets, resulting in expensive requeues and driver entries.  This patch
re-adds the check for the subqueue status before calling hard_start_xmit,
so we can try and avoid the driver entry when the queues are stopped.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 net/sched/sch_generic.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index fa1a6f4..e595e65 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -134,7 +134,7 @@ static inline int qdisc_restart(struct net_device *dev)
 {
struct Qdisc *q = dev->qdisc;
struct sk_buff *skb;
-   int ret;
+   int ret = NETDEV_TX_BUSY;
 
/* Dequeue packet */
if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
@@ -145,7 +145,8 @@ static inline int qdisc_restart(struct net_device *dev)
spin_unlock(&dev->queue_lock);
 
HARD_TX_LOCK(dev, smp_processor_id());
-   ret = dev_hard_start_xmit(skb, dev);
+   if (!netif_subqueue_stopped(dev, skb))
+   ret = dev_hard_start_xmit(skb, dev);
HARD_TX_UNLOCK(dev);
 
spin_lock(&dev->queue_lock);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] PATCH 2/2 [SCHED 2.6.23-stable]: Check subqueue state before hard_start_xmit

2007-11-13 Thread PJ Waskiewicz
The only qdiscs that check subqueue state before dequeue'ing are PRIO
and RR.  The other qdiscs, including the default pfifo_fast qdisc, will
allow traffic bound for subqueue 0 through to hard_start_xmit.  The check
for netif_queue_stopped() is done above in pkt_sched.h, so it is
unnecessary for qdisc_restart().  However, if the underlying driver is
multiqueue capable, and only sets queue states on subqueues, this will
allow packets to enter the driver when it's currently unable to process
packets, resulting in expensive requeues and driver entries.  This patch
re-adds the check for the subqueue status before calling hard_start_xmit,
so we can try and avoid the driver entry when the queues are stopped.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 net/sched/sch_generic.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index c81649c..a35d7ce 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -135,7 +135,7 @@ static inline int qdisc_restart(struct net_device *dev)
struct Qdisc *q = dev->qdisc;
struct sk_buff *skb;
unsigned lockless;
-   int ret;
+   int ret = NETDEV_TX_BUSY;
 
/* Dequeue packet */
if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
@@ -158,7 +158,8 @@ static inline int qdisc_restart(struct net_device *dev)
/* And release queue */
spin_unlock(&dev->queue_lock);
 
-   ret = dev_hard_start_xmit(skb, dev);
+   if (!netif_subqueue_stopped(dev, skb->queue_mapping))
+   ret = dev_hard_start_xmit(skb, dev);
 
if (!lockless)
netif_tx_unlock(dev);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] SCHED: Fix unnecesary driver entries when queue is stopped

2007-11-13 Thread PJ Waskiewicz
Dave,

This patch addresses an issue with multiqueue devices and non-multiqueue
qdiscs which is causing performance issues.  This patch should be 
considered for both 2.6.23-stable and 2.6.24 upstream.  Basically, if
a driver is using the netif_*_subqueue() calls, then qdisc_restart() will
happily call hard_start_xmit() even if subqueue 0 is stopped, which is bad.
This re-adds the check for the subqueue state.

Note that this check was removed when qdisc_restart() was rewritten.  At that
time though, we didn't understand the full effect of multiqueue with respect
to the qdiscs and queue management from a driver to kernel perspective.  Since
the driver doesn't know what qdisc capabilities live above it, it needs to
decide to use the queue or subqueue functions ahead of time.  This patch is
just cleaning up a miss from that rewrite.

Patch 1 is for 2.6.24, patch 2 is for 2.6.23 stable.

Thanks,
-- 
PJ Waskiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [AF_PACKET]: Allow multicast traffic to be caught by ORIGDEV when bonded

2007-11-06 Thread PJ Waskiewicz
The socket option for packet sockets to return the original ifindex instead
of the bonded ifindex will not match multicast traffic.  Since this socket
option is the most useful for layer 2 traffic and multicast traffic, make
the option multicast-aware.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 net/packet/af_packet.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 4cb2dfb..23eef6f 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -515,7 +515,7 @@ static int packet_rcv(struct sk_buff *skb, struct 
net_device *dev, struct packet
sll->sll_hatype = dev->type;
sll->sll_protocol = skb->protocol;
sll->sll_pkttype = skb->pkt_type;
-   if (unlikely(po->origdev) && skb->pkt_type == PACKET_HOST)
+   if (unlikely(po->origdev))
sll->sll_ifindex = orig_dev->ifindex;
else
sll->sll_ifindex = dev->ifindex;
@@ -661,7 +661,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev, struct packe
sll->sll_hatype = dev->type;
sll->sll_protocol = skb->protocol;
sll->sll_pkttype = skb->pkt_type;
-   if (unlikely(po->origdev) && skb->pkt_type == PACKET_HOST)
+   if (unlikely(po->origdev))
sll->sll_ifindex = orig_dev->ifindex;
else
sll->sll_ifindex = dev->ifindex;
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] DOC: Update networking/multiqueue.txt with correct information.

2007-09-07 Thread PJ Waskiewicz
Updated the multiqueue.txt document to call out the correct kernel options
to select to enable multiqueue.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 Documentation/networking/multiqueue.txt |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/multiqueue.txt 
b/Documentation/networking/multiqueue.txt
index 00b60cc..ea5a42e 100644
--- a/Documentation/networking/multiqueue.txt
+++ b/Documentation/networking/multiqueue.txt
@@ -58,9 +58,13 @@ software, so it's a straight round-robin qdisc.  It uses the 
same syntax and
 classification priomap that sch_prio uses, so it should be intuitive to
 configure for people who've used sch_prio.
 
-The PRIO qdisc naturally plugs into a multiqueue device.  If PRIO has been
-built with NET_SCH_PRIO_MQ, then upon load, it will make sure the number of
-bands requested is equal to the number of queues on the hardware.  If they
+In order to utilitize the multiqueue features of the qdiscs, the network
+device layer needs to enable multiple queue support.  This can be done by
+selecting NETDEVICES_MULTIQUEUE under Drivers.
+
+The PRIO qdisc naturally plugs into a multiqueue device.  If
+NETDEVICES_MULTIQUEUE is selected, then on qdisc load, the number of
+bands requested is compared to the number of queues on the hardware.  If they
 are equal, it sets a one-to-one mapping up between the queues and bands.  If
 they're not equal, it will not load the qdisc.  This is the same behavior
 for RR.  Once the association is made, any skb that is classified will have
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] iproute2: sch_rr support in tc

2007-08-14 Thread PJ Waskiewicz
This patch applies on top of Patrick McHardy's RTNETLINK
patches to add nested compat attributes.  This is needed to maintain
ABI for sch_{rr|prio} in the kernel with respect to tc.  A new option,
namely multiqueue, was added to sch_prio and sch_rr.  This will allow
a user to turn multiqueue support on for sch_prio or sch_rr at loadtime.
Also, tc qdisc ls will display whether or not multiqueue is enabled on
that qdisc.  When in multiqueue mode, a user can specify a value of 0 for
bands, and the number of bands will be created to match the number of
queues on the device.

This patch is to support the new sch_rr (round-robin) qdisc being proposed
in NET for multiqueue network device support in the Linux network stack.
It uses q_prio.c as the template, since the qdiscs are nearly identical,
outside of the ->dequeue() routine.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/pkt_sched.h |9 +++
 tc/q_prio.c   |   24 +++--
 tc/q_rr.c |  127 +
 3 files changed, 155 insertions(+), 5 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..4f1531b 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -101,6 +101,15 @@ struct tc_prio_qopt
__u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */
 };
 
+enum
+{
+TCA_PRIO_UNSPEC,
+TCA_PRIO_MQ,
+__TCA_PRIO_MAX
+};
+
+#define TCA_PRIO_MAX(__TCA_PRIO_MAX - 1)
+
 /* TBF section */
 
 struct tc_tbf_qopt
diff --git a/tc/q_prio.c b/tc/q_prio.c
index d696e1b..6883edb 100644
--- a/tc/q_prio.c
+++ b/tc/q_prio.c
@@ -29,7 +29,7 @@
 
 static void explain(void)
 {
-   fprintf(stderr, "Usage: ... prio bands NUMBER priomap P1 P2...\n");
+   fprintf(stderr, "Usage: ... prio bands NUMBER priomap P1 
P2...[multiqueue]\n");
 }
 
 #define usage() return(-1)
@@ -40,6 +40,8 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
int pmap_mode = 0;
int idx = 0;
struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 
1, 1 }};
+   struct rtattr *nest;
+   unsigned char mq = 0;
 
while (argc > 0) {
if (strcmp(*argv, "bands") == 0) {
@@ -57,6 +59,8 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
return -1;
}
pmap_mode = 1;
+   } else if (strcmp(*argv, "multiqueue") == 0) {
+   mq = 1;
} else if (strcmp(*argv, "help") == 0) {
explain();
return -1;
@@ -90,7 +94,10 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
opt.priomap[idx] = opt.priomap[TC_PRIO_BESTEFFORT];
}
 */
-   addattr_l(n, 1024, TCA_OPTIONS, &opt, sizeof(opt));
+   nest = addattr_nest_compat(n, 1024, TCA_OPTIONS, &opt, sizeof(opt));
+   if (mq)
+   addattr_l(n, 1024, TCA_PRIO_MQ, NULL, 0);
+   addattr_nest_compat_end(n, nest);
return 0;
 }
 
@@ -98,16 +105,23 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, struct 
rtattr *opt)
 {
int i;
struct tc_prio_qopt *qopt;
+   struct rtattr *tb[TCA_PRIO_MAX+1];
 
if (opt == NULL)
return 0;
 
-   if (RTA_PAYLOAD(opt)  < sizeof(*qopt))
-   return -1;
-   qopt = RTA_DATA(opt);
+   if (parse_rtattr_nested_compat(tb, TCA_PRIO_MAX, opt, qopt,
+   sizeof(*qopt)))
+return -1;
+
fprintf(f, "bands %u priomap ", qopt->bands);
for (i=0; i<=TC_PRIO_MAX; i++)
fprintf(f, " %d", qopt->priomap[i]);
+
+   if (tb[TCA_PRIO_MQ])
+   fprintf(f, " multiqueue: %s ",
+   *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ]) ? "on" : "off");
+
return 0;
 }
 
diff --git a/tc/q_rr.c b/tc/q_rr.c
new file mode 100644
index 000..9335c47
--- /dev/null
+++ b/tc/q_rr.c
@@ -0,0 +1,127 @@
+/*
+ * q_rr.c  RR.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:PJ Waskiewicz, <[EMAIL PROTECTED]>
+ * Original Authors:   Alexey Kuznetsov, <[EMAIL PROTECTED]> (from PRIO)
+ *
+ * Changes:
+ *
+ * Ole Husgaard <[EMAIL PROTECTED]>: 990513: prio2band map was always reset.
+ * J Hadi Salim <[EMAIL PROTECTED]>: 990609: priomap fix.
+ */
+
+#include 
+#include 
+#include 

[PATCH] iproute2: sch_rr support in tc

2007-08-14 Thread PJ Waskiewicz
Stephen,

These patches are resubmissions of patches that were approved, but didn't
get merged.  The first patch is Patrick McHardy's nested compat attribute
patch to the netlink libraries.  The second patch adds multiqueue and sch_rr
functionality to tc.  The multiqueue features have been merged to 2.6.23, so
we'll need these patches to manage the new kernel features.

These patches are unmodified from the version that was approved.  They've been
applied to the latest iproute2 commit.

-- 
PJ Waskiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] IPROUTE2: RTNETLINK nested attributes

2007-08-14 Thread PJ Waskiewicz
From: Patrick McHardy <[EMAIL PROTECTED]>

This adds capability for iproute2 to send nested attributes to the
kernel, while maintaining backwards compatibility.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>
---

 include/libnetlink.h |9 +
 lib/libnetlink.c |   46 ++
 2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index 49e248e..b67c5a5 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -39,15 +39,24 @@ extern int rtnl_send(struct rtnl_handle *rth, const char 
*buf, int);
 extern int addattr32(struct nlmsghdr *n, int maxlen, int type, __u32 data);
 extern int addattr_l(struct nlmsghdr *n, int maxlen, int type, const void 
*data, int alen);
 extern int addraw_l(struct nlmsghdr *n, int maxlen, const void *data, int len);
+extern struct rtattr *addattr_nest(struct nlmsghdr *n, int maxlen, int type);
+extern int addattr_nest_end(struct nlmsghdr *n, struct rtattr *nest);
+extern struct rtattr *addattr_nest_compat(struct nlmsghdr *n, int maxlen, int 
type, const void *data, int len);
+extern int addattr_nest_compat_end(struct nlmsghdr *n, struct rtattr *nest);
 extern int rta_addattr32(struct rtattr *rta, int maxlen, int type, __u32 data);
 extern int rta_addattr_l(struct rtattr *rta, int maxlen, int type, const void 
*data, int alen);
 
 extern int parse_rtattr(struct rtattr *tb[], int max, struct rtattr *rta, int 
len);
 extern int parse_rtattr_byindex(struct rtattr *tb[], int max, struct rtattr 
*rta, int len);
+extern int __parse_rtattr_nested_compat(struct rtattr *tb[], int max, struct 
rtattr *rta, int len);
 
 #define parse_rtattr_nested(tb, max, rta) \
(parse_rtattr((tb), (max), RTA_DATA(rta), RTA_PAYLOAD(rta)))
 
+#define parse_rtattr_nested_compat(tb, max, rta, data, len) \
+({ data = RTA_PAYLOAD(rta) >= len ? RTA_DATA(rta) : NULL; \
+   __parse_rtattr_nested_compat(tb, max, rta, len); })
+
 extern int rtnl_listen(struct rtnl_handle *, rtnl_filter_t handler,
   void *jarg);
 extern int rtnl_from_file(FILE *, rtnl_filter_t handler,
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 555dd5c..12883fe 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -527,6 +527,39 @@ int addraw_l(struct nlmsghdr *n, int maxlen, const void 
*data, int len)
return 0;
 }
 
+struct rtattr *addattr_nest(struct nlmsghdr *n, int maxlen, int type)
+{
+   struct rtattr *nest = NLMSG_TAIL(n);
+
+   addattr_l(n, maxlen, type, NULL, 0);
+   return nest;
+}
+
+int addattr_nest_end(struct nlmsghdr *n, struct rtattr *nest)
+{
+   nest->rta_len = (void *)NLMSG_TAIL(n) - (void *)nest;
+   return n->nlmsg_len;
+}
+
+struct rtattr *addattr_nest_compat(struct nlmsghdr *n, int maxlen, int type,
+  const void *data, int len)
+{
+   struct rtattr *start = NLMSG_TAIL(n);
+
+   addattr_l(n, maxlen, type, data, len);
+   addattr_nest(n, maxlen, type);
+   return start;
+}
+
+int addattr_nest_compat_end(struct nlmsghdr *n, struct rtattr *start)
+{
+   struct rtattr *nest = (void *)start + NLMSG_ALIGN(start->rta_len);
+
+   start->rta_len = (void *)NLMSG_TAIL(n) - (void *)start;
+   addattr_nest_end(n, nest);
+   return n->nlmsg_len;
+}
+
 int rta_addattr32(struct rtattr *rta, int maxlen, int type, __u32 data)
 {
int len = RTA_LENGTH(4);
@@ -589,3 +622,16 @@ int parse_rtattr_byindex(struct rtattr *tb[], int max, 
struct rtattr *rta, int l
fprintf(stderr, "!!!Deficit %d, rta_len=%d\n", len, 
rta->rta_len);
return i;
 }
+
+int __parse_rtattr_nested_compat(struct rtattr *tb[], int max, struct rtattr 
*rta,
+int len)
+{
+   if (RTA_PAYLOAD(rta) < len)
+   return -1;
+   if (RTA_PAYLOAD(rta) >= RTA_ALIGN(len) + sizeof(struct rtattr)) {
+   rta = RTA_DATA(rta) + RTA_ALIGN(len);
+   return parse_rtattr_nested(tb, max, rta);
+   }
+   memset(tb, 0, sizeof(struct rtattr *) * max);
+   return 0;
+}
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[FIX] NET: Fix sch_api and sch_prio to properly set and detect the root qdisc

2007-07-24 Thread PJ Waskiewicz
This is a patch from Patrick McHardy to fix the sch_api code, which I
went ahead and tested and made a slight fix to.  This also includes
the fix to sch_prio based on Patrick's patch.

The sch->parent handle should contain the parent qdisc ID.  When the
qdisc is the root qdisc (TC_H_ROOT), the parent handle should be the
value TC_H_ROOT.  This fixes sch_api to set this correctly on
qdisc_create() for both ingress and egress qdiscs.

Change this check in prio_tune() so that only the root qdisc can be
multiqueue-enabled; use sch->parent instead of sch->handle.

-- 
PJ Waskiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] NET: Fix sch_prio to properly detect the root qdisc on multiqueue

2007-07-24 Thread PJ Waskiewicz
Fix the check in prio_tune() to see if sch->parent is TC_H_ROOT instead of
sch->handle to load or reject the qdisc for multiqueue devices.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 net/sched/sch_prio.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 2d8c084..06441db 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -239,11 +239,13 @@ static int prio_tune(struct Qdisc *sch, struct rtattr 
*opt)
/* If we're multiqueue, make sure the number of incoming bands
 * matches the number of queues on the device we're associating with.
 * If the number of bands requested is zero, then set q->bands to
-* dev->egress_subqueue_count.
+* dev->egress_subqueue_count.  Also, the root qdisc must be the
+* only one that is enabled for multiqueue, since it's the only one
+* that interacts with the underlying device.
 */
q->mq = RTA_GET_FLAG(tb[TCA_PRIO_MQ - 1]);
if (q->mq) {
-   if (sch->handle != TC_H_ROOT)
+   if (sch->parent != TC_H_ROOT)
return -EINVAL;
if (netif_is_multiqueue(sch->dev)) {
if (q->bands == 0)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] NET: Fix sch_api to properly set sch->parent on the root qdisc

2007-07-24 Thread PJ Waskiewicz
From: Patrick McHardy <[EMAIL PROTECTED]>

Fix sch_api to correctly set sch->parent for both ingress and egress
qdiscs in qdisc_create().

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>
Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 net/sched/sch_api.c |   17 -
 1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 13c09bc..dee0d5f 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -380,6 +380,10 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned 
int n)
return;
while ((parentid = sch->parent)) {
sch = qdisc_lookup(sch->dev, TC_H_MAJ(parentid));
+   if (sch == NULL) {
+   WARN_ON(parentid != TC_H_ROOT);
+   return;
+   }
cops = sch->ops->cl_ops;
if (cops->qlen_notify) {
cl = cops->get(sch, parentid);
@@ -420,8 +424,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc 
*parent,
unsigned long cl = cops->get(parent, classid);
if (cl) {
err = cops->graft(parent, cl, new, old);
-   if (new)
-   new->parent = classid;
cops->put(parent, cl);
}
}
@@ -436,7 +438,8 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc 
*parent,
  */
 
 static struct Qdisc *
-qdisc_create(struct net_device *dev, u32 handle, struct rtattr **tca, int 
*errp)
+qdisc_create(struct net_device *dev, u32 parent, u32 handle,
+  struct rtattr **tca, int *errp)
 {
int err;
struct rtattr *kind = tca[TCA_KIND-1];
@@ -482,6 +485,8 @@ qdisc_create(struct net_device *dev, u32 handle, struct 
rtattr **tca, int *errp)
goto err_out2;
}
 
+   sch->parent = parent;
+
if (handle == TC_H_INGRESS) {
sch->flags |= TCQ_F_INGRESS;
sch->stats_lock = &dev->ingress_lock;
@@ -758,9 +763,11 @@ create_n_graft:
if (!(n->nlmsg_flags&NLM_F_CREATE))
return -ENOENT;
if (clid == TC_H_INGRESS)
-   q = qdisc_create(dev, tcm->tcm_parent, tca, &err);
+   q = qdisc_create(dev, tcm->tcm_parent, tcm->tcm_parent,
+tca, &err);
else
-   q = qdisc_create(dev, tcm->tcm_handle, tca, &err);
+   q = qdisc_create(dev, tcm->tcm_parent, tcm->tcm_handle,
+tca, &err);
if (q == NULL) {
if (err == -EAGAIN)
goto replay;
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] NET: Fix sch_prio to detect the root qdisc loading

2007-07-21 Thread PJ Waskiewicz
The sch->parent handle will be NULL for the scheduler that is TC_H_ROOT.
Change this check in prio_tune() so that only the root qdisc can be
multiqueue-enabled.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 net/sched/sch_prio.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 2d8c084..271051e 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -239,11 +239,13 @@ static int prio_tune(struct Qdisc *sch, struct rtattr 
*opt)
/* If we're multiqueue, make sure the number of incoming bands
 * matches the number of queues on the device we're associating with.
 * If the number of bands requested is zero, then set q->bands to
-* dev->egress_subqueue_count.
+* dev->egress_subqueue_count.  Also, the root qdisc must be the
+* only one that is enabled for multiqueue, since it's the only one
+* that interacts with the underlying device.
 */
q->mq = RTA_GET_FLAG(tb[TCA_PRIO_MQ - 1]);
if (q->mq) {
-   if (sch->handle != TC_H_ROOT)
+   if (sch->parent != NULL)
return -EINVAL;
if (netif_is_multiqueue(sch->dev)) {
if (q->bands == 0)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] iproute2: sch_rr support in tc

2007-06-28 Thread PJ Waskiewicz
This patch applies on top of Patrick McHardy's RTNETLINK
patches to add nested compat attributes.  This is needed to maintain
ABI for sch_{rr|prio} in the kernel with respect to tc.  A new option,
namely multiqueue, was added to sch_prio and sch_rr.  This will allow
a user to turn multiqueue support on for sch_prio or sch_rr at loadtime.
Also, tc qdisc ls will display whether or not multiqueue is enabled on
that qdisc.  When in multiqueue mode, a user can specify a value of 0 for
bands, and the number of bands will be created to match the number of
queues on the device.

This patch is to support the new sch_rr (round-robin) qdisc being proposed
in NET for multiqueue network device support in the Linux network stack.
It uses q_prio.c as the template, since the qdiscs are nearly identical,
outside of the ->dequeue() routine.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/pkt_sched.h |9 +++
 tc/Makefile   |1 
 tc/q_prio.c   |   24 +++--
 tc/q_rr.c |  127 +
 4 files changed, 156 insertions(+), 5 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..4f1531b 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -101,6 +101,15 @@ struct tc_prio_qopt
__u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */
 };
 
+enum
+{
+TCA_PRIO_UNSPEC,
+TCA_PRIO_MQ,
+__TCA_PRIO_MAX
+};
+
+#define TCA_PRIO_MAX(__TCA_PRIO_MAX - 1)
+
 /* TBF section */
 
 struct tc_tbf_qopt
diff --git a/tc/Makefile b/tc/Makefile
index b607b26..cadd6c0 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -9,6 +9,7 @@ TCMODULES += q_fifo.o
 TCMODULES += q_sfq.o
 TCMODULES += q_red.o
 TCMODULES += q_prio.o
+TCMODULES += q_rr.o
 TCMODULES += q_tbf.o
 TCMODULES += q_cbq.o
 TCMODULES += q_netem.so
diff --git a/tc/q_prio.c b/tc/q_prio.c
index d696e1b..6883edb 100644
--- a/tc/q_prio.c
+++ b/tc/q_prio.c
@@ -29,7 +29,7 @@
 
 static void explain(void)
 {
-   fprintf(stderr, "Usage: ... prio bands NUMBER priomap P1 P2...\n");
+   fprintf(stderr, "Usage: ... prio bands NUMBER priomap P1 
P2...[multiqueue]\n");
 }
 
 #define usage() return(-1)
@@ -40,6 +40,8 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
int pmap_mode = 0;
int idx = 0;
struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 
1, 1 }};
+   struct rtattr *nest;
+   unsigned char mq = 0;
 
while (argc > 0) {
if (strcmp(*argv, "bands") == 0) {
@@ -57,6 +59,8 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
return -1;
}
pmap_mode = 1;
+   } else if (strcmp(*argv, "multiqueue") == 0) {
+   mq = 1;
} else if (strcmp(*argv, "help") == 0) {
explain();
return -1;
@@ -90,7 +94,10 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
opt.priomap[idx] = opt.priomap[TC_PRIO_BESTEFFORT];
}
 */
-   addattr_l(n, 1024, TCA_OPTIONS, &opt, sizeof(opt));
+   nest = addattr_nest_compat(n, 1024, TCA_OPTIONS, &opt, sizeof(opt));
+   if (mq)
+   addattr_l(n, 1024, TCA_PRIO_MQ, NULL, 0);
+   addattr_nest_compat_end(n, nest);
return 0;
 }
 
@@ -98,16 +105,23 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, struct 
rtattr *opt)
 {
int i;
struct tc_prio_qopt *qopt;
+   struct rtattr *tb[TCA_PRIO_MAX+1];
 
if (opt == NULL)
return 0;
 
-   if (RTA_PAYLOAD(opt)  < sizeof(*qopt))
-   return -1;
-   qopt = RTA_DATA(opt);
+   if (parse_rtattr_nested_compat(tb, TCA_PRIO_MAX, opt, qopt,
+   sizeof(*qopt)))
+return -1;
+
fprintf(f, "bands %u priomap ", qopt->bands);
for (i=0; i<=TC_PRIO_MAX; i++)
fprintf(f, " %d", qopt->priomap[i]);
+
+   if (tb[TCA_PRIO_MQ])
+   fprintf(f, " multiqueue: %s ",
+   *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ]) ? "on" : "off");
+
return 0;
 }
 
diff --git a/tc/q_rr.c b/tc/q_rr.c
new file mode 100644
index 000..9335c47
--- /dev/null
+++ b/tc/q_rr.c
@@ -0,0 +1,127 @@
+/*
+ * q_rr.c  RR.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors: 

[PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-28 Thread PJ Waskiewicz
Updated: Fixed allocation of subqueues in alloc_netdev_mq() to
allocate all subqueues, not num - 1.

Added checks for netif_subqueue_stopped() to netpoll,
pktgen, and software device dev_queue_xmit().  This will ensure
external events to these subsystems will be handled correctly if
a subqueue is shut down.

Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/etherdevice.h |3 +-
 include/linux/netdevice.h   |   62 ++-
 include/linux/skbuff.h  |4 ++-
 net/core/dev.c  |   27 +--
 net/core/netpoll.c  |8 +++---
 net/core/pktgen.c   |   10 +--
 net/core/skbuff.c   |3 ++
 net/ethernet/eth.c  |9 +++---
 8 files changed, 104 insertions(+), 22 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..b3fbb54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void   eth_header_cache_update(struct hh_cache 
*hh, struct net_device *dev
 extern int eth_header_cache(struct neighbour *neigh,
 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2c0cc19..7078745 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+   /* Give a control state for each queue.  This struct may contain
+* per-queue locks in the future.
+*/
+   unsigned long   state;
+};
+
 /*
  * Network device statistics. Akin to the 2.0 ether stats but
  * with byte counters.
@@ -331,6 +339,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN 
packets */
 #define NETIF_F_GSO2048/* Enable software GSO. */
 #define NETIF_F_LLTX   4096/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
@@ -557,6 +566,10 @@ struct net_device
 
/* rtnetlink link ops */
const struct rtnl_link_ops *rtnl_link_ops;
+
+   /* The TX queue control structures */
+   int egress_subqueue_count;
+   struct net_device_subqueue  egress_subqueue[0];
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -719,6 +732,48 @@ static inline int netif_running(const struct net_device 
*dev)
return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 
queue_index)
+{
+   clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+u16 queue_index)
+{
+   return test_bit(__LINK_STATE_XOFF,
+   &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   if (test_and_clear_bit(__LINK_STATE_XOFF,
+  &dev->egress_subqueue[queue_index].state))
+   __netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+   return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -1009,8 +1064,11 @@ static inline void netif_tx_disable(struct net_device 
*dev)
 extern voidether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netde

[PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation

2007-06-28 Thread PJ Waskiewicz
Add a brief howto to Documentation/networking for multiqueue.  It
explains how to use the multiqueue API in a driver to support
multiqueue paths from the stack, as well as the qdiscs to use for
feeding a multiqueue device.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 Documentation/networking/multiqueue.txt |  111 +++
 1 files changed, 111 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/multiqueue.txt 
b/Documentation/networking/multiqueue.txt
new file mode 100644
index 000..00b60cc
--- /dev/null
+++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,111 @@
+
+   HOWTO for multiqueue network device support
+   ===
+
+Section 1: Base driver requirements for implementing multiqueue support
+Section 2: Qdisc support for multiqueue devices
+Section 3: Brief howto using PRIO or RR for multiqueue devices
+
+
+Intro: Kernel support for multiqueue devices
+-
+
+Kernel support for multiqueue devices is only an API that is presented to the
+netdevice layer for base drivers to implement.  This feature is part of the
+core networking stack, and all network devices will be running on the
+multiqueue-aware stack.  If a base driver only has one queue, then these
+changes are transparent to that driver.
+
+
+Section 1: Base driver requirements for implementing multiqueue support
+---
+
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device.  The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+
+The base driver will also need to manage the queues as it does the global
+netdev->queue_lock today.  Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational.  netdev->queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+
+Finally, the base driver should indicate that it is a multiqueue device.  The
+feature flag NETIF_F_MULTI_QUEUE should be added to the netdev->features
+bitmap on device initialization.  Below is an example from e1000:
+
+#ifdef CONFIG_E1000_MQ
+   if ( (adapter->hw.mac.type == e1000_82571) ||
+(adapter->hw.mac.type == e1000_82572) ||
+(adapter->hw.mac.type == e1000_80003es2lan))
+   netdev->features |= NETIF_F_MULTI_QUEUE;
+#endif
+
+
+Section 2: Qdisc support for multiqueue devices
+---
+
+Currently two qdiscs support multiqueue devices.  A new round-robin qdisc,
+sch_rr, and sch_prio. The qdisc is responsible for classifying the skb's to
+bands and queues, and will store the queue mapping into skb->queue_mapping.
+Use this field in the base driver to determine which queue to send the skb
+to.
+
+sch_rr has been added for hardware that doesn't want scheduling policies from
+software, so it's a straight round-robin qdisc.  It uses the same syntax and
+classification priomap that sch_prio uses, so it should be intuitive to
+configure for people who've used sch_prio.
+
+The PRIO qdisc naturally plugs into a multiqueue device.  If PRIO has been
+built with NET_SCH_PRIO_MQ, then upon load, it will make sure the number of
+bands requested is equal to the number of queues on the hardware.  If they
+are equal, it sets a one-to-one mapping up between the queues and bands.  If
+they're not equal, it will not load the qdisc.  This is the same behavior
+for RR.  Once the association is made, any skb that is classified will have
+skb->queue_mapping set, which will allow the driver to properly queue skb's
+to multiple queues.
+
+
+Section 3: Brief howto using PRIO and RR for multiqueue devices
+---
+
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs.  To add the PRIO qdisc to your network device, assuming the device is
+called eth0, run the following command:
+
+# tc qdisc add dev eth0 root handle 1: prio bands 4 multiqueue
+
+This will create 4 bands, 0 being highest priority, and associate those bands
+to the queues on your NIC.  Assuming eth0 has 4 Tx queues, the band mapping
+would look like:
+
+band 0 => queue 0
+band 1 => queue 1
+band 2 => queue 2
+band 3 => queue 3
+
+Traffic will begin flowing through each queue if your TOS values are assigning
+traffic across the various bands.  For example, ssh traffic will always try to
+go out band 0 based on TOS -> Linux priority conversion (realtime traffic),
+so it will be sent out queue 0.  ICMP traffic (pings) fall into the "normal"
+traffic classification, which is band 1.  Therefore pings 

[PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread PJ Waskiewicz
Updated: Cleaned up Kconfig options for multiqueue.  Cleaned up
sch_rr and sch_prio multiqueue handling.  Added nested compat netlink
options for new options.  Allowing a 0 band option for prio and rr when
in multiqueue mode so it defaults to the number of queues on the NIC.

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue
hardware
support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/pkt_sched.h |9 +++
 net/sched/Kconfig |   23 +++
 net/sched/sch_prio.c  |  147 +
 3 files changed, 166 insertions(+), 13 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..268c515 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -101,6 +101,15 @@ struct tc_prio_qopt
__u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */
 };
 
+enum
+{
+   TCA_PRIO_UNSPEC,
+   TCA_PRIO_MQ,
+   __TCA_PRIO_MAX
+};
+
+#define TCA_PRIO_MAX(__TCA_PRIO_MAX - 1)
+
 /* TBF section */
 
 struct tc_tbf_qopt
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..65ee9e7 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -111,6 +111,29 @@ config NET_SCH_PRIO
  To compile this code as a module, choose M here: the
  module will be called sch_prio.
 
+config NET_SCH_RR
+   tristate "Multi Band Round Robin Queuing (RR)"
+   select NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to use an n-band round robin packet
+ scheduler.
+
+ The module uses sch_prio for its framework and is aliased as
+ sch_rr, so it will load sch_prio, although it is referred
+ to using sch_rr.
+
+config NET_SCH_MULTIQUEUE
+   bool "Multiple hardware queue support"
+   ---help---
+ Say Y here if you want to allow supported qdiscs to assign flows to
+ multiple hardware queues on an ethernet device.  This will
+ still work on devices with 1 queue.
+
+ Current qdiscs supporting this feature are NET_SCH_PRIO and
+ NET_SCH_RR.
+
+ Most people will say N here.
+
 config NET_SCH_RED
tristate "Random Early Detection (RED)"
---help---
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..2ceba92 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,13 @@
 struct prio_sched_data
 {
int bands;
+   int curband; /* for round-robin */
struct tcf_proto *filter_list;
u8  prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+   unsigned char mq;
+#endif
 };
 
 
@@ -70,14 +74,34 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int 
*qerr)
 #endif
if (TC_H_MAJ(band))
band = 0;
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+   if (q->mq)
+   skb->queue_mapping = 
+   q->prio2band[band&TC_PRIO_MAX];
+   else
+   skb->queue_mapping = 0;
+#endif
return q->queues[q->prio2band[band&TC_PRIO_MAX]];
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
-   if (band >= q->bands)
+   if (band >= q->bands) {
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+   if (q->mq)
+   skb->queue_mapping = q->prio2band[0];
+   else
+   skb->queue_mapping = 0;
+#endif
return q->queues[q->prio2band[0]];
+   }
 
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+   if (q->mq)
+   skb->queue_mapping = band;
+   else
+   skb->queue_mapping = 0;
+#endif
return q->queues[band];
 }
 
@@ -144,17 +168,65 @@ prio_dequeue(struct Qdisc* sch)
struct Qdisc *qdisc;
 
for (prio = 0; prio < q->bands; prio++) {
-   qdisc = q->queues[prio];
-   skb = qdisc->dequeue(qdisc);
-   if (skb) {
-   sch->q.qlen--;
-   return skb;
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+   /* Check if the target subqueue is available before
+* pulling an skb.  This way we avoid excessive requeues
+* for slower queues.
+*/
+   if (!netif_subqueue_stopped(sch->dev, (q->mq ? prio : 0))) {
+#endif
+   qdisc = q->queues[prio];
+   skb = qdisc->dequeue(qdisc);
+   if (skb) {
+   sch->q.qlen--;
+   return skb;
+   }
+#ifdef C

[PATCH] NET: Multiple queue hardware support

2007-06-28 Thread PJ Waskiewicz
Please consider these patches for 2.6.23 inclusion.

Updates since the last submission:

1. Fixed alloc_netdev_mq() queue_count bug.

2. Fixed the TCA_PRIO_MQ options layout.

3. Protected sch_prio and sch_rr multiqueue code with NET_SCH_MULTIQUEUE.

4. Added RTA_{GET|PUT}_FLAG in place of RTA_DATA for passing multiqueue
   options to and from the qdisc.

5. Allow sch_prio and sch_rr to take 0 bands when in multiqueue mode.  This
   will set q->bands to dev->egress_subqueue_count; added this also to the
   kernel doc.

This patchset is an updated version of previous multiqueue network device
support patches.  The general approach of introducing a new API for multiqueue
network devices to register with the stack has remained.  The changes include
adding a round-robin qdisc, heavily based on sch_prio, which will allow
queueing to hardware with no OS-enforced queuing policy.  sch_prio still has
the multiqueue code in it, but has a Kconfig option to compile it out of the
qdisc.  This allows people with hardware containing scheduling policies to
use sch_rr (round-robin), and others without scheduling policies in hardware
to continue using sch_prio if they wish to have some notion of scheduling
priority.

The patches being sent are split into Documentation, Qdisc changes, and
core stack changes.

The patches to iproute2 for tc will be sent separately, to support sch_rr.

-- 
PJ Waskiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] iproute2: sch_rr support in tc

2007-06-23 Thread PJ Waskiewicz
Updated: This patch applies on top of Patrick McHardy's RTNETLINK
patches to add nested compat attributes.  This is needed to maintain
ABI for sch_{rr|prio} in the kernel with respect to tc.  A new option,
namely multiqueue, was added to sch_prio and sch_rr.  This will allow
a user to turn multiqueue support on for sch_prio or sch_rr at loadtime.
Also, tc qdisc ls will display whether or not multiqueue is enabled on
that qdisc.

This patch is to support the new sch_rr (round-robin) qdisc being proposed
in NET for multiqueue network device support in the Linux network stack.
It uses q_prio.c as the template, since the qdiscs are nearly identical,
outside of the ->dequeue() routine.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/pkt_sched.h |2 -
 tc/Makefile   |1 
 tc/q_prio.c   |   15 -
 tc/q_rr.c |  126 +
 4 files changed, 138 insertions(+), 6 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index fa0ec53..ec3a9a5 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -104,7 +104,7 @@ struct tc_prio_qopt
 enum
 {
TCA_PRIO_UNSPEC,
-   TCA_PRIO_TEST,
+   TCA_PRIO_MQ,
__TCA_PRIO_MAX
 };
 
diff --git a/tc/Makefile b/tc/Makefile
index 9d618ff..62e2697 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -9,6 +9,7 @@ TCMODULES += q_fifo.o
 TCMODULES += q_sfq.o
 TCMODULES += q_red.o
 TCMODULES += q_prio.o
+TCMODULES += q_rr.o
 TCMODULES += q_tbf.o
 TCMODULES += q_cbq.o
 TCMODULES += f_rsvp.o
diff --git a/tc/q_prio.c b/tc/q_prio.c
index 4934416..b34bc05 100644
--- a/tc/q_prio.c
+++ b/tc/q_prio.c
@@ -29,7 +29,7 @@
 
 static void explain(void)
 {
-   fprintf(stderr, "Usage: ... prio bands NUMBER priomap P1 P2...\n");
+   fprintf(stderr, "Usage: ... prio bands NUMBER priomap P1 
P2...[multiqueue]\n");
 }
 
 #define usage() return(-1)
@@ -41,6 +41,7 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
int idx = 0;
struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 
1, 1 }};
struct rtattr *nest;
+   unsigned char mq = 0;
 
while (argc > 0) {
if (strcmp(*argv, "bands") == 0) {
@@ -58,6 +59,8 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
return -1;
}
pmap_mode = 1;
+   } else if (strcmp(*argv, "multiqueue") == 0) {
+   mq = 1;
} else if (strcmp(*argv, "help") == 0) {
explain();
return -1;
@@ -92,7 +95,7 @@ static int prio_parse_opt(struct qdisc_util *qu, int argc, 
char **argv, struct n
}
 */
nest = addattr_nest_compat(n, 1024, TCA_OPTIONS, &opt, sizeof(opt));
-   addattr32(n, 1024, TCA_PRIO_TEST, 123);
+   addattr32(n, 1024, TCA_PRIO_MQ, mq);
addattr_nest_compat_end(n, nest);
return 0;
 }
@@ -106,15 +109,17 @@ int prio_print_opt(struct qdisc_util *qu, FILE *f, struct 
rtattr *opt)
if (opt == NULL)
return 0;
 
-   if (parse_rtattr_nested_compat(tb, TCA_PRIO_MAX, opt, (void *)&qopt, 
sizeof(*qopt)))
+   if (parse_rtattr_nested_compat(tb, TCA_PRIO_MAX, opt, qopt, 
sizeof(*qopt)))
return -1;
 
fprintf(f, "bands %u priomap ", qopt->bands);
for (i=0; i<=TC_PRIO_MAX; i++)
fprintf(f, " %d", qopt->priomap[i]);
 
-   if (tb[TCA_PRIO_TEST])
-   fprintf(f, " TCA_PRIO_TEST: %u ", *(__u32 
*)RTA_DATA(tb[TCA_PRIO_TEST]));
+   if (tb[TCA_PRIO_MQ])
+   fprintf(f, " multiqueue: %s ",
+   *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ]) ? "on" : "off");
+
return 0;
 }
 
diff --git a/tc/q_rr.c b/tc/q_rr.c
new file mode 100644
index 000..f74f4d5
--- /dev/null
+++ b/tc/q_rr.c
@@ -0,0 +1,126 @@
+/*
+ * q_rr.c  RR.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:PJ Waskiewicz, <[EMAIL PROTECTED]>
+ * Original Authors:   Alexey Kuznetsov, <[EMAIL PROTECTED]> (from PRIO)
+ *
+ * Changes:
+ *
+ * Ole Husgaard <[EMAIL PROTECTED]>: 990513: prio2band map was always reset.
+ * J Hadi Salim <[EMAIL PROTECTED]>: 990609: priomap fix.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+static voi

[PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation

2007-06-23 Thread PJ Waskiewicz
Add a brief howto to Documentation/networking for multiqueue.  It
explains how to use the multiqueue API in a driver to support
multiqueue paths from the stack, as well as the qdiscs to use for
feeding a multiqueue device.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 Documentation/networking/multiqueue.txt |  106 +++
 1 files changed, 106 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/multiqueue.txt 
b/Documentation/networking/multiqueue.txt
new file mode 100644
index 000..b7ede56
--- /dev/null
+++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,106 @@
+
+   HOWTO for multiqueue network device support
+   ===
+
+Section 1: Base driver requirements for implementing multiqueue support
+Section 2: Qdisc support for multiqueue devices
+Section 3: Brief howto using PRIO or RR for multiqueue devices
+
+
+Intro: Kernel support for multiqueue devices
+-
+
+Kernel support for multiqueue devices is only an API that is presented to the
+netdevice layer for base drivers to implement.  This feature is part of the
+core networking stack, and all network devices will be running on the
+multiqueue-aware stack.  If a base driver only has one queue, then these
+changes are transparent to that driver.
+
+
+Section 1: Base driver requirements for implementing multiqueue support
+---
+
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device.  The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+
+The base driver will also need to manage the queues as it does the global
+netdev->queue_lock today.  Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational.  netdev->queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+
+Finally, the base driver should indicate that it is a multiqueue device.  The
+feature flag NETIF_F_MULTI_QUEUE should be added to the netdev->features
+bitmap on device initialization.  Below is an example from e1000:
+
+#ifdef CONFIG_E1000_MQ
+   if ( (adapter->hw.mac.type == e1000_82571) ||
+(adapter->hw.mac.type == e1000_82572) ||
+(adapter->hw.mac.type == e1000_80003es2lan))
+   netdev->features |= NETIF_F_MULTI_QUEUE;
+#endif
+
+
+Section 2: Qdisc support for multiqueue devices
+---
+
+Currently two qdiscs support multiqueue devices.  A new round-robin qdisc,
+sch_rr, and sch_prio. The qdisc is responsible for classifying the skb's to
+bands and queues, and will store the queue mapping into skb->queue_mapping.
+Use this field in the base driver to determine which queue to send the skb
+to.
+
+sch_rr has been added for hardware that doesn't want scheduling policies from
+software, so it's a straight round-robin qdisc.  It uses the same syntax and
+classification priomap that sch_prio uses, so it should be intuitive to
+configure for people who've used sch_prio.
+
+The PRIO qdisc naturally plugs into a multiqueue device.  If PRIO has been
+built with NET_SCH_PRIO_MQ, then upon load, it will make sure the number of
+bands requested is equal to the number of queues on the hardware.  If they
+are equal, it sets a one-to-one mapping up between the queues and bands.  If
+they're not equal, it will not load the qdisc.  This is the same behavior
+for RR.  Once the association is made, any skb that is classified will have
+skb->queue_mapping set, which will allow the driver to properly queue skb's
+to multiple queues.
+
+
+Section 3: Brief howto using PRIO and RR for multiqueue devices
+---
+
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs.  To add the PRIO qdisc to your network device, assuming the device is
+called eth0, run the following command:
+
+# tc qdisc add dev eth0 root handle 1: prio bands 4 multiqueue
+
+This will create 4 bands, 0 being highest priority, and associate those bands
+to the queues on your NIC.  Assuming eth0 has 4 Tx queues, the band mapping
+would look like:
+
+band 0 => queue 0
+band 1 => queue 1
+band 2 => queue 2
+band 3 => queue 3
+
+Traffic will begin flowing through each queue if your TOS values are assigning
+traffic across the various bands.  For example, ssh traffic will always try to
+go out band 0 based on TOS -> Linux priority conversion (realtime traffic),
+so it will be sent out queue 0.  ICMP traffic (pings) fall into the "normal"
+traffic classification, which is band 1.  Therefore pings 

[PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-23 Thread PJ Waskiewicz
Updated: Added checks for netif_subqueue_stopped() to netpoll,
pktgen, and software device dev_queue_xmit().  This will ensure
external events to these subsystems will be handled correctly if
a subqueue is shut down.

Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/etherdevice.h |3 +-
 include/linux/netdevice.h   |   62 ++-
 include/linux/skbuff.h  |4 ++-
 net/core/dev.c  |   27 +--
 net/core/netpoll.c  |8 +++---
 net/core/pktgen.c   |   10 +--
 net/core/skbuff.c   |3 ++
 net/ethernet/eth.c  |9 +++---
 8 files changed, 104 insertions(+), 22 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..b3fbb54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void   eth_header_cache_update(struct hh_cache 
*hh, struct net_device *dev
 extern int eth_header_cache(struct neighbour *neigh,
 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e7913ee..6509eb4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+   /* Give a control state for each queue.  This struct may contain
+* per-queue locks in the future.
+*/
+   unsigned long   state;
+};
+
 /*
  * Network device statistics. Akin to the 2.0 ether stats but
  * with byte counters.
@@ -325,6 +333,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN 
packets */
 #define NETIF_F_GSO2048/* Enable software GSO. */
 #define NETIF_F_LLTX   4096/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
@@ -543,6 +552,10 @@ struct net_device
 
/* rtnetlink link ops */
const struct rtnl_link_ops *rtnl_link_ops;
+
+   /* The TX queue control structures */
+   int egress_subqueue_count;
+   struct net_device_subqueue  egress_subqueue[0];
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -705,6 +718,48 @@ static inline int netif_running(const struct net_device 
*dev)
return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 
queue_index)
+{
+   clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+ u16 queue_index)
+{
+   return test_bit(__LINK_STATE_XOFF,
+   &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   if (test_and_clear_bit(__LINK_STATE_XOFF,
+  &dev->egress_subqueue[queue_index].state))
+   __netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+   return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -995,8 +1050,11 @@ static inline void netif_tx_disable(struct net_device 
*dev)
 extern voidether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-  void (*setup)(st

[PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-23 Thread PJ Waskiewicz
Updated: This patch applies on top of Patrick McHardy's RTNETLINK
nested compat attribute patches.  These are required to preserve
ABI for iproute2 when working with the multiqueue qdiscs.

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue hardware
support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/pkt_sched.h |4 +-
 net/sched/Kconfig |   30 +
 net/sched/sch_generic.c   |3 +
 net/sched/sch_prio.c  |  106 -
 4 files changed, 129 insertions(+), 14 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 09808b7..ec3a9a5 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -103,8 +103,8 @@ struct tc_prio_qopt
 
 enum
 {
-   TCA_PRIO_UNPSEC,
-   TCA_PRIO_TEST,
+   TCA_PRIO_UNSPEC,
+   TCA_PRIO_MQ,
__TCA_PRIO_MAX
 };
 
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..7f14fa6 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -102,8 +102,16 @@ config NET_SCH_ATM
  To compile this code as a module, choose M here: the
  module will be called sch_atm.
 
+config NET_SCH_BANDS
+bool "Multi Band Queueing (PRIO and RR)"
+---help---
+  Say Y here if you want to use n-band multiqueue packet
+  schedulers.  These include a priority-based scheduler and
+  a round-robin scheduler.
+
 config NET_SCH_PRIO
tristate "Multi Band Priority Queueing (PRIO)"
+   depends on NET_SCH_BANDS
---help---
  Say Y here if you want to use an n-band priority queue packet
  scheduler.
@@ -111,6 +119,28 @@ config NET_SCH_PRIO
  To compile this code as a module, choose M here: the
  module will be called sch_prio.
 
+config NET_SCH_RR
+   tristate "Multi Band Round Robin Queuing (RR)"
+   depends on NET_SCH_BANDS
+   select NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to use an n-band round robin packet
+ scheduler.
+
+ The module uses sch_prio for its framework and is aliased as
+ sch_rr, so it will load sch_prio, although it is referred
+ to using sch_rr.
+
+config NET_SCH_BANDS_MQ
+   bool "Multiple hardware queue support"
+   depends on NET_SCH_BANDS
+   ---help---
+ Say Y here if you want to allow the PRIO and RR qdiscs to assign
+ flows to multiple hardware queues on an ethernet device.  This
+ will still work on devices with 1 queue.
+
+ Most people will say N here.
+
 config NET_SCH_RED
tristate "Random Early Detection (RED)"
---help---
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 9461e8a..203d5c4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev)
spin_unlock(&dev->queue_lock);
 
ret = NETDEV_TX_BUSY;
-   if (!netif_queue_stopped(dev))
+   if (!netif_queue_stopped(dev) &&
+   !netif_subqueue_stopped(dev, skb->queue_mapping))
/* churn baby churn .. */
ret = dev_hard_start_xmit(skb, dev);
 
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 40a13e8..8a716f0 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,11 @@
 struct prio_sched_data
 {
int bands;
+   int curband; /* for round-robin */
struct tcf_proto *filter_list;
u8  prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+   unsigned char mq;
 };
 
 
@@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int 
*qerr)
 #endif
if (TC_H_MAJ(band))
band = 0;
+   if (q->mq)
+   skb->queue_mapping = 
+   q->prio2band[band&TC_PRIO_MAX];
+   else
+   skb->queue_mapping = 0;
return q->queues[q->prio2band[band&TC_PRIO_MAX]];
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
-   if (band >= q->bands)
+   if (band >= q->bands) {
+   if (q->mq)
+   skb->queue_mapping = q->prio2band[0];
+   else
+   skb->queue_mapping = 0;
return q->queues[q->prio2band[0]];
+   }
 
+   if (q->mq)
+   skb->queue_mapping = band;
+   else
+   skb->queue_mapping = 0;
return q->queues[band];
 }
 
@@ -144,17 +160,57 @@ prio_dequeue(struct Qdisc* sch)
struct Qdisc *qdisc;
 
for (

[PATCH] NET: Multiple queue hardware support

2007-06-23 Thread PJ Waskiewicz
Please consider these patches for 2.6.23 inclusion.

These patches are built against Patrick McHardy's recently submitted
RTNETLINK nested compat attribute patches.  They're needed to preserve
ABI between sch_{rr|prio} and iproute2.

Updates since the last submission:

1. Added checks for netif_subqueue_stopped() to net/core/netpoll.c,
   net/core/pktgen.c, and to software device hard_start_xmit in
   dev_queue_xmit().

2. Removed TCA_PRIO_TEST and added TCA_PRIO_MQ for sch_prio and sch_rr.

3. Fixed dependancy issues in net/sched/Kconfig with NET_SCH_RR.

4. Implemented the new nested compat attribute API for MQ in NET_SCH_PRIO
   and NET_SCH_RR.

5. Allow sch_rr and sch_prio to turn multiqueue hardware support on and off
   at loadtime.

This patchset is an updated version of previous multiqueue network device
support patches.  The general approach of introducing a new API for multiqueue
network devices to register with the stack has remained.  The changes include
adding a round-robin qdisc, heavily based on sch_prio, which will allow
queueing to hardware with no OS-enforced queuing policy.  sch_prio still has
the multiqueue code in it, but has a Kconfig option to compile it out of the
qdisc.  This allows people with hardware containing scheduling policies to
use sch_rr (round-robin), and others without scheduling policies in hardware
to continue using sch_prio if they wish to have some notion of scheduling
priority.

The patches being sent are split into Documentation, Qdisc changes, and
core stack changes.  The requested e1000 changes are still being resolved,
and will be sent at a later date.

The patches to iproute2 for tc will be sent separately, to support sch_rr.

-- 
PJ Waskiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] iproute2: Added support for RR qdisc (sch_rr)

2007-06-21 Thread PJ Waskiewicz
Add tc support for the sch_rr qdisc.  This qdisc supports multiple queues
on hardware.  The syntax for sch_rr is the same as sch_prio.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 tc/Makefile |1 +
 tc/q_rr.c   |  113 +++
 2 files changed, 114 insertions(+), 0 deletions(-)

diff --git a/tc/Makefile b/tc/Makefile
index 9d618ff..62e2697 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -9,6 +9,7 @@ TCMODULES += q_fifo.o
 TCMODULES += q_sfq.o
 TCMODULES += q_red.o
 TCMODULES += q_prio.o
+TCMODULES += q_rr.o
 TCMODULES += q_tbf.o
 TCMODULES += q_cbq.o
 TCMODULES += f_rsvp.o
diff --git a/tc/q_rr.c b/tc/q_rr.c
new file mode 100644
index 000..c5c1dc8
--- /dev/null
+++ b/tc/q_rr.c
@@ -0,0 +1,113 @@
+/*
+ * q_rr.c  RR.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:    PJ Waskiewicz, <[EMAIL PROTECTED]>
+ * Original Authors:   Alexey Kuznetsov, <[EMAIL PROTECTED]> (from PRIO)
+ *
+ * Changes:
+ *
+ * Ole Husgaard <[EMAIL PROTECTED]>: 990513: prio2band map was always reset.
+ * J Hadi Salim <[EMAIL PROTECTED]>: 990609: priomap fix.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... rr bands NUMBER priomap P1 P2...\n");
+}
+
+#define usage() return(-1)
+
+static int rr_parse_opt(struct qdisc_util *qu, int argc, char **argv, struct 
nlmsghdr *n)
+{
+   int ok = 0;
+   int pmap_mode = 0;
+   int idx = 0;
+   struct tc_prio_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 
1, 1 }};
+
+   while (argc > 0) {
+   if (strcmp(*argv, "bands") == 0) {
+   if (pmap_mode)
+   explain();
+   NEXT_ARG();
+   if (get_integer(&opt.bands, *argv, 10)) {
+   fprintf(stderr, "Illegal \"bands\"\n");
+   return -1;
+   }
+   ok++;
+   } else if (strcmp(*argv, "priomap") == 0) {
+   if (pmap_mode) {
+   fprintf(stderr, "Error: duplicate priomap\n");
+   return -1;
+   }
+   pmap_mode = 1;
+   } else if (strcmp(*argv, "help") == 0) {
+   explain();
+   return -1;
+   } else {
+   unsigned band;
+   if (!pmap_mode) {
+   fprintf(stderr, "What is \"%s\"?\n", *argv);
+   explain();
+   return -1;
+   }
+   if (get_unsigned(&band, *argv, 10)) {
+   fprintf(stderr, "Illegal \"priomap\" 
element\n");
+   return -1;
+   }
+   if (band > opt.bands) {
+   fprintf(stderr, "\"priomap\" element is out of 
bands\n");
+   return -1;
+   }
+   if (idx > TC_PRIO_MAX) {
+   fprintf(stderr, "\"priomap\" index > 
TC_RR_MAX=%u\n", TC_PRIO_MAX);
+   return -1;
+   }
+   opt.priomap[idx++] = band;
+   }
+   argc--; argv++;
+   }
+
+   addattr_l(n, 1024, TCA_OPTIONS, &opt, sizeof(opt));
+   return 0;
+}
+
+int rr_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+   int i;
+   struct tc_prio_qopt *qopt;
+
+   if (opt == NULL)
+   return 0;
+
+   if (RTA_PAYLOAD(opt)  < sizeof(*qopt))
+   return -1;
+   qopt = RTA_DATA(opt);
+   fprintf(f, "bands %u priomap ", qopt->bands);
+   for (i=0; i <= TC_PRIO_MAX; i++)
+   fprintf(f, " %d", qopt->priomap[i]);
+   return 0;
+}
+
+struct qdisc_util rr_qdisc_util = {
+   .id = "rr",
+   .parse_qopt = rr_parse_opt,
+   .print_qopt = rr_print_opt,
+};
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread PJ Waskiewicz
Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio to be compiled with or without multiqueue hardware
support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 net/sched/Kconfig   |   32 
 net/sched/sch_generic.c |3 +
 net/sched/sch_prio.c|  123 ---
 3 files changed, 150 insertions(+), 8 deletions(-)

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..ca0b352 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -102,8 +102,16 @@ config NET_SCH_ATM
  To compile this code as a module, choose M here: the
  module will be called sch_atm.
 
+config NET_SCH_BANDS
+bool "Multi Band Queueing (PRIO and RR)"
+---help---
+  Say Y here if you want to use n-band multiqueue packet
+  schedulers.  These include a priority-based scheduler and
+  a round-robin scheduler.
+
 config NET_SCH_PRIO
tristate "Multi Band Priority Queueing (PRIO)"
+   depends on NET_SCH_BANDS
---help---
  Say Y here if you want to use an n-band priority queue packet
  scheduler.
@@ -111,6 +119,30 @@ config NET_SCH_PRIO
  To compile this code as a module, choose M here: the
  module will be called sch_prio.
 
+config NET_SCH_PRIO_MQ
+   bool "Multiple hardware queue support for PRIO"
+   depends on NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to allow the PRIO qdisc to assign
+ flows to multiple hardware queues on an ethernet device.  This
+ will still work on devices with 1 queue.
+
+ Consider this scheduler for devices that do not use
+ hardware-based scheduling policies.  Otherwise, use NET_SCH_RR.
+
+ Most people will say N here.
+
+config NET_SCH_RR
+   bool "Multi Band Round Robin Queuing (RR)"
+   depends on NET_SCH_BANDS && NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to use an n-band round robin packet
+ scheduler.
+
+ The module uses sch_prio for its framework and is aliased as
+ sch_rr, so it will load sch_prio, although it is referred
+ to using sch_rr.
+
 config NET_SCH_RED
tristate "Random Early Detection (RED)"
---help---
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 9461e8a..203d5c4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev)
spin_unlock(&dev->queue_lock);
 
ret = NETDEV_TX_BUSY;
-   if (!netif_queue_stopped(dev))
+   if (!netif_queue_stopped(dev) &&
+   !netif_subqueue_stopped(dev, skb->queue_mapping))
/* churn baby churn .. */
ret = dev_hard_start_xmit(skb, dev);
 
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..4eb3ba5 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -9,6 +9,8 @@
  * Authors:Alexey Kuznetsov, <[EMAIL PROTECTED]>
  * Fixes:   19990609: J Hadi Salim <[EMAIL PROTECTED]>:
  *  Init --  EINVAL when opt undefined
+ * Additions:  Peter P. Waskiewicz Jr. <[EMAIL PROTECTED]>
+ * Added round-robin scheduling for selection at load-time
  */
 
 #include 
@@ -40,9 +42,13 @@
 struct prio_sched_data
 {
int bands;
+#ifdef CONFIG_NET_SCH_RR
+   int curband; /* for round-robin */
+#endif
struct tcf_proto *filter_list;
u8  prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+   u16 band2queue[TC_PRIO_MAX + 1];
 };
 
 
@@ -70,14 +76,19 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int 
*qerr)
 #endif
if (TC_H_MAJ(band))
band = 0;
+   skb->queue_mapping =
+   q->band2queue[q->prio2band[band&TC_PRIO_MAX]];
return q->queues[q->prio2band[band&TC_PRIO_MAX]];
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
-   if (band >= q->bands)
+   if (band >= q->bands) {
+   skb->queue_mapping = q->band2queue[q->prio2band[0]];
return q->queues[q->prio2band[0]];
+   }
 
+   skb->queue_mapping = q->band2queue[band];
return q->queues[band];
 }
 
@@ -144,17 +155,59 @@ prio_dequeue(struct Qdisc* sch)
struct Qdisc *qdisc;
 
for (prio = 0; prio < q->bands; prio++) {
-   qdisc = q->queues[prio];
-   skb = qdisc->dequeue(qdisc);
-   if (skb) {
-   sch->q.qlen--;
-   return skb;
+   /* Check if the target subqueue is available before
+* pulling an skb.  This way we avoid excessive req

[PATCH] iproute2: sch_rr support in tc

2007-06-21 Thread PJ Waskiewicz
This patch is to support the new sch_rr (round-robin) qdisc being proposed
in NET for multiqueue network device support in the Linux network stack.
It uses q_prio.c as the template, since the qdiscs are nearly identical,
outside of the ->dequeue() routine.

I'm soliciting feedback for a 2.6.23 multiqueue submission.  Thanks.

-- 
PJ Waskiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation

2007-06-21 Thread PJ Waskiewicz
Add a brief howto to Documentation/networking for multiqueue.  It
explains how to use the multiqueue API in a driver to support
multiqueue paths from the stack, as well as the qdiscs to use for
feeding a multiqueue device.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 Documentation/networking/multiqueue.txt |  100 +++
 1 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/multiqueue.txt 
b/Documentation/networking/multiqueue.txt
new file mode 100644
index 000..55b2db8
--- /dev/null
+++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,100 @@
+
+   HOWTO for multiqueue network device support
+   ===
+
+Section 1: Base driver requirements for implementing multiqueue support
+Section 2: Qdisc support for multiqueue devices
+Section 3: Brief howto using PRIO or RR for multiqueue devices
+
+
+Intro: Kernel support for multiqueue devices
+-
+
+Kernel support for multiqueue devices is only an API that is presented to the
+netdevice layer for base drivers to implement.  This feature is part of the
+core networking stack, and all network devices will be running on the
+multiqueue-aware stack.  If a base driver only has one queue, then these
+changes are transparent to that driver.
+
+
+Section 1: Base driver requirements for implementing multiqueue support
+---
+
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device.  The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+
+The base driver will also need to manage the queues as it does the global
+netdev->queue_lock today.  Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational.  netdev->queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+
+Finally, the base driver should indicate that it is a multiqueue device.  The
+feature flag NETIF_F_MULTI_QUEUE should be added to the netdev->features
+bitmap on device initialization.  Below is an example from e1000:
+
+#ifdef CONFIG_E1000_MQ
+   if ( (adapter->hw.mac.type == e1000_82571) ||
+(adapter->hw.mac.type == e1000_82572) ||
+(adapter->hw.mac.type == e1000_80003es2lan))
+   netdev->features |= NETIF_F_MULTI_QUEUE;
+#endif
+
+
+Section 2: Qdisc support for multiqueue devices
+---
+
+Currently two qdiscs support multiqueue devices.  A new round-robin qdisc,
+sch_rr, and sch_prio. The qdisc is responsible for classifying the skb's to
+bands and queues, and will store the queue mapping into skb->queue_mapping.
+Use this field in the base driver to determine which queue to send the skb
+to.
+
+sch_rr has been added for hardware that doesn't want scheduling policies from
+software, so it's a straight round-robin qdisc.  It uses the same syntax and
+classification priomap that sch_prio uses, so it should be intuitive to
+configure for people who've used sch_prio.
+
+The PRIO qdisc naturally plugs into a multiqueue device.  If PRIO has been
+built with NET_SCH_PRIO_MQ, then upon load, it will make sure the number of
+bands requested is equal to the number of queues on the hardware.  If they
+are equal, it sets a one-to-one mapping up between the queues and bands.  If
+they're not equal, it will not load the qdisc.  This is the same behavior
+for RR.  Once the association is made, any skb that is classified will have
+skb->queue_mapping set, which will allow the driver to properly queue skb's
+to multiple queues.
+
+
+Section 3: Brief howto using PRIO and RR for multiqueue devices
+---
+
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs.  To add the PRIO qdisc to your network device, assuming the device is
+called eth0, run the following command:
+
+# tc qdisc add dev eth0 root handle 1: prio bands 4
+
+This will create 4 bands, 0 being highest priority, and associate those bands
+to the queues on your NIC.  Assuming eth0 has 4 Tx queues, the band mapping
+would look like:
+
+band 0 => queue 0
+band 1 => queue 1
+band 2 => queue 2
+band 3 => queue 3
+
+Traffic will begin flowing through each queue if your TOS values are assigning
+traffic across the various bands.  For example, ssh traffic will always try to
+go out band 0 based on TOS -> Linux priority conversion (realtime traffic),
+so it will be sent out queue 0.  ICMP traffic (pings) fall into the "normal"
+traffic classification, which is band 1.  Therefore pings will be sen

[PATCH 2/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-21 Thread PJ Waskiewicz
Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/etherdevice.h |3 +-
 include/linux/netdevice.h   |   62 ++-
 include/linux/skbuff.h  |4 ++-
 net/core/dev.c  |   20 ++
 net/core/skbuff.c   |3 ++
 net/ethernet/eth.c  |9 +++---
 6 files changed, 87 insertions(+), 14 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..b3fbb54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void   eth_header_cache_update(struct hh_cache 
*hh, struct net_device *dev
 extern int eth_header_cache(struct neighbour *neigh,
 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e7913ee..6509eb4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+   /* Give a control state for each queue.  This struct may contain
+* per-queue locks in the future.
+*/
+   unsigned long   state;
+};
+
 /*
  * Network device statistics. Akin to the 2.0 ether stats but
  * with byte counters.
@@ -325,6 +333,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN 
packets */
 #define NETIF_F_GSO2048/* Enable software GSO. */
 #define NETIF_F_LLTX   4096/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
@@ -543,6 +552,10 @@ struct net_device
 
/* rtnetlink link ops */
const struct rtnl_link_ops *rtnl_link_ops;
+
+   /* The TX queue control structures */
+   int egress_subqueue_count;
+   struct net_device_subqueue  egress_subqueue[0];
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -705,6 +718,48 @@ static inline int netif_running(const struct net_device 
*dev)
return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 
queue_index)
+{
+   clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+ u16 queue_index)
+{
+   return test_bit(__LINK_STATE_XOFF,
+   &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   if (test_and_clear_bit(__LINK_STATE_XOFF,
+  &dev->egress_subqueue[queue_index].state))
+   __netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+   return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -995,8 +1050,11 @@ static inline void netif_tx_disable(struct net_device 
*dev)
 extern voidether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-  void (*setup)(struct net_device *));
+extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+ void (*setup)(struct net_device *),
+ int queue_count);
+#define alloc_netdev(sizeof_priv, name, setup) \
+   alloc_netdev_mq(

[PATCH] NET: Multiple queue hardware support

2007-06-21 Thread PJ Waskiewicz
Please consider these patches for 2.6.23 inclusion.

Updates since the last submission:

1. skb->queue_mapping moved into the iff cacheline.  I looked at moving
   iff and queue_mapping, but there wasn't enough room anywhere else to
   logically group these in a different cacheline that I could see.  Thanks
   Patrick McHardy.

2. netdev->egress_subqueue is now indexed thanks to Dave Miller.

3. sch_rr is now a MODULE_ALIAS of sch_prio.  Thanks Patrick McHardy.

4. Both sch_rr and multiqueue sch_prio expect the number of bands to
   equal the number of queues on the netdev.

This patchset is an updated version of previous multiqueue network device
support patches.  The general approach of introducing a new API for multiqueue
network devices to register with the stack has remained.  The changes include
adding a round-robin qdisc, heavily based on sch_prio, which will allow
queueing to hardware with no OS-enforced queuing policy.  sch_prio still has
the multiqueue code in it, but has a Kconfig option to compile it out of the
qdisc.  This allows people with hardware containing scheduling policies to
use sch_rr (round-robin), and others without scheduling policies in hardware
to continue using sch_prio if they wish to have some notion of scheduling
priority.

The patches being sent are split into Documentation, Qdisc changes, and
core stack changes.  The requested e1000 changes are still being resolved,
and will be sent at a later date.

I did not modify other users of netif_queue_stopped() in net/core/netpoll.c,
net/core/dev.c, or net/core/pktgen.c, since no classification occurs for
the skb being sent to the device.  Therefore, packets should always be
ending up in queue 0, so there's no need to check the subqueue status either.

The patches to iproute2 for tc will be sent separately, to support sch_rr.

-- 
PJ Waskiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] NET: [CORE] Stack changes to add multiqueue hardware support API

2007-06-18 Thread PJ Waskiewicz
Add the multiqueue hardware device support API to the core network
stack.  Allow drivers to allocate multiple queues and manage them
at the netdev level if they choose to do so.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/etherdevice.h |3 +-
 include/linux/netdevice.h   |   62 ++-
 include/linux/skbuff.h  |2 +
 net/core/dev.c  |   27 +++
 net/core/skbuff.c   |3 ++
 net/ethernet/eth.c  |9 +++---
 6 files changed, 94 insertions(+), 12 deletions(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index f48eb89..b3fbb54 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -39,7 +39,8 @@ extern void   eth_header_cache_update(struct hh_cache 
*hh, struct net_device *dev
 extern int eth_header_cache(struct neighbour *neigh,
 struct hh_cache *hh);
 
-extern struct net_device *alloc_etherdev(int sizeof_priv);
+extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count);
+#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1)
 
 /**
  * is_zero_ether_addr - Determine if give Ethernet address is all zeros.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e7913ee..bf532a0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -108,6 +108,14 @@ struct wireless_dev;
 #define MAX_HEADER (LL_MAX_HEADER + 48)
 #endif
 
+struct net_device_subqueue
+{
+   /* Give a control state for each queue.  This struct may contain
+* per-queue locks in the future.
+*/
+   unsigned long   state;
+};
+
 /*
  * Network device statistics. Akin to the 2.0 ether stats but
  * with byte counters.
@@ -325,6 +333,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN 
packets */
 #define NETIF_F_GSO2048/* Enable software GSO. */
 #define NETIF_F_LLTX   4096/* LockLess TX */
+#define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
@@ -543,6 +552,10 @@ struct net_device
 
/* rtnetlink link ops */
const struct rtnl_link_ops *rtnl_link_ops;
+
+   /* The TX queue control structures */
+   struct net_device_subqueue  *egress_subqueue;
+   int egress_subqueue_count;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -705,6 +718,48 @@ static inline int netif_running(const struct net_device 
*dev)
return test_bit(__LINK_STATE_START, &dev->state);
 }
 
+/*
+ * Routines to manage the subqueues on a device.  We only need start
+ * stop, and a check if it's stopped.  All other device management is
+ * done at the overall netdevice level.
+ * Also test the device if we're multiqueue.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, u16 
queue_index)
+{
+   clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+ u16 queue_index)
+{
+   return test_bit(__LINK_STATE_XOFF,
+   &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+   if (netpoll_trap())
+   return;
+#endif
+   if (test_and_clear_bit(__LINK_STATE_XOFF,
+  &dev->egress_subqueue[queue_index].state))
+   __netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+   return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
 
 /* Use this variant when it is known for sure that it
  * is executing from interrupt context.
@@ -995,8 +1050,11 @@ static inline void netif_tx_disable(struct net_device 
*dev)
 extern voidether_setup(struct net_device *dev);
 
 /* Support for loadable net-drivers */
-extern struct net_device *alloc_netdev(int sizeof_priv, const char *name,
-  void (*setup)(struct net_device *));
+extern struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
+ void (*setup)(struct net_device *),
+ int queue_count);
+#define alloc_netdev(sizeof_priv, name, setup) \
+   alloc_netdev_mq(sizeof_priv, name, setup, 1)
 extern int register_netdev(struct net_device *dev);
 extern voidunregister_netdev(

[PATCH 1/3] NET: [DOC] Multiqueue hardware support documentation

2007-06-18 Thread PJ Waskiewicz
Add a brief howto to Documentation/networking for multiqueue.  It
explains how to use the multiqueue API in a driver to support
multiqueue paths from the stack, as well as the qdiscs to use for
feeding a multiqueue device.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 Documentation/networking/multiqueue.txt |   98 +++
 1 files changed, 98 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/multiqueue.txt 
b/Documentation/networking/multiqueue.txt
new file mode 100644
index 000..8201767
--- /dev/null
+++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,98 @@
+
+   HOWTO for multiqueue network device support
+   ===
+
+Section 1: Base driver requirements for implementing multiqueue support
+Section 2: Qdisc support for multiqueue devices
+Section 3: Brief howto using PRIO for multiqueue devices
+
+
+Intro: Kernel support for multiqueue devices
+-
+
+Kernel support for multiqueue devices is only an API that is presented to the
+netdevice layer for base drivers to implement.  This feature is part of the
+core networking stack, and all network devices will be running on the
+multiqueue-aware stack.  If a base driver only has one queue, then these
+changes are transparent to that driver.
+
+
+Section 2: Base driver requirements for implementing multiqueue support
+---
+
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device.  The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+
+The base driver will also need to manage the queues as it does the global
+netdev->queue_lock today.  Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational.  netdev->queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+
+Finally, the base driver should indicate that it is a multiqueue device.  The
+feature flag NETIF_F_MULTI_QUEUE should be added to the netdev->features
+bitmap on device initialization.  Below is an example from e1000:
+
+#ifdef CONFIG_E1000_MQ
+   if ( (adapter->hw.mac.type == e1000_82571) ||
+(adapter->hw.mac.type == e1000_82572) ||
+(adapter->hw.mac.type == e1000_80003es2lan))
+   netdev->features |= NETIF_F_MULTI_QUEUE;
+#endif
+
+
+Section 3: Qdisc support for multiqueue devices
+---
+
+Currently two qdiscs support multiqueue devices.  A new round-robin qdisc,
+sch_rr, and sch_prio. The qdisc is responsible for classifying the skb's to
+bands and queues, and will store the queue mapping into skb->queue_mapping.
+Use this field in the base driver to determine which queue to send the skb
+to.
+
+sch_rr has been added for hardware that doesn't want scheduling policies from
+software, so it's a straight round-robin qdisc.  It uses the same syntax and
+classification priomap that sch_prio uses, so it should be intuitive to
+configure for people who've used sch_prio.
+
+The PRIO qdisc naturally plugs into a multiqueue device.  Upon load of the
+qdisc, PRIO will make a best-effort assignment of queue to PRIO band to evenly
+distribute traffic flows.  The algorithm can be found in prio_tune() in
+net/sched/sch_prio.c.  Once the association is made, any skb that is
+classified will have skb->queue_mapping set, which will allow the driver to
+properly queue skb's to multiple queues.  sch_prio can have these features
+compiled in or out of the module.
+
+
+Section 4: Brief howto using PRIO for multiqueue devices
+
+
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs.  To add the PRIO qdisc to your network device, assuming the device is
+called eth0, run the following command:
+
+# tc qdisc add dev eth0 root handle 1: prio
+
+This will create 3 bands, 0 being highest priority, and associate those bands
+to the queues on your NIC.  Assuming eth0 has 2 Tx queues, the band mapping
+would look like:
+
+band 0 => queue 0
+band 1 => queue 1
+band 2 => queue 1
+
+Traffic will begin flowing through each queue if your TOS values are assigning
+traffic across the various bands.  For example, ssh traffic will always try to
+go out band 0 based on TOS -> Linux priority conversion (realtime traffic),
+so it will be sent out queue 0.  ICMP traffic (pings) fall into the "normal"
+traffic classification, which is band 1.  Therefore pings will be send out
+queue 1 on the NIC.
+
+The behavior of tc filters remains the same, where it will override TOS 
priority
+classi

[PATCH 2/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-18 Thread PJ Waskiewicz
 struct Qdisc *qdisc;
 
for (prio = 0; prio < q->bands; prio++) {
-   qdisc = q->queues[prio];
-   skb = qdisc->dequeue(qdisc);
-   if (skb) {
-   sch->q.qlen--;
-   return skb;
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+   /* Check if the target subqueue is available before
+* pulling an skb.  This way we avoid excessive requeues
+* for slower queues.
+*/
+   if (!netif_subqueue_stopped(sch->dev, q->band2queue[prio])) {
+#endif
+   qdisc = q->queues[prio];
+   skb = qdisc->dequeue(qdisc);
+   if (skb) {
+   sch->q.qlen--;
+   return skb;
+   }
+#ifdef CONFIG_NET_SCH_PRIO_MQ
}
+#endif
}
return NULL;
 
@@ -200,6 +222,10 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
struct prio_sched_data *q = qdisc_priv(sch);
struct tc_prio_qopt *qopt = RTA_DATA(opt);
int i;
+   int queue;
+   int qmapoffset;
+   int offset;
+   int mod;
 
if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
return -EINVAL;
@@ -242,6 +268,32 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
}
}
}
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+   /* setup queue to band mapping */
+   if (q->bands < sch->dev->egress_subqueue_count) {
+   qmapoffset = 1;
+   mod = sch->dev->egress_subqueue_count;
+   } else {
+   mod = q->bands % sch->dev->egress_subqueue_count;
+   qmapoffset = q->bands / sch->dev->egress_subqueue_count
+   + ((mod) ? 1 : 0);
+   }
+
+   queue = 0;
+   offset = 0;
+   for (i = 0; i < q->bands; i++) {
+   q->band2queue[i] = queue;
+   if ( ((i + 1) - offset) == qmapoffset) {
+   queue++;
+   offset += qmapoffset;
+   if (mod)
+   mod--;
+   qmapoffset = q->bands /
+   sch->dev->egress_subqueue_count +
+   ((mod) ? 1 : 0);
+   }
+   }
+#endif
return 0;
 }
 
diff --git a/net/sched/sch_rr.c b/net/sched/sch_rr.c
new file mode 100644
index 000..ce9f237
--- /dev/null
+++ b/net/sched/sch_rr.c
@@ -0,0 +1,516 @@
+/*
+ * net/sched/sch_rr.c  Simple n-band round-robin scheduler.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ *     2 of the License, or (at your option) any later version.
+ *
+ * The core part of this qdisc is based on sch_prio.  ->dequeue() is where
+ * this scheduler functionally differs.
+ *
+ * Author: PJ Waskiewicz, <[EMAIL PROTECTED]>
+ *
+ * Original Authors (from PRIO): Alexey Kuznetsov, <[EMAIL PROTECTED]>
+ * Fixes:   19990609: J Hadi Salim <[EMAIL PROTECTED]>:
+ *  Init --  EINVAL when opt undefined
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+
+struct rr_sched_data
+{
+   int bands;
+   int curband;
+   struct tcf_proto *filter_list;
+   u8  prio2band[TC_RR_MAX + 1];
+   struct Qdisc *queues[TCQ_RR_BANDS];
+   u16 band2queue[TC_RR_MAX + 1];
+};
+
+
+static struct Qdisc *rr_classify(struct sk_buff *skb, struct Qdisc *sch,
+int *qerr)
+{
+   struct rr_sched_data *q = qdisc_priv(sch);
+   u32 band = skb->priority;
+   struct tcf_result res;
+
+   *qerr = NET_XMIT_BYPASS;
+   if (TC_H_MAJ(skb->priority) != sch->handle) {
+#ifdef CONFIG_NET_CLS_ACT
+   switch (tc_classify(skb, q->filter_list, &res)) {
+   case TC_ACT_STOLEN:
+   case TC_ACT_QUEUED:
+   *qerr = NET_XMIT_SUCCESS;
+   case TC_ACT_SHOT:
+   return NULL;
+   }
+
+   if (!q->filter_list ) {
+#else
+   if (!q->filter_list || tc_classify(skb, q->filter_list, &res)) {
+#endif
+   if (TC_H_MAJ(band))
+   band = 0;
+   skb->queue_mapping =
+ q->band2queue[q->prio2band[band&TC_RR_MAX]];
+
+   return q->queues[q->

[PATCH] NET: Multiple queue hardware support

2007-06-18 Thread PJ Waskiewicz
Please consider these patches for 2.6.23 inclusion.

This patchset is an updated version of previous multiqueue network device
support patches.  The general approach of introducing a new API for multiqueue
network devices to register with the stack has remained.  The changes include
adding a round-robin qdisc, heavily based on sch_prio, which will allow
queueing to hardware with no OS-enforced queuing policy.  sch_prio still has
the multiqueue code in it, but has a Kconfig option to compile it out of the
qdisc.  This allows people with hardware containing scheduling policies to
use sch_rr (round-robin), and others without scheduling policies in hardware
to continue using sch_prio if they wish to have some notion of scheduling
priority.

The patches being sent are split into Documentation, Qdisc changes, and
core stack changes.  The requested e1000 changes are still being resolved,
and will be sent at a later date.

I did not modify other users of netif_queue_stopped() in net/core/netpoll.c,
net/core/dev.c, or net/core/pktgen.c, since no classification occurs for
the skb being sent to the device.  Therefore, packets should always be
ending up in queue 0, so there's no need to check the subqueue status either.

The patches to iproute2 for tc will be sent separately, to support sch_rr.

-- 
PJ Waskiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] iproute2: Added support for RR qdisc (sch_rr)

2007-06-18 Thread PJ Waskiewicz
Add tc support for the sch_rr qdisc.  This qdisc supports multiple queues
on hardware.  The syntax for sch_rr is the same as sch_prio.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/pkt_sched.h |   11 
 tc/Makefile   |1 
 tc/q_rr.c |  113 +
 3 files changed, 125 insertions(+), 0 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..907412b 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -22,6 +22,7 @@
 #define TC_PRIO_CONTROL7
 
 #define TC_PRIO_MAX15
+#define TC_RR_MAX  15
 
 /* Generic queue statistics, available for all the elements.
Particular schedulers may have also their private records.
@@ -90,6 +91,16 @@ struct tc_fifo_qopt
__u32   limit;  /* Queue length: bytes for bfifo, packets for pfifo */
 };
 
+/* RR section */
+#define TCQ_RR_BANDS   16
+#define TCQ_MIN_RR_BANDS 2
+
+struct tc_rr_qopt
+{
+   int bands;
+   __u8priomap[TC_RR_MAX + 1];
+};
+
 /* PRIO section */
 
 #define TCQ_PRIO_BANDS 16
diff --git a/tc/Makefile b/tc/Makefile
index 9d618ff..62e2697 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -9,6 +9,7 @@ TCMODULES += q_fifo.o
 TCMODULES += q_sfq.o
 TCMODULES += q_red.o
 TCMODULES += q_prio.o
+TCMODULES += q_rr.o
 TCMODULES += q_tbf.o
 TCMODULES += q_cbq.o
 TCMODULES += f_rsvp.o
diff --git a/tc/q_rr.c b/tc/q_rr.c
new file mode 100644
index 000..8eecac9
--- /dev/null
+++ b/tc/q_rr.c
@@ -0,0 +1,113 @@
+/*
+ * q_rr.c  RR.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:    PJ Waskiewicz, <[EMAIL PROTECTED]>
+ * Original Authors:   Alexey Kuznetsov, <[EMAIL PROTECTED]> (from PRIO)
+ *
+ * Changes:
+ *
+ * Ole Husgaard <[EMAIL PROTECTED]>: 990513: prio2band map was always reset.
+ * J Hadi Salim <[EMAIL PROTECTED]>: 990609: priomap fix.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... rr bands NUMBER priomap P1 P2...\n");
+}
+
+#define usage() return(-1)
+
+static int rr_parse_opt(struct qdisc_util *qu, int argc, char **argv, struct 
nlmsghdr *n)
+{
+   int ok = 0;
+   int pmap_mode = 0;
+   int idx = 0;
+   struct tc_rr_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 1, 
1 }};
+
+   while (argc > 0) {
+   if (strcmp(*argv, "bands") == 0) {
+   if (pmap_mode)
+   explain();
+   NEXT_ARG();
+   if (get_integer(&opt.bands, *argv, 10)) {
+   fprintf(stderr, "Illegal \"bands\"\n");
+   return -1;
+   }
+   ok++;
+   } else if (strcmp(*argv, "priomap") == 0) {
+   if (pmap_mode) {
+   fprintf(stderr, "Error: duplicate priomap\n");
+   return -1;
+   }
+   pmap_mode = 1;
+   } else if (strcmp(*argv, "help") == 0) {
+   explain();
+   return -1;
+   } else {
+   unsigned band;
+   if (!pmap_mode) {
+   fprintf(stderr, "What is \"%s\"?\n", *argv);
+   explain();
+   return -1;
+   }
+   if (get_unsigned(&band, *argv, 10)) {
+   fprintf(stderr, "Illegal \"priomap\" 
element\n");
+   return -1;
+   }
+   if (band > opt.bands) {
+   fprintf(stderr, "\"priomap\" element is out of 
bands\n");
+   return -1;
+   }
+   if (idx > TC_RR_MAX) {
+   fprintf(stderr, "\"priomap\" index > 
TC_RR_MAX=%u\n", TC_RR_MAX);
+   return -1;
+   }
+   opt.priomap[idx++] = band;
+   }
+   argc--; argv++;
+   }
+
+   addattr_l(n, 1024, TCA_OPTIONS, &opt, sizeof(opt));
+   return 0;
+

[PATCH] iproute2: sch_rr support in tc

2007-06-18 Thread PJ Waskiewicz
This patch is to support the new sch_rr (round-robin) qdisc being proposed
in NET for multiqueue network device support in the Linux network stack.
It uses q_prio.c as the template, since the qdiscs are nearly identical,
outside of the ->dequeue() routine.

I'm soliciting feedback for a 2.6.23 multiqueue submission.  Thanks.

-- 
PJ Waskiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] iproute2: Added support for RR qdisc (sch_rr)

2007-06-04 Thread PJ Waskiewicz
Add tc support for the sch_rr qdisc.  This qdisc supports multiple queues
on hardware.  The syntax for sch_rr is the same as sch_prio.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
---

 include/linux/pkt_sched.h |   11 
 tc/Makefile   |1 
 tc/q_rr.c |  113 +
 3 files changed, 125 insertions(+), 0 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..907412b 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -22,6 +22,7 @@
 #define TC_PRIO_CONTROL7
 
 #define TC_PRIO_MAX15
+#define TC_RR_MAX  15
 
 /* Generic queue statistics, available for all the elements.
Particular schedulers may have also their private records.
@@ -90,6 +91,16 @@ struct tc_fifo_qopt
__u32   limit;  /* Queue length: bytes for bfifo, packets for pfifo */
 };
 
+/* RR section */
+#define TCQ_RR_BANDS   16
+#define TCQ_MIN_RR_BANDS 2
+
+struct tc_rr_qopt
+{
+   int bands;
+   __u8priomap[TC_RR_MAX + 1];
+};
+
 /* PRIO section */
 
 #define TCQ_PRIO_BANDS 16
diff --git a/tc/Makefile b/tc/Makefile
index 9d618ff..62e2697 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -9,6 +9,7 @@ TCMODULES += q_fifo.o
 TCMODULES += q_sfq.o
 TCMODULES += q_red.o
 TCMODULES += q_prio.o
+TCMODULES += q_rr.o
 TCMODULES += q_tbf.o
 TCMODULES += q_cbq.o
 TCMODULES += f_rsvp.o
diff --git a/tc/q_rr.c b/tc/q_rr.c
new file mode 100644
index 000..8eecac9
--- /dev/null
+++ b/tc/q_rr.c
@@ -0,0 +1,113 @@
+/*
+ * q_rr.c  RR.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:    PJ Waskiewicz, <[EMAIL PROTECTED]>
+ * Original Authors:   Alexey Kuznetsov, <[EMAIL PROTECTED]> (from PRIO)
+ *
+ * Changes:
+ *
+ * Ole Husgaard <[EMAIL PROTECTED]>: 990513: prio2band map was always reset.
+ * J Hadi Salim <[EMAIL PROTECTED]>: 990609: priomap fix.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+
+static void explain(void)
+{
+   fprintf(stderr, "Usage: ... rr bands NUMBER priomap P1 P2...\n");
+}
+
+#define usage() return(-1)
+
+static int rr_parse_opt(struct qdisc_util *qu, int argc, char **argv, struct 
nlmsghdr *n)
+{
+   int ok = 0;
+   int pmap_mode = 0;
+   int idx = 0;
+   struct tc_rr_qopt opt={3,{ 1, 2, 2, 2, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 1, 
1 }};
+
+   while (argc > 0) {
+   if (strcmp(*argv, "bands") == 0) {
+   if (pmap_mode)
+   explain();
+   NEXT_ARG();
+   if (get_integer(&opt.bands, *argv, 10)) {
+   fprintf(stderr, "Illegal \"bands\"\n");
+   return -1;
+   }
+   ok++;
+   } else if (strcmp(*argv, "priomap") == 0) {
+   if (pmap_mode) {
+   fprintf(stderr, "Error: duplicate priomap\n");
+   return -1;
+   }
+   pmap_mode = 1;
+   } else if (strcmp(*argv, "help") == 0) {
+   explain();
+   return -1;
+   } else {
+   unsigned band;
+   if (!pmap_mode) {
+   fprintf(stderr, "What is \"%s\"?\n", *argv);
+   explain();
+   return -1;
+   }
+   if (get_unsigned(&band, *argv, 10)) {
+   fprintf(stderr, "Illegal \"priomap\" 
element\n");
+   return -1;
+   }
+   if (band > opt.bands) {
+   fprintf(stderr, "\"priomap\" element is out of 
bands\n");
+   return -1;
+   }
+   if (idx > TC_RR_MAX) {
+   fprintf(stderr, "\"priomap\" index > 
TC_RR_MAX=%u\n", TC_RR_MAX);
+   return -1;
+   }
+   opt.priomap[idx++] = band;
+   }
+   argc--; argv++;
+   }
+
+   addattr_l(n, 1024, TCA_OPTIONS, &opt, sizeof(opt));
+   return 0;
+

[RFC] iproute2: sch_rr support in tc

2007-06-04 Thread PJ Waskiewicz
This patch is to support the new sch_rr (round-robin) qdisc being proposed
in NET for multiqueue network device support in the Linux network stack.
It uses q_prio.c as the template, since the qdiscs are nearly identical,
outside of the ->dequeue() routine.

I'm soliciting feedback for a 2.6.23 multiqueue submission.  Thanks.

-- 
PJ Waskiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] NET: Multiqueue network device support.

2007-06-04 Thread PJ Waskiewicz
   /* Release the driver */
if (!nolock) {
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 269a6e1..c78dba4 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -43,6 +43,7 @@ struct prio_sched_data
struct tcf_proto *filter_list;
u8  prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+   u16 band2queue[TC_PRIO_MAX + 1];
 };
 
 
@@ -70,13 +71,26 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int 
*qerr)
 #endif
if (TC_H_MAJ(band))
band = 0;
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+   skb->queue_mapping =
+ q->band2queue[q->prio2band[band&TC_PRIO_MAX]];
+#endif
+
return q->queues[q->prio2band[band&TC_PRIO_MAX]];
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
-   if (band > q->bands)
+   if (band > q->bands) {
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+   skb->queue_mapping = q->band2queue[q->prio2band[0]];
+#endif
return q->queues[q->prio2band[0]];
+   }
+
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+   skb->queue_mapping = q->band2queue[band];
+#endif
 
return q->queues[band];
 }
@@ -144,12 +158,22 @@ prio_dequeue(struct Qdisc* sch)
struct Qdisc *qdisc;
 
for (prio = 0; prio < q->bands; prio++) {
-   qdisc = q->queues[prio];
-   skb = qdisc->dequeue(qdisc);
-   if (skb) {
-   sch->q.qlen--;
-   return skb;
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+   /* Check if the target subqueue is available before
+* pulling an skb.  This way we avoid excessive requeues
+* for slower queues.
+*/
+   if (!netif_subqueue_stopped(sch->dev, q->band2queue[prio])) {
+#endif
+   qdisc = q->queues[prio];
+   skb = qdisc->dequeue(qdisc);
+   if (skb) {
+   sch->q.qlen--;
+   return skb;
+   }
+#ifdef CONFIG_NET_SCH_PRIO_MQ
}
+#endif
}
return NULL;
 
@@ -200,6 +224,10 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
struct prio_sched_data *q = qdisc_priv(sch);
struct tc_prio_qopt *qopt = RTA_DATA(opt);
int i;
+   int queue;
+   int qmapoffset;
+   int offset;
+   int mod;
 
if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
return -EINVAL;
@@ -242,6 +270,32 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
}
}
}
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+   /* setup queue to band mapping */
+   if (q->bands < sch->dev->egress_subqueue_count) {
+   qmapoffset = 1;
+   mod = sch->dev->egress_subqueue_count;
+   } else {
+   mod = q->bands % sch->dev->egress_subqueue_count;
+   qmapoffset = q->bands / sch->dev->egress_subqueue_count
+   + ((mod) ? 1 : 0);
+   }
+
+   queue = 0;
+   offset = 0;
+   for (i = 0; i < q->bands; i++) {
+   q->band2queue[i] = queue;
+   if ( ((i + 1) - offset) == qmapoffset) {
+   queue++;
+   offset += qmapoffset;
+   if (mod)
+   mod--;
+   qmapoffset = q->bands /
+   sch->dev->egress_subqueue_count +
+   ((mod) ? 1 : 0);
+   }
+   }
+#endif
return 0;
 }
 
diff --git a/net/sched/sch_rr.c b/net/sched/sch_rr.c
new file mode 100644
index 000..ce9f237
--- /dev/null
+++ b/net/sched/sch_rr.c
@@ -0,0 +1,516 @@
+/*
+ * net/sched/sch_rr.c  Simple n-band round-robin scheduler.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * The core part of this qdisc is based on sch_prio.  ->dequeue() is where
+ * this scheduler functionally differs.
+ *
+ * Author: PJ Waskiewicz, <[EMAIL PROTECTED]>
+ *
+ * Original Authors (from PRIO): Alexey Kuznetsov, <[EMAIL PROTECTED]>
+ * Fixes:   19990609: J Hadi Salim <[EMAIL PROTECTED]>:
+ *  Init --  EINVAL when opt undefined
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#i

[RFC] NET: Multiple queue hardware support

2007-06-04 Thread PJ Waskiewicz
This patchset is an updated version of previous multiqueue network device
support patches.  The general approach of introducing a new API for multiqueue
network devices to register with the stack has remained.  The changes include
adding a round-robin qdisc, heavily based on sch_prio, which will allow
queueing to hardware with no OS-enforced queuing policy.  sch_prio still has
the multiqueue code in it, but has a Kconfig option to compile it out of the
qdisc.  This allows people with hardware containing scheduling policies to
use sch_rr (round-robin), and others without scheduling policies in hardware
to continue using sch_prio if they wish to have some notion of scheduling
priority.

The patches to iproute2 for tc will be sent separately, to support sch_rr.

I'm soliciting feedback for a 2.6.23 submission.  Thanks.

-- 
PJ Waskiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html