Re: [PATCH] d80211: remove unused Super AG definitions, purge comment

2006-10-16 Thread Johannes Berg
On Mon, 2006-10-16 at 11:39 -0700, David Kimdon wrote:

> - MODE_ATHEROS_PRIME = 5 /* Atheros Dynamic Turbo mode */,
> - MODE_ATHEROS_PRIMEG = 6 /* Atheros Dynamic Turbo mode G */,
>   NUM_IEEE80211_MODES = 7

You want to adjust that last constant there too, I guess. Why is it an
enum anyway if things are assigned statically?

johannes
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Chase Venters
On Tuesday 17 October 2006 00:09, Johann Borck wrote:
> Regarding mukevent I'm thinking of a event-type specific struct, that is
> filled by the originating code, and placed into a per-event-type ring
> buffer (which  requires modification of kevent_wait).

I'd personally worry about an implementation that used a per-event-type ring 
buffer, because you're still left having to hack around starvation issues in 
user-space. It is of course possible under the current model for anyone who 
wants per-event-type ring buffers to have them - just make separate kevent 
sets.

I haven't thought this through all the way yet, but why not have variable 
length event structures and have the kernel fill in a "next" pointer in each 
one? This could even be used to keep backwards binary compatibility while 
adding additional fields to the structures over time, though no space would 
be wasted on modern programs. You still end up with a question of what to do 
in case of overflow, but I'm thinking the thing to do in that case might be 
to start pushing overflow events onto a linked list which can be written back 
into the ring buffer when space becomes available. The appropriate behavior 
would be to throw new events on the linked list if the linked list had any 
events, so that things are delivered in order, but write to the mapped buffer 
directly otherwise.

Deciding when to do that is tricky, and I haven't thought through the 
implications fully when I say this, but what about activating a bottom half 
when more space becomes available, and let that drain overflowed events back 
into the mapped buffer? Or perhaps the time to do it would be in the next 
blocking wait, when the queue emptied? 

I think it is very important to avoid any limits that can not be adjusted on 
the fly at run-time by CAP_SYS_ADMIN or what have you. Doing it this way may 
have other problems I've ignored but at least the big one - compile-time 
capacity limits in the year 2006 - would be largely avoided :P

Nothing real solid yet, just some electrical storms in the grey matter...

Thanks,
Chase
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Bound TSO defer time (resend)

2006-10-16 Thread David Miller
From: John Heffner <[EMAIL PROTECTED]>
Date: Tue, 17 Oct 2006 00:18:33 -0400

> Stephen Hemminger wrote:
> > On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
> > John Heffner <[EMAIL PROTECTED]> wrote:
> 
> >> This patch limits the amount of time you will defer sending a TSO segment
> >> to less than two clock ticks, or the time between two acks, whichever is
> >> longer.
> 
> > 
> > Okay, but doing any timing on clock ticks makes the behavior dependent
> > on the value of HZ which doesn't seem desirable. Should this be based
> > on RTT or a real-time values?
> 
> It would be nice to use a high res clock so you don't depend on HZ, but 
> this is still expensive on most SMP arch's as I understand it.

Right so we do need to use a jiffies based solution.

Since HZ is variable, I have a feeling that the thing to do here
is pick some timeout in msec.  Then replace the "2 clock ticks"
with some msec_to_jiffies() calls, bottoming out at 1 jiffie.

How does that sound?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/13] [RFC] [IPV6] Move source address selection into route lookup.

2006-10-16 Thread David Miller
From: Ville Nuorvala <[EMAIL PROTECTED]>
Date: Tue, 17 Oct 2006 03:13:17 +0300

> This patch moves the normal source address selection from
> ip6_dst_lookup() into ip6_pol_route_output(), but shouldn't
> change the routing or source address selection behavior in
> any way.
> 
> Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>

Although this conversion is very clean and the next patch
is very logic, I'm going to hold on all patches from 7 onward
so there is some time for some discussion of the RFC'ness
of them :-)

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/13] [IPV6] Always copy rt->u.dst.error when copying a rt6_info.

2006-10-16 Thread David Miller
From: Ville Nuorvala <[EMAIL PROTECTED]>
Date: Tue, 17 Oct 2006 03:10:49 +0300

> Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>

Looks good, applied.

Ville, can you fixup Thunderbird to not corrupt your patches?
The specific corruption is that if the patch has a completely
empty line with just a space at the beginning, thunderbird is
killing that space which makes the patch bad (at least in GIT's
eyes, which is all that matters :-)

Thanks!
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/13] [IPV6] Make IPV6_SUBTREES depend on IPV6_MULTIPLE_TABLES.

2006-10-16 Thread David Miller
From: Ville Nuorvala <[EMAIL PROTECTED]>
Date: Tue, 17 Oct 2006 03:08:35 +0300

> As IPV6_SUBTREES can't work without IPV6_MULTIPLE_TABLES have IPV6_SUBTREES
> depend on it.
> 
> Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>

Good catch, patch applied, thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/13] [IPV6] Clean up BACKTRACK().

2006-10-16 Thread David Miller
From: Ville Nuorvala <[EMAIL PROTECTED]>
Date: Tue, 17 Oct 2006 03:06:27 +0300

> The fn check is unnecessary as fn can never be NULL in BACKTRACK().
> 
> Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>

Applied, especially valid since we're walking parents up to
the, we break out at hitting root, and root's parent is
itself :-)

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/13] [IPV6] Make sure error handling is done when calling ip6_route_output().

2006-10-16 Thread David Miller
From: Ville Nuorvala <[EMAIL PROTECTED]>
Date: Tue, 17 Oct 2006 03:04:08 +0300

> As ip6_route_output() never returns NULL, error checking must be done by
> looking at dst->error in stead of comparing dst against NULL.
> 
> Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>

Good catch, patch applied.

Thanks a lot.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/13] [IPV6] Remove struct pol_chain.

2006-10-16 Thread David Miller
From: Ville Nuorvala <[EMAIL PROTECTED]>
Date: Tue, 17 Oct 2006 02:54:27 +0300

> Struct pol_chain has existed since at least the 2.2 kernel, but isn't used
> anymore. As the IPv6 policy routing is implemented in a totally different
> way in the current kernel, just get rid of it.
> 
> Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>

That's obvious enough, good catch.

Applied, thanks a lot.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/13] [SCTP] Fix minor typo

2006-10-16 Thread David Miller
From: Ville Nuorvala <[EMAIL PROTECTED]>
Date: Tue, 17 Oct 2006 02:56:55 +0300

> Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>

Also applied, thanks.

Please format your changelog headers properly, make
it "[TOPIC]: " instead of "[TOPIC] ".  Thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Johann Borck
Ulrich Drepper wrote:
> Evgeniy Polyakov wrote:
>> Existing design does not allow overflow.
>
> And I've pointed out a number of times that this is not practical at
> best.  There are event sources which can create events which cannot be
> coalesced into one single event as it would be required with your design.
>
> Signals are one example, specifically realtime signals.  If we do not
> want the design to be limited from the start this approach has to be
> thought over.
>
>
>>> So zap mmap() support completely, since it is not usable at all. We
>>> wont discuss on it.
>>
>> Initial implementation did not have it.
>> But I was requested to do it, and it is ready now.
>> No one likes it, but no one provides an alternative implementation.
>> We are stuck.
>
> We need the mapped ring buffer.  The current design (before it was
> removed) was broken but this does not mean it shouldn't be
> implemented.  We just need more time to figure out how to implement it
> correctly.
>
Considering the if at all and if then how of ring buffer implemetation
I'd like to throw in some ideas I had when reading the discussion and
respective code. If I understood Ulrich Drepper right, his notion of a
generic event handling interface is, that it has to be flexible enough
to transport additional info from origin to userspace, and to support
queuing of events from the same origin, so that additional
per-event-occurrence data doesn't get lost, which would happen when
coalescing multiple events into one until delivery. From what I read he
says ring buffer is broken because of  insufficient space for additional
data (mukevent) and the limited number of events that can be put into
ring buffer. Another argument is missing notification of userspace about
dropped events in case ring buffer limit is reached. (is that right?)
I see no reason why kevent couldn't be modified to fit (all) these
needs. While modifying the server-example and writing a client using
kevent I came across the coalescing problem, there were more incoming
connections than accept events, and I had to work around that. In this
case the pure number of coalesced events would suffice, while it
wouldn't for the example of RT-signals that Ulrich Drepper gave. So if
coalescing can be done at all or if it is impossible depends on the type
of event. The same goes for additional data delivered with the events.
There might be no panacea for all possible scenarios with one fixed
design. Either performance suffers for 'lightweight' events  which don't
need additional data and/or coalescing is not problematic and/or ring
buffer, or kevent is not usable for other types of events. Why not treat
different things differently, and let the (kernel-)user decide.
I don't know if I got all this right, but if, then ring buffer is needed
especially for cases where coalescing is not possible and additional
data has to be delivered for each triggered notification (so the pure
number of events is not enough; other reasons? performance? ). To me it
doesn't make sense to have kevent fill memory and use processor-time if
buffer is not used at all, which is the case when using kevent_getevents.
So here are my Ideas:
Make usage of ring buffer optional, if not required for specific
event-type it might be chosen by userspace-code.
Make limit of events in ring buffer optional and controllable from
userspace.
Regarding mukevent I'm thinking of a event-type specific struct, that is
filled by the originating code, and placed into a per-event-type ring
buffer (which  requires modification of kevent_wait). To my limited
understanding it seems that alternative or modified versions of
kevent_storage_ready, (__)kevent_requeue and kevent_user_ring_add_event
could return a void pointer to the position in buffer, and all kevent
has to know about is the size of the struct.
If coalescing doesn't hurt for a specific event-type it might just be
modified to notify userspace about the number of coalesced events. Make
it depend on type of event.

I know this doesn't address all objections that have been made, and
Evgeniy, big sorry for this being just talk again, and maybe not even
applicable for some reasons I do not overlook, but maybe it's worth
consideration. I'll gladly try to put that into code, and see where it
leads. I think kevent is great, and if things can be done to increase
it's genericity without sacrifying performance, why not.
Sorry for the length of post and repetitions,

Johann
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/14] TIPC updates

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:23 +0200 (CEST)

> This patch set includes a number TIPC fixes/cleanups. Please see each 
> individual patch for further description.
> 
> Please pull from:
> 
>  git://tipc.cslab.ericsson.net/pub/git/tipc.git
> 
>  (rebased on linux/kernel/git/davem/net-2.6.git)

I applied everything except patch 8/14, you really need to
add proper SKB queue locking to handle that race.  I think
the "performance cost" of taking that lock is much overstated,
you should never have contention on that lock at all.

Secondly, I never pull from your trees because I still have
to make many fixups to your patches:

1) Please add a proper colon to your changeset header lines,
   it should be "[TIPC]: ", not "[TIPC] ".

2) Please check for trailing whitespace added by your patches.
   I've given you the command you can use in another email to
   check this for yourselve before submission.

3) Please get full proper signed-off-by lines from patch submitters,
   especially when the patch is more than a trivial 1 or 2 liner.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/14] [TIPC] Updated TIPC version number to 1.6.2

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:55 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied, thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/14] [TIPC] Unrecognized configuration command now returns error message

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:54 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> This patch causes TIPC to return an error message when it receives
> an unrecognized configuration command.  (Previously, the sender
> received no feedback.)
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/14] [TIPC] Added subscription cancellation capability

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:53 +0200

> From: Lijun Chen <[EMAIL PROTECTED]>
> 
> This patch allows a TIPC application to cancel an existing
> topology service subscription by re-requesting the subscription
> with the TIPC_SUB_CANCEL filter bit set.  (All other bits of
> the cancel request must match the original subscription request.)
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied, but had some trailing whitespace additions to cleanup
and would you please ask all patch authors to provide proper
signed-off-by lines in the future?  Thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/14] [TIPC] Fixed slow link reactivation when link tolerance is large

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:51 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> This patch corrects an issue wherein a previouly failed node could
> not reestablish a links to a non-failing node in the TIPC network
> until the latter node detected the link failure itself (which might
> be configured to take up to 30 seconds).  The non-failing node now
> responds to link setup requests from a previously failed node in at
> most 1 second, allowing it to detect the link failure more quickly.
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied, thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/14] [TIPC] Can now list multicast link on an isolated network node

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:52 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> This patch fixes a minor bug that prevents "tipc-config -l" from
> displaying the multicast link if a TIPC node has never successfully
> established at least one unicast link.
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied, thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/14] [TIPC] Name publication events now delivered in chronological order

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:50 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> This patch tivially re-orders the entries in TIPC's list of local
> publications so that applications will receive publication events
> in the order they were published.
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied, thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/14] [TIPC] Fix socket receive queue NULL pointer dereference on SMP systems

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:49 +0200

> From: P Litov <[EMAIL PROTECTED]>
> 
> This patch corrects an SMP system-specific race condition which allowed
> TIPC to prematurely dereference the first sk_buff in a socket receive
> queue that was changing from empty to non-empty state.
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

If you are going to access the socket packet without some other kind
of locking that prevents changes to the queue, you must take the skb
queue lock.  You can't dance around it by checking the linked list
pointer instead the queue length.  Otherwise we'd be doing this all
over the UDP code and other datagram socket layers.  And we don't
because it simply isn't valid.

So I'm not applying this.

Also, this patch is missing a proper signed off line from the
patch author, P Litov.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/14] [TIPC] Add support for Ethernet VLANs

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:48 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> This patch enhances TIPC's Ethernet support to include VLAN interfaces.
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied, more whitespace I had to fixup:

+ git apply --check --whitespace=error-all diff
Adds trailing whitespace.
diff:24: * (in case the message is sent off-node), 
fatal: 1 line adds trailing whitespaces.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/14] [TIPC] Remove code bloat introduced by print buffer rework

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:47 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> This patch allows the compiler to optimize out any code that tries to
> send debugging output to the null print buffer (TIPC_NULL), a capability
> that was unintentionally broken during the recent print buffer rework.
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied, thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/14] [TIPC] Optimize wakeup logic when socket has no waiting processes

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:46 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> This patch adds a simple test so TIPC doesn't try waking up processes
> waiting on a socket if there are none waiting.
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/14] [TIPC] Added duplicate node address detection capability

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:45 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> TIPC now rejects and logs link setup requests from node  if the
> receiving node already has a functional link to that node on the associated
> interface, or if the requestor is using the same  as the receiver.
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied, but more whitespace crap I had to fix up:

[EMAIL PROTECTED]:~/src/GIT/net-2.6$ pcheck diff
+ git apply --check --whitespace=error-all diff
Adds trailing whitespace.
diff:19:tipc_printf(pb, "%s(%s)", m_ptr->name, 
Adds trailing whitespace.
diff:46:static void disc_dupl_alert(struct bearer *b_ptr, u32 node_addr, 
Adds trailing whitespace.
diff:84:spin_unlock_bh(&n_ptr->lock);   
 
fatal: 3 lines add trailing whitespaces.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/14] [TIPC] Stream socket can now send > 66000 bytes at a time

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:44 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> The stream socket send code was not initializing some required fields
> of the temporary msghdr structure it was utilizing; this is now fixed.
> A check has also been added to detect if a user illegally specifies
> a destination address when sending on an established stream connection.
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied, thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/14] [TIPC] Debug print buffer enhancements and fixes

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:43 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> This change modifies TIPC's print buffer code as follows:
> 1) Now supports small print buffers (min. size reduced from 512 bytes to 64)
> 2) Now uses TIPC_NULL print buffer structure to indicate null device
>instead of NULL pointer (this simplified error handling)
> 3) Fixed misuse of console buffer structure by tipc_dump()
> 4) Added and corrected comments in various places
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied, please run trailing-whitespace checks on your patches,
f.e. using "git apply --check --whitespace=error-all diff".
Because often I have to fix up problems like the following in
your submissions:

[EMAIL PROTECTED]:~/src/GIT/net-2.6$ pcheck diff
+ git apply --check --whitespace=error-all diff
Adds trailing whitespace.
diff:25: * TIPC_LOG: TIPC log buffer 
Adds trailing whitespace.
diff:105: * 
Adds trailing whitespace.
diff:148: * 
Adds trailing whitespace.
diff:334:   printk("\n Start of %s log dump \n\n", 
Adds trailing whitespace.
diff:366:   tipc_printbuf_init(TIPC_LOG, kmalloc(log_size, 
GFP_ATOMIC), 
Adds trailing whitespace.
diff:393: * @next: used to link print buffers when printing to more than one at 
a time 
Adds trailing whitespace.
diff:395: 
fatal: 7 lines add trailing whitespaces.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/14] [TIPC] Add missing unlock in port timeout code.

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Fri, 13 Oct 2006 13:37:42 +0200

> From: Allan Stephens <[EMAIL PROTECTED]>
> 
> Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> Signed-off-by: Per Liden <[EMAIL PROTECTED]>

Applied, thanks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Bound TSO defer time (resend)

2006-10-16 Thread John Heffner

Stephen Hemminger wrote:

On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
John Heffner <[EMAIL PROTECTED]> wrote:



This patch limits the amount of time you will defer sending a TSO segment
to less than two clock ticks, or the time between two acks, whichever is
longer.




Okay, but doing any timing on clock ticks makes the behavior dependent
on the value of HZ which doesn't seem desirable. Should this be based
on RTT or a real-time values?


It would be nice to use a high res clock so you don't depend on HZ, but 
this is still expensive on most SMP arch's as I understand it.


  -John

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Bound TSO defer time (resend)

2006-10-16 Thread Stephen Hemminger
On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
John Heffner <[EMAIL PROTECTED]> wrote:

> The original message didn't show up on the list.  I'm assuming it's
> because the filters didn't like the attached postscript.  I posted PDFs of
> the figures on the web:
> 
> http://www.psc.edu/~jheffner/tmp/a.pdf
> http://www.psc.edu/~jheffner/tmp/b.pdf
> http://www.psc.edu/~jheffner/tmp/c.pdf
> 
>   -John
> 
> 
> -- Forwarded message --
> Date: Mon, 16 Oct 2006 15:55:53 -0400 (EDT)
> From: John Heffner <[EMAIL PROTECTED]>
> To: David Miller <[EMAIL PROTECTED]>
> Cc: netdev 
> Subject: [PATCH] Bound TSO defer time
> 
> This patch limits the amount of time you will defer sending a TSO segment
> to less than two clock ticks, or the time between two acks, whichever is
> longer.
> 
> On slow links, deferring causes significant bursts.  See attached plots,
> which show RTT through a 1 Mbps link with a 100 ms RTT and ~100 ms queue
> for (a) non-TSO, (b) currnet TSO, and (c) patched TSO.  This burstiness
> causes significant jitter, tends to overflow queues early (bad for short
> queues), and makes delay-based congestion control more difficult.
> 
> Deferring by a couple clock ticks I believe will have a relatively small
> impact on performance.
> 
> 
> Signed-off-by: John Heffner <[EMAIL PROTECTED]>

Okay, but doing any timing on clock ticks makes the behavior dependent
on the value of HZ which doesn't seem desirable. Should this be based
on RTT or a real-time values?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: socket/IP on Linux

2006-10-16 Thread Jingping Lin

Arnaldo:
Sorry, I have to bother you again with another Linux
socket question.

Suppose that I have a Linux IP socket connected for a
TCP connection and the socket is set as a non-blocking
one with fcntl().

Even the socket is set as non-blocking, is it really
possible to perform Non-Blocking Close on this socket?
i.e., can I make the "int=close(fd)" a non-blocking
call? 

The answer seems No to me based on my study. I am not
totally sure though.

If the answer is Yes, how?

Please help, thanks a lot,
Jingping
  
--- Arnaldo Carvalho de Melo <[EMAIL PROTECTED]>
wrote:

> On 10/5/06, Jingping Lin <[EMAIL PROTECTED]> wrote:
> > Hello, Linux Kernel:
> > For a project I will work on for mobile, I am
> looking
> > into the IP stacks on Linux.
> >
> > I have a few questions to bother you:
> 
> No bothering, so far, please see the below answers
> and try to check
> them all before "bothering" again 8)
> 
> > 1. is "socket.c" the file handling the socket
> > interface?
> 
> One of them
> 
> > 2. which function is for opening a socket?
> > It looks like "sock_map_fd()" is the one for
> > opening/creating a socket? Is that correct?
> > The "Linux IP Stacks Commentary" book suggested
> the
> > function is "int socket()" which I didn't find in
> > "socket.c" though.
> 
> Perhaps it is suggesting that you create the socket
> in userspace using
> the libc socket(2) function (see 'man socket') and
> then passing it
> thru some ioctl if you want to use kernel_sendmsg
> (make tags ; vi -t
> kernel_sendmsg) from kernelspace?
> 
> > 3. Do you have documentations discussing in
> details
> > the implemented socket interfaces?
> 
> Humm, I guess you can grep the sources for in kernel
> socket usage?
> 
> > Thanks a lot in advance for your help,
> 
> Best Regards,
> 
> - Arnaldo
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Bound TSO defer time (resend)

2006-10-16 Thread John Heffner
The original message didn't show up on the list.  I'm assuming it's
because the filters didn't like the attached postscript.  I posted PDFs of
the figures on the web:

http://www.psc.edu/~jheffner/tmp/a.pdf
http://www.psc.edu/~jheffner/tmp/b.pdf
http://www.psc.edu/~jheffner/tmp/c.pdf

  -John


-- Forwarded message --
Date: Mon, 16 Oct 2006 15:55:53 -0400 (EDT)
From: John Heffner <[EMAIL PROTECTED]>
To: David Miller <[EMAIL PROTECTED]>
Cc: netdev 
Subject: [PATCH] Bound TSO defer time

This patch limits the amount of time you will defer sending a TSO segment
to less than two clock ticks, or the time between two acks, whichever is
longer.

On slow links, deferring causes significant bursts.  See attached plots,
which show RTT through a 1 Mbps link with a 100 ms RTT and ~100 ms queue
for (a) non-TSO, (b) currnet TSO, and (c) patched TSO.  This burstiness
causes significant jitter, tends to overflow queues early (bad for short
queues), and makes delay-based congestion control more difficult.

Deferring by a couple clock ticks I believe will have a relatively small
impact on performance.


Signed-off-by: John Heffner <[EMAIL PROTECTED]>


diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 0e058a2..27ae4b2 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -341,7 +341,9 @@ #endif
int linger2;

unsigned long last_synq_overflow;
-
+
+   __u32   tso_deferred;
+
 /* Receiver side RTT estimation */
struct {
__u32   rtt;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9a253fa..3ea8973 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1087,11 +1087,15 @@ static int tcp_tso_should_defer(struct s
u32 send_win, cong_win, limit, in_flight;

if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
-   return 0;
+   goto send_now;

if (icsk->icsk_ca_state != TCP_CA_Open)
-   return 0;
+   goto send_now;

+   /* Defer for less than two clock ticks. */
+   if (!tp->tso_deferred && ((jiffies<<1)>>1) - (tp->tso_deferred>>1) > 1)
+   goto send_now;
+
in_flight = tcp_packets_in_flight(tp);

BUG_ON(tcp_skb_pcount(skb) <= 1 ||
@@ -1106,8 +1110,8 @@ static int tcp_tso_should_defer(struct s

/* If a full-sized TSO skb can be sent, do it. */
if (limit >= 65536)
-   return 0;
-
+   goto send_now;
+
if (sysctl_tcp_tso_win_divisor) {
u32 chunk = min(tp->snd_wnd, tp->snd_cwnd * tp->mss_cache);

@@ -1116,7 +1120,7 @@ static int tcp_tso_should_defer(struct s
 */
chunk /= sysctl_tcp_tso_win_divisor;
if (limit >= chunk)
-   return 0;
+   goto send_now;
} else {
/* Different approach, try not to defer past a single
 * ACK.  Receiver should ACK every other full sized
@@ -1124,11 +1128,17 @@ static int tcp_tso_should_defer(struct s
 * then send now.
 */
if (limit > tcp_max_burst(tp) * tp->mss_cache)
-   return 0;
+   goto send_now;
}
-
+
/* Ok, it looks like it is advisable to defer.  */
+   tp->tso_deferred = 1 | (jiffies<<1);
+
return 1;
+
+send_now:
+   tp->tso_deferred = 0;
+   return 0;
 }

 /* Create a new MTU probe if we are ready.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: PATCH zero-copy send completion callback

2006-10-16 Thread Eric Barton
David,

> Also, the correct mailing list to get to the networking developers
> is [EMAIL PROTECTED]  "linux-net" is for users.

Noted.

> Finally, I very much doubt you have much chance getting this
> change in, the infrastructure is implemented in a very ad-hoc
> fashion and it takes into consideration none of the potential
> other users of such a thing.  

Are you referring to the absence of a callback argument other than the
callback descriptor itself?  It seemed natural to me to contain the
descriptor in whatever state the higher-level protocol associates with the
message it's sending, and to derive this from the descriptor address in the
callback.

If this isn't what you mean, could you explain?  I'm not at all religious
about it.

> And these days we're trying to figure
> out how to eliminate skbuff and skb_shared_info struct members
> whereas you're adding 16-bytes of space on 64-bit platforms.

Do you think the general concept of a zero-copy completion callback is
useful?

If so, do you have any ideas about how to do it more economically?  It's 2
pointers rather than 1 to avoid forcing an unnecessary packet boundary
between successive zero-copy sends.  But I guess that might not be hugely
significant since you're generally sending many pages when zero-copy is
needed for performance.  Also, (please correct me if I'm wrong) I didn't
think this would push the allocation over to the next entry in
'malloc_sizes'.

Cheers,
Eric


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 9/13] [RFC] [SCTP] Merge IPv4 and IPv6 versions of get_saddr() with their corresponding get_dst().

2006-10-16 Thread Ville Nuorvala
Oops, this almost more than any other patch was RFC. Sorry about that!

Regards,
Ville
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/13] [RFC] [IPV6] Fix source prefix routing problems when source address undefined.

2006-10-16 Thread Ville Nuorvala

With IPv6 routing subtrees we need to take into account that the
source address is typically not specified at the time of the route
lookup.

There are two separate cases where this can happen. In the typical
case the source address hasn't been selected before the route lookup.
Skipping a source prefix policy rule because of this will lead to
inconsistent routing behavior between for example bound and unbound
sockets.

We avoid this by passing the policy rule source prefix to the lookup
and source address selection functions. For source prefix rules the
source address is selected before the route lookup, otherwise we do it
the other way around. The source address selection algorithm remains
virtually unchanged; the source prefix is just used to verify the
selected address is compatible with the rule. If the source address
doesn't match, the route lookup with the current rule is aborted and
is started again with the next rule in the policy chain.

The more uncommon case is where the unspecified address is actually
used as a valid source address. When the kernel uses the unspecified
address it doesn't touch the routing table. We need to make sure a
userland application using a raw socket can do the same. If the user
includes the IPv6 header we therefore have to bypass the source
address selection even then the source address is unspecified. In
addition, we don't insert any routing cache entry created by such a
lookup.

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 include/net/addrconf.h |4 +++-
 include/net/ip6_fib.h  |   16 +++-
 net/ipv6/addrconf.c|   13 +++--
 net/ipv6/fib6_rules.c  |   16 ++--
 net/ipv6/ip6_fib.c |2 +-
 net/ipv6/ndisc.c   |2 +-
 net/ipv6/route.c   |   41 +
 7 files changed, 66 insertions(+), 28 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index d075693..7066362 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -67,8 +67,10 @@ #endif
 extern struct inet6_ifaddr *   ipv6_get_ifaddr(struct in6_addr *addr,
struct net_device *dev,
int strict);
-extern int ipv6_get_saddr(int pref_if,
+struct rt6key;
+extern int ipv6_get_saddr(int pref_if,
   struct in6_addr *daddr,
+  struct rt6key *sconstr,
   struct in6_addr *saddr);
 extern int ipv6_get_lladdr(struct net_device *dev, struct 
in6_addr *);
 extern int ipv6_rcv_saddr_equal(const struct sock *sk,
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index e4438de..8887b5c 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -21,6 +21,7 @@ #include 
 #include 
 #include 
 #include 
+#include 

 struct rt6_info;

@@ -77,6 +78,18 @@ struct rt6key
int plen;
 };

+struct fib6_rule
+{
+   struct fib_rule common;
+   struct rt6key   src;
+   struct rt6key   dst;
+#ifdef CONFIG_IPV6_ROUTE_FWMARK
+   u32 fwmark;
+   u32 fwmask;
+#endif
+   u8  tclass;
+};
+
 struct fib6_table;

 struct rt6_info
@@ -174,7 +187,8 @@ #define RT6_TABLE_LOCAL RT6_TABLE_MAIN
 #endif

 typedef struct rt6_info *(*pol_lookup_t)(struct fib6_table *,
-struct flowi *, int);
+struct flowi *, int,
+struct fib6_rule *);

 /*
  * exported functions
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 09a22c8..486af76 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -904,7 +904,8 @@ static int inline ipv6_saddr_label(const
return 1;
 }

-int ipv6_get_saddr(int pref_if, struct in6_addr *daddr, struct in6_addr *saddr)
+int ipv6_get_saddr(int pref_if, struct in6_addr *daddr,
+  struct rt6key *sconstr, struct in6_addr *saddr)
 {
struct ipv6_saddr_score hiscore;
struct inet6_ifaddr *ifa_result = NULL;
@@ -1151,7 +1152,15 @@ record_it:

if (!ifa_result)
return -EADDRNOTAVAIL;
-   
+#ifdef CONFIG_IPV6_SUBTREES
+   /* Don't let source address based routing interfere with the
+  address selection, just make sure the selected address
+  matches the routing policy constraints */
+
+   if (sconstr && sconstr->plen > 0 &&
+   !ipv6_prefix_equal(saddr, &sconstr->addr, sconstr->plen))
+   return -EADDRNOTAVAIL;
+#endif
ipv6_addr_copy(saddr, &ifa_result->addr);
in6_ifa_put(ifa_result);
return 0;
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index fc56a19..a5b7803 100644
--- a/net/ipv6/fib6_rules.c
+++ b

[PATCH 12/13] [RFC] [IPV6] Make sure route cache entries have a valid source address.

2006-10-16 Thread Ville Nuorvala

Leaving out the source address from routing cache entries when
using routing subtrees causes all kinds of problems. Make sure
this doesn't happen.

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 net/ipv6/route.c |   31 +--
 1 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7cd7747..7c3438e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -594,29 +594,28 @@ static struct rt6_info *rt6_alloc_cow(st

ipv6_addr_copy(&rt->rt6i_dst.addr, daddr);
rt->rt6i_dst.plen = 128;
-   rt->rt6i_flags |= RTF_CACHE;
-   rt->u.dst.flags |= DST_HOST;
-
 #ifdef CONFIG_IPV6_SUBTREES
-   if (rt->rt6i_src.plen && saddr) {
-   ipv6_addr_copy(&rt->rt6i_src.addr, saddr);
-   rt->rt6i_src.plen = 128;
-   }
+   ipv6_addr_copy(&rt->rt6i_src.addr, saddr);
+   rt->rt6i_src.plen = 128;
 #endif
-
+   rt->rt6i_flags |= RTF_CACHE;
+   rt->u.dst.flags |= DST_HOST;
rt->rt6i_nexthop = ndisc_get_neigh(rt->rt6i_dev, 
&rt->rt6i_gateway);
-
}

return rt;
 }

-static struct rt6_info *rt6_alloc_clone(struct rt6_info *ort, struct in6_addr 
*daddr)
+static struct rt6_info *rt6_alloc_clone(struct rt6_info *ort, struct in6_addr 
*daddr, struct
in6_addr *saddr)
 {
struct rt6_info *rt = ip6_rt_copy(ort);
if (rt) {
ipv6_addr_copy(&rt->rt6i_dst.addr, daddr);
rt->rt6i_dst.plen = 128;
+#ifdef CONFIG_IPV6_SUBTREES
+   ipv6_addr_copy(&rt->rt6i_src.addr, saddr);
+   rt->rt6i_src.plen = 128;
+#endif
rt->rt6i_flags |= RTF_CACHE;
rt->u.dst.flags |= DST_HOST;
rt->rt6i_nexthop = neigh_clone(ort->rt6i_nexthop);
@@ -654,7 +653,7 @@ restart:
nrt = rt6_alloc_cow(rt, &fl->fl6_dst, &fl->fl6_src);
else {
 #if CLONE_OFFLINK_ROUTE
-   nrt = rt6_alloc_clone(rt, &fl->fl6_dst);
+   nrt = rt6_alloc_clone(rt, &fl->fl6_dst, &fl->fl6_src);
 #else
goto out2;
 #endif
@@ -756,10 +755,10 @@ restart:
ipv6_addr_copy(&fl->fl6_src, &saddr);
}
if (!rt->rt6i_nexthop && !(rt->rt6i_flags & RTF_NONEXTHOP))
-   nrt = rt6_alloc_cow(rt, &fl->fl6_dst, &fl->fl6_src);
+   nrt = rt6_alloc_cow(rt, &fl->fl6_dst, &saddr);
else {
 #if CLONE_OFFLINK_ROUTE
-   nrt = rt6_alloc_clone(rt, &fl->fl6_dst);
+   nrt = rt6_alloc_clone(rt, &fl->fl6_dst, &saddr);
 #else
goto out2;
 #endif
@@ -1429,6 +1428,10 @@ void rt6_redirect(struct in6_addr *dest,

ipv6_addr_copy(&nrt->rt6i_dst.addr, dest);
nrt->rt6i_dst.plen = 128;
+#ifdef CONFIG_IPV6_SUBTREES
+   ipv6_addr_copy(&nrt->rt6i_src.addr, src);
+   nrt->rt6i_src.plen = 128;
+#endif
nrt->u.dst.flags |= DST_HOST;

ipv6_addr_copy(&nrt->rt6i_gateway, (struct 
in6_addr*)neigh->primary_key);
@@ -1511,7 +1514,7 @@ void rt6_pmtu_discovery(struct in6_addr
if (!rt->rt6i_nexthop && !(rt->rt6i_flags & RTF_NONEXTHOP))
nrt = rt6_alloc_cow(rt, daddr, saddr);
else
-   nrt = rt6_alloc_clone(rt, daddr);
+   nrt = rt6_alloc_clone(rt, daddr, saddr);

if (nrt) {
nrt->u.dst.metrics[RTAX_MTU-1] = pmtu;
-- 
1.4.2.3

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/13] [RFC] [IPV6] Merge ipv6_dev_get_saddr() and ipv6_get_saddr().

2006-10-16 Thread Ville Nuorvala
The split into ipv6_get_saddr() and ipv6_dev_get_saddr() isn't necessary
anymore, so they can be merged into just the function ipv6_get_saddr().

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 include/net/addrconf.h |5 +
 net/ipv6/addrconf.c|   21 ++---
 net/ipv6/ndisc.c   |2 +-
 net/ipv6/route.c   |5 +++--
 4 files changed, 11 insertions(+), 22 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 44f1b67..d075693 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -67,10 +67,7 @@ #endif
 extern struct inet6_ifaddr *   ipv6_get_ifaddr(struct in6_addr *addr,
struct net_device *dev,
int strict);
-extern int ipv6_get_saddr(struct dst_entry *dst,
-  struct in6_addr *daddr,
-  struct in6_addr *saddr);
-extern int ipv6_dev_get_saddr(struct net_device *dev,
+extern int ipv6_get_saddr(int pref_if,
   struct in6_addr *daddr,
   struct in6_addr *saddr);
 extern int ipv6_get_lladdr(struct net_device *dev, struct 
in6_addr *);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c186763..09a22c8 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -904,8 +904,7 @@ static int inline ipv6_saddr_label(const
return 1;
 }

-int ipv6_dev_get_saddr(struct net_device *daddr_dev,
-  struct in6_addr *daddr, struct in6_addr *saddr)
+int ipv6_get_saddr(int pref_if, struct in6_addr *daddr, struct in6_addr *saddr)
 {
struct ipv6_saddr_score hiscore;
struct inet6_ifaddr *ifa_result = NULL;
@@ -937,7 +936,7 @@ int ipv6_dev_get_saddr(struct net_device
 */
if ((daddr_type & IPV6_ADDR_MULTICAST ||
 daddr_scope <= IPV6_ADDR_SCOPE_LINKLOCAL) &&
-   daddr_dev && dev != daddr_dev)
+   pref_if && dev->ifindex != pref_if)
continue;

idev = __in6_dev_get(dev);
@@ -1062,13 +1061,13 @@ #endif

/* Rule 5: Prefer outgoing interface */
if (hiscore.rule < 5) {
-   if (daddr_dev == NULL ||
-   daddr_dev == ifa_result->idev->dev)
+   if (!pref_if ||
+   pref_if == ifa_result->idev->dev->ifindex)
hiscore.attrs |= IPV6_SADDR_SCORE_OIF;
hiscore.rule++;
}
-   if (daddr_dev == NULL ||
-   daddr_dev == ifa->idev->dev) {
+   if (!pref_if ||
+   pref_if == ifa->idev->dev->ifindex) {
score.attrs |= IPV6_SADDR_SCORE_OIF;
if (!(hiscore.attrs & IPV6_SADDR_SCORE_OIF)) {
score.rule = 5;
@@ -1158,14 +1157,6 @@ record_it:
return 0;
 }

-
-int ipv6_get_saddr(struct dst_entry *dst,
-  struct in6_addr *daddr, struct in6_addr *saddr)
-{
-   return ipv6_dev_get_saddr(dst ? ((struct rt6_info 
*)dst)->rt6i_idev->dev : NULL, daddr, saddr);
-}
-
-
 int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr)
 {
struct inet6_dev *idev;
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 0304b5f..3ac4e12 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -449,7 +449,7 @@ static void ndisc_send_na(struct net_dev
src_addr = solicited_addr;
in6_ifa_put(ifp);
} else {
-   if (ipv6_dev_get_saddr(dev, daddr, &tmpaddr))
+   if (ipv6_get_saddr(dev->ifindex, daddr, &tmpaddr))
return;
src_addr = &tmpaddr;
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index b7b8148..7cd7747 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -748,8 +748,9 @@ restart:
read_unlock_bh(&table->tb6_lock);

if (!has_saddr) {
+   int oif = rt->rt6i_dev->ifindex;
/* policy rule doesn't restrict source address */
-   if (ipv6_get_saddr(&rt->u.dst, &fl->fl6_dst, &saddr))
+   if (ipv6_get_saddr(oif, &fl->fl6_dst, &saddr))
goto no_saddr;
has_saddr = RT6_LOOKUP_F_HAS_SADDR;
ipv6_addr_copy(&fl->fl6_src, &saddr);
@@ -2051,7 +2052,7 @@ #endif
NLA_PUT_U32(skb, RTA_IIF, iif);
else if (dst) {
struct in6_addr saddr_buf;
-   if (!ipv6_get_saddr(&rt->u.dst, dst, &saddr_buf))
+   if (!ipv6_get_saddr(rt->rt6i

[PATCH 10/13] [RFC] [IPV6] Don't export ipv6_get_saddr().

2006-10-16 Thread Ville Nuorvala

To make sure the source address selection is done correctly, don't let
users outside the ipv6 module call ipv6_get_saddr() directly. In stead
have them go through ip6_route_output().

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 net/ipv6/ipv6_syms.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/ipv6_syms.c b/net/ipv6/ipv6_syms.c
index 0e8e067..94a9806 100644
--- a/net/ipv6/ipv6_syms.c
+++ b/net/ipv6/ipv6_syms.c
@@ -25,7 +25,6 @@ EXPORT_SYMBOL(inet6_release);
 EXPORT_SYMBOL(inet6_bind);
 EXPORT_SYMBOL(inet6_getname);
 EXPORT_SYMBOL(inet6_ioctl);
-EXPORT_SYMBOL(ipv6_get_saddr);
 EXPORT_SYMBOL(ipv6_chk_addr);
 EXPORT_SYMBOL(in6_dev_finish_destroy);
 #ifdef CONFIG_XFRM
-- 
1.4.2.3

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 9/13] [SCTP] Merge IPv4 and IPv6 versions of get_saddr() with their corresponding get_dst().

2006-10-16 Thread Ville Nuorvala

As the IPv6 route lookup now also returns the selected source address
there is no need for a separate source address lookup. In fact, the
source address selection needs to be moved to get_dst() because the
selected IPv6 source address isn't always stored in the route.
Sometimes this makes it impossible to guess the correct address later on.

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 include/net/sctp/structs.h |7 -
 net/sctp/ipv6.c|  235 +++-
 net/sctp/protocol.c|   56 --
 net/sctp/transport.c   |8 +
 4 files changed, 148 insertions(+), 158 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index c6d93bb..e0973a3 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -529,15 +529,8 @@ struct sctp_af {
struct dst_entry *(*get_dst)(struct sctp_association *asoc,
 union sctp_addr *daddr,
 union sctp_addr *saddr);
-   void(*get_saddr)(struct sctp_association *asoc,
-struct dst_entry *dst,
-union sctp_addr *daddr,
-union sctp_addr *saddr);
void(*copy_addrlist) (struct list_head *,
  struct net_device *);
-   void(*dst_saddr)(union sctp_addr *saddr,
-struct dst_entry *dst,
-unsigned short port);
int (*cmp_addr) (const union sctp_addr *addr1,
 const union sctp_addr *addr2);
void(*addr_copy)(union sctp_addr *dst,
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 78071c6..68ead54 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -188,46 +188,6 @@ static int sctp_v6_xmit(struct sk_buff *
return ip6_xmit(sk, skb, &fl, np->opt, ipfragok);
 }

-/* Returns the dst cache entry for the given source and destination ip
- * addresses.
- */
-static struct dst_entry *sctp_v6_get_dst(struct sctp_association *asoc,
-union sctp_addr *daddr,
-union sctp_addr *saddr)
-{
-   struct dst_entry *dst;
-   struct flowi fl;
-
-   memset(&fl, 0, sizeof(fl));
-   ipv6_addr_copy(&fl.fl6_dst, &daddr->v6.sin6_addr);
-   if (ipv6_addr_type(&daddr->v6.sin6_addr) & IPV6_ADDR_LINKLOCAL)
-   fl.oif = daddr->v6.sin6_scope_id;
-   
-
-   SCTP_DEBUG_PRINTK("%s: DST=" NIP6_FMT " ",
- __FUNCTION__, NIP6(fl.fl6_dst));
-
-   if (saddr) {
-   ipv6_addr_copy(&fl.fl6_src, &saddr->v6.sin6_addr);
-   SCTP_DEBUG_PRINTK(
-   "SRC=" NIP6_FMT " - ",
-   NIP6(fl.fl6_src));
-   }
-
-   dst = ip6_route_output(NULL, &fl);
-   if (!dst->error) {
-   struct rt6_info *rt;
-   rt = (struct rt6_info *)dst;
-   SCTP_DEBUG_PRINTK(
-   "rt6_dst:" NIP6_FMT " rt6_src:" NIP6_FMT "\n",
-   NIP6(rt->rt6i_dst.addr), NIP6(rt->rt6i_src.addr));
-   return dst;
-   }
-   SCTP_DEBUG_PRINTK("NO ROUTE\n");
-   dst_release(dst);
-   return NULL;
-}
-
 /* Returns the number of consecutive initial bits that match in the 2 ipv6
  * addresses.
  */
@@ -250,69 +210,6 @@ static inline int sctp_v6_addr_match_len
return (i*32);
 }

-/* Fills in the source address(saddr) based on the destination address(daddr)
- * and asoc's bind address list.
- */
-static void sctp_v6_get_saddr(struct sctp_association *asoc,
- struct dst_entry *dst,
- union sctp_addr *daddr,
- union sctp_addr *saddr)
-{
-   struct sctp_bind_addr *bp;
-   rwlock_t *addr_lock;
-   struct sctp_sockaddr_entry *laddr;
-   struct list_head *pos;
-   sctp_scope_t scope;
-   union sctp_addr *baddr = NULL;
-   __u8 matchlen = 0;
-   __u8 bmatchlen;
-
-   SCTP_DEBUG_PRINTK("%s: asoc:%p dst:%p "
- "daddr:" NIP6_FMT " ",
- __FUNCTION__, asoc, dst, NIP6(daddr->v6.sin6_addr));
-
-   if (!asoc) {
-   ipv6_get_saddr(dst, &daddr->v6.sin6_addr,&saddr->v6.sin6_addr);
-   SCTP_DEBUG_PRINTK("saddr from ipv6_get_saddr: " NIP6_FMT "\n",
- NIP6(saddr->v6.sin6_addr));
-   return;
-   }
-
-   scope = sctp_scope(daddr);
-
-   bp = &asoc->base.bind_addr;
-   addr_lock = &asoc->base.addr_lock;
-
-   /* Go through the bind address list and find the best source address
-* that matches the scope of the destination address.
-*/
- 

[PATCH 8/13] [RFC] [IPV6] Get rid of ipv6_get_saddr() in xfrm6_get_saddr().

2006-10-16 Thread Ville Nuorvala

As the source address is already selected in ip6_pol_route_output()
there is no need to do the source address lookup a second time.

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 net/ipv6/xfrm6_policy.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index db2d55c..954c9ac 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -48,8 +48,7 @@ static int xfrm6_get_saddr(xfrm_address_
};

if (!xfrm6_dst_lookup((struct xfrm_dst **)&rt, &fl_tunnel)) {
-   ipv6_get_saddr(&rt->u.dst, (struct in6_addr *)&daddr->a6,
-  (struct in6_addr *)&saddr->a6);
+   ipv6_addr_copy((struct in6_addr *)saddr, &fl_tunnel.fl6_src);
dst_release(&rt->u.dst);
return 0;
}
-- 
1.4.2.3

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/13] [RFC] [IPV6] Move source address selection into route lookup.

2006-10-16 Thread Ville Nuorvala

This patch moves the normal source address selection from
ip6_dst_lookup() into ip6_pol_route_output(), but shouldn't
change the routing or source address selection behavior in
any way.

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 net/ipv6/ip6_output.c |6 --
 net/ipv6/route.c  |   37 ++---
 2 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6671691..0019007 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -855,12 +855,6 @@ static int ip6_dst_lookup_tail(struct so
if ((err = (*dst)->error))
goto out_err_release;

-   if (ipv6_addr_any(&fl->fl6_src)) {
-   err = ipv6_get_saddr(*dst, &fl->fl6_dst, &fl->fl6_src);
-   if (err)
-   goto out_err_release;
-   }
-
return 0;

 out_err_release:
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index aa96be8..b7b8148 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -536,7 +536,7 @@ struct rt6_info *rt6_lookup(struct in6_a
int flags = strict ? RT6_LOOKUP_F_IFACE : 0;

if (saddr) {
-   memcpy(&fl.fl6_src, saddr, sizeof(*saddr));
+   ipv6_addr_copy(&fl.fl6_src, saddr);
flags |= RT6_LOOKUP_F_HAS_SADDR;
}

@@ -629,13 +629,11 @@ static struct rt6_info *ip6_pol_route_in
 {
struct fib6_node *fn;
struct rt6_info *rt, *nrt;
-   int strict = 0;
+   int strict = flags & RT6_LOOKUP_F_IFACE;
int attempts = 3;
int err;
int reachable = RT6_LOOKUP_F_REACHABLE;

-   strict |= flags & RT6_LOOKUP_F_IFACE;
-
 relookup:
read_lock_bh(&table->tb6_lock);

@@ -726,22 +724,22 @@ static struct rt6_info *ip6_pol_route_ou
 {
struct fib6_node *fn;
struct rt6_info *rt, *nrt;
-   int strict = 0;
-   int attempts = 3;
-   int err;
+   int has_saddr = flags & RT6_LOOKUP_F_HAS_SADDR;
+   int strict = flags & RT6_LOOKUP_F_IFACE;
int reachable = RT6_LOOKUP_F_REACHABLE;
+   int attempts = 3;
+   struct in6_addr saddr;

-   strict |= flags & RT6_LOOKUP_F_IFACE;
-
+   ipv6_addr_copy(&saddr, &fl->fl6_src);
 relookup:
read_lock_bh(&table->tb6_lock);

 restart_2:
-   fn = fib6_lookup(&table->tb6_root, &fl->fl6_dst, &fl->fl6_src);
+   fn = fib6_lookup(&table->tb6_root, &fl->fl6_dst, &saddr);

 restart:
rt = rt6_select(&fn->leaf, fl->oif, strict | reachable);
-   BACKTRACK(&fl->fl6_src);
+   BACKTRACK(&saddr);
if (rt == &ip6_null_entry ||
rt->rt6i_flags & RTF_CACHE)
goto out;
@@ -749,6 +747,13 @@ restart:
dst_hold(&rt->u.dst);
read_unlock_bh(&table->tb6_lock);

+   if (!has_saddr) {
+   /* policy rule doesn't restrict source address */
+   if (ipv6_get_saddr(&rt->u.dst, &fl->fl6_dst, &saddr))
+   goto no_saddr;
+   has_saddr = RT6_LOOKUP_F_HAS_SADDR;
+   ipv6_addr_copy(&fl->fl6_src, &saddr);
+   }
if (!rt->rt6i_nexthop && !(rt->rt6i_flags & RTF_NONEXTHOP))
nrt = rt6_alloc_cow(rt, &fl->fl6_dst, &fl->fl6_src);
else {
@@ -764,8 +769,7 @@ #endif

dst_hold(&rt->u.dst);
if (nrt) {
-   err = ip6_ins_rt(nrt);
-   if (!err)
+   if (!ip6_ins_rt(nrt))
goto out2;
}

@@ -778,7 +782,6 @@ #endif
 */
dst_release(&rt->u.dst);
goto relookup;
-
 out:
if (reachable) {
reachable = 0;
@@ -790,6 +793,10 @@ out2:
rt->u.dst.lastuse = jiffies;
rt->u.dst.__use++;
return rt;
+no_saddr:
+   rt = &ip6_null_entry;
+   dst_hold(&rt->u.dst);
+   goto out2;
 }

 struct dst_entry * ip6_route_output(struct sock *sk, struct flowi *fl)
@@ -2044,7 +2051,7 @@ #endif
NLA_PUT_U32(skb, RTA_IIF, iif);
else if (dst) {
struct in6_addr saddr_buf;
-   if (ipv6_get_saddr(&rt->u.dst, dst, &saddr_buf) == 0)
+   if (!ipv6_get_saddr(&rt->u.dst, dst, &saddr_buf))
NLA_PUT(skb, RTA_PREFSRC, 16, &saddr_buf);
}

-- 
1.4.2.3

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/13] [IPV6] Always copy rt->u.dst.error when copying a rt6_info.

2006-10-16 Thread Ville Nuorvala
Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 net/ipv6/route.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 263c057..aa96be8 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -618,8 +618,6 @@ static struct rt6_info *rt6_alloc_clone(
ipv6_addr_copy(&rt->rt6i_dst.addr, daddr);
rt->rt6i_dst.plen = 128;
rt->rt6i_flags |= RTF_CACHE;
-   if (rt->rt6i_flags & RTF_REJECT)
-   rt->u.dst.error = ort->u.dst.error;
rt->u.dst.flags |= DST_HOST;
rt->rt6i_nexthop = neigh_clone(ort->rt6i_nexthop);
}
@@ -1540,6 +1538,7 @@ static struct rt6_info * ip6_rt_copy(str
rt->u.dst.output = ort->u.dst.output;

memcpy(rt->u.dst.metrics, ort->u.dst.metrics, 
RTAX_MAX*sizeof(u32));
+   rt->u.dst.error = ort->u.dst.error;
rt->u.dst.dev = ort->u.dst.dev;
if (rt->u.dst.dev)
dev_hold(rt->u.dst.dev);
-- 
1.4.2.3
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/13] [IPV6] Make IPV6_SUBTREES depend on IPV6_MULTIPLE_TABLES.

2006-10-16 Thread Ville Nuorvala

As IPV6_SUBTREES can't work without IPV6_MULTIPLE_TABLES have IPV6_SUBTREES
depend on it.

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 net/ipv6/Kconfig |   16 
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index a2d211d..5fd2ffd 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -152,9 +152,16 @@ config IPV6_TUNNEL

  If unsure, say N.

+config IPV6_MULTIPLE_TABLES
+   bool "IPv6: Multiple Routing Tables"
+   depends on IPV6 && EXPERIMENTAL
+   select FIB_RULES
+   ---help---
+ Support multiple routing tables.
+
 config IPV6_SUBTREES
bool "IPv6: source address based routing"
-   depends on IPV6 && EXPERIMENTAL
+   depends on IPV6_MULTIPLE_TABLES
---help---
  Enable routing by source address or prefix.

@@ -166,13 +173,6 @@ config IPV6_SUBTREES

  If unsure, say N.

-config IPV6_MULTIPLE_TABLES
-   bool "IPv6: Multiple Routing Tables"
-   depends on IPV6 && EXPERIMENTAL
-   select FIB_RULES
-   ---help---
- Support multiple routing tables.
-
 config IPV6_ROUTE_FWMARK
bool "IPv6: use netfilter MARK value as routing key"
depends on IPV6_MULTIPLE_TABLES && NETFILTER
-- 
1.4.2.3

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/13] [IPV6] Clean up BACKTRACK().

2006-10-16 Thread Ville Nuorvala

The fn check is unnecessary as fn can never be NULL in BACKTRACK().

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 net/ipv6/route.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index a1b0f07..263c057 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -484,7 +484,7 @@ #define BACKTRACK(saddr) \
 do { \
if (rt == &ip6_null_entry) { \
struct fib6_node *pn; \
-   while (fn) { \
+   while (1) { \
if (fn->fn_flags & RTN_TL_ROOT) \
goto out; \
pn = fn->parent; \
-- 
1.4.2.3
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/13] [IPV6] Make sure error handling is done when calling ip6_route_output().

2006-10-16 Thread Ville Nuorvala

As ip6_route_output() never returns NULL, error checking must be done by
looking at dst->error in stead of comparing dst against NULL.

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 net/ipv6/xfrm6_policy.c |   12 +++-
 net/sctp/ipv6.c |   10 +-
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 6a252e2..db2d55c 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -25,12 +25,14 @@ #endif
 static struct dst_ops xfrm6_dst_ops;
 static struct xfrm_policy_afinfo xfrm6_policy_afinfo;

-static int xfrm6_dst_lookup(struct xfrm_dst **dst, struct flowi *fl)
+static int xfrm6_dst_lookup(struct xfrm_dst **xdst, struct flowi *fl)
 {
-   int err = 0;
-   *dst = (struct xfrm_dst*)ip6_route_output(NULL, fl);
-   if (!*dst)
-   err = -ENETUNREACH;
+   struct dst_entry *dst = ip6_route_output(NULL, fl);
+   int err = dst->error;
+   if (!err)
+   *xdst = (struct xfrm_dst *) dst;
+   else
+   dst_release(dst);
return err;
 }

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 249e503..78071c6 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -215,17 +215,17 @@ static struct dst_entry *sctp_v6_get_dst
}

dst = ip6_route_output(NULL, &fl);
-   if (dst) {
+   if (!dst->error) {
struct rt6_info *rt;
rt = (struct rt6_info *)dst;
SCTP_DEBUG_PRINTK(
"rt6_dst:" NIP6_FMT " rt6_src:" NIP6_FMT "\n",
NIP6(rt->rt6i_dst.addr), NIP6(rt->rt6i_src.addr));
-   } else {
-   SCTP_DEBUG_PRINTK("NO ROUTE\n");
+   return dst;
}
-
-   return dst;
+   SCTP_DEBUG_PRINTK("NO ROUTE\n");
+   dst_release(dst);
+   return NULL;
 }

 /* Returns the number of consecutive initial bits that match in the 2 ipv6
-- 
1.4.2.3
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.18-mm2 boot failure on x86-64

2006-10-16 Thread Andrew Morton
On Mon, 16 Oct 2006 14:16:13 -0400
Vivek Goyal <[EMAIL PROTECTED]> wrote:

> 
> Can you please have a look at the attached patch

Looks like a fine patch to me, although it could benefit from a comment
explaining why all those PAGE_ALIGN()s are in there.

> and include it in -mm.

Does it fix a patch in -mm or is it needed in mainline?


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/13] [SCTP] Fix minor typo

2006-10-16 Thread Ville Nuorvala

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 net/sctp/socket.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 79c3e07..185d480 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -821,7 +821,7 @@ out:
  * addrs is a pointer to an array of one or more socket addresses. Each
  * address is contained in its appropriate structure (i.e. struct
  * sockaddr_in or struct sockaddr_in6) the family of the address type
- * must be used to distengish the address length (note that this
+ * must be used to distinguish the address length (note that this
  * representation is termed a "packed array" of addresses). The caller
  * specifies the number of addresses in the array with addrcnt.
  *
-- 
1.4.2.3
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/13] [IPV6] Remove struct pol_chain.

2006-10-16 Thread Ville Nuorvala
Struct pol_chain has existed since at least the 2.2 kernel, but isn't used
anymore. As the IPv6 policy routing is implemented in a totally different
way in the current kernel, just get rid of it.

Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]>
---
 include/net/ip6_route.h |7 ---
 1 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 6ca6b71..c14b70e 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -36,13 +36,6 @@ #define RT6_LOOKUP_F_IFACE   0x1
 #define RT6_LOOKUP_F_REACHABLE 0x2
 #define RT6_LOOKUP_F_HAS_SADDR 0x4

-struct pol_chain {
-   int type;
-   int priority;
-   struct fib6_node*rules;
-   struct pol_chain*next;
-};
-
 extern struct rt6_info ip6_null_entry;

 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
-- 
1.4.2.3

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/13] [RFC] Fix problems with IPv6 routing subtrees and source address selection

2006-10-16 Thread Ville Nuorvala
Hi,

here are a bunch of more or less related patches having to do with fixing the 
IPv6 routing
subtrees and source address selection. Most of the code is a cleaned up version 
of what
I've written earlier for MIPL 2, where it has worked pretty well for a couple 
of years now.

The SCTP code, however, turned out to be messier and more difficult to fix than 
I had
originally thought. As I'm not that familiar with SCTP and don't really have an
opportunity to test the code I'm especially grateful for any comments regarding 
those
parts of the code.

I've tried to split up the changes into logical parts to help digest them. 
Please comment!

Regards,
Ville
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH REPOST 1/2] NET: Accurate packet scheduling for ATM/ADSL (kernel)

2006-10-16 Thread Russell Stuart
The Linux traffic's control engine inaccurately calculates
transmission times for packets sent over ADSL links.  For
some packet sizes the error rises to over 50%.  This occurs
because ADSL uses ATM as its link layer transport, and ATM
transmits packets in fixed sized 53 byte cells.

This changes the kernel rate table lookup, to be able to lookup
packet transmission times over all ATM links, including ADSL,
with perfect accuracy. The accuracy is dependent on the rate
table that is calculated in userspace by iproute2 command tc.

A longer presentation of the patch, its rational, what it
does and how to use it can be found here:
   http://www.stuart.id.au/russell/files/tc/tc-atm/

A earlier version of the patch, and a _detailed_ empirical
investigation of its effects can be found here:
   http://www.adsl-optimizer.dk/

Signed-off-by: Jesper Dangaard Brouer <[EMAIL PROTECTED]>
Signed-off-by: Russell Stuart <[EMAIL PROTECTED]>
---

diff -Nurp kernel-source-2.6.16.orig/include/linux/pkt_sched.h 
kernel-source-2.6.16/include/linux/pkt_sched.h
--- kernel-source-2.6.16.orig/include/linux/pkt_sched.h 2006-03-20 
15:53:29.0 +1000
+++ kernel-source-2.6.16/include/linux/pkt_sched.h  2006-06-13 
11:42:12.0 +1000
@@ -77,8 +77,9 @@ struct tc_ratespec
 {
unsigned char   cell_log;
unsigned char   __reserved;
-   unsigned short  feature;
-   short   addend;
+   unsigned short  feature;/* Always 0 in pre-atm patch kernels */
+   charcell_align; /* Always 0 in pre-atm patch kernels */
+   unsigned char   __unused;
unsigned short  mpu;
__u32   rate;
 };
diff -Nurp kernel-source-2.6.16.orig/include/net/sch_generic.h 
kernel-source-2.6.16/include/net/sch_generic.h
--- kernel-source-2.6.16.orig/include/net/sch_generic.h 2006-03-20 
15:53:29.0 +1000
+++ kernel-source-2.6.16/include/net/sch_generic.h  2006-06-13 
11:42:12.0 +1000
@@ -307,4 +307,18 @@ drop:
return NET_XMIT_DROP;
 }
 
+/* Lookup a qdisc_rate_table to determine how long it will take to send a
+   packet given its size.
+ */
+static inline u32 qdisc_l2t(struct qdisc_rate_table* rtab, int pktlen)
+{
+   int slot = pktlen + rtab->rate.cell_align;
+   if (slot < 0)
+   slot = 0;
+   slot >>= rtab->rate.cell_log;
+   if (slot > 255)
+   return rtab->data[255] + 1;
+   return rtab->data[slot];
+}
+
 #endif
diff -Nurp kernel-source-2.6.16.orig/net/sched/act_police.c 
kernel-source-2.6.16/net/sched/act_police.c
--- kernel-source-2.6.16.orig/net/sched/act_police.c2006-03-20 
15:53:29.0 +1000
+++ kernel-source-2.6.16/net/sched/act_police.c 2006-06-13 11:42:12.0 
+1000
@@ -33,8 +33,8 @@
 #include 
 #include 
 
-#define L2T(p,L)   ((p)->R_tab->data[(L)>>(p)->R_tab->rate.cell_log])
-#define L2T_P(p,L) ((p)->P_tab->data[(L)>>(p)->P_tab->rate.cell_log])
+#define L2T(p,L)   qdisc_l2t((p)->R_tab,L)
+#define L2T_P(p,L) qdisc_l2t((p)->P_tab,L)
 #define PRIV(a) ((struct tcf_police *) (a)->priv)
 
 /* use generic hash table */
diff -Nurp kernel-source-2.6.16.orig/net/sched/sch_cbq.c 
kernel-source-2.6.16/net/sched/sch_cbq.c
--- kernel-source-2.6.16.orig/net/sched/sch_cbq.c   2006-03-20 
15:53:29.0 +1000
+++ kernel-source-2.6.16/net/sched/sch_cbq.c2006-06-13 11:42:12.0 
+1000
@@ -193,7 +193,7 @@ struct cbq_sched_data
 };
 

-#define L2T(cl,len)((cl)->R_tab->data[(len)>>(cl)->R_tab->rate.cell_log])
+#define L2T(cl,len)qdisc_l2t((cl)->R_tab,len)
 

 static __inline__ unsigned cbq_hash(u32 h)
diff -Nurp kernel-source-2.6.16.orig/net/sched/sch_htb.c 
kernel-source-2.6.16/net/sched/sch_htb.c
--- kernel-source-2.6.16.orig/net/sched/sch_htb.c   2006-03-20 
15:53:29.0 +1000
+++ kernel-source-2.6.16/net/sched/sch_htb.c2006-06-13 11:42:12.0 
+1000
@@ -206,12 +206,10 @@ struct htb_class
 static __inline__ long L2T(struct htb_class *cl,struct qdisc_rate_table *rate,
int size)
 { 
-int slot = size >> rate->rate.cell_log;
-if (slot > 255) {
+long result = qdisc_l2t(rate, size);
+if (result > rate->data[255])
cl->xstats.giants++;
-   slot = 255;
-}
-return rate->data[slot];
+return result;
 }
 
 struct htb_sched
diff -Nurp kernel-source-2.6.16.orig/net/sched/sch_tbf.c 
kernel-source-2.6.16/net/sched/sch_tbf.c
--- kernel-source-2.6.16.orig/net/sched/sch_tbf.c   2006-03-20 
15:53:29.0 +1000
+++ kernel-source-2.6.16/net/sched/sch_tbf.c2006-06-13 11:42:12.0 
+1000
@@ -132,8 +132,8 @@ struct tbf_sched_data
struct Qdisc*qdisc; /* Inner qdisc, default - bfifo queue */
 };
 
-#define L2T(q,L)   ((q)->R_tab->data[(L)>>(q)->R_tab->rate.cell_log])
-#define L2T_P(q,L) ((q)->P_tab->data[(L)>>(q)->P_tab->rate.cell_log])
+#define L2T(q,L)   qdisc_l2t((q)->R_tab,L)
+#define L2T_P(q,L) qdisc_l2t((q)->P_tab,L)
 
 static int tbf_enqueue(struct sk_buff *skb, struct Qdisc* sch)
 

[PATCH 1/2] sky2: multicast pause frame receive

2006-10-16 Thread Stephen Hemminger
When using flow control, the PHY needs to accept multicast pause frames.
Without this fix, these frames were getting discarded by the PHY before
doing any flow control. 

This maybe related to http://bugzilla.kernel.org/show_bug.cgi?id=6839

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>

---
 drivers/net/sky2.c |   24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

--- sky2.orig/drivers/net/sky2.c2006-10-16 08:38:24.0 -0700
+++ sky2/drivers/net/sky2.c 2006-10-16 09:48:28.0 -0700
@@ -2850,6 +2850,14 @@
return 0;
 }
 
+static void inline sky2_add_filter(u8 filter[8], const u8 *addr)
+{
+   u32 bit;
+
+   bit = ether_crc(ETH_ALEN, addr) & 63;
+   filter[bit >> 3] |= 1 << (bit & 7);
+}
+
 static void sky2_set_multicast(struct net_device *dev)
 {
struct sky2_port *sky2 = netdev_priv(dev);
@@ -2858,7 +2866,10 @@
struct dev_mc_list *list = dev->mc_list;
u16 reg;
u8 filter[8];
+   int rx_pause;
+   static const u8 pause_mc_addr[ETH_ALEN] = { 0x1, 0x80, 0xc2, 0x0, 0x0, 
0x1 };
 
+   rx_pause = (sky2->flow_status == FC_RX || sky2->flow_status == FC_BOTH);
memset(filter, 0, sizeof(filter));
 
reg = gma_read16(hw, port, GM_RX_CTRL);
@@ -2866,18 +2877,19 @@
 
if (dev->flags & IFF_PROMISC)   /* promiscuous */
reg &= ~(GM_RXCR_UCF_ENA | GM_RXCR_MCF_ENA);
-   else if ((dev->flags & IFF_ALLMULTI) || dev->mc_count > 16) /* all 
multicast */
+   else if (dev->flags & IFF_ALLMULTI)
memset(filter, 0xff, sizeof(filter));
-   else if (dev->mc_count == 0)/* no multicast */
+   else if (dev->mc_count == 0 && !rx_pause)
reg &= ~GM_RXCR_MCF_ENA;
else {
int i;
reg |= GM_RXCR_MCF_ENA;
 
-   for (i = 0; list && i < dev->mc_count; i++, list = list->next) {
-   u32 bit = ether_crc(ETH_ALEN, list->dmi_addr) & 0x3f;
-   filter[bit / 8] |= 1 << (bit % 8);
-   }
+   if (rx_pause)
+   sky2_add_filter(filter, pause_mc_addr);
+
+   for (i = 0; list && i < dev->mc_count; i++, list = list->next)
+   sky2_add_filter(filter, list->dmi_addr);
}
 
gma_write16(hw, port, GM_MC_ADDR_H1,
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] sky2: don't process pause frames in recveiver.

2006-10-16 Thread Stephen Hemminger
This reverts earlier change that attempted to fix flow control; but was
broken.

Device needs to discard pause frames at the receive DMA engine, otherwise
the pause frames get received and passed up the stack!

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>


---
 drivers/net/sky2.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- sky2.orig/drivers/net/sky2.h2006-10-16 09:44:46.0 -0700
+++ sky2/drivers/net/sky2.h 2006-10-16 09:50:07.0 -0700
@@ -1576,7 +1576,7 @@
 
GMR_FS_ANY_ERR  = GMR_FS_RX_FF_OV | GMR_FS_CRC_ERR |
  GMR_FS_FRAGMENT | GMR_FS_LONG_ERR |
- GMR_FS_MII_ERR | GMR_FS_BAD_FC |
+ GMR_FS_MII_ERR | GMR_FS_GOOD_FC | GMR_FS_BAD_FC |
  GMR_FS_UN_SIZE | GMR_FS_JABBER,
 };
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] wrr (weighted round-robin) bonding

2006-10-16 Thread Andy Gospodarek
On Mon, Oct 16, 2006 at 09:07:57PM +0200, Dawid Ciezarkiewicz wrote:
> > 
> > Before getting into the technical bits of the patch, what's the
> > reason for wanting to do this, and why is this rather complex manual
> > weight assignment better than an automatic system based on, e.g., link
> > speed of the slaves?
> 
> In short:
> It was designed as a solution for wireless links bonding - where link quality 
> can change rather quickly in time. By using wrr bonding, userspace tools can 
> measure current bandwidth and change bonding slave weights in realtime.

Since this is so similar to mode 0, it would seem there would be a way
to extend it rather than creating yet another mode that is so similar.
What would be the reason not to enhance that mode?

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fixed a number of bugs in the PHY Layer

2006-10-16 Thread Andy Fleming

* genphy_update_link is now exported
* Added a fix from [EMAIL PROTECTED] which changes forcing so it
  only updates the link.  Otherwise, it never tries the lower
  values, since it is always overwriting the speed/duplex values
  with the current ones, rather than the intended ones.
* Fixed a bug where bringing up a PHY with no link caused it to
  timeout, and enter forcing mode.  Once in forcing mode,
  plugging in the link didn't autonegotiate.  Now the AN state
  detects the lack of link, and enters the NO_LINK state.  AN
  only times out if the link is up and AN fails
* Cleaned up the PHY_AN case, reducing one level of indentation
  for the timeout code.
---
 drivers/net/phy/phy.c|   81 --
 drivers/net/phy/phy_device.c |1 +
 2 files changed, 40 insertions(+), 42 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 3af9fcf..c81536d 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -693,60 +693,57 @@ static void phy_timer(unsigned long data
 
break;
case PHY_AN:
+   err = phy_read_status(phydev);
+
+   if (err < 0)
+   break;
+
+   /* If the link is down, give up on
+* negotiation for now */
+   if (!phydev->link) {
+   phydev->state = PHY_NOLINK;
+   netif_carrier_off(phydev->attached_dev);
+   phydev->adjust_link(phydev->attached_dev);
+   break;
+   }
+
/* Check if negotiation is done.  Break
 * if there's an error */
err = phy_aneg_done(phydev);
if (err < 0)
break;
 
-   /* If auto-negotiation is done, we change to
-* either RUNNING, or NOLINK */
+   /* If AN is done, we're running */
if (err > 0) {
-   err = phy_read_status(phydev);
+   phydev->state = PHY_RUNNING;
+   netif_carrier_on(phydev->attached_dev);
+   phydev->adjust_link(phydev->attached_dev);
+
+   } else if (0 == phydev->link_timeout--) {
+   int idx;
 
-   if (err)
+   needs_aneg = 1;
+   /* If we have the magic_aneg bit,
+* we try again */
+   if (phydev->drv->flags & PHY_HAS_MAGICANEG)
break;
 
-   if (phydev->link) {
-   phydev->state = PHY_RUNNING;
-   netif_carrier_on(phydev->attached_dev);
-   } else {
-   phydev->state = PHY_NOLINK;
-   netif_carrier_off(phydev->attached_dev);
-   }
+   /* The timer expired, and we still
+* don't have a setting, so we try
+* forcing it until we find one that
+* works, starting from the fastest speed,
+* and working our way down */
+   idx = phy_find_valid(0, phydev->supported);
 
-   phydev->adjust_link(phydev->attached_dev);
+   phydev->speed = settings[idx].speed;
+   phydev->duplex = settings[idx].duplex;
 
-   } else if (0 == phydev->link_timeout--) {
-   /* The counter expired, so either we
-* switch to forced mode, or the
-* magic_aneg bit exists, and we try aneg
-* again */
-   if (!(phydev->drv->flags & PHY_HAS_MAGICANEG)) {
-   int idx;
-
-   /* We'll start from the
-* fastest speed, and work
-* our way down */
-   idx = phy_find_valid(0,
-   phydev->supported);
-
-   phydev->speed = settings[idx].speed;
-   phydev->duplex = settings[idx].duplex;
-   
-   phydev->autoneg = AUTONEG_DISABLE;
-   

Re: [PATCH 9/14] [TIPC] Name publication events now delivered in chronological order

2006-10-16 Thread David Miller
From: Per Liden <[EMAIL PROTECTED]>
Date: Mon, 16 Oct 2006 10:50:40 +0200 (CEST)

> I'm fairly sure this is a problem on your side. I received patch 10/14 
> from the netdev list and the two list archives I checked also had it.

I also got 2 copies which means it hit netdev for me too.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread David Miller
From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Mon, 16 Oct 2006 11:00:22 +0200

> While browsing include/net/request_sock.h I found this suspicious locking 
> protecting the SYN table hash table. I think this patch is necessary.
> 
> Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

People get tripped up by this one all the time.

We hold a higher level lock which protects other
inserts from happening, namely the listening socket
lock, it works here like the RTNL semaphore does.

We only need to protect the actual change of the hash
head, as lookups can occur asynchronously and we want
linkage seen by lookups to be consistent.

Alexey likes to do this locking trick a lot.

Feel free to add a comment. :-)

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] d80211: remove bitfields from ieee80211_tx_control

2006-10-16 Thread Michael Buesch
On Monday 16 October 2006 21:34, Simon Barber wrote:
> Removing the bitfields makes the code much harder to read and maintain.
> Here we are working around a problem with the compiler by making the
> code ugly - rather than fixing the compiler. The compilers are getting
> better and better (GCC 4 has much better handling of this type of
> optimization) but the code will remain ugly for ever.

Yeah, that's my opinion on this, too.

But I still like the  unsigned int foo:16; => u16 foo;  type of conversions.

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poll problem with PF_PACKET when using PACKET_RX_RING

2006-10-16 Thread Joan Raventos
Is this a bug in PF_PACKET? Should the socket queue be
emptied by packet_set_ring (called via setsockopt when
PACKET_RX_RING is used) so the above cannot happen?
Should the user-space app drain the socket queue with
recvfrom prior to (4) -quite unlikely in practice-?
>> 
>>
>>>I guess the best way is not to bind the socket before having
>>>completed setup. We could still flush the queue to make life
>>>easier for userspace, not sure about that ..
> 
> 
>> Even w/o bind, packet_create is doing a dev_add_pack, which I think will 
>> make pkts arrive to that socket (ie. in netif_receive_skb one can see the 
>> loops over the rcu for both ptype_all and type-specific which seem match 
>> whenever !ptype->dev || ptype->dev==skb->dev).
>> 
>> Also the packet_mmap.txt doc does not mention bind, which probably is more a 
>> mechanism to closely specify a dev than to signal socket readiness.

> packet_create only calls dev_add_pack if a protocol is given.
> You can use a protocol number of 0 and then bind the socket
> after setting it up properly.

Currently I'm using ETH_P_ALL on the socket call. If I understand your proposal 
correctly you suggest to pass 0 on the socket call, so dev_add_pack is not 
called, and afterwards use a sockaddr_ll with bind to set the sll_protocol to 
whatever value (ETH_P_ALL in my case). Correct?

> According to your description, you first used setsockopt(...,
> PACKET_RX_RING), then mmap. In that case the receive queue
> should already get flushed by packet_set_ring (about line 1710).

Ok, I see... I guess if mmap has not been issued by the time setsockopt is 
called then po->mapped == 0 and the code you point out is triggered, 
specifically skb_queue_purge.

> How did you verify that the receive queue still contains packets?

You are totally right! non-block recv to the descriptor returns EAGAIN, so the 
queues are empty. After further instrumentation of the ring code, I'm 
suspecting of an issue with the ring read index at the user-space app...

Nevertheless the whole ring communication between kernel and user-space seems 
to be based on marking pkts via a flag in each pkt slot in the ring 
(tp_status). I guess it works fine because the assignments are atomic (like the 
one on af_packet.c:671). Correct?
BTW what's the purpose of mb() and why is it exactly needed in that position in 
the code?

Thx again!

Salu2,
J.



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch 1/5] d80211: remove bitfields from ieee80211_tx_control

2006-10-16 Thread Simon Barber
Removing the bitfields makes the code much harder to read and maintain.
Here we are working around a problem with the compiler by making the
code ugly - rather than fixing the compiler. The compilers are getting
better and better (GCC 4 has much better handling of this type of
optimization) but the code will remain ugly for ever.

Simon

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Michael Buesch
Sent: Monday, October 16, 2006 9:07 AM
To: David Kimdon
Cc: netdev@vger.kernel.org; John W. Linville; Jiri Benc
Subject: Re: [patch 1/5] d80211: remove bitfields from
ieee80211_tx_control

On Friday 13 October 2006 21:20, David Kimdon wrote:
> All one-bit bitfields have been subsumed into the new 'flags'
> structure member and the new IEEE80211_TXCTL_* definitions.  The 
> multiple bit members were converted to u8, s8 or u16 as appropriate.

And, eh, did this increase or decrease the struct size?
Does this generate better or worse code?

--
Greetings Michael.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in the
body of a message to [EMAIL PROTECTED] More majordomo info at
http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/6] 2.6.18: sb1250-mac: Phylib IRQ handling fixes

2006-10-16 Thread Andrew Morton
On Mon, 16 Oct 2006 15:50:55 +0100 (BST)
"Maciej W. Rozycki" <[EMAIL PROTECTED]> wrote:

> Andrew,
> 
> > I don't get it.  If some code does
> > 
> > rtnl_lock();
> > flush_scheduled_work();
> > 
> > and there's some work scheduled which does rtnl_lock() then it'll deadlock.
> > 
> > But it'll deadlock whether or not the caller of flush_scheduled_work() is
> > keventd.
> > 
> > Calling flush_scheduled_work() under locks is generally a bad idea.
> 
>  Indeed -- this is why I avoid it.
> 
> > I'd have thought that was still deadlockable.  Could you describe the
> > deadlock more completely please?
> 
>  The simplest sequence of calls that prevents races here is as follows:
> 
> unregister_netdev();
>   rtnl_lock();
>   unregister_netdevice();
> dev_close();
>   sbmac_close();
> phy_stop();
> phy_disconnect();
>   phy_stop_interrupts();
> phy_disable_interrupts();
> flush_scheduled_work();
> free_irq();
>   phy_detach();
> mdiobus_unregister();
>   rtnl_unlock();
> 
> We want to call flush_scheduled_work() from phy_stop_interrupts(), because 
> there may still be calls to phy_change() waiting for the keventd to 
> process and mdiobus_unregister() frees structures needed by phy_change().  
> This may deadlock because of the call to rtnl_lock() though.
> 
>  So the modified sequence I have implemented is as follows:
> 
> unregister_netdev();
>   rtnl_lock();
>   unregister_netdevice();
> dev_close();
>   sbmac_close();
> phy_stop();
> schedule_work(); [sbmac_phy_disconnect()]
>   rtnl_unlock();
> 
> and then:
> 
> sbmac_phy_disconnect();
>   phy_disconnect();
> phy_stop_interrupts();
>   phy_disable_interrupts();
>   free_irq();
> phy_detach();
>   mdiobus_unregister();
> 
> This guarantees calls to phy_change() for this PHY unit will not be run 
> after mdiobus_unregister(), because any such calls have been placed in the 
> queue before sbmac_phy_disconnect() (phy_stop() prevents the interrupt 
> handler from issuing new calls to phy_change()).
> 
>  We still need flush_scheduled_work() to be called from 
> phy_stop_interrupts() though, in case some other MAC driver calls 
> phy_disconnect() (or phy_stop_interrupts(), depending on the abstraction 
> layer of phylib used) directly rather than using keventd.  This is 
> possible if phy_disconnect() is called from the driver's module_exit() 
> call, which may make sense for devices that are known not to have their 
> MII interface available as an external connector.  Hence the:
> 
> if (!current_is_keventd())
>   flush_scheduled_work();
> 
> sequence in phy_stop_interrupts().  Of course, we can force all drivers 
> using phylib (in the interrupt mode) to call phy_disconnect() through 
> keventd instead.
> 
>  Does it sound clearer?
> 

Vaguely.  Why doesn't it deadlock if !current_is_keventd()?  I mean,
whether or not the caller is keventd, the flush_scheduled_work() caller
will still be dependent upon rtnl_lock() being acquirable.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] wrr (weighted round-robin) bonding

2006-10-16 Thread Dawid Ciezarkiewicz
On Monday, 16 October 2006 20:50, you wrote:
> 
> Dawid Ciezarkiewicz <[EMAIL PROTECTED]> wrote:
> [...]
> >+weighted-rr or 7
> >+
> >+Weighted round-robin bonding. In this mode bonding
> >+interface will use weights assigned to it's slaves.
> >+
> >+Each slave can have weight assigned via ioctl (ifenslave).
> >+These values will be used at the start of each "cycle".
> >+Each slave will have token counter restored to it's weight.
> >+Then using round-robin mechanism those tokens are "used"
> >+to pay for emitted frames. When all token counters are
> >+zeroed - new "cycle" begins.
> 
>   Before getting into the technical bits of the patch, what's the
> reason for wanting to do this, and why is this rather complex manual
> weight assignment better than an automatic system based on, e.g., link
> speed of the slaves?

In short:
It was designed as a solution for wireless links bonding - where link quality 
can change rather quickly in time. By using wrr bonding, userspace tools can 
measure current bandwidth and change bonding slave weights in realtime.

It was written for Lintrack, and you can read about it's usage here:
http://lintrack.org/index.php/about/advantage
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [e1000]: flow control on by default - good idea really?

2006-10-16 Thread Auke Kok

jamal wrote:

On Thu, 2006-06-07 at 23:59 -0700, David Miller wrote:


It's autonegotiated, check you kernel message logs when the link
came up, you'll see this:

tg3: eth0: Flow control is on for TX and on for RX.



yikes - yes, this would be it.

I  could be wrong and i will double check:
I think when the e1000 says via ethtool "rx is on" - it means that it 
is _advertising_ flow control as opposed to detecting partner has flow

control capability.
Auke, can you also check this as well?


Just found this in my todo box - a bit late :(

yes, that appears to be the correct interpretation: we never read back the detected FC 
state from the hardware.


Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] wrr (weighted round-robin) bonding

2006-10-16 Thread Jay Vosburgh

Dawid Ciezarkiewicz <[EMAIL PROTECTED]> wrote:
[...]
>+  weighted-rr or 7
>+
>+  Weighted round-robin bonding. In this mode bonding
>+  interface will use weights assigned to it's slaves.
>+
>+  Each slave can have weight assigned via ioctl (ifenslave).
>+  These values will be used at the start of each "cycle".
>+  Each slave will have token counter restored to it's weight.
>+  Then using round-robin mechanism those tokens are "used"
>+  to pay for emitted frames. When all token counters are
>+  zeroed - new "cycle" begins.

Before getting into the technical bits of the patch, what's the
reason for wanting to do this, and why is this rather complex manual
weight assignment better than an automatic system based on, e.g., link
speed of the slaves?

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] d80211: remove unused Super AG definitions, purge comment

2006-10-16 Thread David Kimdon
Remove unused Super AG structure members, enums.

In struct ieee80211_tx_status the queue_length and queue_number could
be useful outside the context of Super AG, so remove the comment and
leave the members.

Signed-off-by: David Kimdon <[EMAIL PROTECTED]>

Index: wireless-dev/include/net/d80211.h
===
--- wireless-dev.orig/include/net/d80211.h
+++ wireless-dev/include/net/d80211.h
@@ -159,12 +159,6 @@ struct ieee80211_tx_control {
unsigned int requeue:1;
unsigned int first_fragment:1;  /* This is a first fragment of the
 * frame */
-/* following three flags are only used with Atheros Super A/G */
-   unsigned int compress:1;
-   unsigned int turbo_prime_notify:1; /* notify HostAPd after frame
-   * transmission */
-   unsigned int fast_frame:1;
-
 unsigned int power_level:8; /* per-packet transmit power level, in dBm
 */
unsigned int antenna_sel:4; /* 0 = default/diversity,
@@ -219,7 +213,6 @@ struct ieee80211_tx_status {
int excessive_retries;
int retry_count;
 
-   /* following two fields are only used with Atheros Super A/G */
int queue_length;  /* information about TX queue */
int queue_number;
 };
@@ -265,13 +258,6 @@ struct ieee80211_conf {
 int antenna_def;
 int antenna_mode;
 
-   int atheros_super_ag_compression;
-   int atheros_super_ag_fast_frame;
-   int atheros_super_ag_burst;
-   int atheros_super_ag_wme_ele;
-   int atheros_super_ag_turbo_g;
-   int atheros_super_ag_turbo_prime;
-
/* Following five fields are used for IEEE 802.11H */
unsigned int radar_detect;
unsigned int spect_mgmt;
Index: wireless-dev/net/d80211/hostapd_ioctl.h
===
--- wireless-dev.orig/net/d80211/hostapd_ioctl.h
+++ wireless-dev/net/d80211/hostapd_ioctl.h
@@ -182,10 +182,6 @@ struct prism2_hostapd_param {
u16 aid;
u16 capability;
u8 supp_rates[32];
-   /* atheros_super_ag and enc_flags are only used with
-* IEEE80211_ATHEROS_SUPER_AG
-*/
-   u8 atheros_super_ag;
u8 wds_flags;
 #define IEEE80211_STA_DYNAMIC_ENC BIT(0)
u8 enc_flags;
Index: wireless-dev/include/net/d80211_shared.h
===
--- wireless-dev.orig/include/net/d80211_shared.h
+++ wireless-dev/include/net/d80211_shared.h
@@ -19,8 +19,6 @@ enum {
MODE_ATHEROS_TURBO = 2 /* Atheros Turbo mode (2x.11a at 5 GHz) */,
MODE_IEEE80211G = 3 /* IEEE 802.11g (and 802.11b compatibility) */,
MODE_ATHEROS_TURBOG = 4 /* Atheros Turbo mode (2x.11g at 2.4 GHz) */,
-   MODE_ATHEROS_PRIME = 5 /* Atheros Dynamic Turbo mode */,
-   MODE_ATHEROS_PRIMEG = 6 /* Atheros Dynamic Turbo mode G */,
NUM_IEEE80211_MODES = 7
 };
 

--
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] wrr (weighted round-robin) bonding

2006-10-16 Thread Dawid Ciezarkiewicz
On Monday, 16 October 2006 20:21, Dawid Ciezarkiewicz wrote:
> This patch is little thinner then the previous one.

I'm sorry for that. I've just ... nevermind. Here goes the patch.

Should I post patch for ifenslave here, too?



diff -Nur linux-2.6.17.orig/Documentation/networking/bonding.txt 
linux-2.6.17/Documentation/networking/bonding.txt
--- linux-2.6.17.orig/Documentation/networking/bonding.txt  2006-06-18 
03:49:35.0 +0200
+++ linux-2.6.17/Documentation/networking/bonding.txt   2006-07-28 
15:47:55.0 +0200
@@ -398,6 +398,19 @@
swapped with the new curr_active_slave that was
chosen.
 
+   weighted-rr or 7
+
+   Weighted round-robin bonding. In this mode bonding
+   interface will use weights assigned to it's slaves.
+
+   Each slave can have weight assigned via ioctl (ifenslave).
+   These values will be used at the start of each "cycle".
+   Each slave will have token counter restored to it's weight.
+   Then using round-robin mechanism those tokens are "used"
+   to pay for emitted frames. When all token counters are
+   zeroed - new "cycle" begins.
+   
+
 primary
 
A string (eth0, eth2, etc) specifying which slave is the
diff -Nur linux-2.6.17.orig/drivers/net/bonding/bond_main.c 
linux-2.6.17/drivers/net/bonding/bond_main.c
--- linux-2.6.17.orig/drivers/net/bonding/bond_main.c   2006-06-18 
03:49:35.0 +0200
+++ linux-2.6.17/drivers/net/bonding/bond_main.c2006-07-28 
15:31:44.0 +0200
@@ -115,7 +115,7 @@
 MODULE_PARM_DESC(mode, "Mode of operation : 0 for balance-rr, "
   "1 for active-backup, 2 for balance-xor, "
   "3 for broadcast, 4 for 802.3ad, 5 for balance-tlb, "
-  "6 for balance-alb");
+  "6 for balance-alb, 7 for weighted-rr");
 module_param(primary, charp, 0);
 MODULE_PARM_DESC(primary, "Primary network device to use");
 module_param(lacp_rate, charp, 0);
@@ -162,6 +162,7 @@
 {  "802.3ad",  BOND_MODE_8023AD},
 {  "balance-tlb",  BOND_MODE_TLB},
 {  "balance-alb",  BOND_MODE_ALB},
+{  "weighted-rr",  BOND_MODE_WEIGHTED_RR},
 {  NULL,   -1},
 };
 
@@ -194,6 +195,8 @@
return "transmit load balancing";
case BOND_MODE_ALB:
return "adaptive load balancing";
+   case BOND_MODE_WEIGHTED_RR:
+   return "weighted round robin (weighted-rr)";
default:
return "unknown";
}
@@ -1198,6 +1201,24 @@
return 0;
 }
 
+int bond_set_weight(struct net_device *bond_dev, struct net_device *slave_dev,
+   u16 weight)
+{
+   struct slave* slave;
+   slave = bond_get_slave_by_dev(bond_dev->priv, slave_dev);
+   if (!slave) {
+   return -EINVAL;
+   }
+
+   slave->weight = weight;
+
+   if (weight) {
+   slave->link = BOND_LINK_UP;
+   slave->state = BOND_STATE_ACTIVE;
+   }
+   return 0;
+}
+
 #define BOND_INTERSECT_FEATURES \
(NETIF_F_SG|NETIF_F_IP_CSUM|NETIF_F_NO_CSUM|NETIF_F_HW_CSUM|\
NETIF_F_TSO|NETIF_F_UFO)
@@ -1336,6 +1352,9 @@
 */
new_slave->original_flags = slave_dev->flags;
 
+   /* slave default weight = 1 */
+   new_slave->weight = 1;
+
/*
 * Save slave's original ("permanent") mac address for modes
 * that need it, and for restoring it upon release, and then
@@ -3601,7 +3620,10 @@
}
 
down_write(&(bonding_rwsem));
-   slave_dev = dev_get_by_name(ifr->ifr_slave);
+   if (cmd != SIOCBONDSETWEIGHT)
+   slave_dev = dev_get_by_name(ifr->ifr_slave);
+   else
+   slave_dev = dev_get_by_name(ifr->ifr_weight_slave);
 
dprintk("slave_dev=%p: \n", slave_dev);
 
@@ -3626,6 +3648,9 @@
case SIOCBONDCHANGEACTIVE:
res = bond_ioctl_change_active(bond_dev, slave_dev);
break;
+   case SIOCBONDSETWEIGHT:
+   res = bond_set_weight(bond_dev, slave_dev, 
ifr->ifr_weight_weight);
+   break;
default:
res = -EOPNOTSUPP;
}
@@ -3881,6 +3906,67 @@
return 0;
 }
 
+static int bond_xmit_weighted_rr(struct sk_buff *skb, struct net_device 
*bond_dev)
+{
+   struct bonding *bond = bond_dev->priv;
+   struct slave *slave, *start_at;
+   int i;
+   int res = 1;
+   int were_weight_tokens_recharged = 0;
+
+   read_lock(&bond->lock);
+
+   if (!BOND_IS_OK(bond)) {
+   goto out;
+   }
+
+   read_lock(&bond->curr_slave_lock);
+   slave = start_at = bond->curr_active_slave;
+   read_unlock(&bond->curr_slave_lock);
+
+   if (!slave) {
+   goto out;
+   }
+
+

PATCH zero-copy send completion callback

2006-10-16 Thread Eric Barton

This patch has been used with the lustre cluster file system (www.lustre.org)
to give notification when page buffers used to send bulk data via TCP/IP may be
overwritten.  It implements...

  a) A general-purpose callback to inform higher-level protocols when a
 zero-copy send of a set of pages has completed.

  b) tcp_sendpage_zccd(), a variation on tcp_sendpage() that includes a
 completion callback parameter.

How to use it ("you" are a higher-level protocol driver)...

  a) Initialise a zero-copy descriptor with your callback procedure.

  b) Pass this descriptor in all zero-copy sends for an arbitrary set of pages.
 Skbuffs that reference your pages also take a reference on your zero-copy
 callback descriptor.  They release this reference when they release their
 page references.

  c) Release your own reference when you've posted all your pages and you're
 ready for the callback.

  d) The callback occurs when the last reference is dropped.


This patch applies on branch 'master' of
git://kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 85577a4..4afaef1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -129,6 +129,36 @@ struct skb_frag_struct {
__u16 size;
 };
 
+/* Zero Copy Callback Descriptor
+ * This struct supports receiving notification when zero-copy network I/O has
+ * completed.  The ZCCD can be embedded in a struct containing the state of a
+ * zero-copy network send.  Every skbuff that references that send's pages also
+ * keeps a reference on the ZCCD.  When they have all been disposed of, the
+ * reference count on the ZCCD drops to zero and the callback is made, telling
+ * the original caller that the pages may now be overwritten. */
+struct zccd 
+{
+   atomic_t zccd_refcount;
+   void   (*zccd_callback)(struct zccd *); 
+};
+
+static inline void zccd_init (struct zccd *d, void (*callback)(struct zccd *))
+{
+   atomic_set (&d->zccd_refcount, 1);
+   d->zccd_callback = callback;
+}
+
+static inline void zccd_incref (struct zccd *d)/* take a reference */
+{
+   atomic_inc (&d->zccd_refcount);
+}
+
+static inline void zccd_decref (struct zccd *d)/* release a reference 
*/
+{
+   if (atomic_dec_and_test (&d->zccd_refcount))
+   (d->zccd_callback)(d);
+}
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -141,6 +171,11 @@ struct skb_shared_info {
unsigned short  gso_type;
unsigned intip6_frag_id;
struct sk_buff  *frag_list;
+   struct zccd *zccd1;
+   struct zccd *zccd2;
+   /* NB zero-copy data is normally whole pages.  We have 2 zccds in an
+* skbuff so we don't unneccessarily split the packet where pages fall
+* into the same packet. */
skb_frag_t  frags[MAX_SKB_FRAGS];
 };
 
@@ -1311,6 +1346,23 @@ #ifdef CONFIG_HIGHMEM
 #endif
 }
 
+/* This skbuf has dropped its pages: drop refs on any zero-copy callback
+ * descriptors it has. */
+static inline void skb_complete_zccd (struct sk_buff *skb)
+{
+   struct skb_shared_info *info = skb_shinfo(skb);
+   
+   if (info->zccd1 != NULL) {
+   zccd_decref(info->zccd1);
+   info->zccd1 = NULL;
+   }
+
+   if (info->zccd2 != NULL) {
+   zccd_decref(info->zccd2);
+   info->zccd2 = NULL;
+   }
+}
+
 #define skb_queue_walk(queue, skb) \
for (skb = (queue)->next;   
\
 prefetch(skb->next), (skb != (struct sk_buff *)(queue));   
\
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..e02b55f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -278,6 +278,8 @@ extern int  tcp_v4_tw_remember_stam
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk,
struct msghdr *msg, size_t size);
 extern ssize_t tcp_sendpage(struct socket *sock, struct page 
*page, int offset, size_t size, int flags);
+extern ssize_t tcp_sendpage_zccd(struct socket *sock, struct 
page *page, int offset, size_t size,
+ int flags, struct zccd *zccd);
 
 extern int tcp_ioctl(struct sock *sk, 
  int cmd, 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3c23760..a1d2ed0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -177,6 +177,8 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo->gso_type = 0;
shinfo->ip6_frag_id = 0;
shinfo->frag_list = NULL;
+   shinfo->zccd1 = NULL;
+   shinfo->zccd2 = NULL;
 
if (fclone) {
struct sk_buff *ch

Re: 2.6.18-mm2 boot failure on x86-64

2006-10-16 Thread Vivek Goyal
On Mon, Oct 09, 2006 at 10:53:58AM +0100, Mel Gorman wrote:
> On Fri, 6 Oct 2006, Vivek Goyal wrote:
> 
> >On Fri, Oct 06, 2006 at 01:03:50PM -0500, Steve Fox wrote:
> >>On Fri, 2006-10-06 at 18:11 +0100, Mel Gorman wrote:
> >>>On (06/10/06 11:36), Vivek Goyal didst pronounce:
> Where is bss placed in physical memory? I guess bss_start and bss_stop
> from System.map will tell us. That will confirm that above memset step 
> is
> stomping over bss. Then we have to just find that somewhere probably
> we allocated wrong physical memory area for bootmem allocator map.
> 
> >>>
> >>>BSS is at 0x643000 -> 0x777BC4
> >>>init_bootmem wipes from 0x777000 -> 0x8F7000
> >>>
> >>>So the BSS bytes from 0x777000 ->0x777BC4 (which looks very suspiciously
> >>>pile a page alignment of addr & PAGE_MASK) gets set to 0xFF. One possible
> >>>fix is below. It adds a check in bad_addr() to see if the BSS section is
> >>>about to be used for bootmap. It Seems To Work For Me (tm) and 
> >>>illustrates
> >>>the source of the problem even if it's not the 100% correct fix.
> >>
> >>I was able to boot the machine with Mel's patch applied on top of
> >>-git22.
> >
> >
> >Please have a look at the attached patch. Does it make some sense.
> >
> 
> It makes some sense. As you state, it wastes memory but that is better 
> than breaking.
> 
> >Steve, can you please give this patch a try if it fixes the problem?
> >
> 
> I boottested the patch on the same machine as Steve was using and it 
> completed successfully.
>

Hi Andrew,

Can you please have a look at the attached patch and include it in -mm.
This fixes the issue for steve. It also figures in the list of Adrian Bunk
of known regressions.

Subject: oops in xfrm_register_mode
References : http://lkml.org/lkml/2006/10/4/170
Submitter  : Steve Fox <[EMAIL PROTECTED]>
Handled-By : Vivek Goyal <[EMAIL PROTECTED]>
Status : patch available



o Currently some code pieces assume that address returned by find_e820_area()
  are page aligned. But looks like find_e820_area() had no such intention
  and hence one might end up stomping over some of the data. One such
  case is bootmem allocator initialization code stomped over bss.

o This patch modified find_e820_area() to return page aligned address. This
  might be little wasteful of memory but at the same time probably it is
  easier to handle page aligned memory. 

Signed-off-by: Vivek Goyal <[EMAIL PROTECTED]>
---

 arch/x86_64/kernel/e820.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff -puN 
arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area
 arch/x86_64/kernel/e820.c
--- 
linux-2.6.19-rc1-1M/arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area
   2006-10-06 15:28:13.0 -0400
+++ linux-2.6.19-rc1-1M-root/arch/x86_64/kernel/e820.c  2006-10-06 
15:44:45.0 -0400
@@ -54,13 +54,13 @@ static inline int bad_addr(unsigned long
 
/* various gunk below that needed for SMP startup */
if (addr < 0x8000) { 
-   *addrp = 0x8000;
+   *addrp = PAGE_ALIGN(0x8000);
return 1; 
}
 
/* direct mapping tables of the kernel */
if (last >= table_start<= INITRD_START && 
addr < INITRD_START+INITRD_SIZE) { 
-   *addrp = INITRD_START + INITRD_SIZE; 
+   *addrp = PAGE_ALIGN(INITRD_START + INITRD_SIZE);
return 1;
} 
 #endif
/* kernel code */
-   if (last >= __pa_symbol(&_text) && last < __pa_symbol(&_end)) {
-   *addrp = __pa_symbol(&_end);
+   if (last >= __pa_symbol(&_text) && addr < __pa_symbol(&_end)) {
+   *addrp = PAGE_ALIGN(__pa_symbol(&_end));
return 1;
}
 
if (last >= ebda_addr && addr < ebda_addr + ebda_size) {
-   *addrp = ebda_addr + ebda_size;
+   *addrp = PAGE_ALIGN(ebda_addr + ebda_size);
return 1;
}
 
@@ -152,7 +152,7 @@ unsigned long __init find_e820_area(unsi
continue; 
while (bad_addr(&addr, size) && addr+size <= ei->addr+ei->size)
;
-   last = addr + size;
+   last = PAGE_ALIGN(addr) + size;
if (last > ei->addr + ei->size)
continue;
if (last > end) 
_
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] wrr (weighted round-robin) bonding

2006-10-16 Thread Dawid Ciezarkiewicz
This patch is little thinner then the previous one.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread Eric Dumazet
On Monday 16 October 2006 18:56, Eric Dumazet wrote:
> On Monday 16 October 2006 18:16, Arnaldo Carvalho de Melo wrote:
> > On 10/16/06, Eric Dumazet <[EMAIL PROTECTED]> wrote:
> > > (Sorry, patch inlined this time)
> > >
> > > Hi David
> > >
> > > While browsing include/net/request_sock.h I found this suspicious
> > > locking protecting the SYN table hash table. I think this patch is
> > > necessary.
> > >
> > > Thank you
> >
> > Interesting, just checked and it was there before I moved this out of tcp
> > land:
>
> Well, the bug was there before you put your hands on the code (I checked
> linux-2.4.33 & linux-2.4.1 , bug present on both versions)

Well, 'bug' is not appropriate in fact. Overkill maybe ? 

The comment from include/net/request_sock.h explain the thing...

 * %syn_wait_lock is necessary only to avoid proc interface having to grab the 
main
 * lock sock while browsing the listening hash (otherwise it's deadlock 
prone).
 *
 * This lock is acquired in read mode only from listening_get_next() seq_file
 * op and it's acquired in write mode _only_ from code that is actively
 * changing rskq_accept_head. All readers that are holding the master sock 
lock
 * don't need to grab this lock in read mode too as rskq_accept_head. writes
 * are always protected from the main sock lock.

I bet a more appropriate code (and less prone to reading errors for kernel 
gurus/newbies) would be :

What do you think ?

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
--- linux-2.6.19-rc2/include/net/request_sock.h 2006-10-13 18:25:04.0 
+0200
+++ linux-2.6.19-rc2-ed/include/net/request_sock.h  2006-10-16 
19:34:19.0 +0200
@@ -254,9 +254,13 @@
req->sk = NULL;
req->dl_next = lopt->syn_table[hash];
 
-   write_lock(&queue->syn_wait_lock);
+   /*
+* We want previous writes being commited before doing this change,
+* so that readers of the chain are not confused.
+*/
+   smp_mb();
+
lopt->syn_table[hash] = req;
-   write_unlock(&queue->syn_wait_lock);
 }
 
 #endif /* _REQUEST_SOCK_H */


PATCH zero-copy send completion callback

2006-10-16 Thread Eric Barton

This patch has been used with the lustre cluster file system (www.lustre.org)
to give notification when page buffers used to send bulk data via TCP/IP may be
overwritten.  It implements...

  a) A general-purpose callback to inform higher-level protocols when a
 zero-copy send of a set of pages has completed.

  b) tcp_sendpage_zccd(), a variation on tcp_sendpage() that includes a
 completion callback parameter.

How to use it ("you" are a higher-level protocol driver)...

  a) Initialise a zero-copy descriptor with your callback procedure.

  b) Pass this descriptor in all zero-copy sends for an arbitrary set of pages.
 Skbuffs that reference your pages also take a reference on your zero-copy
 callback descriptor.  They release this reference when they release their
 page references.

  c) Release your own reference when you've posted all your pages and you're
 ready for the callback.

  d) The callback occurs when the last reference is dropped.


This patch applies on branch 'master' of
git://kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 85577a4..4afaef1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -129,6 +129,36 @@ struct skb_frag_struct {
__u16 size;
 };
 
+/* Zero Copy Callback Descriptor
+ * This struct supports receiving notification when zero-copy network I/O has
+ * completed.  The ZCCD can be embedded in a struct containing the state of a
+ * zero-copy network send.  Every skbuff that references that send's pages also
+ * keeps a reference on the ZCCD.  When they have all been disposed of, the
+ * reference count on the ZCCD drops to zero and the callback is made, telling
+ * the original caller that the pages may now be overwritten. */
+struct zccd 
+{
+   atomic_t zccd_refcount;
+   void   (*zccd_callback)(struct zccd *); 
+};
+
+static inline void zccd_init (struct zccd *d, void (*callback)(struct zccd *))
+{
+   atomic_set (&d->zccd_refcount, 1);
+   d->zccd_callback = callback;
+}
+
+static inline void zccd_incref (struct zccd *d)/* take a reference */
+{
+   atomic_inc (&d->zccd_refcount);
+}
+
+static inline void zccd_decref (struct zccd *d)/* release a reference 
*/
+{
+   if (atomic_dec_and_test (&d->zccd_refcount))
+   (d->zccd_callback)(d);
+}
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -141,6 +171,11 @@ struct skb_shared_info {
unsigned short  gso_type;
unsigned intip6_frag_id;
struct sk_buff  *frag_list;
+   struct zccd *zccd1;
+   struct zccd *zccd2;
+   /* NB zero-copy data is normally whole pages.  We have 2 zccds in an
+* skbuff so we don't unneccessarily split the packet where pages fall
+* into the same packet. */
skb_frag_t  frags[MAX_SKB_FRAGS];
 };
 
@@ -1311,6 +1346,23 @@ #ifdef CONFIG_HIGHMEM
 #endif
 }
 
+/* This skbuf has dropped its pages: drop refs on any zero-copy callback
+ * descriptors it has. */
+static inline void skb_complete_zccd (struct sk_buff *skb)
+{
+   struct skb_shared_info *info = skb_shinfo(skb);
+   
+   if (info->zccd1 != NULL) {
+   zccd_decref(info->zccd1);
+   info->zccd1 = NULL;
+   }
+
+   if (info->zccd2 != NULL) {
+   zccd_decref(info->zccd2);
+   info->zccd2 = NULL;
+   }
+}
+
 #define skb_queue_walk(queue, skb) \
for (skb = (queue)->next;   
\
 prefetch(skb->next), (skb != (struct sk_buff *)(queue));   
\
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..e02b55f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -278,6 +278,8 @@ extern int  tcp_v4_tw_remember_stam
 extern int tcp_sendmsg(struct kiocb *iocb, struct sock *sk,
struct msghdr *msg, size_t size);
 extern ssize_t tcp_sendpage(struct socket *sock, struct page 
*page, int offset, size_t size, int flags);
+extern ssize_t tcp_sendpage_zccd(struct socket *sock, struct 
page *page, int offset, size_t size,
+ int flags, struct zccd *zccd);
 
 extern int tcp_ioctl(struct sock *sk, 
  int cmd, 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3c23760..a1d2ed0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -177,6 +177,8 @@ struct sk_buff *__alloc_skb(unsigned int
shinfo->gso_type = 0;
shinfo->ip6_frag_id = 0;
shinfo->frag_list = NULL;
+   shinfo->zccd1 = NULL;
+   shinfo->zccd2 = NULL;
 
if (fclone) {
struct sk_buff *ch

Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread Eric Dumazet
On Monday 16 October 2006 18:16, Arnaldo Carvalho de Melo wrote:
> On 10/16/06, Eric Dumazet <[EMAIL PROTECTED]> wrote:
> > (Sorry, patch inlined this time)
> >
> > Hi David
> >
> > While browsing include/net/request_sock.h I found this suspicious locking
> > protecting the SYN table hash table. I think this patch is necessary.
> >
> > Thank you
>
> Interesting, just checked and it was there before I moved this out of tcp
> land:

Well, the bug was there before you put your hands on the code (I checked 
linux-2.4.33 & linux-2.4.1 , bug present on both versions)

:)

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] d80211: remove bitfields from ieee80211_tx_control

2006-10-16 Thread Michael Buesch
On Friday 13 October 2006 21:20, David Kimdon wrote:
> All one-bit bitfields have been subsumed into the new 'flags'
> structure member and the new IEEE80211_TXCTL_* definitions.  The
> multiple bit members were converted to u8, s8 or u16 as appropriate.

And, eh, did this increase or decrease the struct size?
Does this generate better or worse code?

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread Arnaldo Carvalho de Melo

On 10/16/06, Eric Dumazet <[EMAIL PROTECTED]> wrote:

(Sorry, patch inlined this time)

Hi David

While browsing include/net/request_sock.h I found this suspicious locking
protecting the SYN table hash table. I think this patch is necessary.

Thank you


Interesting, just checked and it was there before I moved this out of tcp land:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0e87506fcc734647c7b2497eee4eb81e785c857a

@@ -898,18 +898,10 @@ static struct request_sock *tcp_v4_searc
static void tcp_v4_synq_add(struct sock *sk, struct request_sock *req)
{
 struct tcp_sock *tp = tcp_sk(sk);
-struct tcp_listen_opt *lopt = tp->listen_opt;
+   struct tcp_listen_opt *lopt = tp->accept_queue.listen_opt;
u32 h = tcp_v4_synq_hash(inet_rsk(req)->rmt_addr,
inet_rsk(req)->rmt_port, lopt->hash_rnd);
-req->expires = jiffies + TCP_TIMEOUT_INIT;
-req->retrans = 0;
-req->sk = NULL;
-req->dl_next = lopt->syn_table[h];
-
-write_lock(&tp->syn_wait_lock);
-lopt->syn_table[h] = req;
-write_unlock(&tp->syn_wait_lock);
-
+reqsk_queue_hash_req(&tp->accept_queue, h, req, TCP_TIMEOUT_INIT);
 tcp_synq_added(sk);
}



Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>


--- linux-2.6.18/include/net/request_sock.h.orig2006-10-16 
10:53:11.0 +0200
+++ linux-2.6.18-ed/include/net/request_sock.h  2006-10-16 10:53:24.0 
+0200
@@ -251,9 +251,9 @@
req->expires = jiffies + timeout;
req->retrans = 0;
req->sk = NULL;
-   req->dl_next = lopt->syn_table[hash];

write_lock(&queue->syn_wait_lock);
+   req->dl_next = lopt->syn_table[hash];
lopt->syn_table[hash] = req;
write_unlock(&queue->syn_wait_lock);
 }

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/6] 2.6.18: sb1250-mac: Phylib IRQ handling fixes

2006-10-16 Thread Maciej W. Rozycki
Andrew,

> I don't get it.  If some code does
> 
>   rtnl_lock();
>   flush_scheduled_work();
> 
> and there's some work scheduled which does rtnl_lock() then it'll deadlock.
> 
> But it'll deadlock whether or not the caller of flush_scheduled_work() is
> keventd.
> 
> Calling flush_scheduled_work() under locks is generally a bad idea.

 Indeed -- this is why I avoid it.

> I'd have thought that was still deadlockable.  Could you describe the
> deadlock more completely please?

 The simplest sequence of calls that prevents races here is as follows:

unregister_netdev();
  rtnl_lock();
  unregister_netdevice();
dev_close();
  sbmac_close();
phy_stop();
phy_disconnect();
  phy_stop_interrupts();
phy_disable_interrupts();
flush_scheduled_work();
free_irq();
  phy_detach();
mdiobus_unregister();
  rtnl_unlock();

We want to call flush_scheduled_work() from phy_stop_interrupts(), because 
there may still be calls to phy_change() waiting for the keventd to 
process and mdiobus_unregister() frees structures needed by phy_change().  
This may deadlock because of the call to rtnl_lock() though.

 So the modified sequence I have implemented is as follows:

unregister_netdev();
  rtnl_lock();
  unregister_netdevice();
dev_close();
  sbmac_close();
phy_stop();
schedule_work(); [sbmac_phy_disconnect()]
  rtnl_unlock();

and then:

sbmac_phy_disconnect();
  phy_disconnect();
phy_stop_interrupts();
  phy_disable_interrupts();
  free_irq();
phy_detach();
  mdiobus_unregister();

This guarantees calls to phy_change() for this PHY unit will not be run 
after mdiobus_unregister(), because any such calls have been placed in the 
queue before sbmac_phy_disconnect() (phy_stop() prevents the interrupt 
handler from issuing new calls to phy_change()).

 We still need flush_scheduled_work() to be called from 
phy_stop_interrupts() though, in case some other MAC driver calls 
phy_disconnect() (or phy_stop_interrupts(), depending on the abstraction 
layer of phylib used) directly rather than using keventd.  This is 
possible if phy_disconnect() is called from the driver's module_exit() 
call, which may make sense for devices that are known not to have their 
MII interface available as an external connector.  Hence the:

if (!current_is_keventd())
  flush_scheduled_work();

sequence in phy_stop_interrupts().  Of course, we can force all drivers 
using phylib (in the interrupt mode) to call phy_disconnect() through 
keventd instead.

 Does it sound clearer?

  Maciej
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hardware bug or kernel bug?

2006-10-16 Thread David Johnson
On Monday 16 October 2006 11:25, Jarek Poplawski wrote:
>
> Was this lock-up effect visible during above 2.6.19-rc1 tests?

No, I've not seen anything in Linux other than the reboots, which are instant 
without any preceding lock-up.

> If not I'd try to continue linux debbuging:
> - is 2.6.19-rc1 working with "normal" config (use make oldconfig
> to "upgrade" .config),

With 2.6.19-rc1 and a normal config, I get the reboots as usual.

> - is 2.6.17 working with "minimal" config (use make oldconfig),

Yes.

> - changing one or two options at a time try to find which one makes
> the effect returns (acpi, smp...).

I've found the culprit - CPU Frequency Scaling.
With it enabled I get the reboots, with it disabled I don't. That's the same 
with every kernel version I've tried (2.6.19-rc1+rc2, 2.6.17.13 & Centos' 
2.6.9) The system was using the p4-clockmod driver and the ondemand governor.

I'm still not sure exactly what the problem is - the reboots only happen in 
the circumstances I've mentioned and are not triggered by changes in clock 
speed alone - but disabling cpufreq seems to make it go away...

Thanks for your help,
David.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bcm43xx-softmac: add PCI-E code

2006-10-16 Thread Michael Buesch
On Monday 16 October 2006 06:18, Larry Finger wrote:
> From: Stefano Brivio <[EMAIL PROTECTED]>
> 
> The current bcm43xx driver does not contain code to handle PCI-E interfaces
> such as the BCM4311 and BCM4312. This patch, originally written by Stefano
> Brivio adds the necessary code to enable these interfaces. 
> 
> Signed-off-by: Stefano Brivio <[EMAIL PROTECTED]>
> Signed-off-by: Larry Finger <[EMAIL PROTECTED]>

This patch should be OK. Please merge for 2.6.19.

-- 
Greetings Michael.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bugme-new] [Bug 7366] New: BUG: unable to handle kernel paging request at virtual address d0cb03e0

2006-10-16 Thread Patrick McHardy
Please use reply to _all_. Quoting manually ..

Patrick McHardy wrote:
>> Does it also happen without external patches like ipp2p? Did you
>> load/unload any netfilter modules before?
>
> This happens after loading all specific, ip_conntrackmodules, flushing
> al iptables rules, reseting counters, flushing all tables, unloading all
> ip_conntrack modules and the runing command  -j ACCEPT> . Tested also with kernel 2.6.18.1 and it works ok. I do not
> thik this has to do anything with  ipp2p
> module, since is not even used, and in the commands I used,  is not
> specified a command for this module.


Any chance you're also unloading iptables modules? If so this patch
(already in Dave's queue) should fix it ..

[NETFILTER]: fix cut-and-paste error in exit functions

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit c7b1507f3c040c02efa1b955f7180a33a232c4d9
tree fd21258deca0e5d8859271bb2c745302ce5a1e2a
parent 26da6cf44bc574d528d715a17e48f54da061c151
author Patrick McHardy <[EMAIL PROTECTED]> Wed, 11 Oct 2006 08:35:50 +0200
committer Patrick McHardy <[EMAIL PROTECTED]> Wed, 11 Oct 2006 08:35:50 +0200

 net/netfilter/xt_NFQUEUE.c  |2 +-
 net/netfilter/xt_connmark.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/xt_NFQUEUE.c b/net/netfilter/xt_NFQUEUE.c
index db9b896..39e1175 100644
--- a/net/netfilter/xt_NFQUEUE.c
+++ b/net/netfilter/xt_NFQUEUE.c
@@ -68,7 +68,7 @@ static int __init xt_nfqueue_init(void)
 
 static void __exit xt_nfqueue_fini(void)
 {
-   xt_register_targets(xt_nfqueue_target, ARRAY_SIZE(xt_nfqueue_target));
+   xt_unregister_targets(xt_nfqueue_target, ARRAY_SIZE(xt_nfqueue_target));
 }
 
 module_init(xt_nfqueue_init);
diff --git a/net/netfilter/xt_connmark.c b/net/netfilter/xt_connmark.c
index 92a5726..a8f0305 100644
--- a/net/netfilter/xt_connmark.c
+++ b/net/netfilter/xt_connmark.c
@@ -147,7 +147,7 @@ static int __init xt_connmark_init(void)
 
 static void __exit xt_connmark_fini(void)
 {
-   xt_register_matches(xt_connmark_match, ARRAY_SIZE(xt_connmark_match));
+   xt_unregister_matches(xt_connmark_match, ARRAY_SIZE(xt_connmark_match));
 }
 
 module_init(xt_connmark_init);


Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Evgeniy Polyakov
On Mon, Oct 16, 2006 at 03:16:15AM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
> Evgeniy Polyakov wrote:
> >The whole idea of mmap buffer seems to be broken, since those who asked
> >for creation do not like existing design and do not show theirs...
> 
> What kind of argumentation is that?
> 
>"Because my attempt to implement it doesn't work and nobody right
> away has a better suggestion this means the idea is broken."
> 
> Nonsense.

Ok, let's reformulate:
My attempt works, but nobody around likes it, I remove it and wait until
some other implement it.

> It just means that time should be spend on thinking about this.  You cut 
> all this short by rushing out your attempt without any discussions. 
> Unfortunately nobody else really looked at the approach so it lingered 
> around for some weeks.  Well, now it is clear that it is not the right 
> approach and we can start thinking about it again.

I talked about it in the last 13 releases of the kevent, and _noone_
said at least some comments. And now I get - 'it is broken, it does not
work, there are problems, we do not want it' and the like. I tried
hardly to show that it does work and problems shown can not happen, but
noone still hears me. Since I think it is not that interface which is
100% required for correct functionality, I removed it. When there are
better suggestions and implementation we can return to them of course.

> >You seems to not checked the code - each event can be marked as ready 
> >only one time, which means only one copy and so on.
> >It was done _specially_. And it is not limitation, but "new" approach.
> 
> I know that it is done deliberately and I tell you that this is wrong 
> and unacceptable.  Realtime signals are one event which need to have 
> more than one event queued.  This is no description of what you have 
> implemented, it's a description of the reality of realtime signals.
> 
> RT signals are queued.  They carry a data value (the sigval_t object) 
> which can be unique for each signal delivery.  Coalescing the signal 
> events therefore leads to information loss.
> 
> Therefore, at the very least for signal we need to have the ability to 
> queue more than one event for each event source.  Not having this 
> functionality means that signals and likely other types of events cannot 
> be implemented using kevent queues.

Well, my point about rt-signals is that they do not deserve to be
resurrected, but it is only my point :)
In case it is still used, each signal setup should create event - many
signals means many events, each signal can be sent with different
parameters - each event should correspond to one unique case.

> >Queue of the same signals or any other events has fundamental flawness
> >(as any other ring buffer implementation, which has queue size)  -
> >it's size of the queue and extremely bad case of the overflow.
> 
> Of course there are additional problems.  Overflows need to be handled. 
>  But this is nothing which is unsolvable.

I strongly disagree that having design which allows overflows is
acceptible - do we really want rt-signals queue overflow problems in new
place? Instead some complex allocation scheme can be created.

> >So, the same event may not be ready several times. Any design which
> >allows to create infinite number of events generated for the same case
> >is broken, since consumer can be in situation, when it can not handle
> >that flow.
> 
> That's complete nonsense.  Again, for RT signals it is very reasonable 
> and not "broken" to have multiple outstanding signals.

The same signal with different payload is acceptible, but when number of
them increases ulimit and they are started to be forgotten - that's what
I call broken design.

> >That is why poll() returns only POLLIN when data is ready in
> >network stack, but is not trying to generate some kind of a signal for 
> >each byte/packet/MTU/MSS received.
> 
> It makes no sense to drag poll() into this discussion.  poll() is a very 
> limited interface.  The new event handling is supposed to be the 
> opposite, namely, usable for all kinds of events.  Arguing that because 
> poll() does it like this just means you don't see what big step is 
> needed to get to the goal of a unified event handling.  The shackles of 
> poll() must be left behind.

Kevent is that subsystem, and for now it works quite good.

> >RT signals have design problems, and I will not repeate the same error
> >with similar limits in kevent.
> 
> I don't know what to say.  You claim to be the source of all wisdom is 
> OS design.  Maybe you should design your own OS, from ground up.  I 
> wonder how many people would like that since all your arguments are 
> squarely geared towards optimizing the implementation.  But: the 
> implementation is irrelevant without users.  The functionality users (= 
> programmers) want and need is what must drive the implementation.  And 
> RT signals are definitely heavily used and liked by programmers.  You 
> have

Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-16 Thread Evgeniy Polyakov
On Mon, Oct 16, 2006 at 02:59:48AM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
> Evgeniy Polyakov wrote:
> >One can set number of events before the syscall and do not remove them
> >after syscall. It can be updated if there is need for that.
> 
> Nobody doubts that it is possible.  But it is
> 
> a) potentially much expensive
> 
> and
> 
> b) an alien concept
> 
> to have the signal mask to set during the wait call implicitly. 
> Conceptually it doesn't even make sense.  This is no event to wait for. 
>  It a parameter for the specific wait call, just like the timeout.  And 
> I fortunately haven't seen you proposing to pass the timeout value 
> implicitly.

Because timeout has it's meaning for syscall processing, but signals are
completely separated objects. Why do you want to allow to queue signals
_and_ add 'temporal' signal mask for syscall? Just use one way - queue
them all.
 
> >>Not good enough?  It does exactly what it is supposed to do.  What can 
> >>there be "not good enough"?
> >
> >Not to move signals into special case of events. If poll() can not work
> >with them it does not mean, that they need to be specified as additional
> >syscall parameter, instead change poll() to work with them, which can be
> >easily done with kevents.
> 
> You still seem to be completely missing the point.  The signal mask is 
> no event to wait for.  It has nothing to do with this that ppoll() takes 
> the signal mask as a parameter.  The signal mask is a parameter for the 
> wait call just like the timeout, not more and not less.

That's where we have different opinioins (among others places :) - I do
not agree that signals are parameters for syscall, I insist that is is
usual events. ppoll() shows us that there is no difference between
signal reported as usual user - syscall returns and we can check if
something was changed (signal was delivered or even was fired), it does
not differ from the case when syscall returns and we check what event it
reports first - ready signal or some other event.
 
> >Do not mix warm and soft - waiting for some period is not equal to
> >syscall timeout. Waiting is possible with timer kevent user (although
> >only relative timeout, can be changed to support both, not a big
> >problem).
> 
> That's what I'm saying all the time.  Of course it can be supported. 
> But for this the timeout parameter must be a timespec pointer.  Whatever 
> you could possibly mean by "do not mix warm and soft" I cannot possibly 
> imagine.  Fact is that both relative and absolute timeouts are useful. 
> And that for absolute timeouts the change of the clock has to be taken 
> into account.

They are usefull for special waiting, but not for waiting when syscall
is called. The former is supported by timer notifications, the latter -
by syscall parameter. We can add support for absolute timer
notifications as addon to relative ones. But using there timeval
structure is not accessible, since it has different sizes on different
arches, so there will be problems with 32/64 arches like x86_64.
Instead it is possible to use u32/u32 structure for sec/nsec, like what
is used for relative timeouts.
 
> >I'm quite sure that absolute timeouts are very usefull, but not as in
> >the case of waiting for syscall completeness. In any way, kevent can be
> >extended to support absolute timeouts in it's timer notifications.
> 
> That's not the same.  If you argue that then the syscall should have no 
> timeout parameter at all.  Fact is that setting up a timer is not for 
> free.  Since the timeout is used all the time having a timeout parameter 
> is the right answer.  And if you do this then do it right just like 
> every other syscall other than poll: use a timespec object.  This gives 
> flexibility without measurable cost.

It does not introduce any flexibility, since syscall does not have a
parameter to specify absolute or relative timeout has been provided.
That's one.
I do argue that syscall must have timout parameter, since it is related
to syscall behaviour but not to events syscall is working with - which is
completely different things: syscall must be interrupted after some time
to allow to fail operation or perform other tasks, but timer event can
be fired in any time in the future, syscall should not care about
underlaying events. That's two.
You say "every other syscall other than poll" - but even aio_suspend()
and friends use relative timeouts (although glibc converts them into 
absolute to be used with pthread_cond_timedwait), so why do you propose 
to use wariable sized structure (even if it is transferred almost for 
free in syscall) instead of usual timeout specified in 
seconds/nanoseconds/anything? That's three.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.h

Re: [PATCH 1/2] [PCI] Check that MWI bit really did get set

2006-10-16 Thread Alan Cox
Ar Sul, 2006-10-15 am 16:44 -0700, ysgrifennodd Andrew Morton:
> Let me restore the words from my earlier email which you removed so that
> you could say that:
> 
>   For you the driver author to make assumptions about what's happening
>   inside pci_set_mwi() is a layering violation.  Maybe the bridge got
>   hot-unplugged.  Maybe the attempt to set MWI caused some synchronous PCI
>   error.  For example, take a look at the various implementations of
>   pci_ops.read() around the place - various of them can fail for various
>   reasons.  

Let me repeat what I said before. As a driver author I do not care. It
doesn't matter if it failed because it is not supported or because a
pink elephant went for a dance on the PCI bus.

>   Now it could be that an appropriate solution is to make pci_set_mwi()
>   return only 0 or 1, and to generate a warning from within pci_set_mwi()
>   if some unexpected error happens.  In which case it is legitimate for
>   callers to not check for errors.

That would be my belief, and ditto for a lot of these other functions -
even the correctly __must_check ones like pci_set_master should do the
error reporting in the set_master() function etc not in every driver.
That gives us a single consistent printk and avoids missing them out or
bloat.

Alan

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [PCI] Check that MWI bit really did get set

2006-10-16 Thread Alan Cox
Ar Sul, 2006-10-15 am 17:16 -0700, ysgrifennodd David Brownell:
> Signed-off-by: David Brownell <[EMAIL PROTECTED]>

Acked-by: Alan Cox <[EMAIL PROTECTED]>
> 
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -499,7 +499,7 @@ int __must_check pci_enable_device_bars(
>  void pci_disable_device(struct pci_dev *dev);
>  void pci_set_master(struct pci_dev *dev);
>  #define HAVE_PCI_SET_MWI
> -int __must_check pci_set_mwi(struct pci_dev *dev);
> +int pci_set_mwi(struct pci_dev *dev);
>  void pci_clear_mwi(struct pci_dev *dev);
>  void pci_intx(struct pci_dev *dev, int enable);
>  int pci_set_dma_mask(struct pci_dev *dev, u64 mask);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [PCI] Check that MWI bit really did get set

2006-10-16 Thread Alan Cox
Ar Sul, 2006-10-15 am 18:10 -0700, ysgrifennodd Andrew Morton:
> Question is, should pci_set_mwi() ever return -EFOO?  I guess it should, in
> the case where setting the line size didn't work out.

It does no harm, no driver will ever check anything but 0 v !0 because
the handling is no different in either case.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hardware bug or kernel bug?

2006-10-16 Thread Jarek Poplawski
On Fri, Oct 13, 2006 at 05:24:39PM +0100, David Johnson wrote:
> On Friday 13 October 2006 14:06, Jarek Poplawski wrote:
> >
> > Probably - but only with networking. So I'd try with this debugging
> > like in my first reply plus maybe 2.6.19-rc1 (e1000 - btw. I hope
> > this other tested card was different model - and locking improved)
> > and resend conclusions to [EMAIL PROTECTED]
> >
> 
> OK I built a 2.6.19-rc1 kernel with a minimal config as you describe and I 
> cannot reproduce the reboots with this kernel. My .config:
> http://www.david-web.co.uk/download/config

I've seen more minimal minimal configs but if it works
it is 50% of success. 

> The other NIC I tried was a D-Link DL10050-based card which I think uses the 
> dl2k module.
> 
> I tried to reproduce the problem under Windows (2k), which didn't reboot but 
> did still suffer from it I believe. Randomly during an scp transfer (using 
> the PuTTY scp client) Windows will lock-up for about 30 seconds, making an 
> entry in the event log indicating that there was a time-out talking to the 
> IDE controller, then continuing. Could the same thing be happening in Linux? 
> If Linux can't talk to the IDE controller when trying to write to disk, how 
> does it handle that?

Was this lock-up effect visible during above 2.6.19-rc1 tests?
If not I'd try to continue linux debbuging:
- is 2.6.19-rc1 working with "normal" config (use make oldconfig
to "upgrade" .config),
- is 2.6.17 working with "minimal" config (use make oldconfig),
- changing one or two options at a time try to find which one makes
the effect returns (acpi, smp...). 

Regards,
Jarek P.

PS: Sorry for late reply - I was offline.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Ulrich Drepper

Evgeniy Polyakov wrote:

The whole idea of mmap buffer seems to be broken, since those who asked
for creation do not like existing design and do not show theirs...


What kind of argumentation is that?

   "Because my attempt to implement it doesn't work and nobody right
away has a better suggestion this means the idea is broken."

Nonsense.

It just means that time should be spend on thinking about this.  You cut 
all this short by rushing out your attempt without any discussions. 
Unfortunately nobody else really looked at the approach so it lingered 
around for some weeks.  Well, now it is clear that it is not the right 
approach and we can start thinking about it again.



You seems to not checked the code - each event can be marked as ready 
only one time, which means only one copy and so on.

It was done _specially_. And it is not limitation, but "new" approach.


I know that it is done deliberately and I tell you that this is wrong 
and unacceptable.  Realtime signals are one event which need to have 
more than one event queued.  This is no description of what you have 
implemented, it's a description of the reality of realtime signals.


RT signals are queued.  They carry a data value (the sigval_t object) 
which can be unique for each signal delivery.  Coalescing the signal 
events therefore leads to information loss.


Therefore, at the very least for signal we need to have the ability to 
queue more than one event for each event source.  Not having this 
functionality means that signals and likely other types of events cannot 
be implemented using kevent queues.




Queue of the same signals or any other events has fundamental flawness
(as any other ring buffer implementation, which has queue size)  -
it's size of the queue and extremely bad case of the overflow.


Of course there are additional problems.  Overflows need to be handled. 
 But this is nothing which is unsolvable.




So, the same event may not be ready several times. Any design which
allows to create infinite number of events generated for the same case
is broken, since consumer can be in situation, when it can not handle
that flow.


That's complete nonsense.  Again, for RT signals it is very reasonable 
and not "broken" to have multiple outstanding signals.




That is why poll() returns only POLLIN when data is ready in
network stack, but is not trying to generate some kind of a signal for 
each byte/packet/MTU/MSS received.


It makes no sense to drag poll() into this discussion.  poll() is a very 
limited interface.  The new event handling is supposed to be the 
opposite, namely, usable for all kinds of events.  Arguing that because 
poll() does it like this just means you don't see what big step is 
needed to get to the goal of a unified event handling.  The shackles of 
poll() must be left behind.




RT signals have design problems, and I will not repeate the same error
with similar limits in kevent.


I don't know what to say.  You claim to be the source of all wisdom is 
OS design.  Maybe you should design your own OS, from ground up.  I 
wonder how many people would like that since all your arguments are 
squarely geared towards optimizing the implementation.  But: the 
implementation is irrelevant without users.  The functionality users (= 
programmers) want and need is what must drive the implementation.  And 
RT signals are definitely heavily used and liked by programmers.  You 
have to accept that you try to modify an OS which has that functionality 
regardless of how much you hate it and want to fight it.




Mmap implementation can be added separately, since it does not affect
kevent core.


That I doubt very much and it is why I would not want the kevent stuff 
go into any released kernel until that "detail" is resolved.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-16 Thread Ulrich Drepper

Evgeniy Polyakov wrote:

One can set number of events before the syscall and do not remove them
after syscall. It can be updated if there is need for that.


Nobody doubts that it is possible.  But it is

a) potentially much expensive

and

b) an alien concept

to have the signal mask to set during the wait call implicitly. 
Conceptually it doesn't even make sense.  This is no event to wait for. 
 It a parameter for the specific wait call, just like the timeout.  And 
I fortunately haven't seen you proposing to pass the timeout value 
implicitly.



Not good enough?  It does exactly what it is supposed to do.  What can 
there be "not good enough"?


Not to move signals into special case of events. If poll() can not work
with them it does not mean, that they need to be specified as additional
syscall parameter, instead change poll() to work with them, which can be
easily done with kevents.


You still seem to be completely missing the point.  The signal mask is 
no event to wait for.  It has nothing to do with this that ppoll() takes 
the signal mask as a parameter.  The signal mask is a parameter for the 
wait call just like the timeout, not more and not less.




Do not mix warm and soft - waiting for some period is not equal to
syscall timeout. Waiting is possible with timer kevent user (although
only relative timeout, can be changed to support both, not a big
problem).


That's what I'm saying all the time.  Of course it can be supported. 
But for this the timeout parameter must be a timespec pointer.  Whatever 
you could possibly mean by "do not mix warm and soft" I cannot possibly 
imagine.  Fact is that both relative and absolute timeouts are useful. 
And that for absolute timeouts the change of the clock has to be taken 
into account.




I'm quite sure that absolute timeouts are very usefull, but not as in
the case of waiting for syscall completeness. In any way, kevent can be
extended to support absolute timeouts in it's timer notifications.


That's not the same.  If you argue that then the syscall should have no 
timeout parameter at all.  Fact is that setting up a timer is not for 
free.  Since the timeout is used all the time having a timeout parameter 
is the right answer.  And if you do this then do it right just like 
every other syscall other than poll: use a timespec object.  This gives 
flexibility without measurable cost.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread Eric Dumazet
(Sorry, patch inlined this time)

Hi David

While browsing include/net/request_sock.h I found this suspicious locking
protecting the SYN table hash table. I think this patch is necessary.

Thank you

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
--- linux-2.6.18/include/net/request_sock.h.orig2006-10-16 
10:53:11.0 +0200
+++ linux-2.6.18-ed/include/net/request_sock.h  2006-10-16 10:53:24.0 
+0200
@@ -251,9 +251,9 @@
req->expires = jiffies + timeout;
req->retrans = 0;
req->sk = NULL;
-   req->dl_next = lopt->syn_table[hash];
 
write_lock(&queue->syn_wait_lock);
+   req->dl_next = lopt->syn_table[hash];
lopt->syn_table[hash] = req;
write_unlock(&queue->syn_wait_lock);
 }


[PATCH] NET : Suspicious locking in reqsk_queue_hash_req()

2006-10-16 Thread Eric Dumazet
Hi David

While browsing include/net/request_sock.h I found this suspicious locking 
protecting the SYN table hash table. I think this patch is necessary.

Thank you

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
--- linux-2.6.18/include/net/request_sock.h.orig2006-10-16 
10:53:11.0 +0200
+++ linux-2.6.18-ed/include/net/request_sock.h  2006-10-16 10:53:24.0 
+0200
@@ -251,9 +251,9 @@
req->expires = jiffies + timeout;
req->retrans = 0;
req->sk = NULL;
-   req->dl_next = lopt->syn_table[hash];
 
write_lock(&queue->syn_wait_lock);
+   req->dl_next = lopt->syn_table[hash];
lopt->syn_table[hash] = req;
write_unlock(&queue->syn_wait_lock);
 }


Re: [PATCH 9/14] [TIPC] Name publication events now delivered in chronological order

2006-10-16 Thread Per Liden
I'm fairly sure this is a problem on your side. I received patch 10/14 
from the netdev list and the two list archives I checked also had it.

/Per

On Fri, 13 Oct 2006, Bill Fink wrote:

> FYI,
> 
> At least here, I received two copies of patch 9/14 and no copy
> of patch 10/14.
> 
>   -Bill
> 
> 
> 
> On Fri, 13 Oct 2006 13:37:50 +0200, Per Liden wrote:
> 
> > From: Allan Stephens <[EMAIL PROTECTED]>
> > 
> > This patch tivially re-orders the entries in TIPC's list of local
> > publications so that applications will receive publication events
> > in the order they were published.
> > 
> > Signed-off-by: Allan Stephens <[EMAIL PROTECTED]>
> > Signed-off-by: Per Liden <[EMAIL PROTECTED]>
> > ---
> >  net/tipc/name_distr.c |2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
> > index f0b063b..03bd659 100644
> > --- a/net/tipc/name_distr.c
> > +++ b/net/tipc/name_distr.c
> > @@ -122,7 +122,7 @@ void tipc_named_publish(struct publicati
> > struct sk_buff *buf;
> > struct distr_item *item;
> >  
> > -   list_add(&publ->local_list, &publ_root);
> > +   list_add_tail(&publ->local_list, &publ_root);
> > publ_cnt++;
> >  
> > buf = named_prepare_buf(PUBLICATION, ITEM_SIZE, 0);
> > -- 
> > 1.4.1
> 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Evgeniy Polyakov
On Sun, Oct 15, 2006 at 04:22:45PM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
> Evgeniy Polyakov wrote:
> >Existing design does not allow overflow.
> 
> And I've pointed out a number of times that this is not practical at 
> best.  There are event sources which can create events which cannot be 
> coalesced into one single event as it would be required with your design.
> 
> Signals are one example, specifically realtime signals.  If we do not 
> want the design to be limited from the start this approach has to be 
> thought over.

The whole idea of mmap buffer seems to be broken, since those who asked
for creation do not like existing design and do not show theirs...

According to signals and possibility to overflow in existing ring buffer
implementation.
You seems to not checked the code - each event can be marked as ready 
only one time, which means only one copy and so on.
It was done _specially_. And it is not limitation, but "new" approach.
Queue of the same signals or any other events has fundamental flawness
(as any other ring buffer implementation, which has queue size)  -
it's size of the queue and extremely bad case of the overflow.
So, the same event may not be ready several times. Any design which
allows to create infinite number of events generated for the same case
is broken, since consumer can be in situation, when it can not handle
that flow. That is why poll() returns only POLLIN when data is ready in
network stack, but is not trying to generate some kind of a signal for 
each byte/packet/MTU/MSS received.
RT signals have design problems, and I will not repeate the same error
with similar limits in kevent.

> >>So zap mmap() support completely, since it is not usable at all. We wont 
> >>discuss on it.
> >
> >Initial implementation did not have it.
> >But I was requested to do it, and it is ready now.
> >No one likes it, but no one provides an alternative implementation.
> >We are stuck.
> 
> We need the mapped ring buffer.  The current design (before it was 
> removed) was broken but this does not mean it shouldn't be implemented. 
>  We just need more time to figure out how to implement it correctly.

In the latest patchset it was removed. I'm waiting for your code.

Mmap implementation can be added separately, since it does not affect
kevent core.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-16 Thread Evgeniy Polyakov
On Sun, Oct 15, 2006 at 03:43:39PM -0700, Ulrich Drepper ([EMAIL PROTECTED]) 
wrote:
> Evgeniy Polyakov wrote:
> >In context you have cut, one updated signal mask between calls to event
> >delivery mechanism (using for example signal()), so it has exactly the
> >same price.
> 
> No, it does not.  If the signal mask is recomputed by the program for 
> each new wait call then you have a lot more work to do when the signal 
> mask is implicitly specified.

One can set number of events before the syscall and do not remove them
after syscall. It can be updated if there is need for that.
 
> >I created it just because I think that POSIX workaround to add signals
> >into the syscall parameters is not good enough.
> 
> Not good enough?  It does exactly what it is supposed to do.  What can 
> there be "not good enough"?

Not to move signals into special case of events. If poll() can not work
with them it does not mean, that they need to be specified as additional
syscall parameter, instead change poll() to work with them, which can be
easily done with kevents.
 
> >You again cut my explanation on why just pure timeout is used.
> >We start a syscall, which can block forever, so we want to limit it's
> >time, and we add special parameter to show how long this syscall should
> >run. Timeout is not about how long we should sleep (which indeed can be
> >absolute), but how long syscall should run - which is related to the 
> >time syscall started.
> 
> I know very well what a timeout is.  But the way the timeout can be 
> specified can vary.  It is often useful (as for select, poll) to specify 
> relative timeouts.
> 
> But there are equally useful uses where the timeout is needed at a 
> specific point in time.  Without a syscall interface which can have a 
> absolute timeout parameter we'd have to write as a poor approximation at 
> userlever
> 
> clock_gettime (CLOCK_REALTIME, &ts);
> struct timespec rel;
> rel.tv_sec = abstmo.tv_sec - ts.tv_sec;
> rel.tv_nsec = abstmo.tv_sec - ts.tv_nsec;
> if (rel.tv_nsec < 0) {
>   rel.tv_nsec += 10;
>   --rel.tv_sec;
> }
> if (rel.tv_sec < 0)
>   inttmo = -1;  // or whatever is used for return immediately
> else
>   inttmo = rel.tv_sec * UINT64_C(10) + rel.tv_nsec;
> 
>  wait(..., inttmo, ...)

Do not mix warm and soft - waiting for some period is not equal to
syscall timeout. Waiting is possible with timer kevent user (although
only relative timeout, can be changed to support both, not a big
problem).

> Not only is this much more expensive to do at userlevel, it is also 
> inadequate because calls to settimeofday() do  not cause a recomputation 
> of the timeout.
> 
> See Ingo's RT futex stuff as an example for a kernel interface which 
> does it right.

I'm quite sure that absolute timeouts are very usefull, but not as in
the case of waiting for syscall completeness. In any way, kevent can be
extended to support absolute timeouts in it's timer notifications.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patch to remove masq/NAT in description of IP6_NF_IPTABLES in ipv6/netfilter/Kconfig

2006-10-16 Thread Patrick McHardy
Peter Bieringer wrote:
> afaik, NAT (and therefore masquerading also) is left out by design in
> IPv6, looks like a copy&paste issue.
> 
> Patch attached to fix this.

Applied, thanks. But please sign off future patches.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suppress / delay SYN-ACK

2006-10-16 Thread Lennert Buytenhek
On Thu, Oct 12, 2006 at 10:08:53AM +0200, Martin Schiller wrote:

> I'm searching for a solution to suppress / delay the SYN-ACK packet of a
> listening server (-application) until he has decided (e.g. analysed the
> requesting ip-address or checked if the corresponding other end of a
> connection is available) if he wants to accept the connect request of the
> client. If not, it should be possible to reject the connect request.

I wrote something like this a couple of years ago:

http://marc.theaimsgroup.com/?l=linux-netdev&m=103666165629419&w=2
http://marc.theaimsgroup.com/?l=linux-netdev&m=106089519611631&w=2

There wasn't a whole lot of external interest, and my need for it
disappeared, so I never really finished it, and there's a couple of
unfixed bugs,


cheers,
Lennert
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Patch to remove masq/NAT in description of IP6_NF_IPTABLES in ipv6/netfilter/Kconfig

2006-10-16 Thread Peter Bieringer
Hi,

afaik, NAT (and therefore masquerading also) is left out by design in
IPv6, looks like a copy&paste issue.

Patch attached to fix this.

Peter
-- 
Dr. Peter Bieringer http://www.bieringer.de/pb/
GPG/PGP Key 0x958F422D   mailto:[EMAIL PROTECTED]
Deep Space 6 Co-Founder and Core Member  http://www.deepspace6.net/
--- linux-2.6.18.1/net/ipv6/netfilter/Kconfig.orig	2006-10-16 08:56:43.0 +0200
+++ linux-2.6.18.1/net/ipv6/netfilter/Kconfig	2006-10-16 08:56:55.0 +0200
@@ -40,7 +40,7 @@
 	  To compile it as a module, choose M here.  If unsure, say N.
 
 config IP6_NF_IPTABLES
-	tristate "IP6 tables support (required for filtering/masq/NAT)"
+	tristate "IP6 tables support (required for filtering)"
 	depends on NETFILTER_XTABLES
 	help
 	  ip6tables is a general, extensible packet identification framework.