Re: [RFC PATCH 0/7] Share events between metrics

2020-12-15 Thread Andi Kleen
> Or, is the concern more about trying to time-slice the results in a 
> fairly granular way and expecting accurate results then?

Usually the later. It's especially important for divisions. You want
both divisor and dividend to be in the same time slice, otherwise
the result usually doesn't make a lot of sense.

-Andi


Re: perf measure for stalled cycles per instruction on newer Intel processors

2020-10-18 Thread Andi Kleen
> > Don't use it. It's misleading on a out-of-order CPU because you don't
> > know if it's actually limiting anything.
> >
> > If you want useful bottleneck data use --topdown.
> 
> So running again, this time with the below params, I got this output
> where all the right most column is colored red. I wonder what can be
> said on the amount/ratio of stalls for this app - if you can maybe recommend
> some posts of yours to better understand that, I saw some comment in the
> perf-stat man page and some lwn article but wasn't really able to figure it 
> out.

TopDown determines what limits the execution the most.

The application is mostly backend bound (55-72%). This can be either memory
issues (more common), or sometimes also execution issues. Standard perf
doesn't support a further break down beyond these high level categories,
but there are alternative tools that do (e.g. mine is "toplev" in
https://github.com/andikleen/pmu-tools or VTune)

Some references on TopDown:
https://github.com/andikleen/pmu-tools/wiki/toplev-manual
http://bit.ly/tma-ispass14

The tools above would also allow you to sample where the stalls
are occuring.

-Andi



Re: [RESEND] test_verifier #13 fails on arm64: "retval 65507 != -29"

2019-07-10 Thread Andi Kleen
> > Reading csum_partial/csum_fold, seems like after calculation of
> > checksum (so-called unfolded checksum), it is supposed to be passed
> > into csum_fold() to convert it into 16-bit one and invert.

Yes, you always need to fold at the end.

The low level code does fold sometimes, but not always.

-Andi


[PATCH] ipv4 ping: Fix __init* attributes

2019-03-29 Thread Andi Kleen
From: Andi Kleen 

ping_v4_net_ops references init functions, so needs to be __initdata.
ping_proc_exit is then referenced from __initdata, so also needs
to be __init.

Signed-off-by: Andi Kleen 
---
 net/ipv4/ping.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 7ccb5f87f70b..070e078612a0 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -1160,7 +1160,7 @@ static void __net_exit ping_v4_proc_exit_net(struct net 
*net)
remove_proc_entry("icmp", net->proc_net);
 }
 
-static struct pernet_operations ping_v4_net_ops = {
+static __initdata struct pernet_operations ping_v4_net_ops = {
.init = ping_v4_proc_init_net,
.exit = ping_v4_proc_exit_net,
 };
@@ -1170,7 +1170,7 @@ int __init ping_proc_init(void)
return register_pernet_subsys(&ping_v4_net_ops);
 }
 
-void ping_proc_exit(void)
+void __init ping_proc_exit(void)
 {
unregister_pernet_subsys(&ping_v4_net_ops);
 }
-- 
2.20.1



Re: perf measure for stalled cycles per instruction on newer Intel processors

2020-10-15 Thread Andi Kleen
On Thu, Oct 15, 2020 at 05:53:40PM +0300, Or Gerlitz wrote:
> Hi,
> 
> Earlier Intel processors (e.g E5-2650) support the more of classical
> two stall events (for backend and frontend [1]) and then perf shows
> the nice measure of stalled cycles per instruction - e.g here where we
> have IPC of 0.91 and CSPI (see [2]) of 0.68:

Don't use it. It's misleading on a out-of-order CPU because you don't
know if it's actually limiting anything.

If you want useful bottleneck data use --topdown.

-Andi


Re: [RFC PATCH 0/7] Share events between metrics

2020-05-07 Thread Andi Kleen
On Thu, May 07, 2020 at 01:14:29AM -0700, Ian Rogers wrote:
> Metric groups contain metrics. Metrics create groups of events to
> ideally be scheduled together. Often metrics refer to the same events,
> for example, a cache hit and cache miss rate. Using separate event
> groups means these metrics are multiplexed at different times and the
> counts don't sum to 100%. More multiplexing also decreases the
> accuracy of the measurement.
> 
> This change orders metrics from groups or the command line, so that
> the ones with the most events are set up first. Later metrics see if
> groups already provide their events, and reuse them if
> possible. Unnecessary events and groups are eliminated.

Note this actually may make multiplexing errors worse.

For metrics it is often important that all the input values to
the metric run at the same time. 

So e.g. if you have two metrics and they each fit into a group,
but not together, even though you have more multiplexing it
will give more accurate results for each metric.

I think you change can make sense for metrics that don't fit 
into single groups anyways. perf currently doesn't quite know
this but some heuristic could be added. 

But I wouldn't do it for simple metrics that fit into groups.
The result may well be worse.

My toplev tool has some heuristics for this, also some more
sophisticated ones that tracks subexpressions. That would
be far too complicated for perf likely.

-Andi


Re: [RFC PATCH 0/7] Share events between metrics

2020-05-07 Thread Andi Kleen
> > - without this change events within a metric may get scheduled
> >   together, after they may appear as part of a larger group and be
> >   multiplexed at different times, lowering accuracy - however, less
> >   multiplexing may compensate for this.
> 
> I agree the heuristic in this patch set is naive and would welcome to
> improve it from your toplev experience. I think this change is
> progress on TopDownL1 - would you agree?

TopdownL1 in non SMT mode should always fit. Inside a group
deduping always makes sense. 

The problem is SMT mode where it doesn't fit. toplev tries
to group each node and each level together.

> 
> I'm wondering if what is needed are flags to control behavior. For
> example, avoiding the use of groups altogether. For TopDownL1 I see.

Yes the current situation isn't great.

For Topdown your patch clearly is an improvement, I'm not sure
it's for everything though.

Probably the advanced heuristics are only useful for a few
formulas, most are very simple. So maybe it's ok. I guess
would need some testing over the existing formulas.


-Andi


Re: [RFC PATCH v3 00/14] Share events between metrics

2020-05-08 Thread Andi Kleen
On Thu, May 07, 2020 at 10:36:15PM -0700, Ian Rogers wrote:
> Metric groups contain metrics. Metrics create groups of events to
> ideally be scheduled together. Often metrics refer to the same events,
> for example, a cache hit and cache miss rate. Using separate event
> groups means these metrics are multiplexed at different times and the
> counts don't sum to 100%. More multiplexing also decreases the
> accuracy of the measurement.
> 
> This change orders metrics from groups or the command line, so that
> the ones with the most events are set up first. Later metrics see if
> groups already provide their events, and reuse them if
> possible. Unnecessary events and groups are eliminated.

Yes some improvements here are great.

> 
> The option --metric-no-group is added so that metrics aren't placed in
> groups. This affects multiplexing and may increase sharing.
> 
> The option --metric-mo-merge is added and with this option the
> existing grouping behavior is preserved.

Could we also make this a per metric option, like

-M foo:nomerge,... 

or somesuch? Okay i suppose this could be a followon.

Ultimatively like you said we probably want to configure
defaults in the event file.

-Andi


Re: [RFC PATCH v3 01/14] perf parse-events: expand add PMU error/verbose messages

2020-05-08 Thread Andi Kleen


This seems like a independent bug fix and should probably
be pushed independently of the rest.

Perhaps add a Fixes: tag.

Reviewed-by: Andi Kleen 

On Thu, May 07, 2020 at 10:36:16PM -0700, Ian Rogers wrote:
> On a CPU like skylakex an uncore_iio_0 PMU may alias with
> uncore_iio_free_running_0. The latter PMU doesn't support fc_mask
> as a parameter and so pmu_config_term fails. Typically
> parse_events_add_pmu is called in a loop where if one alias succeeds
> errors are ignored, however, if multiple errors occur
> parse_events__handle_error will currently give a WARN_ONCE.
> 
> This change removes the WARN_ONCE in parse_events__handle_error and
> makes it a pr_debug. It adds verbose messages to parse_events_add_pmu
> warning that non-fatal errors may occur, while giving details on the pmu
> and config terms for useful context. pmu_config_term is altered so the
> failing term and pmu are present in the case of the 'unknown term'
> error which makes spotting the free_running case more straightforward.
> 
> Before:
> $ perf --debug verbose=3 stat -M llc_misses.pcie_read sleep 1
> Using CPUID GenuineIntel-6-55-4
> metric expr unc_iio_data_req_of_cpu.mem_read.part0 + 
> unc_iio_data_req_of_cpu.mem_read.part1 + 
> unc_iio_data_req_of_cpu.mem_read.part2 + 
> unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ
> found event unc_iio_data_req_of_cpu.mem_read.part0
> found event unc_iio_data_req_of_cpu.mem_read.part1
> found event unc_iio_data_req_of_cpu.mem_read.part2
> found event unc_iio_data_req_of_cpu.mem_read.part3
> metric expr unc_iio_data_req_of_cpu.mem_read.part0 + 
> unc_iio_data_req_of_cpu.mem_read.part1 + 
> unc_iio_data_req_of_cpu.mem_read.part2 + 
> unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ
> found event unc_iio_data_req_of_cpu.mem_read.part0
> found event unc_iio_data_req_of_cpu.mem_read.part1
> found event unc_iio_data_req_of_cpu.mem_read.part2
> found event unc_iio_data_req_of_cpu.mem_read.part3
> adding 
> {unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W,{unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W
> intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
> WARNING: multiple event parsing errors
> ...
> Invalid event/parameter 'fc_mask'
> ...
> 
> After:
> $ perf --debug verbose=3 stat -M llc_misses.pcie_read sleep 1
> Using CPUID GenuineIntel-6-55-4
> metric expr unc_iio_data_req_of_cpu.mem_read.part0 + 
> unc_iio_data_req_of_cpu.mem_read.part1 + 
> unc_iio_data_req_of_cpu.mem_read.part2 + 
> unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ
> found event unc_iio_data_req_of_cpu.mem_read.part0
> found event unc_iio_data_req_of_cpu.mem_read.part1
> found event unc_iio_data_req_of_cpu.mem_read.part2
> found event unc_iio_data_req_of_cpu.mem_read.part3
> metric expr unc_iio_data_req_of_cpu.mem_read.part0 + 
> unc_iio_data_req_of_cpu.mem_read.part1 + 
> unc_iio_data_req_of_cpu.mem_read.part2 + 
> unc_iio_data_req_of_cpu.mem_read.part3 for LLC_MISSES.PCIE_READ
> found event unc_iio_data_req_of_cpu.mem_read.part0
> found event unc_iio_data_req_of_cpu.mem_read.part1
> found event unc_iio_data_req_of_cpu.mem_read.part2
> found event unc_iio_data_req_of_cpu.mem_read.part3
> adding 
> {unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W,{unc_iio_data_req_of_cpu.mem_read.part0,unc_iio_data_req_of_cpu.mem_read.part1,unc_iio_data_req_of_cpu.mem_read.part2,unc_iio_data_req_of_cpu.mem_read.part3}:W
> intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
> Attempting to add event pmu 'uncore_iio_free_running_5' with 
> 'unc_iio_data_req_of_cpu.mem_read.part0,' that may result in non-fatal errors
> After aliases, add event pmu 'uncore_iio_free_running_5' with 
> 'fc_mask,ch_mask,umask,event,' that may result in non-fatal errors
> Attempting to add event pmu 'uncore_iio_free_running_3' with 
> 'unc_iio_data_req_of_cpu.mem_read.part0,' that may result in non-fatal errors
> After aliases, add event pmu 'uncore_iio_free_running_3' with 
> 'fc_mask,ch_mask,umask,event,' that may result in non-fatal errors
> Attempting to add event pmu 'uncore_iio_free_running_1' with 
> 'unc_iio_data_req_of_cpu.mem_read.part0,' that may result in non-fatal errors
> After aliases, add event pmu 'uncore_iio_free_running_1' with 
> 'fc_mask,ch_mask,umask,event,

Re: [RFC PATCH v3 12/14] perf metricgroup: order event groups by size

2020-05-08 Thread Andi Kleen
On Thu, May 07, 2020 at 10:36:27PM -0700, Ian Rogers wrote:
> When adding event groups to the group list, insert them in size order.
> This performs an insertion sort on the group list. By placing the
> largest groups at the front of the group list it is possible to see if a
> larger group contains the same events as a later group. This can make
> the later group redundant - it can reuse the events from the large group.
> A later patch will add this sharing.

I'm not sure if size is that great an heuristic. The dedup algorithm should
work in any case even if you don't order by size, right?

I suppose in theory some kind of topological sort would be better.

-Andi
> 
> Signed-off-by: Ian Rogers 
> ---
>  tools/perf/util/metricgroup.c | 16 +++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
> index 2a6456fa178b..69fbff47089f 100644
> --- a/tools/perf/util/metricgroup.c
> +++ b/tools/perf/util/metricgroup.c
> @@ -520,7 +520,21 @@ static int __metricgroup__add_metric(struct list_head 
> *group_list,
>   return -EINVAL;
>   }
>  
> - list_add_tail(&eg->nd, group_list);
> + if (list_empty(group_list))
> + list_add(&eg->nd, group_list);
> + else {
> + struct list_head *pos;
> +
> + /* Place the largest groups at the front. */
> + list_for_each_prev(pos, group_list) {
> + struct egroup *old = list_entry(pos, struct egroup, nd);
> +
> + if (hashmap__size(&eg->pctx.ids) <=
> + hashmap__size(&old->pctx.ids))
> + break;
> + }
> + list_add(&eg->nd, pos);
> + }
>  
>   return 0;
>  }
> -- 
> 2.26.2.645.ge9eca65c58-goog
> 


Re: [RFC PATCH v3 13/14] perf metricgroup: remove duped metric group events

2020-05-08 Thread Andi Kleen
>  static struct evsel *find_evsel_group(struct evlist *perf_evlist,
> struct expr_parse_ctx *pctx,
> +   bool has_constraint,
> struct evsel **metric_events,
> unsigned long *evlist_used)
>  {
> - struct evsel *ev;
> - bool leader_found;
> - const size_t idnum = hashmap__size(&pctx->ids);
> - size_t i = 0;
> - int j = 0;
> + struct evsel *ev, *current_leader = NULL;
>   double *val_ptr;
> + int i = 0, matched_events = 0, events_to_match;
> + const int idnum = (int)hashmap__size(&pctx->ids);

BTW standard perf data structure would be a rblist or strlist

I think it would be really better to do the deduping in a separate
pass than trying to add it to find_evsel_group. This leads
to very complicated logic.

This will likely make it easier to implement more sophisticated
algorithms too.

-Andi



Re: [RFC PATCH v3 06/14] perf evsel: fix 2 memory leaks

2020-05-08 Thread Andi Kleen
On Thu, May 07, 2020 at 10:36:21PM -0700, Ian Rogers wrote:
> If allocated, perf_pkg_mask and metric_events need freeing.

All these patches at the beginning look like straight forward
bug fixes and are really independent of the new features.

For them

Reviewed-by: Andi Kleen 

> 
> Signed-off-by: Ian Rogers 
> ---
>  tools/perf/util/evsel.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 28683b0eb738..05bb46baad6a 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1263,6 +1263,8 @@ void evsel__exit(struct evsel *evsel)
>   zfree(&evsel->group_name);
>   zfree(&evsel->name);
>   zfree(&evsel->pmu_name);
> + zfree(&evsel->per_pkg_mask);
> + zfree(&evsel->metric_events);
>   perf_evsel__object.fini(evsel);
>  }
>  
> -- 
> 2.26.2.645.ge9eca65c58-goog
> 


Re: [RFC PATCH v3 12/14] perf metricgroup: order event groups by size

2020-05-08 Thread Andi Kleen
> > I'm not sure if size is that great an heuristic. The dedup algorithm should
> > work in any case even if you don't order by size, right?
> 
> Consider two metrics:
>  - metric 1 with events {A,B}
>  - metric 2 with events {A,B,C,D}
> If the list isn't sorted then as the matching takes the first group
> with all the events, metric 1 will match {A,B} and metric 2 {A,B,C,D}.
> If the order is sorted to {A,B,C,D},{A,B} then metric 1 matches within
> the {A,B,C,D} group as does metric 2. The events in metric 1 aren't
> used and are removed.

Ok. It's better for the longer metric if they stay together.

> 
> The dedup algorithm is very naive :-)

I guess what matters is that it gives reasonable results on the current
metrics. I assume it does?

How much deduping is happening if you run all metrics?

For toplev on my long term todo list was to compare it against
a hopefully better schedule generated by or-tools, but I never
got around to coding that up.

-Andi


Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Andi Kleen
> > It's the "CONFIG_DEBUG_INFO_SPLIT" thing that makes faddr2line unable
> > to see the inlining information,
> > 
> > Using OPTIMIZE_INLINING is fine.
> 
> Good to know that!

It works for me. Perhaps your binutils is too old? It was
added at some point. Can you try upgrading?

% ./linux/scripts/faddr2line obj/vmlinux schedule+10
schedule+10/0x80:
schedule at arch/x86/include/asm/current.h:15

% addr2line --version
GNU addr2line version 2.27-24.fc26

-Andi


Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Andi Kleen
> Put another way: the CONFIG_DEBUG_INFO_SPLIT option is useless. Yes,
> it saves time and disk space, but does so at the expense of making all
> the debug information unavailable to basic tools.

You're right. It works for line information, but strangely not for
inlines. I assume it can be fixed.

-Andi


Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Andi Kleen
On Mon, Nov 13, 2017 at 12:56:31PM -0800, Linus Torvalds wrote:
> On Mon, Nov 13, 2017 at 12:10 PM, Andi Kleen  wrote:
> >
> > You're right. It works for line information, but strangely not for
> > inlines. I assume it can be fixed.
> 
> So I'm not 100% sure it's strictly a addr2line bug.

It seems to be broken for normal programs too

$ cat tinline.c

int i;

static inline int finline(void)
{
i++;
}

main()
{
finline();
}

$ gcc -O2 -gsplit-dwarf tinline.c
$ addr2line -i -e a.out 0x4003b0
/home/ak/tsrc/tinline.c:6
$ gcc -O2  -g tinline.c
$ addr2line -i -e a.out 0x4003b0
/home/ak/tsrc/tinline.c:6
/home/ak/tsrc/tinline.c:12
$

I filed https://sourceware.org/bugzilla/show_bug.cgi?id=22434


-Andi



Re: CONFIG_DEBUG_INFO_SPLIT impacts on faddr2line

2017-11-13 Thread Andi Kleen
> I wonder if there is some way to use the split format for the
> intermediate files, but then for the very final link bring them all in
> and make the end result be a traditional single binary. I'm not
> talking the separate "dwp" package that packs multiple dwo files into
> one, but to actually link them all in at the one.
> 
> Sure, it would lose some of the advantage, but I think a large portion
> of the -gsplit-dwarf advantage is about the intermediate state. No?

Not sure it's worth to do complicated workarounds. I assume
it will be not that difficult to fix binutils (after all
gdb works), and disabling the option is a reasonable workaround.

> 
> I tried to google for it, but couldn't find anything. But apparently
> elfutils doesn't support dwo files either. It seems mainly the linker
> and gdb itself that supports it.

The original design document is

https://gcc.gnu.org/wiki/DebugFission

-Andi


Re: Page allocator bottleneck

2017-09-14 Thread Andi Kleen
Tariq Toukan  writes:
>
> Congestion in this case is very clear.
> When monitored in perf top:
> 85.58% [kernel] [k] queued_spin_lock_slowpath

Please look at the callers. Spinlock profiles without callers
are usually useless because it's just blaming the messenger.

Most likely the PCP lists are too small for your extreme allocation
rate, so it goes back too often to the shared pool.

You can play with the vm.percpu_pagelist_fraction setting.

-Andi


Re: [PATCH 3/4] Make net watchdog timers 1 sec jiffy aligned

2007-05-31 Thread Andi Kleen
Venki Pallipadi <[EMAIL PROTECTED]> writes:
> 
> If this does not work:
> Another option is to use 'deferrable timer' here which will be called at
> same as before time when CPU is busy and on idle CPU it will be delayed until
> CPU comes out of idle due to any other events.

That would sound like a good idea here and at least power wise 
it should be nearly free

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Failover-friendly TCP retransmission

2007-06-04 Thread Andi Kleen
[EMAIL PROTECTED] writes:

> Please note first that I want to address physical failures by
> the failover-capable network devices, which are increasingly
> becoming important as Xen-based VM systems are getting popular.
> Reducing a single-point-of-failure (physical device) is vital on
> such VM systems.

Just you typically still have lots of other single points of failures in 
a single system, some of them quite less reliable than your typical
NIC. But at least it gives impressive demos when pulling ethernet cables @)

> 1. Network device layer detects a failure first and switch to a
>backup device (say, in 20sec).
> 
> 2. TCP layer timeout & retransmission comes next, _hopefully_
>before the application layer timeout.
> 
> 3. Application layer detects a network failure last (by, say,
>30sec timeout) and may trigger a system-level failover.
> 
> It should be noted that the timeouts for #1 and #2 are handled
> independently and there is no relationship between them.

> If TCP retransmission misses the time frame between event #1 and
> #3 in Background above (between 20 and 30sec since network
> failure), a failure causes the system-level failover where the
> network-device-level failover should be enough.

You should probably make sure that the device ends up returning the
right NET_XMIT_* code for such drops to TCP, in particular
NET_XMIT_DROP. This might require slight driver interface
changes. Also right now it only affects the congestion window, I think, 
it  might be reasonable to let it affect the timer backoff too.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Failover-friendly TCP retransmission

2007-06-05 Thread Andi Kleen
> Your suggestion, to utilize NET_XMIT_* code returned from an
> underlying layer, is done in tcp_transmit_skb.
> 
> But my problem is that tcp_transmit_skb is not called during a
> certain period of time.  So I'm suggesting to cap RTO value so
> that tcp_transmit_skb gets called more frequently.

The transmit code controls the transmission timeout. Or at least
it could change it if it really wanted.

What I wanted to say is: if the loss still happens under control
of the sending end device and TCP knows this then it could change
the retransmit timer to fire earlier or even just wait for an 
event from the device that tells it to retransmit early.

I admit I have not thought through all the implications of this,
but it would seem to me a better approach than capping RTO or
doing other intrusive TCP changes.

The problem with capping RTO is that when there is a loss
in the network for some other reasons (and there is no reason
bonding can't be used when talking to the internet) you
might be too aggressive or not aggressive enough anymore
to get the data through.

But if you only change behaviour when you detect a local
loss this cannot happen.

Just using a very short timeout of one jiffie on local loss might work 
(the stack already does this sometimes). Upcalls would be more 
complicated and might have some bad side effects (like not 
interacting well with qdiscs or possibly being unfair if there 
are a lot of sockets). But that might be solvable too. 

In a virtualized environments it might be also needed 
to pass NET_XMIT_* through the paravirtual driver interface.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: a maze of twisty stats, most different

2007-06-29 Thread Andi Kleen
David Stevens <[EMAIL PROTECTED]> writes:

> I think there's a more general problem that's a huge hassle. There are 
> lots of
> new SNMP MIB's, but no conventions (that I'm aware of, at least) to allow 
> for
> changes to the /proc entries that get them to applications. 

Actually /proc/net/snmp and netstat -s are extensible. If you add a new
MIB or field there it should just show up in netstat -s. It won't know about
the text decoding that is done for the early traditional MIBs, but that
is already not there for most of the newer fields.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: a maze of twisty stats, most different

2007-06-29 Thread Andi Kleen
> That works ok for some things, like new global counters, but some
> items really fit best in existing files and the concern there is about
> other uses of them beyond the standard tools.
> Examples:
> -addition of route age in /proc/net/route and /proc/net/ipv6_route

Routing information belongs into netlink imho. The /proc setup
for it already only shows a small subset of it. Netlink is easily
extensible -- just add new headers.

> -per-group data in /proc/net/igmp & igmp6

Don't know.  But you can probably just add more fields there,
multicasting is not exactly something that a lot of people care about
so there are likely not many scripts that might get broken.

> -per-interface MLD MIB info, which ought to go with other per-interface 
> data

ethtool ?  It's also extensible, although you have to change
the userland. But I don't think there is a risk of someone's script
breaking.

> 
> I think everything that uses this kind of interface ought to do
> label matching, so additional columns in a row (anywhere in the row)
> would just be skipped/ignored by things that don't understand them,
> and similarlarly for single-row tagged items. You can do that in scripts
> with awk, but if existing items don't, they'll break.

I would expect shell scripts to generally do netstat -s | ..., which
is easily matchable.


-Andi> 
> 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Who's allowed to set a skb destructor?

2007-07-05 Thread Andi Kleen
Brice Goglin <[EMAIL PROTECTED]> writes:

> I am trying to understand whether I can setup a skb destructor in my
> code (which is basically a protocol above dev_queue_xmit() and co). From
> what I see in many parts in the current kernel code, the "protocol" (I
> mean, the one who actually creates the skb) may setup a destructor.

The socket layer generally needs it for its own accounting.
Unless you never pass it up you can't use it.

> However, I also see some places where some low-level drivers might be
> using a destructor too , without apparently checking whether an upper
> layer already uses one. For instance, write_ofld_wr() in cxgb3/sge.c.

Likely a bug. Normally that should not slip past code review.

> found some old threads about adding support for multiple destructors but
> I don't see anything like this in the current kernel.
> 
> So, I'd like to have a clear statement about who's allowed to use a
> destructor :)

The traditional standpoint was that having your own large skb pools 
is not recommended because you won't interact well with the 
rest of the system running low on memory and you tieing up 
memory.

Essentially you would recreate all the problems traditional Unix
systems have with fixed size mbuf pools. Linux always used a more
dynamic and flexible allocate-only-as-you-need approach even when it
can have a little more overhead in managing IOMMUs etc.

These days there are shrinker callbacks that would in theory
allow you to handle this, but it would be likely still hard to implement
correctly.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Who's allowed to set a skb destructor?

2007-07-05 Thread Andi Kleen
On Thu, Jul 05, 2007 at 02:28:50PM +0200, Jarek Poplawski wrote:
> I wonder if it's very unsound to think about a one way list
> of destructors. Of course, not owners could only clean their
> private allocations. Woudn't this save some skb clonning,
> copying or adding new fields for private infos?

skb cloning isn't very expensive when you need it. And they
got a little private area you can use for your own stuff 
while you have it queued (skb->cb) 

As a historical note one of the big changes during the Linux 2.0
and 2.1 TCP rewrite was that TCP was changed to always clone for the
retransmit queue. This cleaned up the code greatly and fixed
many problems. Cloning was also especially optimized for this. When TCP 
which is about one of the most performance critical protocols around can 
afford it likely other code can too.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Who's allowed to set a skb destructor?

2007-07-05 Thread Andi Kleen
> The destructor method is set and used for skbs originating from the RDMA 
> driver sitting above cxgb3.

If these skbs never reach the normal sockets based stack it might be ok.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Alternate to Ixia's ANVL test harness for tcp compliance.

2006-07-20 Thread Andi Kleen
On Thursday 20 July 2006 21:49, Piet Delaney wrote:

> Unfortunately Ixia told me they don't have any plains to port the
> new GUI to linux. Instead they are trying to migrate Linux developers,
> us, to using Windows. Yeck!


With some luck it will just work in wine.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is RDMA

2006-07-24 Thread Andi Kleen
On Tuesday 25 July 2006 00:34, Rick Jones wrote:
> That TOE/iWARP could end-up being precluded by NAT seems so ironic from a 
> POE2E 
> standpoint.

Yes, it's sad, but reality unfortunately. 

There is even precedent: the VJ stateless TW recycling scheme also
turned out to not work because of NAT considerations.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-24 Thread Andi Kleen

> For example, my idea to allow ESTABLISHED TCP socket demux to be done
> before netfilter is flawed.  Connection tracking and NAT can change
> the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP
> socket, therefore we must always hit netfilter first.

Hmm, how does this happen?

I guess either when a connection is masqueraded and an application did a bind()
on a local port that is used by the masquerading engine.  That could be handled
by just disallowing it.

Or when you have a transparent proxy setup with the proxy on the local host.
Perhaps in that case netfilter could be taught to reinject packets
in a way that they hit another ESTABLISHED lookup.

Did I miss a case?

> All the original costs of route, netfilter, TCP socket lookup all
> reappear as we make VJ netchannels fit all the rules of real practical
> systems, eliminating their gains entirely.

At least most of the optimizations from the early demux scheme could
be probably gotten simpler by adding a fast path to iptables/conntrack/etc. 
that checks if all rules only check SYN etc. packets and doesn't walk the
full rules then (or more generalized a fast TCP flag mask check similar 
to what TCP does). With that ESTABLISHED would hit TCP only with relatively
small overhead.

> I will also note in 
> passing that papers on related ideas, such as the Exokernel stuff, are
> very careful to not address the issue of how practical 1) their demux
> engine is and 2) the negative side effects of userspace TCP
> implementations.  For an example of the latter, if you have some 1GB
> JAVA process you do not want to wake that monster up just to do some
> ACK processing or TCP window updates, yet if you don't you violate
> TCP's rules and risk spurious unnecessary retransmits.

I don't quite get why the size of the process matters here - if only
some user space TCP library is called directly then it shouldn't
really matter how big or small the rest of the process is.

Or did you mean migration costs as described below?

But on the other hand full user space TCP seems to me of little gain
compared to a process context implementation.

I somehow like it better to hide these implementation details in 
the kernel.
 
> Furthermore, the VJ netchannel gains can be partially obtained from
> generic stateless facilities that we are going to get anyways.
> Networking chips supporting multiple MSI-X vectors, choosen by hashing
> the flow ID, can move TCP processing to "end nodes" which are cpu
> threads in this case, by having each such MSI-X vector target a
> different cpu thread.

The problem with the scheme is that to do process context processing
effectively you would need to teach the scheduler to aggressively
migrate on wake up (so that the process ends up on the CPU that 
was selected by the hash function in the NIC).

But what do you do when you have lots of different connections
with different target CPU hash values or when this would require
you to move multiple compute intensive processes or a single core?

Without user context TCP, but using softirqs instead, it looks a bit better 
because you can at least use different CPUs to do the ACK processing etc.
and the hash function spreading out connections over your CPUs doesn't harm.

But you still have relatively high cache line transfer costs in handing
over these packet from the softirq CPUs to the final process consumer. I liked
VJ's idea of using arrays-of-something instead of lists for that to avoid
some cache line transfers.  Ok at least it sounds nice in theory - haven't seen 
any 
hard numbers on this scheme compared to a traditional double linked list.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-24 Thread Andi Kleen
On Tuesday 25 July 2006 01:22, David Miller wrote:
> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Tue, 25 Jul 2006 01:10:25 +0200
> 
> > > All the original costs of route, netfilter, TCP socket lookup all
> > > reappear as we make VJ netchannels fit all the rules of real practical
> > > systems, eliminating their gains entirely.
> > 
> > At least most of the optimizations from the early demux scheme could
> > be probably gotten simpler by adding a fast path to iptables/conntrack/etc. 
> > that checks if all rules only check SYN etc. packets and doesn't walk the
> > full rules then (or more generalized a fast TCP flag mask check similar 
> > to what TCP does). With that ESTABLISHED would hit TCP only with relatively
> > small overhead.
> 
> Actually, all is not lost.  Alexey has a more clever idea which
> is basically to run the netfilter hooks in the socket receive
> path.

The gain being that the target CPU does the work instead of 
the softirq one?

Some combined lookup and better handler of ESTABLISHED still
seems like a good idea.

One idea I had at some point was to separate conntrack for local
connection vs routed connections and attach the local conntrack
to the socket (and use its lookup tables). Then at least for
local connections conntrack should be nearly free.

It should also solve the issue we currently have that enabled 
conntrack makes local network performance significantly worse.

> Where does state live in such a huge process?  Usually, it is
> scattered all over it's address space.  Let us say that java
> application just did a lot of churning on it's own data
> structure, swapping out TCP library state objects, we will
> prematurely swap that stuff back in just to spit out an ACK
> or similar.

TCP state is usually multiple cache lines, so you would have
cache misses anyways. Do you worry about the TLBs? 

> > But what do you do when you have lots of different connections
> > with different target CPU hash values or when this would require
> > you to move multiple compute intensive processes or a single core?
> 
> That is why we have scheduler :)

It can't do well if it gets conflicting input.

> Even in a best effort scenerio, things 
> will be generally better than they are currently, plus there is nothing
> precluding the flow demux MSI-X selection from getting more intelligent.

Intelligent = statefull in this case.

AFAIK the only way to do it stateless is hashes and the output
of hashes tends to be unpredictible by definition.


> For example, the demuxer could "notice" that TCPdata transmits for
> flow X tend to happen on cpu X, and update a flow table to record that
> fact.  It could use the same flow table as the one used for LRO.

Hmm, i somewhat doubt that lower end NICs will ever have such flow tables.
Also the flow tables could always thrash (because the on NIC RAM is necessarily
limited) or they or require the NIC to look up state in memory which is 
probably not much faster than the CPUs doing it.

Using hash functions in the hardware to select the MSI-X seems 
more elegant, cheaper and much more scalable to me.

The drawback of hashes is that for processes with multiple
connections you have to move some work back into the softirqs
that run on the MSI-X target CPUs.

So basically doing process context TCP fully will require
much more complex and statefull hardware. 

Or you can keep it only as a fast path for specific situations
(single busy connection per thread) and stay with mostly-softirq
processing for the many connection cases.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-24 Thread Andi Kleen

> Even enough bits for 1024 or 2048 CPUs in the single system image? 

MSI-X supports 255 sub interrupts max, most hardware probably much less
(e.g. 8 seems to be a popular number). 

It can be always hashed to the existing CPUs.

It's a nice idea but I think standard hashing + processing in softirq 
would be worth a try first at least.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-24 Thread Andi Kleen
On Tuesday 25 July 2006 02:29, Rick Jones wrote:
> This all sounds like the discussions we had within HP-UX between 10.20 and 
> 11.0 
> concerning Inbound Packet Scheduling vs Thread Optimized Packet Scheduling.  

We've also talking about this for many years, just no code so far.
Or rather Linux so far left the job to manual tuning.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skge error; hangs w/ hardware memory hole

2006-07-24 Thread Andi Kleen
On Sunday 23 July 2006 08:32, Anthony DeRobertis wrote:
> Andreas Kleen wrote:
> 
> > 
> > You need to use iommu=soft swiotlb=force
> > 
> > The standard IOMMU is also broken on VIA, but forced swiotlb should
> > work.
> 
> Didn't work :-(

swiotlb=force is unfortunately broken right now. 

But which this patch it should work. Does it?

-Andi

Test patch only: disable DMA over 4GB

Index: linux-2.6.17-work/arch/x86_64/kernel/pci-dma.c
===
--- linux-2.6.17-work.orig/arch/x86_64/kernel/pci-dma.c
+++ linux-2.6.17-work/arch/x86_64/kernel/pci-dma.c
@@ -202,7 +202,7 @@ int dma_set_mask(struct device *dev, u64
 {
if (!dev->dma_mask || !dma_supported(dev, mask))
return -EIO;
-   *dev->dma_mask = mask;
+   *dev->dma_mask = mask & 0x;
return 0;
 }
 EXPORT_SYMBOL(dma_set_mask);

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-25 Thread Andi Kleen

> It may be a hardware interpretation, but doesn't it have non-trivial system 
> implications - where one runs threads/processes etc?

Only if you do process context RX processing. If you chose not to it doesn't 
have much influence.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding offloading IPv6 addrconf and ndisc

2006-07-31 Thread Andi Kleen

> If we process these in sequence in software interrupt, everything
> is fine.  Processing of "A" will add the address, and the test
> ping packet "B" will respond properly.
> 
> If you defer "A", everything breaks and the test packet "B" will
> get processed first and not work.

Playing devil's advocate here: if the packets are processed on
two different CPUs then this could also happen and break the test
case.

So the test is probably a bit fragile.

Currently it is unlikely to happen because of interrupt affinity for a 
single device,  but in future with MSI-X support it might not.

I generally agree it's better to keep this in kernel though.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPV6]: Audit all ip6_dst_lookup/ip6_dst_store calls

2006-08-01 Thread Andi Kleen
On Tuesday 01 August 2006 16:41, Andrew Morton wrote:
> On Mon, 31 Jul 2006 19:04:33 +1000
> Herbert Xu <[EMAIL PROTECTED]> wrote:
> 
> > 2) There is something broken in the x86_64 unwind code which is causing
> > it to panic just about everytime somebody calls dump_stack().
> > 
> > Andi, this is the second time I've seen a report where an otherwise
> > harmless dump_stack call (the other one was caused by a WARN_ON) gets
> > turned into a panic by the stack unwind code on x86_64.  This particular
> > report is with 2.6.18-rc3 so it looks like whatever bug is causing it
> > hasn't been fixed yet.
> > 
> > Could you please have a look at it? Thanks.
> 
> Jan thinks this might have been fixed by a patch which he sent Andi a
> couple of days ago.  Andi has sent that patch to Linus

I didn't send that particular patch before, just queued it, because I didn't 
realize
that particular crash, but I have now send it yesterday. So far L. hasn't merged
it unfortunately, but I will resend.

> but I'm not sure 
> which patch it was
 
"entry-more-unwind" was my version, there was another one from Jan

> and I'm not sure whether it has been merged into 
> mainline.
> 
> But yes, -rc3 unwind has problems.

"unwinder stuck" messages are expected and not really fatal because they
don't lose any information. I expect it will need some releases to work
them all out fully, but then we'll hopefully have a much better unwinder
that doesn't generate any false positives anymore.

New crashes during unwinding are fatal though and I plan to fix them.
So far this one was the only known one.

I already got a lot of patches queued for .19 that fix more unwind
information in a lot of assembly files. Still not fully complete though.
Fixing it all properly unfortunately requires undoing some stuff, e.g.
the unwinder cannot deal with separate lock sections, so I was slowly
removing them.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix more per-cpu typos

2006-08-01 Thread Andi Kleen

> --- a/arch/x86_64/kernel/smp.c
> +++ b/arch/x86_64/kernel/smp.c
> @@ -203,7 +203,7 @@ int __cpuinit init_smp_flush(void)
>  {
>   int i;
>   for_each_cpu_mask(i, cpu_possible_map) {
> - spin_lock_init(&per_cpu(flush_state.tlbstate_lock, i));
> + spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);

What advantage does this have over the earlier form?

In general this should be split up into three patches.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-07 Thread Andi Kleen

> So for now it is probably sufficient to just get rid of the
> HASH_HIGHMEM flag here.  Later we can try changing this multiplier
> of "16" to something like "8" or even "4".

The hash sizing code needs far more tweaks. iirc it can still allocate several 
GB 
hash tables on large memory systems (i've seen that once in the boot log
of a 2TB system).  Even on smaller systems it is usually too much.

IMHO there needs to be a maximum size (maybe related to the sum of 
caches of all CPUs in the system?)

Best would be to fix this for all large system hashes together.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-07 Thread Andi Kleen

> 
> Whereas it should probably go:
> 
>   if (max == 0) {
>   max = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages;
>   max = (max << PAGE_SHIFT) >> 4;
>   do_div(max, bucketsize);
>   }
> 
> or something like that.

That's still too big. Consider a 2TB machine, with all memory in LOWMEM.

Or even a smaller system. At some point it doesn't make sense
anymore to go to a larger table, even if you have the memory
and it is actually costly to have a larger table because walking
the table (netstat, route etc.) completely will take longer and longer.

It needs some additional limit. Either a maximum one or something dynamic
like CPU cache sizes (but there is currently no portable way to get that)

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-08 Thread Andi Kleen

> 3) should we limit TCP hashe and hashb size the same way?

Yes - or best in fact all hashes handled by alloc_large_system_hash()

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-08 Thread Andi Kleen

> >
> > IMHO there needs to be a maximum size (maybe related to the sum of
> > caches of all CPUs in the system?)
> >
> > Best would be to fix this for all large system hashes together.
> 
> How about using an algorithm like this: up to a certain "size"
> (memory size, cache size,...), scale the hash tables linearly; 
> but for larger sizes, scale logarithmically (or approximately
> logarithmically)

I don't think it makes any sense to continue scaling at all after
some point - you won't get better shorter hash chains anymore and the 
large hash tables actually cause problems: e.g. there are situations where we 
walk
the complete tables and that takes longer and longer.

Also does a 1TB machine really need bigger hash tables than a 100GB one?

The problem is to find out what a good boundary is.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-08 Thread Andi Kleen
On Wednesday 09 August 2006 02:11, David Miller wrote:
> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Wed, 9 Aug 2006 01:23:01 +0200
> 
> > The problem is to find out what a good boundary is.
> 
> The more I think about this the more I lean towards
> two conclusions:
> 
> 1) dynamic table growth is the only reasonable way to
>handle this and not waste memory in all cases

Yes, but even with dynamic growth you still need some upper boundary 
(otherwise a DOS could eat all your memory). And it would need
to be figured out what it is.

BTW does dynamic shrink after a load spike make sense too?

> 2) for cases where we haven't implemented dynamic
>table growth, specifying a proper limit argument
>to the hash table allocation is a sufficient
>solution for the time being

Agreed, just we don't know what the proper limits are. 

I guess it would need someone running quite a lot of benchmarks.
Anyone volunteering? @)

Or do some cheesy default and document the options to change
it clearly and wait for feedback from users on what works for
them?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-08 Thread Andi Kleen

> But there's alot of state to update (more or less
> atomically, too) 

Why does it need to be atomic? It might be enough
to just check a flag and poll for it in the readers and then redo the 
lookup.

> in the TCP hashes. Looks tricky to 
> do that without hurting performance, especially since
> you'll probably want to resize the tables when you've
> discovered they're full and busy

There will be some hickup, but as long as the downtime
is limited it shouldn't be too bad.

-Andi

 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-09 Thread Andi Kleen

> > There will be some hickup, but as long as the downtime
> > is limited it shouldn't be too bad.
> >
> 
> Benchmarks are in order

One issue I forgot earlier and Kirill pointed out is that the
reallocation would require vmalloc because memory will be too fragmented
to get a big piece of physical memory. So it would add TLB pressure. 

Can't think of a good way around that.

You might have been right with being sceptical.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skb_shared_info()

2006-08-13 Thread Andi Kleen

> The e1000 issue is just one example of this, another
> would be any attempt to consolidate the TCP retransmit
> queue data management.

Another reason to move it in the sk_buff would be better cache 
coloring? Currently on large/small MTU packets it will be always on 
the same colors.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skb_shared_info()

2006-08-14 Thread Andi Kleen
On Monday 14 August 2006 09:29, Herbert Xu wrote:
> On Mon, Aug 14, 2006 at 07:04:01AM +0200, Andi Kleen wrote:
> > 
> > Another reason to move it in the sk_buff would be better cache 
> > coloring? Currently on large/small MTU packets it will be always on 
> > the same colors.
> 
> If we went with a fully paged skbs this should be a non-issue, right?

Even for 1.5k MTU? (which is still the most common case after all)

> In a fully paged representation the head would be small which is the
> perfect place to place the skb_shared_info.

If the head is small and allocated with kmalloc then slab should 
color it yes. It won't for big allocations.

-Andi
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skb_shared_info()

2006-08-14 Thread Andi Kleen
On Monday 14 August 2006 09:50, Herbert Xu wrote:
> On Mon, Aug 14, 2006 at 09:45:53AM +0200, Andi Kleen wrote:
> > 
> > Even for 1.5k MTU? (which is still the most common case after all)
> 
> Ideally they would stay in kmalloc memory.  Could you explain the cache
> colouring problem for 1500-byte packets?

kmalloc rounds the 1.5k up to 2k and with that exactly two objects
fit into a 4k page. With that slab doesn't have any space left
to do cache coloring.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] network memory allocator.

2006-08-14 Thread Andi Kleen
Evgeniy Polyakov <[EMAIL PROTECTED]> writes:

> Design notes.
> Original idea was to store meta information used for allocation in an
> AVL tree [1], but since I found a way to use some "unused" fields in struct 
> page,
> tree is unused in the allocator.

But there seems to be still an AVL tree in there?


> Benchmarks with trivial epoll based web server showed noticeble (more
> than 40%) imrovements of the request rates (1600-1800 requests per
> second vs. more than 2300 ones). It can be described by more
> cache-friendly freeing algorithm, by tighter objects packing and thus
> reduced cache line ping-pongs, reduced lookups into higher-layer caches
> and so on.

So what are its drawbacks compared to slab/kmalloc? 

Also if it really performs that much better it might be a good
idea to replace all of kmalloc() with it, but doing that
would require a lot more benchmarks with various workloads
and small and big machines first.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] network memory allocator.

2006-08-15 Thread Andi Kleen
Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> There will be heaps of cacheline pingpong accessing these arrays.  I'd have
> though that
> 
> static struct whatever {
>   avl_t avl_node_id;
>   struct avl_node **avl_node_array;
>   struct list_head *avl_container_array;
>   struct avl_node *avl_root;
>   struct avl_free_list *avl_free_list_head;
>   spinlock_t avl_free_lock;
> } __cacheline_aligned_in_smp whatevers[NR_CPUS];
> 
> would be better.

Or even better per cpu data. New global/static NR_CPUS arrays should be really 
discouraged.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] network memory allocator.

2006-08-16 Thread Andi Kleen
On Wed, 16 Aug 2006 09:48:08 +0100
Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> On Wed, Aug 16, 2006 at 09:35:46AM +0400, Evgeniy Polyakov wrote:
> > On Tue, Aug 15, 2006 at 10:21:22PM +0200, Arnd Bergmann ([EMAIL PROTECTED]) 
> > wrote:
> > > Am Monday 14 August 2006 13:04 schrieb Evgeniy Polyakov:
> > > > ?* full per CPU allocation and freeing (objects are never freed on
> > > > different CPU)
> > > 
> > > Many of your data structures are per cpu, but your underlying allocations
> > > are all using regular kzalloc/__get_free_page/__get_free_pages functions.
> > > Shouldn't these be converted to calls to kmalloc_node and alloc_pages_node
> > > in order to get better locality on NUMA systems?
> > >
> > > OTOH, we have recently experimented with doing the dev_alloc_skb calls
> > > with affinity to the NUMA node that holds the actual network adapter, and
> > > got significant improvements on the Cell blade server. That of course
> > > may be a conflicting goal since it would mean having per-cpu per-node
> > > page pools if any CPU is supposed to be able to allocate pages for use
> > > as DMA buffers on any node.
> > 
> > Doesn't alloc_pages() automatically switches to alloc_pages_node() or
> > alloc_pages_current()?
> 
> That's not what's wanted.  If you have a slow interconnect you always want
> to allocate memory on the node the network device is attached to.

That's not true on all NUMA systems (that they have a slow interconnect)
I think on x86-64 I would prefer if it was distributed evenly or maybe even 
on the CPU who is finally going to process it.

-Andi "not all NUMA is an Altix"
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UDP-Lite and /proc/net/snmp

2007-11-10 Thread Andi Kleen
Eric Dumazet <[EMAIL PROTECTED]> writes:
>
> I meant a netstat bug of course, sorry :(
>
> It fails to parse /proc/net/netstat , because TcpExt line is bigger
> than 1024 chars.

guilty -- i wrote that code a long time ago.

>
> To correct it, we might enlarge buf1[] and buf2[] from 1024 to 2048 in
> statistics.c, process_fd() function.

Or just split TcpExt: into multiple lines, e.g. with a TcpExt2: 
No reason to have it all on a single line anyways.
I think that would be the better fix.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/05] ipv6: RFC4214 Support

2007-11-10 Thread Andi Kleen
"Templin, Fred L" <[EMAIL PROTECTED]> writes:
>  
> +#if defined(CONFIG_IPV6_ISATAP)
> + /* ISATAP (RFC4214) - router address in daddr */
> + if (!strncmp(parms->name, "isatap", 6)) {

Modern distributions tend to have daemons to automatically rename
network interfaces using SIOCSIFNAME. Not sure they would touch
isatap*, but they or someone else might. I would be likely safer to
not base your user interface on the name only, but use a flag
or number somewhere else.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: possible bug in tcp_probe

2007-11-13 Thread Andi Kleen
Gavin McCullagh <[EMAIL PROTECTED]> writes:

> I'm using linux v2.6.22.6 and tcp_probe with a couple of small
> modifications[1]. 
>
> Even with moderately large numbers of flows (16 on the one machine) and
> increasingly as I monitor more flows than that, I get strange overflow
> problems such as this one:

You probably just need to increase the buffer (bufsize option to the 
module or tcp_probe.bufsize if compiled in) 

In general I think tcp_probe was always more intended a demo of how kprobes
could be used than a real production tool (with no offense to sch) 

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET : rt_check_expire() can take a long time, add a cond_resched()

2007-11-14 Thread Andi Kleen
Eric Dumazet <[EMAIL PROTECTED]> writes:
>
> Using a "if (need_resched())" test before calling "cond_resched();" is
> necessary to avoid spending too much time doing the resched check.

The only difference between cond_resched() and if (need_resched())
cond_resched() is one function call less and one might_sleep less. If
the might_sleep or the function call are really problems (did you
measure it? -- i doubt it somewhat) then it would be better to fix the
generic code to either inline that or supply a __cond_resched()
without might_sleep.

A cheaper change might have been to just limit the number of buckets
scanned.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET : rt_check_expire() can take a long time, add a cond_resched()

2007-11-17 Thread Andi Kleen
Eric Dumazet <[EMAIL PROTECTED]> writes:

> So it may sound unnecessary but in the rt_check_expire() case, with a
> loop potentially doing XXX.XXX iterations, being able to bypass the
> function call is a clear win (in my bench case, 25 ms instead of 88
> ms). Impact on I-cache is irrelevant here as this rt_check_expires()

Measuring what? And really milli-seconds? The number does not sound plausible 
to me.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [IPV4] UDP: Always checksum even if without socket filter

2007-11-17 Thread Andi Kleen
Wang Chen <[EMAIL PROTECTED]> writes:

> Herbert Xu said the following on 2007-11-16 12:11:
>> Wang Chen <[EMAIL PROTECTED]> wrote:
>>> So, I think the checksum in udp_queue_rcv_skb() actually does
>>> the work, not that in udp_recvmsg() and udp_poll().
>>>
>>> If I am wrong, please point out.
>> 
>> We may have a bug in the accounting area.  Check the recent
>> patch made to UDP/IPv6 which is probably needed here as well.
>> 
>
> Like dave said, decrementing the InDataGrams in this case is an
> option.
> I will check the same place of UDP/IPv6.

And like Benny pointed out it's probably a bad idea because
decrementing counters will be an unexpected ABI change for monitoring 
programs who have no other way to detect overflow.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET : rt_check_expire() can take a long time, add a cond_resched()

2007-11-17 Thread Andi Kleen
> The 25.000.000 ns and 88.000.000 ns numbers where on an empty table, but 
> large (16 MB of memory)

This would mean that cond_resched() needs ~4x as much time as checking
an empty bucket. I find that somewhat hard to believe.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [IPV4] UDP: Always checksum even if without socket filter

2007-11-18 Thread Andi Kleen
> We could defer the increment until we check the checksum,
> but that is likely to break even more things because people
> (as Wang Chen did initially) will send a packet to some
> port with an app that doesn't eat the packets, and expect the
> InDatagrams counter to increase once the stack eats the packet.

Who expects that? Is there really any program who relies on that?

If it's just a human: there are a couple of "non intuitive" behaviours
in the stack. This would be just another one. Not too big a deal.

> But it won't until the application does the read.
> 
> All of our options suck, we just have to choose the least sucking one
> and right now to me that's decrementing the counter as much as I
> empathize with the SNMP application overflow detection issue.

If the SNMP monitor detects an false overflow the error it reports 
will be much worse than a single missing packet. So you would replace 
one error with a worse error.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [IPV4] UDP: Always checksum even if without socket filter

2007-11-19 Thread Andi Kleen
> > > 
> > > All of our options suck, we just have to choose the least sucking one
> > > and right now to me that's decrementing the counter as much as I
> > > empathize with the SNMP application overflow detection issue.
> > 
> > If the SNMP monitor detects an false overflow the error it reports 
> > will be much worse than a single missing packet. So you would replace 
> > one error with a worse error.
> 
> This can be fixed, the above cannot.

I don't see how, short of breaking the interface
(e.g. reporting 64bit or separate overflow counts)
-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [IPV4] UDP: Always checksum even if without socket filter

2007-11-20 Thread Andi Kleen
On Tue, Nov 20, 2007 at 12:29:45AM -0500, Bill Fink wrote:
> While I agree with your analysis that it could be worked around,
> who knows how all the various SNMP monitoring applications out there
> would interpret such an unusual event.  I liked Stephen's suggestion
> of a deferred decrement that would insure the counter didn't ever
> run backwards.  But the best approach seems to be just not to count
> it in the first place until tha application has actually received
> the packet, since as Herbert pointed out, that's what the RFC
> actually specifies for the meaning of the udpInDatagrams counter.

Together with another counter that counts "edge datagrams received"
that would be an excellent idea.

Here's a patch.

-Andi

---

Split UDP receive count into UdpInDatagrams and UdpInEarlyDatagrams

UdpInDatagrams can be confusing because it counts packets that 
might be dropped later.

Move UdpInDatagrams into recvmsg() as allowed by the RFC.

Add a new UdpInEarlyDatagrams counter to count datagrams received early,
but which might be dropped later

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

Index: linux-2.6.24-rc1-hack/include/linux/snmp.h
===
--- linux-2.6.24-rc1-hack.orig/include/linux/snmp.h
+++ linux-2.6.24-rc1-hack/include/linux/snmp.h
@@ -138,6 +138,7 @@ enum
UDP_MIB_OUTDATAGRAMS,   /* OutDatagrams */
UDP_MIB_RCVBUFERRORS,   /* RcvbufErrors */
UDP_MIB_SNDBUFERRORS,   /* SndbufErrors */
+   UDP_MIB_INEARLYDATAGRAMS,   /* Early Datagrams Received */
__UDP_MIB_MAX
 };
 
Index: linux-2.6.24-rc1-hack/net/ipv4/udp.c
===
--- linux-2.6.24-rc1-hack.orig/net/ipv4/udp.c
+++ linux-2.6.24-rc1-hack/net/ipv4/udp.c
@@ -873,6 +873,8 @@ try_again:
if (err)
goto out_free;
 
+   UDP_INC_STATS_USER(UDP_MIB_INDATAGRAMS, is_udplite);
+
sock_recv_timestamp(msg, sk, skb);
 
/* Copy the address. */
@@ -967,7 +969,8 @@ int udp_queue_rcv_skb(struct sock * sk, 
 
ret = (*up->encap_rcv)(sk, skb);
if (ret <= 0) {
-   UDP_INC_STATS_BH(UDP_MIB_INDATAGRAMS, 
up->pcflag);
+   UDP_INC_STATS_BH(UDP_MIB_INEARLYDATAGRAMS,
+up->pcflag);
return -ret;
}
}
@@ -1023,7 +1026,7 @@ int udp_queue_rcv_skb(struct sock * sk, 
goto drop;
}
 
-   UDP_INC_STATS_BH(UDP_MIB_INDATAGRAMS, up->pcflag);
+   UDP_INC_STATS_BH(UDP_MIB_INEARLYDATAGRAMS, up->pcflag);
return 0;
 
 drop:
Index: linux-2.6.24-rc1-hack/net/ipv4/proc.c
===
--- linux-2.6.24-rc1-hack.orig/net/ipv4/proc.c
+++ linux-2.6.24-rc1-hack/net/ipv4/proc.c
@@ -173,6 +173,7 @@ static const struct snmp_mib snmp4_udp_l
SNMP_MIB_ITEM("OutDatagrams", UDP_MIB_OUTDATAGRAMS),
SNMP_MIB_ITEM("RcvbufErrors", UDP_MIB_RCVBUFERRORS),
SNMP_MIB_ITEM("SndbufErrors", UDP_MIB_SNDBUFERRORS),
+   SNMP_MIB_ITEM("InEarlyDatagrams", UDP_MIB_INEARLYDATAGRAMS),
SNMP_MIB_SENTINEL
 };
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-21 Thread Andi Kleen

There seems to be rough consensus that the kernel currently has too many 
exported symbols. A lot of these exports are generally usable utility 
functions or important driver interfaces; but another large part are functions
intended by only one or two very specific modules for a very specific purpose.
One example is the TCP code. It has most of its internals exported, but 
only for use by tcp_ipv6.c (and now a few more by the TCP/IP congestion 
modules) 
But it doesn't make sense to include these exported for a specific module
functions into a broader "kernel interface".   External modules assume
they can use these functions, but they were never intended for that.

This patch allows to export symbols only for specific modules by 
introducing symbol name spaces. A module name space has a white
list of modules that are allowed to import symbols for it; all others
can't use the symbols.

It adds two new macros: 

MODULE_NAMESPACE_ALLOW(namespace, module);

Allow module to import symbols from namespace. module is the module name without
.ko as displayed by lsmod.  Must be in the same module as the export
(and be duplicated if there are multiple modules exporting symbols
to a namespace).  Multiple allows for the same name space are allowed.

EXPORT_SYMBOL_NS(namespace, symbol);

Export symbol into namespace.  Only modules allowed for the namespace
will be able to use them. EXPORT_SYMBOL_NS implies GPL only
because it is only for "internal" interfaces.

The name spaces only work for module loading. I didn't find
a nice way to make them work inside the main kernel binary. This means
the name space is not enforced for modules that are built in.

The biggest amount of work is of course still open: to go over all the existing
exports and figure for which ones it makes sense to define a namespace.
I did it for TCP and UDP so far, but the kernel right now has nearly 10k
exports (with some dups) that would need to be checked and turned into
name spaces. I would expect any symbol that is only used by one or two
other modules is a strong candidate for a namespace; in some cases even more
with modules that are tightly coupled.

I am optimistic that in the end we will have a much more manageable 
kernel interface.

Caveats: 

Exports need one long word more memory.

I had to add some alignment magic to the existing EXPORT_SYMBOLs
to get the sections right. Tested on i386/x86-64, but I hope it also
still works on architectures with stricter alignment requirements
like ARM. Any testers for that?

---
 arch/arm/kernel/armksyms.c|2 
 include/asm-generic/vmlinux.lds.h |7 +
 include/linux/module.h|   71 +++
 kernel/module.c   |  137 +++---
 4 files changed, 177 insertions(+), 40 deletions(-)

Index: linux/include/linux/module.h
===
--- linux.orig/include/linux/module.h
+++ linux/include/linux/module.h
@@ -34,6 +34,7 @@ struct kernel_symbol
 {
unsigned long value;
const char *name;
+   const char *namespace;
 };
 
 struct modversion_info
@@ -167,49 +168,80 @@ struct notifier_block;
 #ifdef CONFIG_MODULES
 
 /* Get/put a kernel symbol (calls must be symmetric) */
-void *__symbol_get(const char *symbol);
-void *__symbol_get_gpl(const char *symbol);
+extern void *do_symbol_get(const char *symbol, struct module *caller);
+#define __symbol_get(sym) do_symbol_get(sym, THIS_MODULE)
 #define symbol_get(x) ((typeof(&x))(__symbol_get(MODULE_SYMBOL_PREFIX #x)))
 
+struct module_ns {
+   char *name;
+   char *allowed;
+};
+
+#define NS_SEPARATOR "."
+
+/*
+ * Allow module MODULE to reference namespace NS.
+ * MODULE is just the base module name with suffix or path.
+ * This must be declared in the module (or main kernel) as where the
+ * symbols are defined. When multiple modules export symbols from
+ * a single namespace all modules need to contain a full set
+ * of MODULE_NAMESPACE_ALLOWs.
+ */
+#define MODULE_NAMESPACE_ALLOW(ns, module) \
+   static const struct module_ns __knamespace_##module##_##_##ns \
+   asm("__knamespace_" #module NS_SEPARATOR #ns)   \
+   __attribute_used__  \
+   __attribute__((section("__knamespace"), unused))\
+   = { #ns,  #module }
+
 #ifndef __GENKSYMS__
 #ifdef CONFIG_MODVERSIONS
 /* Mark the CRC weak since genksyms apparently decides not to
  * generate a checksums for some symbols */
-#define __CRC_SYMBOL(sym, sec) \
+#define __CRC_SYMBOL(sym, sec, post, post2)\
extern void *__crc_##sym __attribute__((weak)); \
-   static const unsigned long __kcrctab_##sym  \
+   static const unsigned long __kcrctab_##sym##post\
+   asm("__kcrctab_" #sym post2)\
__attribute_used__  \
__attribu

[PATCH RFC] [2/9] Fix duplicate symbol check to also check future gpl and unused symbols

2007-11-21 Thread Andi Kleen

This seems to have been forgotten earlier. Right now it was possible
for a normal symbol to override a future gpl symbol and similar.
I restructured the code a bit to avoid too much duplicated code.

---
 kernel/module.c |   45 -
 1 file changed, 24 insertions(+), 21 deletions(-)

Index: linux/kernel/module.c
===
--- linux.orig/kernel/module.c
+++ linux/kernel/module.c
@@ -1430,33 +1430,36 @@ EXPORT_SYMBOL_GPL(do_symbol_get);
  * Ensure that an exported symbol [global namespace] does not already exist
  * in the kernel or in some other module's exported symbol table.
  */
-static int verify_export_symbols(struct module *mod)
+
+static int check_duplicate(const struct kernel_symbol *syms, int num, struct 
module *owner)
 {
-   const char *name = NULL;
-   unsigned long i, ret = 0;
-   struct module *owner;
+   int i;
const unsigned long *crc;
 
-   for (i = 0; i < mod->num_syms; i++)
-   if (find_symbol(mod->syms[i].name, &owner, &crc, 1, mod)) {
-   name = mod->syms[i].name;
-   ret = -ENOEXEC;
-   goto dup;
-   }
-
-   for (i = 0; i < mod->num_gpl_syms; i++)
-   if (find_symbol(mod->gpl_syms[i].name, &owner, &crc, 1, mod)) {
-   name = mod->gpl_syms[i].name;
-   ret = -ENOEXEC;
-   goto dup;
+   for (i = 0; i < num; i++)
+   if (find_symbol(syms[i].name, &owner, &crc, 1, owner)) {
+   printk(KERN_ERR "%s: exports duplicate symbol %s (owned 
by %s)\n",
+   owner->name, syms[i].name, module_name(owner));
+   return -ENOEXEC;
}
+   return 0;
+}
 
-dup:
+static int verify_export_symbols(struct module *mod)
+{
+   int ret = check_duplicate(mod->syms, mod->num_syms, mod);
if (ret)
-   printk(KERN_ERR "%s: exports duplicate symbol %s (owned by 
%s)\n",
-   mod->name, name, module_name(owner));
-
-   return ret;
+   return ret;
+   ret = check_duplicate(mod->gpl_syms, mod->num_gpl_syms, mod);
+   if (ret)
+   return ret;
+   ret = check_duplicate(mod->unused_syms, mod->num_unused_syms, mod);
+   if (ret)
+   return ret;
+   ret = check_duplicate(mod->unused_gpl_syms, mod->num_unused_gpl_syms, 
mod);
+   if (ret)
+   return ret;
+   return check_duplicate(mod->gpl_future_syms, mod->num_gpl_future_syms, 
mod);
 }
 
 /* Change all symbols so that sh_value encodes the pointer directly. */
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] [4/9] modpost: Fix format string warnings

2007-11-21 Thread Andi Kleen

Fix wrong format strings in modpost exposed by the previous patch.
Including one missing argument -- some random data was printed instead.

---
 scripts/mod/modpost.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux/scripts/mod/modpost.c
===
--- linux.orig/scripts/mod/modpost.c
+++ linux/scripts/mod/modpost.c
@@ -388,7 +388,7 @@ static int parse_elf(struct elf_info *in
 
/* Check if file offset is correct */
if (hdr->e_shoff > info->size) {
-   fatal("section header offset=%u in file '%s' is bigger then 
filesize=%lu\n", hdr->e_shoff, filename, info->size);
+   fatal("section header offset=%lu in file '%s' is bigger then 
filesize=%lu\n", (unsigned long)hdr->e_shoff, filename, info->size);
return 0;
}
 
@@ -409,7 +409,7 @@ static int parse_elf(struct elf_info *in
const char *secname;
 
if (sechdrs[i].sh_offset > info->size) {
-   fatal("%s is truncated. sechdrs[i].sh_offset=%u > 
sizeof(*hrd)=%ul\n", filename, (unsigned int)sechdrs[i].sh_offset, 
sizeof(*hdr));
+   fatal("%s is truncated. sechdrs[i].sh_offset=%lu > 
sizeof(*hrd)=%lu\n", filename, (unsigned long)sechdrs[i].sh_offset, 
sizeof(*hdr));
return 0;
}
secname = secstrings + sechdrs[i].sh_name;
@@ -907,7 +907,8 @@ static void warn_sec_mismatch(const char
 "before '%s' (at offset -0x%llx)\n",
 modname, fromsec, (unsigned long long)r.r_offset,
 secname, refsymname,
-elf->strtab + after->st_name);
+elf->strtab + after->st_name,
+(unsigned long long)r.r_offset);
} else {
warn("%s(%s+0x%llx): Section mismatch: reference to %s:%s\n",
 modname, fromsec, (unsigned long long)r.r_offset,
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] [5/9] modpost: Fix a buffer overflow in modpost

2007-11-21 Thread Andi Kleen

When passing an file name > 1k the stack could be overflowed.
Not really a security issue, but still better plugged.


---
 scripts/mod/modpost.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux/scripts/mod/modpost.c
===
--- linux.orig/scripts/mod/modpost.c
+++ linux/scripts/mod/modpost.c
@@ -1656,7 +1656,6 @@ int main(int argc, char **argv)
 {
struct module *mod;
struct buffer buf = { };
-   char fname[SZ];
char *kernel_read = NULL, *module_read = NULL;
char *dump_write = NULL;
int opt;
@@ -1709,6 +1708,8 @@ int main(int argc, char **argv)
err = 0;
 
for (mod = modules; mod; mod = mod->next) {
+   char fname[strlen(mod->name) + 10];
+
if (mod->skip)
continue;
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] [6/9] Implement namespace checking in modpost

2007-11-21 Thread Andi Kleen

This checks the namespaces at build time in modpost

---
 scripts/mod/modpost.c |  344 ++
 1 file changed, 317 insertions(+), 27 deletions(-)

Index: linux/scripts/mod/modpost.c
===
--- linux.orig/scripts/mod/modpost.c
+++ linux/scripts/mod/modpost.c
@@ -1,8 +1,9 @@
-/* Postprocess module symbol versions
+/* Postprocess module symbol versions and do various other module checks.
  *
  * Copyright 2003   Kai Germaschewski
  * Copyright 2002-2004  Rusty Russell, IBM Corporation
  * Copyright 2006   Sam Ravnborg
+ * Copyright 2007  Andi Kleen, SUSE Labs (changes licensed GPLv2 only)
  * Based in part on module-init-tools/depmod.c,file2alias
  *
  * This software may be used and distributed according to the terms
@@ -12,9 +13,13 @@
  */
 
 #include 
+#include 
 #include "modpost.h"
 #include "../../include/linux/license.h"
 
+#define NS_SEPARATOR '.'
+#define NS_SEPARATOR_STRING "."
+
 /* Are we using CONFIG_MODVERSIONS? */
 int modversions = 0;
 /* Warn about undefined symbols? (do so if we have vmlinux) */
@@ -27,6 +32,9 @@ static int external_module = 0;
 static int vmlinux_section_warnings = 1;
 /* Only warn about unresolved symbols */
 static int warn_unresolved = 0;
+/* Fixing those would cause too many ifdefs -- off by default. */
+static int warn_missing_modules = 0;
+
 /* How a symbol is exported */
 enum export {
export_plain,  export_unused, export_gpl,
@@ -105,19 +113,43 @@ static struct module *find_module(char *
return mod;
 }
 
-static struct module *new_module(char *modname)
+static const char *basename(const char *s)
+{
+   char *p = strrchr(s, '/');
+   if (p)
+   return p + 1;
+   return s;
+}
+
+static struct module *find_module_base(char *modname)
 {
struct module *mod;
-   char *p, *s;
 
-   mod = NOFAIL(malloc(sizeof(*mod)));
-   memset(mod, 0, sizeof(*mod));
-   p = NOFAIL(strdup(modname));
+   for (mod = modules; mod; mod = mod->next) {
+   if (strcmp(basename(mod->name), modname) == 0)
+   break;
+   }
+   return mod;
+}
 
+static void strip_o(char *p)
+{
+   char *s;
/* strip trailing .o */
if ((s = strrchr(p, '.')) != NULL)
if (strcmp(s, ".o") == 0)
*s = '\0';
+}
+
+static struct module *new_module(char *modname)
+{
+   struct module *mod;
+   char *p;
+
+   mod = NOFAIL(malloc(sizeof(*mod)));
+   memset(mod, 0, sizeof(*mod));
+   p = NOFAIL(strdup(modname));
+   strip_o(p);
 
/* add to list */
mod->name = p;
@@ -132,10 +164,12 @@ static struct module *new_module(char *m
  * struct symbol is also used for lists of unresolved symbols */
 
 #define SYMBOL_HASH_SIZE 1024
+#define NSALLOW_HASH_SIZE 64
 
 struct symbol {
struct symbol *next;
struct module *module;
+   const char *namespace;
unsigned int crc;
int crc_valid;
unsigned int weak:1;
@@ -147,10 +181,19 @@ struct symbol {
char name[0];
 };
 
+struct nsallow {
+   struct nsallow *next;
+   struct module *mod;
+   struct module *orig;
+   int ref;
+   char name[0];
+};
+
 static struct symbol *symbolhash[SYMBOL_HASH_SIZE];
+static struct nsallow *nsallowhash[NSALLOW_HASH_SIZE];
 
 /* This is based on the hash agorithm from gdbm, via tdb */
-static inline unsigned int tdb_hash(const char *name)
+static unsigned int tdb_hash(const char *name)
 {
unsigned value; /* Used to compute the hash value.  */
unsigned   i;   /* Used to cycle through random values. */
@@ -192,21 +235,67 @@ static struct symbol *new_symbol(const c
return new;
 }
 
-static struct symbol *find_symbol(const char *name)
+static struct symbol *find_symbol(const char *name, const char *ns)
 {
-   struct symbol *s;
+   struct symbol *s, *match;
 
/* For our purposes, .foo matches foo.  PPC64 needs this. */
if (name[0] == '.')
name++;
 
+   match = NULL;
for (s = symbolhash[tdb_hash(name) % SYMBOL_HASH_SIZE]; s; s=s->next) {
+   if (strcmp(s->name, name) == 0) {
+   match = s;
+   if (ns && s->namespace && strcmp(s->namespace, ns))
+   continue;
+   return s;
+   }
+   }
+   return ns ? NULL : match;
+}
+
+static struct nsallow *find_nsallow(const char *name, struct module *mod)
+{
+   struct nsallow *s;
+
+   for (s = nsallowhash[tdb_hash(name)%NSALLOW_HASH_SIZE]; s; s=s->next) {
+   if (strcmp(s->name, name) == 0 && s->mod == mod)
+   return s;
+   }
+   return NULL;
+}
+
+static s

[PATCH RFC] [3/9] modpost: Declare the modpost error functions as printf like

2007-11-21 Thread Andi Kleen

This way gcc can warn for wrong format strings

---
 scripts/mod/modpost.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux/scripts/mod/modpost.c
===
--- linux.orig/scripts/mod/modpost.c
+++ linux/scripts/mod/modpost.c
@@ -33,7 +33,9 @@ enum export {
export_unused_gpl, export_gpl_future, export_unknown
 };
 
-void fatal(const char *fmt, ...)
+#define PRINTF __attribute__ ((format (printf, 1, 2)))
+
+PRINTF void fatal(const char *fmt, ...)
 {
va_list arglist;
 
@@ -46,7 +48,7 @@ void fatal(const char *fmt, ...)
exit(1);
 }
 
-void warn(const char *fmt, ...)
+PRINTF void warn(const char *fmt, ...)
 {
va_list arglist;
 
@@ -57,7 +59,7 @@ void warn(const char *fmt, ...)
va_end(arglist);
 }
 
-void merror(const char *fmt, ...)
+PRINTF void merror(const char *fmt, ...)
 {
va_list arglist;
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] [9/9] Add a inet namespace

2007-11-21 Thread Andi Kleen

Shared by IP, IPv6, DCCP, UDPLITE, SCTP. 

The symbols used by tunnel modules weren't put into any name space
because there are quite a lot of them.

---
 net/core/fib_rules.c|9 --
 net/ipv4/af_inet.c  |   52 
 net/ipv4/arp.c  |1 
 net/ipv4/icmp.c |   10 +++
 net/ipv4/inet_connection_sock.c |   40 +++---
 net/ipv4/inet_diag.c|4 +--
 net/ipv4/inet_hashtables.c  |8 +++---
 net/ipv4/inet_timewait_sock.c   |   12 -
 net/ipv4/ip_input.c |2 -
 net/ipv4/ip_output.c|7 +++--
 net/ipv4/ip_sockglue.c  |   10 +++
 11 files changed, 86 insertions(+), 69 deletions(-)

Index: linux/net/ipv4/af_inet.c
===
--- linux.orig/net/ipv4/af_inet.c
+++ linux/net/ipv4/af_inet.c
@@ -218,7 +218,7 @@ out:
 }
 
 u32 inet_ehash_secret __read_mostly;
-EXPORT_SYMBOL(inet_ehash_secret);
+EXPORT_SYMBOL_NS(inet, inet_ehash_secret);
 
 /*
  * inet_ehash_secret must be set exactly once
@@ -235,7 +235,7 @@ void build_ehash_secret(void)
inet_ehash_secret = rnd;
spin_unlock_bh(&inetsw_lock);
 }
-EXPORT_SYMBOL(build_ehash_secret);
+EXPORT_SYMBOL_NS(inet, build_ehash_secret);
 
 /*
  * Create an inet socket.
@@ -1127,7 +1127,7 @@ int inet_sk_rebuild_header(struct sock *
return err;
 }
 
-EXPORT_SYMBOL(inet_sk_rebuild_header);
+EXPORT_SYMBOL_NS(inet,inet_sk_rebuild_header);
 
 static int inet_gso_send_check(struct sk_buff *skb)
 {
@@ -1235,6 +1235,8 @@ unsigned long snmp_fold_field(void *mib[
}
return res;
 }
+/* AK: Not in inet namespace because they're a generic facility. Probably
+   should be in another file though. */
 EXPORT_SYMBOL_GPL(snmp_fold_field);
 
 int snmp_mib_init(void *ptr[2], size_t mibsize, size_t mibalign)
@@ -1499,20 +1501,30 @@ static int __init ipv4_proc_init(void)
 
 MODULE_ALIAS_NETPROTO(PF_INET);
 
-EXPORT_SYMBOL(inet_accept);
-EXPORT_SYMBOL(inet_bind);
-EXPORT_SYMBOL(inet_dgram_connect);
-EXPORT_SYMBOL(inet_dgram_ops);
-EXPORT_SYMBOL(inet_getname);
-EXPORT_SYMBOL(inet_ioctl);
-EXPORT_SYMBOL(inet_listen);
-EXPORT_SYMBOL(inet_register_protosw);
-EXPORT_SYMBOL(inet_release);
-EXPORT_SYMBOL(inet_sendmsg);
-EXPORT_SYMBOL(inet_shutdown);
-EXPORT_SYMBOL(inet_sock_destruct);
-EXPORT_SYMBOL(inet_stream_connect);
-EXPORT_SYMBOL(inet_stream_ops);
-EXPORT_SYMBOL(inet_unregister_protosw);
-EXPORT_SYMBOL(net_statistics);
-EXPORT_SYMBOL(sysctl_ip_nonlocal_bind);
+MODULE_NAMESPACE_ALLOW(inet, ipv6);
+MODULE_NAMESPACE_ALLOW(inet, udplite);
+MODULE_NAMESPACE_ALLOW(inet, dccp_ipv6);
+MODULE_NAMESPACE_ALLOW(inet, dccp_ipv4);
+MODULE_NAMESPACE_ALLOW(inet, dccp);
+MODULE_NAMESPACE_ALLOW(inet, sctp);
+
+/* RED-PEN: would be better to fix wanrouter */
+MODULE_NAMESPACE_ALLOW(inet, wanrouter);
+
+EXPORT_SYMBOL_NS(inet,inet_accept);
+EXPORT_SYMBOL_NS(inet,inet_bind);
+EXPORT_SYMBOL_NS(inet,inet_dgram_connect);
+EXPORT_SYMBOL_NS(inet,inet_dgram_ops);
+EXPORT_SYMBOL_NS(inet,inet_getname);
+EXPORT_SYMBOL_NS(inet,inet_ioctl);
+EXPORT_SYMBOL_NS(inet,inet_listen);
+EXPORT_SYMBOL_NS(inet,inet_register_protosw);
+EXPORT_SYMBOL_NS(inet,inet_release);
+EXPORT_SYMBOL_NS(inet,inet_sendmsg);
+EXPORT_SYMBOL_NS(inet,inet_shutdown);
+EXPORT_SYMBOL_NS(inet,inet_sock_destruct);
+EXPORT_SYMBOL_NS(inet,inet_stream_connect);
+EXPORT_SYMBOL_NS(inet,inet_stream_ops);
+EXPORT_SYMBOL_NS(inet,inet_unregister_protosw);
+EXPORT_SYMBOL_NS(inet,net_statistics);
+EXPORT_SYMBOL_NS(inet,sysctl_ip_nonlocal_bind);
Index: linux/net/ipv4/arp.c
===
--- linux.orig/net/ipv4/arp.c
+++ linux/net/ipv4/arp.c
@@ -1406,6 +1406,7 @@ static int __init arp_proc_init(void)
 
 #endif /* CONFIG_PROC_FS */
 
+/* No namespace because those are used by various drivers */
 EXPORT_SYMBOL(arp_broken_ops);
 EXPORT_SYMBOL(arp_find);
 EXPORT_SYMBOL(arp_create);
Index: linux/net/ipv4/icmp.c
===
--- linux.orig/net/ipv4/icmp.c
+++ linux/net/ipv4/icmp.c
@@ -1101,7 +1101,7 @@ void __init icmp_init(struct net_proto_f
}
 }
 
-EXPORT_SYMBOL(icmp_err_convert);
-EXPORT_SYMBOL(icmp_send);
-EXPORT_SYMBOL(icmp_statistics);
-EXPORT_SYMBOL(xrlim_allow);
+EXPORT_SYMBOL_NS(inet, icmp_err_convert);
+EXPORT_SYMBOL_NS(inet, icmp_send);
+EXPORT_SYMBOL_NS(inet, icmp_statistics);
+EXPORT_SYMBOL_NS(inet, xrlim_allow);
Index: linux/net/ipv4/inet_connection_sock.c
===
--- linux.orig/net/ipv4/inet_connection_sock.c
+++ linux/net/ipv4/inet_connection_sock.c
@@ -26,7 +26,7 @@
 
 #ifdef INET_CSK_DEBUG
 const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
-EXPORT_SYMBOL(inet_csk_timer_bug_msg);
+EXPORT_SYMBOL_NS(inet, inet_csk_timer_bug_msg);
 #endif
 
 /*
@@ -73,7 +73,7 @@ int inet_csk_bind_confl

[PATCH RFC] [8/9] Put UDP exports into a namespace

2007-11-21 Thread Andi Kleen

The UDP exports are only used by UDPv6 and UDP lite. They are internal functions
not supposed to be used by anybody else. So turn them into a name space that 
only allows those.

---
 net/ipv4/udp.c |   27 +++
 net/ipv4/udplite.c |6 +++---
 2 files changed, 18 insertions(+), 15 deletions(-)

Index: linux/net/ipv4/udp.c
===
--- linux.orig/net/ipv4/udp.c
+++ linux/net/ipv4/udp.c
@@ -105,6 +105,9 @@
 #include 
 #include "udp_impl.h"
 
+MODULE_NAMESPACE_ALLOW(udp, udplite);
+MODULE_NAMESPACE_ALLOW(udp, ipv6);
+
 /*
  * Snmp MIB for the UDP layer
  */
@@ -1641,18 +1644,18 @@ void udp4_proc_exit(void)
 }
 #endif /* CONFIG_PROC_FS */
 
-EXPORT_SYMBOL(udp_disconnect);
-EXPORT_SYMBOL(udp_hash);
-EXPORT_SYMBOL(udp_hash_lock);
-EXPORT_SYMBOL(udp_ioctl);
-EXPORT_SYMBOL(udp_get_port);
-EXPORT_SYMBOL(udp_prot);
-EXPORT_SYMBOL(udp_sendmsg);
-EXPORT_SYMBOL(udp_lib_getsockopt);
-EXPORT_SYMBOL(udp_lib_setsockopt);
-EXPORT_SYMBOL(udp_poll);
+EXPORT_SYMBOL_NS(udp, udp_disconnect);
+EXPORT_SYMBOL_NS(udp, udp_hash);
+EXPORT_SYMBOL_NS(udp, udp_hash_lock);
+EXPORT_SYMBOL_NS(udp, udp_ioctl);
+EXPORT_SYMBOL_NS(udp, udp_get_port);
+EXPORT_SYMBOL_NS(udp, udp_prot);
+EXPORT_SYMBOL_NS(udp, udp_sendmsg);
+EXPORT_SYMBOL_NS(udp, udp_lib_getsockopt);
+EXPORT_SYMBOL_NS(udp, udp_lib_setsockopt);
+EXPORT_SYMBOL_NS(udp, udp_poll);
 
 #ifdef CONFIG_PROC_FS
-EXPORT_SYMBOL(udp_proc_register);
-EXPORT_SYMBOL(udp_proc_unregister);
+EXPORT_SYMBOL_NS(udp, udp_proc_register);
+EXPORT_SYMBOL_NS(udp, udp_proc_unregister);
 #endif
Index: linux/net/ipv4/udplite.c
===
--- linux.orig/net/ipv4/udplite.c
+++ linux/net/ipv4/udplite.c
@@ -113,6 +113,6 @@ out_register_err:
printk(KERN_CRIT "%s: Cannot add UDP-Lite protocol.\n", __FUNCTION__);
 }
 
-EXPORT_SYMBOL(udplite_hash);
-EXPORT_SYMBOL(udplite_prot);
-EXPORT_SYMBOL(udplite_get_port);
+EXPORT_SYMBOL_NS(udp, udplite_hash);
+EXPORT_SYMBOL_NS(udp, udplite_prot);
+EXPORT_SYMBOL_NS(udp, udplite_get_port);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] [7/9] Convert TCP exports into namespaces

2007-11-21 Thread Andi Kleen

I defined two namespaces: tcp for TCP internals which are only used by 
tcp_ipv6.ko And tcpcong for exports used by the TCP congestion modules

No need to export any TCP internals to anybody else. So express this in a 
namespace.

I admit I'm not 100% sure tcpcong makes sense -- there might be a legitimate
need to have external out of tree congestion modules. They seem nearly like
drivers, but only nearly. If that was deemed the case it would be possible to 
remove tcpcong again to allow everybody to access this.

This implicitely turns all exports into GPL only, but that won't matter
because all modules allowed to import TCP functions are GPLed.

---
 net/ipv4/tcp.c   |   71 +++
 net/ipv4/tcp_cong.c  |   14 -
 net/ipv4/tcp_input.c |   12 +++
 net/ipv4/tcp_ipv4.c  |   38 -
 net/ipv4/tcp_minisocks.c |   12 +++
 net/ipv4/tcp_output.c|   12 +++
 net/ipv4/tcp_timer.c |2 -
 7 files changed, 87 insertions(+), 74 deletions(-)

Index: linux/net/ipv4/tcp.c
===
--- linux.orig/net/ipv4/tcp.c
+++ linux/net/ipv4/tcp.c
@@ -275,21 +275,21 @@ DEFINE_SNMP_STAT(struct tcp_mib, tcp_sta
 
 atomic_t tcp_orphan_count = ATOMIC_INIT(0);
 
-EXPORT_SYMBOL_GPL(tcp_orphan_count);
+EXPORT_SYMBOL_NS(tcp, tcp_orphan_count);
 
 int sysctl_tcp_mem[3] __read_mostly;
 int sysctl_tcp_wmem[3] __read_mostly;
 int sysctl_tcp_rmem[3] __read_mostly;
 
-EXPORT_SYMBOL(sysctl_tcp_mem);
-EXPORT_SYMBOL(sysctl_tcp_rmem);
-EXPORT_SYMBOL(sysctl_tcp_wmem);
+EXPORT_SYMBOL_NS(tcp, sysctl_tcp_mem);
+EXPORT_SYMBOL_NS(tcp, sysctl_tcp_rmem);
+EXPORT_SYMBOL_NS(tcp, sysctl_tcp_wmem);
 
 atomic_t tcp_memory_allocated; /* Current allocated memory. */
 atomic_t tcp_sockets_allocated;/* Current number of TCP sockets. */
 
-EXPORT_SYMBOL(tcp_memory_allocated);
-EXPORT_SYMBOL(tcp_sockets_allocated);
+EXPORT_SYMBOL_NS(tcp, tcp_memory_allocated);
+EXPORT_SYMBOL_NS(tcp, tcp_sockets_allocated);
 
 /*
  * Pressure flag: try to collapse.
@@ -299,7 +299,7 @@ EXPORT_SYMBOL(tcp_sockets_allocated);
  */
 int tcp_memory_pressure __read_mostly;
 
-EXPORT_SYMBOL(tcp_memory_pressure);
+EXPORT_SYMBOL_NS(tcp, tcp_memory_pressure);
 
 void tcp_enter_memory_pressure(void)
 {
@@ -309,7 +309,7 @@ void tcp_enter_memory_pressure(void)
}
 }
 
-EXPORT_SYMBOL(tcp_enter_memory_pressure);
+EXPORT_SYMBOL_NS(tcp, tcp_enter_memory_pressure);
 
 /*
  * Wait for a TCP event.
@@ -1995,7 +1995,7 @@ int compat_tcp_setsockopt(struct sock *s
return do_tcp_setsockopt(sk, level, optname, optval, optlen);
 }
 
-EXPORT_SYMBOL(compat_tcp_setsockopt);
+EXPORT_SYMBOL_NS(tcp, compat_tcp_setsockopt);
 #endif
 
 /* Return information about state of tcp endpoint in API format. */
@@ -2061,7 +2061,7 @@ void tcp_get_info(struct sock *sk, struc
info->tcpi_total_retrans = tp->total_retrans;
 }
 
-EXPORT_SYMBOL_GPL(tcp_get_info);
+EXPORT_SYMBOL_NS(tcp, tcp_get_info);
 
 static int do_tcp_getsockopt(struct sock *sk, int level,
int optname, char __user *optval, int __user *optlen)
@@ -2174,7 +2174,7 @@ int compat_tcp_getsockopt(struct sock *s
return do_tcp_getsockopt(sk, level, optname, optval, optlen);
 }
 
-EXPORT_SYMBOL(compat_tcp_getsockopt);
+EXPORT_SYMBOL_NS(tcp, compat_tcp_getsockopt);
 #endif
 
 struct sk_buff *tcp_tso_segment(struct sk_buff *skb, int features)
@@ -2262,7 +2262,7 @@ struct sk_buff *tcp_tso_segment(struct s
 out:
return segs;
 }
-EXPORT_SYMBOL(tcp_tso_segment);
+EXPORT_SYMBOL_NS(tcp, tcp_tso_segment);
 
 #ifdef CONFIG_TCP_MD5SIG
 static unsigned long tcp_md5sig_users;
@@ -2298,7 +2298,7 @@ void tcp_free_md5sig_pool(void)
__tcp_free_md5sig_pool(pool);
 }
 
-EXPORT_SYMBOL(tcp_free_md5sig_pool);
+EXPORT_SYMBOL_NS(tcp, tcp_free_md5sig_pool);
 
 static struct tcp_md5sig_pool **__tcp_alloc_md5sig_pool(void)
 {
@@ -2371,7 +2371,7 @@ retry:
return pool;
 }
 
-EXPORT_SYMBOL(tcp_alloc_md5sig_pool);
+EXPORT_SYMBOL_NS(tcp, tcp_alloc_md5sig_pool);
 
 struct tcp_md5sig_pool *__tcp_get_md5sig_pool(int cpu)
 {
@@ -2384,14 +2384,14 @@ struct tcp_md5sig_pool *__tcp_get_md5sig
return (p ? *per_cpu_ptr(p, cpu) : NULL);
 }
 
-EXPORT_SYMBOL(__tcp_get_md5sig_pool);
+EXPORT_SYMBOL_NS(tcp, __tcp_get_md5sig_pool);
 
 void __tcp_put_md5sig_pool(void)
 {
tcp_free_md5sig_pool();
 }
 
-EXPORT_SYMBOL(__tcp_put_md5sig_pool);
+EXPORT_SYMBOL_NS(tcp, __tcp_put_md5sig_pool);
 #endif
 
 void tcp_done(struct sock *sk)
@@ -2409,7 +2409,7 @@ void tcp_done(struct sock *sk)
else
inet_csk_destroy_sock(sk);
 }
-EXPORT_SYMBOL_GPL(tcp_done);
+EXPORT_SYMBOL_NS(tcp, tcp_done);
 
 extern void __skb_cb_too_small_for_tcp(int, int);
 extern struct tcp_congestion_ops tcp_reno;
@@ -2524,15 +2524,28 @@ void __init tcp_init(void)
tcp_register_congestion_control(&tcp_reno);
 }
 
-EXPORT_SYMBOL(tcp_close);
-EXPORT_SYMBOL(tcp_disconnect);

Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-21 Thread Andi Kleen

> I like this concept in general; I have one minor comment; right now
> your namespace argument is like
> 
> EXPORT_SYMBOL_NS(foo, some_symbol);
> 
> from a language-like pov I kinda wonder if it's nicer to do
> 
> EXPORT_SYMBOL_NS("foo", some_symbol);
> 
> because foo isn't something in C scope, but more a string-like
> identifier...

That wouldn't work for MODULE_ALLOW() because it appends the namespace
to other identifiers. I don't know of a way in the C processor to get
back from a string to a ## concatenable identifier.

For EXPORT_SYMBOL_NS it would be in theory possible, but making 
it asymmetric to MODULE_ALLOW would be ugly imho.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-22 Thread Andi Kleen
On Thursday 22 November 2007 04:56, Rusty Russell wrote:

> This is an interesting idea, thanks for the code!  My only question is
> whether we can get most of this benefit by dropping the indirection of
> namespaces and have something like "EXPORT_SYMBOL_TO(sym, modname)"?  It
> doesn't work so well for exporting to a group of modules, but that seems
> a reasonable line to draw anyway.

That would explode quickly already even for my example "inet" namespace.
It already has several modules.  I don't think so much duplication would be a 
good idea.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-22 Thread Andi Kleen
On Thursday 22 November 2007 12:06, Christoph Hellwig wrote:
> Very nice, looking forward to organize the exports mess a bit more.

I would need people to help me converting more subsystems to this new scheme.

In particular all exports that are only used by a single module are direct 
candidates for a namespace. For the others it has to be checked by someone
who knows the particular subsystem well.

e.g. for SCSI it would be nice to put the symbols only used by sd/sr/sg etc.
into a namespace.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-22 Thread Andi Kleen

> Creating the DCCP and its congestion control infrastructure (CCID)
> module namespaces is now on my TODO list. :-)

My original patchkit had DCCP actually done, but I ran into some problem
while forward porting and disabled it again. But should be reasonably
easy to resurrect.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-22 Thread Andi Kleen

> Andy, I like your idea.  IMHO, as Rusty said a simple EXPORT_SYMBOL_TO
> is better.

I don't think so. e.g. tcpcong would be very very messy this way.

> And I wonder if it is possible to export to something like  the struct
> device_driver? If it's possible then it will not limited to modules.

Not sure I follow you. Can you expand? 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-22 Thread Andi Kleen
On Friday 23 November 2007 01:25, Rusty Russell wrote:
> On Thursday 22 November 2007 22:05:45 Christoph Hellwig wrote:
> > On Thu, Nov 22, 2007 at 02:56:22PM +1100, Rusty Russell wrote:
> > > This is an interesting idea, thanks for the code!  My only question
> > > is whether we can get most of this benefit by dropping the indirection
> > > of namespaces and have something like "EXPORT_SYMBOL_TO(sym, modname)"?
> > >  It doesn't work so well for exporting to a group of modules, but that
> > > seems a reasonable line to draw anyway.
> >
> > I'd say exporting to a group of modules is the main use case.  E.g. in
> > scsi there would be symbols exported to transport class modules only
> > or lots of the vfs_ symbols would be exported only to stackable
> > filesystems or nfsd.
>
> That's my point.  If there's a whole class of modules which can use a
> symbol, why are we ruling out external modules? 

The point is to get cleaner interfaces. Anything which is kind of internal
should only be used by closely related in tree modules which can be updated. 
Point of is not to be some kind of license enforcer or similar, there 
are already other mechanisms for that. Just to get the set of really
public kernel interfaces down to a manageable level.

But I still think exporting only to a single module would be to limiting 
for this case even. It would work for the TCP<->ipv6.ko post child,
but not for some of the other networking cases where it makes sense.

> If that's what you want, 
> why not have a list of permitted modules compiled into the kernel and allow
> no others?

That would not make the relationship explicit, which would not further
the goal.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-23 Thread Andi Kleen
On Fri, Nov 23, 2007 at 02:35:05PM +1100, Rusty Russell wrote:
> On Friday 23 November 2007 12:36:22 Andi Kleen wrote:
> > On Friday 23 November 2007 01:25, Rusty Russell wrote:
> > > That's my point.  If there's a whole class of modules which can use a
> > > symbol, why are we ruling out external modules?
> >
> > The point is to get cleaner interfaces.
> 
> But this doesn't change interfaces at all.  It makes modules fail to load 
> unless they're on a permitted list, which now requires maintenance.

The modules wouldn't be using the internal interfaces in the first
place with name spaces in place. This serves as a documentation
on what is considered internal. And if some obscure module (in or
out of tree) wants to use an internal interface they first have
to send the module maintainer a patch and get some review this way.

I believe that is fairly important in tree too because the 
kernel has become so big now that review cannot be the only
enforcement mechanism for this anymore.

Another secondary reason is that there are too many exported interfaces
in general. Several distributions have policies that require to 
keep the changes to these exported interfaces minimal and that
is very hard with thousands of exported symbol.  With name spaces
the number of truly publicly exported symbols will hopefully
shrink to a much smaller, more manageable set.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-24 Thread Andi Kleen
On Sat, Nov 24, 2007 at 03:53:34PM +1100, Rusty Russell wrote:
> So, you're saying that there's a problem with in-tree modules using symbols 
> they shouldn't?  Can you give an example?
> 
> > I believe that is fairly important in tree too because the
> > kernel has become so big now that review cannot be the only
> > enforcement mechanism for this anymore.
> 
> If people aren't reviewing, this won't make them review.  I don't think the 

With millions of LOC the primary maintainers cannot review everything.
It's not that anybody is doing a bad job -- it is just so much code
that explicit mechanisms are better than implicit contracts.

> problem is that people are conniving to avoid review.

No of course not -- it is just too much code to let everything
be reviewed by the core subsystem maintainers. But with explicit
marking of internal symbols they would need to look at it because
the relationship will be clearly spelled out in the code.

> > Several distributions have policies that require to 
> > keep the changes to these exported interfaces minimal and that
> > is very hard with thousands of exported symbol.  With name spaces
> > the number of truly publicly exported symbols will hopefully
> > shrink to a much smaller, more manageable set.
> 
> *This* makes sense.  But it's not clear that the burden should be placed on 
> kernel coders.  You can create a list yourself.  How do I tell the difference 
> between "truly publicly exported" symbols and others?

Out of tree solutions generally do not scale.  Nobody else can 
keep up with 2+ Million changes each merge window.

> 
> If a symbol has more than one in-tree user, it's hard to argue against an 

There are still classes of drivers. e.g. for the SCSI example: SD,SG,SR etc.
are more internal while low level drivers like aic7xxx are clearly external
drivers.

> out-of-tree module using the symbol, unless you're arguing against *all* 
> out-of-tree modules.

No, actually namespaces kind of help out of tree modules. Once they only
use interfaces that are really generic driver interfaces and fairly stable
their authors will have much less pain forward porting to newer kernel
version. But currently the authors cannot even know what is an instable
internal interface and what is a generic relatively stable driver level
interface. Namespaces are a mechanism to make this all explicit.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-27 Thread Andi Kleen

> > Perhaps you've got lots of patches were people are using internal APIs they 
> > shouldn't?
> > 
> 
> Maybe the issue is "who can tell" since what is external and what is
> internal is not explicitly defined?

Exactly.  Or rather it is not defined on the module level. We got 
"static" of course, but I think we should have a similar mechanism
on a module level.


> Explicitly documenting what comprises the kernel API (external,
> supported) 

It would not be fully supported either -- can still change etc. --
but there is a reasonable expectation that those external
APIs will change less often than internal interfaces.

> - forcing developers to identify their exports as part of the
> implementation or as part of the kernel API

That is EXPORT_SYMBOL already. The trouble is just that it covers
too much. My patchkit is trying to limit it again for a specific
use case -- exporting an "internal" interface to another module.
Or rather a set of modules. 

Standard example is TCP: TCP exports nearly everything and the
single user is the TCP code in ipv6.ko. Instead those symbols should
be limited to be only accessable to ipv6.ko. 

The reason I went with the more generic namespace mechanism
instead of EXPORT_SYMBOL_TO() is that ipv6 is ever split up
it would still work. 

Also using namespaces doesn't have any more overhead than
EXPORT_SYMBOL_TO() and the complexity is about the same
(not very much anyways -- just look at the patches) 

> - making it easier for reviewers to identify when developers are adding
> to the kernel API and thereby focusing the appropriate level of review
> to the new function

That is another reason.
 
-ANdi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-27 Thread Andi Kleen
On Tue, Nov 27, 2007 at 03:26:52PM +1100, Rusty Russell wrote:
> On Monday 26 November 2007 16:58:08 Roland Dreier wrote:
> >  > > I agree that we shouldn't make things too hard for out-of-tree
> >  > > modules, but I disagree with your first statement: there clearly is a
> >  > > large class of symbols that are used by multiple modules but which are
> >  > > not generically useful -- they are only useful by a certain small
> >  > > class of modules.
> >  >
> >  > If it is so clear, you should be able to easily provide examples?
> >
> > Sure -- Andi's example of symbols required only by TCP congestion
> > modules;
> 
> Exactly.  Why exactly should someone not write a new TCP congestion module?

Agreed the congestion modules are a corner case. I even mentioned that in the
patch. I would be happy to drop that one if that is the consensus.
It was more done as a example anyways. That is why I made it an separate
namespace from "tcp"

But for many other TCP symbols it makes a lot of sense: all the functions
only used by tcp_ipv6.c. If someone wants to write support for a "IPv7" or
similar they really should do it in tree. So I think the "tcp" and  "inet"
namespaces make a lot of sense.

> Right.  So presumably there will only ever be two drivers using this core 
> code, so no new users will ever be written?  Now we've found one use case, is 

If there are new users they will need to get proper review and should
be in tree.  MODULE_ALLOW() enforces that.

> it worth the complexity of namespaces?  Is it worth the halfway point of 

What complexity? You're always claiming it is complex. It isn't really.

> export-to-module?
> 

> No, we've seen the solution and various people applying it.  I'm still trying 
> to discover the problem it's solving.

Again for rusty @)

Goals are:
- Limit the interfaces available for out of tree modules to reasonably 
stable ones that are already used by a larger set of drivers.

This can also have further downstream advantages.
For example it might be a reasonable future rule to require all unconditionally
EXPORT_SYMBOL()s to have a complete LinuxDoc documentation entry.

- Explicitely declare in source what is clearly internal and not intended to
be a generally usable interface.

e.g. for the LinuxDoc example above such internal functions don't necessarily
need full LinuxDoc documentation.

- Force review from core maintainers for use of such internal interfaces

- Limit size of exported API to make stable ABIs for enterprise
distributions easier
[Yes I know that is not a popular topic on l-k, but it's a day-to-day
problem for these distros and out of tree solutions do not work]

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-27 Thread Andi Kleen
> OK, short of making IPv4 a module (which would be a worthy task :)

At some point there were patches, it is probably not very difficult.
But DaveM resisted at some point because he didn't want people
to replace the network stack (although I personally don't have a problem
with that)

> do you have an example where a symbol is used by more than one module
> but needs to be put into a namespace?

For SCSI: SD,SR,SG etc.
For Networking: e.g. symbols i put into inet, which are only
used by protocols (sctp, dccp, udplite, ipv6)
I already caught someone doing something wrong with that BTW -- 
wanrouter clearly does some things it shouldn't be doing. 
Or the fib namespace, where all the fib functions should be only
used by the two fib_* modules and ipv6/decnet.
For KVM: the exports used by kvm_amd/intel
For arch/x86: e.g. e820_*, IO_APIC_get_PCI_irq_vector, swiotlb, 
a lot of stuff only used by acpi processor.c, pcibios*, etc.etc.
Roland gave another example for Infiniband.

Also in general it allows flexibility later if modules get split.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-27 Thread Andi Kleen
On Tue, Nov 27, 2007 at 08:43:24AM -0700, Jonathan Corbet wrote:
> Rusty said:
> 
> > Well, introduce an EXPORT_SYMBOL_INTERNAL().  It's a lot less code.  But 
> > you'd 
> > still need to show that people are having trouble knowing what APIs to use.
> 
> Might the recent discussion on the exporting of sys_open() and
> sys_read() be an example here?  There would appear to be a consensus
> that people should not have used those functions, but they are now
> proving difficult to unexport.

That is a good example yes.

> Perhaps the best use of the namespace stuff might be for *future*
> exports which are needed to help make the mainline kernel fully modular,
> but which are not meant to be part of a more widely-used API?

Not sure about future only, but yes that is its intention.
Thanks for putting it clearly.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-27 Thread Andi Kleen
> With my "Enterprise" hat on, I can see where Andi was coming from
> originally. 

For the record my original motivation was to fix the "TCP exports everything
for ipv6.ko" case cleanly. I later realized that it would be useful for the
ABI stability issues too, but it was really not my primary motivation.
This is why I didn't even mention that in the original patch description.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-27 Thread Andi Kleen
On Tue, Nov 27, 2007 at 03:00:22PM -0800, Stephen Hemminger wrote:
> On Tue, 27 Nov 2007 23:37:43 +0100
> Andi Kleen <[EMAIL PROTECTED]> wrote:
> 
> > > With my "Enterprise" hat on, I can see where Andi was coming from
> > > originally. 
> > 
> > For the record my original motivation was to fix the "TCP exports everything
> > for ipv6.ko" case cleanly. I later realized that it would be useful for the
> > ABI stability issues too, but it was really not my primary motivation.
> > This is why I didn't even mention that in the original patch description.
> 
> Since ipv6 can never be removed because it references itself, the whole 
> concept

AFAIK that is obsolete anyways. It was done because it was feared it would
be broken, but at least the broken cases I knew about were all fixed
a long time ago. Most likely it could be safely removed.

> of a modular ipv6 is flawed. 

Modules that cannot be unloaded are still useful. Standard case: Distributions
like to offer an option to not use ipv6 because that is popular workaround
for the common "DNS server eats  queries and causes delays" issue.
Forcing the user to rebuild the kernel for this wouldn't be practical.
If ipv6 wasn't modular that would be hard to do.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-28 Thread Andi Kleen
On Wednesday 28 November 2007 17:48:17 Adrian Bunk wrote:
> On Wed, Nov 28, 2007 at 12:06:45AM +0100, Andi Kleen wrote:
> > On Tue, Nov 27, 2007 at 03:00:22PM -0800, Stephen Hemminger wrote:
> >...
> > > of a modular ipv6 is flawed. 
> > 
> > Modules that cannot be unloaded are still useful. Standard case: 
> > Distributions
> > like to offer an option to not use ipv6 because that is popular workaround
> > for the common "DNS server eats  queries and causes delays" issue.
> > Forcing the user to rebuild the kernel for this wouldn't be practical.
> > If ipv6 wasn't modular that would be hard to do.
> 
> It should be trivial doing it similar to the selinux=0 boot option.

They safe also a few hundred KB of memory this way.
I know it is not en vogue anymore to care about memory bloat, but
I personally like that.

-Andi 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-29 Thread Andi Kleen

> I think it would be good if you could specify a default namespace
> per module, that could reduce the amount of necessary changes significantly.

But also give less documentation. It's also not that difficult to mark
the exports once. I've forward ported such patches over a few kernels
and didn't run into significant me

> obj-$(CONFIG_COMBINED) += combined.o
> combined-$(CONFIG_SUBOPTION) += combined_main.o combined_other.o
> obj-$(CONFIG_SINGLE) += single.o
> obj-$(CONFIG_OTHER) += other.o
> obj-$(CONFIG_API) += api.o
> 
> NAMESPACE = subsys   # default, used for other.o
> NAMESPACE_single.o = single  # used only for single.o
> NAMESPACE_combined.o = combined  # all parts of combined.o
> NAMESPACE_combined_other.o = special #except this one
> NAMESPACE_api.o =# api.o is put into the global ns

I would prefer to keep that inside the source files, again for 
documentation purposes. One goal of namespace was to make something
that was previously kind of implicit explicit and the default name
spaces would work against that again I think.

-Andi
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [3/9] modpost: Declare the modpost error functions as printf like

2007-12-10 Thread Andi Kleen
On Mon, Dec 10, 2007 at 07:50:08PM +0100, Sam Ravnborg wrote:
> On Thu, Nov 22, 2007 at 03:43:08AM +0100, Andi Kleen wrote:
> > 
> > This way gcc can warn for wrong format strings
> 
> This loks good. Can I get i s-o-b then I will apply it.

Sorry must have been left out by mistake.

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

for both patches.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [5/9] modpost: Fix a buffer overflow in modpost

2007-12-10 Thread Andi Kleen
On Monday 10 December 2007 20:32, Sam Ravnborg wrote:
> On Thu, Nov 22, 2007 at 03:43:10AM +0100, Andi Kleen wrote:
> > When passing an file name > 1k the stack could be overflowed.
> > Not really a security issue, but still better plugged.
>
> Looks good. A s-o-b line again please.

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

> Although I am not so happy with the ue of gcc extensions.

That's not a gcc extension. It's C99.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with implementation of TCP_DEFER_ACCEPT?

2007-09-02 Thread Andi Kleen
TJ <[EMAIL PROTECTED]> writes:
> 
> Therefore, anyone deploying apache web servers in a web-farm behind the
> Juniper DX load-balanders and using TCP multiplexing (for which they pay
> a hefty licence fee!) 

If they ask for that much money they can surely fix it to work
properly too? 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates

2007-09-08 Thread Andi Kleen
James Chapman <[EMAIL PROTECTED]> writes:
> 
> Clearly, keeping a device in polled mode for 1-2 jiffies 

1-2 jiffies can be a long time on a HZ=100 kernel (20ms). A fast CPU
could do a lot of loops in this time, which would be waste of power
and CPU time.

On some platforms the precise timers (like ktime_get()) can be slow,
but often they are fast.  It might make sense to use a shorter
constant time wait on those with fast timers at least. Right now this
cannot be known by portable code, but there was a proposal some time
ago to export some global estimate to tell how fast
ktime_get().et.al. are. That could be reviewed.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why does tcp_v[46]_conn_request not inc MIB stats

2007-09-18 Thread Andi Kleen
David Miller <[EMAIL PROTECTED]> writes:

> This is limiting embryonic mini-socket creation.  The listen overflow
> should only increment when the 3-way handshake completion is aborted
> because the listening socket limit is exceeded, which is entirely
> different from the embryonic limit.

That's true, but I think Rick has a point in that there should 
be some sort of (different) counter counting this.  In general
I believe every point in the stack who drops a packet should
have a statistics counter so that it can be later diagnosed.

Rick, the best way to get such a counter in is to just send a patch,
don't ask for it.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: follow-up: discrepancy with POSIX

2007-09-19 Thread Andi Kleen
Ulrich Drepper <[EMAIL PROTECTED]> writes:

> 
>fd = socket(AT_INET6, ...)
> 
>connect(fd, ...some IPv6 address...)
> 
>struct sockaddr_in6 sin6 = { .sin6_family = AF_INET6 };
>connect(fd, &sin6, sizeof (sin6));

The standard way to undo connect is to use AF_UNSPEC. Code to handle
that for dgram sockets is there. It's the same code for v4 and v6.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: follow-up: discrepancy with POSIX

2007-09-19 Thread Andi Kleen
On Wed, Sep 19, 2007 at 09:49:09AM -0700, Ulrich Drepper wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Andi Kleen wrote:
> > The standard way to undo connect is to use AF_UNSPEC. Code to handle
> > that for dgram sockets is there. It's the same code for v4 and v6.
> 
> I quoted the standard and it does not say anything about AF_UNSPEC.  So
> you cannot simply make such broad statements.

Ok "standard" was perhaps a poor choice of words.

AF_UNSPEC used to be introduced long ago by Alan based on some early 
POSIX draft iirc.

Also incidentially it's a null address:

include/linux/socket.h:#define AF_UNSPEC0

> But the spec calls for a "null address" to be used and that's in my
> understanding something different from using AF_UNSPEC.

memset(&sockaddr, 0, sizeof(sockaddr)) should give you AF_UNSPEC

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: follow-up: discrepancy with POSIX

2007-09-19 Thread Andi Kleen
On Wed, Sep 19, 2007 at 10:46:54AM -0700, Ulrich Drepper wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Andi Kleen wrote:
> >> But the spec calls for a "null address" to be used and that's in my
> >> understanding something different from using AF_UNSPEC.
> > 
> > memset(&sockaddr, 0, sizeof(sockaddr)) should give you AF_UNSPEC
> 
> But the spec calls for null address for the protocol.
> 
> That means the family for the null address is the same as the family of
> the socket.

Spec doesn't match traditional behaviour then. IPv4 0.0.0.0 is 
traditionally an synonym for old style all broadcast (255.255.255.255) 
on UDP/RAW and it's certainly possible to connect() to that. 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: follow-up: discrepancy with POSIX

2007-09-19 Thread Andi Kleen
On Wed, Sep 19, 2007 at 11:02:00AM -0700, Ulrich Drepper wrote:
> > on UDP/RAW and it's certainly possible to connect() to that. 
> 
> Where do you get this from?  And where is this implemented?  I don't

Sorry it's actually loopback, not broadcast as implemented in Linux.
In Linux it's implemented in ip_route_output_slow(). Essentially
converted to 127.0.0.1

I think it's traditional BSD behaviour but couldn't find it on
a quick look in FreeBSD source (but haven't looked very intensively) 

Admittedly port 0 is somewhat dodgy for UDP too, but at least in RAW
context it might be valid.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: follow-up: discrepancy with POSIX

2007-09-19 Thread Andi Kleen
> It has hung-on in various places (stacks) as an "accepted" broadcast IP 
> in the receive path, but not the send path for quite possibly decades now.

Well it is valid in Linux for sending.  And who knows who relies on it.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 2/3] UDP memory usage accounting: accounting unit and variable

2007-09-21 Thread Andi Kleen
Satoshi OSHIMA <[EMAIL PROTECTED]> writes:

> This patch introduces global variable for UDP memory accounting.
> The unit is page.

The global variable doesn't seem to be very MP scalable, especially
if you change it for each packet. This will be a very hot cache line,
in the worst case bouncing around a large machine.

Possible alternatives:
- Per CPU variables
- You only change the global on socket creation time (by pre allocating a large
amount) or when the system comes under memory pressure.
- Batching of the global updates for multiple packets [that's a variant
of the previous one, might be still too costly though]

Also for such variables it's usually good to cache line pad them on SMP
to avoid false sharing with something else.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] bnx2: factor out gzip unpacker

2007-09-21 Thread Andi Kleen
Denys Vlasenko <[EMAIL PROTECTED]> writes:
> 
> I plan to use gzip compression on following drivers' firmware,
> if patches will be accepted:
> 
>textdata bss dec hex filename
>   17653  109968 240  127861   1f375 drivers/net/acenic.o
>6628  120448   4  127080   1f068 drivers/net/dgrs.o
>  ^^

Just change the makefiles to always install gzip'ed modules
modutils knows how to unzip them on the fly.

That will catch all the firmware and all the other code too.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Make TCP prequeue configurable

2007-10-01 Thread Andi Kleen
David Miller <[EMAIL PROTECTED]> writes:
> 
> Furthermore, prequeue puts the stack input processing work into user
> context, which means that the users will be charged more fairly for
> the work that is done for them.

For more details on this people might want to read the old Lazy Receiver
Processing papers: http://www.cs.rice.edu/CS/Systems/LRP/

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: tg3 spitting out uninitialized memory

2007-04-15 Thread Andi Kleen
> That's right.  See the discussion on a similar problem here:
> 
>  59644&admit=-682735245+1158251313899+28353475>

That's an AMD machine.

> 
> In this other case, the symptoms were very similar to yours.
> I suspected it was an IOMMU problem, and the problem went
> away when the user upgraded to the latest kernel at that
> time.

tg3s never use the IOMMU on AMD systems with production kernels
because they have a suitably large DMA mask. IOMMU is only used
there to remap memory that doesn't fit a device's DMA mask.
There is a debugging mode to force iommu always, but it is not used 
unless you configure it specially.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIOCGIFCOUNT

2007-04-21 Thread Andi Kleen
On Saturday 21 April 2007 00:41:33 David Miller wrote:
> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Sat, 21 Apr 2007 00:04:51 +0200
> 
> > 0x8938 is SIOCGIFCOUNT. As far as I can see it is only defined in
> > sockios.h, but not implemented?
> >
> > Should it be removed from the include file or implemented?
> 
> Unless we can be %100 certain not one build will break by removing it,
> we should keep it around.

Ok, but should it not be implemented for the non compat case too?
I assume java has some reason to call it.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   >