from:"Andrew Morton"

Re: [PATCH] lib: remove "expecting prototype" kernel-doc warnings

2021-04-13 Thread Andrew Morton

On Sun, 11 Apr 2021 15:17:56 -0700 Randy Dunlap  wrote:

> Fix various kernel-doc warnings in lib/ due to missing or
> erroneous function names.
> Add kernel-doc for some function parameters that was missing.
> Use kernel-doc "Return:" notation in earlycpio.c.
> 
> Quietens the following warnings:
> 
> ../lib/earlycpio.c:61: warning: expecting prototype for cpio_data 
> find_cpio_data(). Prototype was for find_cpio_data() instead
> 
> ../lib/lru_cache.c:640: warning: expecting prototype for lc_dump(). Prototype 
> was for lc_seq_dump_details() instead
> lru_cache.c:90: warning: Function parameter or member 'cache' not described 
> in 'lc_create'

I'm still seeing this.

> lru_cache.c:90: warning: Function parameter or member 'cache' not described 
> in 'lc_create'

But it looks OK:

/**
 * lc_create - prepares to track objects in an active set
 * @name: descriptive name only used in lc_seq_printf_stats and 
lc_seq_dump_details
 * @cache: cache root pointer
 * @max_pending_changes: maximum changes to accumulate until a transaction is 
required
 * @e_count: number of elements allowed to be active simultaneously
 * @e_size: size of the tracked objects
 * @e_off: offset to the &struct lc_element member in a tracked object
 *
 * Returns a pointer to a newly initialized struct lru_cache on success,
 * or NULL on (allocation) failure.
 */
struct lru_cache *lc_create(const char *name, struct kmem_cache *cache,
unsigned max_pending_changes,
unsigned e_count, size_t e_size, size_t e_off)
{

Re: [PATCH V2 1/1] mm:improve the performance during fork

2021-03-30 Thread Andrew Morton

On Mon, 29 Mar 2021 20:36:35 +0800 qianjun.ker...@gmail.com wrote:

> From: jun qian 
> 
> In our project, Many business delays come from fork, so
> we started looking for the reason why fork is time-consuming.
> I used the ftrace with function_graph to trace the fork, found
> that the vm_normal_page will be called tens of thousands and
> the execution time of this vm_normal_page function is only a
> few nanoseconds. And the vm_normal_page is not a inline function.
> So I think if the function is inline style, it maybe reduce the
> call time overhead.
> 
> I did the following experiment:
> 
> use the bpftrace tool to trace the fork time :
> 
> bpftrace -e 'kprobe:_do_fork/comm=="redis-server"/ {@st=nsecs;} \
> kretprobe:_do_fork /comm=="redis-server"/{printf("the fork time \
> is %d us\n", (nsecs-@st)/1000)}'
> 
> no inline vm_normal_page:
> result:
> the fork time is 40743 us
> the fork time is 41746 us
> the fork time is 41336 us
> the fork time is 42417 us
> the fork time is 40612 us
> the fork time is 40930 us
> the fork time is 41910 us
> 
> inline vm_normal_page:
> result:
> the fork time is 39276 us
> the fork time is 38974 us
> the fork time is 39436 us
> the fork time is 38815 us
> the fork time is 39878 us
> the fork time is 39176 us
> 
> In the same test environment, we can get 3% to 4% of
> performance improvement.
> 
> note:the test data is from the 4.18.0-193.6.3.el8_2.v1.1.x86_64,
> because my product use this version kernel to test the redis
> server, If you need to compare the latest version of the kernel
> test data, you can refer to the version 1 Patch.
> 
> We need to compare the changes in the size of vmlinux:
>   inline   non-inline   diff
> vmlinux size  9709248 bytes9709824 bytes-576 bytes
> 

I get very different results with gcc-7.2.0:

q:/usr/src/25> size mm/memory.o
   textdata bss dec hex filename
  748983375  64   78337   13201 mm/memory.o-before
  751193363  64   78546   132d2 mm/memory.o-after

That's a somewhat significant increase in code size, and larger code
size has a worsened cache footprint.

Not that this is necessarily a bad thing for a function which is
tightly called many times in succession as is vm__normal_page()

> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -592,7 +592,7 @@ static void print_bad_pte(struct vm_area_struct *vma, 
> unsigned long addr,
>   * PFNMAP mappings in order to support COWable mappings.
>   *
>   */
> -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +inline struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long 
> addr,
>   pte_t pte)
>  {
>   unsigned long pfn = pte_pfn(pte);

I'm a bit surprised this made any difference - rumour has it that
modern gcc just ignores `inline' and makes up its own mind.  Which is
why we added __always_inline.

Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

2021-03-17 Thread Andrew Morton

On Mon, 15 Mar 2021 18:30:03 -0700 Arjun Roy  wrote:

> From: Arjun Roy 
> 
> TCP zerocopy receive is used by high performance network applications
> to further scale. For RX zerocopy, the memory containing the network
> data filled by the network driver is directly mapped into the address
> space of high performance applications. To keep the TLB cost low,
> these applications unmap the network memory in big batches. So, this
> memory can remain mapped for long time. This can cause a memory
> isolation issue as this memory becomes unaccounted after getting
> mapped into the application address space. This patch adds the memcg
> accounting for such memory.
> 
> Accounting the network memory comes with its own unique challenges.
> The high performance NIC drivers use page pooling to reuse the pages
> to eliminate/reduce expensive setup steps like IOMMU. These drivers
> keep an extra reference on the pages and thus we can not depend on the
> page reference for the uncharging. The page in the pool may keep a
> memcg pinned for arbitrary long time or may get used by other memcg.
> 
> This patch decouples the uncharging of the page from the refcnt and
> associates it with the map count i.e. the page gets uncharged when the
> last address space unmaps it. Now the question is, what if the driver
> drops its reference while the page is still mapped? That is fine as
> the address space also holds a reference to the page i.e. the
> reference count can not drop to zero before the map count.

What tree were you hoping to get this merged through?  I'd suggest net
- it's more likely to get tested over there.

>
> ...
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c

These changes could be inside #ifdef CONFIG_NET.  Although I expect
MEMCG=y&&NET=n is pretty damn rare.

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-10 Thread Andrew Morton

On Wed, 10 Mar 2021 10:46:15 + Mel Gorman  
wrote:

> This patch adds a new page allocator interface via alloc_pages_bulk,
> and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> to be allocated and added to a list. They can be freed in bulk using
> free_pages_bulk().

Why am I surprised we don't already have this.

> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch
> if necessary.
> 
> Note that this implementation is not very efficient and could be improved
> but it would require refactoring. The intent is to make it available early
> to determine what semantics are required by different callers. Once the
> full semantics are nailed down, it can be refactored.
> 
> ...
>
> +/* Drop reference counts and free order-0 pages from a list. */
> +void free_pages_bulk(struct list_head *list)
> +{
> + struct page *page, *next;
> +
> + list_for_each_entry_safe(page, next, list, lru) {
> + trace_mm_page_free_batched(page);
> + if (put_page_testzero(page)) {
> + list_del(&page->lru);
> + __free_pages_ok(page, 0, FPI_NONE);
> + }
> + }
> +}
> +EXPORT_SYMBOL_GPL(free_pages_bulk);

I expect that batching games are planned in here as well?

>  static inline unsigned int
>  gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
> @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
> unsigned int order,
>   struct alloc_context *ac, gfp_t *alloc_mask,
>   unsigned int *alloc_flags)
>  {
> + gfp_mask &= gfp_allowed_mask;
> + *alloc_mask = gfp_mask;
> +
>   ac->highest_zoneidx = gfp_zone(gfp_mask);
>   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
>   ac->nodemask = nodemask;
> @@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
> unsigned int order,
>   return true;
>  }
>  
> +/*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + */

Documentation is rather lame.  Returns number of pages allocated...

> +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *alloc_list)
> +{
> + struct page *page;
> + unsigned long flags;
> + struct zone *zone;
> + struct zoneref *z;
> + struct per_cpu_pages *pcp;
> + struct list_head *pcp_list;
> + struct alloc_context ac;
> + gfp_t alloc_mask;
> + unsigned int alloc_flags;
> + int alloced = 0;
> +
> + if (nr_pages == 1)
> + goto failed;
> +
> + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> + if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, &ac, 
> &alloc_mask, &alloc_flags))
> + return 0;
> + gfp_mask = alloc_mask;
> +
> + /* Find an allowed local zone that meets the high watermark. */
> + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
> ac.highest_zoneidx, ac.nodemask) {
> + unsigned long mark;
> +
> + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> + !__cpuset_zone_allowed(zone, gfp_mask)) {
> + continue;
> + }
> +
> + if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> + zone_to_nid(zone) != 
> zone_to_nid(ac.preferred_zoneref->zone)) {
> + goto failed;
> + }
> +
> + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + 
> nr_pages;
> + if (zone_watermark_fast(zone, 0,  mark,
> + zonelist_zone_idx(ac.preferred_zoneref),
> + alloc_flags, gfp_mask)) {
> + break;
> + }
> + }

I suspect the above was stolen from elsewhere and that some code
commonification is planned.


> + if (!zone)
> + return 0;
> +
> + /* Attempt the batch allocation */
> + local_irq_save(flags);
> + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> + pcp_list = &pcp->lists[ac.migratetype];
> +
> + while (alloced < nr_pages) {
> + page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> + pcp, pcp_list);
> + if (!page)
> + break;
> +
> + prep_new_page(page, 0, gfp_mask, 0);

I wonder if it would be worth running prep_new_page() in a second pass,
after reenabling interrupts.

Speaking of which, will the realtime people get upset about the
irqs-off latency?  How many pages are we talking about here?

> + list_add

Re: [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users

2021-03-10 Thread Andrew Morton

On Wed, 10 Mar 2021 10:46:13 + Mel Gorman  
wrote:

> This series introduces a bulk order-0 page allocator with sunrpc and
> the network page pool being the first users.

Right now, the [0/n] doesn't even tell us that it's a performance
patchset!

The whole point of this patchset appears to appear in the final paragraph
of the final patch's changelog.

: For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
: redirecting xdp_frame packets into a veth, that does XDP_PASS to create
: an SKB from the xdp_frame, which then cannot return the page to the
: page_pool.  In this case, we saw[1] an improvement of 18.8% from using
: the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

Much more detail on the overall objective and the observed results,
please?

Also, that workload looks awfully corner-casey.  How beneficial is this
work for more general and widely-used operations?

> The implementation is not
> particularly efficient and the intention is to iron out what the semantics
> of the API should have for users. Once the semantics are ironed out, it can
> be made more efficient.

And some guesstimates about how much benefit remains to be realized
would be helpful.

Re: [PATCH 0/3] Fix some seq_file users that were recently broken

2021-02-07 Thread Andrew Morton

On Sat, 6 Feb 2021 14:29:24 -0800 Jakub Kicinski  wrote:

> On Fri, 5 Feb 2021 14:35:50 -0800 Andrew Morton wrote:
> > On Fri, 05 Feb 2021 11:36:30 +1100 NeilBrown  wrote:
> > 
> > > A recent change to seq_file broke some users which were using seq_file
> > > in a non-"standard" way ...  though the "standard" isn't documented, so
> > > they can be excused.  The result is a possible leak - of memory in one
> > > case, of references to a 'transport' in the other.
> > > 
> > > These three patches:
> > >  1/ document and explain the problem
> > >  2/ fix the problem user in x86
> > >  3/ fix the problem user in net/sctp
> > 
> > 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and
> > interface") was August 2018, so I don't think "recent" applies here?
> > 
> > I didn't look closely, but it appears that the sctp procfs file is
> > world-readable.  So we gave unprivileged userspace the ability to leak
> > kernel memory?
> > 
> > So I'm thinking that we aim for 5.12-rc1 on all three patches with a 
> > cc:stable?
> 
> I'd rather take the sctp patch sooner, we'll send another batch 
> of networking fixes for 5.11, anyway. Would that be okay with you?

Sure.

Re: [PATCH 0/3] Fix some seq_file users that were recently broken

2021-02-05 Thread Andrew Morton

On Fri, 05 Feb 2021 11:36:30 +1100 NeilBrown  wrote:

> A recent change to seq_file broke some users which were using seq_file
> in a non-"standard" way ...  though the "standard" isn't documented, so
> they can be excused.  The result is a possible leak - of memory in one
> case, of references to a 'transport' in the other.
> 
> These three patches:
>  1/ document and explain the problem
>  2/ fix the problem user in x86
>  3/ fix the problem user in net/sctp
> 

1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and
interface") was August 2018, so I don't think "recent" applies here?

I didn't look closely, but it appears that the sctp procfs file is
world-readable.  So we gave unprivileged userspace the ability to leak
kernel memory?

So I'm thinking that we aim for 5.12-rc1 on all three patches with a cc:stable?

Re: linux-next: manual merge of the akpm-current tree with the bpf-next tree

2020-12-14 Thread Andrew Morton

On Mon, 14 Dec 2020 18:06:29 -0800 Jakub Kicinski  wrote:

> On Mon, 14 Dec 2020 17:40:21 -0800 Andrew Morton wrote:
> > On Mon, 14 Dec 2020 17:29:43 -0800 Roman Gushchin  wrote:
> > > On Tue, Dec 15, 2020 at 07:21:56AM +1100, Stephen Rothwell wrote:  
> > > > On Fri, 4 Dec 2020 20:20:05 +1100 Stephen Rothwell 
> > > >  wrote:  
> > > > > Today's linux-next merge of the akpm-current tree got conflicts in:
> > > > > 
> > > > >   include/linux/memcontrol.h
> > > > >   mm/memcontrol.c
> > > > > 
> > > > > between commit:
> > > > > 
> > > > >   bcfe06bf2622 ("mm: memcontrol: Use helpers to read page's memcg 
> > > > > data")
> > > > > 
> > > > > from the bpf-next tree and commits:
> > > > > 
> > > > >   6771a349b8c3 ("mm/memcg: remove incorrect comment")
> > > > >   c3970fcb1f21 ("mm: move lruvec stats update functions to vmstat.h")
> > > > > 
> > > > > from the akpm-current tree.
> > > > >   
> > > ...  
> > > > 
> > > > Just a reminder that this conflict still exists.  Commit bcfe06bf2622
> > > > is now in the net-next tree.  
> > > 
> > > Thanks, Stephen!
> > > 
> > > I wonder if it's better to update these 2 commits in the mm tree to avoid
> > > conflicts?
> > > 
> > > Basically split your fix into two and merge it into mm commits.
> > > The last chunk in the patch should be merged into "mm/memcg: remove 
> > > incorrect comment".
> > > And the rest into "mm: move lruvec stats update functions to vmstat.h".
> > > 
> > > Andrew, what do you think?  
> > 
> > I have "mm/memcg: remove incorrect comment" and "mm: move lruvec stats
> > update functions to vmstat.h" staged against Linus's tree and plan to
> > send them to him later today.  So I trust the BPF tree maintainers will
> > be able to resolve these minor things when those patches turn up in
> > mainline.
> 
> Hm. The code is in net-next by now. I was thinking of sending the
> Networking PR later today (tonight?) as well. I'm happy to hold off 
> or do whatever you require, but I'd appreciate more explicit / noob
> friendly instructions.

Linus tends not to like it when tree maintainers do last-minute
conflict fixes.

> AFAIU all we can do is tell Linus about the merge issue, and point 
> at Stephen's resolution.

That's the way to do it - including a (tested?) copy in the email would
be nice.

Re: linux-next: manual merge of the akpm-current tree with the bpf-next tree

2020-12-14 Thread Andrew Morton

On Mon, 14 Dec 2020 17:29:43 -0800 Roman Gushchin  wrote:

> On Tue, Dec 15, 2020 at 07:21:56AM +1100, Stephen Rothwell wrote:
> > Hi all,
> > 
> > On Fri, 4 Dec 2020 20:20:05 +1100 Stephen Rothwell  
> > wrote:
> > >
> > > Today's linux-next merge of the akpm-current tree got conflicts in:
> > > 
> > >   include/linux/memcontrol.h
> > >   mm/memcontrol.c
> > > 
> > > between commit:
> > > 
> > >   bcfe06bf2622 ("mm: memcontrol: Use helpers to read page's memcg data")
> > > 
> > > from the bpf-next tree and commits:
> > > 
> > >   6771a349b8c3 ("mm/memcg: remove incorrect comment")
> > >   c3970fcb1f21 ("mm: move lruvec stats update functions to vmstat.h")
> > > 
> > > from the akpm-current tree.
> > > 
> ...
> > 
> > Just a reminder that this conflict still exists.  Commit bcfe06bf2622
> > is now in the net-next tree.
> 
> Thanks, Stephen!
> 
> I wonder if it's better to update these 2 commits in the mm tree to avoid
> conflicts?
> 
> Basically split your fix into two and merge it into mm commits.
> The last chunk in the patch should be merged into "mm/memcg: remove incorrect 
> comment".
> And the rest into "mm: move lruvec stats update functions to vmstat.h".
> 
> Andrew, what do you think?

I have "mm/memcg: remove incorrect comment" and "mm: move lruvec stats
update functions to vmstat.h" staged against Linus's tree and plan to
send them to him later today.  So I trust the BPF tree maintainers will
be able to resolve these minor things when those patches turn up in
mainline.

Re: [PATCH v3 1/1] page_frag: Recover from memory pressure

2020-11-18 Thread Andrew Morton

On Wed, 18 Nov 2020 11:46:54 -0800 Jakub Kicinski  wrote:

> > 1. The kernel is under memory pressure and allocation of
> > PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,
> > the pfmemalloc page is allocated for page_frag_cache->va.
> > 
> > 2: All skb->data from page_frag_cache->va (pfmemalloc) will have
> > skb->pfmemalloc=true. The skb will always be dropped by sock without
> > SOCK_MEMALLOC. This is an expected behaviour.
> > 
> > 3. Suppose a large amount of pages are reclaimed and kernel is not under
> > memory pressure any longer. We expect skb->pfmemalloc drop will not happen.
> > 
> > 4. Unfortunately, page_frag_alloc() does not proactively re-allocate
> > page_frag_alloc->va and will always re-use the prior pfmemalloc page. The
> > skb->pfmemalloc is always true even kernel is not under memory pressure any
> > longer.
> > 
> > Fix this by freeing and re-allocating the page instead of recycling it.
> 
> Andrew, are you taking this via -mm or should I put it in net? 
> I'm sending a PR to Linus tomorrow.

Please go ahead - if/when it appears in mainline or linux-next, I'll
drop the -mm copy.

Re: [PATCH bpf-next v5 01/34] mm: memcontrol: use helpers to read page's memcg data

2020-11-12 Thread Andrew Morton

On Thu, 12 Nov 2020 19:25:48 -0800 Alexei Starovoitov 
 wrote:

> On Thu, Nov 12, 2020 at 7:18 PM Andrew Morton  
> wrote:
> >
> > On Thu, 12 Nov 2020 19:04:56 -0800 Alexei Starovoitov 
> >  wrote:
> >
> > > On Thu, Nov 12, 2020 at 04:26:10PM -0800, Roman Gushchin wrote:
> > > >
> > > > These patches are not intended to be merged through the bpf tree.
> > > > They are included into the patchset to make bpf selftests pass and for
> > > > informational purposes.
> > > > It's written in the cover letter.
> > > ...
> > > > Maybe I had to just list their titles in the cover letter. Idk what's
> > > > the best option for such cross-subsystem dependencies.
> > >
> > > We had several situations in the past releases where dependent patches
> > > were merged into multiple trees. For that to happen cleanly from git pov
> > > one of the maintainers need to create a stable branch/tag and let other
> > > maintainers pull that branch into different trees. This way the sha-s
> > > stay the same and no conflicts arise during the merge window.
> > > In this case sounds like the first 4 patches are in mm tree already.
> > > Is there a branch/tag I can pull to get the first 4 into bpf-next?
> >
> > Not really, at present.  This is largely by design, although it does cause
> > this problem once or twice a year.
> >
> > These four patches:
> >
> > mm-memcontrol-use-helpers-to-read-pages-memcg-data.patch
> > mm-memcontrol-slab-use-helpers-to-access-slab-pages-memcg_data.patch
> > mm-introduce-page-memcg-flags.patch
> > mm-convert-page-kmemcg-type-to-a-page-memcg-flag.patch
> >
> > are sufficiently reviewed - please pull them into the bpf tree when
> > convenient.  Once they hit linux-next, I'll drop the -mm copies and the
> > bpf tree maintainers will then be responsible for whether & when they
> > get upstream.
> 
> That's certainly an option if they don't depend on other patches in the mm 
> tree.
> Roman probably knows best ?

That should be OK.  They apply and compile ;)

Re: [PATCH bpf-next v5 01/34] mm: memcontrol: use helpers to read page's memcg data

2020-11-12 Thread Andrew Morton

On Thu, 12 Nov 2020 19:04:56 -0800 Alexei Starovoitov 
 wrote:

> On Thu, Nov 12, 2020 at 04:26:10PM -0800, Roman Gushchin wrote:
> > 
> > These patches are not intended to be merged through the bpf tree.
> > They are included into the patchset to make bpf selftests pass and for
> > informational purposes.
> > It's written in the cover letter.
> ...
> > Maybe I had to just list their titles in the cover letter. Idk what's
> > the best option for such cross-subsystem dependencies.
> 
> We had several situations in the past releases where dependent patches
> were merged into multiple trees. For that to happen cleanly from git pov
> one of the maintainers need to create a stable branch/tag and let other
> maintainers pull that branch into different trees. This way the sha-s
> stay the same and no conflicts arise during the merge window.
> In this case sounds like the first 4 patches are in mm tree already.
> Is there a branch/tag I can pull to get the first 4 into bpf-next?

Not really, at present.  This is largely by design, although it does cause
this problem once or twice a year.

These four patches:

mm-memcontrol-use-helpers-to-read-pages-memcg-data.patch
mm-memcontrol-slab-use-helpers-to-access-slab-pages-memcg_data.patch
mm-introduce-page-memcg-flags.patch
mm-convert-page-kmemcg-type-to-a-page-memcg-flag.patch

are sufficiently reviewed - please pull them into the bpf tree when
convenient.  Once they hit linux-next, I'll drop the -mm copies and the
bpf tree maintainers will then be responsible for whether & when they
get upstream.

Re: [PATCH] compiler-clang: remove version check for BPF Tracing

2020-11-04 Thread Andrew Morton

On Wed,  4 Nov 2020 11:10:51 -0800 Nick Desaulniers  
wrote:

> bpftrace parses the kernel headers and uses Clang under the hood. Remove
> the version check when __BPF_TRACING__ is defined (as bpftrace does) so
> that this tool can continue to parse kernel headers, even with older
> clang sources.
> 
> Cc: 
> Fixes: commit 1f7a44f63e6c ("compiler-clang: add build check for clang 
> 10.0.1")

1f7a44f63e6c is only in 5.10-rcX, so I shall remove the cc:stable.

Re: [PATCH v2 1/2] mm: add GFP mask param to strndup_user

2020-08-21 Thread Andrew Morton

On Fri, 21 Aug 2020 20:28:26 -0700 Pascal Bouchareine  wrote:

> Let caller specify allocation.
> Preserve existing calls with GFP_USER.
> 
>  21 files changed, 65 insertions(+), 43 deletions(-)

Why change all existing callsites so that one callsite can pass in a
different gfp_t?

> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 1ca609f66fdf..3d94ba811f4b 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -326,7 +326,7 @@ static __poll_t dma_buf_poll(struct file *file, 
> poll_table *poll)
>   */
>  static long dma_buf_set_name(struct dma_buf *dmabuf, const char __user *buf)
>  {
> - char *name = strndup_user(buf, DMA_BUF_NAME_LEN);
> + char *name = strndup_user(buf, DMA_BUF_NAME_LEN, GFP_USER);
>   long ret = 0;

Wouldn't

#include 

char *__strndup_user(const char __user *s, long n, gfp_t gfp);

static inline char *strndup_user(const char __user *s, long n)
{
return __strndup_user(s, n, GFP_USER);
}

be simpler?



Also...

why does strndup_user() use GFP_USER?  Nobody will be mapping the
resulting strings into user pagetables (will they?).  This was done by
Al's 6c2c97a24f096e32, which doesn't have a changelog :(


In [patch 2/2],

+   desc = strndup_user(user_desc, SK_MAX_DESC_SIZE, GFP_KERNEL_ACCOUNT);

if GFP_USER is legit then shouldn't this be GFP_USER_ACCOUNT (ie,
GFP_USER|__GFP_ACCOUNT)?

Re: [PATCH v5 2/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-16 Thread Andrew Morton

On Tue, 16 Jun 2020 11:43:11 -0400 Waiman Long  wrote:

> As said by Linus:
> 
>   A symmetric naming is only helpful if it implies symmetries in use.
>   Otherwise it's actively misleading.
> 
>   In "kzalloc()", the z is meaningful and an important part of what the
>   caller wants.
> 
>   In "kzfree()", the z is actively detrimental, because maybe in the
>   future we really _might_ want to use that "memfill(0xdeadbeef)" or
>   something. The "zero" part of the interface isn't even _relevant_.
> 
> The main reason that kzfree() exists is to clear sensitive information
> that should not be leaked to other future users of the same memory
> objects.
> 
> Rename kzfree() to kfree_sensitive() to follow the example of the
> recently added kvfree_sensitive() and make the intention of the API
> more explicit. In addition, memzero_explicit() is used to clear the
> memory to make sure that it won't get optimized away by the compiler.
> 
> The renaming is done by using the command sequence:
> 
>   git grep -w --name-only kzfree |\
>   xargs sed -i 's/\bkzfree\b/kfree_sensitive/'
> 
> followed by some editing of the kfree_sensitive() kerneldoc and adding
> a kzfree backward compatibility macro in slab.h.
> 
> ...
>
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -186,10 +186,12 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *, 
> struct mem_cgroup *);
>   */
>  void * __must_check krealloc(const void *, size_t, gfp_t);
>  void kfree(const void *);
> -void kzfree(const void *);
> +void kfree_sensitive(const void *);
>  size_t __ksize(const void *);
>  size_t ksize(const void *);
>  
> +#define kzfree(x)kfree_sensitive(x)  /* For backward compatibility */
> +

What was the thinking here?  Is this really necessary?

I suppose we could keep this around for a while to ease migration.  But
not for too long, please.

Re: [PATCH 12/23] bpf: handle the compat string in bpf_trace_copy_string better

2020-05-27 Thread Andrew Morton

On Thu, 21 May 2020 17:22:50 +0200 Christoph Hellwig  wrote:

> User the proper helper for kernel or userspace addresses based on
> TASK_SIZE instead of the dangerous strncpy_from_unsafe function.
> 
> ...
>
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -331,8 +331,11 @@ static void bpf_trace_copy_string(char *buf, void 
> *unsafe_ptr, char fmt_ptype,
>   switch (fmt_ptype) {
>   case 's':
>  #ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
> - strncpy_from_unsafe(buf, unsafe_ptr, bufsz);
> - break;
> + if ((unsigned long)unsafe_ptr < TASK_SIZE) {
> + strncpy_from_user_nofault(buf, user_ptr, bufsz);
> + break;
> + }
> + fallthrough;
>  #endif
>   case 'k':
>   strncpy_from_kernel_nofault(buf, unsafe_ptr, bufsz);

Another user of strncpy_from_unsafe() has popped up in linux-next's
bpf.  I did the below, but didn't try very hard - it's probably wrong
if CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=n?

Anyway, please take a look at all the bpf_trace.c changes in
linux-next.


From: Andrew Morton 
Subject: bpf:bpf_seq_printf(): handle potentially unsafe format string better

User the proper helper for kernel or userspace addresses based on
TASK_SIZE instead of the dangerous strncpy_from_unsafe function.

Cc: Christoph Hellwig 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Masami Hiramatsu 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
---

 kernel/trace/bpf_trace.c |   13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

--- a/kernel/trace/bpf_trace.c~xxx
+++ a/kernel/trace/bpf_trace.c
@@ -588,15 +588,22 @@ BPF_CALL_5(bpf_seq_printf, struct seq_fi
}
 
if (fmt[i] == 's') {
+   void *unsafe_ptr;
+
/* try our best to copy */
if (memcpy_cnt >= MAX_SEQ_PRINTF_MAX_MEMCPY) {
err = -E2BIG;
goto out;
}
 
-   err = strncpy_from_unsafe(bufs->buf[memcpy_cnt],
- (void *) (long) args[fmt_cnt],
- MAX_SEQ_PRINTF_STR_LEN);
+   unsafe_ptr = (void *)(long)args[fmt_cnt];
+   if ((unsigned long)unsafe_ptr < TASK_SIZE) {
+   err = strncpy_from_user_nofault(
+   bufs->buf[memcpy_cnt], unsafe_ptr,
+   MAX_SEQ_PRINTF_STR_LEN);
+   } else {
+   err = -EFAULT;
+   }
if (err < 0)
bufs->buf[memcpy_cnt][0] = '\0';
params[fmt_cnt] = (u64)(long)bufs->buf[memcpy_cnt];
_

Re: [PATCH 10/23] maccess: unify the probe kernel arch hooks

2020-05-27 Thread Andrew Morton

On Thu, 21 May 2020 17:22:48 +0200 Christoph Hellwig  wrote:

> Currently architectures have to override every routine that probes
> kernel memory, which includes a pure read and strcpy, both in strict
> and not strict variants.  Just provide a single arch hooks instead to
> make sure all architectures cover all the cases.

Fix a buildo.

--- a/arch/x86/mm/maccess.c~maccess-unify-the-probe-kernel-arch-hooks-fix
+++ a/arch/x86/mm/maccess.c
@@ -29,6 +29,6 @@ bool probe_kernel_read_allowed(const voi
 {
if (!strict)
return true;
-   return (unsigned long)vaddr >= TASK_SIZE_MAX;
+   return (unsigned long)unsafe_src >= TASK_SIZE_MAX;
 }
 #endif
_

Re: clean up and streamline probe_kernel_* and friends v4

2020-05-27 Thread Andrew Morton

On Tue, 26 May 2020 08:13:09 +0200 Christoph Hellwig  wrote:

> On Mon, May 25, 2020 at 03:19:12PM -0700, Andrew Morton wrote:
> > hm.  Applying linux-next to this series generates a lot of rejects against
> > powerpc:
> > 
> > -rw-rw-r-- 1 akpm akpm  493 May 25 15:06 arch/powerpc/kernel/kgdb.c.rej
> > -rw-rw-r-- 1 akpm akpm 6461 May 25 15:06 
> > arch/powerpc/kernel/trace/ftrace.c.rej
> > -rw-rw-r-- 1 akpm akpm  447 May 25 15:06 arch/powerpc/mm/fault.c.rej
> > -rw-rw-r-- 1 akpm akpm  623 May 25 15:06 arch/powerpc/perf/core-book3s.c.rej
> > -rw-rw-r-- 1 akpm akpm 1408 May 25 15:06 arch/riscv/kernel/patch.c.rej
> > 
> > the arch/powerpc/kernel/trace/ftrace.c ones aren't very trivial.
> > 
> > It's -rc7.  Perhaps we should park all this until 5.8-rc1?
> 
> As this is a pre-condition for the set_fs removal I'd really like to
> get the actual changes in.  All these conflicts seem to be about the
> last three cleanup patches just doing renaming, so can we just skip
> those three for now?  Then we can do the rename right after 5.8-rc1
> when we have the least chances for conflicts.

That seems to have worked.  "[PATCH 23/23] maccess: return -ERANGE when
copy_from_kernel_nofault_allowed fails" needed a bit of massaging to both
the patch and to the patch title.

Re: clean up and streamline probe_kernel_* and friends v4

2020-05-25 Thread Andrew Morton

On Thu, 21 May 2020 17:22:38 +0200 Christoph Hellwig  wrote:

> this series start cleaning up the safe kernel and user memory probing
> helpers in mm/maccess.c, and then allows architectures to implement
> the kernel probing without overriding the address space limit and
> temporarily allowing access to user memory.  It then switches x86
> over to this new mechanism by reusing the unsafe_* uaccess logic.
> 
> This version also switches to the saner copy_{from,to}_kernel_nofault
> naming suggested by Linus.
> 
> I kept the x86 helpers as-is without calling unsage_{get,put}_user as
> that avoids a number of hard to trace casts, and it will still work
> with the asm-goto based version easily.

hm.  Applying linux-next to this series generates a lot of rejects against
powerpc:

-rw-rw-r-- 1 akpm akpm  493 May 25 15:06 arch/powerpc/kernel/kgdb.c.rej
-rw-rw-r-- 1 akpm akpm 6461 May 25 15:06 arch/powerpc/kernel/trace/ftrace.c.rej
-rw-rw-r-- 1 akpm akpm  447 May 25 15:06 arch/powerpc/mm/fault.c.rej
-rw-rw-r-- 1 akpm akpm  623 May 25 15:06 arch/powerpc/perf/core-book3s.c.rej
-rw-rw-r-- 1 akpm akpm 1408 May 25 15:06 arch/riscv/kernel/patch.c.rej

the arch/powerpc/kernel/trace/ftrace.c ones aren't very trivial.

It's -rc7.  Perhaps we should park all this until 5.8-rc1?

Re: [PATCH] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs

2019-02-13 Thread Andrew Morton

On Wed, 13 Feb 2019 22:11:58 +0100 Jann Horn  wrote:

> > This is probably more a davem patch than a -mm one.
> 
> Ah, sorry. I assumed that I just should go by which directory the
> patched code is in.
> 
> You did just add it to the -mm tree though, right? So I shouldn't
> resend it to davem?

Yes, please send to Dave.  I'll autodrop the -mm copy if/when it turns
up in -next.

Re: [PATCH] mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs

2019-02-13 Thread Andrew Morton

On Wed, 13 Feb 2019 21:41:57 +0100 Jann Horn  wrote:

> The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
> number of references that we might need to create in the fastpath later,
> the bump-allocation fastpath only has to modify the non-atomic bias value
> that tracks the number of extra references we hold instead of the atomic
> refcount. The maximum number of allocations we can serve (under the
> assumption that no allocation is made with size 0) is nc->size, so that's
> the bias used.
> 
> However, even when all memory in the allocation has been given away, a
> reference to the page is still held; and in the `offset < 0` slowpath, the
> page may be reused if everyone else has dropped their references.
> This means that the necessary number of references is actually
> `nc->size+1`.
> 
> Luckily, from a quick grep, it looks like the only path that can call
> page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
> requires CAP_NET_ADMIN in the init namespace and is only intended to be
> used for kernel testing and fuzzing.

For the net-naive, what is TAP?  It doesn't appear to mean
drivers/net/tap.c.

> To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
> `offset < 0` path, below the virt_to_page() call, and then repeatedly call
> writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
> with a vector consisting of 15 elements containing 1 byte each.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4675,11 +4675,11 @@ void *page_frag_alloc(struct page_frag_cache *nc,
>   /* Even if we own the page, we do not use atomic_set().
>* This would break get_page_unless_zero() users.
>*/
> - page_ref_add(page, size - 1);
> + page_ref_add(page, size);
>  
>   /* reset page count bias and offset to start of new frag */
>   nc->pfmemalloc = page_is_pfmemalloc(page);
> - nc->pagecnt_bias = size;
> + nc->pagecnt_bias = size + 1;
>   nc->offset = size;
>   }
>  
> @@ -4695,10 +4695,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
>   size = nc->size;
>  #endif
>   /* OK, page count is 0, we can safely set it */
> - set_page_count(page, size);
> + set_page_count(page, size + 1);
>  
>   /* reset page count bias and offset to start of new frag */
> - nc->pagecnt_bias = size;
> + nc->pagecnt_bias = size + 1;
>   offset = size - fragsz;
>   }

This is probably more a davem patch than a -mm one.

Re: [net-next PATCH 1/2] mm: add dma_addr_t to struct page

2019-02-11 Thread Andrew Morton

On Mon, 11 Feb 2019 17:06:46 +0100 Jesper Dangaard Brouer  
wrote:

> The page_pool API is using page->private to store DMA addresses.
> As pointed out by David Miller we can't use that on 32-bit architectures
> with 64-bit DMA
> 
> This patch adds a new dma_addr_t struct to allow storing DMA addresses
> 
> ..
>
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -95,6 +95,14 @@ struct page {
>*/
>   unsigned long private;
>   };
> + struct {/* page_pool used by netstack */
> + /**
> +  * @dma_addr: Page_pool need to store DMA-addr, and
> +  * cannot use @private, as DMA-mappings can be 64-bit
> +  * even on 32-bit Architectures.
> +  */

This comment is a bit awkward.  The discussion about why it doesn't use
->private is uninteresting going forward and is more material for a
changelog.

How about

/**
 * @dma_addr: page_pool requires a 64-bit value even on
 * 32-bit architectures.
 */

Otherwise,

Acked-by: Andrew Morton

Re: rfc: bool structure members (was Re: [PATCH V3] net/mlx4: Get rid of page operation after dma_alloc_coherent)

2018-12-20 Thread Andrew Morton

On Thu, 20 Dec 2018 18:25:05 -0800 Joe Perches  wrote:

> On Thu, 2018-12-20 at 09:49 -0800, Bart Van Assche wrote:
> > On Thu, 2018-12-20 at 18:44 +0100, Christoph Hellwig wrote:
> > > On Thu, Dec 20, 2018 at 10:43:18AM -0700, Jason Gunthorpe wrote:
> > > > >   - chunk->coherent is an int not a bool since checkpatch complains 
> > > > > about
> > > > > using bool in structs; see https://lkml.org/lkml/2017/11/21/384.
> > > > 
> > > > :( bool is much more readable and there is no performance concern in
> > > > this struct. I think checkpatch is overzealous here.
> > > 
> > > Yes.  Nevermind that this for bool vs bitfield.  A int is worse in
> > > every respect in the criteria used in that mail.
> > 
> > (+Joe Perches)
> > 
> > Hi Joe,
> 
> Hi all.
> 
> > This is the second time that I see that the checkpatch complaint about using
> > bool in a structure leads kernel contributors to a bad decision. Please 
> > consider
> > removing that warning from checkpatch.
> 
> I agree it's not a very good message nor is bool use
> of structure members a real problem except in very
> few cases.
> 
> I think the message could either be restated and bool
> members described as OK for unshared memory structures.
> 
> Right now this is the test:
> 
> # check for bool use in .h files
>   if ($realfile =~ /\.h$/ &&
>   $sline =~ /^.\s+bool\s*$Ident\s*(?::\s*d+\s*)?;/) {
>   CHK("BOOL_MEMBER",
>   "Avoid using bool structure members because of 
> possible alignment issues - see: https://lkml.org/lkml/2017/11/21/384\n"; . 
> $herecurr);

Probably better if this is in the tree. 
Documentation/process/coding-style.rst, perhaps.

Re: [PATCH 2/2] x86/modules: Make x86 allocs to flush when free

2018-11-28 Thread Andrew Morton

On Tue, 27 Nov 2018 16:07:54 -0800 Rick Edgecombe  
wrote:

> Change the module allocations to flush before freeing the pages.
> 
> ...
>
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -87,8 +87,8 @@ void *module_alloc(unsigned long size)
>   p = __vmalloc_node_range(size, MODULE_ALIGN,
>   MODULES_VADDR + get_module_load_offset(),
>   MODULES_END, GFP_KERNEL,
> - PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> - __builtin_return_address(0));
> + PAGE_KERNEL_EXEC, VM_IMMEDIATE_UNMAP,
> + NUMA_NO_NODE, __builtin_return_address(0));
>   if (p && (kasan_module_alloc(p, size) < 0)) {
>   vfree(p);
>   return NULL;

Should any other architectures do this?

Re: [PATCH] mm: Replace all open encodings for NUMA_NO_NODE

2018-11-23 Thread Andrew Morton

On Fri, 23 Nov 2018 15:24:16 +0530 Anshuman Khandual 
 wrote:

> At present there are multiple places where invalid node number is encoded
> as -1. Even though implicitly understood it is always better to have macros
> in there. Replace these open encodings for an invalid node number with the
> global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like
> 'invalid node' from various places redirecting them to a common definition.
> 
> ...
> 
> Build tested this with multiple cross compiler options like alpha, sparc,
> arm64, x86, powerpc, powerpc64le etc with their default config which might
> not have compiled tested all driver related changes. I will appreciate
> folks giving this a test in their respective build environment.
> 
> All these places for replacement were found by running the following grep
> patterns on the entire kernel code. Please let me know if this might have
> missed some instances. This might also have replaced some false positives.
> I will appreciate suggestions, inputs and review.
> 
> 1. git grep "nid == -1"
> 2. git grep "node == -1"
> 3. git grep "nid = -1"
> 4. git grep "node = -1"

The build testing is good, but I worry that some of the affected files
don't clearly have numa.h in their include paths, for the NUMA_NO_NODE
definition.

The first thing I looked it is arch/powerpc/include/asm/pci-bridge.h. 
Maybe it somehow manages to include numa.h via some nested include, but
if so, is that reliable across all config combinations and as code
evolves?

So I think that the patch should have added an explicit include of
numa.h, especially in cases where the affected file previously had no
references to any of the things which numa.h defines.

Re: [PATCH] bpfilter: fix user mode helper cross compilation

2018-06-27 Thread Andrew Morton

On Wed, 20 Jun 2018 16:04:34 +0200 Matteo Croce  wrote:

> Use $(OBJDUMP) instead of literal 'objdump' to avoid
> using host toolchain when cross compiling.
> 

I'm still having issues here, with ld.

x86_64 machine, ARCH=i386:

y:/usr/src/25> make V=1 M=net/bpfilter
test -e include/generated/autoconf.h -a -e include/config/auto.conf || (   \
echo >&2;   \
echo >&2 "  ERROR: Kernel configuration is invalid.";   \
echo >&2 " include/generated/autoconf.h or include/config/auto.conf are 
missing.";\
echo >&2 " Run 'make oldconfig && make prepare' on kernel src to fix 
it.";  \
echo >&2 ;  \
/bin/false)
mkdir -p net/bpfilter/.tmp_versions ; rm -f net/bpfilter/.tmp_versions/*
make -f ./scripts/Makefile.build obj=net/bpfilter
(cat /dev/null;   echo kernel/net/bpfilter/bpfilter.ko;) > 
net/bpfilter/modules.order
  ld -m elf_i386   -r -o net/bpfilter/bpfilter.o net/bpfilter/bpfilter_kern.o 
net/bpfilter/bpfilter_umh.o ; scripts/mod/modpost net/bpfilter/bpfilter.o
ld: i386:x86-64 architecture of input file `net/bpfilter/bpfilter_umh.o' is 
incompatible with i386 output
scripts/Makefile.build:530: recipe for target 'net/bpfilter/bpfilter.o' failed
make[1]: *** [net/bpfilter/bpfilter.o] Error 1
Makefile:1518: recipe for target '_module_net/bpfilter' failed
make: *** [_module_net/bpfilter] Error 2

y:/usr/src/25> ld --version
GNU ld (GNU Binutils for Ubuntu) 2.29.1

Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004

2018-05-02 Thread Andrew Morton

On Wed, 2 May 2018 21:58:25 -0700 Cong Wang  wrote:

> On Wed, May 2, 2018 at 9:27 PM, Andrew Morton  
> wrote:
> >
> > So it's saying that something which got committed into Linus's tree
> > after 4.17-rc3 has caused a NULL deref in
> > sock_release->llc_ui_release+0x3a/0xd0
> 
> Do you mean it contains commit 3a04ce7130a7
> ("llc: fix NULL pointer deref for SOCK_ZAPPED")?

That was in 4.17-rc3 so if this report's bisection is correct, that
patch is innocent.

origin.patch (http://ozlabs.org/~akpm/mmots/broken-out/origin.patch)
contains no changes to net/llc/af_llc.c so perhaps this crash is also
occurring in 4.17-rc3 base.

Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004

2018-05-02 Thread Andrew Morton


(networking cc's added)

On Thu, 3 May 2018 12:14:50 +0800 kernel test robot  wrote:

> Greetings,
> 
> 0day kernel testing robot got the below dmesg and the first bad commit is
> 
> git://git.cmpxchg.org/linux-mmotm.git master
> 
> commit 486ad79630d0ba0b7205a8db9fe15ba392f5ee32
> Author: Andrew Morton 
> AuthorDate: Fri Apr 20 22:00:53 2018 +
> Commit: Johannes Weiner 
> CommitDate: Fri Apr 20 22:00:53 2018 +
> 
> origin


OK, this got confusing.  origin.patch is the diff between 4.17-rc3 and
current mainline.

>
> [many lines deleted]
>
> [main] Setsockopt(101 c 1b24000 a) on fd 177 [3:5:240]
> [main] Setsockopt(1 2c 1b24000 4) on fd 178 [5:2:0]
> [main] Setsockopt(29 8 1b24000 4) on fd 180 [10:1:0]
> [main] Setsockopt(1 20 1b24000 4) on fd 181 [26:2:125]
> [main] Setsockopt(11 1 1b24000 4) on fd 183 [2:2:17]
> [   15.532543] BUG: unable to handle kernel NULL pointer dereference at 
> 0004
> [   15.534143] PGD 80001734b067 P4D 80001734b067 PUD 17350067 PMD 0 
> [   15.535516] Oops: 0002 [#1] PTI
> [   15.536165] Modules linked in:
> [   15.536798] CPU: 0 PID: 363 Comm: trinity-main Not tainted 
> 4.17.0-rc1-1-g486ad79 #2
> [   15.538396] RIP: 0010:llc_ui_release+0x3a/0xd0
> [   15.539293] RSP: 0018:c915bd70 EFLAGS: 00010202
> [   15.540345] RAX: 0001 RBX: 88001fa60008 RCX: 
> 0006
> [   15.541802] RDX: 0006 RSI: 88001fdda660 RDI: 
> 88001fa60008
> [   15.543139] RBP: c915bd80 R08:  R09: 
> 
> [   15.544725] R10:  R11:  R12: 
> 
> [   15.546287] R13: 88001fa61730 R14: 88001e130a60 R15: 
> 880019bdb3f0
> [   15.547962] FS:  7f2221bb1700() GS:82034000() 
> knlGS:
> [   15.549848] CS:  0010 DS:  ES:  CR0: 80050033
> [   15.551186] CR2: 0004 CR3: 1734e000 CR4: 
> 06b0
> [   15.552671] DR0: 02232000 DR1:  DR2: 
> 
> [   15.554105] DR3:  DR6: 0ff0 DR7: 
> 0600
> [   15.34] Call Trace:
> [   15.556049]  sock_release+0x14/0x60
> [   15.556767]  sock_close+0xd/0x20
> [   15.557427]  __fput+0xba/0x1f0
> [   15.558058]  fput+0x9/0x10
> [   15.558682]  task_work_run+0x73/0xa0
> [   15.559416]  do_exit+0x231/0xab0
> [   15.560079]  do_group_exit+0x3f/0xc0
> [   15.560810]  __x64_sys_exit_group+0x13/0x20
> [   15.561656]  do_syscall_64+0x58/0x2f0
> [   15.562407]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [   15.563360]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   15.564471] RIP: 0033:0x7f2221696408
> [   15.565264] RSP: 002b:7ffe5c544c48 EFLAGS: 0206 ORIG_RAX: 
> 00e7
> [   15.566924] RAX: ffda RBX:  RCX: 
> 7f2221696408
> [   15.568485] RDX:  RSI: 003c RDI: 
> 
> [   15.570046] RBP:  R08: 00e7 R09: 
> ffa0
> [   15.571603] R10: 7ffe5c5449e0 R11: 0206 R12: 
> 0004
> [   15.573160] R13: 7ffe5c544e30 R14:  R15: 
> 
> [   15.574720] Code: 7b ff 43 78 0f 88 a5 6f 14 00 31 f6 48 89 df e8 ad 33 fb 
> ff 48 89 df e8 55 94 ff ff 85 c0 0f 84 84 00 00 00 4c 8b a3 d8 04 00 00 <41> 
> ff 44 24 04 0f 88 7f 6f 14 00 48 8b 43 58 f6 c4 01 74 58 48 
> [   15.578679] RIP: llc_ui_release+0x3a/0xd0 RSP: c915bd70
> [   15.579874] CR2: 0004
> [   15.580553] ---[ end trace 0dd8fdc6b7182234 ]---
>

So it's saying that something which got committed into Linus's tree
after 4.17-rc3 has caused a NULL deref in
sock_release->llc_ui_release+0x3a/0xd0

Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-05-01 Thread Andrew Morton

On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka  
wrote:

> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > 
> > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > > 
> > > > > Fixing __vmalloc code 
> > > > > is easy and it doesn't require cooperation with maintainers.
> > > > 
> > > > But it is a hack against the intention of the scope api.
> > > 
> > > It is not!
> > 
> > This discussion simply doesn't make much sense it seems. The scope API
> > is to document the scope of the reclaim recursion critical section. That
> > certainly is not a utility function like vmalloc.
> 
> That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel 
> developer) from converting the code to the scope API. You make nonsensical 
> excuses.
> 

Fun thread!

Winding back to the original problem, I'd state it as

- Caller uses kvmalloc() but passes the address into vmalloc-naive
  DMA API and

- Caller uses kvmalloc() but passes the address into kfree()

Yes?

If so, then...

Is there a way in which, in the kvmalloc-called-kmalloc path, we can
tag the slab-allocated memory with a "this memory was allocated with
kvmalloc()" flag?  I *think* there's extra per-object storage available
with suitable slab/slub debugging options?  Perhaps we could steal one
bit from the redzone, dunno.

If so then we can

a) set that flag in kvmalloc() if the kmalloc() call succeeded

b) check for that flag in the DMA code, WARN if it is set.

c) in kvfree(), clear that flag before calling kfree()

d) in kfree(), check for that flag and go WARN() if set.

So both potential bugs are detected all the time, dependent upon
CONFIG_SLUB_DEBUG (and perhaps other slub config options).

Re: simplify procfs code for seq_file instances

2018-04-24 Thread Andrew Morton

On Tue, 24 Apr 2018 16:23:04 +0200 Christoph Hellwig  wrote:

> On Thu, Apr 19, 2018 at 09:57:50PM +0300, Alexey Dobriyan wrote:
> > > git://git.infradead.org/users/hch/misc.git proc_create
> > 
> > 
> > I want to ask if it is time to start using poorman function overloading
> > with _b_c_e(). There are millions of allocation functions for example,
> > all slightly difference, and people will add more. Seeing /proc interfaces
> > doubled like this is painful.
> 
> Function overloading is totally unacceptable.
> 
> And I very much disagree with a tradeoff that keeps 5000 lines of 
> code vs a few new helpers.

OK, the curiosity and suspense are killing me.  What the heck is
"function overloading with _b_c_e()"?

Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-19 Thread Andrew Morton

On Thu, 19 Apr 2018 17:19:20 -0400 (EDT) Mikulas Patocka  
wrote:

> > > In order to detect these bugs reliably I submit this patch that changes
> > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > > 
> > > ...
> > >
> > > --- linux-2.6.orig/mm/util.c  2018-04-18 15:46:23.0 +0200
> > > +++ linux-2.6/mm/util.c   2018-04-18 16:00:43.0 +0200
> > > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> > >   */
> > >  void *kvmalloc_node(size_t size, gfp_t flags, int node)
> > >  {
> > > +#ifndef CONFIG_DEBUG_VM
> > >   gfp_t kmalloc_flags = flags;
> > >   void *ret;
> > >  
> > > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> > >*/
> > >   if (ret || size <= PAGE_SIZE)
> > >   return ret;
> > > +#endif
> > >  
> > >   return __vmalloc_node_flags_caller(size, node, flags,
> > >   __builtin_return_address(0));
> > 
> > Well, it doesn't have to be done at compile-time, does it?  We could
> > add a knob (in debugfs, presumably) which enables this at runtime. 
> > That's far more user-friendly.
> 
> But who will turn it on in debugfs?

But who will turn it on in Kconfig?  Just a handful of developers.  We
could add SONFIG_DEBUG_SG to the list in
Documentation/process/submit-checklist.rst, but nobody reads that.

Also, a whole bunch of defconfigs set CONFIG_DEBUG_SG=y and some
googling indicates that they aren't the only ones...

> It should be default for debugging 
> kernels, so that users using them would report the error.

Well.  This isn't the first time we've wanted to enable expensive (or
noisy) debugging things in debug kernels, by any means.

So how could we define a debug kernel in which it's OK to enable such
things?

- Could be "it's an -rc kernel".  But then we'd be enabling a bunch of
  untested code when Linus cuts a release.

- Could be "it's an -rc kernel with SUBLEVEL <= 5".  But then we risk
  unexpected things happening when Linux cuts -rc6, which still isn't
  good.

- How about "it's an -rc kernel with odd-numbered SUBLEVEL and
  SUBLEVEL <= 5".  That way everybody who runs -rc1, -rc3 and -rc5 will
  have kvmalloc debugging enabled.  That's potentially nasty because
  vmalloc is much slower than kmalloc.  But kvmalloc() is only used for
  large and probably infrequent allocations, so it's probably OK.

I wonder how we get at SUBLEVEL from within .c.

Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-19 Thread Andrew Morton

On Thu, 19 Apr 2018 12:12:38 -0400 (EDT) Mikulas Patocka  
wrote:

> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
> 
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
> 
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.

Yes, that's nasty.

> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> 
> ...
>
> --- linux-2.6.orig/mm/util.c  2018-04-18 15:46:23.0 +0200
> +++ linux-2.6/mm/util.c   2018-04-18 16:00:43.0 +0200
> @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
>   */
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
> +#ifndef CONFIG_DEBUG_VM
>   gfp_t kmalloc_flags = flags;
>   void *ret;
>  
> @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
>*/
>   if (ret || size <= PAGE_SIZE)
>   return ret;
> +#endif
>  
>   return __vmalloc_node_flags_caller(size, node, flags,
>   __builtin_return_address(0));

Well, it doesn't have to be done at compile-time, does it?  We could
add a knob (in debugfs, presumably) which enables this at runtime. 
That's far more user-friendly.

Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-13 Thread Andrew Morton

On Mon, 12 Mar 2018 21:28:57 -0700 Kees Cook  wrote:

> On Mon, Mar 12, 2018 at 4:57 PM, Linus Torvalds
>  wrote:
> > On Mon, Mar 12, 2018 at 3:55 PM, Andrew Morton
> >  wrote:
> >>
> >> Replacing the __builtin_choose_expr() with ?: works of course.
> >
> > Hmm. That sounds like the right thing to do. We were so myopically
> > staring at the __builtin_choose_expr() problem that we overlooked the
> > obvious solution.
> >
> > Using __builtin_constant_p() together with a ?: is in fact our common
> > pattern, so that should be fine. The only real reason to use
> > __builtin_choose_expr() is if you want to get the *type* to vary
> > depending on which side you choose, but that's not an issue for
> > min/max.
> 
> This doesn't solve it for -Wvla, unfortunately. That was the point of
> Josh's original suggestion of __builtin_choose_expr().
> 
> Try building with KCFLAGS=-Wval and checking net/ipv6/proc.c:
> 
> net/ipv6/proc.c: In function ‘snmp6_seq_show_item’:
> net/ipv6/proc.c:198:2: warning: ISO C90 forbids array ‘buff’ whose
> size can’t be evaluated [-Wvla]
>   unsigned long buff[SNMP_MIB_MAX];
>   ^~~~

PITA.  Didn't we once have a different way of detecting VLAs?  Some
post-compilation asm parser, iirc.

I suppose the world wouldn't end if we had a gcc version ifdef in
kernel.h.  We'll get to remove it in, oh, ten years.

Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-12 Thread Andrew Morton

On Fri, 9 Mar 2018 17:30:15 -0800 Kees Cook  wrote:

> > It's one reason why I wondered if simplifying the expression to have
> > just that single __builtin_constant_p() might not end up working..
> 
> Yeah, it seems like it doesn't bail out as "false" for complex
> expressions given to __builtin_constant_p().
> 
> If no magic solution, then which of these?
> 
> - go back to my original "multi-eval max only for constants" macro (meh)
> - add gcc version checks around this and similarly for -Wvla in the future 
> (eww)
> - raise gcc version (yikes)

Replacing the __builtin_choose_expr() with ?: works of course.  What
will be the runtime effects?

I tried replacing

__builtin_choose_expr(__builtin_constant_p(x) &&
  __builtin_constant_p(y),

with

__builtin_choose_expr(__builtin_constant_p(x + y),

but no success.

I'm not sure what else to try to trick gcc into working.

--- 
a/include/linux/kernel.h~kernelh-skip-single-eval-logic-on-literals-in-min-max-v3-fix
+++ a/include/linux/kernel.h
@@ -804,13 +804,10 @@ static inline void ftrace_dump(enum ftra
  * accidental VLA.
  */
 #define __min(t1, t2, x, y)\
-   __builtin_choose_expr(__builtin_constant_p(x) &&\
- __builtin_constant_p(y),  \
- (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
- __single_eval_min(t1, t2, \
-   __UNIQUE_ID(min1_), \
-   __UNIQUE_ID(min2_), \
-   x, y))
+   ((__builtin_constant_p(x) && __builtin_constant_p(y)) ? \
+   ((t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y)) :   \
+   (__single_eval_min(t1, t2, __UNIQUE_ID(min1_),  \
+ __UNIQUE_ID(min2_), x, y)))
 
 /**
  * min - return minimum of two values of the same or compatible types
@@ -826,13 +823,11 @@ static inline void ftrace_dump(enum ftra
max1 > max2 ? max1 : max2; })
 
 #define __max(t1, t2, x, y)\
-   __builtin_choose_expr(__builtin_constant_p(x) &&\
- __builtin_constant_p(y),  \
- (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
- __single_eval_max(t1, t2, \
-   __UNIQUE_ID(max1_), \
-   __UNIQUE_ID(max2_), \
-   x, y))
+   ((__builtin_constant_p(x) && __builtin_constant_p(y)) ? \
+   ((t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y)) :   \
+   (__single_eval_max(t1, t2, __UNIQUE_ID(max1_),  \
+ __UNIQUE_ID(max2_), x, y)))
+
 /**
  * max - return maximum of two values of the same or compatible types
  * @x: first value
_

Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Andrew Morton

On Fri, 9 Mar 2018 16:28:51 -0800 Linus Torvalds 
 wrote:

> On Fri, Mar 9, 2018 at 4:07 PM, Andrew Morton  
> wrote:
> >
> > A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
> > to know that __builtin_constant_p(x) is a constant.  Or something.
> 
> LOL.
> 
> I suspect it might be that it wants to evaluate
> __builtin_choose_expr() at an earlier stage than it evaluates
> __builtin_constant_p(), so it's not that it doesn't know that
> __builtin_constant_p() is a constant, it just might not know it *yet*.
> 
> Maybe.
> 
> Side note, if it's not that, but just the "complex" expression that
> has the logical 'and' etc, maybe the code could just use
> 
>   __builtin_constant_p((x)+(y))
> 
> or something.

I'll do a bit more poking at it.

> But yeah:
> 
> > Sigh.  Wasn't there some talk about modernizing our toolchain
> > requirements?
> 
> Maybe it's just time to give up on 4.4.  We wanted 4.5 for "asm goto",
> and once we upgrade to 4.5 I think Arnd said that no distro actually
> ships it, so we might as well go to 4.6.
> 
> So maybe this is just the excuse to finally make that official, if
> there is no clever workaround any more.

I wonder which gcc versions actually accept Kees's addition.

Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Andrew Morton

On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:

> When max() is used in stack array size calculations from literal values
> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
> thinks this is a dynamic calculation due to the single-eval logic, which
> is not needed in the literal case. This change removes several accidental
> stack VLAs from an x86 allmodconfig build:
> 
> $ diff -u before.txt after.txt | grep ^-
> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
> variable length array ‘ids’ [-Wvla]
> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
> array ‘namebuf’ [-Wvla]
> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
> [-Wvla]
> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
> ‘buff64’ [-Wvla]
> 
> Based on an earlier patch from Josh Poimboeuf.

v1, v2 and v3 of this patch all fail with gcc-4.4.4:

./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
./include/linux/jiffies.h:444: error: first argument to '__builtin_choose_expr' 
not a constant

That's with

#define __max(t1, t2, x, y) \
__builtin_choose_expr(__builtin_constant_p(x) &&\
  __builtin_constant_p(y) &&\
  __builtin_types_compatible_p(t1, t2), \
  (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
  __single_eval_max(t1, t2, \
__UNIQUE_ID(max1_), \
__UNIQUE_ID(max2_), \
x, y))
/**
 * max - return maximum of two values of the same or compatible types
 * @x: first value
 * @y: second value
 */
#define max(x, y)   __max(typeof(x), typeof(y), x, y)


A brief poke failed to reveal a workaround - gcc-4.4.4 doesn't appear
to know that __builtin_constant_p(x) is a constant.  Or something.

Sigh.  Wasn't there some talk about modernizing our toolchain
requirements?

Re: [PATCH] net/9p: avoid -ERESTARTSYS leak to userspace

2018-03-09 Thread Andrew Morton

On Fri, 09 Mar 2018 21:41:38 +0100 Greg Kurz  wrote:

> If it was interrupted by a signal, the 9p client may need to send some
> more requests to the server for cleanup before returning to userspace.
> 
> To avoid such a last minute request to be interrupted right away, the
> client memorizes if a signal is pending, clears TIF_SIGPENDING, handles
> the request and calls recalc_sigpending() before returning.
> 
> Unfortunately, if the transmission of this cleanup request fails for any
> reason, the transport returns an error and the client propagates it right
> away, without calling recalc_sigpending().
> 
> This ends up with -ERESTARTSYS from the initially interrupted request
> crawling up to syscall exit, with TIF_SIGPENDING cleared by the cleanup
> request. The specific signal handling code, which is responsible for
> converting -ERESTARTSYS to -EINTR is not called, and userspace receives
> the confusing errno value:
> 
> open: Unknown error 512 (512)
> 
> This is really hard to hit in real life. I discovered the issue while
> working on hot-unplug of a virtio-9p-pci device with an instrumented
> QEMU allowing to control request completion.
> 
> Both p9_client_zc_rpc() and p9_client_rpc() functions have this buggy
> error path actually. Their code flow is a bit obscure and the best
> thing to do would probably be a full rewrite: to really ensure this
> situation of clearing TIF_SIGPENDING and returning -ERESTARTSYS can
> never happen.
> 
> But given the general lack of interest for the 9p code, I won't risk
> breaking more things. So this patch simply fixes the buggy paths in
> both functions with a trivial label+goto.
> 
> Thanks to Laurent Dufour for his help and suggestions on how to find
> the root cause and how to fix it.

That's a fairly straightforward-looking bug.  However the code still
looks a bit racy:


:   if (signal_pending(current)) {
:   sigpending = 1;
:   clear_thread_flag(TIF_SIGPENDING);
:   } else
:   sigpending = 0;
: 

what happens if signal_pending(current) becomes true right here?

:   err = c->trans_mod->request(c, req);


I'm surprised that the 9p client is mucking with signals at all. 
Signals are a userspace IPC scheme and kernel code should instead be
using the more powerful messaging mechanisms which we've developed. 
Ones which don't disrupt userspace state.

Why is this happening?  Is there some userspace/kernel interoperation
involved?

Re: [PATCH] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-08 Thread Andrew Morton

On Thu, 8 Mar 2018 13:40:45 -0800 Kees Cook  wrote:

> When max() is used in stack array size calculations from literal values
> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
> thinks this is a dynamic calculation due to the single-eval logic, which
> is not needed in the literal case. This change removes several accidental
> stack VLAs from an x86 allmodconfig build:
> 
> $ diff -u before.txt after.txt | grep ^-
> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
> variable length array ‘ids’ [-Wvla]
> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
> array ‘namebuf’ [-Wvla]
> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
> [-Wvla]
> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array ‘buff’ 
> [-Wvla]
> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
> ‘buff64’ [-Wvla]
> 
> Based on an earlier patch from Josh Poimboeuf.
> 
> ...
>
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -787,37 +787,57 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
> oops_dump_mode) { }
>   * strict type-checking.. See the
>   * "unnecessary" pointer comparison.
>   */
> -#define __min(t1, t2, min1, min2, x, y) ({   \
> +#define __single_eval_min(t1, t2, min1, min2, x, y) ({   \
>   t1 min1 = (x);  \
>   t2 min2 = (y);  \
>   (void) (&min1 == &min2);\
>   min1 < min2 ? min1 : min2; })
>  
> +/*
> + * In the case of builtin constant values, there is no need to do the
> + * double-evaluation protection, so the raw comparison can be made.
> + * This allows min()/max() to be used in stack array allocations and
> + * avoid the compiler thinking it is a dynamic value leading to an
> + * accidental VLA.
> + */
> +#define __min(t1, t2, x, y)  \
> + __builtin_choose_expr(__builtin_constant_p(x) &&\
> +   __builtin_constant_p(y) &&\
> +   __builtin_types_compatible_p(t1, t2), \
> +   (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
> +   __single_eval_min(t1, t2, \
> + __UNIQUE_ID(max1_), \
> + __UNIQUE_ID(max2_), \
> + x, y))
> +

Holy crap.

I suppose gcc will one day be fixed and we won't need this.

Is there a good reason to convert min()?  Surely nobody will be using
min to dimension an array - always max?  Just for symmetry, I guess.

Re: [PATCH 0/3] Remove accidental VLA usage

2018-03-08 Thread Andrew Morton

On Thu, 8 Mar 2018 09:02:36 -0600 Josh Poimboeuf  wrote:

> On Wed, Mar 07, 2018 at 07:30:44PM -0800, Kees Cook wrote:
> > This series adds SIMPLE_MAX() to be used in places where a stack array
> > is actually fixed, but the compiler still warns about VLA usage due to
> > confusion caused by the safety checks in the max() macro.
> > 
> > I'm sending these via -mm since that's where I've introduced SIMPLE_MAX(),
> > and they should all have no operational differences.
> 
> What if we instead simplify the max() macro's type checking so that GCC
> can more easily fold the array size constants?  The below patch seems to
> work:
> 
> -/*
> - * min()/max()/clamp() macros that also do
> - * strict type-checking.. See the
> - * "unnecessary" pointer comparison.
> - */
> -#define __min(t1, t2, min1, min2, x, y) ({   \
> - t1 min1 = (x);  \
> - t2 min2 = (y);  \
> - (void) (&min1 == &min2);\
> - min1 < min2 ? min1 : min2; })
> +extern long __error_incompatible_types_in_min_macro;
> +extern long __error_incompatible_types_in_max_macro;
> +
> +#define __min(t1, t2, x, y)  \
> + __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \
> +   (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
> +   (t1)__error_incompatible_types_in_min_macro)

This will move the error detection from compile-time to link-time. 
That's tolerable I guess, but a bit sad and should be flagged in the
changelog at least.

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-02-07 Thread Andrew Morton

On Wed, 7 Feb 2018 18:44:39 +0100 Pablo Neira Ayuso  wrote:

> Hi,
> 
> On Wed, Jan 31, 2018 at 09:19:16AM +0100, Michal Hocko wrote:
> [...]
> > Yeah, we do not BUG but rather fail instead. See __vmalloc_node_range.
> > My excavation tools pointed me to "VM: Rework vmalloc code to support 
> > mapping of arbitray pages"
> > by Christoph back in 2002. So yes, we can safely remove it finally. Se
> > below.
> > 
> > 
> > From 8d52e1d939d101b0dafed6ae5c3c1376183e65bb Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Wed, 31 Jan 2018 09:16:56 +0100
> > Subject: [PATCH] net/netfilter/x_tables.c: remove size check
> > 
> > Back in 2002 vmalloc used to BUG on too large sizes. We are much better
> > behaved these days and vmalloc simply returns NULL for those. Remove
> > the check as it simply not needed and the comment even misleading.
> > 
> > Suggested-by: Andrew Morton 
> > Signed-off-by: Michal Hocko 
> > ---
> >  net/netfilter/x_tables.c | 4 
> >  1 file changed, 4 deletions(-)
> > 
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index b55ec5aa51a6..48a6ff620493 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -999,10 +999,6 @@ struct xt_table_info *xt_alloc_table_info(unsigned int 
> > size)
> > if (sz < sizeof(*info))
> > return NULL;
> >  
> > -   /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */
> > -   if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
> > -   return NULL;
> > -
> > /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> >  * work reasonably well if sz is too large and bail out rather
> >  * than shoot all processes down before realizing there is nothing
> 
> Patchwork didn't catch this patch for some reason, would you mind to
> resend?

From: Michal Hocko 
Subject: net/netfilter/x_tables.c: remove size check

Back in 2002 vmalloc used to BUG on too large sizes.  We are much better
behaved these days and vmalloc simply returns NULL for those.  Remove the
check as it simply not needed and the comment is even misleading.

Link: http://lkml.kernel.org/r/20180131081916.go21...@dhcp22.suse.cz
Suggested-by: Andrew Morton 
Signed-off-by: Michal Hocko 
Reviewed-by: Andrew Morton 
Cc: Florian Westphal 
Cc: David S. Miller 
Signed-off-by: Andrew Morton 
---

 net/netfilter/x_tables.c |4 
 1 file changed, 4 deletions(-)

diff -puN net/netfilter/x_tables.c~net-netfilter-x_tablesc-remove-size-check 
net/netfilter/x_tables.c
--- a/net/netfilter/x_tables.c~net-netfilter-x_tablesc-remove-size-check
+++ a/net/netfilter/x_tables.c
@@ -1004,10 +1004,6 @@ struct xt_table_info *xt_alloc_table_inf
if (sz < sizeof(*info))
return NULL;
 
-   /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */
-   if ((size >> PAGE_SHIFT) + 2 > totalram_pages)
-   return NULL;
-
/* __GFP_NORETRY is not fully supported by kvmalloc but it should
 * work reasonably well if sz is too large and bail out rather
 * than shoot all processes down before realizing there is nothing
_

Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Andrew Morton

On Tue, 30 Jan 2018 15:01:04 +0100 Michal Hocko  wrote:

> > Well, this is not about syzkaller, it merely pointed out a potential
> > DoS... And that has to be addressed somehow.
> 
> So how about this?
> ---

argh ;)

> >From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Tue, 30 Jan 2018 14:51:07 +0100
> Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive
> 
> syzbot has noticed that xt_alloc_table_info can allocate a lot of
> memory. This is an admin only interface but an admin in a namespace
> is sufficient as well. eacd86ca3b03 ("net/netfilter/x_tables.c: use
> kvmalloc() in xt_alloc_table_info()") has changed the opencoded
> kmalloc->vmalloc fallback into kvmalloc. It has dropped __GFP_NORETRY on
> the way because vmalloc has simply never fully supported __GFP_NORETRY
> semantic. This is still the case because e.g. page tables backing the
> vmalloc area are hardcoded GFP_KERNEL.
> 
> Revert back to __GFP_NORETRY as a poors man defence against excessively
> large allocation request here. We will not rule out the OOM killer
> completely but __GFP_NORETRY should at least stop the large request
> in most cases.
> 
> Fixes: eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in 
> xt_alloc_table_info()")
> Signed-off-by: Michal Hocko 
> ---
>  net/netfilter/x_tables.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index d8571f414208..a5f5c29bcbdc 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1003,7 +1003,13 @@ struct xt_table_info *xt_alloc_table_info(unsigned int 
> size)
>   if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
>   return NULL;

offtopic: preceding comment here is "prevent them from hitting BUG() in
vmalloc.c".  I suspect this is ancient code and vmalloc sure as heck
shouldn't go BUG with this input.  And it should be using `sz' ;)

So I suspect and hope that this code can be removed.  If not, let's fix
vmalloc!

> - info = kvmalloc(sz, GFP_KERNEL);
> + /*
> +  * __GFP_NORETRY is not fully supported by kvmalloc but it should
> +  * work reasonably well if sz is too large and bail out rather
> +  * than shoot all processes down before realizing there is nothing
> +  * more to reclaim.
> +  */
> + info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
>   if (!info)
>   return NULL;

checkpatch sayeth

networking block comments don't use an empty /* line, use /* Comment...

So I'll do that and shall scoot the patch Davewards.

Re: [PATCH V11 4/5] vsprintf: add printk specifier %px

2017-11-29 Thread Andrew Morton

On Wed, 29 Nov 2017 13:05:04 +1100 "Tobin C. Harding"  wrote:

> printk specifier %p now hashes all addresses before printing. Sometimes
> we need to see the actual unmodified address. This can be achieved using
> %lx but then we face the risk that if in future we want to change the
> way the Kernel handles printing of pointers we will have to grep through
> the already existent 50 000 %lx call sites. Let's add specifier %px as a
> clear, opt-in, way to print a pointer and maintain some level of
> isolation from all the other hex integer output within the Kernel.
> 
> Add printk specifier %px to print the actual unmodified address.
> 
> ...
>
> +Unmodified Addresses
> +
> +
> +::
> +
> + %px 01234567 or 0123456789abcdef
> +
> +For printing pointers when you _really_ want to print the address. Please
> +consider whether or not you are leaking sensitive information about the
> +Kernel layout in memory before printing pointers with %px. %px is
> +functionally equivalent to %lx. %px is preferred to %lx because it is more
> +uniquely grep'able. If, in the future, we need to modify the way the Kernel
> +handles printing pointers it will be nice to be able to find the call
> +sites.
> +

You might want to add a checkpatch rule which emits a stern
do-you-really-want-to-do-this warning when someone uses %px.

Re: [PATCH V11 3/5] printk: hash addresses printed with %p

2017-11-29 Thread Andrew Morton

On Wed, 29 Nov 2017 13:05:03 +1100 "Tobin C. Harding"  wrote:

> Currently there exist approximately 14 000 places in the kernel where
> addresses are being printed using an unadorned %p. This potentially
> leaks sensitive information regarding the Kernel layout in memory. Many
> of these calls are stale, instead of fixing every call lets hash the
> address by default before printing. This will of course break some
> users, forcing code printing needed addresses to be updated.
> 
> Code that _really_ needs the address will soon be able to use the new
> printk specifier %px to print the address.
> 
> For what it's worth, usage of unadorned %p can be broken down as
> follows (thanks to Joe Perches).
> 
> $ git grep -E '%p[^A-Za-z0-9]' | cut -f1 -d"/" | sort | uniq -c
>1084 arch
>  20 block
>  10 crypto
>  32 Documentation
>8121 drivers
>1221 fs
> 143 include
> 101 kernel
>  69 lib
> 100 mm
>1510 net
>  40 samples
>   7 scripts
>  11 security
> 166 sound
> 152 tools
>   2 virt
> 
> Add function ptr_to_id() to map an address to a 32 bit unique
> identifier. Hash any unadorned usage of specifier %p and any malformed
> specifiers.
> 
> ...
>
> @@ -1644,6 +1646,73 @@ char *device_node_string(char *buf, char *end, struct 
> device_node *dn,
>   return widen_string(buf, buf - buf_start, end, spec);
>  }
>  
> +static bool have_filled_random_ptr_key __read_mostly;
> +static siphash_key_t ptr_key __read_mostly;
> +
> +static void fill_random_ptr_key(struct random_ready_callback *unused)
> +{
> + get_random_bytes(&ptr_key, sizeof(ptr_key));
> + /*
> +  * have_filled_random_ptr_key==true is dependent on get_random_bytes().
> +  * ptr_to_id() needs to see have_filled_random_ptr_key==true
> +  * after get_random_bytes() returns.
> +  */
> + smp_mb();
> + WRITE_ONCE(have_filled_random_ptr_key, true);
> +}

I don't think I'm seeing anything which prevents two CPUs from
initializing ptr_key at the same time.  Probably doesn't matter much...

Re: [PATCH V11 0/5] hash addresses printed with %p

2017-11-29 Thread Andrew Morton

On Wed, 29 Nov 2017 13:05:00 +1100 "Tobin C. Harding"  wrote:

> Currently there exist approximately 14 000 places in the Kernel where
> addresses are being printed using an unadorned %p. This potentially
> leaks sensitive information regarding the Kernel layout in memory. Many
> of these calls are stale, instead of fixing every call lets hash the
> address by default before printing. This will of course break some
> users, forcing code printing needed addresses to be updated. We can add
> a printk specifier for this purpose (%px) to give developers a clear
> upgrade path for breakages caused by applying this patch set.
> 
> The added advantage of hashing %p is that security is now opt-out, if
> you _really_ want the address you have to work a little harder and use
> %px.
> 
> The idea for creating the printk specifier %px to print the actual
> address was suggested by Kees Cook (see below for email threads by
> subject).

Maybe I'm being thick, but...  if we're rendering these addresses
unusable by hashing them, why not just print something like
"" in their place?  That loses the uniqueness thing but I
wonder how valuable that is in practice?

Re: [PATCH] mm, page_alloc: re-enable softirq use of per-cpu page allocator

2017-04-10 Thread Andrew Morton

On Mon, 10 Apr 2017 16:08:21 +0100 Mel Gorman  
wrote:

> IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists caching
> of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only use per-cpu
> allocator for irq-safe requests").
> 
> This unfortunately also included excluded SoftIRQ.  This hurt the performance
> for the use-case of refilling DMA RX rings in softirq context.

Out of curiosity: by how much did it "hurt"?

Tariq found:

: I disabled the page-cache (recycle) mechanism to stress the page
: allocator, and see a drastic degradation in BW, from 47.5 G in v4.10 to
: 31.4 G in v4.11-rc1 (34% drop).

then with this patch he found

: It looks very good!  I get line-rate (94Gbits/sec) with 8 streams, in
: comparison to less than 55Gbits/sec before.

Can I take this to mean that the page allocator's per-cpu-pages feature
ended up doubling the performance of this driver?  Better than the
driver's private page recycling?  I'd like to believe that, but am
having trouble doing so ;)

> This patch re-allow softirq context, which should be safe by disabling
> BH/softirq, while accessing the list.  PCP-lists access from both hard-IRQ
> and NMI context must not be allowed.  Peter Zijlstra says in_nmi() code
> never access the page allocator, thus it should be sufficient to only test
> for !in_irq().
> 
> One concern with this change is adding a BH (enable) scheduling point at
> both PCP alloc and free. If further concerns are highlighted by this patch,
> the result wiill be to revert 374ad05ab64d and try again at a later date
> to offset the irq enable/disable overhead.

Re: [mm PATCH 0/3] Page fragment updates

2016-12-05 Thread Andrew Morton

On Mon, 5 Dec 2016 09:01:12 -0800 Alexander Duyck  
wrote:

> On Tue, Nov 29, 2016 at 10:23 AM, Alexander Duyck
>  wrote:
> > This patch series takes care of a few cleanups for the page fragments API.
> >
> > ...
> 
> It's been about a week since I submitted this series.  Just wanted to
> check in and see if anyone had any feedback or if this is good to be
> accepted for 4.10-rc1 with the rest of the set?

Looks good to me.  I have it all queued for post-4.9 processing.

Re: [mm PATCH v3 21/23] mm: Add support for releasing multiple instances of a page

2016-11-21 Thread Andrew Morton

On Mon, 21 Nov 2016 08:21:39 -0800 Alexander Duyck  
wrote:

> >> + __free_pages_ok(page, order);
> >> + }
> >> +}
> >> +EXPORT_SYMBOL(__page_frag_drain);
> >
> > It's an exported-to-modules library function.  It should be documented,
> > please?  The page-frag API is only partially documented, but that's no
> > excuse.
> 
> Okay.  I assume you want the documentation as a follow-up patch since
> I received a notice that the patch was added to -mm?

Yes please.  Or a replacement patch which I'll temporarily turn into a
delta, either is fine.

> If you would like I could look at doing a couple of renaming patches
> so that we make the API a bit more consistent.  I could move the
> __alloc and __free to what you have suggested, and then take a look at
> trying to rename the refill/drain to be a bit more consistent in terms
> of what they are supposed to work on and how they are supposed to be
> used.

I think that would be better - it's hardly high-priority but a bit of
attention to the documentation and naming conventions would help tidy
things up.  When you can't find anything else to do ;)

Re: [mm PATCH v3 21/23] mm: Add support for releasing multiple instances of a page

2016-11-18 Thread Andrew Morton

On Thu, 10 Nov 2016 06:36:06 -0500 Alexander Duyck 
 wrote:

> This patch adds a function that allows us to batch free a page that has
> multiple references outstanding.  Specifically this function can be used to
> drop a page being used in the page frag alloc cache.  With this drivers can
> make use of functionality similar to the page frag alloc cache without
> having to do any workarounds for the fact that there is no function that
> frees multiple references.
> 
> ...
>
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -506,6 +506,8 @@ extern void free_hot_cold_page(struct page *page, bool 
> cold);
>  extern void free_hot_cold_page_list(struct list_head *list, bool cold);
>  
>  struct page_frag_cache;
> +extern void __page_frag_drain(struct page *page, unsigned int order,
> +   unsigned int count);
>  extern void *__alloc_page_frag(struct page_frag_cache *nc,
>  unsigned int fragsz, gfp_t gfp_mask);
>  extern void __free_page_frag(void *addr);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0fbfead..54fea40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3912,6 +3912,20 @@ static struct page *__page_frag_refill(struct 
> page_frag_cache *nc,
>   return page;
>  }
>  
> +void __page_frag_drain(struct page *page, unsigned int order,
> +unsigned int count)
> +{
> + VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> +
> + if (page_ref_sub_and_test(page, count)) {
> + if (order == 0)
> + free_hot_cold_page(page, false);
> + else
> + __free_pages_ok(page, order);
> + }
> +}
> +EXPORT_SYMBOL(__page_frag_drain);

It's an exported-to-modules library function.  It should be documented,
please?  The page-frag API is only partially documented, but that's no
excuse.

And perhaps documentation will help explain the naming choice.  Why
"drain"?  I'd have expected "put"?

And why the leading underscores.  The page-frag API is pretty weird :(

And inconsistent.  __alloc_page_frag -> page_frag_alloc,
__free_page_frag -> page_frag_free(), etc.  I must have been asleep
when I let that lot through.

Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)

2016-09-26 Thread Andrew Morton

On Thu, 22 Sep 2016 18:43:59 +0200 Vlastimil Babka  wrote:

> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> with the number of fds passed. We had a customer report page allocation
> failures of order-4 for this allocation. This is a costly order, so it might
> easily fail, as the VM expects such allocation to have a lower-order fallback.
> 
> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> physically contiguous. Also the allocation is temporary for the duration of 
> the
> syscall, so it's unlikely to stress vmalloc too much.
> 
> Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
> it doesn't need this kind of fallback.
> 
> ...
>
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -558,6 +559,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
> __user *outp,
>   struct fdtable *fdt;
>   /* Allocate small arguments on the stack to save memory and be faster */
>   long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
> + unsigned long alloc_size;
>  
>   ret = -EINVAL;
>   if (n < 0)
> @@ -580,8 +582,12 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
> __user *outp,
>   bits = stack_fds;
>   if (size > sizeof(stack_fds) / 6) {
>   /* Not enough space in on-stack array; must use kmalloc */
> + alloc_size = 6 * size;

Well.  `size' is `unsigned'.  The multiplication will be done as 32-bit
so there was no point in making `alloc_size' unsigned long.

So can we tighten up the types in this function?  size_t might make
sense, but vmalloc() takes a ulong.

>   ret = -ENOMEM;
> - bits = kmalloc(6 * size, GFP_KERNEL);
> + bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
> + if (!bits && alloc_size > PAGE_SIZE)
> + bits = vmalloc(alloc_size);

I don't share Eric's concerns about performance here.  If the vmalloc()
is called, we're about to write to that quite large amount of memory
which we just allocated, and the vmalloc() overhead will be relatively
low.

>   if (!bits)
>   goto out_nofds;
>   }
> @@ -618,7 +624,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
> __user *outp,
>  
>  out:
>   if (bits != stack_fds)
> - kfree(bits);
> + kvfree(bits);
>  out_nofds:
>   return ret;

It otherwise looks OK to me.

Re: [PATCH 3/3] mm: memcontrol: consolidate cgroup socket tracking

2016-09-14 Thread Andrew Morton

On Thu, 15 Sep 2016 13:34:24 +0800 kbuild test robot  wrote:

> Hi Johannes,
> 
> [auto build test ERROR on net/master]
> [also build test ERROR on v4.8-rc6 next-20160914]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> [Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
> convenience) to record what (public, well-known) commit your patch series was 
> built on]
> [Check https://git-scm.com/docs/git-format-patch for more information]
> 
> url:
> https://github.com/0day-ci/linux/commits/Johannes-Weiner/mm-memcontrol-make-per-cpu-charge-cache-IRQ-safe-for-socket-accounting/20160915-035634
> config: m68k-sun3_defconfig (attached as .config)
> compiler: m68k-linux-gcc (GCC) 4.9.0
> reproduce:
> wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=m68k 
> 
> All errors (new ones prefixed by >>):
> 
>net/built-in.o: In function `sk_alloc':
> >> (.text+0x4076): undefined reference to `mem_cgroup_sk_alloc'
>net/built-in.o: In function `__sk_destruct':
> >> sock.c:(.text+0x457e): undefined reference to `mem_cgroup_sk_free'
>net/built-in.o: In function `sk_clone_lock':
>(.text+0x4f1c): undefined reference to `mem_cgroup_sk_alloc'

This?

--- a/mm/memcontrol.c~mm-memcontrol-consolidate-cgroup-socket-tracking-fix
+++ a/mm/memcontrol.c
@@ -5655,9 +5655,6 @@ void mem_cgroup_sk_alloc(struct sock *sk
 {
struct mem_cgroup *memcg;
 
-   if (!mem_cgroup_sockets_enabled)
-   return;
-
/*
 * Socket cloning can throw us here with sk_memcg already
 * filled. It won't however, necessarily happen from
--- a/net/core/sock.c~mm-memcontrol-consolidate-cgroup-socket-tracking-fix
+++ a/net/core/sock.c
@@ -1385,7 +1385,8 @@ static void sk_prot_free(struct proto *p
slab = prot->slab;
 
cgroup_sk_free(&sk->sk_cgrp_data);
-   mem_cgroup_sk_free(sk);
+   if (mem_cgroup_sockets_enabled)
+   mem_cgroup_sk_free(sk);
security_sk_free(sk);
if (slab != NULL)
kmem_cache_free(slab, sk);
@@ -1422,7 +1423,8 @@ struct sock *sk_alloc(struct net *net, i
sock_net_set(sk, net);
atomic_set(&sk->sk_wmem_alloc, 1);
 
-   mem_cgroup_sk_alloc(sk);
+   if (mem_cgroup_sockets_enabled)
+   mem_cgroup_sk_alloc(sk);
cgroup_sk_alloc(&sk->sk_cgrp_data);
sock_update_classid(&sk->sk_cgrp_data);
sock_update_netprioidx(&sk->sk_cgrp_data);
@@ -1569,7 +1571,8 @@ struct sock *sk_clone_lock(const struct
newsk->sk_incoming_cpu = raw_smp_processor_id();
atomic64_set(&newsk->sk_cookie, 0);
 
-   mem_cgroup_sk_alloc(newsk);
+   if (mem_cgroup_sockets_enabled)
+   mem_cgroup_sk_alloc(newsk);
cgroup_sk_alloc(&newsk->sk_cgrp_data);
 
/*
_

Re: [PATCH] remove lots of IS_ERR_VALUE abuses

2016-05-27 Thread Andrew Morton

On Fri, 27 May 2016 23:23:25 +0200 Arnd Bergmann  wrote:

> Most users of IS_ERR_VALUE() in the kernel are wrong, as they
> pass an 'int' into a function that takes an 'unsigned long'
> argument. This happens to work because the type is sign-extended
> on 64-bit architectures before it gets converted into an
> unsigned type.
> 
> However, anything that passes an 'unsigned short' or 'unsigned int'
> argument into IS_ERR_VALUE() is guaranteed to be broken, as are
> 8-bit integers and types that are wider than 'unsigned long'.
> 
> Andrzej Hajda has already fixed a lot of the worst abusers that
> were causing actual bugs, but it would be nice to prevent any
> users that are not passing 'unsigned long' arguments.
> 
> This patch changes all users of IS_ERR_VALUE() that I could find
> on 32-bit ARM randconfig builds and x86 allmodconfig. For the
> moment, this doesn't change the definition of IS_ERR_VALUE()
> because there are probably still architecture specific users
> elsewhere.

So you do plan to add some sort of typechecking into IS_ERR_VALUE()?

> Almost all the warnings I got are for files that are better off
> using 'if (err)' or 'if (err < 0)'.
> The only legitimate user I could find that we get a warning for
> is the (32-bit only) freescale fman driver, so I did not remove
> the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
> For 9pfs, I just worked around one user whose calling conventions
> are so obscure that I did not dare change the behavior.
> 
> I was using this definition for testing:
> 
>  #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
>unlikely((unsigned long long)(x) >= (unsigned long 
> long)(typeof(x))-MAX_ERRNO))
> 
> which ends up making all 16-bit or wider types work correctly with
> the most plausible interpretation of what IS_ERR_VALUE() was supposed
> to return according to its users, but also causes a compile-time
> warning for any users that do not pass an 'unsigned long' argument.
> 
> I suggested this approach earlier this year, but back then we ended
> up deciding to just fix the users that are obviously broken. After
> the initial warning that caused me to get involved in the discussion
> (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
> asked me to send the whole thing again.

Re: [Y2038] [RESEND PATCH 2/3] fs: poll/select/recvmmsg: use timespec64 for timeout events

2016-05-04 Thread Andrew Morton

On Wed, 04 May 2016 23:08:11 +0200 Arnd Bergmann  wrote:

> > But I'm less comfortable making the call on this one. It looks
> > relatively straight forward, but it would be good to have maintainer
> > acks before I add it to my tree.
> 
> Agreed. Feel free to add my
> 
> Reviewed-by: Arnd Bergmann 
> 
> at least (whoever picks it up).

In reply to [1/3] John said

: Looks ok at the first glance. I've queued these up for testing,
: however I only got #1 and #3 of the set. Are you hoping these two
: patches will go through tip/timers/core or are you looking for acks so
: they can go via another tree?

However none of the patches are in linux-next.

John had qualms about [2/3], but it looks like a straightforward
substitution in areas which will get plenty of testing

I'll grab the patches for now to get them some external testing.

Re: [Bug 117521] New: BUG: unable to handle kernel paging request at 000001a400015ff4

2016-05-02 Thread Andrew Morton


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

Thanks.  It's probably a TIPC issue.

On Mon, 02 May 2016 16:44:18 + bugzilla-dae...@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=117521
> 
> Bug ID: 117521
>Summary: BUG: unable to handle kernel paging request at
> 01a400015ff4
>Product: Memory Management
>Version: 2.5
> Kernel Version: 4.4.0
>   Hardware: All
> OS: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: high
>   Priority: P1
>  Component: Other
>   Assignee: a...@linux-foundation.org
>   Reporter: gbala...@gmail.com
> Regression: No
> 
> [   65.954959] sm-msp-queue[1279]: unable to qualify my own domain name
> (dcsx5testslot3) -- using short name
> [  632.098785] perf interrupt took too long (2505 > 2500), lowering
> kernel.perf_event_max_sample_rate to 5
> [ 5880.428123] perf interrupt took too long (5585 > 5000), lowering
> kernel.perf_event_max_sample_rate to 25000
> [17934.014969] CE: hpet increased min_delta_ns to 20115 nsec
> [38956.721789] CE: hpet4 increased min_delta_ns to 20115 nsec
> [46927.872827] hrtimer: interrupt took 63361 ns
> [101662.241093] CE: hpet2 increased min_delta_ns to 20115 nsec
> [245973.044600] CE: hpet6 increased min_delta_ns to 20115 nsec
> [368639.565040] show_signal_msg: 6 callbacks suppressed
> [375832.498126] BUG: unable to handle kernel paging request at 
> 01a400015ff4
> [375832.505300] IP: [] queued_spin_lock_slowpath+0xe6/0x160
> [375832.512394] PGD 0 
> [375832.514657] Oops: 0002 [#1] SMP
> [375832.518306] Modules linked in: nf_log_ipv6 nf_log_ipv4 nf_log_common 
> xt_LOG
> sctp libcrc32c e1000e tipc udp_tunnel ip6_udp_tunnel 8021q garp iTCO_wdt
> xt_physdev br_netfilter bridge stp llc nf_conntrack_ipv4 nf_defrag_ipv4
> ipmiq_drv(O) sio_mmc(O) ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
> nf_defrag_ipv6 xt_state nf_conntrack lockd ip6table_filter event_drv(O)
> ip6_tables grace pt_timer_info(O) ddi(O) usb_storage ixgbe igb i2c_i801
> iTCO_vendor_support i2c_algo_bit ioatdma intel_ips i2c_core pcspkr sunrpc ptp
> mdio dca pps_core lpc_ich tpm_tis mfd_core tpm [last unloaded: iTCO_wdt]
> [375832.573693] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G   O4.4.0
> #14
> [375832.581385] Hardware name: PT AMC124/Base Board Product Name, BIOS
> LGNAJFIP.PTI.0012.P15 01/15/2014
> [375832.591028] task: 880351a89b40 ti: 880351a9 task.ti:
> 880351a9
> [375832.599026] RIP: 0010:[]  []
> queued_spin_lock_slowpath+0xe6/0x160
> [375832.608964] RSP: 0018:88035fc83d58  EFLAGS: 00010002
> [375832.614825] RAX: 1447 RBX: 0292 RCX:
> 88035fc95fc0
> [375832.622743] RDX: 01a400015ff4 RSI: 0014 RDI:
> 880351232f80
> [375832.630567] RBP: 88035fc83d58 R08: 0101 R09:
> 0004
> [375832.638348] R10:  R11:  R12:
> 01001002
> [375832.645919] R13: 0001 R14:  R15:
> 
> [375832.653610] FS:  () GS:88035fc8()
> knlGS:
> [375832.662317] CS:  0010 DS:  ES:  CR0: 8005003b
> [375832.668483] CR2: 01a400015ff4 CR3: 01c0a000 CR4:
> 06e0
> [375832.676133] Stack:
> [375832.678344]  88035fc83d78 816de2c1 88034a8bba60
> 880351232f80
> [375832.686163]  88035fc83db8 810bc592 88035fc83dc8
> 880351758000
> [375832.694139]  01001002  b802f4bd
> a024e6f0
> [375832.702154] Call Trace:
> [375832.704844]   
> [375832.707018]  [] _raw_spin_lock_irqsave+0x31/0x40
> [375832.713970]  [] __wake_up+0x32/0x70
> [375832.719444]  [] ? tipc_recv_stream+0x370/0x370 [tipc]
> [375832.726589]  [] sock_def_wakeup+0x30/0x40
> [375832.732566]  [] tipc_sk_timeout+0x148/0x180 [tipc]
> [375832.739388]  [] ? tipc_recv_stream+0x370/0x370 [tipc]
> [375832.746507]  [] call_timer_fn+0x44/0x110
> [375832.752378]  [] ? cascade+0x4a/0x80
> [375832.757848]  [] ? tipc_recv_stream+0x370/0x370 [tipc]
> [375832.764871]  [] run_timer_softirq+0x22c/0x280
> [375832.771175]  [] __do_softirq+0xc8/0x260
> [375832.776958]  [] irq_exit+0x83/0xb0
> [375832.782369]  [] do_IRQ+0x65/0xf0
> [375832.787607]  [] common_interrupt+0x7f/0x7f
> [375832.793709]   
> [375832.795803]  [] ? cpuidle_enter_state+0xad/0x200
> [375832.802765]  [] ? cpuidle_enter_state+0x91/0x200
> [375832.809338]  [] cpuidle_enter+0x17/0x20
> [375832.815155]  [] call_cpuidle+0x37/0x60
> [375832.821184]  [] ? cpuidle_select+0x13/0x20
> [375832.827249]  [] cpu_startup_entry+0x211/0x2d0
> [375832.833535]  [] start_secondary+0x103/0x130
> [375832.839759] Code: 87 47 02 c1 e0 10 85 c0 74 38 48 89 c2 c1 e8 12 48 c1 ea
> 0c 83 e8 01 83 e2 30 48 98 48 81 c2 c0 5f 01 00 48 03 14 c5 0

Re: [PATCH net-next v7 01/19] lib/bitmap.c: conversion routines to/from u32 array

2016-02-08 Thread Andrew Morton

On Sun,  7 Feb 2016 17:08:45 -0800 David Decotigny  wrote:

> From: David Decotigny 
> 
> Aimed at transferring bitmaps to/from user-space in a 32/64-bit agnostic
> way.
> 
> Tested:
>   unit tests (next patch) on qemu i386, x86_64, ppc, ppc64 BE and LE,
>   ARM.
> 
> @@ -1060,6 +1062,90 @@ int bitmap_allocate_region(unsigned long *bitmap, 
> unsigned int pos, int order)
>  EXPORT_SYMBOL(bitmap_allocate_region);
>  
>  /**
> + * bitmap_from_u32array - copy the contents of a u32 array of bits to bitmap
> + *   @bitmap: array of unsigned longs, the destination bitmap, non NULL
> + *   @nbits: number of bits in @bitmap
> + *   @buf: array of u32 (in host byte order), the source bitmap, non NULL
> + *   @nwords: number of u32 words in @buf
> + *
> + * copy min(nbits, 32*nwords) bits from @buf to @bitmap, remaining
> + * bits between nword and nbits in @bitmap (if any) are cleared. In
> + * last word of @bitmap, the bits beyond nbits (if any) are kept
> + * unchanged.
> + */

This will leave the caller not knowing how many valid bits are actually
present in the resulting bitmap.  To determine that, the caller will
need to perform (duplicated) math on `nbits' and `nwords'.

> +void bitmap_from_u32array(unsigned long *bitmap, unsigned int nbits,
> +   const u32 *buf, unsigned int nwords)

So how about we make this return the number of valid bits in *bitmap?

> +/**
> + * bitmap_to_u32array - copy the contents of bitmap to a u32 array of bits
> + *   @buf: array of u32 (in host byte order), the dest bitmap, non NULL
> + *   @nwords: number of u32 words in @buf
> + *   @bitmap: array of unsigned longs, the source bitmap, non NULL
> + *   @nbits: number of bits in @bitmap
> + *
> + * copy min(nbits, 32*nwords) bits from @bitmap to @buf. Remaining
> + * bits after nbits in @buf (if any) are cleared.
> + */
> +void bitmap_to_u32array(u32 *buf, unsigned int nwords,
> + const unsigned long *bitmap, unsigned int nbits)

Ditto.

Re: [PATCH] mm: memcontrol: only manage socket pressure for CONFIG_INET

2015-12-09 Thread Andrew Morton

On Wed, 9 Dec 2015 18:05:05 -0500 Johannes Weiner  wrote:

> On Wed, Dec 09, 2015 at 02:28:36PM -0800, Andrew Morton wrote:
> > On Wed, 9 Dec 2015 13:58:58 -0500 Johannes Weiner  
> > wrote:
> > > The calls to tcp_init_cgroup() appear earlier in the series than "mm:
> > > memcontrol: hook up vmpressure to socket pressure". However, they get
> > > moved around a few times so fixing it earlier means respinning the
> > > series. Andrew, it's up to you whether we take the bisectability hit
> > > for !CONFIG_INET && CONFIG_MEMCG (how common is this?) or whether you
> > > want me to resend the series.
> > 
> > hm, drat, I was suspecting dependency issues here, but a test build
> > said it was OK.
> > 
> > Actually, I was expecting this patch series to depend on the linux-next
> > cgroup2 changes, but that doesn't appear to be the case.  *should* this
> > series be staged after the cgroup2 code?
> 
> Code-wise they are independent. My stuff is finishing up the new memcg
> control knobs, the cgroup2 stuff is changing how and when those knobs
> are exposed from within the cgroup core. I'm not relying on any recent
> changes in the cgroup core AFAICS, so the order shouldn't matter here.

OK, thanks.

> > Regarding this particular series: yes, I think we can live with a
> > bisection hole for !CONFIG_INET && CONFIG_MEMCG users.  But I'm not
> > sure why we're discussing bisection issues, because Arnd's build
> > failure occurs with everything applied?
> 
> Arnd's patches apply to the top of the stack, but they address issues
> introduced early in the series and the problematic code gets touched a
> lot in subsequent patches. E.g. the first build breakage is in ("net:
> tcp_memcontrol: simplify linkage between socket and page counter")
> when the tcp_init_cgroup() and tcp_destroy_cgroup() function calls get
> moved around and lose the CONFIG_INET protection.

Yeah, this is a pain.  I think I'll fold Arnd's fix into
mm-memcontrol-introduce-config_memcg_legacy_kmem.patch (which is staged
after all the other MM patches and after linux-next) and will pretend I
didn't know about the issue ;)

> Anyway, if we can live with the bisection caveat then Arnd's fixes on
> top of the kmem series look good to me. Depending on what Vladimir
> thinks we might want to replace the CONFIG_SLOB fix with something
> else later on, but that shouldn't be a problem, either.

I don't have a fix for the CONFIG_SLOB&&CONFIG_MEMCG issue yet.  I
agree that it would be best to make the combination work correctly
rather than banning it, but that does require a bit of runtime testing.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mm: memcontrol: only manage socket pressure for CONFIG_INET

2015-12-09 Thread Andrew Morton

On Wed, 9 Dec 2015 13:58:58 -0500 Johannes Weiner  wrote:

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 6faea81e66d7..73cd572167bb 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4220,13 +4220,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state 
> > *css)
> > if (ret)
> > return ret;
> >  
> > +#ifdef CONFIG_INET
> >  #ifdef CONFIG_MEMCG_LEGACY_KMEM
> > ret = tcp_init_cgroup(memcg);
> > if (ret)
> > return ret;
> >  #endif
> 
> The calls to tcp_init_cgroup() appear earlier in the series than "mm:
> memcontrol: hook up vmpressure to socket pressure". However, they get
> moved around a few times so fixing it earlier means respinning the
> series. Andrew, it's up to you whether we take the bisectability hit
> for !CONFIG_INET && CONFIG_MEMCG (how common is this?) or whether you
> want me to resend the series.

hm, drat, I was suspecting dependency issues here, but a test build
said it was OK.

Actually, I was expecting this patch series to depend on the linux-next
cgroup2 changes, but that doesn't appear to be the case.  *should* this
series be staged after the cgroup2 code?

Regarding this particular series: yes, I think we can live with a
bisection hole for !CONFIG_INET && CONFIG_MEMCG users.  But I'm not
sure why we're discussing bisection issues, because Arnd's build
failure occurs with everything applied?

> Sorry about the trouble. I don't have a git tree on kernel.org because
> we don't really use git in -mm, but the downside is that we don't get
> the benefits of the automatic build testing for all kinds of configs.
> I'll try to set up a git tree to expose series to full build coverage
> before they hit -mm and -next.

This sort of thing happens quite rarely.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node

2015-10-08 Thread Andrew Morton

On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes  
wrote:

> On Wed, 19 Aug 2015, Patil, Kiran wrote:
> 
> > Acked-by: Kiran Patil 
> 
> Where's the call to preempt_disable() to prevent kernels with preemption 
> from making numa_node_id() invalid during this iteration?

David asked this question twice, received no answer and now the patch
is in the maintainer tree, destined for mainline.

If I was asked this question I would respond

  The use of numa_mem_id() is racy and best-effort.  If the unlikely
  race occurs, the memory allocation will occur on the wrong node, the
  overall result being very slightly suboptimal performance.  The
  existing use of numa_node_id() suffers from the same issue.

But I'm not the person proposing the patch.  Please don't just ignore
reviewer comments!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: simplify configfs attributes V2

2015-10-05 Thread Andrew Morton

On Sat,  3 Oct 2015 15:32:36 +0200 Christoph Hellwig  wrote:

> This series consolidates the code to implement configfs attributes
> by providing the ->show and ->store method in common code and using
> container_of in the methods to access the containing structure.
> 
> This reduces source and binary size of configfs consumers a lot.

There's a version of this patch series (I assume V1) in linux-next via
Nicholas's tree so my hands are tied.  I trust that Nicholas will
update things.

Or maybe it was a slipup - modifying usb, ocfs2 etc from the iscsi tree
is innovative ;)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists

2015-10-02 Thread Andrew Morton

On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer  
wrote:

> > Thus, I need introducing new code like this patch and at the same time
> > have to reduce the number of instruction-cache misses/usage.  In this
> > case we solve the problem by kmem_cache_free_bulk() not getting called
> > too often. Thus, +17 bytes will hopefully not matter too much... but on
> > the other hand we sort-of know that calling kmem_cache_free_bulk() will
> > cause icache misses.
> 
> I just tested this change on top of my net-use-case patchset... and for
> some strange reason the code with this WARN_ON is faster and have much
> less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses).
> 
> Thus, I think we should keep your fix.
> 
> I cannot explain why using WARN_ON() is better and cause less icache
> misses.  And I hate when I don't understand every detail.
> 
>  My theory is, after reading the assembler code, that the UD2
> instruction (from BUG_ON) cause some kind of icache decoder stall
> (Intel experts???).  Now that should not be a problem, as UD2 is
> obviously placed as an unlikely branch and left at the end of the asm
> function call.  But the call to __slab_free() is also placed at the end
> of the asm function (gets inlined from slab_free() as unlikely).  And
> it is actually fairly likely that bulking is calling __slab_free (slub
> slowpath call).

Yes, I was looking at the asm code and the difference is pretty small:
a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly.

But I wouldn't assume that the microbenchmarking is meaningful.  I've
seen shockingly large (and quite repeatable) microbenchmarking
differences from small changes in code which isn't even executed (and
this is one such case, actually).  You add or remove just one byte of
text and half the kernel (or half the .o file?) gets a different
alignment and this seems to change everything.

Deleting the BUG altogether sounds the best solution.  As long as the
kernel crashes in some manner, we'll be able to work out what happened.
And it's cant-happen anyway, isn't it?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists

2015-10-01 Thread Andrew Morton

On Wed, 30 Sep 2015 13:44:19 +0200 Jesper Dangaard Brouer  
wrote:

> Make it possible to free a freelist with several objects by adjusting
> API of slab_free() and __slab_free() to have head, tail and an objects
> counter (cnt).
> 
> Tail being NULL indicate single object free of head object.  This
> allow compiler inline constant propagation in slab_free() and
> slab_free_freelist_hook() to avoid adding any overhead in case of
> single object free.
> 
> This allows a freelist with several objects (all within the same
> slab-page) to be free'ed using a single locked cmpxchg_double in
> __slab_free() and with an unlocked cmpxchg_double in slab_free().
> 
> Object debugging on the free path is also extended to handle these
> freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
> objects don't belong to the same slab-page.
> 
> These changes are needed for the next patch to bulk free the detached
> freelists it introduces and constructs.
> 
> Micro benchmarking showed no performance reduction due to this change,
> when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
> 

checkpatch says

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather 
than BUG() or BUG_ON()
#205: FILE: mm/slub.c:2888:
+   BUG_ON(!size);


Linus will get mad at you if he finds out, and we wouldn't want that.

--- a/mm/slub.c~slub-optimize-bulk-slowpath-free-by-detached-freelist-fix
+++ a/mm/slub.c
@@ -2885,7 +2885,8 @@ static int build_detached_freelist(struc
 /* Note that interrupts must be enabled when calling this function. */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 {
-   BUG_ON(!size);
+   if (WARN_ON(!size))
+   return;
 
do {
struct detached_freelist df;
_


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [linux-next] oops in ip_route_input_noref

2015-09-18 Thread Andrew Morton

On Thu, 17 Sep 2015 10:58:52 +0200 Thierry Reding  
wrote:

> On Wed, Sep 16, 2015 at 09:04:15AM -0600, David Ahern wrote:
> > On 9/16/15 9:00 AM, Fabio Estevam wrote:
> > >On Wed, Sep 16, 2015 at 6:24 AM, Sergey Senozhatsky
> > > wrote:
> > >
> > >>added by b7503e0cdb5dbec5d201aa69dc14679b5ae8
> > >>
> > >> net: Add FIB table id to rtable
> > >>
> > >> Add the FIB table id to rtable to make the information available for
> > >> IPv4 as it is for IPv6.
> > >
> > >I see the same issue here when booting a mx25 ARM processor via NFS.
> > >
> > >defconfig is arch/arm/configs/imx_v4_v5_defconfig.
> > >
> > 
> > I am still not able to reproduce. While I work on a full Cumulus image for
> > other test cases here's a patch to try; eagle eye Nikolay noted a potential
> > use without init in the maze of goto's.
> > 
> > Thanks,
> > David
> 
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index da427a4a33fe..80f7c5b7b832 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -1712,6 +1712,7 @@ static int ip_route_input_slow(struct sk_buff *skb, 
> > __be32 daddr, __be32 saddr,
> > goto martian_source;
> >  
> > res.fi = NULL;
> > +   res.table = NULL;
> > if (ipv4_is_lbcast(daddr) || (saddr == 0 && daddr == 0))
> > goto brd_input;
> >  
> > @@ -1834,6 +1835,7 @@ out:  return err;
> > RT_CACHE_STAT_INC(in_no_route);
> > res.type = RTN_UNREACHABLE;
> > res.fi = NULL;
> > +   res.table = NULL;
> > goto local_input;
> >  
> > /*
> 
> I was seeing the same oops as Fabio (except that the faulting address
> was 0xb instead of 0x7) and after applying this patch I no longer see
> it:
> 
> Tested-by: Thierry Reding 

I've been hitting this as well.  An oops on boot in
ip_route_input_slow(), here:

#ifdef CONFIG_IP_ROUTE_CLASSID
rth->dst.tclassid = itag;
#endif
rth->rt_is_input = 1;
if (res.table)
-->>rth->rt_table_id = res.table->tb_id;

RT_CACHE_STAT_INC(in_slow_tot);


I did this, which made it go away:

--- a/net/ipv4/route.c~a
+++ a/net/ipv4/route.c
@@ -1692,6 +1692,8 @@ static int ip_route_input_slow(struct sk
struct net*net = dev_net(dev);
bool do_cache;
 
+   res.table = 0;
+
/* IP on this device is disabled. */
 
if (!in_dev)
_

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/7] slub: improve bulk alloc strategy

2015-06-16 Thread Andrew Morton

On Mon, 15 Jun 2015 17:52:46 +0200 Jesper Dangaard Brouer  
wrote:

> Call slowpath __slab_alloc() from within the bulk loop, as the
> side-effect of this call likely repopulates c->freelist.
> 
> Choose to reenable local IRQs while calling slowpath.
> 
> Saving some optimizations for later.  E.g. it is possible to
> extract parts of __slab_alloc() and avoid the unnecessary and
> expensive (37 cycles) local_irq_{save,restore}.  For now, be
> happy calling __slab_alloc() this lower icache impact of this
> func and I don't have to worry about correctness.
> 
> ...
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2776,8 +2776,23 @@ bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t 
> flags, size_t size,
>   for (i = 0; i < size; i++) {
>   void *object = c->freelist;
>  
> - if (!object)
> - break;
> + if (unlikely(!object)) {
> + c->tid = next_tid(c->tid);
> + local_irq_enable();
> +
> + /* Invoke slow path one time, then retry fastpath
> +  * as side-effect have updated c->freelist
> +  */

That isn't very grammatical.

Block comments are formatted

/*
 * like this
 */

please.


> + p[i] = __slab_alloc(s, flags, NUMA_NO_NODE,
> + _RET_IP_, c);
> + if (unlikely(!p[i])) {
> + __kmem_cache_free_bulk(s, i, p);
> + return false;
> + }
> + local_irq_disable();
> + c = this_cpu_ptr(s->cpu_slab);
> + continue; /* goto for-loop */
> + }
>  

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7] slub: fix error path bug in kmem_cache_alloc_bulk

2015-06-16 Thread Andrew Morton

On Mon, 15 Jun 2015 17:52:26 +0200 Jesper Dangaard Brouer  
wrote:

> The current kmem_cache/SLAB bulking API need to release all objects
> in case the layer cannot satisfy the full request.
> 
> If __kmem_cache_alloc_bulk() fails, all allocated objects in array
> should be freed, but, __kmem_cache_alloc_bulk() can't know
> about objects allocated by this slub specific kmem_cache_alloc_bulk()
> function.

Can we fold patches 2, 3 and 4 into a single patch?

And maybe patch 5 as well.  I don't think we need all these
development-time increments in the permanent record.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/7] slub bulk alloc: extract objects from the per cpu slab

2015-06-16 Thread Andrew Morton

On Mon, 15 Jun 2015 17:52:07 +0200 Jesper Dangaard Brouer  
wrote:

> From: Christoph Lameter 
> 
> [NOTICE: Already in AKPM's quilt-queue]
> 
> First piece: acceleration of retrieval of per cpu objects
> 
> If we are allocating lots of objects then it is advantageous to disable
> interrupts and avoid the this_cpu_cmpxchg() operation to get these objects
> faster.
> 
> Note that we cannot do the fast operation if debugging is enabled, because
> we would have to add extra code to do all the debugging checks.  And it
> would not be fast anyway.
> 
> Note also that the requirement of having interrupts disabled
> avoids having to do processor flag operations.
> 
> Allocate as many objects as possible in the fast way and then fall back to
> the generic implementation for the rest of the objects.
> 
> ...
>
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2759,7 +2759,32 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
>  bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>   void **p)
>  {
> - return kmem_cache_alloc_bulk(s, flags, size, p);
> + if (!kmem_cache_debug(s)) {
> + struct kmem_cache_cpu *c;
> +
> + /* Drain objects in the per cpu slab */
> + local_irq_disable();
> + c = this_cpu_ptr(s->cpu_slab);
> +
> + while (size) {
> + void *object = c->freelist;
> +
> + if (!object)
> + break;
> +
> + c->freelist = get_freepointer(s, object);
> + *p++ = object;
> + size--;
> +
> + if (unlikely(flags & __GFP_ZERO))
> + memset(object, 0, s->object_size);
> + }
> + c->tid = next_tid(c->tid);
> +
> + local_irq_enable();

It might be worth adding

if (!size)
return true;

here.  To avoid the pointless call to __kmem_cache_alloc_bulk().

It depends on the typical success rate of this allocation loop.  Do you
know what this is?

> + }
> +
> + return __kmem_cache_alloc_bulk(s, flags, size, p);
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_bulk);

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/7] slab: infrastructure for bulk object allocation and freeing

2015-06-16 Thread Andrew Morton

On Mon, 15 Jun 2015 17:51:56 +0200 Jesper Dangaard Brouer  
wrote:

> +bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
> + void **p)
> +{
> + return kmem_cache_alloc_bulk(s, flags, size, p);
> +}

hm, any call to this function is going to be nasty, brutal and short.

--- 
a/mm/slab.c~slab-infrastructure-for-bulk-object-allocation-and-freeing-v3-fix
+++ a/mm/slab.c
@@ -3425,7 +3425,7 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
 bool kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
void **p)
 {
-   return kmem_cache_alloc_bulk(s, flags, size, p);
+   return __kmem_cache_alloc_bulk(s, flags, size, p);
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk);
 
_

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] netconsole: implement extended console support

2015-05-12 Thread Andrew Morton

On Mon, 11 May 2015 12:41:34 -0400 Tejun Heo  wrote:

> printk logbuf keeps various metadata and optional key=value dictionary
> for structured messages, both of which are stripped when messages are
> handed to regular console drivers.
> 
> It can be useful to have this metadata and dictionary available to
> netconsole consumers.  This obviously makes logging via netconsole
> more complete and the sequence number in particular is useful in
> environments where messages may be lost or reordered in transit -
> e.g. when netconsole is used to collect messages in a large cluster
> where packets may have to travel congested hops to reach the
> aggregator.  The lost and reordered messages can easily be identified
> and handled accordingly using the sequence numbers.
> 
> printk recently added extended console support which can be selected
> by setting CON_EXTENDED flag.

There's no such thing as CON_EXTENDED.  Not sure what this is trying to
say.

>  From console driver side, not much
> changes.  The only difference is that the text passed to the write
> callback is formatted the same way as /dev/kmsg.
> 
> This patch implements extended console support for netconsole which
> can be enabled by either prepending "+" to a netconsole boot param
> entry or echoing 1 to "extended" file in configfs.  When enabled,
> netconsole transmits extended log messages with headers identical to
> /dev/kmsg output.
> 
> There's one complication due to message fragments.  netconsole limits
> the maximum message size to 1k and messages longer than that are split
> into multiple fragments.  As all extended console messages should
> carry matching headers and be uniquely identifiable, each extended
> message fragment carries full copy of the metadata and an extra header
> field to identify the specific fragment.  The optional header is of
> the form "ncfrag=OFF/LEN" where OFF is the byte offset into the
> message body and LEN is the total length.
> 
> To avoid unnecessarily making printk format extended messages,
> Extended netconsole is registered with printk when the first extended
> netconsole is configured.
>
> ...
>
> +static ssize_t store_extended(struct netconsole_target *nt,
> +   const char *buf,
> +   size_t count)
> +{
> + int extended;
> + int err;
> +
> + if (nt->enabled) {
> + pr_err("target (%s) is enabled, disable to update parameters\n",
> +config_item_name(&nt->item));
> + return -EINVAL;
> + }

What's the reason for the above?

It's unclear (to me, at least ;)) what "disable" means?  Specifically
what steps must the operator take to successfully perform this
operation?  A sentence detailing those steps in netconsole.txt would be
nice.

> + err = kstrtoint(buf, 10, &extended);
> + if (err < 0)
> + return err;
> + if (extended < 0 || extended > 1)
> + return -EINVAL;
> +
> + nt->extended = extended;
> +
> + return strnlen(buf, count);
> +}
> +
>
> ...
>
> +static void send_ext_msg_udp(struct netconsole_target *nt, const char *msg,
> +  int msg_len)
> +{
> + static char buf[MAX_PRINT_CHUNK];
> + const char *header, *body;
> + int offset = 0;
> + int header_len, body_len;
> +
> + if (msg_len <= MAX_PRINT_CHUNK) {
> + netpoll_send_udp(&nt->np, msg, msg_len);
> + return;
> + }
> +
> + /* need to insert extra header fields, detect header and body */
> + header = msg;
> + body = memchr(msg, ';', msg_len);
> + if (WARN_ON_ONCE(!body))
> + return;
> +
> + header_len = body - header;
> + body_len = msg_len - header_len - 1;
> + body++;
> +
> + /*
> +  * Transfer multiple chunks with the following extra header.
> +  * "ncfrag=/"
> +  */
> + memcpy(buf, header, header_len);
> +
> + while (offset < body_len) {
> + int this_header = header_len;
> + int this_chunk;
> +
> + this_header += scnprintf(buf + this_header,
> +  sizeof(buf) - this_header,
> +  ",ncfrag=%d/%d;", offset, body_len);
> +
> + this_chunk = min(body_len - offset,
> +  MAX_PRINT_CHUNK - this_header);
> + if (WARN_ON_ONCE(this_chunk <= 0))
> + return;
> +
> + memcpy(buf + this_header, body + offset, this_chunk);
> +
> + netpoll_send_udp(&nt->np, buf, this_header + this_chunk);
> +
> + offset += this_chunk;
> + }

What protects `buf'?  console_sem, I assume?

-   static char buf[MAX_PRINT_CHUNK];
+   static char buf[MAX_PRINT_CHUNK];   /* Protected by console_sem */

wouldn't hurt.

> +}
> +
> +static void write_ext_msg(struct console *con, const char *msg,


I've forgotten what's happening with this patchset.  There were a few
design-level issues raised

Re: [PATCH 00/28] Swap over NFS -v16

2008-02-26 Thread Andrew Morton

On Tue, 26 Feb 2008 11:50:42 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Tue, 2008-02-26 at 17:03 +1100, Neil Brown wrote:
> > On Saturday February 23, [EMAIL PROTECTED] wrote:
>  
> > > What is the NFS and net people's take on all of this?
> > 
> > Well I'm only vaguely an NFS person, barely a net person, sporadically
> > an mm person, but I've had a look and it seems to mostly make sense.
> 
> Thanks for taking a look, and giving such elaborate feedback. I'll try
> and address these issues asap, but first let me reply to a few points
> here.

Neil's overview of what-all-this-is and how-it-all-works is really good. 
I'd suggest that you take it over, flesh it out and attach it firmly to the
patchset.  It really helps.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 10073] New: Just-small-enough packets in tunnels are silently eaten

2008-02-25 Thread Andrew Morton

On Sat, 23 Feb 2008 09:17:14 -0800 (PST) [EMAIL PROTECTED] wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10073
> 
>Summary: Just-small-enough packets in tunnels are silently eaten
>Product: Networking
>Version: 2.5
>  KernelVersion: 2.6.23 (mainline), 2.6.25-rc2 (mainline), 2.6.18-6-amd64
> (Debian
>   Platform: All
> OS/Version: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: IPV6
> AssignedTo: [EMAIL PROTECTED]
> ReportedBy: [EMAIL PROTECTED]
> 
> 
> Hi,
> 
> This has been broken for quite a while, but I haven't gotten around to debug 
> it
> until now.
> 
> I have an IPv6-in-IPv4 tunnel between two points, both with MTU 1500 on the
> regular v4 network. (I've verified that I can indeed send 1500-byte packets 
> and
> fragments over IPv4 from the two points.) By default, Linux assigns MTU 1480 
> to
> this tunnel.
> 
> However, if I try to ping -s 5000 from one side of the tunnel to the other, I
> get
> first "Packet too big, mtu=1480" and then on the next packet (when the machine
> tries to send 1480-byte fragments) "Packet too big, mtu=1472". After that, the
> packet goes through.
> 
> However, in some cases it seems I do not seem to get the "Packet too big" ICMP
> at all. In particular, if I change to a GRE tunnel (where the default MTU is
> 1476), and send in 1476-byte packets, they are just eaten. They clearly go 
> into
> the GRE tunnel (according to tcpdump), but no IPv4 packets ever go out on the
> other side, and no ICMPs are sent back. (There's no iptables rules on the
> router in question, nor any relevant sysctl settings except that IPv6
> forwarding is of course turned on.) If I lower MTU on the interfaces to 1468,
> everything seems to work just fine. (I cannot change the MTU of a regular ipip
> tunnel, so it's impossible for me to check whether a lower MTU would have 
> fixed
> the issues for those as well, but it seems reasonable.)
> 
> Any idea where the extra eight bytes go? Is there some inner layer of
> encapsulation in the kernel (adding eight bytes internally) that I've missed?
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: printk_ratelimit and net_ratelimit conflict and tunable behavior

2008-02-25 Thread Andrew Morton

On Mon, 25 Feb 2008 14:36:40 -0600 Steven Hawkes <[EMAIL PROTECTED]> wrote:

> From: Steve Hawkes <[EMAIL PROTECTED]>
> 
> The printk_ratelimit() and net_ratelimit() functions each have their own
> tunable parameters to control their respective rate limiting feature, but
> they share common state variables, preventing independent tuning of the
> parameters from working correctly. Also, changes to rate limiting tunable
> parameters do not always take effect properly since state is not recomputed
> when changes occur. For example, if ratelimit_burst is increased while rate
> limiting is occurring, the change won't take full effect until at least
> enough time between messages occurs so that the toks value reaches
> ratelimit_burst * ratelimit_jiffies. This can result in messages being
> suppressed when they should be allowed.
> 
> Implement independent state for printk_ratelimit() and net_ratelimit(), and
> update state when tunables are changed.
> 

This patch causes a large and nasty reject.

> ---
> --- linux-2.6.24/include/linux/kernel.h   2008-01-24 16:58:37.0 
> -0600
> +++ linux-2.6.24-printk_ratelimit/include/linux/kernel.h  2008-02-21 
> 11:20:41.751197312 -0600

Probably because you patched 2.6.24.  We're developing 2.6.25 now, and the
difference between the two is very large inded.  Please raise patches
against Linus's latest tree?

There are other patches pending against printk.c (in -mm and in git-sched)
but afacit they won't collide.

> @@ -196,8 +196,19 @@ static inline int log_buf_copy(char *des
>  
>  unsigned long int_sqrt(unsigned long);
>  
> +struct printk_ratelimit_state
> +{

Please do

struct printk_ratelimit_state {

> + unsigned long toks;
> + unsigned long last_jiffies;
> + int missed;
> + int limit_jiffies;
> + int limit_burst;
> + char const *facility;
> +};

I find that the best-value comments one can add to kernel code are to the
members of structures.  If the reader understands what all the fields do, the
code becomes simple to follow.

> --- linux-2.6.24/net/core/utils.c 2008-01-24 16:58:37.0 -0600
> +++ linux-2.6.24-printk_ratelimit/net/core/utils.c2008-02-21 
> 11:03:44.644337698 -0600
> @@ -41,7 +41,16 @@ EXPORT_SYMBOL(net_msg_warn);
>   */
>  int net_ratelimit(void)
>  {
> - return __printk_ratelimit(net_msg_cost, net_msg_burst);
> + static struct printk_ratelimit_state limit_state = {
> + .toks  = 10 * 5 * HZ,
> + .last_jiffies  = 0,
> + .missed= 0,
> + .limit_jiffies = 5 * HZ,
> + .limit_burst   = 10,
> + .facility  = "net"
> + };
> +
> + return __printk_ratelimit(net_msg_cost, net_msg_burst, &limit_state);

I don't get it.  There's one instance of limit_state, kernel-wide, and
__printk_ratelimit() modifies it.  What prevents one CPU's activities from
interfering with a second CPU's activities?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 10071] New: kernel hang in inet_init

2008-02-25 Thread Andrew Morton

On Sat, 23 Feb 2008 00:34:06 -0800 (PST) [EMAIL PROTECTED] wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10071
> 
>Summary: kernel hang in inet_init
>Product: Networking
>Version: 2.5
>  KernelVersion: 2.6.25 rc2 latest git
>   Platform: All
> OS/Version: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: IPV4
> AssignedTo: [EMAIL PROTECTED]
> ReportedBy: [EMAIL PROTECTED]
> 
> 
> Distribution:ubuntu hardy
> Hardware Environment:dell dimension 5150
> Problem Description:
> kernel hang mostly but boot slowly sometimes
> 
> Steps to reproduce:
> boot
> 
> when the computer manage to boot, inet_init last some time:
> [1.725306] NET: Registered protocol family 2
> [6.825384] IP route cache hash table entries: 32768 (order: 5, 131072
> bytes)
> [6.825803] TCP established hash table entries: 131072 (order: 8, 1048576
> bytes)
> [6.826691] TCP bind hash table entries: 65536 (order: 9, 2359296 bytes)
> [6.836160] TCP: Hash tables configured (established 131072 bind 65536)
> [6.836255] TCP reno registered
> [7.081569] initcall 0xc03b8920: inet_init+0x0/0x3aa() returned 0.
> [7.081569] initcall 0xc03b8920 ran for 5106 msecs: inet_init+0x0/0x3aa()
> 
> When booting with acpi=ht it works:
> [0.124668] Calling initcall 0xc03b8920: inet_init+0x0/0x3aa()
> [0.124867] NET: Registered protocol family 2
> [0.288067] IP route cache hash table entries: 32768 (order: 5, 131072
> bytes)
> [0.288856] TCP established hash table entries: 131072 (order: 8, 1048576
> bytes)
> [0.289739] TCP bind hash table entries: 65536 (order: 9, 2359296 bytes)
> [0.299156] TCP: Hash tables configured (established 131072 bind 65536)
> [0.299250] TCP reno registered
> Feb 23 09:10:28 tiyann kernel: [0.162267] initcall 0xc03b8920:
> inet_init+0x0/0x3aa() returned 0.
> [0.162383] initcall 0xc03b8920 ran for 33 msecs: inet_init+0x0/0x3aa()
> 
> there seem to be a weird interaction between acpi and inet_init
> 
> any hint on how to provide better information
> thanks
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 10097] New: SMP BUG in __nf_conntrack_find

2008-02-25 Thread Andrew Morton

On Mon, 25 Feb 2008 10:44:08 -0800 (PST) [EMAIL PROTECTED] wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10097
> 
>Summary: SMP BUG in __nf_conntrack_find
>Product: Networking
>Version: 2.5
>  KernelVersion: 2.6.25-rc3
>   Platform: All
> OS/Version: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Netfilter/Iptables
> AssignedTo: [EMAIL PROTECTED]
> ReportedBy: [EMAIL PROTECTED]
> 
> 
> Latest working kernel version: 2.6.24.2
> Earliest failing kernel version: 2.6.24-rc3 (not checked before)
> Distribution: Bluewhite64
> Hardware Environment: Athlon X2 4200
> 
> Software Environment:
> samba 3.0, 2.6.25-rc3 kernel + HR + tickless + kernel SMP debugging
> 
> Problem Description:
> The Samba smbd daemon triggers regularly the following BUG with 2.6.25-rc3:
> 
> BUG: using smp_processor_id() in preemptible [] code: nmbd/3167
> caller is __nf_conntrack_find+0x119/0x150
> Pid: 3167, comm: nmbd Not tainted 2.6.25-rc3 #1
> 
> Call Trace:
>  [] debug_smp_processor_id+0xc4/0xd0
>  [] __nf_conntrack_find+0x119/0x150
>  [] nf_conntrack_find_get+0x19/0x80
>  [] nf_conntrack_in+0x1a4/0x5a0
>  [] ? restore_args+0x0/0x30
>  [] ipv4_conntrack_local+0x66/0x70
>  [] nf_iterate+0x62/0xa0
>  [] ? dst_output+0x0/0x10
>  [] nf_hook_slow+0x66/0xe0
>  [] ? dst_output+0x0/0x10
>  [] __ip_local_out+0xa5/0xb0
>  [] ip_local_out+0x11/0x30
>  [] ip_push_pending_frames+0x261/0x3e0
>  [] udp_push_pending_frames+0x233/0x3d0
>  [] udp_sendmsg+0x30f/0x710
>  [] ? default_wake_function+0x0/0x10
>  [] inet_sendmsg+0x45/0x80
>  [] sock_sendmsg+0xdf/0x110
>  [] ? autoremove_wake_function+0x0/0x40
>  [] ? hrtick_resched+0x77/0x90
>  [] ? trace_hardirqs_on+0xd5/0x160
>  [] ? sockfd_lookup_light+0x45/0x80
>  [] sys_sendto+0xea/0x120
>  [] ? _spin_unlock_irq+0x2b/0x60
>  [] ? trace_hardirqs_on+0xd5/0x160
>  [] ? _spin_unlock_irq+0x36/0x60
>  [] system_call_after_swapgs+0x7b/0x80
> 
> Steps to reproduce:
> Start smbd with the forementionned kernel instrumented for SMP and kernel
> debugging and hr + tickless enabled.
> 

Presumably this is in NF_CT_STAT_INC().  I wonder what caused it to start
happening.

Guys, this probably means that the developers who tested this change aren't
enabling the debug options which all kernel developers _should_ be enabling
when "testing" their code!  Documentation/SubmitChecklist has a handy list.

Should NF_CT_STAT_INC() be using local_inc()?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 10063] New: Network problems with 2.6.23+

2008-02-24 Thread Andrew Morton


(plese respond via emailed reply-to-all, not via the bugzilla web interface)

On Thu, 21 Feb 2008 18:04:45 -0800 (PST) [EMAIL PROTECTED] wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10063

There's more info (including a tcpdump) in bugzilla.

>Summary: Network problems with 2.6.23+
>Product: Networking
>Version: 2.5
>  KernelVersion: 2.6.24.2
>   Platform: All
> OS/Version: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: high
>   Priority: P1
>  Component: IPV4
> AssignedTo: [EMAIL PROTECTED]
> ReportedBy: [EMAIL PROTECTED]
> 
> 
> Latest working kernel version: 2.6.23
> Earliest failing kernel version: 2.6.24
> Distribution: Gentoo
> Hardware Environment: 
> 
> model name  : Dual Core AMD Opteron(tm) Processor 170
> stepping: 2
> cpu MHz : 2009.144
> 
> 02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E
> Gigabit Ethernet Controller (rev 15) sky2
> 
> Working on 2.6.24 (as far as I can see) in :
> 
> model name  : Intel(R) Pentium(R) 4 CPU 2.80GHz
> stepping: 9
> cpu MHz : 2793.222
> 01:01.0 Ethernet controller: Intel Corporation 82547GI Gigabit Ethernet
> Controller
> 
> Software Environment: TCP-transfers (FTP,HTTP)
> Problem Description:
> 
> First I want to apologize that I'm not able to do alot of troubleshooting on
> this issue cause I don't know where to start exactly.
> 
> The problem is that I have, after upgrading to 2.6.24 from 2.6.23, experienced
> alot of "weird" network problems. For example, FTP-transfers to certain hosts
> on the Internet stalls after a while. I have one dst-host where this problem
> appears frequently. On another dst-host it doesn't happen as often. Sometimes
> the transfers doesn't even start. 
> 
> First we did alot of faultracing on the transmission but after downgrading the
> kernel it was apparent that the transmission wasn't the problem. 
> 
> I did not think about filing this report until i saw,
> http://kerneltrap.org/node/15550. So I guess I'm not alone.
> 
> I must also add that to some dst-hosts there is no problem at all. So I would
> guess that there might be something with the mtu,wscale or something between
> hosts. 
> 
> All the other dst-hosts I've tried is running 2.6.23
> 
> I will try to test alot of kernel-version to try to ease the troubleshooting. 
> 
> 
> Steps to reproduce:
> 
> Start an FTP-transfer from the box mention above.

I guess it might help if you can tell us the IP address of some of the
problematic servers.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nework Emulation with UML, bug 8895

2008-02-24 Thread Andrew Morton

On Sat, 23 Feb 2008 14:04:18 +0100 clowncoder <[EMAIL PROTECTED]> wrote:

> Hello,
> You can have a configured and running network inside a single linux machine,
> only one script command is enough. After the start of all the machine, a
> graphical representation of your topology helps your interactions with
> the network. This virtual network can be downloaded at http://clownix.net,
> I must warn you, it is a 350mega bz2 file-system that needs 4 giga on 
> the host.
> Very usefull for networking development in kernel as proved by bug 8895
> that has been found with the above tool:
> 
> The uncorrected bug 8895:
> 
> file:  /usr/src/linux-2.6.24.2/net/ipv6/ip6_fib.c  line  796:   
> "dst_free(&rt->u.dst);"
> 
> --- /Comment From qmiao  2007-08-29 
> 23:33:07 /---
> 
> fib6_add
> ...
> 
> if (fn->leaf == NULL) {
> fn->leaf = rt;<--**-- rt is assigned to fn->leaf
> atomic_inc(&rt->rt6i_ref);
> }
> ...
> err = fib6_add_rt2node(fn, rt, info); <-**- return -EEXIST
> ...
> (Here err was not null)
> ...
> if (err) {
> ...
> dst_free(&rt->u.dst); <--**-- Actually rt is still in
> tree (fn->leaf = rt /* see above */)
> ...
> }
> 

Please cc netdev on net-related bug reports.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/8]: uninline & uninline

2008-02-23 Thread Andrew Morton

On Sat, 23 Feb 2008 14:15:06 +0100 Andi Kleen <[EMAIL PROTECTED]> wrote:

> Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> 
> >> -41525  2066 f, 3370 +, 44895 -, diff: -41525  IS_ERR 
> >
> > This is a surprise.  I expect that the -mm-only
> > profile-likely-unlikely-macros.patch is the cause of this and mainline
> > doesn't have this problem.
> 
> Shouldn't they only have overhead when the respective CONFIG is enabled?

yup.

> > If true, then this likely/unlikely bloat has probably spread into a lot of
> > your other results and it all should be redone against mainline, sorry :(
> >
> > (I'm not aware of anyone having used profile-likely-unlikely-macros.patch
> > in quite some time.  That's unfortunate because it has turned up some
> > fairly flagrant code deoptimisations)
> 
> Is there any reason they couldn't just be merged to mainline? 
> 
> I think it's a useful facility.

ummm, now why did we made that decision...  I think we decided that it's
the sort of thing which one person can run once per few months and that
will deliver its full value.  I can maintain it in -mm and we're happy - no
need to add it to mainline.  No strong feelings either way really.

It does have the downside that the kernel explodes if someone adds unlikely
or likely to the vdso code and I need to occasionally hunt down new
additions and revert them in that patch.  That makes it a bit of a
maintenance burden.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 8/8] Jhash in too big for inlining, move under lib/

2008-02-23 Thread Andrew Morton

On Sat, 23 Feb 2008 12:05:36 +0200 (EET) "Ilpo Järvinen" <[EMAIL PROTECTED]> 
wrote:

> On Sat, 23 Feb 2008, Andrew Morton wrote:
> 
> > On Wed, 20 Feb 2008 15:47:18 +0200 "Ilpo Järvinen" <[EMAIL PROTECTED]> 
> > wrote:
> > 
> > > vmlinux.o:
> > >  62 functions changed, 66 bytes added, 10935 bytes removed, diff: -10869
> > > 
> > > ...+ these to lib/jhash.o:
> > >  jhash_3words: 112
> > >  jhash2: 276
> > >  jhash: 475
> > > 
> > > select for networking code might need a more fine-grained approach.
> > 
> > It should be possible to use a modular jhash.ko.  The things which you
> > have identified as clients of the jhash library are usually loaded as 
> > modules.
> > But in the case where someone does (say) NFSD=y we do need jhash.o linked 
> > into
> > vmlinux also. This is doable in Kconfig but I always forget how.
> 
> Ok, even though its not that likely that one lives without e.g.
> net/ipv4/inet_connection_sock.c or net/netlink/af_netlink.c?

Sure, the number of people who will want CONFIG_JHASH=n is very small.

> But maybe 
> some guys "really know what they are doing" and can come up with config 
> that would be able to build it as module (for other than proof-of-concept 
> uses I mean)... :-/
> 
> > Adrian, Sam and Randy are the repositories of knowledge here ;)
> 
> Thanks, I'll consult them in this. I've never needed to do any Kconfig 
> stuff so far so it's no surprise I have very little clue... :-)

Thanks.  If it gets messy I'd just put lib/jhash.o in obj-y.  I assume it's
pretty small.

> I've one question for you Andrew, how would you like this kind of 
> cross-subsys toucher to be merged, through you directly I suppose?

Sure, I can scrounge around for appropriate reviews and acks and can make
sure that nobody's pending work gets disrupted.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] alloc_percpu() fails to allocate percpu data

2008-02-23 Thread Andrew Morton

On Thu, 21 Feb 2008 19:00:03 +0100 Eric Dumazet <[EMAIL PROTECTED]> wrote:

> +#ifndef cache_line_size
> +#define cache_line_size()L1_CACHE_BYTES
> +#endif

argh, you made me look.

Really cache_line_size() should be implemented in include/linux/cache.h. 
Then we tromp the stupid private implementations in slob.c and slub.c.

Then we wonder why x86 uses a custom cache_line_size(), but still uses
L1_CACHE_BYTES for its L1_CACHE_ALIGN().

Once we've answered that, we look at your

+   /*
+* We should make sure each CPU gets private memory.
+*/
+   size = roundup(size, cache_line_size());

and wonder whether it should have used L1_CACHE_ALIGN().

I think I'd better stop looking.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 17/28] netvm: hook skb allocation to reserves

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:27 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> Change the skb allocation api to indicate RX usage and use this to fall back 
> to
> the reserve when needed. SKBs allocated from the reserve are tagged in
> skb->emergency.
> 
> Teach all other skb ops about emergency skbs and the reserve accounting.
> 
> Use the (new) packet split API to allocate and track fragment pages from the
> emergency reserve. Do this using an atomic counter in page->index. This is
> needed because the fragments have a different sharing semantic than that
> indicated by skb_shinfo()->dataref. 
> 
> Note that the decision to distinguish between regular and emergency SKBs 
> allows
> the accounting overhead to be limited to the later kind.
> 
> ...
>
> +static inline void skb_get_page(struct sk_buff *skb, struct page *page)
> +{
> + get_page(page);
> + if (skb_emergency(skb))
> + atomic_inc(&page->frag_count);
> +}
> +
> +static inline void skb_put_page(struct sk_buff *skb, struct page *page)
> +{
> + if (skb_emergency(skb) && atomic_dec_and_test(&page->frag_count))
> + rx_emergency_put(PAGE_SIZE);
> + put_page(page);
> +}

I'm thinking we should do `#define slowcall inline' then use that in the future.

>  static void skb_release_data(struct sk_buff *skb)
>  {
>   if (!skb->cloned ||
>   !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
>  &skb_shinfo(skb)->dataref)) {
> + int size;
> +
> +#ifdef NET_SKBUFF_DATA_USES_OFFSET
> + size = skb->end;
> +#else
> + size = skb->end - skb->head;
> +#endif

The patch adds rather a lot of ifdefs.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/28] mm: memory reserve management

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:20 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> Generic reserve management code. 
> 
> It provides methods to reserve and charge. Upon this, generic alloc/free style
> reserve pools could be build, which could fully replace mempool_t
> functionality.
> 
> It should also allow for a Banker's algorithm replacement of __GFP_NOFAIL.

Generally: the comments in this code are a bit straggly and hard to follow.
They'd be worth a revisit.

> +/*
> + * Simple output of the reserve tree in: /proc/reserve_info
> + * Example:
> + *
> + * localhost ~ # cat /proc/reserve_info
> + * total reserve  8156K (0/544817)
> + *   total network reserve  8156K (0/544817)
> + * network TX reserve 196K (0/49)
> + *   protocol TX pages  196K (0/49)
> + * network RX reserve 7960K (0/544768)
> + *   IPv6 route cache   1372K (0/4096)
> + *   IPv4 route cache   5468K (0/16384)
> + *   SKB data reserve   1120K (0/524288)
> + * IPv6 fragment cache560K (0/262144)
> + * IPv4 fragment cache560K (0/262144)
> + */

Well, "Simple" was a freudian typo.  Not designed for programmatic parsing,
I see.

> +static __init int mem_reserve_proc_init(void)
> +{
> + struct proc_dir_entry *entry;
> +
> + entry = create_proc_entry("reserve_info", S_IRUSR, NULL);

I think we're supposed to use proc_create().  Blame Alexey.

> + if (entry)
> + entry->proc_fops = &mem_reserve_opterations;
> +
> + return 0;
> +}
> +
> +__initcall(mem_reserve_proc_init);

module_init() is more trendy.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/28] mm: system wide ALLOC_NO_WATERMARK

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:18 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
> wide - which they are per setup_per_zone_pages_min(), when we scrape the
> barrel, do it properly.
> 

The changelog is fairly incomprehensible.

>  mm/page_alloc.c |6 ++
>  1 file changed, 6 insertions(+)
> 
> Index: linux-2.6/mm/page_alloc.c
> ===
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1552,6 +1552,12 @@ restart:
>  rebalance:
>   if (alloc_flags & ALLOC_NO_WATERMARKS) {
>  nofail_alloc:
> + /*
> +  * break out of mempolicy boundaries
> +  */
> + zonelist = NODE_DATA(numa_node_id())->node_zonelists +
> + gfp_zone(gfp_mask);
> +
>   /* go through the zonelist yet again, ignoring mins */
>   page = get_page_from_freelist(gfp_mask, order, zonelist,
>   ALLOC_NO_WATERMARKS);

As is the patch.  People who care about mempolicies will want a better
explanation, please, so they can check that we're not busting their stuff.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 15/28] netvm: network reserve infrastructure

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:25 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> Provide the basic infrastructure to reserve and charge/account network memory.
> 
> We provide the following reserve tree:
> 
> 1)  total network reserve
> 2)network TX reserve
> 3)  protocol TX pages
> 4)network RX reserve
> 5)  SKB data reserve
> 
> [1] is used to make all the network reserves a single subtree, for easy
> manipulation.
> 
> [2] and [4] are merely for eastetic reasons.
> 
> The TX pages reserve [3] is assumed bounded by it being the upper bound of
> memory that can be used for sending pages (not quite true, but good enough)
> 
> The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
> against in the fallback path.
> 
> The consumers for these reserves are sockets marked with:
>   SOCK_MEMALLOC
> 
> Such sockets are to be used to service the VM (iow. to swap over). They
> must be handled kernel side, exposing such a socket to user-space is a BUG.
> 
> +/**
> + *   sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
> + *   @socks: number of new %SOCK_MEMALLOC sockets
> + *   @tx_resserve_pages: number of pages to (un)reserve for TX
> + *
> + *   This function adjusts the memalloc reserve based on system demand.
> + *   The RX reserve is a limit, and only added once, not for each socket.
> + *
> + *   NOTE:
> + *  @tx_reserve_pages is an upper-bound of memory used for TX hence
> + *  we need not account the pages like we do for RX pages.
> + */
> +int sk_adjust_memalloc(int socks, long tx_reserve_pages)
> +{
> + int nr_socks;
> + int err;
> +
> + err = mem_reserve_pages_add(&net_tx_pages, tx_reserve_pages);
> + if (err)
> + return err;
> +
> + nr_socks = atomic_read(&memalloc_socks);
> + if (!nr_socks && socks > 0)
> + err = mem_reserve_connect(&net_reserve, &mem_reserve_root);

This looks like it should have some locking?

> + nr_socks = atomic_add_return(socks, &memalloc_socks);
> + if (!nr_socks && socks)
> + err = mem_reserve_disconnect(&net_reserve);

Or does that try to make up for it?  Still looks fishy.

> + if (err)
> + mem_reserve_pages_add(&net_tx_pages, -tx_reserve_pages);
> +
> + return err;
> +}
> +
> +/**
> + *   sk_set_memalloc - sets %SOCK_MEMALLOC
> + *   @sk: socket to set it on
> + *
> + *   Set %SOCK_MEMALLOC on a socket and increase the memalloc reserve
> + *   accordingly.
> + */
> +int sk_set_memalloc(struct sock *sk)
> +{
> + int set = sock_flag(sk, SOCK_MEMALLOC);
> +#ifndef CONFIG_NETVM
> + BUG();
> +#endif

??  #error, maybe?

> + if (!set) {
> + int err = sk_adjust_memalloc(1, 0);
> + if (err)
> + return err;
> +
> + sock_set_flag(sk, SOCK_MEMALLOC);
> + sk->sk_allocation |= __GFP_MEMALLOC;
> + }
> + return !set;
> +}
> +EXPORT_SYMBOL_GPL(sk_set_memalloc);

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/28] Swap over NFS -v16

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> Another posting of the full swap over NFS series. 

Well I looked.  There's rather a lot of it and I wouldn't pretend to
understand it.

What is the NFS and net people's take on all of this?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/28] mm: kmem_estimate_pages()

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:14 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> Provide a method to get the upper bound on the pages needed to allocate
> a given number of objects from a given kmem_cache.
> 
> This lays the foundation for a generic reserve framework as presented in
> a later patch in this series. This framework needs to convert object demand
> (kmalloc() bytes, kmem_cache_alloc() objects) to pages.
> 
> ...
>
>  /*
> + * return the max number of pages required to allocated count
> + * objects from the given cache
> + */
> +unsigned kmem_estimate_pages(struct kmem_cache *s, gfp_t flags, int objects)

You might want to have another go at that comment.

> +/*
> + * return the max number of pages required to allocate @bytes from kmalloc
> + * in an unspecified number of allocation of heterogeneous size.
> + */
> +unsigned kestimate(gfp_t flags, size_t bytes)

And its pal.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/28] mm: emergency pool

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:17 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> @@ -213,7 +213,7 @@ enum zone_type {
>  
>  struct zone {
>   /* Fields commonly accessed by the page allocator */
> - unsigned long   pages_min, pages_low, pages_high;
> + unsigned long   pages_emerg, pages_min, pages_low, pages_high;

It would be nice to make these one-per-line, then document them.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/28] mm: allow PF_MEMALLOC from softirq context

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:15 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
> a borrowed context save current->flags, ksoftirqd will have its own 
> task_struct.

The second sentence doesn't make sense.

> This is needed to allow network softirq packet processing to make use of
> PF_MEMALLOC.
>
> ...
>
> +#define tsk_restore_flags(p, pflags, mask) \
> + do {(p)->flags &= ~(mask); \
> + (p)->flags |= ((pflags) & (mask)); } while (0)
> +

Does it need to be a macro?

If so, it really should cook up a temporary to avoid referencing p twice -
the children might be watching.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/28] mm: __GFP_MEMALLOC

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:46:19 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> __GFP_MEMALLOC will allow the allocation to disregard the watermarks, 
> much like PF_MEMALLOC.
> 

'twould be nice if the changelog had some explanation of the reason
for this change.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: at net/ipv4/tcp_input.c:2054 tcp_mark_head_lost()

2008-02-23 Thread Andrew Morton


(cc netdev)

On Wed, 20 Feb 2008 20:04:39 -0800 (PST) Giangiacomo Mariotti <[EMAIL 
PROTECTED]> wrote:

> This is what I got with dmesg :
> 
> [  266.978695] WARNING: at net/ipv4/tcp_input.c:2054 tcp_mark_head_lost()
> [  266.978701] Pid: 0, comm: swapper Not tainted 2.6.24.2-my001 #1
> [  266.978703] 
> [  266.978704] Call Trace:
> [  266.978706][] tcp_ack+0x16d8/0x197f
> [  266.978721]  [] __wake_up+0x38/0x4e
> [  266.978727]  [] tcp_rcv_established+0xe2/0x8cb
> [  266.978732]  [] tcp_v4_do_rcv+0x30/0x39c
> [  266.978738]  [] tcp_v4_rcv+0x99b/0xa06
> [  266.978743]  [] __netdev_alloc_skb+0x29/0x43
> [  266.978749]  [] ip_local_deliver_finish+0x152/0x212
> [  266.978753]  [] ip_rcv_finish+0x2f8/0x31b
> [  266.978758]  [] netif_receive_skb+0x3ae/0x3d1
> [  266.978763]  [] rtl8169_rx_interrupt+0x45f/0x53e
> [  266.978768]  [] rtl8169_poll+0x36/0x16a
> [  266.978773]  [] net_rx_action+0xb7/0x1f3
> [  266.978778]  [] __do_softirq+0x65/0xce
> [  266.978782]  [] default_idle+0x0/0x3d
> [  266.978786]  [] call_softirq+0x1c/0x28
> [  266.978789]  [] do_softirq+0x2c/0x7d
> [  266.978792]  [] irq_exit+0x3f/0x84
> [  266.978794]  [] do_IRQ+0xb6/0xd5
> [  266.978797]  [] default_idle+0x0/0x3d
> [  266.978800]  [] ret_from_intr+0x0/0xa
> [  266.978801][] default_idle+0x29/0x3d
> [  266.978809]  [] cpu_idle+0x93/0xbb
> [  266.978813]  [] start_kernel+0x2bb/0x2c7
> [  266.978818]  [] _sinittext+0x123/0x12a
> [  266.978821] 
> 
> This though didn't cause any user-visible problem.
> 
> .config file :
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 8/8] Jhash in too big for inlining, move under lib/

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:47:18 +0200 "Ilpo Järvinen" <[EMAIL PROTECTED]> wrote:

> vmlinux.o:
>  62 functions changed, 66 bytes added, 10935 bytes removed, diff: -10869
> 
> ...+ these to lib/jhash.o:
>  jhash_3words: 112
>  jhash2: 276
>  jhash: 475
> 
> select for networking code might need a more fine-grained approach.

It should be possible to use a modular jhash.ko.  The things which you
have identified as clients of the jhash library are usually loaded as modules.

But in the case where someone does (say) NFSD=y we do need jhash.o linked into
vmlinux also.  This is doable in Kconfig but I always forget how.  Adrian, Sam
and Randy are the repositories of knowledge here ;)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/8]: uninline & uninline

2008-02-23 Thread Andrew Morton

On Wed, 20 Feb 2008 15:47:10 +0200 "Ilpo J__rvinen" <[EMAIL PROTECTED]> wrote:

> Ok, here's the top of the list (1+ bytes):

This is good stuff - thanks.

> -41525  2066 f, 3370 +, 44895 -, diff: -41525  IS_ERR 

This is a surprise.  I expect that the -mm-only
profile-likely-unlikely-macros.patch is the cause of this and mainline
doesn't have this problem.

If true, then this likely/unlikely bloat has probably spread into a lot of
your other results and it all should be redone against mainline, sorry :(

(I'm not aware of anyone having used profile-likely-unlikely-macros.patch
in quite some time.  That's unfortunate because it has turned up some
fairly flagrant code deoptimisations)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 10061] New: Hang in md5_resync

2008-02-21 Thread Andrew Morton


(please respond via emailed reply-to-all, not via the bugzilla web
interface, thanks)

On Thu, 21 Feb 2008 13:13:13 -0800 (PST) [EMAIL PROTECTED] wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10061
> 
>Summary: Hang in md5_resync
>Product: IO/Storage
>Version: 2.5
>  KernelVersion: 2.6.25-rc2
>   Platform: All
> OS/Version: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: MD
> AssignedTo: [EMAIL PROTECTED]
> ReportedBy: [EMAIL PROTECTED]
> 
> 
> Latest working kernel version: 2.6.23
> Earliest failing kernel version: 2.6.25-rc2
> Distribution: Debian unstable
> Hardware Environment: sata_nv, ext3, Seagate SATA disks 
> Software Environment: N/A
> Problem Description: After upgrading to 2.6.25, I did some light I/O 
> (basically
> a tab-complete in my home directory), and the whole machine promptly hung.
> After a while, the serial console spat out:
> 
> [  548.684064] BUG: soft lockup - CPU#1 stuck for]
> [  548.684065] CPU 1:
> [  548.684065] Modules linked in: tun eeprom ide_generic ide_disk ide_cd_mod
> cdx
> [  548.684065] Pid: 3599, comm: md4_resync Not tainted 2.6.25-rc2 #1
> [  548.684065] RIP: 0010:[]  []
> __write_loc0
> [  548.684065] RSP: 0018:81011fcffc18  EFLAGS: 0287
> [  548.684065] RAX: 81011c14dfd8 RBX: 8100d42d30c0 RCX:
> 8100d42d30c0
> [  548.684065] RDX:  RSI: 8100d4cd0d90 RDI:
> 8100d4cd0d00
> [  548.684065] RBP: 81011fcffb90 R08: fe538a93 R09:
> 
> [  548.684065] R10: 00c0 R11: 803ad3e1 R12:
> 8020bcb6
> [  548.684065] R13: 81011fcffb90 R14: 8100d4cd0d00 R15:
> 
> [  548.684065] FS:  7f8748e0c6e0() GS:81011fcb5dc0()
> knlGS:0
> [  548.684065] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
> [  548.684066] CR2: 7f8748e076f3 CR3: d296b000 CR4:
> 06e0
> [  548.684066] DR0:  DR1:  DR2:
> 
> [  548.684066] DR3:  DR6: 0ff0 DR7:
> 0400
> [  548.684066]
> [  548.684066] Call Trace:
> [  548.684066][] ? _write_lock_bh+0x1a/0x1c
> [  548.684066]  [] ? neigh_resolve_output+0xbd/0x281
> [  548.684066]  [] ? ip6_output+0xc5c/0xc71
> [  548.684066]  [] ? sock_alloc_send_skb+0x93/0x1e1
> [  548.684066]  [] ? __ndisc_send+0x3c6/0x4fb
> [  548.684066]  [] ? :sd_mod:sd_prep_fn+0x74/0x6fb
> [  548.684066]  [] ? ndisc_send_ns+0x71/0x7e
> [  548.684066]  [] ? scsi_request_fn+0x339/0x3a5
> [  548.684066]  [] ? ndisc_solicit+0x159/0x166
> [  548.684066]  [] ? __mod_timer+0xb0/0xbf
> [  548.684066]  [] ? neigh_timer_handler+0x292/0x2d0
> [  548.684066]  [] ? neigh_timer_handler+0x0/0x2d0
> [  548.684066]  [] ? run_timer_softirq+0x151/0x1bf
> [  548.684066]  [] ? __do_softirq+0x65/0xcf
> [  548.684066]  [] ? call_softirq+0x1c/0x28
> [  548.684066]  [] ? do_softirq+0x2c/0x68
> [  548.684066]  [] ? smp_apic_timer_interrupt+0x8a/0xa5
> [  548.684066]  [] ? apic_timer_interrupt+0x66/0x70
> [  548.684066][] ?
> :raid456:raid5_compute_sector+0x19
> [  548.684066]  [] ? :raid456:stripe_to_pdidx+0x48/0x51
> [  548.684066]  [] ? :raid456:sync_request+0x7d3/0x8a1
> [  548.684066]  [] ? generic_unplug_device+0x18/0x24
> [  548.684066]  [] ? blk_unplug+0x5a/0x60
> [  548.684066]  [] ? :raid456:unplug_slaves+0x55/0x91
> [  548.684066]  [] ? :md_mod:md_do_sync+0x54e/0x96c
> [  548.684066]  [] ? getnstimeofday+0x2f/0x83
> [  548.684066]  [] ? try_to_wake_up+0xfd/0x10e
> [  548.684066]  [] ? :md_mod:md_thread+0xde/0xfa
> [  548.684066]  [] ? :md_mod:md_thread+0x0/0xfa
> [  548.684066]  [] ? kthread+0x47/0x76
> [  548.684066]  [] ? schedule_tail+0x28/0x5c
> [  548.684066]  [] ? child_rip+0xa/0x12
> [  548.684066]  [] ? kthread+0x0/0x76
> [  548.684066]  [] ? child_rip+0x0/0x12
> [  548.684066]
> 
> and the machine continued hanging.
> 
> Downgraded to 2.6.23 again, seems to work fine. Will try 2.6.24.2 shortly.

This looks like a networking regression, not a RAID regression.

We did have a big locking bug in netlink in 2.6.25-rc2, although your trace
doesn't appear to point at that particular bug.

It would be good if you could test the latest Linux snapshot please, see if
we've already fixed this, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.25-rc2] e100: Trying to free already-free IRQ 11 during suspend ...

2008-02-18 Thread Andrew Morton

On Sun, 17 Feb 2008 15:36:50 +0300 Andrey Borzenkov <[EMAIL PROTECTED]> wrote:

> ... and possibly reboot/poweroff (it flows by too fast to be legible).
> 
> [ 8803.850634] ACPI: Preparing to enter system sleep state S3
> [ 8803.853141] Suspending console(s)
> [ 8805.287505] serial 00:09: disabled
> [ 8805.291564] Trying to free already-free IRQ 11
> [ 8805.291579] Pid: 6920, comm: pm-suspend Not tainted 2.6.25-rc2-1avb #2
> [ 8805.291628]  [] free_irq+0xb7/0x130
> [ 8805.291675]  [] e100_suspend+0xc0/0x100
> [ 8805.291724]  [] pci_device_suspend+0x26/0x70
> [ 8805.291747]  [] suspend_device+0x94/0xd0
> [ 8805.291763]  [] device_suspend+0x153/0x240
> [ 8805.291784]  [] suspend_devices_and_enter+0x4f/0xf0
> [ 8805.291808]  [] ? freeze_processes+0x3f/0x80
> [ 8805.291825]  [] enter_state+0xaa/0x140
> [ 8805.291840]  [] state_store+0x8f/0xd0
> [ 8805.291852]  [] ? state_store+0x0/0xd0
> [ 8805.291866]  [] kobj_attr_store+0x24/0x30
> [ 8805.291901]  [] sysfs_write_file+0xbb/0x110
> [ 8805.291936]  [] vfs_write+0x99/0x130
> [ 8805.291963]  [] ? sysfs_write_file+0x0/0x110
> [ 8805.291979]  [] sys_write+0x3d/0x70
> [ 8805.291998]  [] sysenter_past_esp+0x5f/0xa5
> [ 8805.292038]  ===
> [ 8805.347640] ACPI: PCI interrupt for device :00:06.0 disabled
> [ 8805.361128] ACPI: PCI interrupt for device :00:02.0 disabled
> [ 8805.376670]  hwsleep-0322 [00] enter_sleep_state : Entering sleep 
> state [S3]
> [ 8805.376670] Back to C!
> 
> Interface is unused normally (only for netconsole sometimes). dmesg and config
> attached.

Does reverting this:

commit 8543da6672b0994921f014f2250e27ae81645580
Author: Auke Kok <[EMAIL PROTECTED]>
Date:   Wed Dec 12 16:30:42 2007 -0800

e100: free IRQ to remove warningwhenrebooting

with this patch:

--- a/drivers/net/e100.c~revert-1
+++ a/drivers/net/e100.c
@@ -2804,9 +2804,8 @@ static int e100_suspend(struct pci_dev *
pci_enable_wake(pdev, PCI_D3cold, 0);
}
 
-   free_irq(pdev->irq, netdev);
-
pci_disable_device(pdev);
+   free_irq(pdev->irq, netdev);
pci_set_power_state(pdev, PCI_D3hot);
 
return 0;
@@ -2848,8 +2847,6 @@ static void e100_shutdown(struct pci_dev
pci_enable_wake(pdev, PCI_D3cold, 0);
}
 
-   free_irq(pdev->irq, netdev);
-
pci_disable_device(pdev);
pci_set_power_state(pdev, PCI_D3hot);
 }
_

fix it?

> Hmm ... after resume device has disappeared at all ...
> 
> {pts/1}% cat /proc/interrupts
>CPU0
>   0:1290492XT-PIC-XTtimer
>   1:   6675XT-PIC-XTi8042
>   2:  0XT-PIC-XTcascade
>   3:  2XT-PIC-XT
>   4:  2XT-PIC-XT
>   5:  3XT-PIC-XT
>   7:  4XT-PIC-XTirda0
>   8:  0XT-PIC-XTrtc0
>   9:583XT-PIC-XTacpi
>  10:  2XT-PIC-XT
>  11:  31483XT-PIC-XTyenta, yenta, yenta, ohci_hcd:usb1, ALI 
> 5451, pcmcia0.0
>  12:  28070XT-PIC-XTi8042
>  14:  21705XT-PIC-XTide0
>  15:  82123XT-PIC-XTide1
> NMI:  0   Non-maskable interrupts
> TRM:  0   Thermal event interrupts
> SPU:  0   Spurious interrupts
> ERR:  0

I hope that's not a separate bug...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux 2.6.24.1 - kernel does not boot; IRQ trouble?

2008-02-18 Thread Andrew Morton

On Sun, 17 Feb 2008 00:54:08 + (GMT) Chris Rankin <[EMAIL PROTECTED]> wrote:

> [Try this again, except this time I'll force the attachment as inline text!]
> 
> Hi,
> 
> I have managed to boot 2.6.24.1 on this machine, with the NMI watchdog 
> enabled, by using the
> "acpi=noirq" option. (There does seem to be some unhappiness with bridge 
> symlinks in sysfs,
> though.)
> 
> ...
>
> sysfs: duplicate filename 'bridge' can not be created
> WARNING: at fs/sysfs/dir.c:424 sysfs_add_one()
> Pid: 1, comm: swapper Not tainted 2.6.24.1 #1
>  [] show_trace_log_lvl+0x1a/0x2f
>  [] show_trace+0x12/0x14
>  [] dump_stack+0x6c/0x72
>  [] sysfs_add_one+0x57/0xbc
>  [] sysfs_create_link+0xc2/0x10d
>  [] pci_bus_add_devices+0xbd/0x103
>  [] pci_legacy_init+0x56/0xe3
>  [] kernel_init+0x157/0x2c3
>  [] kernel_thread_helper+0x7/0x10
>  ===
> pci :00:01.0: Error creating sysfs bridge symlink, continuing...
> sysfs: duplicate filename 'bridge' can not be created
> WARNING: at fs/sysfs/dir.c:424 sysfs_add_one()
> Pid: 1, comm: swapper Not tainted 2.6.24.1 #1
>  [] show_trace_log_lvl+0x1a/0x2f
>  [] show_trace+0x12/0x14
>  [] dump_stack+0x6c/0x72
>  [] sysfs_add_one+0x57/0xbc
>  [] sysfs_create_link+0xc2/0x10d
>  [] pci_bus_add_devices+0xbd/0x103
>  [] pci_bus_add_devices+0xa5/0x103
>  [] pci_legacy_init+0x56/0xe3
>  [] kernel_init+0x157/0x2c3
>  [] kernel_thread_helper+0x7/0x10
>  ===

I have a vague feeling that this was fixed, perhaps in 2.6.24.x?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.25-rc2, 2.6.24-rc8] page allocation failure...

2008-02-18 Thread Andrew Morton

On Sun, 17 Feb 2008 13:20:59 + "Daniel J Blueman" <[EMAIL PROTECTED]> wrote:

> I'm still hitting this with e1000e on 2.6.25-rc2, 10 times again.
> 
> It's clearly non-fatal, but then do we expect it to occur?
> 
> Daniel
> 
> --- [dmesg]
> 
> [ 1250.822786] swapper: page allocation failure. order:3, mode:0x4020
> [ 1250.822786] Pid: 0, comm: swapper Not tainted 2.6.25-rc2-119 #2
> [ 1250.822786]
> [ 1250.822786] Call Trace:
> [ 1250.822786][] __alloc_pages+0x34e/0x3a0
> [ 1250.822786]  [] ? __netdev_alloc_skb+0x1f/0x40
> [ 1250.822786]  [] __slab_alloc+0x102/0x3d0
> [ 1250.822786]  [] ? __netdev_alloc_skb+0x1f/0x40
> [ 1250.822786]  [] __kmalloc_track_caller+0x7b/0xc0
> [ 1250.822786]  [] __alloc_skb+0x6f/0x160
> [ 1250.822786]  [] __netdev_alloc_skb+0x1f/0x40
> [ 1250.822786]  [] e1000_alloc_rx_buffers+0x1ed/0x260
> [ 1250.822786]  [] e1000_clean_rx_irq+0x22a/0x330
> [ 1250.822786]  [] e1000_clean+0x1e1/0x540
> [ 1250.822786]  [] ? tick_program_event+0x45/0x70
> [ 1250.822786]  [] net_rx_action+0x9a/0x150
> [ 1250.822786]  [] __do_softirq+0x74/0xf0
> [ 1250.822786]  [] call_softirq+0x1c/0x30
> [ 1250.822786]  [] do_softirq+0x3d/0x80
> [ 1250.822786]  [] irq_exit+0x85/0x90
> [ 1250.822786]  [] do_IRQ+0x85/0x100
> [ 1250.822786]  [] ? mwait_idle+0x0/0x50
> [ 1250.822786]  [] ret_from_intr+0x0/0xa
> [ 1250.822786][] ? mwait_idle+0x45/0x50
> [ 1250.822786]  [] ? enter_idle+0x22/0x30
> [ 1250.822786]  [] ? cpu_idle+0x74/0xa0
> [ 1250.822786]  [] ? rest_init+0x55/0x60

They're regularly reported with e1000 too - I don't think aything really
changed.

e1000 has this crazy problem where because of a cascade of follies (mainly
borked hardware) it has to do a 32kb allocation for a 9kb(?) packet.  It
would be sad if that was carried over into e1000e?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: include/linux/pcounter.h

2008-02-16 Thread Andrew Morton

On Sat, 16 Feb 2008 11:07:29 +0100 Eric Dumazet <[EMAIL PROTECTED]> wrote:

> 
> Andrew, pcounter is a temporary abstraction.

It's buggy!  Main problems are a) possible return of negative numbers b)
some of the API can't be from preemptible code c) excessive interrupt-off
time on some machines if used from irq-disabled sections.

> It is temporaty because it will vanish as soon as Christoph Clameter (or 
> somebody else) provides real cheap per cpu counter implementation.

numbers?

most of percpu_counter_add() is only executed once per FBC_BATCH calls.

> At time I introduced it in network tree (locally, not meant to invade kernel 
> land and makes you unhappy :) ), the goals were :

Well maybe as a temporary networking-only thing OK, based upon
performance-tested results.  But I don't think the present code is suitable
as part of the kernel-wide toolkit.

> Some counters (total sockets count) were a single integer, that were doing 
> ping-pong between cpus (SMP/NUMA). As they are basically lazy values (as we 
> dont really need to read their value), using plain atomic_t was overkill.
> 
> Using a plain percpu_counters was expensive (NR_CPUS*(32+sizeof(void *)) 
> instead of num_possible_cpus()*4).

No, percpu_counters use alloc_percpu(), which is O(num_possible_cpus), not
O(NR_CPUS).

> Using 'online' instead of 'possible' stuff is not really needed for a 
> temporary thing.

This was put in ./lib/!

> - We dont care of read sides.

Well the present single caller in networking might not care.  But this was
put in ./lib/ and was exported to modules.  That is an invitation to all
kernel developers to use it in new code.  Which may result in truly awful
performance on high-cpu-count machines.

>  We want really fast write side. Real fast.

eh?  It's called on a per-connection basis, not on a per-packet basis?

> Read side is when you do a "cat /proc/net/sockstat".
> That is ... once in a while...

For the current single caller.  But it's in ./lib/.

And there's always someone out there who does whatever we don't expect them
to do.

> Now when we allocate a new socket, code to increment the "socket count" is :
> 
> c03a74a8 :
> c03a74a8:   b8 90 26 5f c0  mov$0xc05f2690,%eax
> c03a74ad:   64 8b 0d 10 f1 5e c0mov%fs:0xc05ef110,%ecx
> c03a74b4:   01 14 01add%edx,(%ecx,%eax,1)
> c03a74b7:   c3  ret

I can't find that code.  I suspect that's the DEFINE_PER_CPU flavour, which
isn't used anywhere afaict.  Plus this omits the local_irq_save/restore (or
preempt_disable/enable) and the indirect function call, which can be
expensive.

> That is 4 instructions. I could be two in the future, thanks to current work 
> on fs/gs based percpu variables.
> 
> Current percpu_counters implementation is more expensive :
> 
> c021467b <__percpu_counter_add>:
> c021467b:   55  push   %ebp
> c021467c:   57  push   %edi
> c021467d:   89 c7   mov%eax,%edi
> c021467f:   56  push   %esi
> c0214680:   53  push   %ebx
> c0214681:   83 ec 04sub$0x4,%esp
> c0214684:   8b 40 14mov0x14(%eax),%eax
> c0214687:   64 8b 1d 08 f0 5e c0mov%fs:0xc05ef008,%ebx
> c021468e:   8b 6c 24 18 mov0x18(%esp),%ebp
> c0214692:   f7 d0   not%eax
> c0214694:   8b 1c 98mov(%eax,%ebx,4),%ebx
> c0214697:   89 1c 24mov%ebx,(%esp)
> c021469a:   8b 03   mov(%ebx),%eax
> c021469c:   89 c3   mov%eax,%ebx
> c021469e:   89 c6   mov%eax,%esi
> c02146a0:   c1 fe 1fsar$0x1f,%esi
> c02146a3:   89 e8   mov%ebp,%eax
> c02146a5:   01 d3   add%edx,%ebx
> c02146a7:   11 ce   adc%ecx,%esi
> c02146a9:   99  cltd
> c02146aa:   39 d6   cmp%edx,%esi
> c02146ac:   7f 15   jg c02146c3 
> <__percpu_counter_add+0x48>
> c02146ae:   7c 04   jl c02146b4 

One of the above two branches is taken ((FBC_BATCH-1)/FBC_BATCH)ths of the
time.

> <__percpu_counter_add+0x39>
> c02146b0:   39 eb   cmp%ebp,%ebx
> c02146b2:   73 0f   jaec02146c3 
> <__percpu_counter_add+0x48>
> c02146b4:   f7 dd   neg%ebp
> c02146b6:   89 e8   mov%ebp,%eax
> c02146b8:   99  cltd
> c02146b9:   39 d6   cmp%edx,%esi
> c02146bb:   7f 20   jg c02146dd 
> <__percpu_counter_add+0x62>
> c02146bd:   7c 04   jl c02146c3 
> <__percpu_counter_add+0x48>
> c02146bf:   39 eb   cmp%ebp,%ebx
> c02146c1:   77 1a   ja

Re: include/linux/pcounter.h

2008-02-15 Thread Andrew Morton


- First up, why was this added at all?  We have percpu_counter.h which
  has several years development invested in it.  afaict it would suit the
  present applications of pcounters.

  If some deficiency in percpu_counters has been identified, is it
  possible to correct that deficiency rather than implementing a whole new
  counter type?  That would be much better.

- The comments in pcounter.h appear to indicate that there is a
  performance advantage (and we infer that the advantage is when the
  statically-allocated flavour of pcounters is used).  When compared with
  percpu_counters the number of data-reference indirections is the same as
  with percpu_counters, so no advantage there.

  And, bizarrely, because of a quite inappropriate abstraction toy, both
  flavours of pcounters add an indirect function call which I believe is
  significantly more expensive than a plain old pointer indirection.

  So it's quite possible that DEFINE_PCOUNTER-style counters consume more
  memory and are slower than percpu_counters.  They will surely be much
  slower on the read side.  More below.

  If we really want to put some helper wrappers around
  DEFINE_PER_CPU(s32) then I'd suggest that we should do that as a
  standalone thing and not attempt to graft the same interface onto two
  quite different types of storage (DEFINE_PER_CPU and alloc_per_cpu)

- The comment "2)" in pcounter.h (which overflows 80 cols and probably
  wasn't passed through checkpatch) indicates that some other
  implementation (presumably plain old DEFINE_PER_CPU) will use
  NR_CPUS*(32+sizeof(void *)) bytes of storage.  But DEFINE_PCOUNTER will
  use as much memory as DEFINE_PER_CPU(s32) and both pcounter_alloc()-style
  pcounters and percpu_counters use
  num_possible_cpus()*sizeof(s32)+epsilon.

- The CONFIG_SMP=n stubs in pcounter.h are cheesy and are vulnerable to
  several well-known compilation risks which I always forget.  Should be
  converted to regular static inlines.

- the comment in lib/pcounter.c needlessly exceeds 80 cols.

- pcounter_dyn_add() will spew a
  use-of-smp_processor_id()-in-preemptible-code warning if used in places
  where one could reasonably use it.  The interface could do with a bit of
  a rethink.  Or at least some justification and documentation.

- pcounter_getval() will be disastrously inefficient if
  num_possible_cpus() is much greater than num_online_cpus().  It should
  use for_each_online_cpu() (as does percpu_counter), and implement a CPU
  hotplug notifier (as does percpu_counter).

  It will remain grossly inefficient at high CPU counts, unlike
  percpu_counters, which solved this problem by doing a batched spill into
  a central counter at add/sub time.

  The danger here is that someone will actually use this interface in new
  code.  Six months later (when it's too late to fix it) the big-NUMA guys
  come up and say "whaa, when our user does  it disabled interrupts
  for N milliseconds".

- pcounter_getval() can return incorrect negative numbers.  This can
  cause caller malfunctions in very rare situations because callers just
  don't expect the things which they're counting to go negative.

  We experienced this during the evolution of percpu_counter.  See
  percpu_counter_sum_positive() and friends.

- pcounter_alloc() should return -ENOMEM on allocation error, not "1".

- pcounter_free() perhaps shouldn't test for (self->per_cpu_values !=
  NULL), because callers shouldn't be calling it if pcounter_alloc() failed
  (arguable).



afaict the whole implementation can and should be removed and replaced with
percpu_counters.  I don't think there's much point in its ability to manage
DEFINE_PER_CPU counters: pcounter_getval() remains grossly inefficient (and
can return negative values) and quite a bit of new code will need to be put
in place to address that.

But perhaps there are plans to evolve it into something further in the
future, I don't know.  But I would suggest that the people who have worked
upon percpu_counters (principally Gautham, Peter Z, clameter and me) be
involved in that work.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] remove rcu_assign_pointer(NULL) penalty with type/macro safety

2008-02-15 Thread Andrew Morton

On Wed, 13 Feb 2008 14:00:24 -0800
"Paul E. McKenney" <[EMAIL PROTECTED]> wrote:

> Hello!
> 
> This is an updated version of the patch posted last November:
> 
>   http://archives.free.net.ph/message/20071201.003721.cd6ff17c.en.html
> 
> This new version permits arguments with side effects, for example:
> 
>   rcu_assign_pointer(global_p, p++);
> 
> and also verifies that the arguments are pointers, while still avoiding
> the unnecessary memory barrier when assigning NULL to a pointer.
> This memory-barrier avoidance means that rcu_assign_pointer() is now only
> permitted for pointers (not array indexes), and so this version emits a
> compiler warning if the first argument is not a pointer.  I built a "make
> allyesconfig" version on an x86 system, and received no such warnings.
> If RCU is ever applied to array indexes, then the second patch in this
> series should be applied, and the resulting rcu_assign_index() be used.
> 
> Given the rather surprising history of subtlely broken implementations of
> rcu_assign_pointer(), I took the precaution of generating a full set of
> test cases and verified that memory barriers and compiler warnings were
> emitted when required.  I guess it is the simple things that get you...
> 
> Signed-off-by: Paul E. McKenney <[EMAIL PROTECTED]>
> ---
> 
>  rcupdate.h |   16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff -urpNa -X dontdiff linux-2.6.24/include/linux/rcupdate.h 
> linux-2.6.24-rap/include/linux/rcupdate.h
> --- linux-2.6.24/include/linux/rcupdate.h 2008-01-24 14:58:37.0 
> -0800
> +++ linux-2.6.24-rap/include/linux/rcupdate.h 2008-02-13 13:36:47.0 
> -0800
> @@ -270,12 +270,20 @@ extern struct lockdep_map rcu_lock_map;
>   * structure after the pointer assignment.  More importantly, this
>   * call documents which pointers will be dereferenced by RCU read-side
>   * code.
> + *
> + * Throws a compiler warning for non-pointer arguments.
> + *
> + * Does not insert a memory barrier for a NULL pointer.
>   */
>  
> -#define rcu_assign_pointer(p, v) ({ \
> - smp_wmb(); \
> - (p) = (v); \
> - })
> +#define rcu_assign_pointer(p, v) \
> + ({ \
> + typeof(*p) *_p1 = (v); \
> + \
> + if (!__builtin_constant_p(v) || (_p1 != NULL)) \
> + smp_wmb(); \
> + (p) = _p1; \
> + })
>  

umm...

net/netfilter/core.c: In function 'nf_register_afinfo':
net/netfilter/core.c:39: warning: initialization discards qualifiers from 
pointer target type
net/netfilter/nf_log.c: In function 'nf_log_register':
net/netfilter/nf_log.c:37: warning: initialization discards qualifiers from 
pointer target type
net/netfilter/nf_queue.c: In function 'nf_register_queue_handler':
net/netfilter/nf_queue.c:38: warning: initialization discards qualifiers from 
pointer target type


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9990] New: tg3: eth0: The system may be re-ordering memory-mapped I/O cycles

2008-02-14 Thread Andrew Morton

On Thu, 14 Feb 2008 01:59:12 -0800 (PST) [EMAIL PROTECTED] wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=9990
> 
>Summary: tg3: eth0: The system may be re-ordering memory-mapped
> I/O cycles
>Product: Drivers
>Version: 2.5
>  KernelVersion: 2.6.24-git18
>   Platform: All
> OS/Version: Linux
>   Tree: Mainline
> Status: NEW
>   Severity: normal
>   Priority: P1
>  Component: Network
> AssignedTo: [EMAIL PROTECTED]
> ReportedBy: [EMAIL PROTECTED]
> 
> 
> Latest working kernel version: 2.6.24
> Earliest failing kernel version: 2.6.24-git18
> Distribution: Debian/testing
> Hardware Environment:
> Software Environment:
> Problem Description:
> 
> Feb 11 13:11:52 www kernel: [   12.015569] tg3: eth0: Link is up at 100 Mbps,
> full duplex.
> Feb 11 13:11:52 www kernel: [   12.015633] tg3: eth0: Flow control is on for 
> TX
> and on for RX.
> Feb 11 13:33:44 www kernel: [ 1328.538204] tg3: eth0: The system may be
> re-ordering memory-mapped I/O cycles to the network
> device, attempting to recover. Please report the problem to the driver
> maintainer and include system chipset information.
> Feb 11 13:33:44 www kernel: [ 1328.667255] tg3: eth0: Link is down.
> Feb 11 13:33:46 www kernel: [ 1330.560734] tg3: eth0: Link is up at 100 Mbps,
> full duplex.
> Feb 11 13:33:46 www kernel: [ 1330.560734] tg3: eth0: Flow control is on for 
> TX
> and on for RX.
> 
> After that, the machine rebooted (panic?)
> 
> Feb 11 13:35:14 www kernel: klogd 1.5.0#1.1, log source = /proc/kmsg started.
> 
> lspci -vvv info:
> 02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
> Ethernet (rev 10)
> Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter
> (PCI-X, 10,100,1000-T)
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
> Stepping- SERR+ FastB2B- DisINTx-
> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
> SERR-  Latency: 64 (16000ns min), Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 19
> Region 0: Memory at fdf7 (64-bit, non-prefetchable) [size=64K]
> [virtual] Expansion ROM at 8814 [disabled] [size=64K]
> Capabilities: [40] PCI-X non-bridge device
> Command: DPERE- ERO- RBC=2048 OST=1
> Status: Dev=02:02.0 64bit+ 133MHz+ SCD- USC- DC=simple
> DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
> Capabilities: [48] Power Management version 2
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot+,D3cold+)
> Status: D0 PME-Enable+ DSel=0 DScale=1 PME-
> Capabilities: [50] Vital Product Data 
> Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ 
> Queue=0/3
> Enable-
> Address: fd7ffd6fdf7deeb8  Data: bdfd
> Kernel driver in use: tg3
> Kernel modules: tg3
> 
> 02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit
> Ethernet (rev 10)
> Subsystem: Compaq Computer Corporation NC7782 Gigabit Server Adapter
> (PCI-X, 10,100,1000-T)
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
> Stepping- SERR+ FastB2B- DisINTx-
> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
> SERR-  Latency: 64 (16000ns min), Cache Line Size: 64 bytes
> Interrupt: pin B routed to IRQ 20
> Region 0: Memory at fdf6 (64-bit, non-prefetchable) [size=64K]
> Capabilities: [40] PCI-X non-bridge device
> Command: DPERE- ERO+ RBC=512 OST=1
> Status: Dev=02:02.1 64bit+ 133MHz+ SCD- USC- DC=simple
> DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
> Capabilities: [48] Power Management version 2
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot+,D3cold+)
> Status: D0 PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [50] Vital Product Data 
> Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ 
> Queue=0/3
> Enable-
> Address: f73feeefe7f8  Data: 9bcd
> Kernel driver in use: tg3
> Kernel modules: tg3
> 
> 
> Steps to reproduce:
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] add rcu_assign_index() if ever needed

2008-02-13 Thread Andrew Morton

On Thu, 14 Feb 2008 09:02:09 +0530 Gautham R Shenoy <[EMAIL PROTECTED]> wrote:

> >  /**
> > + * rcu_assign_index - assign (publicize) a index of a newly
> > + * initialized array elementg that will be dereferenced by RCU
>    
> 
> I hope Andrew got that one while porting against the latest -mm :)

I don't actually read the comments - I just like to make sure they're
there ;)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] remove rcu_assign_pointer(NULL) penalty with type/macro safety

2008-02-13 Thread Andrew Morton

On Wed, 13 Feb 2008 15:57:38 -0800 (PST)
David Miller <[EMAIL PROTECTED]> wrote:

> From: Andrew Morton <[EMAIL PROTECTED]>
> Date: Wed, 13 Feb 2008 15:52:45 -0800
> 
> > On Wed, 13 Feb 2008 15:37:44 -0800
> > "Paul E. McKenney" <[EMAIL PROTECTED]> wrote:
> > 
> > > Ah.  It does take a bit to get fib_trie into one's build -- allyesconfig
> > > doesn't cut it.
> > 
> > This is not good.  The sole purpose of allmodconfig and allyesconfig is for
> > compilation and linkage coverage testing.  Hence we should aim to get as
> > much code as possible included in those cases.
> 
> Well, in this case there is a choice, either you use one routing
> lookup datastructure or the other.  It's not purposefully being hidden
> from the everything builds :-)

oic.  yes, that is a bit of a problem.  Oh well.

`make randconfig' seems to be able to enable CONFIG_IP_FIB_TRIE about one
time in eight ;)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] remove rcu_assign_pointer(NULL) penalty with type/macro safety

2008-02-13 Thread Andrew Morton

On Wed, 13 Feb 2008 15:37:44 -0800
"Paul E. McKenney" <[EMAIL PROTECTED]> wrote:

> Ah.  It does take a bit to get fib_trie into one's build -- allyesconfig
> doesn't cut it.

This is not good.  The sole purpose of allmodconfig and allyesconfig is for
compilation and linkage coverage testing.  Hence we should aim to get as
much code as possible included in those cases.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] remove rcu_assign_pointer(NULL) penalty with type/macro safety

2008-02-13 Thread Andrew Morton

On Wed, 13 Feb 2008 14:00:24 -0800
"Paul E. McKenney" <[EMAIL PROTECTED]> wrote:

> Hello!
> 
> This is an updated version of the patch posted last November:
> 
>   http://archives.free.net.ph/message/20071201.003721.cd6ff17c.en.html
> 
> This new version permits arguments with side effects, for example:
> 
>   rcu_assign_pointer(global_p, p++);
> 
> and also verifies that the arguments are pointers, while still avoiding
> the unnecessary memory barrier when assigning NULL to a pointer.
> This memory-barrier avoidance means that rcu_assign_pointer() is now only
> permitted for pointers (not array indexes), and so this version emits a
> compiler warning if the first argument is not a pointer.  I built a "make
> allyesconfig" version on an x86 system, and received no such warnings.
> If RCU is ever applied to array indexes, then the second patch in this
> series should be applied, and the resulting rcu_assign_index() be used.
> 
> Given the rather surprising history of subtlely broken implementations of
> rcu_assign_pointer(), I took the precaution of generating a full set of
> test cases and verified that memory barriers and compiler warnings were
> emitted when required.  I guess it is the simple things that get you...
> 
> Signed-off-by: Paul E. McKenney <[EMAIL PROTECTED]>
> ---
> 
>  rcupdate.h |   16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff -urpNa -X dontdiff linux-2.6.24/include/linux/rcupdate.h 
> linux-2.6.24-rap/include/linux/rcupdate.h
> --- linux-2.6.24/include/linux/rcupdate.h 2008-01-24 14:58:37.0 
> -0800
> +++ linux-2.6.24-rap/include/linux/rcupdate.h 2008-02-13 13:36:47.0 
> -0800

whoop, ancient kernel alert.

> @@ -270,12 +270,20 @@ extern struct lockdep_map rcu_lock_map;
>   * structure after the pointer assignment.  More importantly, this
>   * call documents which pointers will be dereferenced by RCU read-side
>   * code.
> + *
> + * Throws a compiler warning for non-pointer arguments.
> + *
> + * Does not insert a memory barrier for a NULL pointer.
>   */
>  
> -#define rcu_assign_pointer(p, v) ({ \
> - smp_wmb(); \
> - (p) = (v); \
> - })
> +#define rcu_assign_pointer(p, v) \
> + ({ \
> + typeof(*p) *_p1 = (v); \
> + \
> + if (!__builtin_constant_p(v) || (_p1 != NULL)) \
> + smp_wmb(); \
> + (p) = _p1; \
> + })

Someone already merged the dont-do-it-for-NULL patch so I reworked this
appropriately.  Was too lazy to update the changelog though.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1045 matches

Mail list logo