date:20070322

Re: [1/6] 2.6.21-rc4: known regressions

2007-03-22 Thread Michal Piotrowski


On 23/03/07, Nick Piggin <[EMAIL PROTECTED]> wrote:

On Thu, Mar 22, 2007 at 06:40:41PM -0700, Linus Torvalds wrote:
>
> [ Ok, I think it's those timers again...
>
>   Ingo: let me just state how *happy* I am that I told you off when you
>   wanted to merge the hires timers and NO_HZ before 2.6.20 because they
>   were "stable". You were wrong, and 2.6.20 is at least in reasonable
>   shape. Now we just need to make sure that 2.6.21 will be too.. ]
>
> On Thu, 22 Mar 2007, Mingming Cao wrote:
> >
> > I might missed something, so far I can't see a deadlock yet.
> > If there is a deadlock, I think we should see ext3_xattr_release_block()
> > and ext3_forget() on the stack. Is this the case?
>
> No. What's strange is that two (maybe more, I didn't check) processes seem
> to be stuck in
>
>[] schedule_timeout+0x70/0x8e
>[] schedule_timeout_uninterruptible+0x15/0x17
>[] journal_stop+0xe2/0x1e6
>[] journal_force_commit+0x1d/0x1f
>[] ext3_force_commit+0x22/0x24
>[] ext3_write_inode+0x34/0x3a
>[] __writeback_single_inode+0x1c5/0x2cb
>[] sync_inode+0x1c/0x2e
>[] ext3_sync_file+0xab/0xc0
>[] do_fsync+0x4b/0x98
>[] __do_fsync+0x20/0x2f
>[] sys_fsync+0xd/0xf
>[] syscall_call+0x7/0xb
>
> but that that thing is literally:
>
>   ...
> do {
> old_handle_count = transaction->t_handle_count;
> schedule_timeout_uninterruptible(1);
> } while (old_handle_count != transaction->t_handle_count);
>   ...
>
> and especially if nothing is happening, I'd not expect
> "transaction->t_handle_count" to keep changing, so it should stop very
> quickly.
>
> Maybe it's CONFIG_NO_HZ again, and the problem is that timeout, and simply
> no timer tick happening?
>
> Bingo. I think that's it.
>
>   active timers:
>#0: hardirq_stack, tick_sched_timer, S:01
># expires at 953089300 nsecs [in -2567889 nsecs]
>#1: hardirq_stack, hrtimer_wakeup, S:01
># expires at 10858649798503 nsecs [in 1327754230614 nsecs]
> .expires_next   : 953089300 nsecs
>
> See
>
>   http://lkml.org/lkml/2007/3/16/288
>
> and that in turn points to the kernel log:
>
>   
http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc4/git-console.log

Seems convincing. Michal, can you post your .config, and if you had
dynticks and hrtimers enabled, try reproducing without them?



http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc4/git-config

I don't know how to reproduce this bug on 2.6.21-rc4.On 2.6.21-rc2-mm1
it was very simple, just run youtube, bash_shared_mapping etc. In fact
I didn't see this bug for a week.

Unfortunately, I wasn't able to take a crash dump because of sound
card driver bug (I've got crash dump from 2.6.21-rc2-mm1).

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [QUICKLIST 1/5] Quicklists for page table pages V4

2007-03-22 Thread Andrew Morton

On Thu, 22 Mar 2007 23:52:05 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> On Thu, 22 Mar 2007, Andrew Morton wrote:
> 
> > On Thu, 22 Mar 2007 23:28:41 -0700 (PDT) Christoph Lameter <[EMAIL 
> > PROTECTED]> wrote:
> > 
> > > 1. Proven code from the IA64 arch.
> > > 
> > >   The method used here has been fine tuned for years and
> > >   is NUMA aware. It is based on the knowledge that accesses
> > >   to page table pages are sparse in nature. Taking a page
> > >   off the freelists instead of allocating a zeroed pages
> > >   allows a reduction of number of cachelines touched
> > >   in addition to getting rid of the slab overhead. So
> > >   performance improves.
> > 
> > By how much?
> 
> About 40% on fork+exit. See 
> 
> http://marc.info/?l=linux-ia64&m=110942798406005&w=2
> 

afacit that two-year-old, totally-different patch has nothing to do with my
repeatedly-asked question.  It appears to be consolidating three separate
quicklist allocators into one common implementation.

In an attempt to answer my own question (and hence to justify the retention
of this custom allocator) I did this:


diff -puN include/linux/quicklist.h~qlhack include/linux/quicklist.h
--- a/include/linux/quicklist.h~qlhack
+++ a/include/linux/quicklist.h
@@ -32,45 +32,17 @@ DECLARE_PER_CPU(struct quicklist, quickl
  */
 static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void *))
 {
-   struct quicklist *q;
-   void **p = NULL;
-
-   q =&get_cpu_var(quicklist)[nr];
-   p = q->page;
-   if (likely(p)) {
-   q->page = p[0];
-   p[0] = NULL;
-   q->nr_pages--;
-   }
-   put_cpu_var(quicklist);
-   if (likely(p))
-   return p;
-
-   p = (void *)__get_free_page(flags | __GFP_ZERO);
+   void *p = (void *)__get_free_page(flags | __GFP_ZERO);
if (ctor && p)
ctor(p);
return p;
 }
 
-static inline void quicklist_free(int nr, void (*dtor)(void *), void *pp)
+static inline void quicklist_free(int nr, void (*dtor)(void *), void *p)
 {
-   struct quicklist *q;
-   void **p = pp;
-   struct page *page = virt_to_page(p);
-   int nid = page_to_nid(page);
-
-   if (unlikely(nid != numa_node_id())) {
-   if (dtor)
-   dtor(p);
-   free_page((unsigned long)p);
-   return;
-   }
-
-   q = &get_cpu_var(quicklist)[nr];
-   p[0] = q->page;
-   q->page = p;
-   q->nr_pages++;
-   put_cpu_var(quicklist);
+   if (dtor)
+   dtor(p);
+   free_page((unsigned long)p);
 }
 
 void quicklist_trim(int nr, void (*dtor)(void *),
@@ -81,4 +53,3 @@ unsigned long quicklist_total_size(void)
 #endif
 
 #endif /* LINUX_QUICKLIST_H */
-
_

but it crashes early in the page allocator (i386) and I don't see why.  It
makes me wonder if we have a use-after-free which is hidden by the presence
of the quicklist buffering or something.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slab: NUMA kmem_cache diet

2007-03-22 Thread Eric Dumazet


Pekka J Enberg a écrit :

(Please inline patches to the mail, makes it easier to review.)

On Thu, 22 Mar 2007, Eric Dumazet wrote:

Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer possible
nodes. This patch dynamically sizes the 'struct kmem_cache' to allocate only
needed space.

I moved nodelists[] field at the end of struct kmem_cache, and use the
following computation in kmem_cache_init()


Hmm, what seems bit worrying is:

diff --git a/mm/slab.c b/mm/slab.c
index abf46ae..b187618 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -389,7 +389,6 @@ struct kmem_cache {
unsigned int buffer_size;
u32 reciprocal_buffer_size;
 /* 3) touched by every alloc & free from the backend */
-   struct kmem_list3 *nodelists[MAX_NUMNODES];

I think nodelists is placed at the beginning of the struct for a reason. 
But I have no idea if it actually makes any difference...


It might make a difference if STATS is on, because freehit/freemiss might 
share a cache line with nodelists. Apart that, a kmem_cache struct is 
read_mostly : All changes are done outside of it, via array_cache or nodelists[].



Anyway slab STATS is already a SMP/NUMA nightmare because of cache line ping 
pongs. We might place STATS counter in a/some dedicated cache line(s)...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm try#2] Blackfin: on-chip Two Wire Interface I2C driver

2007-03-22 Thread Wu, Bryan

On Fri, 2007-03-23 at 08:27 +0100, Jean Delvare wrote:
> Hi Bryan,
> 
> On Fri, 23 Mar 2007 13:46:57 +0800, Wu, Bryan wrote:
> > Changlogs:
> > 
> > a) Fixed issues according to Jean's review.
> > b) Add MAINTAINS infomation
> > c) add I2C_HW_B_BLACKFIN to i2c-id.h
> 
> I2C_HW_B_* is traditionally used for drivers built on top of the
> i2c-algo-bit driver, which isn't the case of your driver, so it's a bit
> confusing. Please instead use:
> 
> #define I2C_HW_BLACKFIN   0x190001
> 
> I hope we'll be able to get rid of these IDs soon, so we no longer have
> to care about this mess.

Thanks, I appreciate. When you want to get rid of these IDs, I will do
the removing ID job related to this patch.

> 
> Other than that I'm OK with the patch this time, I'll push it on my
> stack. I'll fix the ID issue myself, no need to resend. This also means
> that, from now on, any change to this driver should be provided as an
> incremental patch on top of this version.
> 
> Thanks,

OK, I will follow this rule definitely.
Thanks again
-Bryan Wu
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm try#2] Blackfin: on-chip Two Wire Interface I2C driver

2007-03-22 Thread Jean Delvare

Hi Bryan,

On Fri, 23 Mar 2007 13:46:57 +0800, Wu, Bryan wrote:
> Changlogs:
> 
> a) Fixed issues according to Jean's review.
> b) Add MAINTAINS infomation
> c) add I2C_HW_B_BLACKFIN to i2c-id.h

I2C_HW_B_* is traditionally used for drivers built on top of the
i2c-algo-bit driver, which isn't the case of your driver, so it's a bit
confusing. Please instead use:

#define I2C_HW_BLACKFIN 0x190001

I hope we'll be able to get rid of these IDs soon, so we no longer have
to care about this mess.

Other than that I'm OK with the patch this time, I'll push it on my
stack. I'll fix the ID issue myself, no need to resend. This also means
that, from now on, any change to this driver should be provided as an
incremental patch on top of this version.

Thanks,
-- 
Jean Delvare
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] lguest: clean rest of linkage warnings (bar one)

2007-03-22 Thread Rusty Russell

On Thu, 2007-03-22 at 12:45 +0100, Sam Ravnborg wrote:
> On Thu, Mar 22, 2007 at 09:09:42PM +1100, Rusty Russell wrote:
> > It also fixes the remaining warnings, except one.  The code in
> > modpost.c which needs to be taught that it's legal to link from
> > .paravirtprobe to .init.text is horrible, and I'm pretty sure I'd just
> > make it worse.
> 
> If you drop me a sample of the exact warning I will add this one too.
> Current kbuild.git contains a lot of fixes in this area so I refer to
> do so on top of that tree.

WARNING: vmlinux - Section mismatch: reference to .init.text: from 
.paravirtprobe between '__start_paravirtprobe' (at offset 0xc0464c7c) and 
'__stop_paravirtprobe'

Sam, you're always so polite and damn helpful: I'm pretty sure that
violates the core tenets of Linux etiquette.

Perhaps you could work on that?
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] i386 GDT cleanups: Rename boot_gdt_table to boot_gdt

2007-03-22 Thread Rusty Russell

On Thu, 2007-03-22 at 16:59 +0100, Sébastien Dugué wrote:
>  Rename boot_gdt_table to boot_gdt to avoid the duplicate
> T(able).
> 
> Signed-off-by: Sébastien Dugué <[EMAIL PROTECTED]>
> 
> ---
>  arch/i386/kernel/head.S   |9 -
>  arch/i386/kernel/trampoline.S |   12 ++--
>  2 files changed, 10 insertions(+), 11 deletions(-)

Acked-by: Rusty Russell <[EMAIL PROTECTED]>

In future, I'd recommend adding a witty comment to any such trivial
patch: it's really the only way to get it featured on LWN's Kernel Quote
of the Week.

Damn you Jon for turning us all into show ponies! (*Hi mum!*)
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + uml-create-archh.patch added to -mm tree

2007-03-22 Thread Andrew Morton

On Fri, 23 Mar 2007 07:51:47 +0100 Blaisorblade <[EMAIL PROTECTED]> wrote:

> On Thursday 22 March 2007 22:44, [EMAIL PROTECTED] wrote:
> > The patch titled
> >  uml: mreate arch.h
>  ^
> > has been added to the -mm tree.  Its filename is
> >  uml-create-archh.patch
> mreate? I've also seen this in all other patches of this batch (examples 
> below), and both Jeff's original mails and patch filenames are correct. What 
> are your scripts doing here?

That's a semi-manual attempt to turn the incoming chaos into semi-consistent
output, which went wrong.  Fixed, thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc4-rt0-kdump (was: Re: [patch] setup_boot_APIC_clock() irq-enable fix)

2007-03-22 Thread Ingo Molnar

* Michal Piotrowski <[EMAIL PROTECTED]> wrote:

> >> > BUG: at kernel/fork.c:1033 copy_process()
> >>
> >> thanks Michal - this is a real bug that affects upstream too. Find 
> >> the fix below - i've test-booted it and it fixes the warning.
> > 
> > Problem is fixed, thanks.
> 
> BTW. It seems that nobody uses -rt as a crash dump kernel ;)

it's been tested with v2.6.20-rt8, and it should work as long as you 
enable CONFIG_RELOCATABLE. But i'm not using it myself, and 
v2.6.21-rc4-rt0 isnt a particularly encouraging version string for 
people to try ;)

> Hibernation is still broken.
> 
> http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc4-rt0/console.log
> http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc4-rt0/rt-config

what's the failure mode besides the lockdep + other debug messages - it 
doesnt resume? Your log seems to have at least one sequence of resume 
related messages - those seem to have worked fine.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slab: NUMA kmem_cache diet

2007-03-22 Thread Pekka J Enberg

(Please inline patches to the mail, makes it easier to review.)

On Thu, 22 Mar 2007, Eric Dumazet wrote:
> Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer possible
> nodes. This patch dynamically sizes the 'struct kmem_cache' to allocate only
> needed space.
> 
> I moved nodelists[] field at the end of struct kmem_cache, and use the
> following computation in kmem_cache_init()

Hmm, what seems bit worrying is:

diff --git a/mm/slab.c b/mm/slab.c
index abf46ae..b187618 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -389,7 +389,6 @@ struct kmem_cache {
unsigned int buffer_size;
u32 reciprocal_buffer_size;
 /* 3) touched by every alloc & free from the backend */
-   struct kmem_list3 *nodelists[MAX_NUMNODES];

I think nodelists is placed at the beginning of the struct for a reason. 
But I have no idea if it actually makes any difference...

Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[git patches] net driver fixes

2007-03-22 Thread Jeff Garzik


Please pull from 'upstream-linus' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git 
upstream-linus

to receive the following updates:

 drivers/net/Kconfig  |   25 ++--
 drivers/net/cxgb3/common.h   |   15 ++
 drivers/net/cxgb3/cxgb3_main.c   |   90 --
 drivers/net/cxgb3/regs.h |   22 +++
 drivers/net/cxgb3/t3_hw.c|   15 ++-
 drivers/net/cxgb3/xgmac.c|  133 +--
 drivers/net/ewrk3.c  |3 +-
 drivers/net/mv643xx_eth.c|   14 ++
 drivers/net/myri10ge/myri10ge.c  |   22 +++-
 drivers/net/pci-skeleton.c   |4 +-
 drivers/net/saa9730.c|  177 +-
 drivers/net/skge.c   |  110 +---
 drivers/net/skge.h   |6 +-
 drivers/net/ucc_geth.c   |3 +-
 drivers/net/wireless/airo.c  |4 +-
 drivers/net/wireless/bcm43xx/bcm43xx_radio.c |   14 +-
 16 files changed, 460 insertions(+), 197 deletions(-)

Anton Blanchard (1):
  Fix return code in pci-skeleton.c

Brice Goglin (4):
  myri10ge: Serverworks HT2100 provides aligned PCIe completion
  myri10ge: update wcfifo and intr_coal_delay default values
  myri10ge: fix management of >4kB allocated pages
  myri10ge: update driver version to 1.3.0-1.226

Dale Farnsworth (1):
  mv643xx_eth: add mv643xx_eth_shutdown function

Divy Le Ray (5):
  cxgb3 - fix ethtool cmd on multiple queues port
  cxgb3 - Auto-load FW if mismatch detected
  cxgb3 - Fix potential MAC hang
  cxgb3 - T3B2 pcie config space
  cxgb3 - fix white spaces in drivers/net/Kconfig

Jeff Garzik (1):
  [netdrvr] ewrk3: correct card detection bug

Larry Finger (1):
  bcm43xx: MANUALWLAN fixes

Li Yang (1):
  Revert "ucc_geth: returns NETDEV_TX_BUSY when BD ring is full"

Michal Schmidt (1):
  airo: Fix an error path memory leak

Ralf Baechle (1):
  SAA9730: Fix large pile of warnings

Stephen Hemminger (3):
  skge: deadlock on tx timeout
  skge: mask irqs when device down
  skge: use per-port phy locking

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 5ff0922..c3f9f59 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -2372,22 +2372,23 @@ config CHELSIO_T1_NAPI
  when the driver is receiving lots of packets from the card.
 
 config CHELSIO_T3
-tristate "Chelsio Communications T3 10Gb Ethernet support"
-depends on PCI
-help
-  This driver supports Chelsio T3-based gigabit and 10Gb Ethernet
-  adapters.
+   tristate "Chelsio Communications T3 10Gb Ethernet support"
+   depends on PCI
+   select FW_LOADER
+   help
+ This driver supports Chelsio T3-based gigabit and 10Gb Ethernet
+ adapters.
 
-  For general information about Chelsio and our products, visit
-  our website at .
+ For general information about Chelsio and our products, visit
+ our website at .
 
-  For customer support, please visit our customer support page at
-  .
+ For customer support, please visit our customer support page at
+ .
 
-  Please send feedback to <[EMAIL PROTECTED]>.
+ Please send feedback to <[EMAIL PROTECTED]>.
 
-  To compile this driver as a module, choose M here: the module
-  will be called cxgb3.
+ To compile this driver as a module, choose M here: the module
+ will be called cxgb3.
 
 config EHEA
tristate "eHEA Ethernet support"
diff --git a/drivers/net/cxgb3/common.h b/drivers/net/cxgb3/common.h
index e23deeb..85e5543 100644
--- a/drivers/net/cxgb3/common.h
+++ b/drivers/net/cxgb3/common.h
@@ -260,6 +260,10 @@ struct mac_stats {
unsigned long serdes_signal_loss;
unsigned long xaui_pcs_ctc_err;
unsigned long xaui_pcs_align_change;
+
+   unsigned long num_toggled; /* # times toggled TxEn due to stuck TX */
+   unsigned long num_resets;  /* # times reset due to stuck TX */
+
 };
 
 struct tp_mib_stats {
@@ -400,6 +404,12 @@ struct adapter_params {
unsigned int rev;   /* chip revision */
 };
 
+enum { /* chip revisions */
+   T3_REV_A  = 0,
+   T3_REV_B  = 2,
+   T3_REV_B2 = 3,
+};
+
 struct trace_params {
u32 sip;
u32 sip_mask;
@@ -465,6 +475,10 @@ struct cmac {
struct adapter *adapter;
unsigned int offset;
unsigned int nucast;/* # of address filters for unicast MACs */
+   unsigned int tcnt;
+   unsigned int xcnt;
+   unsigned int toggle_cnt;
+   unsigned int txen;
struct mac_s

Re: + uml-create-archh.patch added to -mm tree

2007-03-22 Thread Blaisorblade

On Thursday 22 March 2007 22:44, [EMAIL PROTECTED] wrote:
> The patch titled
>  uml: mreate arch.h
   ^
> has been added to the -mm tree.  Its filename is
>  uml-create-archh.patch
mreate? I've also seen this in all other patches of this batch (examples 
below), and both Jeff's original mails and patch filenames are correct. What 
are your scripts doing here?

> The patch titled
>  uml: mreate as-layout.h
...
> The patch titled
>  uml: memove user_util.h
...
> The patch titled
>  uml: mdd missing __init declarations
...

Bye
-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.21-rc4 04/15] md: add raid5_run_ops and support routines

2007-03-22 Thread Dan Williams

Prepare the raid5 implementation to use async_tx for running stripe
operations:
* biofill (copy data into request buffers to satisfy a read request)
* compute block (generate a missing block in the cache from the other
blocks)
* prexor (subtract existing data as part of the read-modify-write process)
* biodrain (copy data out of request buffers to satisfy a write request)
* postxor (recalculate parity for new data that has entered the cache)
* check (verify that the parity is correct)
* io (submit i/o to the member disks)

Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the workqueue
* call bi_end_io for reads in ops_complete_biofill

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  520 
 include/linux/raid/raid5.h |   63 +
 2 files changed, 580 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6386814..b7185a1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -52,6 +52,7 @@
 #include "raid6.h"
 
 #include 
+#include 
 
 /*
  * Stripe cache
@@ -324,6 +325,525 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
return sh;
 }
 
+static int
+raid5_end_read_request(struct bio * bi, unsigned int bytes_done, int error);
+static int
+raid5_end_write_request (struct bio *bi, unsigned int bytes_done, int error);
+
+static void ops_run_io(struct stripe_head *sh)
+{
+   raid5_conf_t *conf = sh->raid_conf;
+   int i, disks = sh->disks;
+
+   might_sleep();
+
+   for (i=disks; i-- ;) {
+   int rw;
+   struct bio *bi;
+   mdk_rdev_t *rdev;
+   if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags))
+   rw = WRITE;
+   else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
+   rw = READ;
+   else
+   continue;
+
+   bi = &sh->dev[i].req;
+
+   bi->bi_rw = rw;
+   if (rw == WRITE)
+   bi->bi_end_io = raid5_end_write_request;
+   else
+   bi->bi_end_io = raid5_end_read_request;
+
+   rcu_read_lock();
+   rdev = rcu_dereference(conf->disks[i].rdev);
+   if (rdev && test_bit(Faulty, &rdev->flags))
+   rdev = NULL;
+   if (rdev)
+   atomic_inc(&rdev->nr_pending);
+   rcu_read_unlock();
+
+   if (rdev) {
+   if (test_bit(STRIPE_SYNCING, &sh->state) ||
+   test_bit(STRIPE_EXPAND_SOURCE, &sh->state) ||
+   test_bit(STRIPE_EXPAND_READY, &sh->state))
+   md_sync_acct(rdev->bdev, STRIPE_SECTORS);
+
+   bi->bi_bdev = rdev->bdev;
+   PRINTK("%s: for %llu schedule op %ld on disc %d\n",
+   __FUNCTION__, (unsigned long long)sh->sector,
+   bi->bi_rw, i);
+   atomic_inc(&sh->count);
+   bi->bi_sector = sh->sector + rdev->data_offset;
+   bi->bi_flags = 1 << BIO_UPTODATE;
+   bi->bi_vcnt = 1;
+   bi->bi_max_vecs = 1;
+   bi->bi_idx = 0;
+   bi->bi_io_vec = &sh->dev[i].vec;
+   bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
+   bi->bi_io_vec[0].bv_offset = 0;
+   bi->bi_size = STRIPE_SIZE;
+   bi->bi_next = NULL;
+   if (rw == WRITE &&
+   test_bit(R5_ReWrite, &sh->dev[i].flags))
+   atomic_add(STRIPE_SECTORS, 
&rdev->corrected_errors);
+   generic_make_request(bi);
+   } else {
+   if (rw == WRITE)
+   set_bit(STRIPE_DEGRADED, &sh->state);
+   PRINTK("skip op %ld on disc %d for sector %llu\n",
+   bi->bi_rw, i, (unsigned long long)sh->sector);
+   clear_bit(R5_LOCKED, &sh->dev[i].flags);
+   set_bit(STRIPE_HANDLE, &sh->state);
+   }
+   }
+}
+
+static struct dma_async_tx_descriptor *
+async_copy_data(int frombio, struct bio *bio, struct page *page, sector_t 
sector,
+   struct dma_async_tx_descriptor *tx)
+{
+   struct bio_vec *bvl;
+   struct page *bio_page;
+   int i;
+   int page_offset;
+
+   if (bio->bi_sector >= sector)
+   page_offset = (signed)(bio->bi_sector - sector) * 512;
+   else
+   page_offset = (signed)(sector - bio->bi_sector) * -512;
+   bio_for_each_segment(bvl, bio, i) {
+   int len = bio_iovec_idx(bio,i)->bv_len;
+

[PATCH 2.6.21-rc4 08/15] md: move raid5 parity checks to raid5_run_ops

2007-03-22 Thread Dan Williams

handle_stripe sets STRIPE_OP_CHECK to request a check operation in
raid5_run_ops.  If raid5_run_ops is able to perform the check with a
dma engine the parity will be preserved in memory removing the need to
re-read it from disk, as is necessary in the synchronous case.

'Repair' operations re-use the same logic as compute block, with the caveat
that the results of the compute block are immediately written back to the
parity disk.  To differentiate these operations the STRIPE_OP_MOD_REPAIR_PD
flag is added.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   81 
 1 files changed, 62 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9856742..17a114c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2411,32 +2411,75 @@ static void handle_stripe5(struct stripe_head *sh)
locked += handle_write_operations5(sh, rcw, 0);
}
 
-   /* maybe we need to check and possibly fix the parity for this stripe
-* Any reads will already have been scheduled, so we just see if enough 
data
-* is available
+   /* 1/ Maybe we need to check and possibly fix the parity for this 
stripe.
+*Any reads will already have been scheduled, so we just see if 
enough data
+*is available.
+* 2/ Hold off parity checks while parity dependent operations are in 
flight
+*(conflicting writes are protected by the 'locked' variable)
 */
-   if (syncing && locked == 0 &&
-   !test_bit(STRIPE_INSYNC, &sh->state)) {
+   if ((syncing && locked == 0 && !test_bit(STRIPE_OP_COMPUTE_BLK, 
&sh->ops.pending) &&
+   !test_bit(STRIPE_INSYNC, &sh->state)) ||
+   test_bit(STRIPE_OP_CHECK, &sh->ops.pending) ||
+   test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) {
+
set_bit(STRIPE_HANDLE, &sh->state);
-   if (failed == 0) {
-   BUG_ON(uptodate != disks);
-   compute_parity5(sh, CHECK_PARITY);
-   uptodate--;
-   if (page_is_zero(sh->dev[sh->pd_idx].page)) {
-   /* parity is correct (on disc, not in buffer 
any more) */
-   set_bit(STRIPE_INSYNC, &sh->state);
-   } else {
-   conf->mddev->resync_mismatches += 
STRIPE_SECTORS;
-   if (test_bit(MD_RECOVERY_CHECK, 
&conf->mddev->recovery))
-   /* don't try to repair!! */
+   /* Take one of the following actions:
+* 1/ start a check parity operation if (uptodate == disks)
+* 2/ finish a check parity operation and act on the result
+* 3/ skip to the writeback section if we previously
+*initiated a recovery operation
+*/
+   if (failed == 0 && !test_bit(STRIPE_OP_MOD_REPAIR_PD, 
&sh->ops.pending)) {
+   if (!test_and_set_bit(STRIPE_OP_CHECK, 
&sh->ops.pending)) {
+   BUG_ON(uptodate != disks);
+   clear_bit(R5_UPTODATE, 
&sh->dev[sh->pd_idx].flags);
+   sh->ops.count++;
+   uptodate--;
+   } else if (test_and_clear_bit(STRIPE_OP_CHECK, 
&sh->ops.complete)) {
+   clear_bit(STRIPE_OP_CHECK, &sh->ops.ack);
+   clear_bit(STRIPE_OP_CHECK, &sh->ops.pending);
+
+   if (sh->ops.zero_sum_result == 0)
+   /* parity is correct (on disc, not in 
buffer any more) */
set_bit(STRIPE_INSYNC, &sh->state);
else {
-   compute_block(sh, sh->pd_idx);
-   uptodate++;
+   conf->mddev->resync_mismatches += 
STRIPE_SECTORS;
+   if (test_bit(MD_RECOVERY_CHECK, 
&conf->mddev->recovery))
+   /* don't try to repair!! */
+   set_bit(STRIPE_INSYNC, 
&sh->state);
+   else {
+   BUG_ON(test_and_set_bit(
+   STRIPE_OP_COMPUTE_BLK,
+   &sh->ops.pending));
+   set_bit(STRIPE_OP_MOD_REPAIR_PD,
+   &sh->ops.pending);
+   
BUG_ON(test_and_set_bit(R5_Wantcompute,
+

[PATCH 2.6.21-rc4 09/15] md: satisfy raid5 read requests via raid5_run_ops

2007-03-22 Thread Dan Williams

Use raid5_run_ops to carry out the memory copies for a raid5 read request.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   40 +++-
 1 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 17a114c..bcd23fb 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1980,7 +1980,7 @@ static void handle_stripe5(struct stripe_head *sh)
int i;
int syncing, expanding, expanded;
int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
-   int compute=0, req_compute=0, non_overwrite=0;
+   int to_fill=0, compute=0, req_compute=0, non_overwrite=0;
int failed_num=0;
struct r5dev *dev;
unsigned long pending=0;
@@ -2004,34 +2004,20 @@ static void handle_stripe5(struct stripe_head *sh)
dev = &sh->dev[i];
clear_bit(R5_Insync, &dev->flags);
 
-   PRINTK("check %d: state 0x%lx read %p write %p written %p\n",
-   i, dev->flags, dev->toread, dev->towrite, dev->written);
-   /* maybe we can reply to a read */
+   PRINTK("check %d: state 0x%lx toread %p read %p write %p 
written %p\n",
+   i, dev->flags, dev->toread, dev->read, dev->towrite, 
dev->written);
+
+   /* maybe we can start a biofill operation */
if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread) {
-   struct bio *rbi, *rbi2;
-   PRINTK("Return read for disc %d\n", i);
-   spin_lock_irq(&conf->device_lock);
-   rbi = dev->toread;
-   dev->toread = NULL;
-   if (test_and_clear_bit(R5_Overlap, &dev->flags))
-   wake_up(&conf->wait_for_overlap);
-   spin_unlock_irq(&conf->device_lock);
-   while (rbi && rbi->bi_sector < dev->sector + 
STRIPE_SECTORS) {
-   copy_data(0, rbi, dev->page, dev->sector);
-   rbi2 = r5_next_bio(rbi, dev->sector);
-   spin_lock_irq(&conf->device_lock);
-   if (--rbi->bi_phys_segments == 0) {
-   rbi->bi_next = return_bi;
-   return_bi = rbi;
-   }
-   spin_unlock_irq(&conf->device_lock);
-   rbi = rbi2;
-   }
+   to_read--;
+   if (!test_bit(STRIPE_OP_BIOFILL, &sh->ops.pending))
+   set_bit(R5_Wantfill, &dev->flags);
}
 
/* now count some things */
if (test_bit(R5_LOCKED, &dev->flags)) locked++;
if (test_bit(R5_UPTODATE, &dev->flags)) uptodate++;
+   if (test_bit(R5_Wantfill, &dev->flags)) to_fill++;
if (test_bit(R5_Wantcompute, &dev->flags)) BUG_ON(++compute > 
1);
 
if (dev->toread) to_read++;
@@ -2055,9 +2041,13 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(R5_Insync, &dev->flags);
}
rcu_read_unlock();
+
+   if (to_fill && !test_and_set_bit(STRIPE_OP_BIOFILL, &sh->ops.pending))
+   sh->ops.count++;
+
PRINTK("locked=%d uptodate=%d to_read=%d"
-   " to_write=%d failed=%d failed_num=%d\n",
-   locked, uptodate, to_read, to_write, failed, failed_num);
+   " to_write=%d to_fill=%d failed=%d failed_num=%d\n",
+   locked, uptodate, to_read, to_write, to_fill, failed, 
failed_num);
/* check if the array has lost two devices and, if so, some requests 
might
 * need to be failed
 */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.21-rc4 14/15] iop13xx: Surface the iop13xx adma units to the iop-adma driver

2007-03-22 Thread Dan Williams

Adds the platform device definitions and the architecture specific
support routines (i.e. register initialization and descriptor formats) for the
iop-adma driver.

Changelog:
* added 'descriptor pool size' to the platform data
* add base support for buffer sizes larger than 16MB (hw max)
* build error fix from Kirill A. Shutemov
* rebase for async_tx changes
* add interrupt support
* do not call platform register macros in driver code

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 arch/arm/mach-iop13xx/setup.c  |  194 +++
 include/asm-arm/arch-iop13xx/adma.h|  545 
 include/asm-arm/arch-iop13xx/iop13xx.h |   34 +-
 3 files changed, 752 insertions(+), 21 deletions(-)

diff --git a/arch/arm/mach-iop13xx/setup.c b/arch/arm/mach-iop13xx/setup.c
index 9a46bcd..43189c8 100644
--- a/arch/arm/mach-iop13xx/setup.c
+++ b/arch/arm/mach-iop13xx/setup.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define IOP13XX_UART_XTAL 4000
 #define IOP13XX_SETUP_DEBUG 0
@@ -236,6 +237,140 @@ static unsigned long iq8134x_probe_flash_size(void)
 }
 #endif
 
+/* ADMA Channels */
+static struct resource iop13xx_adma_0_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(0),
+   .end = IOP13XX_ADMA_UPPER_PA(0),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA0_EOT,
+   .end = IRQ_IOP13XX_ADMA0_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA0_EOC,
+   .end = IRQ_IOP13XX_ADMA0_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA0_ERR,
+   .end = IRQ_IOP13XX_ADMA0_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static struct resource iop13xx_adma_1_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(1),
+   .end = IOP13XX_ADMA_UPPER_PA(1),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA1_EOT,
+   .end = IRQ_IOP13XX_ADMA1_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA1_EOC,
+   .end = IRQ_IOP13XX_ADMA1_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA1_ERR,
+   .end = IRQ_IOP13XX_ADMA1_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static struct resource iop13xx_adma_2_resources[] = {
+   [0] = {
+   .start = IOP13XX_ADMA_PHYS_BASE(2),
+   .end = IOP13XX_ADMA_UPPER_PA(2),
+   .flags = IORESOURCE_MEM,
+   },
+   [1] = {
+   .start = IRQ_IOP13XX_ADMA2_EOT,
+   .end = IRQ_IOP13XX_ADMA2_EOT,
+   .flags = IORESOURCE_IRQ
+   },
+   [2] = {
+   .start = IRQ_IOP13XX_ADMA2_EOC,
+   .end = IRQ_IOP13XX_ADMA2_EOC,
+   .flags = IORESOURCE_IRQ
+   },
+   [3] = {
+   .start = IRQ_IOP13XX_ADMA2_ERR,
+   .end = IRQ_IOP13XX_ADMA2_ERR,
+   .flags = IORESOURCE_IRQ
+   }
+};
+
+static u64 iop13xx_adma_dmamask = DMA_64BIT_MASK;
+static struct iop_adma_platform_data iop13xx_adma_0_data = {
+   .hw_id = 0,
+   .capabilities = DMA_CAP_MEMCPY | DMA_CAP_XOR | DMA_CAP_DUAL_XOR |
+   DMA_CAP_ZERO_SUM | DMA_CAP_MEMSET |
+   DMA_CAP_MEMCPY_CRC32C | DMA_CAP_INTERRUPT,
+   .pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop13xx_adma_1_data = {
+   .hw_id = 1,
+   .capabilities = DMA_CAP_MEMCPY | DMA_CAP_XOR | DMA_CAP_DUAL_XOR |
+   DMA_CAP_ZERO_SUM | DMA_CAP_MEMSET |
+   DMA_CAP_MEMCPY_CRC32C | DMA_CAP_INTERRUPT,
+   .pool_size = PAGE_SIZE,
+};
+
+static struct iop_adma_platform_data iop13xx_adma_2_data = {
+   .hw_id = 2,
+   .capabilities = DMA_CAP_MEMCPY | DMA_CAP_XOR | DMA_CAP_DUAL_XOR |
+   DMA_CAP_ZERO_SUM | DMA_CAP_MEMSET |
+   DMA_CAP_MEMCPY_CRC32C | DMA_CAP_PQ_XOR |
+   DMA_CAP_PQ_UPDATE | DMA_CAP_PQ_ZERO_SUM |
+   DMA_CAP_INTERRUPT,
+   .pool_size = PAGE_SIZE,
+};
+
+/* The ids are fixed up later in iop13xx_platform_init */
+static struct platform_device iop13xx_adma_0_channel = {
+   .name = "IOP-ADMA",
+   .id = 0,
+   .num_resources = 4,
+   .resource = iop13xx_adma_0_resources,
+   .dev = {
+   .dma_mask = &iop13xx_adma_dmamask,
+   .coherent_dma_mask = DMA_64BIT_MASK,
+   .platform_data = (void *) &iop13xx_adma_0_data,
+   },
+};
+
+static struct platform_device iop13xx_adma_1_channel = {
+   .name = "IOP-ADMA",
+   .id = 0,
+   .num_resources = 4,
+   .resource = i

[PATCH 2.6.21-rc4 07/15] md: move raid5 compute block operations to raid5_run_ops

2007-03-22 Thread Dan Williams

handle_stripe sets STRIPE_OP_COMPUTE_BLK to request servicing from
raid5_run_ops.  It also sets a flag for the block being computed to let
other parts of handle_stripe submit dependent operations.  raid5_run_ops
guarantees that the compute operation completes before any dependent
operation starts.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  125 +++-
 1 files changed, 93 insertions(+), 32 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4d1adb5..9856742 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1980,7 +1980,7 @@ static void handle_stripe5(struct stripe_head *sh)
int i;
int syncing, expanding, expanded;
int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
-   int non_overwrite = 0;
+   int compute=0, req_compute=0, non_overwrite=0;
int failed_num=0;
struct r5dev *dev;
unsigned long pending=0;
@@ -2032,8 +2032,8 @@ static void handle_stripe5(struct stripe_head *sh)
/* now count some things */
if (test_bit(R5_LOCKED, &dev->flags)) locked++;
if (test_bit(R5_UPTODATE, &dev->flags)) uptodate++;
+   if (test_bit(R5_Wantcompute, &dev->flags)) BUG_ON(++compute > 
1);
 
-   
if (dev->toread) to_read++;
if (dev->towrite) {
to_write++;
@@ -2188,31 +2188,82 @@ static void handle_stripe5(struct stripe_head *sh)
 * parity, or to satisfy requests
 * or to load a block that is being partially written.
 */
-   if (to_read || non_overwrite || (syncing && (uptodate < disks)) || 
expanding) {
-   for (i=disks; i--;) {
-   dev = &sh->dev[i];
-   if (!test_bit(R5_LOCKED, &dev->flags) && 
!test_bit(R5_UPTODATE, &dev->flags) &&
-   (dev->toread ||
-(dev->towrite && !test_bit(R5_OVERWRITE, 
&dev->flags)) ||
-syncing ||
-expanding ||
-(failed && (sh->dev[failed_num].toread ||
-(sh->dev[failed_num].towrite && 
!test_bit(R5_OVERWRITE, &sh->dev[failed_num].flags
-   )
-   ) {
-   /* we would like to get this block, possibly
-* by computing it, but we might not be able to
+   if (to_read || non_overwrite || (syncing && (uptodate + compute < 
disks)) || expanding ||
+   test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) {
+
+   /* Clear completed compute operations.  Parity recovery
+* (STRIPE_OP_MOD_REPAIR_PD) implies a write-back which is 
handled
+* later on in this routine
+*/
+   if (test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.complete) &&
+   !test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) {
+   clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.complete);
+   clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.ack);
+   clear_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending);
+   }
+
+   /* look for blocks to read/compute, skip this if a compute
+* is already in flight, or if the stripe contents are in the
+* midst of changing due to a write
+*/
+   if (!test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending) &&
+   !test_bit(STRIPE_OP_PREXOR, &sh->ops.pending) &&
+   !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
+   for (i=disks; i--;) {
+   dev = &sh->dev[i];
+
+   /* don't schedule compute operations or reads on
+* the parity block while a check is in flight
 */
-   if (uptodate == disks-1) {
-   PRINTK("Computing block %d\n", i);
-   compute_block(sh, i);
-   uptodate++;
-   } else if (test_bit(R5_Insync, &dev->flags)) {
-   set_bit(R5_LOCKED, &dev->flags);
-   set_bit(R5_Wantread, &dev->flags);
-   locked++;
-   PRINTK("Reading block %d (sync=%d)\n", 
-   i, syncing);
+   if ((i == sh->pd_idx) && 
test_bit(STRIPE_OP_CHECK, &sh->ops.pending))
+   continue;
+
+   if (!test_bit(R5_LOCKED, &dev->flags)

[PATCH 2.6.21-rc4 03/15] dmaengine: add the async_tx api

2007-03-22 Thread Dan Williams

async_tx is an api to describe a series of bulk memory
transfers/transforms.  When possible these transactions are carried out by
asynchrounous dma engines.  The api handles inter-transaction dependencies
and hides dma channel management from the client.  When a dma engine is not
present the transaction is carried out via synchronous software routines.

Xor operations are handled by async_tx, to this end xor.c is moved into
drivers/dma and is changed to take an explicit destination address and
a series of sources to match the hardware engine implementation.

When CONFIG_DMA_ENGINE is not set the asynchrounous path is compiled away.

Changelog:
* fixed a leftover debug print
* don't allow callbacks in async_interrupt_cond
* fixed xor_block changes
* fixed usage of ASYNC_TX_XOR_DROP_DEST
* drop dma mapping methods, suggested by Chris Leech
* printk warning fixups from Andrew Morton
* don't use inline in C files, Adrian Bunk
* select the API when MD is enabled

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/Makefile |1 
 drivers/dma/Kconfig  |   15 +
 drivers/dma/Makefile |1 
 drivers/dma/async_tx.c   |  905 ++
 drivers/dma/xor.c|  153 
 drivers/md/Kconfig   |3 
 drivers/md/Makefile  |6 
 drivers/md/raid5.c   |   52 +--
 drivers/md/xor.c |  154 
 include/linux/async_tx.h |  180 +
 include/linux/raid/xor.h |5 
 11 files changed, 1286 insertions(+), 189 deletions(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index 3a718f5..2e8de9e 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -62,6 +62,7 @@ obj-$(CONFIG_I2C) += i2c/
 obj-$(CONFIG_W1)   += w1/
 obj-$(CONFIG_HWMON)+= hwmon/
 obj-$(CONFIG_PHONE)+= telephony/
+obj-$(CONFIG_ASYNC_TX_DMA) += dma/
 obj-$(CONFIG_MD)   += md/
 obj-$(CONFIG_BT)   += bluetooth/
 obj-$(CONFIG_ISDN) += isdn/
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 30d021d..292ddad 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -7,8 +7,8 @@ menu "DMA Engine support"
 config DMA_ENGINE
bool "Support for DMA engines"
---help---
- DMA engines offload copy operations from the CPU to dedicated
- hardware, allowing the copies to happen asynchronously.
+  DMA engines offload bulk memory operations from the CPU to dedicated
+  hardware, allowing the operations to happen asynchronously.
 
 comment "DMA Clients"
 
@@ -22,6 +22,16 @@ config NET_DMA
  Since this is the main user of the DMA engine, it should be enabled;
  say Y here.
 
+config ASYNC_TX_DMA
+   tristate "Asynchronous Bulk Memory Transfers/Transforms API"
+   ---help---
+ This enables the async_tx management layer for dma engines.
+ Subsystems coded to this API will use offload engines for bulk
+ memory operations where present.  Software implementations are
+ called when a dma engine is not present or fails to allocate
+ memory to carry out the transaction.
+ Current subsystems ported to async_tx: MD_RAID4,5
+
 comment "DMA Devices"
 
 config INTEL_IOATDMA
@@ -30,5 +40,4 @@ config INTEL_IOATDMA
default m
---help---
  Enable support for the Intel(R) I/OAT DMA engine.
-
 endmenu
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index bdcfdbd..6a99341 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_DMA_ENGINE) += dmaengine.o
 obj-$(CONFIG_NET_DMA) += iovlock.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o
+obj-$(CONFIG_ASYNC_TX_DMA) += async_tx.o xor.o
diff --git a/drivers/dma/async_tx.c b/drivers/dma/async_tx.c
new file mode 100644
index 000..8f7e701
--- /dev/null
+++ b/drivers/dma/async_tx.c
@@ -0,0 +1,905 @@
+/*
+ * Copyright(c) 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define ASYNC_TX_DEBUG 0
+#define PRINTK(x...) ((void)(ASYNC_TX_DEBUG && printk(x)))
+
+#ifdef CONFIG_DMA_ENGINE
+static stru

[PATCH 2.6.21-rc4 13/15] dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines

2007-03-22 Thread Dan Williams

This is a driver for the iop DMA/AAU/ADMA units which are capable of pq_xor,
pq_update, pq_zero_sum, xor, dual_xor, xor_zero_sum, fill, copy+crc, and copy
operations.

Changelog:
* fixed a slot allocation bug in do_iop13xx_adma_xor that caused too few
slots to be requested eventually leading to data corruption
* enabled the slot allocation routine to attempt to free slots before
returning -ENOMEM
* switched the cleanup routine to solely use the software chain and the
status register to determine if a descriptor is complete.  This is
necessary to support other IOP engines that do not have status writeback
capability
* make the driver iop generic
* modified the allocation routines to understand allocating a group of
slots for a single operation
* added a null xor initialization operation for the xor only channel on
iop3xx
* support xor operations on buffers larger than the hardware maximum
* split the do_* routines into separate prep, src/dest set, submit stages
* added async_tx support (dependent operations initiation at cleanup time)
* simplified group handling
* added interrupt support (callbacks via tasklets)
* brought the pending depth inline with ioat (i.e. 4 descriptors)
* drop dma mapping methods, suggested by Chris Leech
* don't use inline in C files, Adrian Bunk
* remove static tasklet declarations
* make iop_adma_alloc_slots easier to read and remove chances for a
corrupted descriptor chain
* fix locking bug in iop_adma_alloc_chan_resources, Benjamin Herrenschmidt

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/dma/Kconfig |8 
 drivers/dma/Makefile|1 
 drivers/dma/iop-adma.c  | 1469 +++
 include/asm-arm/hardware/iop_adma.h |  121 +++
 4 files changed, 1599 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 292ddad..1c2ae4e 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -40,4 +40,12 @@ config INTEL_IOATDMA
default m
---help---
  Enable support for the Intel(R) I/OAT DMA engine.
+
+config INTEL_IOP_ADMA
+tristate "Intel IOP ADMA support"
+depends on DMA_ENGINE && (ARCH_IOP32X || ARCH_IOP33X || ARCH_IOP13XX)
+default m
+---help---
+  Enable support for the Intel(R) IOP Series RAID engines.
+
 endmenu
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index 6a99341..8ebf10d 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_DMA_ENGINE) += dmaengine.o
 obj-$(CONFIG_NET_DMA) += iovlock.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o
+obj-$(CONFIG_INTEL_IOP_ADMA) += iop-adma.o
 obj-$(CONFIG_ASYNC_TX_DMA) += async_tx.o xor.o
diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c
new file mode 100644
index 000..3ac03ef
--- /dev/null
+++ b/drivers/dma/iop-adma.c
@@ -0,0 +1,1469 @@
+/*
+ * Copyright(c) 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+
+/*
+ * This driver supports the asynchrounous DMA copy and RAID engines available
+ * on the Intel Xscale(R) family of I/O Processors (IOP 32x, 33x, 134x)
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define to_iop_adma_chan(chan) container_of(chan, struct iop_adma_chan, common)
+#define to_iop_adma_device(dev) container_of(dev, struct iop_adma_device, 
common)
+#define tx_to_iop_adma_slot(tx) container_of(tx, struct iop_adma_desc_slot, 
async_tx)
+
+#define IOP_ADMA_DEBUG 0
+#define PRINTK(x...) ((void)(IOP_ADMA_DEBUG && printk(x)))
+
+/**
+ * iop_adma_free_slots - flags descriptor slots for reuse
+ * @slot: Slot to free
+ * Caller must hold &iop_chan->lock while calling this function
+ */
+static void iop_adma_free_slots(struct iop_adma_desc_slot *slot)
+{
+   int stride = slot->slots_per_op;
+
+   while (stride--) {
+   slot->slots_per_op = 0;
+   slot = list_entry(slot->slot_node.next,
+   struct iop_adma_desc_slot,
+   slot_node);
+   }
+}
+
+static dma_cookie_t
+iop_adma_run_tx_complete_actions(str

[PATCH 2.6.21-rc4 15/15] iop3xx: Surface the iop3xx DMA and AAU units to the iop-adma driver

2007-03-22 Thread Dan Williams

Adds the platform device definitions and the architecture specific support
routines (i.e. register initialization and descriptor formats) for the
iop-adma driver.

Changelog:
* add support for > 1k zero sum buffer sizes
* added dma/aau platform devices to iq80321 and iq80332 setup
* fixed the calculation in iop_desc_is_aligned
* support xor buffer sizes larger than 16MB
* fix places where software descriptors are assumed to be contiguous, only
hardware descriptors are contiguous
for up to a PAGE_SIZE buffer size
* convert to async_tx
* add interrupt support
* add platform devices for 80219 boards
* do not call platform register macros in driver code
* remove switch() statements for compatible register offsets/layouts

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 arch/arm/mach-iop32x/glantank.c|2 
 arch/arm/mach-iop32x/iq31244.c |5 
 arch/arm/mach-iop32x/iq80321.c |3 
 arch/arm/mach-iop32x/n2100.c   |2 
 arch/arm/mach-iop33x/iq80331.c |3 
 arch/arm/mach-iop33x/iq80332.c |3 
 arch/arm/plat-iop/Makefile |2 
 arch/arm/plat-iop/adma.c   |  198 +++
 include/asm-arm/arch-iop32x/adma.h |5 
 include/asm-arm/arch-iop33x/adma.h |5 
 include/asm-arm/hardware/iop3xx-adma.h |  893 
 include/asm-arm/hardware/iop3xx.h  |   68 --
 12 files changed, 1129 insertions(+), 60 deletions(-)

diff --git a/arch/arm/mach-iop32x/glantank.c b/arch/arm/mach-iop32x/glantank.c
index 45f4f13..2e0099b 100644
--- a/arch/arm/mach-iop32x/glantank.c
+++ b/arch/arm/mach-iop32x/glantank.c
@@ -180,6 +180,8 @@ static void __init glantank_init_machine(void)
platform_device_register(&iop3xx_i2c1_device);
platform_device_register(&glantank_flash_device);
platform_device_register(&glantank_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
 
pm_power_off = glantank_power_off;
 }
diff --git a/arch/arm/mach-iop32x/iq31244.c b/arch/arm/mach-iop32x/iq31244.c
index 571ac35..bf1c112 100644
--- a/arch/arm/mach-iop32x/iq31244.c
+++ b/arch/arm/mach-iop32x/iq31244.c
@@ -276,9 +276,14 @@ static void __init iq31244_init_machine(void)
platform_device_register(&iop3xx_i2c1_device);
platform_device_register(&iq31244_flash_device);
platform_device_register(&iq31244_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
 
if (is_80219())
pm_power_off = ep80219_power_off;
+
+   if (!is_80219())
+   platform_device_register(&iop3xx_aau_channel);
 }
 
 MACHINE_START(IQ31244, "Intel IQ31244")
diff --git a/arch/arm/mach-iop32x/iq80321.c b/arch/arm/mach-iop32x/iq80321.c
index 361c70c..474ec2a 100644
--- a/arch/arm/mach-iop32x/iq80321.c
+++ b/arch/arm/mach-iop32x/iq80321.c
@@ -180,6 +180,9 @@ static void __init iq80321_init_machine(void)
platform_device_register(&iop3xx_i2c1_device);
platform_device_register(&iq80321_flash_device);
platform_device_register(&iq80321_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
+   platform_device_register(&iop3xx_aau_channel);
 }
 
 MACHINE_START(IQ80321, "Intel IQ80321")
diff --git a/arch/arm/mach-iop32x/n2100.c b/arch/arm/mach-iop32x/n2100.c
index 5f07344..8e6fe13 100644
--- a/arch/arm/mach-iop32x/n2100.c
+++ b/arch/arm/mach-iop32x/n2100.c
@@ -245,6 +245,8 @@ static void __init n2100_init_machine(void)
platform_device_register(&iop3xx_i2c0_device);
platform_device_register(&n2100_flash_device);
platform_device_register(&n2100_serial_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
 
pm_power_off = n2100_power_off;
 
diff --git a/arch/arm/mach-iop33x/iq80331.c b/arch/arm/mach-iop33x/iq80331.c
index 1a9e361..b4d12bf 100644
--- a/arch/arm/mach-iop33x/iq80331.c
+++ b/arch/arm/mach-iop33x/iq80331.c
@@ -135,6 +135,9 @@ static void __init iq80331_init_machine(void)
platform_device_register(&iop33x_uart0_device);
platform_device_register(&iop33x_uart1_device);
platform_device_register(&iq80331_flash_device);
+   platform_device_register(&iop3xx_dma_0_channel);
+   platform_device_register(&iop3xx_dma_1_channel);
+   platform_device_register(&iop3xx_aau_channel);
 }
 
 MACHINE_START(IQ80331, "Intel IQ80331")
diff --git a/arch/arm/mach-iop33x/iq80332.c b/arch/arm/mach-iop33x/iq80332.c
index 96d6f0f..2abb2d8 100644
--- a/arch/arm/mach-iop33x/iq80332.c
+++ b/arch/arm/mach-iop33x/iq80332.c
@@ -135,6 +135,9 @@ static void __init iq80332_init_machine(void)
platform_device_register(&iop33x_uart0_device);
platform_device_register(&iop33x_uart1_device);
platform_device_

[PATCH 2.6.21-rc4 12/15] md: remove raid5 compute_block and compute_parity5

2007-03-22 Thread Dan Williams

replaced by raid5_run_ops

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  124 
 1 files changed, 0 insertions(+), 124 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0be26c2..062df02 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1482,130 +1482,6 @@ static void copy_data(int frombio, struct bio *bio,
   }   \
} while(0)
 
-
-static void compute_block(struct stripe_head *sh, int dd_idx)
-{
-   int i, count, disks = sh->disks;
-   void *ptr[MAX_XOR_BLOCKS], *dest, *p;
-
-   PRINTK("compute_block, stripe %llu, idx %d\n", 
-   (unsigned long long)sh->sector, dd_idx);
-
-   dest = page_address(sh->dev[dd_idx].page);
-   memset(dest, 0, STRIPE_SIZE);
-   count = 0;
-   for (i = disks ; i--; ) {
-   if (i == dd_idx)
-   continue;
-   p = page_address(sh->dev[i].page);
-   if (test_bit(R5_UPTODATE, &sh->dev[i].flags))
-   ptr[count++] = p;
-   else
-   printk(KERN_ERR "compute_block() %d, stripe %llu, %d"
-   " not present\n", dd_idx,
-   (unsigned long long)sh->sector, i);
-
-   check_xor();
-   }
-   if (count)
-   xor_block(count, STRIPE_SIZE, dest, ptr);
-   set_bit(R5_UPTODATE, &sh->dev[dd_idx].flags);
-}
-
-static void compute_parity5(struct stripe_head *sh, int method)
-{
-   raid5_conf_t *conf = sh->raid_conf;
-   int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
-   void *ptr[MAX_XOR_BLOCKS], *dest;
-   struct bio *chosen;
-
-   PRINTK("compute_parity5, stripe %llu, method %d\n",
-   (unsigned long long)sh->sector, method);
-
-   count = 0;
-   dest = page_address(sh->dev[pd_idx].page);
-   switch(method) {
-   case READ_MODIFY_WRITE:
-   BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags));
-   for (i=disks ; i-- ;) {
-   if (i==pd_idx)
-   continue;
-   if (sh->dev[i].towrite &&
-   test_bit(R5_UPTODATE, &sh->dev[i].flags)) {
-   ptr[count++] = page_address(sh->dev[i].page);
-   chosen = sh->dev[i].towrite;
-   sh->dev[i].towrite = NULL;
-
-   if (test_and_clear_bit(R5_Overlap, 
&sh->dev[i].flags))
-   wake_up(&conf->wait_for_overlap);
-
-   BUG_ON(sh->dev[i].written);
-   sh->dev[i].written = chosen;
-   check_xor();
-   }
-   }
-   break;
-   case RECONSTRUCT_WRITE:
-   memset(dest, 0, STRIPE_SIZE);
-   for (i= disks; i-- ;)
-   if (i!=pd_idx && sh->dev[i].towrite) {
-   chosen = sh->dev[i].towrite;
-   sh->dev[i].towrite = NULL;
-
-   if (test_and_clear_bit(R5_Overlap, 
&sh->dev[i].flags))
-   wake_up(&conf->wait_for_overlap);
-
-   BUG_ON(sh->dev[i].written);
-   sh->dev[i].written = chosen;
-   }
-   break;
-   case CHECK_PARITY:
-   break;
-   }
-   if (count) {
-   xor_block(count, STRIPE_SIZE, dest, ptr);
-   count = 0;
-   }
-   
-   for (i = disks; i--;)
-   if (sh->dev[i].written) {
-   sector_t sector = sh->dev[i].sector;
-   struct bio *wbi = sh->dev[i].written;
-   while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) 
{
-   copy_data(1, wbi, sh->dev[i].page, sector);
-   wbi = r5_next_bio(wbi, sector);
-   }
-
-   set_bit(R5_LOCKED, &sh->dev[i].flags);
-   set_bit(R5_UPTODATE, &sh->dev[i].flags);
-   }
-
-   switch(method) {
-   case RECONSTRUCT_WRITE:
-   case CHECK_PARITY:
-   for (i=disks; i--;)
-   if (i != pd_idx) {
-   ptr[count++] = page_address(sh->dev[i].page);
-   check_xor();
-   }
-   break;
-   case READ_MODIFY_WRITE:
-   for (i = disks; i--;)
-   if (sh->dev[i].written) {
-   ptr[count++] = page_address(sh->dev[i].page);
-   check_xor();
-   }

[PATCH 2.6.21-rc4 10/15] md: use async_tx and raid5_run_ops for raid5 expansion operations

2007-03-22 Thread Dan Williams

The parity calculation for an expansion operation is the same as the
calculation performed at the end of a write with the caveat that all blocks
in the stripe are scheduled to be written.  An expansion operation is
identified as a stripe with the POSTXOR flag set and the BIODRAIN flag not
set.

The bulk copy operation to the new stripe is handled inline by async_tx.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   48 
 1 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index bcd23fb..84a3f35 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2511,18 +2511,32 @@ static void handle_stripe5(struct stripe_head *sh)
}
}
 
-   if (expanded && test_bit(STRIPE_EXPANDING, &sh->state)) {
-   /* Need to write out all blocks after computing parity */
-   sh->disks = conf->raid_disks;
-   sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 
conf->raid_disks);
-   compute_parity5(sh, RECONSTRUCT_WRITE);
+   /* Finish postxor operations initiated by the expansion
+* process
+*/
+   if (test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete) &&
+   !test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending)) {
+
+   clear_bit(STRIPE_EXPANDING, &sh->state);
+
+   clear_bit(STRIPE_OP_POSTXOR, &sh->ops.pending);
+   clear_bit(STRIPE_OP_POSTXOR, &sh->ops.ack);
+   clear_bit(STRIPE_OP_POSTXOR, &sh->ops.complete);
+
for (i= conf->raid_disks; i--;) {
-   set_bit(R5_LOCKED, &sh->dev[i].flags);
-   locked++;
set_bit(R5_Wantwrite, &sh->dev[i].flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
}
-   clear_bit(STRIPE_EXPANDING, &sh->state);
-   } else if (expanded) {
+   }
+
+   if (expanded && test_bit(STRIPE_EXPANDING, &sh->state) &&
+   !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
+   /* Need to write out all blocks after computing parity */
+   sh->disks = conf->raid_disks;
+   sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 
conf->raid_disks);
+   locked += handle_write_operations5(sh, 0, 1);
+   } else if (expanded && !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) {
clear_bit(STRIPE_EXPAND_READY, &sh->state);
atomic_dec(&conf->reshape_stripes);
wake_up(&conf->wait_for_overlap);
@@ -2533,6 +2547,7 @@ static void handle_stripe5(struct stripe_head *sh)
/* We have read all the blocks in this stripe and now we need to
 * copy some of them into a target stripe for expand.
 */
+   struct dma_async_tx_descriptor *tx = NULL;
clear_bit(STRIPE_EXPAND_SOURCE, &sh->state);
for (i=0; i< sh->disks; i++)
if (i != sh->pd_idx) {
@@ -2556,9 +2571,12 @@ static void handle_stripe5(struct stripe_head *sh)
release_stripe(sh2);
continue;
}
-   memcpy(page_address(sh2->dev[dd_idx].page),
-  page_address(sh->dev[i].page),
-  STRIPE_SIZE);
+
+   /* place all the copies on one channel */
+   tx = async_memcpy(sh2->dev[dd_idx].page,
+   sh->dev[i].page, 0, 0, STRIPE_SIZE,
+   ASYNC_TX_DEP_ACK, tx, NULL, NULL);
+
set_bit(R5_Expanded, &sh2->dev[dd_idx].flags);
set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags);
for (j=0; jraid_disks; j++)
@@ -2570,6 +2588,12 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(STRIPE_HANDLE, &sh2->state);
}
release_stripe(sh2);
+
+   /* done submitting copies, wait for them to 
complete */
+   if (i + 1 >= sh->disks) {
+   async_tx_ack(tx);
+   dma_wait_for_async_tx(tx);
+   }
}
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.21-rc4 11/15] md: move raid5 io requests to raid5_run_ops

2007-03-22 Thread Dan Williams

handle_stripe now only updates the state of stripes.  All execution of
operations is moved to raid5_run_ops.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   68 
 1 files changed, 10 insertions(+), 58 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 84a3f35..0be26c2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2360,6 +2360,8 @@ static void handle_stripe5(struct stripe_head *sh)
PRINTK("Read_old block %d for 
r-m-w\n", i);
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantread, 
&dev->flags);
+   if 
(!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
locked++;
} else {
set_bit(STRIPE_DELAYED, 
&sh->state);
@@ -2380,6 +2382,8 @@ static void handle_stripe5(struct stripe_head *sh)
PRINTK("Read_old block %d for 
Reconstruct\n", i);
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantread, 
&dev->flags);
+   if 
(!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
locked++;
} else {
set_bit(STRIPE_DELAYED, 
&sh->state);
@@ -2479,6 +2483,8 @@ static void handle_stripe5(struct stripe_head *sh)
 
set_bit(R5_LOCKED, &dev->flags);
set_bit(R5_Wantwrite, &dev->flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
clear_bit(STRIPE_DEGRADED, &sh->state);
locked++;
set_bit(STRIPE_INSYNC, &sh->state);
@@ -2500,12 +2506,16 @@ static void handle_stripe5(struct stripe_head *sh)
dev = &sh->dev[failed_num];
if (!test_bit(R5_ReWrite, &dev->flags)) {
set_bit(R5_Wantwrite, &dev->flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
set_bit(R5_ReWrite, &dev->flags);
set_bit(R5_LOCKED, &dev->flags);
locked++;
} else {
/* let's read it back */
set_bit(R5_Wantread, &dev->flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
+   sh->ops.count++;
set_bit(R5_LOCKED, &dev->flags);
locked++;
}
@@ -2615,64 +2625,6 @@ static void handle_stripe5(struct stripe_head *sh)
  test_bit(BIO_UPTODATE, &bi->bi_flags)
? 0 : -EIO);
}
-   for (i=disks; i-- ;) {
-   int rw;
-   struct bio *bi;
-   mdk_rdev_t *rdev;
-   if (test_and_clear_bit(R5_Wantwrite, &sh->dev[i].flags))
-   rw = WRITE;
-   else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
-   rw = READ;
-   else
-   continue;
- 
-   bi = &sh->dev[i].req;
- 
-   bi->bi_rw = rw;
-   if (rw == WRITE)
-   bi->bi_end_io = raid5_end_write_request;
-   else
-   bi->bi_end_io = raid5_end_read_request;
- 
-   rcu_read_lock();
-   rdev = rcu_dereference(conf->disks[i].rdev);
-   if (rdev && test_bit(Faulty, &rdev->flags))
-   rdev = NULL;
-   if (rdev)
-   atomic_inc(&rdev->nr_pending);
-   rcu_read_unlock();
- 
-   if (rdev) {
-   if (syncing || expanding || expanded)
-   md_sync_acct(rdev->bdev, STRIPE_SECTORS);
-
-   bi->bi_bdev = rdev->bdev;
-   PRINTK("for %llu schedule op %ld on disc %d\n",
-   (unsigned long long)sh->sector, bi->bi_rw, i);
-   atomic_inc(&sh->count);
-   bi->bi_sector = sh->sector + rdev->data_offset;
-   bi->bi_flags = 1 << BIO_UPTODATE;
-   bi->bi_vcnt = 1;
-   bi->bi_max_vecs

[PATCH 2.6.21-rc4 06/15] md: move write operations to raid5_run_ops

2007-03-22 Thread Dan Williams

handle_stripe sets STRIPE_OP_PREXOR, STRIPE_OP_BIODRAIN, STRIPE_OP_POSTXOR
to request a write to the stripe cache.  raid5_run_ops is triggerred to run
and executes the request outside the stripe lock.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  152 +---
 1 files changed, 131 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0397e33..4d1adb5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1788,7 +1788,75 @@ static void compute_block_2(struct stripe_head *sh, int 
dd_idx1, int dd_idx2)
}
 }
 
+static int handle_write_operations5(struct stripe_head *sh, int rcw, int 
expand)
+{
+   int i, pd_idx = sh->pd_idx, disks = sh->disks;
+   int locked=0;
+
+   if (rcw == 0) {
+   /* skip the drain operation on an expand */
+   if (!expand) {
+   BUG_ON(test_and_set_bit(STRIPE_OP_BIODRAIN,
+   &sh->ops.pending));
+   sh->ops.count++;
+   }
+
+   BUG_ON(test_and_set_bit(STRIPE_OP_POSTXOR, &sh->ops.pending));
+   sh->ops.count++;
+
+   for (i=disks ; i-- ;) {
+   struct r5dev *dev = &sh->dev[i];
+
+   if (dev->towrite) {
+   set_bit(R5_LOCKED, &dev->flags);
+   if (!expand)
+   clear_bit(R5_UPTODATE, &dev->flags);
+   locked++;
+   }
+   }
+   } else {
+   BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) ||
+   test_bit(R5_Wantcompute, &sh->dev[pd_idx].flags)));
+
+   BUG_ON(test_and_set_bit(STRIPE_OP_PREXOR, &sh->ops.pending) ||
+   test_and_set_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending) 
||
+   test_and_set_bit(STRIPE_OP_POSTXOR, &sh->ops.pending));
+
+   sh->ops.count += 3;
+
+   for (i=disks ; i-- ;) {
+   struct r5dev *dev = &sh->dev[i];
+   if (i==pd_idx)
+   continue;
 
+   /* For a read-modify write there may be blocks that are
+* locked for reading while others are ready to be 
written
+* so we distinguish these blocks by the R5_Wantprexor 
bit
+*/
+   if (dev->towrite &&
+   (test_bit(R5_UPTODATE, &dev->flags) ||
+   test_bit(R5_Wantcompute, &dev->flags))) {
+   set_bit(R5_Wantprexor, &dev->flags);
+   set_bit(R5_LOCKED, &dev->flags);
+   clear_bit(R5_UPTODATE, &dev->flags);
+   locked++;
+   }
+   }
+   }
+
+   /* keep the parity disk locked while asynchronous operations
+* are in flight
+*/
+   set_bit(R5_LOCKED, &sh->dev[pd_idx].flags);
+   clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
+   locked++;
+
+   PRINTK("%s: stripe %llu locked: %d pending: %lx\n",
+   __FUNCTION__, (unsigned long long)sh->sector,
+   locked, sh->ops.pending);
+
+   return locked;
+}
 
 /*
  * Each stripe/dev can have one or more bion attached.
@@ -2151,8 +2219,67 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(STRIPE_HANDLE, &sh->state);
}
 
-   /* now to consider writing and what else, if anything should be read */
-   if (to_write) {
+   /* Now we check to see if any write operations have recently
+* completed
+*/
+
+   /* leave prexor set until postxor is done, allows us to distinguish
+* a rmw from a rcw during biodrain
+*/
+   if (test_bit(STRIPE_OP_PREXOR, &sh->ops.complete) &&
+   test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete)) {
+
+   clear_bit(STRIPE_OP_PREXOR, &sh->ops.complete);
+   clear_bit(STRIPE_OP_PREXOR, &sh->ops.ack);
+   clear_bit(STRIPE_OP_PREXOR, &sh->ops.pending);
+
+   for (i=disks; i--;)
+   clear_bit(R5_Wantprexor, &sh->dev[i].flags);
+   }
+
+   /* if only POSTXOR is set then this is an 'expand' postxor */
+   if (test_bit(STRIPE_OP_BIODRAIN, &sh->ops.complete) &&
+   test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete)) {
+
+   clear_bit(STRIPE_OP_BIODRAIN, &sh->ops.complete);
+   clear_bit(STRIPE_OP_BIODRAIN, &sh->ops.ack);
+   clear_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending);
+
+   clear_bit(STRIPE_OP_POSTXOR, &sh->ops.complete);
+   clear_bit(STRIPE_OP_POSTXOR, &sh->ops.ack);
+   clear_bit(STRIPE_OP_POSTXOR, &sh->ops.pendin

Re: [QUICKLIST 1/5] Quicklists for page table pages V4

2007-03-22 Thread Christoph Lameter

On Thu, 22 Mar 2007, Andrew Morton wrote:

> On Thu, 22 Mar 2007 23:28:41 -0700 (PDT) Christoph Lameter <[EMAIL 
> PROTECTED]> wrote:
> 
> > 1. Proven code from the IA64 arch.
> > 
> > The method used here has been fine tuned for years and
> > is NUMA aware. It is based on the knowledge that accesses
> > to page table pages are sparse in nature. Taking a page
> > off the freelists instead of allocating a zeroed pages
> > allows a reduction of number of cachelines touched
> > in addition to getting rid of the slab overhead. So
> > performance improves.
> 
> By how much?

About 40% on fork+exit. See 

http://marc.info/?l=linux-ia64&m=110942798406005&w=2

 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.21-rc4 02/15] ARM: Add drivers/dma to arch/arm/Kconfig

2007-03-22 Thread Dan Williams

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 arch/arm/Kconfig |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e7baca2..74077e3 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -997,6 +997,8 @@ source "drivers/mmc/Kconfig"
 
 source "drivers/rtc/Kconfig"
 
+source "drivers/dma/Kconfig"
+
 endmenu
 
 source "fs/Kconfig"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.21-rc4 05/15] md: use raid5_run_ops for stripe cache operations

2007-03-22 Thread Dan Williams

Each stripe has three flag variables to reflect the state of operations
(pending, ack, and complete).
-pending: set to request servicing in raid5_run_ops
-ack: set to reflect that raid5_runs_ops has seen this request
-complete: set when the operation is complete and it is ok for handle_stripe5
to clear 'pending' and 'ack'.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   65 +---
 1 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b7185a1..0397e33 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -126,6 +126,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
}
md_wakeup_thread(conf->mddev->thread);
} else {
+   BUG_ON(sh->ops.pending);
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, 
&sh->state)) {
atomic_dec(&conf->preread_active_stripes);
if (atomic_read(&conf->preread_active_stripes) 
< IO_THRESHOLD)
@@ -225,7 +226,8 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
 
BUG_ON(atomic_read(&sh->count) != 0);
BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
-   
+   BUG_ON(sh->ops.pending || sh->ops.ack || sh->ops.complete);
+
CHECK_DEVLOCK();
PRINTK("init_stripe called, stripe %llu\n", 
(unsigned long long)sh->sector);
@@ -241,11 +243,11 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
for (i = sh->disks; i--; ) {
struct r5dev *dev = &sh->dev[i];
 
-   if (dev->toread || dev->towrite || dev->written ||
+   if (dev->toread || dev->read || dev->towrite || dev->written ||
test_bit(R5_LOCKED, &dev->flags)) {
-   printk("sector=%llx i=%d %p %p %p %d\n",
+   printk("sector=%llx i=%d %p %p %p %p %d\n",
   (unsigned long long)sh->sector, i, dev->toread,
-  dev->towrite, dev->written,
+  dev->read, dev->towrite, dev->written,
   test_bit(R5_LOCKED, &dev->flags));
BUG();
}
@@ -325,6 +327,43 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
return sh;
 }
 
+/* check_op() ensures that we only dequeue an operation once */
+#define check_op(op) do {\
+   if (test_bit(op, &sh->ops.pending) &&\
+   !test_bit(op, &sh->ops.complete)) {\
+   if (test_and_set_bit(op, &sh->ops.ack))\
+   clear_bit(op, &pending);\
+   else\
+   ack++;\
+   } else\
+   clear_bit(op, &pending);\
+} while(0)
+
+/* find new work to run, do not resubmit work that is already
+ * in flight
+ */
+static unsigned long get_stripe_work(struct stripe_head *sh)
+{
+   unsigned long pending;
+   int ack = 0;
+
+   pending = sh->ops.pending;
+
+   check_op(STRIPE_OP_BIOFILL);
+   check_op(STRIPE_OP_COMPUTE_BLK);
+   check_op(STRIPE_OP_PREXOR);
+   check_op(STRIPE_OP_BIODRAIN);
+   check_op(STRIPE_OP_POSTXOR);
+   check_op(STRIPE_OP_CHECK);
+   if (test_and_clear_bit(STRIPE_OP_IO, &sh->ops.pending))
+   ack++;
+
+   sh->ops.count -= ack;
+   BUG_ON(sh->ops.count < 0);
+
+   return pending;
+}
+
 static int
 raid5_end_read_request(struct bio * bi, unsigned int bytes_done, int error);
 static int
@@ -1859,7 +1898,6 @@ static int stripe_to_pdidx(sector_t stripe, raid5_conf_t 
*conf, int disks)
  *schedule a write of some buffers
  *return confirmation of parity correctness
  *
- * Parity calculations are done inside the stripe lock
  * buffers are taken off read_list or write_list, and bh_cache buffers
  * get BH_Lock set before the stripe lock is released.
  *
@@ -1877,10 +1915,11 @@ static void handle_stripe5(struct stripe_head *sh)
int non_overwrite = 0;
int failed_num=0;
struct r5dev *dev;
+   unsigned long pending=0;
 
-   PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n",
-   (unsigned long long)sh->sector, atomic_read(&sh->count),
-   sh->pd_idx);
+   PRINTK("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d 
ops=%lx:%lx:%lx\n",
+  (unsigned long long)sh->sector, sh->state, 
atomic_read(&sh->count),
+  sh->pd_idx, sh->ops.pending, sh->ops.ack, sh->ops.complete);
 
spin_lock(&sh->lock);
clear_bit(STRIPE_HANDLE, &sh->state);
@@ -2330,8 +2369,14 @@ static void handle_stripe5(struct stripe_head *sh)
}
}
 
+   if (sh->ops.count)
+   pending = get_stripe_work(sh);
+
spin_unlock(&sh->lock)

[PATCH 2.6.21-rc4 01/15] dmaengine: add base support for the async_tx api

2007-03-22 Thread Dan Williams

The async_tx api provides methods for describing a chain of asynchronous
bulk memory transfers/transforms with support for inter-transactional
dependencies.  It is implemented as a dmaengine client
that smooths over the details of different hardware offload engine
implementations.  Code that is written to the api can optimize for
asynchrnous operation and the api will fit the chain of operations to the
available offload resources.

Currently the raid5 implementation in the MD raid456 driver has been
converted to the async_tx api.  A driver for the offload engines on the
Intel Xscale series of I/O processors, iop-adma, is provided.  With the
iop-adma driver and async_txi, raid456 is able to offload copy, xor, and
xor-zero-sum operations to hardware engines.

On iop342 tiobench showed higher throughput for sequential writes (20 - 30%
improvement) and sequential reads to a degraded array (40 - 55%
improvement).  For the other cases performance was roughly equal, +/- a few
percentage points.  On a x86-smp platform the performance of the async_tx
implementation (in synchronous mode) was also +/- a few percentage points
of the original implementation.  According to 'top' CPU utilization was
positively affected in the offload case, but exact measurements have yet to
be taken.

The tiobench command line used for testing was:
tiobench --size 2048 --block 4096 --block 131072 --dir /mnt/raid --numruns 5
* iop342 had 1GB of memory available

This patch:
1/ introduces struct dma_async_tx_descriptor as a common field for all dmaengine
software descriptors
2/ converts the device_memcpy_* methods into separate prep, set src/dest, and
submit stages
3/ adds support for capabilities beyond memcpy (xor, memset, xor zero sum, 
completion
interrupts)
4/ converts ioatdma to the new semantics

Changelog:
* drop dma mapping methods, suggested by Chris Leech
* fix ioat_dma_dependency_added, also caught by Andrew Morton
* fix dma_sync_wait, change from Andrew Morton
* uninline large functions, change from Andrew Morton
* add tx->callback = NULL to dmaengine calls to interoperate with async_tx
  calls

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/dma/dmaengine.c   |  194 ++-
 drivers/dma/ioatdma.c |  248 -
 drivers/dma/ioatdma.h |8 +
 include/linux/dmaengine.h |  237 ---
 4 files changed, 439 insertions(+), 248 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 322ee29..2285f33 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -59,6 +59,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -66,6 +67,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static DEFINE_MUTEX(dma_list_mutex);
 static LIST_HEAD(dma_device_list);
@@ -165,6 +167,24 @@ static struct dma_chan *dma_client_chan_alloc(struct 
dma_client *client)
return NULL;
 }
 
+enum dma_status dma_sync_wait(struct dma_chan *chan, dma_cookie_t cookie)
+{
+   enum dma_status status;
+   unsigned long dma_sync_wait_timeout = jiffies + msecs_to_jiffies(5000);
+
+   dma_async_issue_pending(chan);
+   do {
+   status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);
+   if (time_after_eq(jiffies, dma_sync_wait_timeout)) {
+   printk(KERN_ERR "dma_sync_wait_timeout!\n");
+   return DMA_ERROR;
+   }
+   } while (status == DMA_IN_PROGRESS);
+
+   return status;
+}
+EXPORT_SYMBOL(dma_sync_wait);
+
 /**
  * dma_chan_cleanup - release a DMA channel's resources
  * @kref: kernel reference structure that contains the DMA channel device
@@ -211,7 +231,8 @@ static void dma_chans_rebalance(void)
mutex_lock(&dma_list_mutex);
 
list_for_each_entry(client, &dma_client_list, global_node) {
-   while (client->chans_desired > client->chan_count) {
+   while (client->chans_desired < 0 ||
+   client->chans_desired > client->chan_count) {
chan = dma_client_chan_alloc(client);
if (!chan)
break;
@@ -220,7 +241,8 @@ static void dma_chans_rebalance(void)
   chan,
   DMA_RESOURCE_ADDED);
}
-   while (client->chans_desired < client->chan_count) {
+   while (client->chans_desired >= 0 &&
+   client->chans_desired < client->chan_count) {
spin_lock_irqsave(&client->lock, flags);
chan = list_entry(client->channels.next,
  struct dma_chan,
@@ -297,12 +319,12 @@ EXPORT_SYMBOL(dma_async_client_unregister);
  * @number: count of DMA channels requested
  *
  * Clients call dma_async_client_chan_request() to specify how many

[PATCH 2.6.21-rc4 00/15] md raid5 acceleration and async_tx

2007-03-22 Thread Dan Williams

The following patch set implements the async_tx api and modifies
md-raid5 to issue memory copies and xor calculations asynchronously.
Async_tx is an extension of the existing dmaengine interface in the
kernel.  Async_tx allows kernel code to utilize application specific
acceleration engines when present and fall back to software routines
otherwise.  Further details can be found in the individual patch
headers. 
 
The implementation has been in -mm since 2.6.20-mm1 and has been
released in various forms to IOP users since 2.6.18
(http://sourceforge.net/projects/xscaleiop).  This release includes
cleanups to the iop-adma driver.  The md-raid5 patches are just rebased
versions of the 2.6.20-rc5 release. 
 
Given the release of an acceleration engine driver for another platform
(440spe http://marc.info/?l=linux-raid&m=117400143317440&w=2), I feel
more comfortable broaching the subject of 2.6.22 inclusion.
 
Regards, 
Dan
 
 
Dan Williams (15):
dmaengine: add base support for the async_tx api
ARM: Add drivers/dma to arch/arm/Kconfig
dmaengine: add the async_tx api
md: add raid5_run_ops and support routines
md: use raid5_run_ops for stripe cache operations
md: move write operations to raid5_run_ops
md: move raid5 compute block operations to raid5_run_ops
md: move raid5 parity checks to raid5_run_ops
md: satisfy raid5 read requests via raid5_run_ops
md: use async_tx and raid5_run_ops for raid5 expansion operations
md: move raid5 io requests to raid5_run_ops
md: remove raid5 compute_block and compute_parity5
dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines
iop13xx: Surface the iop13xx adma units to the iop-adma driver
iop3xx: Surface the iop3xx DMA and AAU units to the iop-adma driver
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [uml-devel] [ PATCH 4/7 ] UML - create as-layout.h

2007-03-22 Thread Blaisorblade

On Thursday 22 March 2007 17:06, Jeff Dike wrote:
> This patch moves all the the symbols defined in um_arch.c, which are
> mostly boundaries between different parts of the UML kernel address
> space, to a new header, as-layout.h.  There are also a few things here
> which aren't really related to address space layout, but which don't
> really have a better place to go.

Hey, I do like _these_ patches! A nice picture in that header could then be 
added (in the very future ;-) ), but at least one knows there are so much of 
them. And user_util.h is no more!

;-)

Bye!
-- 
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale! 
 http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [QUICKLIST 1/5] Quicklists for page table pages V4

2007-03-22 Thread Andrew Morton

On Thu, 22 Mar 2007 23:28:41 -0700 (PDT) Christoph Lameter <[EMAIL PROTECTED]> 
wrote:

> 1. Proven code from the IA64 arch.
> 
>   The method used here has been fine tuned for years and
>   is NUMA aware. It is based on the knowledge that accesses
>   to page table pages are sparse in nature. Taking a page
>   off the freelists instead of allocating a zeroed pages
>   allows a reduction of number of cachelines touched
>   in addition to getting rid of the slab overhead. So
>   performance improves.

By how much?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm try#2] Blackfin: architecture update patch

2007-03-22 Thread Wu, Bryan

On Fri, 2007-03-23 at 15:12 +0900, Paul Mundt wrote:
> On Fri, Mar 23, 2007 at 02:04:30PM +0800, Wu, Bryan wrote:
> > This is the latest blackfin update patch. Because there are lots of
> > issue fixing in this one, I put all modification in one update patch
> > which is located in:
> > https://blackfin.uclinux.org/gf/download/frsrelease/39/2707/blackfin-arch-2.6.21-rc4-mm1-update.patch
> > 
> I hope these will split up logically in the future so it's possible to
> reply to them without having to do manual mangling..
> 

>From now on, I will follow this rule. Thanks

> The patch generally looks fine, this is the only thing that really jumped
> out:
> 
> > Index: linux-2.6/include/asm-blackfin/pgtable.h
> > ===
> > --- linux-2.6.orig/include/asm-blackfin/pgtable.h
> > +++ linux-2.6/include/asm-blackfin/pgtable.h
> > @@ -59,4 +59,12 @@
> >  #defineVMALLOC_START   0
> >  #defineVMALLOC_END 0x
> >  
> > +#define  __HAVE_ARCH_ENTER_LAZY_CPU_MODE
> > +#define arch_enter_lazy_cpu_mode() do {} while (0)
> > +#define arch_leave_lazy_cpu_mode() do {} while (0)
> > +
> > +#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> > +#define arch_enter_lazy_mmu_mode() do {} while (0)
> > +#define arch_leave_lazy_mmu_mode() do {} while (0)
> > +
> >  #endif /* _BLACKFIN_PGTABLE_H */
> 
> asm-generic/pgtable.h already does this if you don't explicitly define
> __HAVE_ARCH_ENTER_LAZY_{CPU,MMU}_MODE. So please kill this entirely. If
> you forgot to include asm-generic/pgtable.h, that's another matter..

OK, I will check with this. Adding this dummy function is because
compiling will fail.

Best Regards,
-Bryan Wu
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[QUICKLIST 3/5] Quicklist support for i386

2007-03-22 Thread Christoph Lameter

i386: Convert to quicklists

Implement the i386 management of pgd and pmds using quicklists.

The i386 management of page table pages currently uses page sized slabs.
Getting rid of that using quicklists allows full use of the page flags
and the page->lru. So get rid of the improvised linked lists using
page->index and page->private.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc4-mm1/arch/i386/mm/init.c
===
--- linux-2.6.21-rc4-mm1.orig/arch/i386/mm/init.c   2007-03-15 
17:20:01.0 -0700
+++ linux-2.6.21-rc4-mm1/arch/i386/mm/init.c2007-03-20 14:21:52.0 
-0700
@@ -695,31 +695,6 @@ int remove_memory(u64 start, u64 size)
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif
 
-struct kmem_cache *pgd_cache;
-struct kmem_cache *pmd_cache;
-
-void __init pgtable_cache_init(void)
-{
-   if (PTRS_PER_PMD > 1) {
-   pmd_cache = kmem_cache_create("pmd",
-   PTRS_PER_PMD*sizeof(pmd_t),
-   PTRS_PER_PMD*sizeof(pmd_t),
-   0,
-   pmd_ctor,
-   NULL);
-   if (!pmd_cache)
-   panic("pgtable_cache_init(): cannot create pmd cache");
-   }
-   pgd_cache = kmem_cache_create("pgd",
-   PTRS_PER_PGD*sizeof(pgd_t),
-   PTRS_PER_PGD*sizeof(pgd_t),
-   0,
-   pgd_ctor,
-   PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
-   if (!pgd_cache)
-   panic("pgtable_cache_init(): Cannot create pgd cache");
-}
-
 /*
  * This function cannot be __init, since exceptions don't work in that
  * section.  Put this after the callers, so that it cannot be inlined.
Index: linux-2.6.21-rc4-mm1/arch/i386/mm/pgtable.c
===
--- linux-2.6.21-rc4-mm1.orig/arch/i386/mm/pgtable.c2007-03-15 
17:20:01.0 -0700
+++ linux-2.6.21-rc4-mm1/arch/i386/mm/pgtable.c 2007-03-20 14:55:47.0 
-0700
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -198,11 +199,6 @@ struct page *pte_alloc_one(struct mm_str
return pte;
 }
 
-void pmd_ctor(void *pmd, struct kmem_cache *cache, unsigned long flags)
-{
-   memset(pmd, 0, PTRS_PER_PMD*sizeof(pmd_t));
-}
-
 /*
  * List of all pgd's needed for non-PAE so it can invalidate entries
  * in both cached and uncached pgd's; not needed for PAE since the
@@ -211,36 +207,18 @@ void pmd_ctor(void *pmd, struct kmem_cac
  * against pageattr.c; it is the unique case in which a valid change
  * of kernel pagetables can't be lazily synchronized by vmalloc faults.
  * vmalloc faults work because attached pagetables are never freed.
- * The locking scheme was chosen on the basis of manfred's
- * recommendations and having no core impact whatsoever.
  * -- wli
  */
 DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
-
-static inline void pgd_list_add(pgd_t *pgd)
-{
-   struct page *page = virt_to_page(pgd);
-   page->index = (unsigned long)pgd_list;
-   if (pgd_list)
-   set_page_private(pgd_list, (unsigned long)&page->index);
-   pgd_list = page;
-   set_page_private(page, (unsigned long)&pgd_list);
-}
+LIST_HEAD(pgd_list);
 
-static inline void pgd_list_del(pgd_t *pgd)
-{
-   struct page *next, **pprev, *page = virt_to_page(pgd);
-   next = (struct page *)page->index;
-   pprev = (struct page **)page_private(page);
-   *pprev = next;
-   if (next)
-   set_page_private(next, (unsigned long)pprev);
-}
+#define QUICK_PGD 0
+#define QUICK_PMD 1
 
-void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+void pgd_ctor(void *pgd)
 {
unsigned long flags;
+   struct page *page = virt_to_page(pgd);
 
if (PTRS_PER_PMD == 1) {
memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
@@ -259,31 +237,32 @@ void pgd_ctor(void *pgd, struct kmem_cac
__pa(swapper_pg_dir) >> PAGE_SHIFT,
USER_PTRS_PER_PGD, PTRS_PER_PGD - USER_PTRS_PER_PGD);
 
-   pgd_list_add(pgd);
+   list_add(&page->lru, &pgd_list);
spin_unlock_irqrestore(&pgd_lock, flags);
 }
 
 /* never called when PTRS_PER_PMD > 1 */
-void pgd_dtor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+void pgd_dtor(void *pgd)
 {
unsigned long flags; /* can be called from interrupt context */
+   struct page *page = virt_to_page(pgd);
 
paravirt_release_pd(__pa(pgd) >> PAGE_SHIFT);
spin_lock_irqsave(&pgd_lock, flags);
-   pgd_list_del(pgd);
+   list_del(&page->lru);
spin_unlock_irqrestore(&pgd_lock, flags);
 }
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
int i;
-   p

[QUICKLIST 4/5] Quicklist support for x86_64

2007-03-22 Thread Christoph Lameter

Conver x86_64 to using quicklists

This adds caching of pgds and puds, pmds, pte. That way we can
avoid costly zeroing and initialization of special mappings in the
pgd.

A second quicklist is used to separate out PGD handling. Thus we can carry
the initialized pgds of terminating processes over to the next process
needing them.

Also clean up the pgd_list handling to use regular list macros. Not using
the slab allocator frees up the lru field so we can use regular list macros.

The adding and removal of the pgds to the pgdlist is moved into the
constructor / destructor. We can then avoid moving pgds off the list that
are still in the quicklists reducing the pds creation and allocation
overhead further.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc4-mm1/arch/x86_64/Kconfig
===
--- linux-2.6.21-rc4-mm1.orig/arch/x86_64/Kconfig   2007-03-20 
14:20:34.0 -0700
+++ linux-2.6.21-rc4-mm1/arch/x86_64/Kconfig2007-03-20 14:21:57.0 
-0700
@@ -56,6 +56,14 @@ config ZONE_DMA
bool
default y
 
+config QUICKLIST
+   bool
+   default y
+
+config NR_QUICK
+   int
+   default 2
+
 config ISA
bool
 
Index: linux-2.6.21-rc4-mm1/include/asm-x86_64/pgalloc.h
===
--- linux-2.6.21-rc4-mm1.orig/include/asm-x86_64/pgalloc.h  2007-03-20 
14:21:06.0 -0700
+++ linux-2.6.21-rc4-mm1/include/asm-x86_64/pgalloc.h   2007-03-20 
14:55:47.0 -0700
@@ -4,6 +4,10 @@
 #include 
 #include 
 #include 
+#include 
+
+#define QUICK_PGD 0/* We preserve special mappings over free */
+#define QUICK_PT 1 /* Other page table pages that are zero on free */
 
 #define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
@@ -20,86 +24,77 @@ static inline void pmd_populate(struct m
 static inline void pmd_free(pmd_t *pmd)
 {
BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
-   free_page((unsigned long)pmd);
+   quicklist_free(QUICK_PT, NULL, pmd);
 }
 
 static inline pmd_t *pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-   return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   return (pmd_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, 
NULL);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   return (pud_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, 
NULL);
 }
 
 static inline void pud_free (pud_t *pud)
 {
BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
-   free_page((unsigned long)pud);
+   quicklist_free(QUICK_PT, NULL, pud);
 }
 
-static inline void pgd_list_add(pgd_t *pgd)
+static inline void pgd_ctor(void *x)
 {
+   unsigned boundary;
+   pgd_t *pgd = x;
struct page *page = virt_to_page(pgd);
 
+   /*
+* Copy kernel pointers in from init.
+*/
+   boundary = pgd_index(__PAGE_OFFSET);
+   memcpy(pgd + boundary,
+   init_level4_pgt + boundary,
+   (PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+
spin_lock(&pgd_lock);
-   page->index = (pgoff_t)pgd_list;
-   if (pgd_list)
-   pgd_list->private = (unsigned long)&page->index;
-   pgd_list = page;
-   page->private = (unsigned long)&pgd_list;
+   list_add(&page->lru, &pgd_list);
spin_unlock(&pgd_lock);
 }
 
-static inline void pgd_list_del(pgd_t *pgd)
+static inline void pgd_dtor(void *x)
 {
-   struct page *next, **pprev, *page = virt_to_page(pgd);
+   pgd_t *pgd = x;
+   struct page *page = virt_to_page(pgd);
 
spin_lock(&pgd_lock);
-   next = (struct page *)page->index;
-   pprev = (struct page **)page->private;
-   *pprev = next;
-   if (next)
-   next->private = (unsigned long)pprev;
+   list_del(&page->lru);
spin_unlock(&pgd_lock);
 }
 
+
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   unsigned boundary;
-   pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
-   if (!pgd)
-   return NULL;
-   pgd_list_add(pgd);
-   /*
-* Copy kernel pointers in from init.
-* Could keep a freelist or slab cache of those because the kernel
-* part never changes.
-*/
-   boundary = pgd_index(__PAGE_OFFSET);
-   memset(pgd, 0, boundary * sizeof(pgd_t));
-   memcpy(pgd + boundary,
-  init_level4_pgt + boundary,
-  (PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+   pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD,
+GFP_KERNEL|__GFP_REPEAT, pgd_ctor);
+
return pgd;
 }
 
 static inline void pgd_free(pgd_t *pgd)
 {
BUG_ON((unsigned long)pgd & (PAGE_SIZE-1));
-   pgd_list_del(pgd);
-   free_page((unsigned long)pgd);
+   qui

[QUICKLIST 5/5] Quicklist support for sparc64

2007-03-22 Thread Christoph Lameter

From: David Miller <[EMAIL PROTECTED]>

[QUICKLIST]: Add sparc64 quicklist support.

I ported this to sparc64 as per the patch below, tested on
UP SunBlade1500 and 24 cpu Niagara T1000.

Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc4-mm1/arch/sparc64/Kconfig
===
--- linux-2.6.21-rc4-mm1.orig/arch/sparc64/Kconfig  2007-03-20 
14:20:33.0 -0700
+++ linux-2.6.21-rc4-mm1/arch/sparc64/Kconfig   2007-03-20 14:22:03.0 
-0700
@@ -26,6 +26,10 @@ config MMU
bool
default y
 
+config QUICKLIST
+   bool
+   default y
+
 config STACKTRACE_SUPPORT
bool
default y
Index: linux-2.6.21-rc4-mm1/arch/sparc64/mm/init.c
===
--- linux-2.6.21-rc4-mm1.orig/arch/sparc64/mm/init.c2007-03-20 
14:20:33.0 -0700
+++ linux-2.6.21-rc4-mm1/arch/sparc64/mm/init.c 2007-03-20 14:22:03.0 
-0700
@@ -178,30 +178,6 @@ unsigned long sparc64_kern_sec_context _
 
 int bigkernel = 0;
 
-struct kmem_cache *pgtable_cache __read_mostly;
-
-static void zero_ctor(void *addr, struct kmem_cache *cache, unsigned long 
flags)
-{
-   clear_page(addr);
-}
-
-extern void tsb_cache_init(void);
-
-void pgtable_cache_init(void)
-{
-   pgtable_cache = kmem_cache_create("pgtable_cache",
- PAGE_SIZE, PAGE_SIZE,
- SLAB_HWCACHE_ALIGN |
- SLAB_MUST_HWCACHE_ALIGN,
- zero_ctor,
- NULL);
-   if (!pgtable_cache) {
-   prom_printf("Could not create pgtable_cache\n");
-   prom_halt();
-   }
-   tsb_cache_init();
-}
-
 #ifdef CONFIG_DEBUG_DCFLUSH
 atomic_t dcpage_flushes = ATOMIC_INIT(0);
 #ifdef CONFIG_SMP
Index: linux-2.6.21-rc4-mm1/arch/sparc64/mm/tsb.c
===
--- linux-2.6.21-rc4-mm1.orig/arch/sparc64/mm/tsb.c 2007-03-15 
17:20:01.0 -0700
+++ linux-2.6.21-rc4-mm1/arch/sparc64/mm/tsb.c  2007-03-20 14:22:03.0 
-0700
@@ -252,7 +252,7 @@ static const char *tsb_cache_names[8] = 
"tsb_1MB",
 };
 
-void __init tsb_cache_init(void)
+void __init pgtable_cache_init(void)
 {
unsigned long i;
 
Index: linux-2.6.21-rc4-mm1/include/asm-sparc64/pgalloc.h
===
--- linux-2.6.21-rc4-mm1.orig/include/asm-sparc64/pgalloc.h 2007-03-15 
17:20:01.0 -0700
+++ linux-2.6.21-rc4-mm1/include/asm-sparc64/pgalloc.h  2007-03-20 
14:55:47.0 -0700
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -13,52 +14,50 @@
 #include 
 
 /* Page table allocation/freeing. */
-extern struct kmem_cache *pgtable_cache;
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   return kmem_cache_alloc(pgtable_cache, GFP_KERNEL);
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pgd_free(pgd_t *pgd)
 {
-   kmem_cache_free(pgtable_cache, pgd);
+   quicklist_free(0, NULL, pgd);
 }
 
 #define pud_populate(MM, PUD, PMD) pud_set(PUD, PMD)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return kmem_cache_alloc(pgtable_cache,
-   GFP_KERNEL|__GFP_REPEAT);
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pmd_free(pmd_t *pmd)
 {
-   kmem_cache_free(pgtable_cache, pmd);
+   quicklist_free(0, NULL, pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
  unsigned long address)
 {
-   return kmem_cache_alloc(pgtable_cache,
-   GFP_KERNEL|__GFP_REPEAT);
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
 unsigned long address)
 {
-   return virt_to_page(pte_alloc_one_kernel(mm, address));
+   void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
+   return pg ? virt_to_page(pg) : NULL;
 }

 static inline void pte_free_kernel(pte_t *pte)
 {
-   kmem_cache_free(pgtable_cache, pte);
+   quicklist_free(0, NULL, pte);
 }
 
 static inline void pte_free(struct page *ptepage)
 {
-   pte_free_kernel(page_address(ptepage));
+   quicklist_free(0, NULL, page_address(ptepage));
 }
 
 
@@ -66,6 +65,9 @@ static inline void pte_free(struct page 
 #define pmd_populate(MM,PMD,PTE_PAGE)  \
pmd_populate_kernel(MM,PMD,page_address(PTE_PAGE))
 
-#define check_pgt_cache()  do { } while (0)
+static inline void check_pgt_cache(void)
+{
+   quicklist_trim(0, NULL, 25, 16);
+}
 
 #endif /* _SPARC64_PGALLOC_H */
-
To unsubscribe from this list: send the line "unsubs

[QUICKLIST 1/5] Quicklists for page table pages V4

2007-03-22 Thread Christoph Lameter

Quicklists for page table pages V4

V3->V4
- Rename quicklist_check to quicklist_trim and allow parameters
  to specify how to clean quicklists.
- Remove dead code

V2->V3
- Fix Kconfig issues by setting CONFIG_QUICKLIST explicitly
  and default to one quicklist if NR_QUICK is not set.
- Fix i386 support. (Cannot mix PMD and PTE allocs.)
- Discussion of V2.
  http://marc.info/?l=linux-kernel&m=117391339914767&w=2

V1->V2
- Add sparch64 patch
- Single i386 and x86_64 patch
- Update attribution
- Update justification
- Update approvals
- Earlier discussion of V1 was at
  http://marc.info/?l=linux-kernel&m=117357922219342&w=2

This patchset introduces an arch independent framework to handle lists
of recently used page table pages to replace the existing (ab)use of the
slab for that purpose.

1. Proven code from the IA64 arch.

The method used here has been fine tuned for years and
is NUMA aware. It is based on the knowledge that accesses
to page table pages are sparse in nature. Taking a page
off the freelists instead of allocating a zeroed pages
allows a reduction of number of cachelines touched
in addition to getting rid of the slab overhead. So
performance improves. This is particularly useful if pgds
contain standard mappings. We can save on the teardown
and setup of such a page if we have some on the quicklists.
This includes avoiding lists operations that are otherwise
necessary on alloc and free to track pgds.

2. Light weight alternative to use slab to manage page size pages

Slab overhead is significant and even page allocator use
is pretty heavy weight. The use of a per cpu quicklist
means that we touch only two cachelines for an allocation.
There is no need to access the page_struct (unless arch code
needs to fiddle around with it). So the fast past just
means bringing in one cacheline at the beginning of the
page. That same cacheline may then be used to store the
page table entry. Or a second cacheline may be used
if the page table entry is not in the first cacheline of
the page. The current code will zero the page which means
touching 32 cachelines (assuming 128 byte). We get down
from 32 to 2 cachelines in the fast path.

3. Fix conflicting use of page_structs by slab and arch code.

F.e. Both arches use the ->private and ->index field to
create lists of pgds and i386 also uses other page flags. The slab
can also use the ->private field for allocations that
are larger than page size which would occur if one enables
debugging. In that case the arch code would overwrite the
pointer to the first page of the compound page allocated
by the slab. SLAB has been modified to not enable
debugging for such slabs (!).

There the potential for additional conflicts
here especially since some arches also use page flags to mark
page table pages.

The patch removes these conflicts by no longer using
the slab for these purposes. The page allocator is more
suitable since PAGE_SIZE chunks are its domain.
Then we can start using standard list operations via
page->lru instead of improvising linked lists.

SLUB makes more extensive use of the page struct and so
far had to create workarounds for these slabs. The ->index
field is used for the SLUB freelist. So SLUB cannot allow
the use of a freelist for these slabs and--like slab--
currently does not allow debugging and forces slabs to
only contain a single object (avoids freelist).

If we do not get rid of these issues then both SLAB and SLUB
have to continue to provide special code paths to support these
slabs.

4. i386 gets lightweight NUMA aware management of page table pages.

Note that the use of SLAB on NUMA systems will require the
use of alien caches to efficiently remove remote page
table pages. Which (for a PAGE_SIZEd allocation) is a lengthy
and expensive process. With quicklists no alien caches are
needed. Pages can be simply returned to the correct node.

5. x86_64 gets lightweight page table page management.

This will allow x86_64 arch code to faster repopulate pgds
and other page table entries. The list operations for pgds
are reduced in the same way as for i386 to the point where
a pgd is allocated from the page allocator and when it is
freed back to the page allocator. A pgd can pass through
the quicklists without having to be reinitialized.

6. Consolidation of code from multiple arches

So far arches have their own implementation of quicklist
management. This patch moves that feature into the core allowing
an easier maintenance and consistent management of quickl

[QUICKLIST 2/5] Quicklist support for IA64

2007-03-22 Thread Christoph Lameter

Quicklist for IA64

IA64 is the origin of the quicklist implementation. So cut out the pieces
that are now in core code and modify the functions called.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc4-mm1/arch/ia64/mm/init.c
===
--- linux-2.6.21-rc4-mm1.orig/arch/ia64/mm/init.c   2007-03-20 
14:20:28.0 -0700
+++ linux-2.6.21-rc4-mm1/arch/ia64/mm/init.c2007-03-20 14:21:47.0 
-0700
@@ -39,9 +39,6 @@
 
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 
-DEFINE_PER_CPU(unsigned long *, __pgtable_quicklist);
-DEFINE_PER_CPU(long, __pgtable_quicklist_size);
-
 extern void ia64_tlb_init (void);
 
 unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x1UL;
@@ -56,54 +53,6 @@ EXPORT_SYMBOL(vmem_map);
 struct page *zero_page_memmap_ptr; /* map entry for zero page */
 EXPORT_SYMBOL(zero_page_memmap_ptr);
 
-#define MIN_PGT_PAGES  25UL
-#define MAX_PGT_FREES_PER_PASS 16L
-#define PGT_FRACTION_OF_NODE_MEM   16
-
-static inline long
-max_pgt_pages(void)
-{
-   u64 node_free_pages, max_pgt_pages;
-
-#ifndefCONFIG_NUMA
-   node_free_pages = nr_free_pages();
-#else
-   node_free_pages = node_page_state(numa_node_id(), NR_FREE_PAGES);
-#endif
-   max_pgt_pages = node_free_pages / PGT_FRACTION_OF_NODE_MEM;
-   max_pgt_pages = max(max_pgt_pages, MIN_PGT_PAGES);
-   return max_pgt_pages;
-}
-
-static inline long
-min_pages_to_free(void)
-{
-   long pages_to_free;
-
-   pages_to_free = pgtable_quicklist_size - max_pgt_pages();
-   pages_to_free = min(pages_to_free, MAX_PGT_FREES_PER_PASS);
-   return pages_to_free;
-}
-
-void
-check_pgt_cache(void)
-{
-   long pages_to_free;
-
-   if (unlikely(pgtable_quicklist_size <= MIN_PGT_PAGES))
-   return;
-
-   preempt_disable();
-   while (unlikely((pages_to_free = min_pages_to_free()) > 0)) {
-   while (pages_to_free--) {
-   free_page((unsigned long)pgtable_quicklist_alloc());
-   }
-   preempt_enable();
-   preempt_disable();
-   }
-   preempt_enable();
-}
-
 void
 lazy_mmu_prot_update (pte_t pte)
 {
Index: linux-2.6.21-rc4-mm1/include/asm-ia64/pgalloc.h
===
--- linux-2.6.21-rc4-mm1.orig/include/asm-ia64/pgalloc.h2007-03-15 
17:20:01.0 -0700
+++ linux-2.6.21-rc4-mm1/include/asm-ia64/pgalloc.h 2007-03-20 
14:55:47.0 -0700
@@ -18,71 +18,18 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
-DECLARE_PER_CPU(unsigned long *, __pgtable_quicklist);
-#define pgtable_quicklist __ia64_per_cpu_var(__pgtable_quicklist)
-DECLARE_PER_CPU(long, __pgtable_quicklist_size);
-#define pgtable_quicklist_size __ia64_per_cpu_var(__pgtable_quicklist_size)
-
-static inline long pgtable_quicklist_total_size(void)
-{
-   long ql_size = 0;
-   int cpuid;
-
-   for_each_online_cpu(cpuid) {
-   ql_size += per_cpu(__pgtable_quicklist_size, cpuid);
-   }
-   return ql_size;
-}
-
-static inline void *pgtable_quicklist_alloc(void)
-{
-   unsigned long *ret = NULL;
-
-   preempt_disable();
-
-   ret = pgtable_quicklist;
-   if (likely(ret != NULL)) {
-   pgtable_quicklist = (unsigned long *)(*ret);
-   ret[0] = 0;
-   --pgtable_quicklist_size;
-   preempt_enable();
-   } else {
-   preempt_enable();
-   ret = (unsigned long *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
-   }
-
-   return ret;
-}
-
-static inline void pgtable_quicklist_free(void *pgtable_entry)
-{
-#ifdef CONFIG_NUMA
-   int nid = page_to_nid(virt_to_page(pgtable_entry));
-
-   if (unlikely(nid != numa_node_id())) {
-   free_page((unsigned long)pgtable_entry);
-   return;
-   }
-#endif
-
-   preempt_disable();
-   *(unsigned long *)pgtable_entry = (unsigned long)pgtable_quicklist;
-   pgtable_quicklist = (unsigned long *)pgtable_entry;
-   ++pgtable_quicklist_size;
-   preempt_enable();
-}
-
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   return pgtable_quicklist_alloc();
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pgd_free(pgd_t * pgd)
 {
-   pgtable_quicklist_free(pgd);
+   quicklist_free(0, NULL, pgd);
 }
 
 #ifdef CONFIG_PGTABLE_4
@@ -94,12 +41,12 @@ pgd_populate(struct mm_struct *mm, pgd_t
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return pgtable_quicklist_alloc();
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pud_free(pud_t * pud)
 {
-   pgtable_quicklist_free(pud);
+   quicklist_free(0, NULL, pud);
 }
 #define __pud_free_tlb(tlb, pud)   pud_free(pud)
 #endif /* CONFIG_PGTABLE_4 */
@@ -112,12 +59,12 @@ pud_populat

[PATCH] sched: rsdl yet more fixes

2007-03-22 Thread Con Kolivas

This one should hopefully fix Andy's bug.

To be queued on top of what's already in -mm please. Will make v.33 with these
changes for other trees soon.

---
The wrong bit could be unset on requeue_task which could cause an oops.
Fix that.

sched_yield semantics became almost a noop so change back to expiring tasks
when yield is called.

recalc_task_prio() performed during pull_task() on SMP may not reliably
be doing the right thing to tasks queued on the new runqueue. Add a
special variant of enqueue_task that does its own local recalculation of
priority and quota.

rq->best_static_prio should not be set by realtime or SCHED_BATCH tasks.
Correct that, and microoptimise the code around setting best_static_prio.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 kernel/sched.c |  103 +++--
 1 file changed, 71 insertions(+), 32 deletions(-)

Index: linux-2.6.21-rc4-mm1/kernel/sched.c
===
--- linux-2.6.21-rc4-mm1.orig/kernel/sched.c2007-03-23 11:28:25.0 
+1100
+++ linux-2.6.21-rc4-mm1/kernel/sched.c 2007-03-23 17:28:19.0 +1100
@@ -714,17 +714,17 @@ static inline int entitled_slot(int stat
  */
 static inline int next_entitled_slot(struct task_struct *p, struct rq *rq)
 {
-   if (p->static_prio < rq->best_static_prio && p->policy != SCHED_BATCH)
-   return SCHED_PRIO(find_first_zero_bit(p->bitmap, PRIO_RANGE));
-   else {
-   DECLARE_BITMAP(tmp, PRIO_RANGE);
+   DECLARE_BITMAP(tmp, PRIO_RANGE);
+   int search_prio;
 
-   bitmap_or(tmp, p->bitmap,
- prio_matrix[USER_PRIO(p->static_prio)],
- PRIO_RANGE);
-   return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
-   USER_PRIO(rq->prio_level)));
-   }
+   if (p->static_prio < rq->best_static_prio && p->policy != SCHED_BATCH)
+   search_prio = MAX_RT_PRIO;
+   else
+   search_prio = rq->prio_level;
+   bitmap_or(tmp, p->bitmap, prio_matrix[USER_PRIO(p->static_prio)],
+ PRIO_RANGE);
+   return SCHED_PRIO(find_next_zero_bit(tmp, PRIO_RANGE,
+   USER_PRIO(search_prio)));
 }
 
 static void queue_expired(struct task_struct *p, struct rq *rq)
@@ -817,7 +817,7 @@ static void requeue_task(struct task_str
list_move_tail(&p->run_list, p->array->queue + p->prio);
if (!rt_task(p)) {
if (list_empty(old_array->queue + old_prio))
-   __clear_bit(old_prio, p->array->prio_bitmap);
+   __clear_bit(old_prio, old_array->prio_bitmap);
set_dynamic_bit(p, rq);
}
 }
@@ -2074,25 +2074,54 @@ void sched_exec(void)
 }
 
 /*
+ * This is a unique version of enqueue_task for the SMP case where a task
+ * has just been moved across runqueues. It uses the information from the
+ * old runqueue to help it make a decision much like recalc_task_prio. As
+ * the new runqueue is almost certainly at a different prio_level than the
+ * src_rq it is cheapest just to pick the next entitled slot.
+ */
+static inline void enqueue_pulled_task(struct rq *src_rq, struct rq *rq,
+  struct task_struct *p)
+{
+   int queue_prio;
+
+   p->array = rq->active;
+   if (!rt_task(p)) {
+   if (p->rotation == src_rq->prio_rotation) {
+   if (p->array == src_rq->expired) {
+   queue_expired(p, rq);
+   goto out_queue;
+   }
+   } else
+   task_new_array(p, rq);
+   }
+   queue_prio = next_entitled_slot(p, rq);
+   if (queue_prio >= MAX_PRIO) {
+   queue_expired(p, rq);
+   goto out_queue;
+   }
+   rq_quota(rq, queue_prio) += p->quota;
+   p->prio = queue_prio;
+out_queue:
+   p->normal_prio = p->prio;
+   p->rotation = rq->prio_rotation;
+   sched_info_queued(p);
+   set_dynamic_bit(p, rq);
+   list_add_tail(&p->run_list, p->array->queue + p->prio);
+}
+
+/*
  * pull_task - move a task from a remote runqueue to the local runqueue.
  * Both runqueues must be locked.
  */
-static void pull_task(struct rq *src_rq, struct prio_array *src_array,
- struct task_struct *p, struct rq *this_rq,
- int this_cpu)
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+ struct rq *this_rq, int this_cpu)
 {
dequeue_task(p, src_rq);
dec_nr_running(p, src_rq);
set_task_cpu(p, this_cpu);
inc_nr_running(p, this_rq);
-
-   /*
-* If this task has already been running on src_rq this priority
-* cycle, make the new runqueue think it has been on its cycle
-*/
-   if (p->rotation == src_rq->prio_rotation)
-   p->rotation =

Re: [PATCH -mm try#2] Blackfin: architecture update patch

2007-03-22 Thread Paul Mundt

On Fri, Mar 23, 2007 at 02:04:30PM +0800, Wu, Bryan wrote:
> This is the latest blackfin update patch. Because there are lots of
> issue fixing in this one, I put all modification in one update patch
> which is located in:
> https://blackfin.uclinux.org/gf/download/frsrelease/39/2707/blackfin-arch-2.6.21-rc4-mm1-update.patch
> 
I hope these will split up logically in the future so it's possible to
reply to them without having to do manual mangling..

The patch generally looks fine, this is the only thing that really jumped
out:

> Index: linux-2.6/include/asm-blackfin/pgtable.h
> ===
> --- linux-2.6.orig/include/asm-blackfin/pgtable.h
> +++ linux-2.6/include/asm-blackfin/pgtable.h
> @@ -59,4 +59,12 @@
>  #define  VMALLOC_START   0
>  #define  VMALLOC_END 0x
>  
> +#define  __HAVE_ARCH_ENTER_LAZY_CPU_MODE
> +#define arch_enter_lazy_cpu_mode()   do {} while (0)
> +#define arch_leave_lazy_cpu_mode()   do {} while (0)
> +
> +#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> +#define arch_enter_lazy_mmu_mode()   do {} while (0)
> +#define arch_leave_lazy_mmu_mode()   do {} while (0)
> +
>  #endif   /* _BLACKFIN_PGTABLE_H */

asm-generic/pgtable.h already does this if you don't explicitly define
__HAVE_ARCH_ENTER_LAZY_{CPU,MMU}_MODE. So please kill this entirely. If
you forgot to include asm-generic/pgtable.h, that's another matter..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc4-mm1

2007-03-22 Thread Con Kolivas

On Friday 23 March 2007 05:17, Andy Whitcroft wrote:
> Ok, I have yet a third x86_64 machine is is blowing up with the latest
> 2.6.21-rc4-mm1+hotfixes+rsdl-0.32 but working with
> 2.6.21-rc4-mm1+hotfixes-RSDL.  I have results on various hotfix levels
> so I have just fired off a set of tests across the affected machines on
> that latest hotfix stack plus the RSDL backout and the results should be
> in in the next hour or two.
>
> I think there is a strong correlation between RSDL and these hangs.  Any
> suggestions as to the next step.

Found a nasty in requeue_task
+   if (list_empty(old_array->queue + old_prio))
+   __clear_bit(old_prio, p->array->prio_bitmap);

see anything wrong there? I do :P

I'll queue that up with the other changes pending and hopefully that will fix 
your bug.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL v0.31

2007-03-22 Thread Mike Galbraith

On Fri, 2007-03-23 at 16:59 +1100, Con Kolivas wrote:

> The deadline mechanism is easy to hit and works. Try printk'ing it.

Hm.  I did (.30), and it didn't in an hours time doing this and that.
After I did the take your quota with you, it did kick in.  Lots.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sysfs q [was: sysfs ugly timer interface]

2007-03-22 Thread Jan Engelhardt


On Mar 22 2007 21:48, Greg KH wrote:
>On Fri, Mar 23, 2007 at 02:24:46AM +0100, Jan Engelhardt wrote:
>> On Mar 22 2007 08:28, Greg KH wrote:
>> 
>> Question regarding sysfs files: How would you do something like 
>> /proc/net/nf_conntrack with sysfs? Have directories named like , 
>> 0001, 0002, ..?
>
>I don't know, I've never said that _all_ proc files can move to sysfs. 
>For some things, like possibly the netfilter stuff, proc files make 
>more sense.

But proc is for procs. (At least its name indicates.)

>Were you thinking of moving this file to sysfs?

No, not that one. But new modules. Everyone says "please no new /proc 
files"[some examples, 1,2]. On the other hand,

[1] http://lkml.org/lkml/2007/1/21/34
[2] http://lkml.org/lkml/2005/2/3/285

 [EMAIL PROTECTED]:/home/maxim# cat 
 /sys/devices/system/clockevents/clockevents0/registered
  
 lapicF:0007 M:3(periodic) C: 1
 hpet F:0003 M:1(shutdown) C: 0
 lapicF:0007 M:3(periodic) C: 0
 [EMAIL PROTECTED]:/home/maxim#
>>>
>>> Now... this file needs to die, before 2.6.21 is released. It tries to
>>> bring /proc-like parsing nightmare to sysfs. Kill it before it becomes
>>> part of stable ABI!

when there's a proc-style multi-line file like that clockevents thing in 
sysfs, people raise objections too (see above), which leads me to the 
question: if neither procfs nor sysfs are appropriate for such files, 
what is?


>What does the information in it represent?

A list of the currently tracked connections.



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm try#2] Blackfin: architecture update patch

2007-03-22 Thread Wu, Bryan

Hi folks:

This is the latest blackfin update patch. Because there are lots of
issue fixing in this one, I put all modification in one update patch
which is located in:
https://blackfin.uclinux.org/gf/download/frsrelease/39/2707/blackfin-arch-2.6.21-rc4-mm1-update.patch

Changlogs:

1) fixed most of issues according to Arnd and Paul's review.
2) Remove RCS tags
3) Remove unsupported BF535 machine.
4) applied and tested based on blackfin-arch-balance-parenthesis-in-macros.patch

Signed-off-by: Bryan Wu <[EMAIL PROTECTED]> 
---

 arch/blackfin/Kconfig   |  173 +-
 arch/blackfin/Makefile  |1 
 arch/blackfin/kernel/Makefile   |3 
 arch/blackfin/kernel/asm-offsets.c  |2 
 arch/blackfin/kernel/bfin_dma_5xx.c |  245 +--
 arch/blackfin/kernel/bfin_gpio.c|   41 
 arch/blackfin/kernel/bfin_ksyms.c   |2 
 arch/blackfin/kernel/dma-mapping.c  |   25 
 arch/blackfin/kernel/dualcore_test.c|2 
 arch/blackfin/kernel/entry.S|6 
 arch/blackfin/kernel/flat.c |  101 +
 arch/blackfin/kernel/init_task.c|5 
 arch/blackfin/kernel/irqchip.c  |7 
 arch/blackfin/kernel/module.c   |2 
 arch/blackfin/kernel/process.c  |   56 
 arch/blackfin/kernel/ptrace.c   |   41 
 arch/blackfin/kernel/setup.c|   79 -
 arch/blackfin/kernel/signal.c   |  193 --
 arch/blackfin/kernel/sys_bfin.c |   18 
 arch/blackfin/kernel/time.c |   18 
 arch/blackfin/kernel/traps.c|  110 -
 arch/blackfin/kernel/vmlinux.lds.S  |2 
 arch/blackfin/lib/ashldi3.c |6 
 arch/blackfin/lib/ashrdi3.c |6 
 arch/blackfin/lib/checksum.c|6 
 arch/blackfin/lib/divsi3.S  |9 
 arch/blackfin/lib/gcclib.h  |2 
 arch/blackfin/lib/ins.S |2 
 arch/blackfin/lib/lshrdi3.c |6 
 arch/blackfin/lib/memchr.S  |2 
 arch/blackfin/lib/memcmp.S  |2 
 arch/blackfin/lib/memcpy.S  |7 
 arch/blackfin/lib/memmove.S |2 
 arch/blackfin/lib/memset.S  |8 
 arch/blackfin/lib/modsi3.S  |9 
 arch/blackfin/lib/muldi3.c  |6 
 arch/blackfin/lib/outs.S|2 
 arch/blackfin/lib/smulsi3_highpart.S|6 
 arch/blackfin/lib/udivsi3.S |   11 
 arch/blackfin/lib/umodsi3.S |7 
 arch/blackfin/lib/umulsi3_highpart.S|6 
 arch/blackfin/mach-bf533/boards/cm_bf533.c  |2 
 arch/blackfin/mach-bf533/boards/ezkit.c |   11 
 arch/blackfin/mach-bf533/boards/generic_board.c |2 
 arch/blackfin/mach-bf533/boards/stamp.c |   17 
 arch/blackfin/mach-bf533/cpu.c  |2 
 arch/blackfin/mach-bf533/head.S |2 
 arch/blackfin/mach-bf533/ints-priority.c|2 
 arch/blackfin/mach-bf537/boards/Makefile|9 
 arch/blackfin/mach-bf537/boards/cm_bf537.c  |   13 
 arch/blackfin/mach-bf537/boards/eth_mac.c   |   52 
 arch/blackfin/mach-bf537/boards/generic_board.c |   24 
 arch/blackfin/mach-bf537/boards/pnav10.c|   34 
 arch/blackfin/mach-bf537/boards/stamp.c |   95 -
 arch/blackfin/mach-bf537/cpu.c  |2 
 arch/blackfin/mach-bf537/head.S |3 
 arch/blackfin/mach-bf537/ints-priority.c|2 
 arch/blackfin/mach-bf561/boards/Makefile|1 
 arch/blackfin/mach-bf561/boards/cm_bf561.c  |   23 
 arch/blackfin/mach-bf561/boards/ezkit.c |   19 
 arch/blackfin/mach-bf561/boards/generic_board.c |4 
 arch/blackfin/mach-bf561/coreb.c|2 
 arch/blackfin/mach-bf561/head.S |2 
 arch/blackfin/mach-bf561/ints-priority.c|2 
 arch/blackfin/mach-common/cache.S   |4 
 arch/blackfin/mach-common/cacheinit.S   |2 
 arch/blackfin/mach-common/cplbhdlr.S|6 
 arch/blackfin/mach-common/cplbinfo.c|2 
 arch/blackfin/mach-common/cplbmgr.S |6 
 arch/blackfin/mach-common/dpmc.S|2 
 arch/blackfin/mach-common/entry.S   |   19 
 arch/blackfin/mach-common/interrupt.S   |2 
 arch/blackfin/mach-common/ints-priority-dc.c|   29 
 arch/blackfin/mach-common/ints-priority-sc.c|   28 
 arch/blackfin/mach-common/irqpanic.c|2 
 arch/blackfin/mach-common/lock.S|2 
 arch/blackfin/mach-common/pm.c  |2 
 arch/blackfin/mm/Makefile   |2 
 arch/bla

Re: [PATCH 1/2] ehea: fix for dynamic lpar support

2007-03-22 Thread Jeff Garzik


Jan-Bernd Themann wrote:

The patch fixes bugs related to the probe / remove adapter
functionality (handling of OFDT nodes)

Signed-off-by: Jan-Bernd Themann <[EMAIL PROTECTED]>


applied 1-2


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: controlling mmap()'d vs read/write() pages

2007-03-22 Thread Nick Piggin


Eric W. Biederman wrote:

Dave Hansen <[EMAIL PROTECTED]> writes:



So, I think we have a difference of opinion.  I think it's _all_ about
memory pressure, and you think it is _not_ about accounting for memory
pressure. :)  Perhaps we mean different things, but we appear to
disagree greatly on the surface.



I think it is about preventing a badly behaved container from having a
significant effect on the rest of the system, and in particular other
containers on the system.


That's Dave's point, I believe. Limiting mapped memory may be
mostly OK for well behaved applications, but it doesn't do anything
to stop bad ones from effectively DoSing the system or ruining any
guarantees you might proclaim (not that hard guarantees are always
possible without using virtualisation anyway).

This is why I'm surprised at efforts that go to such great lengths
to get accounting "just right" (but only for mmaped memory). You
may as well not even bother, IMO.

Give me an RSS limit big enough to run a couple of system calls and
a loop...

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL v0.31

2007-03-22 Thread Con Kolivas

On Friday 23 March 2007 15:39, Mike Galbraith wrote:
> On Fri, 2007-03-23 at 09:50 +1100, Con Kolivas wrote:
> > Now to figure out some meaningful cheap way of improving this accounting.
>
> The accounting is easy iff tick resolution is good enough, the deadline
> mechanism is harder.  I did the "quota follows task" thing, but nothing
> good happens.  That just ensured that the deadline mechanism kicks in
> constantly because tick theft is a fact of tick-based life.  A
> reasonable fudge factor would help, but...
>
> I see problems wrt with trying to implement the deadline mechanism.
>
> As implemented, it can't identify who is doing the stealing (which
> happens constantly, even if userland if 100% hog) because of tick
> resolution accounting.  If you can't identify the culprit, you can't
> enforce the quota, and quotas which are not enforced are, strictly
> speaking, not quotas.  At tick time, you can only close the barn door
> after the cow has been stolen, and the thief can theoretically visit
> your barn an infinite number of times while you aren't watching the
> door.  ("don't blink" scenarios, and tick is backward-assward blink)
>
> You can count nanoseconds in schedule, and store the actual usage, but
> then you still have the problem of inaccuracies in sched_clock() from
> cross-cpu wakeup and migration.  Cross-cpu wakeups happen quite a lot.
> If sched_clock() _were_ absolutely accurate, you wouldn't need the
> runqueue deadline mechanism, because at slice tick time you can see
> everything you will ever see without moving enforcement directly into
> the most critical of paths.
>
> IMHO, unless it can be demonstrated that timeslice theft is a problem
> with a real-life scenario, you'd be better off dropping the queue
> ticking.  Time slices are a deadline mechanism, and in practice the god
> of randomness ensures that even fast movers do get caught often enough
> to make ticking tasks sufficient.
>
> (that was a very long-winded reply to one sentence because I spent a lot
> of time looking into this very subject and came to the conclusion that
> you can't get there from here.  fwiw, ymmv and all that of course;)
>
> > Thanks again!
>
> You're welcome.

The deadline mechanism is easy to hit and works. Try printk'ing it. There is 
some leeway to take tick accounting into the equation and I don't believe 
nanosecond resolution is required at all for this (how much leeway would you 
give then ;)). Eventually there is nothing to stop us using highres timers 
(blessed if they work as planned everywhere eventually) to do the events and 
do away with scheduler_tick entirely. For now ticks works fine; a reasonable 
estimate for smp migration will suffice (patch forthcoming).

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.21 5/4] cxgb3 - fix white spaces in drivers/net/Kconfig

2007-03-22 Thread Jeff Garzik


[EMAIL PROTECTED] wrote:

From: Divy Le Ray <[EMAIL PROTECTED]>

Use tabs instead of white spaces for CHELSIO_T3 entry.

Signed-off-by: Divy Le Ray <[EMAIL PROTECTED]>


applied


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.21 1/4] cxgb3 - fix ethtool cmd on multiple queues port

2007-03-22 Thread Jeff Garzik


[EMAIL PROTECTED] wrote:

From: Divy Le Ray <[EMAIL PROTECTED]>

Limit ethtool -g/-G to the given port's queues.

Signed-off-by: Divy Le Ray <[EMAIL PROTECTED]>


applied 1-4


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mv643xx_eth: add mv643xx_eth_shutdown function

2007-03-22 Thread Jeff Garzik


Dale Farnsworth wrote:

From: Dale Farnsworth <[EMAIL PROTECTED]>

mv643xx_eth_shutdown is needed for kexec.

Signed-off-by: Dale Farnsworth <[EMAIL PROTECTED]>

---
 drivers/net/mv643xx_eth.c |   14 ++
 1 file changed, 14 insertions(+)


applied


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm try#2] Blackfin: on-chip Two Wire Interface I2C driver

2007-03-22 Thread Wu, Bryan

Hi folks,

Changlogs:

a) Fixed issues according to Jean's review.
b) Add MAINTAINS infomation
c) add I2C_HW_B_BLACKFIN to i2c-id.h

[PATCH] Blackfin: on-chip Two Wire Interface I2C driver

The i2c linux driver for blackfin architecture which supports blackfin
on-chip TWI controller i2c operation.

Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
Reviewed-by: Andrew Morton <[EMAIL PROTECTED]>
Reviewed-by: Alexey Dobriyan <[EMAIL PROTECTED]>
Reviewed-by: Jean Delvare <[EMAIL PROTECTED]>
Cc: David Brownell <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 MAINTAINERS   |7 
 drivers/i2c/busses/Kconfig|   16 
 drivers/i2c/busses/Makefile   |1 
 drivers/i2c/busses/i2c-bfin-twi.c |  644 ++
 include/linux/i2c-id.h|1 
 5 files changed, 669 insertions(+)

Index: linux-2.6/drivers/i2c/busses/Kconfig
===
--- linux-2.6.orig/drivers/i2c/busses/Kconfig
+++ linux-2.6/drivers/i2c/busses/Kconfig
@@ -91,6 +91,22 @@
  This driver can also be built as a module.  If so, the module
  will be called i2c-au1550.
 
+config I2C_BLACKFIN_TWI
+   tristate "Blackfin TWI I2C support"
+   depends on I2C && (BF534 || BF536 || BF537)
+   help
+ This is the TWI I2C device driver for Blackfin 534/536/537.
+ This driver can also be built as a module.  If so, the module
+ will be called i2c-bfin-twi.
+
+config I2C_BLACKFIN_TWI_CLK_KHZ
+   int "Blackfin TWI I2C clock (kHz)"
+   depends on I2C_BLACKFIN_TWI
+   range 10 400
+   default 50
+   help
+ The unit of the TWI clock is kilo HZ.
+
 config I2C_ELEKTOR
tristate "Elektor ISA card"
depends on I2C && ISA && BROKEN_ON_SMP
Index: linux-2.6/drivers/i2c/busses/Makefile
===
--- linux-2.6.orig/drivers/i2c/busses/Makefile
+++ linux-2.6/drivers/i2c/busses/Makefile
@@ -10,6 +10,7 @@
 obj-$(CONFIG_I2C_AMD8111)  += i2c-amd8111.o
 obj-$(CONFIG_I2C_AT91) += i2c-at91.o
 obj-$(CONFIG_I2C_AU1550)   += i2c-au1550.o
+obj-$(CONFIG_I2C_BLACKFIN_TWI) += i2c-bfin-twi.o
 obj-$(CONFIG_I2C_ELEKTOR)  += i2c-elektor.o
 obj-$(CONFIG_I2C_HYDRA)+= i2c-hydra.o
 obj-$(CONFIG_I2C_I801) += i2c-i801.o
Index: linux-2.6/drivers/i2c/busses/i2c-bfin-twi.c
===
--- /dev/null
+++ linux-2.6/drivers/i2c/busses/i2c-bfin-twi.c
@@ -0,0 +1,644 @@
+/*
+ * drivers/i2c/busses/i2c-bfin-twi.c
+ *
+ * Description: Driver for Blackfin Two Wire Interface
+ *
+ * Author:  sonicz  <[EMAIL PROTECTED]>
+ *
+ * Copyright (c) 2005-2007 Analog Devices, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#define POLL_TIMEOUT   (2 * HZ)
+
+/* SMBus mode*/
+#define TWI_I2C_MODE_STANDARD  0x01
+#define TWI_I2C_MODE_STANDARDSUB   0x02
+#define TWI_I2C_MODE_COMBINED  0x04
+
+struct bfin_twi_iface {
+   struct mutextwi_lock;
+   int irq;
+   spinlock_t  lock;
+   charread_write;
+   u8  command;
+   u8  *transPtr;
+   int readNum;
+   int writeNum;
+   int cur_mode;
+   int manual_stop;
+   int result;
+   int timeout_count;
+   struct timer_list   timeout_timer;
+   struct i2c_adapter  adap;
+   struct completion   complete;
+};
+
+static struct bfin_twi_iface twi_iface;
+
+static void bfin_twi_handle_interrupt(struct bfin_twi_iface *iface)
+{
+   unsigned short twi_int_status = bfin_read_TWI_INT_STAT();
+   unsigned short mast_stat = bfin_read_TWI_MASTER_STAT();
+
+   if (twi_int_status & XMTSERV) {
+   /* Transmit next data */
+   if (iface->writeNum > 0) {
+   bfin_write_TWI_XMT_DATA8(*(iface->transPtr++));
+   iface->

Re: sysfs q [was: sysfs ugly timer interface]

2007-03-22 Thread Greg KH

On Fri, Mar 23, 2007 at 02:24:46AM +0100, Jan Engelhardt wrote:
> 
> On Mar 22 2007 08:28, Greg KH wrote:
> >> 
> >> > [EMAIL PROTECTED]:/home/maxim# cat 
> >> > /sys/devices/system/clockevents/clockevents0/registered
> >> > lapicF:0007 M:3(periodic) C: 1
> >> > hpet F:0003 M:1(shutdown) C: 0
> >> > lapicF:0007 M:3(periodic) C: 0
> >> > [EMAIL PROTECTED]:/home/maxim#   
> >> 
> >> Now... this file needs to die, before 2.6.21 is released. It tries to
> >> bring /proc-like parsing nightmare to sysfs. Kill it before it becomes
> >> part of stable ABI!
> >
> >Eeek!
> 
> Question regarding sysfs files: How would you do something like
> /proc/net/nf_conntrack with sysfs? Have directories named like , 
> 0001, 0002, ..?

I don't know, I've never said that _all_ proc files can move to sysfs.
For some things, like possibly the netfilter stuff, proc files make more
sense.

Were you thinking of moving this file to sysfs?  What does the
information in it represent?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc[123] regression with NOAPIC

2007-03-22 Thread Ray Lee

Thomas Gleixner wrote:
> On Thu, 2007-03-22 at 15:16 +0100, Adrian Bunk wrote:
 Does it work if you do _not_ revert the commits, and instead replace in
 drivers/acpi/processor_idle.c the
   #ifdef ARCH_APICTIMER_STOPS_ON_C3
 with an
   #if 0
 ?
>>> Then NOAPIC probably works again, but booting w/o NOAPIC fails.
>> But we'll know that it's this code that has a problen with noapic
>> in the CONFIG_GENERIC_CLOCKEVENTS=n case.
> 
> Nope. This code does not have a problem. It causes a problem elsewhere:

I can still try the above if it ends up being a useful data point.

> 
> It calls switch_ipi_to_APIC_timer() or switch_APIC_timer_to_ipi(), which
> sets/clears a bit in the broadcast mask and enables / disables the local
> APIC timer.
> 
> I don't see right now, why this causes the box to lock up hard, but
> maybe the debug printk's below give us some hint.
> 
>   tglx
> 
> diff --git a/arch/x86_64/kernel/apic.c b/arch/x86_64/kernel/apic.c
> index 723417d..29376e2 100644
> --- a/arch/x86_64/kernel/apic.c
> +++ b/arch/x86_64/kernel/apic.c
> @@ -886,6 +886,8 @@ void disable_APIC_timer(void)
>   if (using_apic_timer) {
>   unsigned long v;
>  
> + printk("Disabling local APIC timer %d\n", apic_runs_main_timer);
> +
>   v = apic_read(APIC_LVTT);
>   /*
>* When an illegal vector value (0-15) is written to an LVT
> @@ -910,6 +912,7 @@ void enable_APIC_timer(void)
>   !cpu_isset(cpu, timer_interrupt_broadcast_ipi_mask)) {
>   unsigned long v;
>  
> + printk("Enabling local APIC timer: %d\n", apic_runs_main_timer);
>   v = apic_read(APIC_LVTT);
>   apic_write(APIC_LVTT, v & ~APIC_LVT_MASKED);
>   }
> @@ -934,6 +937,7 @@ void smp_send_timer_broadcast_ipi(void)
>  
>   cpus_and(mask, cpu_online_map, timer_interrupt_broadcast_ipi_mask);
>   if (!cpus_empty(mask)) {
> + printk("Send IPI\n");
>   send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
>   }
>  }
> 
> 

I didn't see the first two print, but I'm having to watch the bad
bootups (with NOAPIC) by eyesight alone, as I don't have a second system
to run netconsole on at the moment.

However, on the NOAPIC, locking boot, the last thing that prints out is
the final printk, Send IPI.

On the boots without NOAPIC, at the same spot roughly a thousand
(estimated) "Send IPI" messages hit the screen before transitioning to
the initramfs and continuing normally.

In the morning, I can rework the patch to set a global in the first two
cases (Disabling/Enabling local APIC timer), and print the result of
those in the last case, as we know the system will hang there. (I would
have done this before sending the message, but given our timezone
difference, figured this was a good start.)

Ray
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc4-rt0-kdump (was: Re: [patch] setup_boot_APIC_clock() irq-enable fix)

2007-03-22 Thread Vivek Goyal

On Thu, Mar 22, 2007 at 02:27:25PM +0100, Michal Piotrowski wrote:
> Michal Piotrowski napisał(a):
> > On 22/03/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:
> >>
> >> * Michal Piotrowski <[EMAIL PROTECTED]> wrote:
> >>
> >> > Hi Ingo,
> >>
> >> > 2.6.21-rc4-rt0
> >>
> >> > BUG: at kernel/fork.c:1033 copy_process()
> >>
> >> thanks Michal - this is a real bug that affects upstream too. Find the
> >> fix below - i've test-booted it and it fixes the warning.
> > 
> > Problem is fixed, thanks.
> 
> BTW. It seems that nobody uses -rt as a crash dump kernel ;)
> 
> BUG: unable to handle kernel paging request at virtual address f7ebf8c4
>  printing eip:
> c1610192
> *pde = 
> stopped custom tracer.
> Oops:  [#1]
> PREEMPT 
> Modules linked in:
> CPU:0
> EIP:0060:[]Not tainted VLI
> EFLAGS: 00010206   (2.6.21-rc4-rt0-kdump #3)
> EIP is at copy_oldmem_page+0x4a/0xd0
> eax: 08c4   ebx: f7ebf000   ecx: 0100   edx: 0246
> esi: f7ebf8c4   edi: c4c520fc   ebp: c4d54e30   esp: c4d54e18
> ds: 007b   es: 007b   fs: 00d8  gs:   ss: 0068  preempt:0001
> Process swapper (pid: 1, ti=c4d54000 task=c4d52c20 task.ti=c4d54000)
> Stack: c17ab7e0 c183f982 c1969658 0400 0400 00037ebf c4d54e5c 
> c16af187 
>00037ebf c4c520fc 0400 08c4   c4c696e0 
> 0400 
>c4c520fc c4d54f94 c19a9cfd c4c520fc 0400 c4d54f78  
> c1840996 
> Call Trace:
>  [] read_from_oldmem+0x73/0x98
>  [] vmcore_init+0x26c/0xab7
>  [] init+0xaa/0x287
>  [] kernel_thread_helper+0x7/0x10
>  ===
> 
> l *copy_oldmem_page+0x4a/0xd0
> 0xc1610148 is in copy_oldmem_page (arch/i386/kernel/crash_dump.c:35).
> 30   * copying the data to a pre-allocated kernel page and then copying 
> to user
> 31   * space in non-atomic context.
> 32   */
> 33  ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
> 34 size_t csize, unsigned long offset, 
> int userbuf)
> 35  {
> 36  void  *vaddr;
> 37
> 38  if (!csize)
> 39  return 0;
> 

Can you please paste the disassembly of copy_oldmem_page() on your system.
Not sure from where this faulting address 0xf7ebf8c4 is coming. We are still
in vmcore_init(), so we should be copying the data to kernel buffers only.
This looks like a valid kernel address.

Can you also put some printk() here to find out from where 0xf7ebf8c4 has
come? It does not look like a fixed kernel virutual address returned by
kmap_atomic_pfn(). Then is it passed by kernel as a parameter to
copy_oldmem_page()?

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][2/2] double stack limit (rfc)

2007-03-22 Thread KAMEZAWA Hiroyuki

On Thu, 22 Mar 2007 21:56:03 -0700
"Tony Luck" <[EMAIL PROTECTED]> wrote:

> On 3/22/07, KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:
> > I hear some people says that "When I set stack-size-limit to 32M,
> > I want to use 32M of memory stack..." and register-stack expansion can
> > fail because stack is used up by memory-stack.
> 
> An interesting dilemma.  If you apply this patch though, you might
> get someone complain that they set the stack limit to 32M, but
> execution continued as the program ran all the way to 64M!
> 
yes, consumes twice mem at bad case.  

> Possibly you might argue that each of the memory stack and the
> RBS stack should be allowed to grow to the stacklimit ... in which
> case you'd need a more invasive patch that made separate vma
> for each of the stack and the RBS stack, and checked at fault
> time each would be allowed to grow to the stack limit. But I'm
> not sure that I like that ... ia64 happens to split different objects
> in the stack between the RBS and the memory stack depending
> on whether they happen to be allocated by the compiler to
> stack registers (r32-r127) or to actual memory locations.  Both
> types of allocation contribute to the total "stack" size of the
> process so the existing behaivour of keeping the sum of the
> size of the RBS stack and the memory stack below the
> stack limit seems quite reasonable. 

I explained the same thing to my cusotmers ;). I posted this as RFC.
I'd like to hear other opinions, too.

-Kame
Note: [1/2] patch is just a bug fix. sorry for mixing.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] cpusets/sched_domain reconciliation

2007-03-22 Thread Paul Jackson

Andrew wrote:
> It isn't very nice.  It probably won't crash, but it _can_ crash and when

I guess I got lucky, Cliff, when I snuck in the recursion in the
other cpuset.c routines that you were using as an example here ;).

Since the current kernel/cpuset.c recursion seems only to be in code
paths for cpu hot unplug, it's not surprising that no one has actually
hit it enough times so far to motivate them to hunt me down and
exterminate me.  The intersection of the worlds heavy cpuset users
with the worlds heavy cpu unpluggers is a very small set indeed.

Maybe we should do something about this before that set of people grows
to include someone with violent tendencies.

I suppose it would work to set a hard coded limit to how deep one
could make the cpuset hierarchy, and perhaps provide a kernel tunable
for those rare cases where someone needed more than this limit.

We could avoid the recursion here, by converting it to its iterative
equivalent.  This equivlaent would have to keep track in a dynamically
allocator vector the cpuset pointers being worked, and if it got to the
end of that vector, reallocate a longer one.

It's not that much more code - and it's a fairly simple transformation
of a simple recursion on one variable to an iteration using a vector
of that variable.

Since the iterative code, using a dynamically sized vector, probably
doesn't add much more kernel text than the code to limit the depth and
provide for a kernel tunable to tweak the limit, and since the iterative
approach avoids imposing some arbitrary small limit on the user visible
API for what is really a small corner case, the iterative approach seems
like the better idea.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][2/2] double stack limit (rfc)

2007-03-22 Thread Tony Luck


On 3/22/07, KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:

I hear some people says that "When I set stack-size-limit to 32M,
I want to use 32M of memory stack..." and register-stack expansion can
fail because stack is used up by memory-stack.


An interesting dilemma.  If you apply this patch though, you might
get someone complain that they set the stack limit to 32M, but
execution continued as the program ran all the way to 64M!

Possibly you might argue that each of the memory stack and the
RBS stack should be allowed to grow to the stacklimit ... in which
case you'd need a more invasive patch that made separate vma
for each of the stack and the RBS stack, and checked at fault
time each would be allowed to grow to the stack limit.  But I'm
not sure that I like that ... ia64 happens to split different objects
in the stack between the RBS and the memory stack depending
on whether they happen to be allocated by the compiler to
stack registers (r32-r127) or to actual memory locations.  Both
types of allocation contribute to the total "stack" size of the
process so the existing behaivour of keeping the sum of the
size of the RBS stack and the memory stack below the
stack limit seems quite reasonable.  I'm willing to listen to
opposing arguments though.

-Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kmalloc() with size zero

2007-03-22 Thread Vignesh Babu BM

On Fri, 2007-03-23 at 07:08 +0530, Jan Engelhardt wrote:
> 
> On Mar 22 2007 16:18, Stephane Eranian wrote:
> >
> I'd say "feature", glibc's malloc also returns an address on
> malloc(0).
> 
This is implementation defined-the standard allows for return of either
null or an address.
> 
> Jan
> --
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 
-- 
Regards,  
Vignesh Babu BM  
_  
"Why is it that every time I'm with you, makes me believe in magic?"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL v0.31

2007-03-22 Thread Mike Galbraith

On Fri, 2007-03-23 at 09:50 +1100, Con Kolivas wrote:

> Now to figure out some meaningful cheap way of improving this accounting.

The accounting is easy iff tick resolution is good enough, the deadline
mechanism is harder.  I did the "quota follows task" thing, but nothing
good happens.  That just ensured that the deadline mechanism kicks in
constantly because tick theft is a fact of tick-based life.  A
reasonable fudge factor would help, but...

I see problems wrt with trying to implement the deadline mechanism.

As implemented, it can't identify who is doing the stealing (which
happens constantly, even if userland if 100% hog) because of tick
resolution accounting.  If you can't identify the culprit, you can't
enforce the quota, and quotas which are not enforced are, strictly
speaking, not quotas.  At tick time, you can only close the barn door
after the cow has been stolen, and the thief can theoretically visit
your barn an infinite number of times while you aren't watching the
door.  ("don't blink" scenarios, and tick is backward-assward blink)

You can count nanoseconds in schedule, and store the actual usage, but
then you still have the problem of inaccuracies in sched_clock() from
cross-cpu wakeup and migration.  Cross-cpu wakeups happen quite a lot.
If sched_clock() _were_ absolutely accurate, you wouldn't need the
runqueue deadline mechanism, because at slice tick time you can see
everything you will ever see without moving enforcement directly into
the most critical of paths.

IMHO, unless it can be demonstrated that timeslice theft is a problem
with a real-life scenario, you'd be better off dropping the queue
ticking.  Time slices are a deadline mechanism, and in practice the god
of randomness ensures that even fast movers do get caught often enough
to make ticking tasks sufficient.

(that was a very long-winded reply to one sentence because I spent a lot
of time looking into this very subject and came to the conclusion that
you can't get there from here.  fwiw, ymmv and all that of course;)

> Thanks again!

You're welcome.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/21] MSI rework

2007-03-22 Thread Michael Ellerman

On Thu, 2007-03-22 at 15:08 -0700, Greg KH wrote:
> On Fri, Mar 23, 2007 at 09:02:16AM +1100, Benjamin Herrenschmidt wrote:
> > > > i.e.  First the simple bug fixes that should purely be restructure of
> > > > msi.c with no affect on anything outside of it.
> > > > 
> > > > And then get into the architecture enhancements.
> > > 
> > > I agree, care to break these down into a smaller series of patches that
> > > can go into -mm for testing?
> > 
> > I don't see the point in breaking the serie... you can bisect half way
> > through if necessary... it's made of small patches that are done, afaik,
> > in such a way that the whole thing should still work at any level in the
> > serie.
> > 
> > The serie just expresses the dependency between them.
> 
> Ok, then which patches in the series should be acceptable to take right
> now for 2.6.22?  The "clean up the BUG" ones?

The series is already very verbose, I don't think I can split most of
them any smaller without producing an unbuildable kernel.

I think 1 up to and including 11 are safe as houses, they shouldn't have
any effect other than to clean up the code.

The rest make functional changes, but they're all quite small, self
contained, and easily bisectable. I'd certainly like Eric to have a look
at them, but at some point I think we're just going to have to bite the
bullet and merge them, and see what we get in the way of bug reports.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

signature.asc
Description: This is a digitally signed message part

[PATCH][2/2] double stack limit (rfc)

2007-03-22 Thread KAMEZAWA Hiroyuki

Now, ia64's hard-stack-size (rlimit.max) is sum of register-stack size and
memory-stack size. But soft-stack-size (rlimit.cur) accounitng is not sum..
they are accounted independently. This is inconsistent.

I hear some people says that "When I set stack-size-limit to 32M,
I want to use 32M of memory stack..." and register-stack expansion can
fail because stack is used up by memory-stack.

This patch moves register-stack's base address to lower address.
By this patch, meaning of hard-stack-size can be equal to soft-stack-size and
we can avoid the case "register-backing store cannot be expanded because the
memory stack uses the whole stack".

How about this ?

Signed-Off-By: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc4/arch/ia64/mm/init.c
===
--- linux-2.6.21-rc4.orig/arch/ia64/mm/init.c
+++ linux-2.6.21-rc4/arch/ia64/mm/init.c
@@ -152,7 +152,7 @@ inline void
 ia64_set_rbs_bot (void)
 {
unsigned long stack_size = current->signal->rlim[RLIMIT_STACK].rlim_max 
& -16;
-
+   stack_size *= 2;
if (stack_size > MAX_USER_STACK_SIZE)
stack_size = MAX_USER_STACK_SIZE;
current->thread.rbs_bot = PAGE_ALIGN(current->mm->start_stack - 
stack_size);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][ia64][1/2] bugfix stack layout upside-down

2007-03-22 Thread KAMEZAWA Hiroyuki

ia64 expects following vm layout
==
[register-stack]
[memory-stack]
==
But, when ulimit -s is used  and stack-base-address-randomization works,
vm layout is sometimes following.
==
[memory-stack]
[register-stack]
==

If this happens,  register-stack cannot be expanded.

This patch fixes this bug by adjusting register-stack.

Signed-Off-By: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc4/arch/ia64/mm/init.c
===
--- linux-2.6.21-rc4.orig/arch/ia64/mm/init.c
+++ linux-2.6.21-rc4/arch/ia64/mm/init.c
@@ -155,7 +155,7 @@ ia64_set_rbs_bot (void)
 
if (stack_size > MAX_USER_STACK_SIZE)
stack_size = MAX_USER_STACK_SIZE;
-   current->thread.rbs_bot = STACK_TOP - stack_size;
+   current->thread.rbs_bot = PAGE_ALIGN(current->mm->start_stack - 
stack_size);
 }
 
 /*

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 17/21] MSI: Clear the irq_desc's msi pointer on free

2007-03-22 Thread Eric W. Biederman

Michael Ellerman <[EMAIL PROTECTED]> writes:

> On Thu, 2007-03-22 at 08:23 -0600, Eric W. Biederman wrote:
>> Michael Ellerman <[EMAIL PROTECTED]> writes:
>> 
>> > Currently we never clear the msi_desc pointer in the irq_desc. This
>> > leaves us with a pointer to free'ed memory hanging around. No one seems
>> > to have hit this, so presumably other parts of the code are protecting
>> > us from ever using the stale pointer .. or we're just lucky, we should
>> > still clear it.
>> 
>> Hmm.  Maybe.  Currently this is done in dynamic_irq_cleanup,
>> at least for everything except sparc64.
>
> OK, I missed that. I still think we should do it here, otherwise there's
> a window, however small, where the msi_desc pointer is pointing at freed
> memory.

After following the code through the current cleanup happens before you are
proposing, and in fact the irq is return to the set of irq's that can
be allocated before you are calling set_irq_msi(irq, NULL).

Therefore you are doing this too late and we need to ensure the
architecture code does this in arch_teardown_msi_irq.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible Bug in mincore or mmap

2007-03-22 Thread Bruce Dubbs

Nick Piggin wrote:
> Bruce Dubbs wrote:
>> When testing an installation with tests from the Linux Test Project, my
>> kernels fail one instance of the mincore01 tests:
>>
>> mincoremincore011  PASS  :  expected failure: errno = 22 (Invalid
>> argument)
>> mincore012  PASS  :  expected failure: errno = 14 (Bad address)
>> mincore013  FAIL  :  call succeeded unexpectedly
>> mincore014  PASS  :  expected failure: errno = 12 (Cannot allocate
>> memory)011  PASS  :  expected failure: errno = 22 (Invalid argument)
>> mincore012  PASS  :  expected failure: errno = 14 (Bad address)
>> mincore013  FAIL  :  call succeeded unexpectedly
>> mincore014  PASS  :  expected failure: errno = 12 (Cannot allocate
>> memory)
>>
>> I pared down the test to the attached program.  The result is supposed
>> to fail as it is asking for memory information 5 times what should be
>> allocated.
>>
>> Upon experimenting, I found the test works properly if a printf is
>> executed before the mmap call.  I have tested on locally built, but
>> unmodified, 2.4.25, 2.6.12.5, and a 2.6.20.3 kernels and get the same
>> behavior.  The tests fail on IA32 architecture, but not 64-bit kernels.
>>  The test always works properly on FC6 and RHEL3.
>>
>> I've checked the archives for this issue and could not find anything
>> appropriate.
>>
>> Could this be a potential security issue as memory that is not supposed
>> to be accessible seems to be available to the user?  Is it expected
>> behavior?
> 
> It shouldn't be a security problem if mincore doesn't actually
> return the data.

Thanks for the response.  It may be interesting to note that adding:

buf = (char*)global_pointer + 2 * global_len;
i = *buf;

after the mincore call does not fail. Changing the 2nd line above to
*buf = 1; gives a segmentation fault as you would expect.

As a minimum, it appears the mmap function is allowing read access
beyond its allocated address space in some circumstances.

Upon thinking about it, it may be that the test is invalid.  I don't
believe there is anything tying the mincore query to the memory region
allocated by mmap.  If memory mapping occurs beyond the mmap requested
memory size to anticipate another memory request, mincore wouldn't fail.

Does this make any sense?




>> 
>>
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> static int   PAGESIZE;
>> static char  file_name[]= "fooXX";
>> static void* global_pointer = NULL;
>> static int   global_len = 0;
>> static int   file_desc  = 0;
>>
>> int main(int argc, char **argv)
>> {
>> int i;
>> int result;
>> char*   buf;
>> unsigned char   vect[20] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
>>
>> PAGESIZE = getpagesize();
>> /* global_pointer will point to a mmapped area of global_len
>> bytes */
>> global_len = PAGESIZE*2;
>> buf = (char*)malloc(global_len);
>> memset(buf, 42, global_len);  // Asterisks /* create a
>> temporary file */
>> file_desc = mkstemp(file_name);
>> /* fill the temporary file with two pages of data */
>> write(file_desc, buf, global_len);
>> free(buf);
>> // Will work properly as long as print is before mmap function.
>> if ( argc > 1 ) printf("argc=%d\n", argc);
>>
>> /* map the file in memory */
>> global_pointer = mmap( NULL, global_len,
>> PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, file_desc, 0);
>>
>> // Result should be -1 as the request is 5 times actual mapping
>> result = mincore(global_pointer, (size_t)(global_len*5), vect);
>>
>> // Print some data
>> printf("PAGESIZE=%d\n", PAGESIZE);
>> printf("global_len=%d\n", global_len);
>> printf("global_pointer=0x%x\n", (unsigned int)global_pointer);
>> printf("alloc=%d\n", (global_len+PAGESIZE-1) / PAGESIZE );
>> printf("Result=%d\n", result);
>> printf("vect: ");
>>
>> for ( i=0; i<20; i++) printf("%02x ", vect[i]);
>> printf("\n");
>> // Clean up
>> munmap(global_pointer, (size_t)global_len);
>> close(file_desc);
>> unlink(file_name);
>> }
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL 0.31 causes slowdown

2007-03-22 Thread Con Kolivas


On 23/03/07, Tim Chen <[EMAIL PROTECTED]> wrote:

Con,

I've tried running Volanomark and found a 80% regression
with RSDL 0.31 scheduler on 2.6.21-rc4 on a 2 socket Core 2 quad cpu
system (4 cpus per socket, 8 cpus for system).

The results are sensitive to rr_interval. Using Con's patch to increase
rr_interval to a large value of 100,
the regression reduced to 30% instead of 80%.

I ran Volanomark in loopback mode with 10 chatrooms
(20 clients per chatroom) configuration, with each client sending
out 1 messages.

http://www.volano.com/benchmarks.html

There are significant differences in the vmstat runqueue profile
between the 2.6.21-rc4 and the one with RSDL.

There are a lot less runnable jobs (see col 2) with RSDL 0.31  (rr_interval=15)
and higher idle time.


Thanks Tim.

Volanomark is a purely yield() semantic dependant workload (as
discussed many times previously). In the earlier form of RSDL I
softened the effect of sched_yield but other changes since then have
made that softness bordering on a noop. Obviously when sched_yield is
relied upon that will not be enough. Extending the rr interval simply
makes the yield slightly more effective and is not the proper
workaround. Since expiration of arrays is a regular frequent
occurrence in RSDL then changing yield semantics back to expiration
should cause a massive improvement in these values, without making the
yields as long as in mainline. It's impossible to know exactly what
the final result will be since java uses this timing sensitive yield
for locking but we can improve it drastically from this. I'll make a
patch soon to change yield again.

--
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] 2.6.21-rc4 nicksched v32

2007-03-22 Thread Nick Piggin


Al Boldi wrote:

Nick Piggin wrote:



Timeslices get scaled by /proc/sys/kernel/base_timeslice. If you have bad
interactivity you could try lowering this and see if it helps.



As I am on 2.6.20, I can't really test this patch, but I tried your sched 
from PlugSched and its not bad.


OK, the one in plugsched is reasonably different...


I'm getting strange numbers with chew.c, though.  Try this:
Boot into /bin/sh
Run ./chew on one console.
Run ./chew on another console.
Watch latencies.

Console 1:
pid 671, prio   0, out for  799 ms, ran for  800 ms, load  50%
pid 671, prio   0, out for  799 ms, ran for  801 ms, load  50%
pid 671, prio   0, out for  799 ms, ran for  799 ms, load  49%
pid 671, prio   0, out for  800 ms, ran for  800 ms, load  49%

Console 2:
pid 672, prio   0, out for  800 ms, ran for  799 ms, load  49%
pid 672, prio   0, out for  799 ms, ran for  800 ms, load  50%
pid 672, prio   0, out for  799 ms, ran for  800 ms, load  50%
pid 672, prio   0, out for  799 ms, ran for  799 ms, load  49%
pid 672, prio   0, out for  799 ms, ran for  799 ms, load  49%

Looks good, but now add a cpu-hog, ie. while :; do :; done

Console 1:

pid 671, prio   0, out for  799 ms, ran for  399 ms, load  33%
pid 671, prio   0, out for  799 ms, ran for  399 ms, load  33%
pid 671, prio   0, out for  799 ms, ran for  399 ms, load  33%
pid 671, prio   0, out for  799 ms, ran for  399 ms, load  33%

Console 2:
pid 672, prio   0, out for 1599 ms, ran for  799 ms, load  33%
pid 672, prio   0, out for 1599 ms, ran for  799 ms, load  33%
pid 672, prio   0, out for 1599 ms, ran for  800 ms, load  33%
pid 672, prio   0, out for 1599 ms, ran for  799 ms, load  33%

It's smooth, but latencies are double on console2.  Easy enough to fix, 
though, just press scrollock to induce a sleep and then release.


Yeah, this issue is since fixed in v32.


Also, latencies are huge.  Is there a way to fix latencies per nice?


Latencies should be quite a bit lower in v32. You can lower it
with /proc/sys/kernel/base_timeslice, however I would like to see
whether the current setting gives reasonable interactivity.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CIFS: reset ATTR_READONLY on Windows

2007-03-22 Thread Steve French

> When all write permission bits are removed from a file on 
> the Windows share (e.g. using chmod ugo-w filename), the 
> read-only bit is set in the file (as seen in the Windows Explorer).
> Now it is not possible to get the  write permissions back (e.g. using
chmod ugo+w filename). The read-only  
> bit has to be removed in the Windows Explorer. When this is done, 
> the file is still displayed to be read-only (ls -l filename), 
> but it can be made writeable (chmod ugo+w filename). When the 
> patch from the message  mentioned above is applied (discovered 
> by Alan Tyson of HP, reported by Jeff Layton), the writeable bits 
> are displayed correctly as soon as the read-only bit is
> removed in the Windows Explorer. The main problem,  
> however, still exists as it is not possible to make the 
> file writeable from the Linux CIFS client.

> Investigating the problem further, I found out that the problem occurs  
> only if no other attribute is set on the file. If I set the archived or  
> hidden bit in the Windows Explorer or unmark the "for fast searching,  
> allow Indexing Service to index this file" checkbox (which sets the  
> ATTR_NOT_CONTENT_INDEXED bit), the read-only bit can be cleared from the  
> Linux client. Looking into the function cifs_setattr() in fs/cifs/inode.c
> and adding some debug output, it can be seen that time_buf.Attributes is  
> zero in the problem case and the attribute will not be set. Sending the  
> attributes with a zero value does not help as this seems to mean that no  
> attributes have to be changed. So I added the ATTR_NORMAL bit in the case  
> where ATTR_READONLY has to be cleared and no other attribute is set.

Urs,
Yes - you were correct.  Good catch.  Thanks. The patch I put in the cifs
tree was implemented slightly differently (I did not read the 2nd half of
your email until too late) but is a similar idea.   Would you take a look
at it?

http://master.kernel.org/git/?p=linux/kernel/git/sfrench/cifs-2.6.git;a=commitdiff;h=066fcb06d3e27c258bc229bb688ced2b16daa6c2

> I made another small change: The ATTR_READONLY is only cleared if all  
> writeable bits are set in the Linux permissions (the file is writeable for  
> user, group and others). This is caused by the check ((mode & S_IWUGO) ==  
> S_IWUGO). I think that this does not make sense, as soon as any write flag  
> is set, the file is no longer read-only.

I agree that the 2nd change is intuitive, but it changes behavior slightly
more - and this close to the next point release coming out I would
rather fix the bug the least risky way.

Fortunately this compensation won't matter as much when the code is finished
to store the mode in the CIFS ACL for those type of servers.  I am tempted
to just save the mode in an xattr (actually OS/2 EA which NTFS still supports) 
for these cases.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RSDL 0.31 causes slowdown

2007-03-22 Thread William Lee Irwin III

On Thu, Mar 22, 2007 at 01:21:46PM -0800, Tim Chen wrote:
> I've tried running Volanomark and found a 80% regression
> with RSDL 0.31 scheduler on 2.6.21-rc4 on a 2 socket Core 2 quad cpu
> system (4 cpus per socket, 8 cpus for system). 
> The results are sensitive to rr_interval. Using Con's patch to increase
> rr_interval to a large value of 100, 
> the regression reduced to 30% instead of 80%.
> I ran Volanomark in loopback mode with 10 chatrooms 
> (20 clients per chatroom) configuration, with each client sending
> out 1 messages. 
> http://www.volano.com/benchmarks.html
> There are significant differences in the vmstat runqueue profile 
> between the 2.6.21-rc4 and the one with RSDL.  
> There are a lot less runnable jobs (see col 2) with RSDL 0.31 
> (rr_interval=15) and higher idle time.

This would be yield() semantics. A flag or alternate syscall for "hard"
yield() semantics would resolve this (likely trapped into via LD_PRELOAD).
It may also be useful to have a few variants of yield_to() (a.k.a.
directed yields), such as ones donating timeslices by pid, by owner of
sysv semaphore, by owner of futex, and to other ends of pipes and UNIX
domain sockets if unique or otherwise able to be made sense of. It's
unclear how easily the latter could be utilized by userspace, though,
given the number of applications and libraries needing to be updated.

-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [1/6] 2.6.21-rc4: known regressions

2007-03-22 Thread Nick Piggin

On Thu, Mar 22, 2007 at 06:40:41PM -0700, Linus Torvalds wrote:
> 
> [ Ok, I think it's those timers again...
> 
>   Ingo: let me just state how *happy* I am that I told you off when you 
>   wanted to merge the hires timers and NO_HZ before 2.6.20 because they 
>   were "stable". You were wrong, and 2.6.20 is at least in reasonable 
>   shape. Now we just need to make sure that 2.6.21 will be too.. ]
> 
> On Thu, 22 Mar 2007, Mingming Cao wrote:
> > 
> > I might missed something, so far I can't see a deadlock yet.
> > If there is a deadlock, I think we should see ext3_xattr_release_block()
> > and ext3_forget() on the stack. Is this the case?
> 
> No. What's strange is that two (maybe more, I didn't check) processes seem 
> to be stuck in
> 
>[] schedule_timeout+0x70/0x8e
>[] schedule_timeout_uninterruptible+0x15/0x17
>[] journal_stop+0xe2/0x1e6
>[] journal_force_commit+0x1d/0x1f
>[] ext3_force_commit+0x22/0x24
>[] ext3_write_inode+0x34/0x3a
>[] __writeback_single_inode+0x1c5/0x2cb
>[] sync_inode+0x1c/0x2e
>[] ext3_sync_file+0xab/0xc0
>[] do_fsync+0x4b/0x98
>[] __do_fsync+0x20/0x2f
>[] sys_fsync+0xd/0xf
>[] syscall_call+0x7/0xb
> 
> but that that thing is literally:
> 
>   ...
> do {
> old_handle_count = transaction->t_handle_count;
> schedule_timeout_uninterruptible(1);
> } while (old_handle_count != transaction->t_handle_count);
>   ...
> 
> and especially if nothing is happening, I'd not expect 
> "transaction->t_handle_count" to keep changing, so it should stop very 
> quickly.
> 
> Maybe it's CONFIG_NO_HZ again, and the problem is that timeout, and simply 
> no timer tick happening?
> 
> Bingo. I think that's it.
> 
>   active timers:
>#0: hardirq_stack, tick_sched_timer, S:01
># expires at 953089300 nsecs [in -2567889 nsecs]
>#1: hardirq_stack, hrtimer_wakeup, S:01
># expires at 10858649798503 nsecs [in 1327754230614 nsecs]
> .expires_next   : 953089300 nsecs
> 
> See
> 
>   http://lkml.org/lkml/2007/3/16/288
> 
> and that in turn points to the kernel log:
> 
>   
> http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc4/git-console.log

Seems convincing. Michal, can you post your .config, and if you had
dynticks and hrtimers enabled, try reproducing without them?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: how can I touch softlockup watchdog on all cpus?

2007-03-22 Thread Jeremy Fitzhardinge

Dave Jones wrote:
> He wants to do this with interrupts off. on_each_cpu won't work in
> that situation.
>   

I was thinking just before his big pause.  But it sounds like its fairly
marginal.

>  > Or patch the softlockup watchdog to add a way to temporarily disable it.
>
> Seems pretty much the only way you could make this work.
>   

Yep.  The patches I just posted might do the trick.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [1/6] 2.6.21-rc4: known regressions

2007-03-22 Thread Linus Torvalds

[ Ok, I think it's those timers again...

  Ingo: let me just state how *happy* I am that I told you off when you 
  wanted to merge the hires timers and NO_HZ before 2.6.20 because they 
  were "stable". You were wrong, and 2.6.20 is at least in reasonable 
  shape. Now we just need to make sure that 2.6.21 will be too.. ]

On Thu, 22 Mar 2007, Mingming Cao wrote:
> 
> I might missed something, so far I can't see a deadlock yet.
> If there is a deadlock, I think we should see ext3_xattr_release_block()
> and ext3_forget() on the stack. Is this the case?

No. What's strange is that two (maybe more, I didn't check) processes seem 
to be stuck in

 [] schedule_timeout+0x70/0x8e
 [] schedule_timeout_uninterruptible+0x15/0x17
 [] journal_stop+0xe2/0x1e6
 [] journal_force_commit+0x1d/0x1f
 [] ext3_force_commit+0x22/0x24
 [] ext3_write_inode+0x34/0x3a
 [] __writeback_single_inode+0x1c5/0x2cb
 [] sync_inode+0x1c/0x2e
 [] ext3_sync_file+0xab/0xc0
 [] do_fsync+0x4b/0x98
 [] __do_fsync+0x20/0x2f
 [] sys_fsync+0xd/0xf
 [] syscall_call+0x7/0xb

but that that thing is literally:

...
do {
old_handle_count = transaction->t_handle_count;
schedule_timeout_uninterruptible(1);
} while (old_handle_count != transaction->t_handle_count);
...

and especially if nothing is happening, I'd not expect 
"transaction->t_handle_count" to keep changing, so it should stop very 
quickly.

Maybe it's CONFIG_NO_HZ again, and the problem is that timeout, and simply 
no timer tick happening?

Bingo. I think that's it.

active timers:
 #0: hardirq_stack, tick_sched_timer, S:01
 # expires at 953089300 nsecs [in -2567889 nsecs]
 #1: hardirq_stack, hrtimer_wakeup, S:01
 # expires at 10858649798503 nsecs [in 1327754230614 nsecs]
  .expires_next   : 953089300 nsecs

See

http://lkml.org/lkml/2007/3/16/288

and that in turn points to the kernel log:

http://www.stardust.webpages.pl/files/tbf/bitis-gabonica/2.6.21-rc4/git-console.log
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kmalloc() with size zero

2007-03-22 Thread Jan Engelhardt


On Mar 22 2007 16:18, Stephane Eranian wrote:
>
>Hello,
>
>I ran into an issue with perfmon where I ended up calling
>kmalloc() with a size of zero. To my surprise, this did
>not return NULL but a valid data address.
>
>I am wondering if this is a property of kmalloc() or simply
>a bug. It is the case that the __kmalloc() code does not
>check for zero size.

I'd say "feature", glibc's malloc also returns an address on malloc(0).


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.21-rc4-mm1

2007-03-22 Thread Andrew Morton


Please always do reply-to-all.

On Fri, 23 Mar 2007 00:27:09 +0100 "J.A. Magallón" <[EMAIL PROTECTED]> wrote:

> On Mon, 19 Mar 2007 20:56:23 -0800, Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> > 
> > Temporarily at
> > 
> >   http://userweb.kernel.org/~akpm/2.6.21-rc4-mm1/
> > 
> > Will appear later at
> > 
> >   
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc4/2.6.21-rc4-mm1/
> > 
> 
> Is anybody having problems with optical drives and this kernel ?
> I can't get my dvdrw to spit any events to udevmonitor. If I mount it
> manually everything works fine.

Yes, I think one person reported something similar.

> Perhaps the problem is in hal/g-v-m or anything else, but I suppose that
> udevmonitor receives events directly from kernel, isn't it ?

Probably related to the not-yet-completely-solved firmware loader failures.

It would be good if someone could do a bisection search on this.  I face a
fun evening hunting down a horrendous ext3 performance regression which is
now in mainline.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Define CLONE_NEWPID flag

2007-03-22 Thread Herbert Poetzl

On Wed, Mar 21, 2007 at 01:39:38PM -0700, Andrew Morton wrote:
> On Wed, 21 Mar 2007 12:41:03 -0700
> [EMAIL PROTECTED] wrote:
> 
> > This was discussed on containers and we thought it would be useful
> > to reserve this flag.
> > ---
> > 
> > From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
> > Subject: [PATCH] Define CLONE_NEWPID flag
> > 
> > Define CLONE_NEWPID flag that will be used to clone pid namespaces.
> > 
> > Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
> > ---
> >  include/linux/sched.h |1 +
> >  1 file changed, 1 insertion(+)
> > 
> > Index: lx26-21-rc3-mm2/include/linux/sched.h
> > ===
> > --- lx26-21-rc3-mm2.orig/include/linux/sched.h  2007-03-20 
> > 20:13:19.0 -0700
> > +++ lx26-21-rc3-mm2/include/linux/sched.h   2007-03-21 11:10:33.0 
> > -0700
> > @@ -26,6 +26,7 @@
> >  #define CLONE_STOPPED  0x0200  /* Start in stopped 
> > state */
> >  #define CLONE_NEWUTS   0x0400  /* New utsname group? */
> >  #define CLONE_NEWIPC   0x0800  /* New ipcs */
> > +#define CLONE_NEWPID   0x1000  /* New pid namespace */
> 
> Do we actually have any need to reserve it at this time?  
> I'd have thought that we could defer adding this until we 
> have some code in-kernel which uses it.

FWIW, I'm fine with the reservation, we won't get around
it for the pid space, so we can as well register it now
YMMV

best,
Herbert

> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.linux-foundation.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: max_loop limit

2007-03-22 Thread Jan Engelhardt


On Mar 22 2007 14:42, Eric Dumazet wrote:
>Instead of using :
>
>static struct loop_device *loop_dev;
>loop_dev = kmalloc(max_loop * sizeof(struct loop_device));
>
>Switch to :
>
>static struct loop_device **loop_dev;
>loop_dev = kmalloc(max_loop * sizeof(void *));
>if (!loop_dev) rollback...
>for (i = 0 ; i < max_loop ; i++) {
>   loop_dev[i] = kmalloc(sizeof(struct loop_device));
>   if (!loop_dev[i]) rollback...
>}
>
>This time, you would be limited to 16384 loop devices on x86_64, 32768 on i386
>:)

Oh noes. Please use a linked list (kmalloc cope = perfect) if you really need
loads of loopdevs. Sorta

 struct loopdev {
struct list_head lh;
int lo_number;
 };

to keep the /dev/loop%d number consistent across loopdev removal.
Maybe it's better to even use an rbtree (linked list does not scale to it).


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sysfs q [was: sysfs ugly timer interface]

2007-03-22 Thread Jan Engelhardt


On Mar 22 2007 08:28, Greg KH wrote:
>> 
>> > [EMAIL PROTECTED]:/home/maxim# cat 
>> > /sys/devices/system/clockevents/clockevents0/registered
>> > lapicF:0007 M:3(periodic) C: 1
>> > hpet F:0003 M:1(shutdown) C: 0
>> > lapicF:0007 M:3(periodic) C: 0
>> > [EMAIL PROTECTED]:/home/maxim#   
>> 
>> Now... this file needs to die, before 2.6.21 is released. It tries to
>> bring /proc-like parsing nightmare to sysfs. Kill it before it becomes
>> part of stable ABI!
>
>Eeek!

Question regarding sysfs files: How would you do something like
/proc/net/nf_conntrack with sysfs? Have directories named like , 
0001, 0002, ..?


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp->nodeid;)

2007-03-22 Thread Christoph Lameter

On Thu, 22 Mar 2007, Siddha, Suresh B wrote:

> > You should check num_possible_nodes(), or nr_node_ids (this one is cheaper, 
> > its a variable instead of a function call)
> 
> But that is based on compile time option, isn't it? Perhaps I need
> to use some other mechanism to find out the platform is not NUMA capable..

No its runtime.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: how can I touch softlockup watchdog on all cpus?

2007-03-22 Thread Dave Jones

On Thu, Mar 22, 2007 at 03:46:54PM -0700, Jeremy Fitzhardinge wrote:
 > Cestonaro, Thilo (external) wrote:
 > > It's a condition of a customer of us, so I can't change it.
 > >
 > > But it happens not often that my part is used. So I thought there is a 
 > > mechanism to disable or reset the watchdog
 > > because it is a legal pause for it. And there is one 
 > > "touch_softlockup_watchdog()", that does what I want,
 > > BUT just for the current cpu. And so the watchdog blats from the other cpu.
 > >   
 > 
 > on_each_cpu(touch_softlockup_watchdog, NULL, 0, 0)?

He wants to do this with interrupts off. on_each_cpu won't work in
that situation.

 > Or patch the softlockup watchdog to add a way to temporarily disable it.

Seems pretty much the only way you could make this work.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] make KVM conform to sucky rdmsr interface

2007-03-22 Thread Rusty Russell

On Thu, 2007-03-22 at 14:30 -0700, Andrew Morton wrote:
> Which tree are you patching??

We crossed in the mail: you turfed out the paravirt.h cleanup patch it
applied to.

We have rolled the fixes one patch, and am testing...

Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 002 of 3] md: Clear the congested_fn when stopping a raid5

2007-03-22 Thread NeilBrown


If this mddev and queue got reused for another array that doesn't
register a congested_fn, this function would get called incorretly.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c|1 +
 ./drivers/md/raid5.c |3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-03-23 11:13:41.0 +1100
+++ ./drivers/md/md.c   2007-03-23 11:13:41.0 +1100
@@ -3325,6 +3325,7 @@ static int do_md_stop(mddev_t * mddev, i
mddev->queue->merge_bvec_fn = NULL;
mddev->queue->unplug_fn = NULL;
mddev->queue->issue_flush_fn = NULL;
+   mddev->queue->backing_dev_info.congested_fn = NULL;
if (mddev->pers->sync_request)
sysfs_remove_group(&mddev->kobj, 
&md_redundancy_group);
 

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2007-03-23 11:13:29.0 +1100
+++ ./drivers/md/raid5.c2007-03-23 11:13:41.0 +1100
@@ -4269,8 +4269,8 @@ static int run(mddev_t *mddev)
 
mddev->queue->unplug_fn = raid5_unplug_device;
mddev->queue->issue_flush_fn = raid5_issue_flush;
-   mddev->queue->backing_dev_info.congested_fn = raid5_congested;
mddev->queue->backing_dev_info.congested_data = mddev;
+   mddev->queue->backing_dev_info.congested_fn = raid5_congested;
 
mddev->array_size =  mddev->size * (conf->previous_raid_disks -
conf->max_degraded);
@@ -4301,6 +4301,7 @@ static int stop(mddev_t *mddev)
mddev->thread = NULL;
shrink_stripes(conf);
kfree(conf->stripe_hashtbl);
+   mddev->queue->backing_dev_info.congested_fn = NULL;
blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
sysfs_remove_group(&mddev->kobj, &raid5_attrs_group);
kfree(conf->disks);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 003 of 3] md: Convert compile time warnings into runtime warnings.

2007-03-22 Thread NeilBrown


... still not sure why we need this 

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c|   41 +++--
 ./drivers/md/raid5.c |   12 ++--
 2 files changed, 41 insertions(+), 12 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-03-23 11:13:41.0 +1100
+++ ./drivers/md/md.c   2007-03-23 12:06:34.0 +1100
@@ -1319,6 +1319,7 @@ static int bind_rdev_to_array(mdk_rdev_t
char b[BDEVNAME_SIZE];
struct kobject *ko;
char *s;
+   int err;
 
if (rdev->mddev) {
MD_BUG();
@@ -1353,20 +1354,29 @@ static int bind_rdev_to_array(mdk_rdev_t
while ( (s=strchr(rdev->kobj.k_name, '/')) != NULL)
*s = '!';

-   list_add(&rdev->same_set, &mddev->disks);
rdev->mddev = mddev;
printk(KERN_INFO "md: bind<%s>\n", b);
 
rdev->kobj.parent = &mddev->kobj;
-   kobject_add(&rdev->kobj);
+   if ((err = kobject_add(&rdev->kobj)))
+   goto fail;
 
if (rdev->bdev->bd_part)
ko = &rdev->bdev->bd_part->kobj;
else
ko = &rdev->bdev->bd_disk->kobj;
-   sysfs_create_link(&rdev->kobj, ko, "block");
+   if ((err = sysfs_create_link(&rdev->kobj, ko, "block"))) {
+   kobject_del(&rdev->kobj);
+   goto fail;
+   }
+   list_add(&rdev->same_set, &mddev->disks);
bd_claim_by_disk(rdev->bdev, rdev, mddev->gendisk);
return 0;
+
+ fail:
+   printk(KERN_WARNING "md: failed to register dev-%s for %s\n",
+  b, mdname(mddev));
+   return err;
 }
 
 static void unbind_rdev_from_array(mdk_rdev_t * rdev)
@@ -2966,7 +2976,9 @@ static struct kobject *md_probe(dev_t de
mddev->kobj.k_name = NULL;
snprintf(mddev->kobj.name, KOBJ_NAME_LEN, "%s", "md");
mddev->kobj.ktype = &md_ktype;
-   kobject_register(&mddev->kobj);
+   if (kobject_register(&mddev->kobj))
+   printk(KERN_WARNING "md: cannot register %s/md - name in use\n",
+  disk->disk_name);
return NULL;
 }
 
@@ -3144,9 +3156,12 @@ static int do_md_run(mddev_t * mddev)
bitmap_destroy(mddev);
return err;
}
-   if (mddev->pers->sync_request)
-   sysfs_create_group(&mddev->kobj, &md_redundancy_group);
-   else if (mddev->ro == 2) /* auto-readonly not meaningful */
+   if (mddev->pers->sync_request) {
+   if (sysfs_create_group(&mddev->kobj, &md_redundancy_group))
+   printk(KERN_WARNING
+  "md: cannot register extra attributes for %s\n",
+  mdname(mddev));
+   } else if (mddev->ro == 2) /* auto-readonly not meaningful */
mddev->ro = 0;
 
atomic_set(&mddev->writes_pending,0);
@@ -3160,7 +3175,9 @@ static int do_md_run(mddev_t * mddev)
if (rdev->raid_disk >= 0) {
char nm[20];
sprintf(nm, "rd%d", rdev->raid_disk);
-   sysfs_create_link(&mddev->kobj, &rdev->kobj, nm);
+   if (sysfs_create_link(&mddev->kobj, &rdev->kobj, nm))
+   printk("md: cannot register %s for %s\n",
+  nm, mdname(mddev));
}

set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
@@ -5388,8 +5405,12 @@ static int remove_and_add_spares(mddev_t
if (mddev->pers->hot_add_disk(mddev,rdev)) {
char nm[20];
sprintf(nm, "rd%d", rdev->raid_disk);
-   sysfs_create_link(&mddev->kobj,
- &rdev->kobj, nm);
+   if (sysfs_create_link(&mddev->kobj,
+ &rdev->kobj, nm))
+   printk(KERN_WARNING
+  "md: cannot register "
+  "%s for %s\n",
+  nm, mdname(mddev));
spares++;
md_new_event(mddev);
} else

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2007-03-23 11:13:41.0 +1100
+++ ./drivers/md/raid5.c2007-03-23 12:06:00.0 +1100
@@ -4265,7 +4265,10 @@ static int run(mddev_t *mddev)
}
 
/* Ok, everything is just fine now */
-   sysfs_create_group(&mddev->kobj, &raid5_attrs_group);
+   if (sysfs_create_group(&mddev->kobj, &raid5_attrs_group))
+   pr

[PATCH 000 of 3] md: bug fixes for md for 2.6.21

2007-03-22 Thread NeilBrown

A minor new feature and 2 bug fixes for md suitable for 2.6.21

The minor feature is to make reshape (adding a drive to an array and
restriping it) work for raid4.  The code is all ready, it just wasn't
used.

Thanks,
NeilBrown


 [PATCH 001 of 3] md: Allow raid4 arrays to be reshaped.
 [PATCH 002 of 3] md: Clear the congested_fn when stopping a raid5
 [PATCH 003 of 3] md: Convert compile time warnings into runtime warnings.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 001 of 3] md: Allow raid4 arrays to be reshaped.

2007-03-22 Thread NeilBrown


All that is missing the the function pointers in raid4_pers.


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid5.c |4 
 1 file changed, 4 insertions(+)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2007-03-23 11:13:29.0 +1100
+++ ./drivers/md/raid5.c2007-03-23 11:13:29.0 +1100
@@ -4727,6 +4727,10 @@ static struct mdk_personality raid4_pers
.spare_active   = raid5_spare_active,
.sync_request   = sync_request,
.resize = raid5_resize,
+#ifdef CONFIG_MD_RAID5_RESHAPE
+   .check_reshape  = raid5_check_reshape,
+   .start_reshape  = raid5_start_reshape,
+#endif
.quiesce= raid5_quiesce,
 };
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [1/6] 2.6.21-rc4: known regressions

2007-03-22 Thread Mingming Cao

On Thu, 2007-03-22 at 08:21 -0700, Linus Torvalds wrote:
> 
> On Thu, 22 Mar 2007, Nick Piggin wrote:
> > 
> > Nothing sleeps on PageUptodate, so I don't think that could explain it.
> 
> Good point. I forget that we just test "uptodate", but then always sleep 
> on "locked".
> 
> > The fs: fix __block_write_full_page error case buffer submission patch
> > does change the locking, but I'd be really suprised if that was the
> > problem, because it changes locking to match the regular non-error path
> > submission.
> 
> I'd agree, except something clearly has changed ;^)
> 
> > > Alternatively, maybe it really is an _io_ problem (and the buffer-head 
> > > thing
> > > is just a red herring, and it could happen to other IO, it's just that
> > > metadata IO uses buffer heads), and it's the scheduler changes since
> > > 2.6.20..
> > 
> > I see what you mean. Could it be an ext3 or jbd change I wonder?
> 
> jbd hasn't changed since 2.6.20, and the ext3 changes are mostly 
> things like const'ness fixes. And others were things like changing 
> "journal_current_handle()" into "ext3_journal_current_handle()", which 
> looked exciting considering that the hung processes were waiting for the 
> journal, but the fact is, that's just an inline function that just calls 
> the old function, so..
> 
> But interestingly, there *is* a "EA block reference count racing fix" 
> that does move a lock_buffer()/unlock_buffer() to cover a bigger area. It 
> looks "obviously correct", but maybe there's a deadlock possibility there 
> with ext3_forget() or something?
> 

I might missed something, so far I can't see a deadlock yet.
If there is a deadlock, I think we should see ext3_xattr_release_block()
and ext3_forget() on the stack. Is this the case?

Regards,
Mingming

>   Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] cache pipe buf page address for non-highmem arch

2007-03-22 Thread Andrew Morton

On Thu, 22 Mar 2007 17:51:11 -0700 "Ken Chen" <[EMAIL PROTECTED]> wrote:

> +#ifdef CONFIG_HIGHMEM
> +#define pipe_kmapkmap
> +#define pipe_kmap_atomic kmap_atomic
> +#else/* CONFIG_HIGHMEM */
> +static inline void *pipe_kmap(struct page *page)
> +{
> + return (void *) page->private;
> +}
> +static inline void *pipe_kmap_atomic(struct page *page, enum km_type type)
> +{
> + pagefault_disable();
> + return pipe_kmap(page);
> +}
> +#endif

If we're going to do this then we should also implement pipe_kunmap_atomic().
Relying upon kunmap_atomic() internals like this is weird-looking, and is 
fragile
against future changes to kunmap_atomic().
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible Bug in mincore or mmap

2007-03-22 Thread Nick Piggin


Bruce Dubbs wrote:

When testing an installation with tests from the Linux Test Project, my
kernels fail one instance of the mincore01 tests:

mincoremincore011  PASS  :  expected failure: errno = 22 (Invalid
argument)
mincore012  PASS  :  expected failure: errno = 14 (Bad address)
mincore013  FAIL  :  call succeeded unexpectedly
mincore014  PASS  :  expected failure: errno = 12 (Cannot allocate
memory)011  PASS  :  expected failure: errno = 22 (Invalid argument)
mincore012  PASS  :  expected failure: errno = 14 (Bad address)
mincore013  FAIL  :  call succeeded unexpectedly
mincore014  PASS  :  expected failure: errno = 12 (Cannot allocate
memory)

I pared down the test to the attached program.  The result is supposed
to fail as it is asking for memory information 5 times what should be
allocated.

Upon experimenting, I found the test works properly if a printf is
executed before the mmap call.  I have tested on locally built, but
unmodified, 2.4.25, 2.6.12.5, and a 2.6.20.3 kernels and get the same
behavior.  The tests fail on IA32 architecture, but not 64-bit kernels.
 The test always works properly on FC6 and RHEL3.

I've checked the archives for this issue and could not find anything
appropriate.

Could this be a potential security issue as memory that is not supposed
to be accessible seems to be available to the user?  Is it expected
behavior?


It shouldn't be a security problem if mincore doesn't actually
return the data.

The other thing is, that test may not be valid, because it doesn't
guarantee that you have nothing mapped immediately after the
global_pointer region. Maybe a difference in address space layout
is causing it to "correctly" fail on x86-64.






Thanks.

  -- Bruce




#include 
#include 
#include 
#include 

static int   PAGESIZE;
static char  file_name[]= "fooXX";
static void* global_pointer = NULL;
static int   global_len = 0;
static int   file_desc  = 0;

int main(int argc, char **argv)
{
int i;
int result;
char*   buf;
unsigned char   vect[20] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};



PAGESIZE = getpagesize();

/* global_pointer will point to a mmapped area of global_len bytes */

global_len = PAGESIZE*2;

buf = (char*)malloc(global_len);
memset(buf, 42, global_len);  // Asterisks 

/* create a temporary file */

file_desc = mkstemp(file_name);

/* fill the temporary file with two pages of data */

write(file_desc, buf, global_len);
free(buf);

// Will work properly as long as print is before mmap function.

if ( argc > 1 ) printf("argc=%d\n", argc);

/* map the file in memory */
global_pointer = mmap( NULL, global_len,
PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, file_desc, 0);

// Result should be -1 as the request is 5 times actual mapping
result = mincore(global_pointer, (size_t)(global_len*5), vect);

// Print some data
printf("PAGESIZE=%d\n", PAGESIZE);
printf("global_len=%d\n", global_len);
printf("global_pointer=0x%x\n", (unsigned int)global_pointer);
printf("alloc=%d\n", (global_len+PAGESIZE-1) / PAGESIZE );
printf("Result=%d\n", result);
printf("vect: ");

for ( i=0; i<20; i++) printf("%02x ", vect[i]);
printf("\n");

// Clean up

munmap(global_pointer, (size_t)global_len);
close(file_desc);
unlink(file_name);
}



--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] cache pipe buf page address for non-highmem arch

2007-03-22 Thread Ken Chen


It is really sad that we always call kmap and friends for every pipe
buffer page on 64-bit arch that doesn't use HIGHMEM, or on
configuration that doesn't turn on HIGHMEM.

The effect of calling kmap* is visible in the execution profile when
pipe code is being stressed.  It is especially true on amd's x86-64
platform where kmap() has to traverse through numa node index
calculation in order to convert struct page * to kernel virtual
address.  It is fairly pointless to perform that calculation repeatly
on system with no highmem (i.e., 64-bit arch like x86-64).  This patch
caches kernel pipe buffer page's kernel vaddr to speed up pipe buffer
mapping functions.

There is another suboptimal block in pipe_read() where wake_up is
called twice.  I think it was an oversight since in pipe_write(), it
looks like it is doing the right thing.

Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


--- linus-2.6.git/fs/pipe.c.orig2007-03-01 12:41:06.0 -0800
+++ linus-2.6.git/fs/pipe.c 2007-03-22 16:28:03.0 -0700
@@ -20,6 +20,22 @@

#include 
#include 
+#include 
+
+#ifdef CONFIG_HIGHMEM
+#define pipe_kmap  kmap
+#define pipe_kmap_atomic   kmap_atomic
+#else  /* CONFIG_HIGHMEM */
+static inline void *pipe_kmap(struct page *page)
+{
+   return (void *) page->private;
+}
+static inline void *pipe_kmap_atomic(struct page *page, enum km_type type)
+{
+   pagefault_disable();
+   return pipe_kmap(page);
+}
+#endif

/*
 * We use a start+len construction, which provides full use of the
@@ -169,10 +185,10 @@ void *generic_pipe_buf_map(struct pipe_i
{
if (atomic) {
buf->flags |= PIPE_BUF_FLAG_ATOMIC;
-   return kmap_atomic(buf->page, KM_USER0);
+   return pipe_kmap_atomic(buf->page, KM_USER0);
}

-   return kmap(buf->page);
+   return pipe_kmap(buf->page);
}

void generic_pipe_buf_unmap(struct pipe_inode_info *pipe,
@@ -316,6 +332,7 @@ redo:
if (do_wakeup) {
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
+   do_wakeup = 0;
}
pipe_wait(pipe);
}
@@ -423,6 +440,8 @@ redo1:
ret = ret ? : -ENOMEM;
break;
}
+   page->private = (unsigned long)
+   page_address(page);
pipe->tmp_page = page;
}
/* Always wake up, even if the copy fails. Otherwise
@@ -438,9 +457,9 @@ redo1:
iov_fault_in_pages_read(iov, chars);
redo2:
if (atomic)
-   src = kmap_atomic(page, KM_USER0);
+   src = pipe_kmap_atomic(page, KM_USER0);
else
-   src = kmap(page);
+   src = pipe_kmap(page);

error = pipe_iov_copy_from_user(src, iov, chars,
atomic);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: controlling mmap()'d vs read/write() pages

2007-03-22 Thread Herbert Poetzl

On Tue, Mar 20, 2007 at 03:19:16PM -0600, Eric W. Biederman wrote:
> Dave Hansen <[EMAIL PROTECTED]> writes:
> 
> >
> > So, I think we have a difference of opinion. I think it's _all_
> > about memory pressure, and you think it is _not_ about accounting
> > for memory pressure. :) Perhaps we mean different things, but we
> > appear to disagree greatly on the surface.
>
> I think it is about preventing a badly behaved container from having a
> significant effect on the rest of the system, and in particular other
> containers on the system.
>
> See below. I think to reach agreement we should start by discussing
> the algorithm that we see being used to keep the system function well
> and the theory behind that algorithm. Simply limiting memory is not
> enough to understand why it works
>
> > Can we agree that there must be _some_ way to control the amounts of
> > unmapped page cache? Whether that's related somehow to the same way
> > we control RSS or done somehow at the I/O level, there must be some
> > way to control it. Agree?
> 
> At lot depends on what we measure and what we try and control.
> Currently what we have been measuring are amounts of RAM, and thus
> what we are trying to control is the amount of RAM.  If we want to
> control memory pressure we need a definition and a way to measure it.
> I think there may be potential if we did that but we would still need
> a memory limit to keep things like mlock in check.
> 
> So starting with a some definitions and theory.
> RSS is short for resident set size.  The resident set being how many
> of pages are current in memory and not on disk and used by the
> application.  This includes the memory in page tables, but can
> reasonably be extended to include any memory a process can be shown to
> be using.
> 
> In theory there is some minimal RSS that you can give an application
> at which it will get productive work done.  Below the minimal RSS
> the application will spend the majority of real time waiting for
> pages to come in from disk, so it can execute the next instruction.
> The ultimate worst case here is a read instruction appearing on one
> page and it's datum on another.  You have to have both pages in memory
> at the same time for the read to complete.  If you set the RSS hard
> limit to one page the problem will be continually restarting either
> because the page it is on is not in memory or the page it is reading
> from is not in memory.
> 
> What we want to accomplish is to have a system that runs multiple
> containers without problems.  As a general memory management policy
> we can accomplish this by ensuring each container has at least
> it's minimal RSS quota of pages.  By watching the paging activity
> of a container it is possible to detect when that container has
> to few pages and is spend all of it's time I/O bound, and thus
> has slipped below it's minimal RSS.
> 
> As such it is possible for the memory management system if we have
> container RSS accounting to dynamically figure out how much memory
> each container needs and to keep everyone above their minimal RSS
> most of the time when that is possible.  Basically to do this the
> memory manage code would need to keep dynamic RSS limits, and
> adjust them based upon need.
> 
> There is still the case when not all containers can have their
> minimal RSS, there is simply not enough memory in the system.
> 
> That is where having a hard settable RSS limit comes in.  With this
> we communicate to the application and the users beyond which point we
> consider their application to be abusing the system.
> 
> There is a lot of history with RSS limits showing their limitations
> and how they work.  It is fundamentally a dynamic policy instead of
> a static set of guarantees which allows for applications with a
> diverse set of memory requirements to work in harmony.
> 
> One of the very neat things about a hard RSS limit is that if there
> are extra resources on the system you can improve overall system
> performance by cache pages in the page cache instead writing them
> to disk.

that is exactly what we (Linux-VServer) want ...
(sounds good to me, please keep up the good work in
this direction)

there is nothing wrong with hard limits if somebody
really wants them, even if they hurt the sysstem as
whole, but those limits shouldn't be the default ..

best,
Herbert

> > http://linux-mm.org/SoftwareZones
> 
> I will try and take a look in a bit.
> 
> 
> Eric
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.linux-foundation.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 003 of 4] knfsd: nfsd4: demote "clientid in use" printk to a dprintk

2007-03-22 Thread NeilBrown


From: Bruce Fields <[EMAIL PROTECTED]>

The reused clientid here is a more of a problem for the client than the
server, and the client can report the problem itself if it's serious.

Signed-off-by: "J. Bruce Fields" <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./fs/nfsd/nfs4state.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff .prev/fs/nfsd/nfs4state.c ./fs/nfsd/nfs4state.c
--- .prev/fs/nfsd/nfs4state.c   2007-03-23 11:19:14.0 +1100
+++ ./fs/nfsd/nfs4state.c   2007-03-23 11:19:14.0 +1100
@@ -750,9 +750,8 @@ nfsd4_setclientid(struct svc_rqst *rqstp
status = nfserr_clid_inuse;
if (!cmp_creds(&conf->cl_cred, &rqstp->rq_cred)
|| conf->cl_addr != sin->sin_addr.s_addr) {
-   printk("NFSD: setclientid: string in use by client"
-   "(clientid %08x/%08x)\n",
-   conf->cl_clientid.cl_boot, conf->cl_clientid.cl_id);
+   dprintk("NFSD: setclientid: string in use by client"
+   "at %u.%u.%u.%u\n", NIPQUAD(conf->cl_addr));
goto out;
}
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 001 of 4] knfsd: Allow nfsd READDIR to return 64bit cookies

2007-03-22 Thread NeilBrown


->readdir passes lofft_t offsets (used as nfs cookies) to
nfs3svc_encode_entry{,_plus}, but when they pass it on to
encode_entry it becomes an 'off_t', which isn't good.

So filesystems that returned 64bit offsets would lose.


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./fs/nfsd/nfs3xdr.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff .prev/fs/nfsd/nfs3xdr.c ./fs/nfsd/nfs3xdr.c
--- .prev/fs/nfsd/nfs3xdr.c 2007-03-23 11:13:15.0 +1100
+++ ./fs/nfsd/nfs3xdr.c 2007-03-23 11:13:15.0 +1100
@@ -887,8 +887,8 @@ compose_entry_fh(struct nfsd3_readdirres
 #define NFS3_ENTRY_BAGGAGE (2 + 1 + 2 + 1)
 #define NFS3_ENTRYPLUS_BAGGAGE (1 + 21 + 1 + (NFS3_FHSIZE >> 2))
 static int
-encode_entry(struct readdir_cd *ccd, const char *name,
-int namlen, off_t offset, ino_t ino, unsigned int d_type, int plus)
+encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
+loff_t offset, ino_t ino, unsigned int d_type, int plus)
 {
struct nfsd3_readdirres *cd = container_of(ccd, struct nfsd3_readdirres,
common);
@@ -908,7 +908,7 @@ encode_entry(struct readdir_cd *ccd, con
*cd->offset1 = htonl(offset64 & 0x);
cd->offset1 = NULL;
} else {
-   xdr_encode_hyper(cd->offset, (u64) offset);
+   xdr_encode_hyper(cd->offset, offset64);
}
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 004 of 4] knfsd: nfsd4: remove superfluous cancel_delayed_work() call

2007-03-22 Thread NeilBrown

From: "J. Bruce Fields" <[EMAIL PROTECTED]>

This cancel_delayed_work call is called from a function that is only
called from a piece of code that immediate follows a cancel and
destruction of the workqueue, so it's clearly a mistake.

Cc: Oleg Nesterov <[EMAIL PROTECTED]>
Signed-off-by: "J. Bruce Fields" <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./fs/nfsd/nfs4state.c |1 -
 1 file changed, 1 deletion(-)

diff .prev/fs/nfsd/nfs4state.c ./fs/nfsd/nfs4state.c
--- .prev/fs/nfsd/nfs4state.c   2007-03-23 11:19:14.0 +1100
+++ ./fs/nfsd/nfs4state.c   2007-03-23 11:19:34.0 +1100
@@ -3260,7 +3260,6 @@ __nfs4_state_shutdown(void)
unhash_delegation(dp);
}

-   cancel_delayed_work(&laundromat_work);
nfsd4_shutdown_recdir();
nfs4_init = 0;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 002 of 4] knfsd: nfsd4: fix inheritance flags on v4 ace derived from posix default ace

2007-03-22 Thread NeilBrown


From: Bruce Fields <[EMAIL PROTECTED]>

A regression introduced in the last set of acl patches removed the
INHERIT_ONLY flag from aces derived from the posix acl.  Fix.

Signed-off-by: "J. Bruce Fields" <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./fs/nfsd/nfs4acl.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/fs/nfsd/nfs4acl.c ./fs/nfsd/nfs4acl.c
--- .prev/fs/nfsd/nfs4acl.c 2007-03-23 11:18:58.0 +1100
+++ ./fs/nfsd/nfs4acl.c 2007-03-23 11:18:58.0 +1100
@@ -228,7 +228,7 @@ _posix_to_nfsv4_one(struct posix_acl *pa
struct posix_acl_summary pas;
unsigned short deny;
int eflag = ((flags & NFS4_ACL_TYPE_DEFAULT) ?
-   NFS4_INHERITANCE_FLAGS : 0);
+   NFS4_INHERITANCE_FLAGS | NFS4_ACE_INHERIT_ONLY_ACE : 0);
 
BUG_ON(pacl->a_count < 3);
summarize_posix_acl(pacl, &pas);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 000 of 4] knfsd: 4 bugfixes for nfsd suitable for 2.6.21

2007-03-22 Thread NeilBrown

Following are 4 small bugfixes for nfsd that are suitable for 2.6.21.

Thanks,
NeilBrown

 [PATCH 001 of 4] knfsd: Allow nfsd READDIR to return 64bit cookies
 [PATCH 002 of 4] knfsd: nfsd4: fix inheritance flags on v4 ace derived from 
posix default ace
 [PATCH 003 of 4] knfsd: nfsd4: demote "clientid in use" printk to a dprintk
 [PATCH 004 of 4] knfsd: nfsd4: remove superfluous cancel_delayed_work() call
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: New format Intel microcode...

2007-03-22 Thread Shaohua Li

On Thu, 2007-03-22 at 23:45 +, Daniel J Blueman wrote:
> Hi Shao-hua,
> 
> Is the tool you mentioned last June [1] available for splitting up the
> old firmware files to the new format (eg
> /lib/firmware/intel-ucode/06-0d-06), or are updates available from
> Intel (or otherwise) in this new format?
Yes, we are preparing the new format data files and maybe put it into a
new website. We will announce it when it's ready.

Thanks,
Shaohua
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20.3 AMD64 oops in CFQ code

2007-03-22 Thread Neil Brown

On Thursday March 22, [EMAIL PROTECTED] wrote:
> 
> Not a cfq failure, but I have been able to reproduce a different oops
> at array stop time while i/o's were pending.  I have not dug into it
> enough to suggest a patch, but I wonder if it is somehow related to
> the cfq failure since it involves congestion and drives going away:

Thanks.   I know about that one and have a patch about to be posted
which should fix it.  But I don't completely understand it.

When a raid5 array shuts down, it clears mddev->private, but doesn't
clean q->backing_dev_info.congested_fn.  So if someone tries to call
that congested_fn, it will try to dereference mddev->private and Oops.

Only by the time that raid5 is shutting down, no-one should have a
reference to the device any more, and no-one should be in a position
to call congested_fn !!

Maybe pdflush is just trying to sync the block device, even though
there is no dirty date  dunno

But I don't think it is related to the cfq problem as this one is only
a problem when the array is being stopped.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] Fw: BUG: Files corrupt after moving LVM volume to USB disk

2007-03-22 Thread Marti Raudsepp


Summary of what I've managed to rule out so far:
1. this problem does not occur without dm-crypt
2. this problem does not occur when the file system hasn't been moved from
  the SATA disk to USB
3. Both reiserfs and ext3 are affected by this problem, but ext3
merely slows down.
4. Unmounting and re-mounting has no effect.

My initial report contains more details.

On 3/22/07, Alasdair G Kergon <[EMAIL PROTECTED]> wrote:

A couple of patches to try:
dm-io-fix-bi_max_vecs.patch
dm-merge-max_hw_sector.patch

and perhaps these three:
dm-crypt-fix-call-to-clone_init.patch
dm-crypt-fix-avoid-cloned-bio-ref-after-free.patch
dm-crypt-fix-remove-first_clone.patch


No luck. :(

And thanks for reminding me how annoying reboots are...

On 3/21/07, Lennart Sorensen <[EMAIL PROTECTED]> wrote:

Does this happen only with this combination, or can you elliminate
something as the cause?


Doesn't occur if dm-crypt is not involved (e.g., when placing reiserfs or
ext3 straight on top of /dev/primary/punchbag). I don't know if LVM is
necessary, as I don't have any unallocated space for "oldschool" partitions
left.

Nor do these problems occur when creating either file system directly on the
USB disk -- so, only when they have been lvmoved from the SATA disk, or when
the volume is physically dd'ed to the USB disk.

On 3/21/07, Lennart Sorensen <[EMAIL PROTECTED]> wrote:

Does it happen with ext3 or only reiserfs?


Tried with ext3 (LVM+dm-crypt+ext3), and the files didn't appear corrupt,
however, I could still see those messages in my dmesg, and disk access was
*extremely* slow, causing around 10k context switches per second (!?) reading
below 5 MB/s. Normally, I can get read speeds around 12-22 MB/s with this
particular USB disk. However, it's not CPU-bound, as over >80% of CPU is still
spent in iowait.

When the ext3 file system is created directly on a volume that is located on
the USB disk, however, these problems do not occur, just like with reiserfs.

However, when I copied the same files to a reiserfs volume when the LV had
already been pvmoved to /dev/sdb, the files were readable (as pointed out in
my initial post). Interestingly, it did not introduce the same slowdown as
with ext3, reading around 13-14 MB/s with ~3k context switches per second.

To clarify (I double-checked all these scenarios):
(1) files written *before* pvmove are corrupt with reiserfs
(2) files written *after*  pvmove read fast with reiserfs
(3) files written *before* pvmove read slowly with ext3
(4) files written *after*  pvmove read slowly with ext3
(5) files written to an ext3 volume that was formatted on the USB disk and not
   pvmoved are fast and don't report these dmesg warnings.

This is purely my speculation, and I can't pretend to know much about the
inner workings of device-mapper/file systems, but it appears that ext3
statically derives some attributes during the creation of a file system,
while reiserfs gets them runtime. ext3 contains a slow workaround if these
attributes don't match (and still posts a warning), but reiserfs
returns outright I/O errors.

(P.S. does the LKML archive randomly omit some double-newlines in posts, or
is this a problem on my end?)

Marti Raudsepp
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.20.3 AMD64 oops in CFQ code

2007-03-22 Thread Dan Williams


On 3/22/07, Dan Williams <[EMAIL PROTECTED]> wrote:

On 3/22/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Thursday March 22, [EMAIL PROTECTED] wrote:
> > On Thu, Mar 22 2007, [EMAIL PROTECTED] wrote:
> > > > 3 (I think) seperate instances of this, each involving raid5. Is your
> > > > array degraded or fully operational?
> > >
> > > Ding! A drive fell out the other day, which is why the problems only
> > > appeared recently.
> > >
> > > md5 : active raid5 sdf4[5] sdd4[3] sdc4[2] sdb4[1] sda4[0]
> > >   1719155200 blocks level 5, 64k chunk, algorithm 2 [6/5] [_U]
> > >   bitmap: 149/164 pages [596KB], 1024KB chunk
> > >
> > > H'm... this means that my alarm scripts aren't working.  Well, that's
> > > good to know.  The drive is being re-integrated now.
> >
> > Heh, at least something good came out of this bug then :-)
> > But that's reaffirming. Neil, are you following this? It smells somewhat
> > fishy wrt raid5.
>
> Yes, I've been trying to pay attention
>
> The evidence does seem to point to raid5 and degraded arrays being
> implicated.  However I'm having trouble finding how the fact that an array
> is degraded would be visible down in the elevator except for having a
> slightly different distribution of reads and writes.
>
> One possible way is that if an array is degraded, then some read
> requests will go through the stripe cache rather than direct to the
> device.  However I would more expect the direct-to-device path to have
> problems as it is much newer code.  Going through the cache for reads
> is very well tested code - and reads come from the cache for most
> writes anyway, so the elevator will still see lots of single-page.
> reads.  It only ever sees single-page write.
>
> There might be more pressure on the stripe cache when running
> degraded, so we might call the ->unplug_fn a little more often, but I
> doubt that would be noticeable.
>
> As you seem to suggest by the patch, it does look like some sort of
> unlocked access to the cfq_queue structure.  However apart from the
> comment before cfq_exit_single_io_context being in the wrong place
> (should be before __cfq_exit_single_io_context) I cannot see anything
> obviously wrong with the locking around that structure.
>
> So I'm afraid I'm stumped too.
>
> NeilBrown

Not a cfq failure, but I have been able to reproduce a different oops
at array stop time while i/o's were pending.  I have not dug into it
enough to suggest a patch, but I wonder if it is somehow related to
the cfq failure since it involves congestion and drives going away:

md: md0: recovery done.
Unable to handle kernel NULL pointer dereference at virtual address 00bc
pgd = 40004000
[00bc] *pgd=
Internal error: Oops: 17 [#1]
Modules linked in:
CPU: 0
PC is at raid5_congested+0x14/0x5c
LR is at sync_sb_inodes+0x278/0x2ec
pc : [<402801cc>]lr : [<400a39e8>]Not tainted
sp : 8a3e3ec4  ip : 8a3e3ed4  fp : 8a3e3ed0
r10: 40474878  r9 : 40474870  r8 : 40439710
r7 : 8a3e3f30  r6 : bfa76b78  r5 : 4161dc08  r4 : 40474800
r3 : 402801b8  r2 : 0004  r1 : 0001  r0 : 
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  Segment kernel
Control: 400397F
Table: 7B7D4018  DAC: 0035
Process pdflush (pid: 1371, stack limit = 0x8a3e2250)
Stack: (0x8a3e3ec4 to 0x8a3e4000)
3ec0:  8a3e3f04 8a3e3ed4 400a39e8 402801c4 8a3e3f24 000129f9 40474800
3ee0: 4047483c 40439a44 8a3e3f30 40439710 40438a48 4045ae68 8a3e3f24 8a3e3f08
3f00: 400a3ca0 400a377c 8a3e3f30 1162 00012bed 40438a48 8a3e3f78 8a3e3f28
3f20: 40069b58 400a3bfc 00011e41 8a3e3f38   8a3e3f28 0400
3f40:      0025 8a3e3f80 8a3e3f8c
3f60: 40439750 8a3e2000 40438a48 8a3e3fc0 8a3e3f7c 4006ab68 40069a8c 0001
3f80: bfae2ac0 40069a80  8a3e3f8c 8a3e3f8c 00012805  8a3e2000
3fa0: 9e7e1f1c 4006aa40 0001  fffc 8a3e3ff4 8a3e3fc4 4005461c
3fc0: 4006aa4c 0001      
3fe0:    8a3e3ff8 40042320 40054520  
Backtrace:
[<402801b8>] (raid5_congested+0x0/0x5c) from [<400a39e8>]
(sync_sb_inodes+0x278/0x2ec)
[<400a3770>] (sync_sb_inodes+0x0/0x2ec) from [<400a3ca0>]
(writeback_inodes+0xb0/0xb8)
[<400a3bf0>] (writeback_inodes+0x0/0xb8) from [<40069b58>]
(wb_kupdate+0xd8/0x160)
 r7 = 40438A48  r6 = 00012BED  r5 = 1162  r4 = 8A3E3F30
[<40069a80>] (wb_kupdate+0x0/0x160) from [<4006ab68>] (pdflush+0x128/0x204)
 r8 = 40438A48  r7 = 8A3E2000  r6 = 40439750  r5 = 8A3E3F8C
 r4 = 8A3E3F80
[<4006aa40>] (pdflush+0x0/0x204) from [<4005461c>] (kthread+0x108/0x134)
[<40054514>] (kthread+0x0/0x134) from [<40042320>] (do_exit+0x0/0x844)
Code: e92dd800 e24cb004 e590 e3a01001 (e59030bc)
md: md0 stopped.
md: unbind
md: export_rdev(sda)
md: unbind
md: export_rdev(sdd)
md: unbind
md: export_rdev(sdc)
md: unbind
md: export_rdev(sdb)

2.6.20-rc3-iop1 on an iop348 platform.  SATA controller is sata_vsc.

Sorry, that's 2.6.21-rc3-iop1.
-

Re: 2.6.20.3 AMD64 oops in CFQ code

2007-03-22 Thread Dan Williams

On 3/22/07, Neil Brown <[EMAIL PROTECTED]> wrote:

On Thursday March 22, [EMAIL PROTECTED] wrote:
> On Thu, Mar 22 2007, [EMAIL PROTECTED] wrote:
> > > 3 (I think) seperate instances of this, each involving raid5. Is your
> > > array degraded or fully operational?
> >
> > Ding! A drive fell out the other day, which is why the problems only
> > appeared recently.
> >
> > md5 : active raid5 sdf4[5] sdd4[3] sdc4[2] sdb4[1] sda4[0]
> >   1719155200 blocks level 5, 64k chunk, algorithm 2 [6/5] [_U]
> >   bitmap: 149/164 pages [596KB], 1024KB chunk
> >
> > H'm... this means that my alarm scripts aren't working.  Well, that's
> > good to know.  The drive is being re-integrated now.
>
> Heh, at least something good came out of this bug then :-)
> But that's reaffirming. Neil, are you following this? It smells somewhat
> fishy wrt raid5.

Yes, I've been trying to pay attention

The evidence does seem to point to raid5 and degraded arrays being
implicated.  However I'm having trouble finding how the fact that an array
is degraded would be visible down in the elevator except for having a
slightly different distribution of reads and writes.

One possible way is that if an array is degraded, then some read
requests will go through the stripe cache rather than direct to the
device.  However I would more expect the direct-to-device path to have
problems as it is much newer code.  Going through the cache for reads
is very well tested code - and reads come from the cache for most
writes anyway, so the elevator will still see lots of single-page.
reads.  It only ever sees single-page write.

There might be more pressure on the stripe cache when running
degraded, so we might call the ->unplug_fn a little more often, but I
doubt that would be noticeable.

As you seem to suggest by the patch, it does look like some sort of
unlocked access to the cfq_queue structure.  However apart from the
comment before cfq_exit_single_io_context being in the wrong place
(should be before __cfq_exit_single_io_context) I cannot see anything
obviously wrong with the locking around that structure.

So I'm afraid I'm stumped too.

NeilBrown

Not a cfq failure, but I have been able to reproduce a different oops
at array stop time while i/o's were pending.  I have not dug into it
enough to suggest a patch, but I wonder if it is somehow related to
the cfq failure since it involves congestion and drives going away:

md: md0: recovery done.
Unable to handle kernel NULL pointer dereference at virtual address 00bc
pgd = 40004000
[00bc] *pgd=
Internal error: Oops: 17 [#1]
Modules linked in:
CPU: 0
PC is at raid5_congested+0x14/0x5c
LR is at sync_sb_inodes+0x278/0x2ec
pc : [<402801cc>]lr : [<400a39e8>]Not tainted
sp : 8a3e3ec4  ip : 8a3e3ed4  fp : 8a3e3ed0
r10: 40474878  r9 : 40474870  r8 : 40439710
r7 : 8a3e3f30  r6 : bfa76b78  r5 : 4161dc08  r4 : 40474800
r3 : 402801b8  r2 : 0004  r1 : 0001  r0 : 
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  Segment kernel
Control: 400397F
Table: 7B7D4018  DAC: 0035
Process pdflush (pid: 1371, stack limit = 0x8a3e2250)
Stack: (0x8a3e3ec4 to 0x8a3e4000)
3ec0:  8a3e3f04 8a3e3ed4 400a39e8 402801c4 8a3e3f24 000129f9 40474800
3ee0: 4047483c 40439a44 8a3e3f30 40439710 40438a48 4045ae68 8a3e3f24 8a3e3f08
3f00: 400a3ca0 400a377c 8a3e3f30 1162 00012bed 40438a48 8a3e3f78 8a3e3f28
3f20: 40069b58 400a3bfc 00011e41 8a3e3f38   8a3e3f28 0400
3f40:      0025 8a3e3f80 8a3e3f8c
3f60: 40439750 8a3e2000 40438a48 8a3e3fc0 8a3e3f7c 4006ab68 40069a8c 0001
3f80: bfae2ac0 40069a80  8a3e3f8c 8a3e3f8c 00012805  8a3e2000
3fa0: 9e7e1f1c 4006aa40 0001  fffc 8a3e3ff4 8a3e3fc4 4005461c
3fc0: 4006aa4c 0001      
3fe0:    8a3e3ff8 40042320 40054520  
Backtrace:
[<402801b8>] (raid5_congested+0x0/0x5c) from [<400a39e8>]
(sync_sb_inodes+0x278/0x2ec)
[<400a3770>] (sync_sb_inodes+0x0/0x2ec) from [<400a3ca0>]
(writeback_inodes+0xb0/0xb8)
[<400a3bf0>] (writeback_inodes+0x0/0xb8) from [<40069b58>]
(wb_kupdate+0xd8/0x160)
r7 = 40438A48  r6 = 00012BED  r5 = 1162  r4 = 8A3E3F30
[<40069a80>] (wb_kupdate+0x0/0x160) from [<4006ab68>] (pdflush+0x128/0x204)
r8 = 40438A48  r7 = 8A3E2000  r6 = 40439750  r5 = 8A3E3F8C
r4 = 8A3E3F80
[<4006aa40>] (pdflush+0x0/0x204) from [<4005461c>] (kthread+0x108/0x134)
[<40054514>] (kthread+0x0/0x134) from [<40042320>] (do_exit+0x0/0x844)
Code: e92dd800 e24cb004 e590 e3a01001 (e59030bc)
md: md0 stopped.
md: unbind
md: export_rdev(sda)
md: unbind
md: export_rdev(sdd)
md: unbind
md: export_rdev(sdc)
md: unbind
md: export_rdev(sdb)

2.6.20-rc3-iop1 on an iop348 platform.  SATA controller is sata_vsc.

--
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/

kmalloc() with size zero

2007-03-22 Thread Stephane Eranian

Hello,

I ran into an issue with perfmon where I ended up calling
kmalloc() with a size of zero. To my surprise, this did
not return NULL but a valid data address.

I am wondering if this is a property of kmalloc() or simply
a bug. It is the case that the __kmalloc() code does not
check for zero size.

Thanks,

-- 

-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] Change softlockup watchdog to ignore stolen time

2007-03-22 Thread Jeremy Fitzhardinge

Jeremy Fitzhardinge wrote:
> The softlockup watchdog is currently a nuisance in a virtual machine,
> since the whole system could have the CPU stolen from it for a long
> period of time.  While it would be unlikely for a guest domain to be
> denied timer interrupts for over 10s, it could happen and any softlockup
> message would be completely spurious.
>
> Earlier I proposed that sched_clock() return time in unstolen
> nanoseconds, which is how Xen and VMI currently implement it.  If the
> softlockup watchdog uses sched_clock() to measure time, it would
> automatically ignore stolen time, and therefore only report when the
> guest itself locked up.  When running native, sched_clock() returns
> real-time nanoseconds, so the behaviour would be unchanged.
>
> Does this seem sound?
>
> Also, softlockup.c's use of jiffies seems archaic now.  Should it be
> converted to use timers?  Mightn't it report lockups just because there
> was no timer event?
>
> Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Less desperately broken version.

diff -r b41fb9e70d72 kernel/softlockup.c
--- a/kernel/softlockup.c   Thu Mar 22 16:25:15 2007 -0700
+++ b/kernel/softlockup.c   Thu Mar 22 17:03:03 2007 -0700
@@ -17,8 +17,8 @@
 
 static DEFINE_SPINLOCK(print_lock);
 
-static DEFINE_PER_CPU(unsigned long, touch_timestamp);
-static DEFINE_PER_CPU(unsigned long, print_timestamp);
+static DEFINE_PER_CPU(unsigned long long, touch_timestamp);
+static DEFINE_PER_CPU(unsigned long long, print_timestamp);
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 
 static int did_panic = 0;
@@ -37,7 +37,7 @@ static struct notifier_block panic_block
 
 void touch_softlockup_watchdog(void)
 {
-   __raw_get_cpu_var(touch_timestamp) = jiffies;
+   __raw_get_cpu_var(touch_timestamp) = sched_clock();
 }
 EXPORT_SYMBOL(touch_softlockup_watchdog);
 
@@ -48,10 +48,15 @@ void softlockup_tick(void)
 void softlockup_tick(void)
 {
int this_cpu = smp_processor_id();
-   unsigned long touch_timestamp = per_cpu(touch_timestamp, this_cpu);
+   unsigned long long touch_timestamp = per_cpu(touch_timestamp, this_cpu);
+   unsigned long long now;
 
-   /* prevent double reports: */
-   if (per_cpu(print_timestamp, this_cpu) == touch_timestamp ||
+   /* watchdog task hasn't updated timestamp yet */
+   if (touch_timestamp == 0)
+   return;
+
+   /* report at most once a second */
+   if (per_cpu(print_timestamp, this_cpu) < (touch_timestamp + 
NSEC_PER_SEC) ||
did_panic ||
!per_cpu(watchdog_task, this_cpu))
return;
@@ -62,12 +67,14 @@ void softlockup_tick(void)
return;
}
 
+   now = sched_clock();
+
/* Wake up the high-prio watchdog task every second: */
-   if (time_after(jiffies, touch_timestamp + HZ))
+   if (now > (touch_timestamp + NSEC_PER_SEC))
wake_up_process(per_cpu(watchdog_task, this_cpu));
 
/* Warn about unreasonable 10+ seconds delays: */
-   if (time_after(jiffies, touch_timestamp + 10*HZ)) {
+   if (now > (touch_timestamp + 10ull*NSEC_PER_SEC)) {
per_cpu(print_timestamp, this_cpu) = touch_timestamp;
 
spin_lock(&print_lock);
@@ -87,6 +94,9 @@ static int watchdog(void * __bind_cpu)
 
sched_setscheduler(current, SCHED_FIFO, ¶m);
current->flags |= PF_NOFREEZE;
+
+   /* initialize timestamp */
+   touch_softlockup_watchdog();
 
/*
 * Run briefly once per second to reset the softlockup timestamp.
@@ -120,7 +130,7 @@ cpu_callback(struct notifier_block *nfb,
printk("watchdog for %i failed\n", hotcpu);
return NOTIFY_BAD;
}
-   per_cpu(touch_timestamp, hotcpu) = jiffies;
+   per_cpu(touch_timestamp, hotcpu) = 0;
per_cpu(watchdog_task, hotcpu) = p;
kthread_bind(p, hotcpu);
break;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] Change softlockup watchdog to ignore stolen time

2007-03-22 Thread Jeremy Fitzhardinge

Zachary Amsden wrote:
> No, it is not unlikely.  4-way SMP VMs idling exhibit this behavior
> with NO_HZ or NO_IDLE_HZ because they get quiet enough to schedule
> nothing on the APs.
>
> And that can happen on native hardware as well.

That's a separate problem.

>>
>> Also, softlockup.c's use of jiffies seems archaic now.  Should it be
>> converted to use timers?  Mightn't it report lockups just because there
>> was no timer event?
>>   
>
> This looks good to me, as a first order approximation.  But on native
> hardware, with NO_HZ, this is just broken to begin with.  Perhaps we
> should make SOFTLOCKUP depend on !NO_HZ.

OK, that just means the softlockup should, erm, do something else.  I
guess using an explicit timer would work, but I'm not sure if that
defeats the whole purpose.  Perhaps there should be a per-cpu disable
flag, which would be set when entering idle?

Something like this...

J

diff -r 3f00aa67786f include/linux/sched.h
--- a/include/linux/sched.h Thu Mar 22 17:03:13 2007 -0700
+++ b/include/linux/sched.h Thu Mar 22 17:09:23 2007 -0700
@@ -232,10 +232,18 @@ extern void scheduler_tick(void);
 
 #ifdef CONFIG_DETECT_SOFTLOCKUP
 extern void softlockup_tick(void);
+extern void softlockup_enable(void);
+extern void softlockup_disable(void);
 extern void spawn_softlockup_task(void);
 extern void touch_softlockup_watchdog(void);
 #else
 static inline void softlockup_tick(void)
+{
+}
+static inline void softlockup_enable(void)
+{
+}
+static inline void softlockup_disable(void)
 {
 }
 static inline void spawn_softlockup_task(void)
diff -r 3f00aa67786f kernel/softlockup.c
--- a/kernel/softlockup.c   Thu Mar 22 17:03:13 2007 -0700
+++ b/kernel/softlockup.c   Thu Mar 22 17:09:23 2007 -0700
@@ -20,6 +20,7 @@ static DEFINE_PER_CPU(unsigned long long
 static DEFINE_PER_CPU(unsigned long long, touch_timestamp);
 static DEFINE_PER_CPU(unsigned long long, print_timestamp);
 static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
+static DEFINE_PER_CPU(int, enabled);
 
 static int did_panic = 0;
 
@@ -41,6 +42,17 @@ void touch_softlockup_watchdog(void)
 }
 EXPORT_SYMBOL(touch_softlockup_watchdog);
 
+void softlockup_enable(void)
+{
+   touch_softlockup_watchdog();
+   __get_cpu_var(enabled) = 1;
+}
+
+void softlockup_disable(void)
+{
+   __get_cpu_var(enabled) = 0;
+}
+
 /*
  * This callback runs from the timer interrupt, and checks
  * whether the watchdog thread has hung or not:
@@ -51,8 +63,8 @@ void softlockup_tick(void)
unsigned long long touch_timestamp = per_cpu(touch_timestamp, this_cpu);
unsigned long long now;
 
-   /* watchdog task hasn't updated timestamp yet */
-   if (touch_timestamp == 0)
+   /* return if not enabled */
+   if (!__get_cpu_var(enabled))
return;
 
/* report at most once a second */
@@ -95,8 +107,8 @@ static int watchdog(void * __bind_cpu)
sched_setscheduler(current, SCHED_FIFO, ¶m);
current->flags |= PF_NOFREEZE;
 
-   /* initialize timestamp */
-   touch_softlockup_watchdog();
+   /* enable on this cpu */
+   softlockup_enable();
 
/*
 * Run briefly once per second to reset the softlockup timestamp.
diff -r 3f00aa67786f kernel/time/tick-sched.c
--- a/kernel/time/tick-sched.c  Thu Mar 22 17:03:13 2007 -0700
+++ b/kernel/time/tick-sched.c  Thu Mar 22 17:09:23 2007 -0700
@@ -228,6 +228,8 @@ void tick_nohz_stop_sched_tick(void)
ts->idle_tick = ts->sched_timer.expires;
ts->tick_stopped = 1;
ts->idle_jiffies = last_jiffies;
+
+   softlockup_disable();
}
/*
 * calculate the expiry time for the next timer wheel
@@ -255,6 +257,7 @@ void tick_nohz_stop_sched_tick(void)
cpu_clear(cpu, nohz_cpu_mask);
}
raise_softirq_irqoff(TIMER_SOFTIRQ);
+
 out:
ts->next_jiffies = next_jiffies;
ts->last_jiffies = last_jiffies;
@@ -311,6 +314,8 @@ void tick_nohz_restart_sched_tick(void)
ts->tick_stopped  = 0;
hrtimer_cancel(&ts->sched_timer);
ts->sched_timer.expires = ts->idle_tick;
+
+   softlockup_enable();
 
while (1) {
/* Forward the time to expire in the future */
@@ -355,17 +360,12 @@ static void tick_nohz_handler(struct clo
tick_do_update_jiffies64(now);
 
/*
-* When we are idle and the tick is stopped, we have to touch
-* the watchdog as we might not schedule for a really long
-* time. This happens on complete idle SMP systems while
-* waiting on the login prompt. We also increment the "start
-* of idle" jiffy stamp so the idle accounting adjustment we
-* do when we go busy again does not account too much ticks.
-*/
-   if (ts->tick_stopped) {
-   touch_softlockup_watchdog();
+* Increment the "start of idle" jiffy stamp so the

1 2 3 4 >

1 - 100 of 373 matches

Mail list logo