date:20070831

Re: [PATCH 5/5] Net: ath5k, kconfig changes

2007-08-31 Thread Nick Kossifidis

2007/8/31, Nick Kossifidis <[EMAIL PROTECTED]>:
> 2007/8/30, John W. Linville <[EMAIL PROTECTED]>:
> > On Thu, Aug 30, 2007 at 04:38:09AM +0300, Nick Kossifidis wrote:
> > > 2007/8/28, Christoph Hellwig <[EMAIL PROTECTED]>:
> >
> > > > Also this whole patch seems rather pointless.  It saves only
> > > > very little and turns the driver into a complete ifdef maze.
> >
> > > Also most
> > > people will use 5212 code only, 5211 cards are on some old laptops and
> > > 5210, well i couldn't even find  a 5210 for actual testing :P
> >
> > FWIW, I'd bet dollars to donuts that distros will enable them all
> > together.
> >
> > Is saving code space the only reason to turn these off?  How much
> > space do you save?
> >
> > Is there some way you can isolate and/or limit the number of ifdef
> > blocks further?  If so, we might consider a version of this patch
> > that depends on EMBEDDED or somesuch...?
> >
> > John
>
> O.K. as a first step i'll limit 5210 code only then, just an option
> like "support older 5210 chipsets" which is going to be off by default
> instead of 3 options. It's not just saving space, it's also saving
> some runtime checks. It's not really a gain in performance though,
> most checks are done during initialization and dfs setup, i just
> thought it would be usefull to save as much cpu as possible.
>

Well after some thought i removed them all, there is no real gain from
this in most cases (that ppl will use newer 5212 chips and
combatibles).



-- 
GPG ID: 0xD21DB2DB
As you read this post global entropy rises. Have Fun ;-)
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm] add-a-rounddown_pow_of_two-routine-to-log2h.patch fix

2007-08-31 Thread Mariusz Kozlowski

Hello,

This patch fixes the unbalanced parenthesis inroduced by
add-a-rounddown_pow_of_two-routine-to-log2h.patch.

Signed-off-by: Mariusz Kozlowski <[EMAIL PROTECTED]>

 include/linux/log2.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.23-rc4-mm1-a/include/linux/log2.h 2007-09-01 07:23:28.0 
+0200
+++ linux-2.6.23-rc4-mm1-b/include/linux/log2.h 2007-09-01 07:29:27.0 
+0200
@@ -186,7 +186,7 @@ unsigned long __rounddown_pow_of_two(uns
 (  \
__builtin_constant_p(n) ? ( \
(n == 1) ? 0 :  \
-   (1UL << ilog2(n)) : \
+   (1UL << ilog2(n))) :\
__rounddown_pow_of_two(n)   \
  )

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 14/26] SLUB: __GFP_MOVABLE and SLAB_TEMPORARY support

2007-08-31 Thread KAMEZAWA Hiroyuki

On Fri, 31 Aug 2007 18:41:21 -0700
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> +#ifndef CONFIG_HIGHMEM
> + if (s->kick || s->flags & SLAB_TEMPORARY)
> + flags |= __GFP_MOVABLE;
> +#endif
> +

Should I do this as

#if !defined(CONFIG_HIGHMEM) && !defined(CONFIG_MEMORY_HOTREMOVE)

?

-Kame
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.23-rc4-mm1

2007-08-31 Thread Andrew Morton


ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc4/2.6.23-rc4-mm1/

- git-kbuild is broken and has been dropped

- git-ixgb is broken by git-net and has been dropped

- git-md-accel is broken by MD fixes and has been dropped

- git-v9fs breaks the build on all non-x86 and the fs has been disabled in
  config

- dynticks-for-x86_64 has returned



Changes since 2.6.23-rc3-mm1:


 origin.patch
 git-acpi.patch
 git-alsa.patch
 git-audit-master.patch
 git-avr32.patch
 git-cifs.patch
 git-cpufreq.patch
 git-powerpc.patch
 git-dvb.patch
 git-hwmon.patch
 git-gfs2-nmw.patch
 git-hid.patch
 git-ia64.patch
 git-ieee1394.patch
 git-infiniband.patch
 git-input.patch
 git-jfs.patch
 git-jg-misc.patch
 git-kvm.patch
 git-libata-all.patch
 git-m32r.patch
 git-mips.patch
 git-mmc.patch
 git-mtd.patch
 git-ubi.patch
 git-netdev-all.patch
 git-net.patch
 git-backlight.patch
 git-nfs.patch
 git-nfsd.patch
 git-ocfs2.patch
 git-r8169.patch
 git-selinux.patch
 git-s390.patch
 git-sched.patch
 git-sh.patch
 git-scsi-misc.patch
 git-scsi-rc-fixes.patch
 git-block.patch
 git-unionfs.patch
 git-v9fs.patch
 git-watchdog.patch
 git-wireless.patch
 git-ipwireless_cs.patch
 git-newsetup.patch
 git-xfs.patch
 git-cryptodev.patch
 git-xtensa.patch
 git-kgdb.patch

 git trees

-ecryptfs-fix-lookup-error-for-special-files.patch
-sparsemem-ensure-we-initialise-the-node-mapping-for-sparsemem_static.patch
-tpmdd-maintainers.patch
-kernel-auditscc-fix-an-off-by-one.patch
-document-linux-memory-policy-v3.patch
-futex_unlock_pi-hurts-my-brain-and-may-cause.patch
-dont-optimise-away-baud-rate-changes-when-bother-is-used.patch
-serial-add-support-for-ite-887x-chips.patch
-serial_txx9-fix-modem-control-line-handling.patch
-serial-8250-handle-saving-the-clear-on-read-bits-from-the-lsr.patch
-add-blacklisting-capability-to-serial_pci-to-avoid-misdetection.patch
-free_irq-fix-debug_shirq-handling.patch
-documentation-fix-getdelaysc-example-l-option-and-segv.patch
-h8300-missing-include.patch
-ensure-we-count-pages-transitioning-inactive-via-clear_active_flags.patch
-wait-for-page-writeback-when-directly-reclaiming-contiguous-areas.patch
-wait-for-page-writeback-when-directly-reclaiming-contiguous-areas-fix.patch
-correct-name-for-rtc-m41t80.patch
-fix-null-pointer-dereference-in-__vm_enough_memory.patch
-m68k-asm-pageh-needs-linux-compilerh.patch
-m68k-kill-superfluous-extern.patch
-m68k-remove-unnecessary-m68k_memoffset-export-and-init.patch
-remove-dead-code-in-via-pmu68k.patch
-m68k-use-_ac-instead-of-ifdef-__assembly__.patch
-m68k-enable-arbitary-speed-tty-support.patch
-m68k-dont-include-rodata-into-text-segment.patch
-m68k-fix-a-few-hickups-in-drivers-scsi-kconfig.patch
-zorro-make-sysfs-config-attribute-read-only.patch
-m68k-mac-make-mac_hid_mouse_emulate_buttons-declaration-visible.patch
-introduce-config_check_signature-was-re-uninline.patch
-posix-timers-fix-deletion-race.patch
-posix-timers-fix-creation-race.patch
-signalfd-fix-interaction-with-posix-timers.patch
-signalfd-make-it-group-wide-fix-posix-timers-scheduling.patch
-ipmi-fix-warning-in-ipmi_si_intfc.patch
-slab-skip-calling-cache_free_alien-when-the-platform-is-not-numa-capable.patch
-synclink_gt-fix-module-reference.patch
-fix-vm_fault-flags-conversion-for-hugetlb.patch
-w1-fix-w1_remove_master_device-searching.patch
-md-make-sure-a-re-add-after-a-restart-honours-bitmap-when-resyncing.patch
-md-correctly-update-sysfs-when-a-raid1-is-reshaped.patch
-uml-fix-previous-request-size-limit-fix.patch
-autofs4-deadlock-during-create.patch
-serial-add-pci-ids-for-pa-semi-pwrficient-onchip-uarts.patch
-cfag12864b-fix.patch
-slub-use-atomic_long_read-for-atomic_long-variables.patch
-slub-do-not-fail-on-broken-memory-configurations.patch
-rtc-max6902-minor-fixes.patch
-exec-kill-unsafe-bug_onsig-count-checks.patch
-xen-i386-xen-heads-fix-sections-mixup-update-2.patch
-check-for-ppc32-in-imsttfb.patch
-selectionh-add-tty_struct-forward-declaration.patch
-newport_con-warning-fix.patch
-i386-fix-lazy-mode-vmalloc-synchronization-for-paravirt.patch
-get_nodes-should-ignore-invalid-node.patch
-fix-ensure-we-dont-use-bootconsoles-after-init-has-been-released.patch
-au1100fb-move-au1100fb_fb_blank-beforce.patch
-pm-fix-dependencies-of-config_suspend-and-config_hibernation-updated-3x.patch
-remove-bdput-from-do_open-in-fs-block_devc.patch
-remove-bdput-from-do_open-in-fs-block_devc-fix.patch
-apply-memory-policies-to-top-two-highest-zones-when-highest-zone-is-zone_movable.patch
-enable-gpes-before-calling-_wak-on-resume.patch
-acpi-fix-a-warning-of-discarding-qualifiers-from-pointer-target-type.patch
-agk-dm-dm-rdac-fix-request-cmd_flags.patch
-gregkh-driver-sysfs-fix-locking-in-sysfs_lookup-and-sysfs_rename_dir.patch
-gregkh-driver-fix-off-by-one-in-sys-module-refcnt.patch
-gregkh-driver-howto-korean-translation-of-documentation-howto.patch
-gregkh-driver-howto-latest-lxr-url-address-changed.patch
-fix-gregkh-driver-driver-core-change-add_uevent_var-to-use-a-struct.patch

BUG POWERPC: snd-powermac hangs since 'Merge 32 and 64 bits asm-powerpc/io.h'

2007-08-31 Thread Dave Vasilevsky

When playing audio with the snd-powermac driver on a PowerMac G4
Quicksilver (Tumbler audio) the sound hangs after a few seconds.

- The time before a hang varies from one second to one minute.
- Killing the process playing sound and starting again will allow
sound to continue (for a few more seconds).
- Many different userspace audio systems and different audio sources
all encounter this bug--definitely a kernel issue.
- Vanilla kernels from 2.6.20 up to git HEAD show this behavior.
Others have reported this bug on distro kernels.[1][2]

I used git-bisect to find that the regression first occured after git
commit 68a64357d15ae4f596e92715719071952006e83c "powerpc: Merge 32 and
64 bits asm-powerpc/io.h" by benh.[3]

My kernel debugging skills are admittedly limited, but I scattered
printks through some relevant files and found that audio stops just as
snd_pmac_pcm_update in sound/ppc/pmac.c encounters a struct dbdma_cmd
with xfer_status == 0x8088. Normally it should be 0x84 (ie: ACTIVE |
RUN) or 0x0. Hopefully this means something to somebody. I'm willing
to do more debugging if more information is necessary, kindly CC me on
replies to this message.

Thank you for your time,
Dave Vasilevsky


[1] https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.20/+bug/87652
[2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=436723
[3]  
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=68a64357d15ae4f596e92715719071952006e83c
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

socket locking obscure code

2007-08-31 Thread Cyrill Gorcunov

Hi LKML,
 
looking thru lock_sock_nested (while trying to catch
BUG in CIFS as reported on bugzilla #8377) I found
that lock_sock_nested consist of:

void fastcall lock_sock_nested(struct sock *sk, int subclass)
{
might_sleep();
--->spin_lock_bh(>sk_lock.slock);
if (sk->sk_lock.owner)
__lock_sock(sk);
sk->sk_lock.owner = (void *)1;
--->spin_unlock(>sk_lock.slock);
/*
 * The sk_lock has mutex_lock() semantics here:
 */
mutex_acquire(>sk_lock.dep_map, subclass, 0, _RET_IP_);
local_bh_enable();
}
 
so why spin_unlock are there instead of spin_unlock_bh?
To recope with __lock_sock? Am I right?
 
Cyrill
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE/RFC] Really Fair Scheduler

2007-08-31 Thread Mike Galbraith

On Fri, 2007-08-31 at 15:22 +0200, Roman Zippel wrote:

> Were there some kernel messages while running it?

Nope.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel-doc: fix doc blocks and html

2007-08-31 Thread Randy Dunlap

On Fri, 31 Aug 2007 20:58:45 -0700 Randy Dunlap wrote:

> From: Randy Dunlap <[EMAIL PROTECTED]>
> 
> Cc: [EMAIL PROTECTED]
> 
> Johannes Berg reports (Thanks!) that  names are not highlighted in
> html output format when they are inside a DOC: block.
> 
> DOC: blocks were not escaped thru xml_escape() like other kernel-doc
> comments were.  Fixed that.

Johannes is using a feature of kernel-doc that I wasn't even familiar
with, also one that no one else is using.  I'm sure that Johannes can
point us to some source code and generated output for it though.

If you want to see a little of it in source code form, 2 net drivers
use it, but then they aren't processed by kernel-doc for generated
output.  They are drivers/net/3c501.c & 3c527.c.  Look for the
"DOC:" comment blocks.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH/RFC] kernel-doc: fix doc blocks and html

2007-08-31 Thread Randy Dunlap

From: Randy Dunlap <[EMAIL PROTECTED]>

Cc: [EMAIL PROTECTED]

Johannes Berg reports (Thanks!) that  names are not highlighted in
html output format when they are inside a DOC: block.

DOC: blocks were not escaped thru xml_escape() like other kernel-doc
comments were.  Fixed that.

However, that left a problem with  ($blankline_html) being processed
thru xml_escape(), converting it to p, which isn't good for the
generated html output (the  should remain unchanged), so this patch
also introduces the notion of "local" kernel-doc meta-characters
('mnemonic:'), which are converted to html just before writing the
stream to its output file.


Please report any problems that you (anyone) see in "highlighting"
in any output mode (text, man, html, xml).

Also update copyright to include me.

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 scripts/kernel-doc |   41 -
 1 file changed, 32 insertions(+), 9 deletions(-)

--- linux-2.6.23-rc4.orig/scripts/kernel-doc
+++ linux-2.6.23-rc4/scripts/kernel-doc
@@ -5,6 +5,7 @@ use strict;
 ## Copyright (c) 1998 Michael Zucchi, All Rights Reserved##
 ## Copyright (C) 2000, 1  Tim Waugh <[EMAIL PROTECTED]>  ##
 ## Copyright (C) 2001  Simon Huggins ##
+## Copyright (C) 2005-2007  Randy Dunlap ##
 ##  ##
 ## #define enhancements by Armin Kuster <[EMAIL PROTECTED]> ##
 ## Copyright (c) 2000 MontaVista Software, Inc. ##
@@ -161,7 +162,7 @@ my $type_constant = '\%([-_\w]+)';
 my $type_func = '(\w+)\(\)';
 my $type_param = '\@(\w+)';
 my $type_struct = '\&((struct\s*)*[_\w]+)';
-my $type_struct_xml = '\\\amp;((struct\s*)*[_\w]+)';
+my $type_struct_xml = '\\((struct\s*)*[_\w]+)';
 my $type_env = '(\$\w+)';
 
 # Output conversion substitutions.
@@ -173,7 +174,9 @@ my %highlights_html = ( $type_constant, 
$type_struct_xml, "\$1",
$type_env, "\$1",
$type_param, "\$1" );
-my $blankline_html = "";
+my $local_lt = "lt:";
+my $local_gt = "gt:";
+my $blankline_html = $local_lt . "p" . $local_gt;  # was ""
 
 # XML, docbook format
 my %highlights_xml = ( "([^=])\\\"([^\\\"<]+)\\\"", "\$1\$2",
@@ -391,17 +394,19 @@ sub output_highlight {
 #  confess "output_highlight got called with no args?\n";
 #   }
 
+if ($output_mode eq "html") {
+   $contents = local_unescape($contents);
+   # convert data read & converted thru xml_escape() into  format:
+   $contents =~ s/\\/&/g;
+}
 #   print STDERR "contents b4:$contents\n";
 eval $dohighlight;
 die $@ if $@;
-if ($output_mode eq "html") {
-   $contents =~ s///;
-}
 #   print STDERR "contents af:$contents\n";
 
 foreach $line (split "\n", $contents) {
if ($line eq ""){
-   print $lineprefix, $blankline;
+   print $lineprefix, local_unescape($blankline);
} else {
$line =~ s/\\/\&/g;
if ($output_mode eq "man" && substr($line, 0, 1) eq ".") {
@@ -1752,7 +1757,13 @@ sub process_state3_type($$) {
 }
 }
 
-# replace <, >, and &
+# xml_escape: replace <, >, and & in the text stream;
+#
+# however, formatting controls that are generated internally/locally in the
+# kernel-doc script are not escaped here; instead, they begin life like
+# $blankline_html (4 of '\' followed by a mnemonic + ':'), then these strings
+# are converted to their mnemonic-expected output, without the 4 * '\' & ':',
+# just before actual output; (this is done by local_unescape())
 sub xml_escape($) {
my $text = shift;
if (($output_mode eq "text") || ($output_mode eq "man")) {
@@ -1764,6 +1775,18 @@ sub xml_escape($) {
return $text;
 }
 
+# convert local escape strings to html
+# local escape strings look like:  'menmonic:' (that's 4 backslashes)
+sub local_unescape($) {
+   my $text = shift;
+   if (($output_mode eq "text") || ($output_mode eq "man")) {
+   return $text;
+   }
+   $text =~ s/lt://g;
+   return $text;
+}
+
 sub process_file($) {
 my $file;
 my $identifier;
@@ -1903,7 +1926,7 @@ sub process_file($) {
} elsif ($state == 4) {
# Documentation block
if (/$doc_block/) {
-   dump_section($section, $contents);
+   dump_section($section, xml_escape($contents));
output_intro({'sectionlist' => [EMAIL PROTECTED],
  'sections' => \%sections });
$contents = "";
@@ -1923,7 +1946,7 @@ sub process_file($) {
}
elsif (/$doc_end/)
{
-   dump_section($section, $contents);
+   dump_section($section, xml_escape($contents));

Re: [PATCH] Patch pvr2 driver to allow development of maple bus driver

2007-08-31 Thread Mike Frysinger

On Friday 31 August 2007, Adrian McMenamin wrote:
> On 31/08/2007, Mike Frysinger <[EMAIL PROTECTED]> wrote:
> > On 8/31/07, Adrian McMenamin <[EMAIL PROTECTED]> wrote:
> > > This patch makes the PVR2 VBLANK interrupt on the SEGA Dreamcast
> > > shareable - a small but necessary change to enable ongoing efforts to
> > > develop a driver for the maple bus on the Dreamcast. (Maple is Sega's
> > > proprietary serial interface for the Dreamcast and can be set to
> > > synchronise dma transfers to the VBLANK).
> > >
> > > This has no impact on the performance of the PVR2.
> >
> > sharable implies the interrupt handler checks to see if it actually
> > caused the interrupt ... which it doesnt at the moment ... presumably,
> > you're making it shared because another device will be using that
> > interrupt as well ... so when that other device gets an interrupt, how
> > do you know it's for that device and not PVR2 ?
>
> If the interrupt occurs then it will be for both of them. The hardware
> cannot be removed and the maple bus driver is set for hardware sync.
>
> The question seems redundant to me.

i really dont know how the maple bus works or what piece of hardware is wired 
up to the same interrupt line.  my point is that if the other device fires an 
interrupt, the pvr interrupt handler may be executed and attempt to do work 
when in reality the pvr was not the source of the interrupt.
-mike


signature.asc
Description: This is a digitally signed message part.

Re: [PATCH 3/6] x86: Convert cpu_sibling_map to be a per cpu variable (v2)

2007-08-31 Thread Andrew Morton

On Fri, 24 Aug 2007 15:26:57 -0700 [EMAIL PROTECTED] wrote:

> Convert cpu_sibling_map from a static array sized by NR_CPUS to a
> per_cpu variable.  This saves sizeof(cpumask_t) * NR unused cpus.
> Access is mostly from startup and CPU HOTPLUG functions.

ia64 allmodconfig:

kernel/sched.c: In function `cpu_to_phys_group':
 kernel/sched.c:5937: error: 
`per_cpu__cpu_sibling_map' undeclared (first use in this function)  
 kernel/sched.c:5937: error: (Each undeclared identifier is 
reported only once
kernel/sched.c:5937: error: for each function it appears in.)   
 kernel/sched.c:5937: warning: type 
defaults to `int' in declaration of `type name'
kernel/sched.c:5937: error: invalid type argument of `unary *'  
 kernel/sched.c: In function 
`build_sched_domains':  
 kernel/sched.c:6172: error: `per_cpu__cpu_sibling_map' 
undeclared (first use in this function)   
kernel/sched.c:6172: warning: type defaults to `int' in declaration of `type 
name'   kernel/sched.c:6172: error: 
invalid type argument of `unary *'  
 kernel/sched.c:6183: warning: type defaults to `int' in 
declaration of `type name'   
kernel/sched.c:6183: error: invalid type argument of `unary *'  
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND][PATCH] dir_index: error out instead of BUG on corrupt hash dir limit

2007-08-31 Thread Eric Sandeen

(resend, this one got lost?  Got an acked-by from Andreas
last go-round)

(Andrew, Ted, should I be splitting out ext3 and ext4 patches and
sending separately...?)

Thanks,
-Eric

--

A corrupt ondisk hash dir limit will trip an assert in dx_probe,
which calls BUG().  Instead, we can just issue the warning and
fail dx_probe like the other 3 tests just before it.  Thanks
to aviro for suggesting this...  Tested with a hand-crafted
corrupt ext3 image, issues:

EXT3-fs warning (device loop0): dx_probe: Corrupt limit in dir inode 14337

vs. previous:

Assertion failure in dx_probe() at fs/ext3/namei.c:383: "dx_get_limit(entries) 
== dx_root_limit(dir, root->info.info_length)"
[ cut here ]
kernel BUG at fs/ext3/namei.c:383!
...


Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

Index: linux-2.6.22-rc4/fs/ext3/namei.c
===
--- linux-2.6.22-rc4.orig/fs/ext3/namei.c
+++ linux-2.6.22-rc4/fs/ext3/namei.c
@@ -379,8 +379,16 @@ dx_probe(struct dentry *dentry, struct i
 
entries = (struct dx_entry *) (((char *)>info) +
   root->info.info_length);
-   assert(dx_get_limit(entries) == dx_root_limit(dir,
- root->info.info_length));
+
+   if (dx_get_limit(entries) != dx_root_limit(dir,
+  root->info.info_length)) {
+   ext3_warning(dir->i_sb, __FUNCTION__,
+"Corrupt limit in dir inode %ld\n", dir->i_ino);
+   brelse(bh);
+   *err = ERR_BAD_DX_DIR;
+   goto fail;
+   }
+
dxtrace (printk("Look up %x", hash));
while (1)
{
Index: linux-2.6.22-rc4/fs/ext4/namei.c
===
--- linux-2.6.22-rc4.orig/fs/ext4/namei.c
+++ linux-2.6.22-rc4/fs/ext4/namei.c
@@ -379,8 +379,16 @@ dx_probe(struct dentry *dentry, struct i
 
entries = (struct dx_entry *) (((char *)>info) +
   root->info.info_length);
-   assert(dx_get_limit(entries) == dx_root_limit(dir,
- root->info.info_length));
+
+   if (dx_get_limit(entries) != dx_root_limit(dir,
+  root->info.info_length)) {
+   ext4_warning(dir->i_sb, __FUNCTION__,
+"Corrupt limit in dir inode %ld\n", dir->i_ino);
+   brelse(bh);
+   *err = ERR_BAD_DX_DIR;
+   goto fail;
+   }
+
dxtrace (printk("Look up %x", hash));
while (1)
{

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 14/26] SLUB: __GFP_MOVABLE and SLAB_TEMPORARY support

2007-08-31 Thread Christoph Lameter

On Sat, 1 Sep 2007, KAMEZAWA Hiroyuki wrote:

> On Fri, 31 Aug 2007 18:41:21 -0700
> Christoph Lameter <[EMAIL PROTECTED]> wrote:
> 
> > +#ifndef CONFIG_HIGHMEM
> > +   if (s->kick || s->flags & SLAB_TEMPORARY)
> > +   flags |= __GFP_MOVABLE;
> > +#endif
> > +
> 
> Should I do this as
> 
> #if !defined(CONFIG_HIGHMEM) && !defined(CONFIG_MEMORY_HOTREMOVE)

Hmmm Not sure... I think the use of __GFP_MOVABLE the way it is up 
there will change as soon as Mel's antifrag patchset is merged.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nmi_watchdog=2 regression in 2.6.21

2007-08-31 Thread Daniel Walker

On Sat, 2007-09-01 at 03:00 +0200, Björn Steinbrink wrote:
> On 2007.08.31 17:24:46 -0700, Daniel Walker wrote:
> > On Fri, 2007-08-31 at 20:06 +0200, Björn Steinbrink wrote:
> > 
> > 
> > > > something to do with the nmi hertz adjustment that happens after
> > > > check_nmi_watchdog() ..
> > > 
> > > Hm hm, does the same thing (watchdog stuck after check) happen with
> > > older kernels, ie. those before Stephane's changeset that made it use
> > > PERFCTR1?
> > 
> > I noticed the frequency gets turned down after check_nmi_watchdog() is
> > called.. I think it's suppose to trigger once per second, but it's more
> > like it updates randomly ..
> 
> It's once per second if the cpu is 100% busy, if it's just idling and
> halted, the performance counters won't be increased.

Didn't know that .. I ran hackbench while watching /proc/interrupts ,
and it ticks along ok on some cores .. 

The acid test was running an application that hangs the system, and it
caught it (although the system didn't recover from the lockup..) ..

> > In older kernels it's very slow, but it's more consistent ..
> 
> With the same load on the box? Maybe some other changes caused the box
> to behave differently (say, CFS), regarding eg. load distribution
> amongst the cores.

It must not have been the same load considering everything else.

I'm satisfied that Stephane's last patch fixes it ..

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Copy large memory regions from & to userspace

2007-08-31 Thread Robert Hancock


Clemens Kolbitsch wrote:

On Friday 31 August 2007 15:25:40 you wrote:

On 8/30/07, Clemens Kolbitsch <[EMAIL PROTECTED]> wrote:

Hi!
Just a short question: What is the correct method of copying large areas
of memory from userspace into userspace when running in kernel-mode?

relayfs?


no... I'm copying user-memory to user-memory, not kernel-to-user, however 
running the code in kernel-mode.


what i wanted to know is how to check the access-rights...
i didn't get any other answers, so for now i'm just using 


if (access_ok(VERIFY_READ, from, PAGE_SIZE) &&
access_ok(VERIFY_WRITE, to, PAGE_SIZE))
{
memcpy(to, from, PAGE_SIZE);
}

and hope that this is the *correct* way to do it...


No, it's not. access_ok does not guarantee that the memory region can be 
validly read or written. It only allows using __copy_to_user or 
__copy_from_user which skips the same checks that access_ok does.


I'm not aware of any code in the kernel that does userspace-to-userspace 
copies directly. Likely because there's rarely a need for it?


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 26/26] SLUB: Add debugging for slab defrag

2007-08-31 Thread Christoph Lameter

Add some debugging printks for slab defragmentation

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/slub.c |   13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-08-28 20:11:34.0 -0700
+++ linux-2.6/mm/slub.c 2007-08-28 20:21:39.0 -0700
@@ -2697,8 +2697,10 @@ int kmem_cache_isolate_slab(struct page 
 * This is necessary to make sure that the page does not vanish
 * from under us before we are able to check the result.
 */
-   if (!get_page_unless_zero(page))
+   if (!get_page_unless_zero(page)) {
+   printk(KERN_ERR "isolate %p zero ref\n", page);
return rc;
+   }
 
local_irq_save(flags);
slab_lock(page);
@@ -2712,6 +2714,8 @@ int kmem_cache_isolate_slab(struct page 
if (!PageSlab(page) || SlabFrozen(page) || !page->inuse) {
slab_unlock(page);
put_page(page);
+   printk(KERN_ERR "isolate faillock %p flags=%lx %s\n",
+   page, page->flags, 
PageSlab(page)?page->slab->name:"--");
goto out;
}
 
@@ -2739,6 +2743,7 @@ int kmem_cache_isolate_slab(struct page 
SetSlabFrozen(page);
slab_unlock(page);
rc = 0;
+   printk(KERN_ERR "Isolated %s slab=%p objects=%d\n", s->name, page, 
page->inuse);
 out:
local_irq_restore(flags);
return rc;
@@ -2809,6 +2814,8 @@ static int kmem_cache_vacate(struct page
 */
if (page->inuse == objects)
ClearSlabReclaimable(page);
+   printk(KERN_ERR "Finish vacate %s slab=%p objects=%d->%d\n",
+   s->name, page, objects, page->inuse);
 out:
leftover = page->inuse;
unfreeze_slab(s, page, tail);
@@ -2826,6 +2833,7 @@ int kmem_cache_reclaim(struct list_head 
void **scratch;
struct page *page;
struct page *page2;
+   int pages = 0;
 
if (list_empty(zaplist))
return 0;
@@ -2836,10 +2844,13 @@ int kmem_cache_reclaim(struct list_head 
 
list_for_each_entry_safe(page, page2, zaplist, lru) {
list_del(>lru);
+   pages++;
if (kmem_cache_vacate(page, scratch) == 0)
freed++;
}
kfree(scratch);
+   printk(KERN_ERR "kmem_cache_reclaim recovered %d of %d slabs.\n",
+   freed, pages);
return freed;
 }
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: maturity and status and attributes, oh my!

2007-08-31 Thread Robert P. J. Day

On Fri, 31 Aug 2007, Mitchell Erblich wrote:

> "Robert P. J. Day" wrote:
> >
> >   at the risk of driving everyone here totally bonkers, i'm going to
> > take one last shot at explaining what i was thinking of when i first
> > proposed this whole "maturity level" thing.  and, just so you know,
> > the major reason i'm so cranked up about this is that i'm feeling just
> > a little territorial -- i was the one who first started nagging people
> > to consider this idea, so i'm a little edgy when i see folks finally
> > giving it some serious thought but appearing to get ready to implement
> > it entirely incorrectly in a way that's going to ruin it irreparably
> > and make it utterly useless.
> >
> >   this isn't just about defining a single feature called "maturity".
> > it's about defining a general mechanism so that you can add entirely
> > new (what i call) "attributes" to kernel features.  one attribute
> > could be "maturity", which could take one of a number of possible
> > values.  another could be "status", with the same restrictions.
> > heck, you could define the attribute "colour", and decide that various
> > kernel features could be labelled as (at most) one of "red", "green"
> > and "chartreuse."  that's what i mean by an "attribute", and
> > attributes would have two critical and non-negotiable properties:
> <<< snip
> >
> >   but i hope i've flogged this thoroughly to the point where people
> > can see what i'm driving at.  once you see (as in simon's patch) how
> > to add the first attribute, it's trivial to simply duplicate that code
> > to add as many more as you want.
> >
> > rday
> >
> > --
> > 
> > Robert P. J. Day
> > Linux Consulting, Training and Annoying Kernel Pedantry
> > Waterloo, Ontario, CANADA
> >
> > http://crashcourse.ca
> > 
> Robert Day,
>
> If I can interpret what you are asking about and changing it abit.
>
> Don't you think that Maturity can be defined ALSO, as the
>number of known bugs and their priority / serverity against a
>architecture dependent or independent item?
>
>Would this suffice and wouldn't it be easier to maintain?
>
>Mitchell Erblich

perhaps.  all i'm begging for is that these attributes be defined
cleanly and clearly, and following those two conditions i suggested
earlier:

1) all attributes are orthogonal to one another, and
2) values within an attribute are mutually exclusive

if you violate either of those conditions, i think you're going to
find it very difficult to design sane Kconfig entries around them.
if you don't believe me, give it a try.

rday

p.s.  rather than saying that maturity can be defined "also" in
another way, you should say that it can be defined "instead" in
another way.  i don't want to think what it might mean to define
something like maturity in two different ways simultaneously, if
that's what you were suggesting.  that just makes my brain hurt.

-- 

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://crashcourse.ca

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 18/26] FS: ExtX filesystem defrag

2007-08-31 Thread Christoph Lameter

Support defragmentation for extX filesystem inodes

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/ext2/super.c |9 +
 fs/ext3/super.c |8 
 fs/ext4/super.c |8 
 3 files changed, 25 insertions(+)

Index: linux-2.6/fs/ext2/super.c
===
--- linux-2.6.orig/fs/ext2/super.c  2007-08-28 19:48:06.0 -0700
+++ linux-2.6/fs/ext2/super.c   2007-08-28 20:16:05.0 -0700
@@ -168,6 +168,12 @@ static void init_once(void * foo, struct
inode_init_once(>vfs_inode);
 }
 
+static void *ext2_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+   return fs_get_inodes(s, nr, v,
+   offsetof(struct ext2_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
@@ -177,6 +183,9 @@ static int init_inodecache(void)
 init_once);
if (ext2_inode_cachep == NULL)
return -ENOMEM;
+
+   kmem_cache_setup_defrag(ext2_inode_cachep,
+   ext2_get_inodes, kick_inodes);
return 0;
 }
 
Index: linux-2.6/fs/ext3/super.c
===
--- linux-2.6.orig/fs/ext3/super.c  2007-08-28 19:48:06.0 -0700
+++ linux-2.6/fs/ext3/super.c   2007-08-28 20:16:05.0 -0700
@@ -484,6 +484,12 @@ static void init_once(void * foo, struct
inode_init_once(>vfs_inode);
 }
 
+static void *ext3_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+   return fs_get_inodes(s, nr, v,
+   offsetof(struct ext3_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
@@ -493,6 +499,8 @@ static int init_inodecache(void)
 init_once);
if (ext3_inode_cachep == NULL)
return -ENOMEM;
+   kmem_cache_setup_defrag(ext3_inode_cachep,
+   ext3_get_inodes, kick_inodes);
return 0;
 }
 
Index: linux-2.6/fs/ext4/super.c
===
--- linux-2.6.orig/fs/ext4/super.c  2007-08-28 19:48:06.0 -0700
+++ linux-2.6/fs/ext4/super.c   2007-08-28 20:16:05.0 -0700
@@ -535,6 +535,12 @@ static void init_once(void * foo, struct
inode_init_once(>vfs_inode);
 }
 
+static void *ext4_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+   return fs_get_inodes(s, nr, v,
+   offsetof(struct ext4_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
@@ -544,6 +550,8 @@ static int init_inodecache(void)
 init_once);
if (ext4_inode_cachep == NULL)
return -ENOMEM;
+   kmem_cache_setup_defrag(ext4_inode_cachep,
+   ext4_get_inodes, kick_inodes);
return 0;
 }
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 19/26] FS: XFS slab defragmentation

2007-08-31 Thread Christoph Lameter

Support inode defragmentation for xfs

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/xfs/linux-2.6/xfs_super.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 4528f9a..e60c90e 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -363,6 +363,11 @@ xfs_fs_inode_init_once(
inode_init_once(vn_to_inode((bhv_vnode_t *)vnode));
 }
 
+static void *xfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+   return fs_get_inodes(s, nr, v, offsetof(bhv_vnode_t, v_inode));
+};
+
 STATIC int
 xfs_init_zones(void)
 {
@@ -376,6 +381,7 @@ xfs_init_zones(void)
xfs_ioend_zone = kmem_zone_init(sizeof(xfs_ioend_t), "xfs_ioend");
if (!xfs_ioend_zone)
goto out_destroy_vnode_zone;
+   kmem_cache_setup_defrag(xfs_vnode_zone, xfs_get_inodes, kick_inodes);
 
xfs_ioend_pool = mempool_create_slab_pool(4 * MAX_BUF_PER_PAGE,
  xfs_ioend_zone);
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 21/26] FS: Slab defrag: Reiserfs support

2007-08-31 Thread Christoph Lameter

Slab defragmentation: Support reiserfs inode defragmentation

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/reiserfs/super.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index 5b68dd3..0344be9 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -520,6 +520,12 @@ static void init_once(void *foo, struct kmem_cache * 
cachep, unsigned long flags
 #endif
 }
 
+static void *reiserfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+   return fs_get_inodes(s, nr, v,
+   offsetof(struct reiserfs_inode_info, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
reiserfs_inode_cachep = kmem_cache_create("reiser_inode_cache",
@@ -530,6 +536,8 @@ static int init_inodecache(void)
  init_once);
if (reiserfs_inode_cachep == NULL)
return -ENOMEM;
+   kmem_cache_setup_defrag(reiserfs_inode_cachep,
+   reiserfs_get_inodes, kick_inodes);
return 0;
 }
 
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 17/26] inodes: Support generic defragmentation

2007-08-31 Thread Christoph Lameter

This implements the ability to remove inodes in a particular slab
from inode cache. In order to remove an inode we may have to write out
the pages of an inode, the inode itself and remove the dentries referring
to the node.

Provide generic functionality that can be used by filesystems that have
their own inode caches to also tie into the defragmentation functions
that are made available here.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/inode.c |   95 +
 include/linux/fs.h |5 ++
 2 files changed, 100 insertions(+)

Index: linux-2.6/fs/inode.c
===
--- linux-2.6.orig/fs/inode.c   2007-08-28 19:48:07.0 -0700
+++ linux-2.6/fs/inode.c2007-08-28 20:15:26.0 -0700
@@ -1351,6 +1351,100 @@ static int __init set_ihash_entries(char
 }
 __setup("ihash_entries=", set_ihash_entries);
 
+static void *get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+   int i;
+
+   spin_lock(_lock);
+   for (i = 0; i < nr; i++) {
+   struct inode *inode = v[i];
+
+   if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+   v[i] = NULL;
+   else
+   __iget(inode);
+   }
+   spin_unlock(_lock);
+   return NULL;
+}
+
+/*
+ * Function for filesystems that embedd struct inode into their own
+ * structures. The offset is the offset of the struct inode in the fs inode.
+ */
+void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
+   unsigned long offset)
+{
+   int i;
+
+   for (i = 0; i < nr; i++)
+   v[i] += offset;
+
+   return get_inodes(s, nr, v);
+}
+EXPORT_SYMBOL(fs_get_inodes);
+
+void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
+{
+   struct inode *inode;
+   int i;
+   int abort = 0;
+   LIST_HEAD(freeable);
+   struct super_block *sb;
+
+   for (i = 0; i < nr; i++) {
+   inode = v[i];
+   if (!inode)
+   continue;
+
+   if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+   if (remove_inode_buffers(inode))
+   invalidate_mapping_pages(>i_data,
+   0, -1);
+   }
+
+   /* Invalidate children and dentry */
+   if (S_ISDIR(inode->i_mode)) {
+   struct dentry *d = d_find_alias(inode);
+
+   if (d) {
+   d_invalidate(d);
+   dput(d);
+   }
+   }
+
+   if (inode->i_state & I_DIRTY)
+   write_inode_now(inode, 1);
+
+   d_prune_aliases(inode);
+   }
+
+   mutex_lock(_mutex);
+   for (i = 0; i < nr; i++) {
+   inode = v[i];
+   if (!inode)
+   continue;
+
+   sb = inode->i_sb;
+   iput(inode);
+   if (abort || !(sb->s_flags & MS_ACTIVE))
+   continue;
+
+   spin_lock(_lock);
+   abort =  !can_unuse(inode);
+
+   if (!abort) {
+   list_move(>i_list, );
+   inode->i_state |= I_FREEING;
+   inodes_stat.nr_unused--;
+   }
+   spin_unlock(_lock);
+   }
+   dispose_list();
+   mutex_unlock(_mutex);
+}
+EXPORT_SYMBOL(kick_inodes);
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1390,6 +1484,7 @@ void __init inode_init(unsigned long mem
 SLAB_MEM_SPREAD),
 init_once);
register_shrinker(_shrinker);
+   kmem_cache_setup_defrag(inode_cachep, get_inodes, kick_inodes);
 
/* Hash may have been set up in inode_init_early */
if (!hashdist)
Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h   2007-08-28 19:48:07.0 -0700
+++ linux-2.6/include/linux/fs.h2007-08-28 20:15:26.0 -0700
@@ -1644,6 +1644,11 @@ static inline void insert_inode_hash(str
__insert_inode_hash(inode, inode->i_ino);
 }
 
+/* Helper functions for inode defragmentation support in filesystems */
+extern void kick_inodes(struct kmem_cache *, int, void **, void *);
+extern void *fs_get_inodes(struct kmem_cache *, int nr, void **,
+   unsigned long offset);
+
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to

[RFC 14/26] SLUB: __GFP_MOVABLE and SLAB_TEMPORARY support

2007-08-31 Thread Christoph Lameter

Slabs that are reclaimable fit the definition of the objects in
ZONE_MOVABLE. So set __GFP_MOVABLE on them (this only works
on platforms where there is no HIGHMEM. Hopefully that restriction
will vanish at some point).

Also add the SLAB_TEMPORARY flag for slab caches that allocate objects with
a short lifetime. Slabs with SLAB_TEMPORARY also are allocated with
__GFP_MOVABLE. Reclaim on them works by isolating the slab for awhile and
waiting for the objects to expire.

The skbuff_head_cache is a prime example of such a slab. Add the
SLAB_TEMPORARY flag to it.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/slab.h |1 +
 mm/slub.c|8 +++-
 net/core/skbuff.c|2 +-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 2923861..daffc22 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -23,6 +23,7 @@
 #define SLAB_POISON0x0800UL/* DEBUG: Poison objects */
 #define SLAB_HWCACHE_ALIGN 0x2000UL/* Align objs on cache lines */
 #define SLAB_CACHE_DMA 0x4000UL/* Use GFP_DMA memory */
+#define SLAB_TEMPORARY 0x8000UL/* Only volatile objects */
 #define SLAB_STORE_USER0x0001UL/* DEBUG: Store the 
last owner for bug hunting */
 #define SLAB_RECLAIM_ACCOUNT   0x0002UL/* Objects are reclaimable */
 #define SLAB_PANIC 0x0004UL/* Panic if kmem_cache_create() 
fails */
diff --git a/mm/slub.c b/mm/slub.c
index bad5291..85ba259 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1040,6 +1040,11 @@ static struct page *allocate_slab(struct kmem_cache *s, 
gfp_t flags, int node)
if (s->flags & SLAB_CACHE_DMA)
flags |= SLUB_DMA;
 
+#ifndef CONFIG_HIGHMEM
+   if (s->kick || s->flags & SLAB_TEMPORARY)
+   flags |= __GFP_MOVABLE;
+#endif
+
if (node == -1)
page = alloc_pages(flags, s->order);
else
@@ -1118,7 +1123,8 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t 
flags, int node)
if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
SLAB_STORE_USER | SLAB_TRACE))
SetSlabDebug(page);
-   if (s->kick)
+
+   if (s->flags & SLAB_TEMPORARY || s->kick)
SetSlabReclaimable(page);
 
  out:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 35021eb..51b2236 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2020,7 +2020,7 @@ void __init skb_init(void)
skbuff_head_cache = kmem_cache_create("skbuff_head_cache",
  sizeof(struct sk_buff),
  0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TEMPORARY,
  NULL);
skbuff_fclone_cache = kmem_cache_create("skbuff_fclone_cache",
(2*sizeof(struct sk_buff)) +
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 24/26] dentries: Add constructor

2007-08-31 Thread Christoph Lameter

In order to support defragmentation on the dentry cache we need to have
an determined object state at all times. Without a destructor the object
would have a random state after allocation.

So provide a constructor.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/dcache.c |   26 ++
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 71e4877..282a467 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -874,6 +874,16 @@ static struct shrinker dcache_shrinker = {
.seeks = DEFAULT_SEEKS,
 };
 
+void dcache_ctor(void *p, struct kmem_cache *s, unsigned long flags)
+{
+   struct dentry *dentry = p;
+
+   spin_lock_init(>d_lock);
+   dentry->d_inode = NULL;
+   INIT_LIST_HEAD(>d_lru);
+   INIT_LIST_HEAD(>d_alias);
+}
+
 /**
  * d_alloc -   allocate a dcache entry
  * @parent: parent of entry to allocate
@@ -911,8 +921,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct 
qstr *name)
 
atomic_set(>d_count, 1);
dentry->d_flags = DCACHE_UNHASHED;
-   spin_lock_init(>d_lock);
-   dentry->d_inode = NULL;
dentry->d_parent = NULL;
dentry->d_sb = NULL;
dentry->d_op = NULL;
@@ -922,9 +930,7 @@ struct dentry *d_alloc(struct dentry * parent, const struct 
qstr *name)
dentry->d_cookie = NULL;
 #endif
INIT_HLIST_NODE(>d_hash);
-   INIT_LIST_HEAD(>d_lru);
INIT_LIST_HEAD(>d_subdirs);
-   INIT_LIST_HEAD(>d_alias);
 
if (parent) {
dentry->d_parent = dget(parent);
@@ -2098,14 +2104,10 @@ static void __init dcache_init(unsigned long mempages)
 {
int loop;
 
-   /* 
-* A constructor could be added for stable state like the lists,
-* but it is probably not worth it because of the cache nature
-* of the dcache. 
-*/
-   dentry_cache = KMEM_CACHE(dentry,
-   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-   
+   dentry_cache = kmem_cache_create("dentry_cache", sizeof(struct dentry),
+   0, SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+   dcache_ctor);
+
register_shrinker(_shrinker);
 
/* Hash may have been set up in dcache_init_early */
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 16/26] Buffer heads: Support slab defrag

2007-08-31 Thread Christoph Lameter

Defragmentation support for buffer heads. We convert the references to
buffers to struct page references and try to remove the buffers from
those pages. If the pages are dirty then trigger writeout so that the
buffer heads can be removed later.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/buffer.c |  101 
 1 file changed, 101 insertions(+)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c  2007-08-28 20:13:08.0 -0700
+++ linux-2.6/fs/buffer.c   2007-08-28 20:14:30.0 -0700
@@ -3011,6 +3011,106 @@ init_buffer_head(void *data, struct kmem
INIT_LIST_HEAD(>b_assoc_buffers);
 }
 
+/*
+ * Writeback a page to clean the dirty state
+ */
+static void trigger_write(struct page *page)
+{
+   struct address_space *mapping = page_mapping(page);
+   int rc;
+   struct writeback_control wbc = {
+   .sync_mode = WB_SYNC_NONE,
+   .nr_to_write = 1,
+   .range_start = 0,
+   .range_end = LLONG_MAX,
+   .nonblocking = 1,
+   .for_reclaim = 0
+   };
+
+   if (!mapping->a_ops->writepage)
+   /* No write method for the address space */
+   return;
+
+   if (!clear_page_dirty_for_io(page))
+   /* Someone else already triggered a write */
+   return;
+
+   rc = mapping->a_ops->writepage(page, );
+   if (rc < 0)
+   /* I/O Error writing */
+   return;
+
+   if (rc == AOP_WRITEPAGE_ACTIVATE)
+   unlock_page(page);
+}
+
+/*
+ * Get references on buffers.
+ *
+ * We obtain references on the page that uses the buffer. v[i] will point to
+ * the corresponding page after get_buffers() is through.
+ *
+ * We are safe from the underlying page being removed simply by doing
+ * a get_page_unless_zero. The buffer head removal may race at will.
+ * try_to_free_buffes will later take appropriate locks to remove the
+ * buffers if they are still there.
+ */
+static void *get_buffers(struct kmem_cache *s, int nr, void **v)
+{
+   struct page *page;
+   struct buffer_head *bh;
+   int i,j;
+   int n = 0;
+
+   for (i = 0; i < nr; i++) {
+   bh = v[i];
+   v[i] = NULL;
+
+   page = bh->b_page;
+
+   if (page && PagePrivate(page)) {
+   for (j = 0; j < n; j++)
+   if (page == v[j])
+   goto cont;
+   }
+
+   if (get_page_unless_zero(page))
+   v[n++] = page;
+cont:  ;
+   }
+   return NULL;
+}
+
+/*
+ * Despite its name: kick_buffers operates on a list of pointers to
+ * page structs that was setup by get_buffer
+ */
+static void kick_buffers(struct kmem_cache *s, int nr, void **v,
+   void *private)
+{
+   struct page *page;
+   int i;
+
+   for (i = 0; i < nr; i++) {
+   page = v[i];
+
+   if (!page || PageWriteback(page))
+   continue;
+
+
+   if (!TestSetPageLocked(page)) {
+   if (PageDirty(page))
+   trigger_write(page);
+   else {
+   if (PagePrivate(page))
+   try_to_free_buffers(page);
+   unlock_page(page);
+   }
+   }
+   put_page(page);
+   }
+}
+
 void __init buffer_init(void)
 {
int nrpages;
@@ -3020,6 +3120,7 @@ void __init buffer_init(void)
(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
SLAB_MEM_SPREAD),
init_buffer_head);
+   kmem_cache_setup_defrag(bh_cachep, get_buffers, kick_buffers);
 
/*
 * Limit the bh occupancy to 10% of ZONE_NORMAL

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 22/26] FS: Socket inode defragmentation

2007-08-31 Thread Christoph Lameter

Support inode defragmentation for sockets

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 net/socket.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index ec07703..89fc7a5 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -264,6 +264,12 @@ static void init_once(void *foo, struct kmem_cache 
*cachep, unsigned long flags)
inode_init_once(>vfs_inode);
 }
 
+static void *sock_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+   return fs_get_inodes(s, nr, v,
+   offsetof(struct socket_alloc, vfs_inode));
+}
+
 static int init_inodecache(void)
 {
sock_inode_cachep = kmem_cache_create("sock_inode_cache",
@@ -275,6 +281,8 @@ static int init_inodecache(void)
  init_once);
if (sock_inode_cachep == NULL)
return -ENOMEM;
+   kmem_cache_setup_defrag(sock_inode_cachep,
+   sock_get_inodes, kick_inodes);
return 0;
 }
 
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 23/26] dentries: Extract common code to remove dentry from lru

2007-08-31 Thread Christoph Lameter

Extract the common code to remove a dentry from the lru into a new function
dentry_lru_remove().

Two call sites used list_del() instead of list_del_init(). AFAIK the
performance of both is the same. dentry_lru_remove() does a list_del_init().

As a result dentry->d_lru is now always empty when a dentry is freed.
A consistent state is useful to establish dentry state from slab defrag.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/dcache.c |   42 ++
 1 files changed, 14 insertions(+), 28 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 678d39d..71e4877 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -95,6 +95,14 @@ static void d_free(struct dentry *dentry)
call_rcu(>d_u.d_rcu, d_callback);
 }
 
+static void dentry_lru_remove(struct dentry *dentry)
+{
+   if (!list_empty(>d_lru)) {
+   list_del_init(>d_lru);
+   dentry_stat.nr_unused--;
+   }
+}
+
 /*
  * Release the dentry's inode, using the filesystem
  * d_iput() operation if defined.
@@ -211,13 +219,7 @@ repeat:
 unhash_it:
__d_drop(dentry);
 kill_it:
-   /* If dentry was on d_lru list
-* delete it from there
-*/
-   if (!list_empty(>d_lru)) {
-   list_del(>d_lru);
-   dentry_stat.nr_unused--;
-   }
+   dentry_lru_remove(dentry);
dentry = d_kill(dentry);
if (dentry)
goto repeat;
@@ -285,10 +287,7 @@ int d_invalidate(struct dentry * dentry)
 static inline struct dentry * __dget_locked(struct dentry *dentry)
 {
atomic_inc(>d_count);
-   if (!list_empty(>d_lru)) {
-   dentry_stat.nr_unused--;
-   list_del_init(>d_lru);
-   }
+   dentry_lru_remove(dentry);
return dentry;
 }
 
@@ -407,10 +406,7 @@ static void prune_one_dentry(struct dentry * dentry, int 
prune_parents)
 
if (dentry->d_op && dentry->d_op->d_delete)
dentry->d_op->d_delete(dentry);
-   if (!list_empty(>d_lru)) {
-   list_del(>d_lru);
-   dentry_stat.nr_unused--;
-   }
+   dentry_lru_remove(dentry);
__d_drop(dentry);
dentry = d_kill(dentry);
spin_lock(_lock);
@@ -600,10 +596,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry 
*dentry)
 
/* detach this root from the system */
spin_lock(_lock);
-   if (!list_empty(>d_lru)) {
-   dentry_stat.nr_unused--;
-   list_del_init(>d_lru);
-   }
+   dentry_lru_remove(dentry);
__d_drop(dentry);
spin_unlock(_lock);
 
@@ -617,11 +610,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry 
*dentry)
spin_lock(_lock);
list_for_each_entry(loop, >d_subdirs,
d_u.d_child) {
-   if (!list_empty(>d_lru)) {
-   dentry_stat.nr_unused--;
-   list_del_init(>d_lru);
-   }
-
+   dentry_lru_remove(dentry);
__d_drop(loop);
cond_resched_lock(_lock);
}
@@ -803,10 +792,7 @@ resume:
struct dentry *dentry = list_entry(tmp, struct dentry, 
d_u.d_child);
next = tmp->next;
 
-   if (!list_empty(>d_lru)) {
-   dentry_stat.nr_unused--;
-   list_del_init(>d_lru);
-   }
+   dentry_lru_remove(dentry);
/* 
 * move only zero ref count dentries to the end 
 * of the unused list for prune_dcache
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 13/26] SLUB: Add SlabReclaimable() to avoid repeated reclaim attempts

2007-08-31 Thread Christoph Lameter

Add a flag SlabReclaimable() that is set on slabs with a method
that allows defrag/reclaim. Clear the flag if a reclaim action is not
successful in reducing the number of objects in a slab. The reclaim
flag is set again if all objects have been allocated from it.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/slub.c |   42 --
 1 file changed, 36 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-08-28 20:10:37.0 -0700
+++ linux-2.6/mm/slub.c 2007-08-28 20:10:47.0 -0700
@@ -107,6 +107,8 @@
 #define SLABDEBUG 0
 #endif
 
+#define SLABRECLAIMABLE (1 << PG_dirty)
+
 static inline int SlabFrozen(struct page *page)
 {
return page->flags & FROZEN;
@@ -137,6 +139,21 @@ static inline void ClearSlabDebug(struct
page->flags &= ~SLABDEBUG;
 }
 
+static inline int SlabReclaimable(struct page *page)
+{
+   return page->flags & SLABRECLAIMABLE;
+}
+
+static inline void SetSlabReclaimable(struct page *page)
+{
+   page->flags |= SLABRECLAIMABLE;
+}
+
+static inline void ClearSlabReclaimable(struct page *page)
+{
+   page->flags &= ~SLABRECLAIMABLE;
+}
+
 /*
  * Issues still to be resolved:
  *
@@ -1099,6 +1116,8 @@ static struct page *new_slab(struct kmem
if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
SLAB_STORE_USER | SLAB_TRACE))
SetSlabDebug(page);
+   if (s->kick)
+   SetSlabReclaimable(page);
 
  out:
if (flags & __GFP_WAIT)
@@ -1155,6 +1174,7 @@ static void discard_slab(struct kmem_cac
atomic_long_dec(>nr_slabs);
reset_page_mapcount(page);
__ClearPageSlab(page);
+   ClearSlabReclaimable(page);
free_slab(s, page);
 }
 
@@ -1328,8 +1348,12 @@ static void unfreeze_slab(struct kmem_ca
 
if (page->freelist)
add_partial(n, page, tail);
-   else if (SlabDebug(page) && (s->flags & SLAB_STORE_USER))
-   add_full(n, page);
+   else {
+   if (SlabDebug(page) && (s->flags & SLAB_STORE_USER))
+   add_full(n, page);
+   if (s->kick && !SlabReclaimable(page))
+   SetSlabReclaimable(page);
+   }
slab_unlock(page);
 
} else {
@@ -2659,7 +2683,7 @@ int kmem_cache_isolate_slab(struct page 
struct kmem_cache *s;
int rc = -ENOENT;
 
-   if (!PageSlab(page) || SlabFrozen(page))
+   if (!PageSlab(page) || SlabFrozen(page) || !SlabReclaimable(page))
return rc;
 
/*
@@ -2729,7 +2753,7 @@ static int kmem_cache_vacate(struct page
struct kmem_cache *s;
unsigned long *map;
int leftover;
-   int objects;
+   int objects = -1;
void *private;
unsigned long flags;
int tail = 1;
@@ -2739,7 +2763,7 @@ static int kmem_cache_vacate(struct page
slab_lock(page);
 
s = page->slab;
-   map = scratch + s->objects * sizeof(void **);
+   map = scratch + max_defrag_slab_objects * sizeof(void **);
if (!page->inuse || !s->kick)
goto out;
 
@@ -2773,10 +2797,13 @@ static int kmem_cache_vacate(struct page
local_irq_save(flags);
slab_lock(page);
tail = 0;
-out:
+
/*
 * Check the result and unfreeze the slab
 */
+   if (page->inuse == objects)
+   ClearSlabReclaimable(page);
+out:
leftover = page->inuse;
unfreeze_slab(s, page, tail);
local_irq_restore(flags);
@@ -2831,6 +2858,9 @@ static unsigned long __kmem_cache_shrink
if (inuse > s->objects / 4)
continue;
 
+   if (s->kick && !SlabReclaimable(page))
+   continue;
+
if (!slab_trylock(page))
continue;
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 25/26] dentries: dentry defragmentation

2007-08-31 Thread Christoph Lameter

kick() is called after get() has been used and after the slab has dropped
all of its own locks. The dentry pruning for unused entries works in a
straightforward way.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/dcache.c |  100 +++-
 1 file changed, 99 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/dcache.c
===
--- linux-2.6.orig/fs/dcache.c  2007-08-29 18:55:21.0 -0700
+++ linux-2.6/fs/dcache.c   2007-08-29 18:57:51.0 -0700
@@ -143,7 +143,10 @@ static struct dentry *d_kill(struct dent
 
list_del(>d_u.d_child);
dentry_stat.nr_dentry--;/* For d_free, below */
-   /*drops the locks, at that point nobody can reach this dentry */
+   /*
+* drops the locks, at that point nobody (aside from defrag)
+* can reach this dentry
+*/
dentry_iput(dentry);
parent = dentry->d_parent;
d_free(dentry);
@@ -2100,6 +2103,100 @@ static void __init dcache_init_early(voi
INIT_HLIST_HEAD(_hashtable[loop]);
 }
 
+/*
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *get_dentries(struct kmem_cache *s, int nr, void **v)
+{
+   struct dentry *dentry;
+   int i;
+
+   spin_lock(_lock);
+   for (i = 0; i < nr; i++) {
+   dentry = v[i];
+
+   /*
+* Three sorts of dentries cannot be reclaimed:
+*
+* 1. dentries that are in the process of being allocated
+*or being freed. In that case the dentry is neither
+*on the LRU nor hashed.
+*
+* 2. Fake hashed entries as used for anonymous dentries
+*and pipe I/O. The fake hashed entries have d_flags
+*set to indicate a hashed entry. However, the
+*d_hash field indicates that the entry is not hashed.
+*
+* 3. dentries that have a backing store that is not
+*writable. This is true for tmpsfs and other in
+*memory filesystems. Removing dentries from them
+*would loose dentries for good.
+*/
+   if ((d_unhashed(dentry) && list_empty(>d_lru)) ||
+  (!d_unhashed(dentry) && hlist_unhashed(>d_hash)) ||
+  (dentry->d_inode &&
+  !mapping_cap_writeback_dirty(dentry->d_inode->i_mapping)))
+   /* Ignore this dentry */
+   v[i] = NULL;
+   else
+   /* dget_locked will remove the dentry from the LRU */
+   dget_locked(dentry);
+   }
+   spin_unlock(_lock);
+   return NULL;
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the refcount obtained
+ * earlier and also free the object.
+ */
+static void kick_dentries(struct kmem_cache *s,
+   int nr, void **v, void *private)
+{
+   struct dentry *dentry;
+   int i;
+
+   /*
+* First invalidate the dentries without holding the dcache lock
+*/
+   for (i = 0; i < nr; i++) {
+   dentry = v[i];
+
+   if (dentry)
+   d_invalidate(dentry);
+   }
+
+   /*
+* If we are the last one holding a reference then the dentries can
+* be freed. We need the dcache_lock.
+*/
+   spin_lock(_lock);
+   for (i = 0; i < nr; i++) {
+   dentry = v[i];
+   if (!dentry)
+   continue;
+
+   spin_lock(>d_lock);
+   if (atomic_read(>d_count) > 1) {
+   spin_unlock(>d_lock);
+   spin_unlock(_lock);
+   dput(dentry);
+   spin_lock(_lock);
+   continue;
+   }
+
+   prune_one_dentry(dentry, 1);
+   }
+   spin_unlock(_lock);
+
+   /*
+* dentries are freed using RCU so we need to wait until RCU
+* operations are complete
+*/
+   synchronize_rcu();
+}
+
 static void __init dcache_init(unsigned long mempages)
 {
int loop;
@@ -2109,6 +2206,7 @@ static void __init dcache_init(unsigned 
dcache_ctor);
 
register_shrinker(_shrinker);
+   kmem_cache_setup_defrag(dentry_cache, get_dentries, kick_dentries);
 
/* Hash may have been set up in dcache_init_early */
if (!hashdist)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 12/26] SLUB: Slab reclaim through Lumpy reclaim

2007-08-31 Thread Christoph Lameter

Creates a special function kmem_cache_isolate_slab() and kmem_cache_reclaim()
to support lumpy reclaim.

In order to isolate pages we will have to handle slab page allocations in
such a way that we can determine if a slab is valid whenever we access it
regardless of its time in life.

A valid slab that can be freed has PageSlab(page) and page->inuse > 0 set.
So we need to make sure in allocate_slab that page->inuse is zero before
PageSlab is set otherwise kmem_cache_vacate may operate on a slab that
has not been properly setup yet.

kmem_cache_isolate_page() is called from lumpy reclaim to isolate pages
neighboring a page cache page that is being reclaimed. Lumpy reclaim will
gather the slabs and call kmem_cache_reclaim() on the list.

This means that we can remove a slab that is in the way of coalescing
together a higher order page.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/slab.h |2 +
 mm/slab.c|   13 +++
 mm/slub.c|   88 +++
 mm/vmscan.c  |   15 ++--
 4 files changed, 109 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/slab.h
===
--- linux-2.6.orig/include/linux/slab.h 2007-08-28 20:05:42.0 -0700
+++ linux-2.6/include/linux/slab.h  2007-08-28 20:06:22.0 -0700
@@ -62,6 +62,8 @@ unsigned int kmem_cache_size(struct kmem
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
 int kmem_cache_defrag(int node);
+int kmem_cache_isolate_slab(struct page *);
+int kmem_cache_reclaim(struct list_head *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: linux-2.6/mm/slab.c
===
--- linux-2.6.orig/mm/slab.c2007-08-28 20:04:54.0 -0700
+++ linux-2.6/mm/slab.c 2007-08-28 20:06:22.0 -0700
@@ -2532,6 +2532,19 @@ int kmem_cache_defrag(int node)
return 0;
 }
 
+/*
+ * SLAB does not support slab defragmentation
+ */
+int kmem_cache_isolate_slab(struct page *page)
+{
+   return -ENOSYS;
+}
+
+int kmem_cache_reclaim(struct list_head *zaplist)
+{
+   return 0;
+}
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-08-28 20:04:54.0 -0700
+++ linux-2.6/mm/slub.c 2007-08-28 20:10:37.0 -0700
@@ -1006,6 +1006,7 @@ static inline int slab_pad_check(struct 
 static inline int check_object(struct kmem_cache *s, struct page *page,
void *object, int active) { return 1; }
 static inline void add_full(struct kmem_cache_node *n, struct page *page) {}
+static inline void remove_full(struct kmem_cache *s, struct page *page) {}
 static inline void kmem_cache_open_debug_check(struct kmem_cache *s) {}
 #define slub_debug 0
 #endif
@@ -1068,11 +1069,9 @@ static struct page *new_slab(struct kmem
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(>nr_slabs);
+
+   page->inuse = 0;
page->slab = s;
-   page->flags |= 1 << PG_slab;
-   if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
-   SLAB_STORE_USER | SLAB_TRACE))
-   SetSlabDebug(page);
 
start = page_address(page);
end = start + s->objects * s->size;
@@ -1090,8 +1089,18 @@ static struct page *new_slab(struct kmem
set_freepointer(s, last, NULL);
 
page->freelist = start;
-   page->inuse = 0;
-out:
+
+   /*
+* page->inuse must be 0 when PageSlab(page) becomes
+* true so that defrag knows that this slab is not in use.
+*/
+   smp_wmb();
+   __SetPageSlab(page);
+   if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
+   SLAB_STORE_USER | SLAB_TRACE))
+   SetSlabDebug(page);
+
+ out:
if (flags & __GFP_WAIT)
local_irq_disable();
return page;
@@ -2638,6 +2647,73 @@ static unsigned long count_partial(struc
return x;
 }
 
+ /*
+ * Isolate page from the slab partial lists. Return 0 if succesful.
+ *
+ * After isolation the LRU field can be used to put the page onto
+ * a reclaim list.
+ */
+int kmem_cache_isolate_slab(struct page *page)
+{
+   unsigned long flags;
+   struct kmem_cache *s;
+   int rc = -ENOENT;
+
+   if (!PageSlab(page) || SlabFrozen(page))
+   return rc;
+
+   /*
+* Get a reference to the page. Return if its freed or being freed.
+* This is necessary to make sure that the page does not vanish
+* from under us before we are able to check the result.
+*/
+   if (!get_page_unless_zero(page))
+   return rc;
+
+   local_irq_save(flags);
+

[RFC 20/26] FS: Proc filesystem support for slab defrag

2007-08-31 Thread Christoph Lameter

Support procfs inode defragmentation

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/proc/inode.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index a5b0dfd..83a66d7 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -113,6 +113,12 @@ static void init_once(void * foo, struct kmem_cache * 
cachep, unsigned long flag
inode_init_once(>vfs_inode);
 }
 
+static void *proc_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+   return fs_get_inodes(s, nr, v,
+   offsetof(struct proc_inode, vfs_inode));
+};
+
 int __init proc_init_inodecache(void)
 {
proc_inode_cachep = kmem_cache_create("proc_inode_cache",
@@ -122,6 +128,8 @@ int __init proc_init_inodecache(void)
 init_once);
if (proc_inode_cachep == NULL)
return -ENOMEM;
+   kmem_cache_setup_defrag(proc_inode_cachep,
+   proc_get_inodes, kick_inodes);
return 0;
 }
 
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 15/26] bufferhead: Revert constructor removal

2007-08-31 Thread Christoph Lameter

The constructor for buffer_head slabs was removed recently. We need
the constructor in order to insure that slab objects always have a definite
state even before we allocated them.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 fs/buffer.c |   19 +++
 1 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 0e5ec37..f4824d1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2960,9 +2960,8 @@ static void recalc_bh_state(void)

 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 {
-   struct buffer_head *ret = kmem_cache_zalloc(bh_cachep, gfp_flags);
+   struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
if (ret) {
-   INIT_LIST_HEAD(>b_assoc_buffers);
get_cpu_var(bh_accounting).nr++;
recalc_bh_state();
put_cpu_var(bh_accounting);
@@ -3003,12 +3002,24 @@ static int buffer_cpu_notify(struct notifier_block 
*self,
return NOTIFY_OK;
 }
 
+static void
+init_buffer_head(void *data, struct kmem_cache *cachep, unsigned long flags)
+{
+   struct buffer_head * bh = (struct buffer_head *)data;
+
+   memset(bh, 0, sizeof(*bh));
+   INIT_LIST_HEAD(>b_assoc_buffers);
+}
+
 void __init buffer_init(void)
 {
int nrpages;
 
-   bh_cachep = KMEM_CACHE(buffer_head,
-   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
+   bh_cachep = kmem_cache_create("buffer_head",
+   sizeof(struct buffer_head), 0,
+   (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
+   SLAB_MEM_SPREAD),
+   init_buffer_head);
 
/*
 * Limit the bh occupancy to 10% of ZONE_NORMAL
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 10/26] SLUB: Trigger defragmentation from memory reclaim

2007-08-31 Thread Christoph Lameter

This patch triggers slab defragmentation from memory reclaim.
The logical point for this is after slab shrinking was performed in
vmscan.c. At that point the fragmentation ratio of a slab was increased
by objects being freed. So we call kmem_cache_defrag from there.

slab_shrink() from vmscan.c is called in some contexts to do
global shrinking of slabs and in others to do shrinking for
a particular zone. Pass the zone to slab_shrink, so that slab_shrink
can call kmem_cache_defrag() and restrict the defragmentation to
the node that is under memory pressure.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 fs/drop_caches.c |2 +-
 include/linux/mm.h   |2 +-
 include/linux/slab.h |1 +
 mm/vmscan.c  |   27 ---
 4 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 59375ef..fb58e63 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -50,7 +50,7 @@ void drop_slab(void)
int nr_objects;
 
do {
-   nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+   nr_objects = shrink_slab(1000, GFP_KERNEL, 1000, NULL);
} while (nr_objects > 10);
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a396aac..9fbb6ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1202,7 +1202,7 @@ int in_gate_area_no_task(unsigned long addr);
 int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
void __user *, size_t *, loff_t *);
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-   unsigned long lru_pages);
+   unsigned long lru_pages, struct zone *zone);
 void drop_pagecache(void);
 void drop_slab(void);
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 848e9a7..7d8ec17 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -61,6 +61,7 @@ void kmem_cache_free(struct kmem_cache *, void *);
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+int kmem_cache_defrag(int node);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d419e10..c6882d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -150,10 +150,18 @@ EXPORT_SYMBOL(unregister_shrinker);
  * are eligible for the caller's allocation attempt.  It is used for balancing
  * slab reclaim versus page reclaim.
  *
+ * zone is the zone for which we are shrinking the slabs. If the intent
+ * is to do a global shrink then zone may be NULL. Specification of a
+ * zone is currently only used to limit slab defragmentation to a NUMA node.
+ * The performace of shrink_slab would be better (in particular under NUMA)
+ * if it could be targeted as a whole to the zone that is under memory
+ * pressure but the VFS infrastructure does not allow that at the present
+ * time.
+ *
  * Returns the number of slab objects which we shrunk.
  */
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-   unsigned long lru_pages)
+   unsigned long lru_pages, struct zone *zone)
 {
struct shrinker *shrinker;
unsigned long ret = 0;
@@ -210,6 +218,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t 
gfp_mask,
shrinker->nr += total_scan;
}
up_read(_rwsem);
+   if (gfp_mask & __GFP_FS)
+   kmem_cache_defrag(zone ? zone_to_nid(zone) : -1);
return ret;
 }
 
@@ -1151,7 +1161,8 @@ unsigned long try_to_free_pages(struct zone **zones, int 
order, gfp_t gfp_mask)
if (!priority)
disable_swap_token();
nr_reclaimed += shrink_zones(priority, zones, );
-   shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
+   shrink_slab(sc.nr_scanned, gfp_mask, lru_pages,
+   NULL);
if (reclaim_state) {
nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
@@ -1321,7 +1332,7 @@ loop_again:
nr_reclaimed += shrink_zone(priority, zone, );
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-   lru_pages);
+   lru_pages, zone);
nr_reclaimed += reclaim_state->reclaimed_slab;
total_scanned += sc.nr_scanned;
if (zone->all_unreclaimable)
@@ -1559,7 +1570,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages)
/* If slab caches are huge, it's better to hit them first */
while (nr_slab >= lru_pages) {
reclaim_state.reclaimed_slab

[RFC 11/26] VM: Allow get_page_unless_zero on compound pages

2007-08-31 Thread Christoph Lameter

SLUB uses compound pages for larger slabs. We need to increment
the page count of these pages in order to make sure that they are not
freed under us for reclaim from within lumpy reclaim.

(The patch is also part of the large blocksize patchset)

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/mm.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9fbb6ba..713d096 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -290,7 +290,7 @@ static inline int put_page_testzero(struct page *page)
  */
 static inline int get_page_unless_zero(struct page *page)
 {
-   VM_BUG_ON(PageCompound(page));
+   VM_BUG_ON(PageTail(page));
return atomic_inc_not_zero(>_count);
 }
 
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 09/26] SLUB: Slab defrag core

2007-08-31 Thread Christoph Lameter

Slab defragmentation (aside from Lumpy Reclaim) may occur:

1. Unconditionally when kmem_cache_shrink is called on a slab cache by the
   kernel calling kmem_cache_shrink.

2. Use of the slabinfo command line to trigger slab shrinking.

3. Per node defrag conditionally when kmem_cache_defrag() is called.

   Defragmentation is only performed if the fragmentation of the slab
   is lower than the specified percentage. Fragmentation ratios are measured
   by calculating the percentage of objects in use compared to the total
   number of objects that the slab cache could hold.

   kmem_cache_defrag takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.
   If a node number was specified then defragmentation is only performed
   on a specific node.

   Slab defragmentation is a memory intensive operation that can be
   sped up in a NUMA system if mostly node local memory is accessed. That
   is the case if we just have reclaimed reclaim on a node.

In order for a slabcache to support defragmentation a couple of functions
must be setup via a call to kmem_cache_setup_defrag(). These are

void *get(struct kmem_cache *s, int nr, void **objects)

Must obtain a reference to the listed objects. SLUB guarantees that
the objects are still allocated. However, other threads may be blocked
in slab_free attempting to free objects in the slab. These may succeed
as soon as get() returns to the slab allocator. The function must
be able to detect such situations and void the attempts to free such
objects (by for example voiding the corresponding entry in the objects
array).

No slab operations may be performed in get(). Interrupts
are disabled. What can be done is very limited. The slab lock
for the page with the object is taken. Any attempt to perform a slab
operation may lead to a deadlock.

get() returns a private pointer that is passed to kick. Should we
be unable to obtain all references then that pointer may indicate
to the kick() function that it should not attempt any object removal
or move but simply remove the reference counts.

void kick(struct kmem_cache *, int nr, void **objects, void *get_result)

After SLUB has established references to the objects in a
slab it will then drop all locks and use kick() to move objects out
of the slab. The existence of the object is guaranteed by virtue of
the earlier obtained references via get(). The callback may perform
any slab operation since no locks are held at the time of call.

The callback should remove the object from the slab in some way. This
may be accomplished by reclaiming the object and then running
kmem_cache_free() or reallocating it and then running
kmem_cache_free(). Reallocation is advantageous because the partial
slabs were just sorted to have the partial slabs with the most objects
first. Reallocation is likely to result in filling up a slab in
addition to freeing up one slab so that it also can be removed from
the partial list.

Kick() does not return a result. SLUB will check the number of
remaining objects in the slab. If all objects were removed then
we know that the operation was successful.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/slab.c |5 +
 mm/slub.c |  265 ++
 2 files changed, 222 insertions(+), 48 deletions(-)

Index: linux-2.6/mm/slab.c
===
--- linux-2.6.orig/mm/slab.c2007-08-28 20:04:05.0 -0700
+++ linux-2.6/mm/slab.c 2007-08-28 20:04:54.0 -0700
@@ -2527,6 +2527,11 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+int kmem_cache_defrag(int node)
+{
+   return 0;
+}
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-08-28 20:04:10.0 -0700
+++ linux-2.6/mm/slub.c 2007-08-28 20:04:54.0 -0700
@@ -2639,75 +2639,244 @@ static unsigned long count_partial(struc
 }
 
 /*
- * kmem_cache_shrink removes empty slabs from the partial lists and sorts
- * the remaining slabs by the number of items in use. The slabs with the
- * most items in use come first. New allocations will then fill those up
- * and thus they can be removed from the partial lists.
+ * Vacate all objects in the given slab.
  *
- * The slabs with the least items are placed last. This results in them
- * being allocated from last increasing the chance that the last objects
- * are freed in them.
+ * The scratch aread passed to list function is sufficient to hold
+ * struct listhead times objects per slab. We

[RFC 07/26] SLUB: Sort slab cache list and establish maximum objects for defrag slabs

2007-08-31 Thread Christoph Lameter

When we defragmenting slabs then it is advantageous to have all
defragmentable slabs together at the beginning of the list so that we do not
have to scan the complete list. When adding a slab cache put defragmentale
caches first and others last.

Determine the maximum number of objects in defragmentable slabs. This allows
to size the allocation of arrays holding refs to these objects later.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/slub.c |   19 +--
 1 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 4a64038..9006069 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -226,6 +226,9 @@ static enum {
 static DECLARE_RWSEM(slub_lock);
 static LIST_HEAD(slab_caches);
 
+/* Maximum objects in defragmentable slabs */
+static unsigned int max_defrag_slab_objects = 0;
+
 /*
  * Tracking user of a slab.
  */
@@ -2385,7 +2388,7 @@ static struct kmem_cache *create_kmalloc_cache(struct 
kmem_cache *s,
flags, NULL))
goto panic;
 
-   list_add(>list, _caches);
+   list_add_tail(>list, _caches);
up_write(_lock);
if (sysfs_slab_add(s))
goto panic;
@@ -2597,6 +2600,13 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+static inline void *alloc_scratch(void)
+{
+   return kmalloc(max_defrag_slab_objects * sizeof(void *) +
+   BITS_TO_LONGS(max_defrag_slab_objects) * sizeof(unsigned long),
+   GFP_KERNEL);
+}
+
 void kmem_cache_setup_defrag(struct kmem_cache *s,
void *(*get)(struct kmem_cache *, int nr, void **),
void (*kick)(struct kmem_cache *, int nr, void **, void *private))
@@ -2608,6 +2618,11 @@ void kmem_cache_setup_defrag(struct kmem_cache *s,
BUG_ON(!s->ctor);
s->get = get;
s->kick = kick;
+   down_write(_lock);
+   list_move(>list, _caches);
+   if (s->objects > max_defrag_slab_objects)
+   max_defrag_slab_objects = s->objects;
+   up_write(_lock);
 }
 EXPORT_SYMBOL(kmem_cache_setup_defrag);
 
@@ -2878,7 +2893,7 @@ struct kmem_cache *kmem_cache_create(const char *name, 
size_t size,
if (s) {
if (kmem_cache_open(s, GFP_KERNEL, name,
size, align, flags, ctor)) {
-   list_add(>list, _caches);
+   list_add_tail(>list, _caches);
up_write(_lock);
if (sysfs_slab_add(s))
goto err;
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 03/26] SLUB: Rename NUMA defrag_ratio to remote_node_defrag_ratio

2007-08-31 Thread Christoph Lameter

We need the defrag ratio for the non NUMA situation now. The NUMA defrag works
by allocating objects from partial slabs on remote nodes. Rename it to

remote_node_defrag_ratio

to be clear about this.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/slub_def.h |5 -
 mm/slub.c|   17 +
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 8aad7dc..5912b58 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -59,7 +59,10 @@ struct kmem_cache {
 #endif
 
 #ifdef CONFIG_NUMA
-   int defrag_ratio;
+   /*
+* Defragmentation by allocating from a remote node.
+*/
+   int remote_node_defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
 #ifdef CONFIG_SMP
diff --git a/mm/slub.c b/mm/slub.c
index aad6f83..e63aba5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1267,7 +1267,8 @@ static struct page *get_any_partial(struct kmem_cache *s, 
gfp_t flags)
 * expensive if we do it every time we are trying to find a slab
 * with available objects.
 */
-   if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
+   if (!s->remote_node_defrag_ratio ||
+   get_cycles() % 1024 > s->remote_node_defrag_ratio)
return NULL;
 
zonelist = _DATA(slab_node(current->mempolicy))
@@ -2200,7 +2201,7 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t 
gfpflags,
 
s->refcount = 1;
 #ifdef CONFIG_NUMA
-   s->defrag_ratio = 100;
+   s->remote_node_defrag_ratio = 100;
 #endif
 
if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
@@ -3717,21 +3718,21 @@ static ssize_t free_calls_show(struct kmem_cache *s, 
char *buf)
 SLAB_ATTR_RO(free_calls);
 
 #ifdef CONFIG_NUMA
-static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
+static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
 {
-   return sprintf(buf, "%d\n", s->defrag_ratio / 10);
+   return sprintf(buf, "%d\n", s->remote_node_defrag_ratio / 10);
 }
 
-static ssize_t defrag_ratio_store(struct kmem_cache *s,
+static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s,
const char *buf, size_t length)
 {
int n = simple_strtoul(buf, NULL, 10);
 
if (n < 100)
-   s->defrag_ratio = n * 10;
+   s->remote_node_defrag_ratio = n * 10;
return length;
 }
-SLAB_ATTR(defrag_ratio);
+SLAB_ATTR(remote_node_defrag_ratio);
 #endif
 
 static struct attribute * slab_attrs[] = {
@@ -3762,7 +3763,7 @@ static struct attribute * slab_attrs[] = {
_dma_attr.attr,
 #endif
 #ifdef CONFIG_NUMA
-   _ratio_attr.attr,
+   _node_defrag_ratio_attr.attr,
 #endif
NULL
 };
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 08/26] SLUB: Consolidate add_partial and add_partial_tail to one function

2007-08-31 Thread Christoph Lameter

Add a parameter to add_partial instead of having separate functions.
That allows the detailed control from multiple places when putting
slabs back to the partial list. If we put slabs back to the front
then they are likely used immediately for allocations. If they are
put at the end then we can maximize the time that the partial slabs
spent without allocations.

When deactivating slab we can put the slabs that had remote objects freed
to them at the end of the list so that the cachelines can cool down.
Slabs that had objects from the cpu freed to them are put in the front
of the list to be reused ASAP.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/slub.c |   31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-08-28 20:03:16.0 -0700
+++ linux-2.6/mm/slub.c 2007-08-28 20:21:55.0 -0700
@@ -1173,19 +1173,15 @@ static __always_inline int slab_trylock(
 /*
  * Management of partially allocated slabs
  */
-static void add_partial_tail(struct kmem_cache_node *n, struct page *page)
+static void add_partial(struct kmem_cache_node *n,
+   struct page *page, int tail)
 {
spin_lock(>list_lock);
n->nr_partial++;
-   list_add_tail(>lru, >partial);
-   spin_unlock(>list_lock);
-}
-
-static void add_partial(struct kmem_cache_node *n, struct page *page)
-{
-   spin_lock(>list_lock);
-   n->nr_partial++;
-   list_add(>lru, >partial);
+   if (tail)
+   list_add_tail(>lru, >partial);
+   else
+   list_add(>lru, >partial);
spin_unlock(>list_lock);
 }
 
@@ -1314,7 +1310,7 @@ static struct page *get_partial(struct k
  *
  * On exit the slab lock will have been dropped.
  */
-static void unfreeze_slab(struct kmem_cache *s, struct page *page)
+static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 {
struct kmem_cache_node *n = get_node(s, page_to_nid(page));
 
@@ -1322,7 +1318,7 @@ static void unfreeze_slab(struct kmem_ca
if (page->inuse) {
 
if (page->freelist)
-   add_partial(n, page);
+   add_partial(n, page, tail);
else if (SlabDebug(page) && (s->flags & SLAB_STORE_USER))
add_full(n, page);
slab_unlock(page);
@@ -1337,7 +1333,7 @@ static void unfreeze_slab(struct kmem_ca
 * partial list stays small. kmem_cache_shrink can
 * reclaim empty slabs from the partial list.
 */
-   add_partial_tail(n, page);
+   add_partial(n, page, 1);
slab_unlock(page);
} else {
slab_unlock(page);
@@ -1352,6 +1348,7 @@ static void unfreeze_slab(struct kmem_ca
 static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
struct page *page = c->page;
+   int tail = 1;
/*
 * Merge cpu freelist into freelist. Typically we get here
 * because both freelists are empty. So this is unlikely
@@ -1360,6 +1357,8 @@ static void deactivate_slab(struct kmem_
while (unlikely(c->freelist)) {
void **object;
 
+   tail = 0;   /* Hot objects. Put the slab first */
+
/* Retrieve object from cpu_freelist */
object = c->freelist;
c->freelist = c->freelist[c->offset];
@@ -1370,7 +1369,7 @@ static void deactivate_slab(struct kmem_
page->inuse--;
}
c->page = NULL;
-   unfreeze_slab(s, page);
+   unfreeze_slab(s, page, tail);
 }
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
@@ -1603,7 +1602,7 @@ checks_ok:
 * then add it.
 */
if (unlikely(!prior))
-   add_partial(get_node(s, page_to_nid(page)), page);
+   add_partial(get_node(s, page_to_nid(page)), page, 0);
 
 out_unlock:
slab_unlock(page);
@@ -2012,7 +2011,7 @@ static struct kmem_cache_node * __init e
 #endif
init_kmem_cache_node(n);
atomic_long_inc(>nr_slabs);
-   add_partial(n, page);
+   add_partial(n, page, 0);
 
/*
 * new_slab() disables interupts. If we do not reenable interrupts here

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 01/26] SLUB: Extend slabinfo to support -D and -C options

2007-08-31 Thread Christoph Lameter

-D lists caches that support defragmentation

-C lists caches that use a ctor.

Change field names for defrag_ratio and remote_node_defrag_ratio.

Add determination of the allocation ratio for slab. The allocation ratio
is the percentage of available slots for objects in use.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 Documentation/vm/slabinfo.c |   52 --
 1 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/Documentation/vm/slabinfo.c b/Documentation/vm/slabinfo.c
index 1af7bd5..1319756 100644
--- a/Documentation/vm/slabinfo.c
+++ b/Documentation/vm/slabinfo.c
@@ -30,6 +30,8 @@ struct slabinfo {
int hwcache_align, object_size, objs_per_slab;
int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
+   int defrag, ctor;
+   int defrag_ratio, remote_node_defrag_ratio;
unsigned long partial, objects, slabs;
int numa[MAX_NODES];
int numa_partial[MAX_NODES];
@@ -56,6 +58,8 @@ int show_slab = 0;
 int skip_zero = 1;
 int show_numa = 0;
 int show_track = 0;
+int show_defrag = 0;
+int show_ctor = 0;
 int show_first_alias = 0;
 int validate = 0;
 int shrink = 0;
@@ -90,18 +94,20 @@ void fatal(const char *x, ...)
 void usage(void)
 {
printf("slabinfo 5/7/2007. (c) 2007 sgi. [EMAIL PROTECTED]"
-   "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+   "slabinfo [-aCDefhilnosSrtTvz1] [-d debugopts] [slab-regexp]\n"
"-a|--aliases   Show aliases\n"
+   "-C|--ctor  Show slabs with ctors\n"
"-d|--debug= Set/Clear Debug options\n"
-   "-e|--empty Show empty slabs\n"
+   "-D|--defragShow defragmentable caches\n"
+   "-e|--empty Show empty slabs\n"
"-f|--first-alias   Show first alias\n"
"-h|--help  Show usage information\n"
"-i|--inverted  Inverted list\n"
"-l|--slabs Show slabs\n"
"-n|--numa  Show NUMA information\n"
-   "-o|--ops   Show kmem_cache_ops\n"
+   "-o|--ops   Show kmem_cache_ops\n"
"-s|--shrinkShrink slabs\n"
-   "-r|--reportDetailed report on single slabs\n"
+   "-r|--reportDetailed report on single slabs\n"
"-S|--Size  Sort by size\n"
"-t|--tracking  Show alloc/free information\n"
"-T|--TotalsShow summary information\n"
@@ -281,7 +287,7 @@ int line = 0;
 void first_line(void)
 {
printf("Name   Objects ObjsizeSpace "
-   "Slabs/Part/Cpu  O/S O %%Fr %%Ef Flg\n");
+   "Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
 }
 
 /*
@@ -324,7 +330,7 @@ void slab_numa(struct slabinfo *s, int mode)
return;
 
if (!line) {
-   printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+   printf("\n%-21s: Rto ", mode ? "NUMA nodes" : "Slab");
for(node = 0; node <= highest_node; node++)
printf(" %4d", node);
printf("\n--");
@@ -333,6 +339,7 @@ void slab_numa(struct slabinfo *s, int mode)
printf("\n");
}
printf("%-21s ", mode ? "All slabs" : s->name);
+   printf("%3d ", s->remote_node_defrag_ratio);
for(node = 0; node <= highest_node; node++) {
char b[20];
 
@@ -406,6 +413,8 @@ void report(struct slabinfo *s)
printf("** Slabs are destroyed via RCU\n");
if (s->reclaim_account)
printf("** Reclaim accounting active\n");
+   if (s->defrag)
+   printf("** Defragmentation at %d%%\n", s->defrag_ratio);
 
printf("\nSizes (bytes) Slabs  Debug
Memory\n");

printf("\n");
@@ -452,6 +461,12 @@ void slabcache(struct slabinfo *s)
if (show_empty && s->slabs)
return;
 
+   if (show_defrag && !s->defrag)
+   return;
+
+   if (show_ctor && !s->ctor)
+   return;
+
store_size(size_str, slab_size(s));
sprintf(dist_str,"%lu/%lu/%d", s->slabs, s->partial, s->cpu_slabs);
 
@@ -462,6 +477,10 @@ void slabcache(struct slabinfo *s)
*p++ = '*';
if (s->cache_dma)
*p++ = 'd';
+   if (s->defrag)
+   *p++ = 'D';
+   if (s->ctor)
+   *p++ = 'C';
if (s->hwcache_align)
*p++ = 'A';
if (s->poison)
@@ -481,7 +500,7 @@ void slabcache(struct slabinfo *s)
printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
s->name, s->objects,

[RFC 04/26] SLUB: Add defrag_ratio field and sysfs support.

2007-08-31 Thread Christoph Lameter

The defrag_ratio is used to set the threshold when a slabcache should be
defragmented.

The allocation ratio is measured in a percentage of the available slots.
The percentage will be lower for slabs that are more fragmented.

Add a defrag ratio field and set it to 30% by default. A limit of 30%
that less than 3 out of 10 available slots for objects are in use.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/slub_def.h |7 +++
 mm/slub.c|   18 ++
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 5912b58..291881d 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -52,6 +52,13 @@ struct kmem_cache {
void (*ctor)(void *, struct kmem_cache *, unsigned long);
int inuse;  /* Offset to metadata */
int align;  /* Alignment */
+   int defrag_ratio;   /*
+* objects/possible-objects limit. If we have
+* less that the specified percentage of
+* objects allocated then defrag passes
+* will start to occur during reclaim.
+*/
+
const char *name;   /* Name (only for display!) */
struct list_head list;  /* List of slab caches */
 #ifdef CONFIG_SLUB_DEBUG
diff --git a/mm/slub.c b/mm/slub.c
index e63aba5..f95a760 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2200,6 +2200,7 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t 
gfpflags,
goto error;
 
s->refcount = 1;
+   s->defrag_ratio = 30;
 #ifdef CONFIG_NUMA
s->remote_node_defrag_ratio = 100;
 #endif
@@ -3717,6 +3718,22 @@ static ssize_t free_calls_show(struct kmem_cache *s, 
char *buf)
 }
 SLAB_ATTR_RO(free_calls);
 
+static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
+{
+   return sprintf(buf, "%d\n", s->defrag_ratio);
+}
+
+static ssize_t defrag_ratio_store(struct kmem_cache *s,
+   const char *buf, size_t length)
+{
+   int n = simple_strtoul(buf, NULL, 10);
+
+   if (n < 100)
+   s->defrag_ratio = n;
+   return length;
+}
+SLAB_ATTR(defrag_ratio);
+
 #ifdef CONFIG_NUMA
 static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf)
 {
@@ -3759,6 +3776,7 @@ static struct attribute * slab_attrs[] = {
_attr.attr,
_calls_attr.attr,
_calls_attr.attr,
+   _ratio_attr.attr,
 #ifdef CONFIG_ZONE_DMA
_dma_attr.attr,
 #endif
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 06/26] SLUB: Add get() and kick() methods

2007-08-31 Thread Christoph Lameter

Add the two methods needed for defragmentation and add the display of the
methods via the proc interface.

Add documentation explaining the use of these methods.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 include/linux/slab.h |3 +++
 include/linux/slub_def.h |   32 
 mm/slub.c|   32 ++--
 3 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d859354..848e9a7 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -54,6 +54,9 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, 
size_t,
void (*)(void *, struct kmem_cache *, unsigned long));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
+void kmem_cache_setup_defrag(struct kmem_cache *s,
+   void *(*get)(struct kmem_cache *, int nr, void **),
+   void (*kick)(struct kmem_cache *, int nr, void **, void *private));
 void kmem_cache_free(struct kmem_cache *, void *);
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 291881d..69c32a7 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -50,6 +50,38 @@ struct kmem_cache {
int objects;/* Number of objects in slab */
int refcount;   /* Refcount for slab cache destroy */
void (*ctor)(void *, struct kmem_cache *, unsigned long);
+
+   /*
+* Called with slab lock held and interrupts disabled.
+* No slab operation may be performed in get().
+*
+* Parameters passed are the number of objects to process
+* and an array of pointers to objects for which we
+* need references.
+*
+* Returns a pointer that is passed to the kick function.
+* If all objects cannot be moved then the pointer may
+* indicate that this wont work and then kick can simply
+* remove the references that were already obtained.
+*
+* The array passed to get() is also passed to kick(). The
+* function may remove objects by setting array elements to NULL.
+*/
+   void *(*get)(struct kmem_cache *, int nr, void **);
+
+   /*
+* Called with no locks held and interrupts enabled.
+* Any operation may be performed in kick().
+*
+* Parameters passed are the number of objects in the array,
+* the array of pointers to the objects and the pointer
+* returned by get().
+*
+* Success is checked by examining the number of remaining
+* objects in the slab.
+*/
+   void (*kick)(struct kmem_cache *, int nr, void **, void *private);
+
int inuse;  /* Offset to metadata */
int align;  /* Alignment */
int defrag_ratio;   /*
diff --git a/mm/slub.c b/mm/slub.c
index fc2f1e3..4a64038 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2597,6 +2597,20 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+void kmem_cache_setup_defrag(struct kmem_cache *s,
+   void *(*get)(struct kmem_cache *, int nr, void **),
+   void (*kick)(struct kmem_cache *, int nr, void **, void *private))
+{
+   /*
+* Defragmentable slabs must have a ctor otherwise objects may be
+* in an undetermined state after they are allocated.
+*/
+   BUG_ON(!s->ctor);
+   s->get = get;
+   s->kick = kick;
+}
+EXPORT_SYMBOL(kmem_cache_setup_defrag);
+
 static unsigned long count_partial(struct kmem_cache_node *n)
 {
unsigned long flags;
@@ -2777,7 +2791,7 @@ static int slab_unmergeable(struct kmem_cache *s)
if (slub_nomerge || (s->flags & SLUB_NEVER_MERGE))
return 1;
 
-   if (s->ctor)
+   if (s->ctor || s->kick || s->get)
return 1;
 
/*
@@ -3507,7 +3521,21 @@ static ssize_t ops_show(struct kmem_cache *s, char *buf)
 
if (s->ctor) {
x += sprintf(buf + x, "ctor : ");
-   x += sprint_symbol(buf + x, (unsigned long)s->ops->ctor);
+   x += sprint_symbol(buf + x, (unsigned long)s->ctor);
+   x += sprintf(buf + x, "\n");
+   }
+
+   if (s->get) {
+   x += sprintf(buf + x, "get : ");
+   x += sprint_symbol(buf + x,
+   (unsigned long)s->get);
+   x += sprintf(buf + x, "\n");
+   }
+
+   if (s->kick) {
+   x += sprintf(buf + x, "kick : ");
+   x += sprint_symbol(buf + x,
+   (unsigned long)s->kick);
x += sprintf(buf + x, "\n");
}
return x;
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at

[RFC 05/26] SLUB: Replace ctor field with ops field in /sys/slab/:0000008 /sys/slab/:0000016 /sys/slab/:0000024 /sys/slab/:0000032 /sys/slab/:0000040 /sys/slab/:0000048 /sys/slab/:0000056 /sys/slab/:0

2007-08-31 Thread Christoph Lameter

Create an ops field in /sys/slab/*/ops to contain all the operations defined
on a slab. This will be used to display the additional operations that we
will define soon.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/slub.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index f95a760..fc2f1e3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3501,16 +3501,18 @@ static ssize_t order_show(struct kmem_cache *s, char 
*buf)
 }
 SLAB_ATTR_RO(order);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
 {
-   if (s->ctor) {
-   int n = sprint_symbol(buf, (unsigned long)s->ctor);
+   int x = 0;
 
-   return n + sprintf(buf + n, "\n");
+   if (s->ctor) {
+   x += sprintf(buf + x, "ctor : ");
+   x += sprint_symbol(buf + x, (unsigned long)s->ops->ctor);
+   x += sprintf(buf + x, "\n");
}
-   return 0;
+   return x;
 }
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
 
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
@@ -3761,7 +3763,7 @@ static struct attribute * slab_attrs[] = {
_attr.attr,
_attr.attr,
_slabs_attr.attr,
-   _attr.attr,
+   _attr.attr,
_attr.attr,
_attr.attr,
_checks_attr.attr,
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 00/26] Slab defragmentation V5

2007-08-31 Thread Christoph Lameter

Slab defragmentation is mainly an issue if Linux is used as a fileserver
and large amounts of dentries, inodes and buffer heads accumulate. In some
load situations the slabs become very sparsely populated so that a lot of
memory is wasted by slabs that only contain one or a few objects. In
extreme cases the performance of a machine will become sluggish since
we are continually running reclaim. Slab defragmentation adds the
capability to recover wasted memory.

For lumpy reclaim slab defragmentation can be used to enhance the
ability to recover larger contiguous areas of memory. Lumpy reclaim currently
cannot do anything if a slab page is encountered. With slab defragmentation
that slab page can be removed and a large contiguous page freed. It may
be possible to have slab pages also part of ZONE_MOVABLE (Mel's defrag
scheme in 2.6.23) or the MOVABLE areas (antifrag patches in mm).

The trouble with this patchset is that it is difficult to validate.
Activities are only performed when special load situations are encountered.
Are there any tests that could give meaningful information about
the effectiveness of these measures? I have run various tests here
creating and deleting files and building kernels under low memory situations
to trigger these reclaim mechanisms but how does one measure their
effectiveness?

The patchset is also available via git

git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git defrag


We currently support the following types of reclaim:

1. dentry cache
2. inode cache (with a generic interface to allow easy setup of more
   filesystems than the currently supported ext2/3/4 reiserfs, XFS
   and proc)
3. buffer_head

One typical mechanism that triggers slab defragmentation on my systems
is the daily run of

updatedb

Updatedb scans all files on the system which causes a high inode and dentry
use. After updatedb is complete we need to go back to the regular use
patterns (typical on my machine: kernel compiles). Those need the memory now
for different purposes. The inodes and dentries used for updatedb will
gradually be aged by the dentry/inode reclaim algorithm which will free
up the dentries and inode entries randomly through the slabs that were
allocated. As a result the slabs will become sparsely populated. If they
become empty then they can be freed but a lot of them will remain sparsely
populated. That is where slab defrag comes in: It removes the slabs with
just a few entries reclaiming more memory for other uses.

V4->V5:
- Support lumpy reclaim for slabs
- Support reclaim via slab_shrink()
- Add constructors to insure a consistent object state at all times.

V3->V4:
- Optimize scan for slabs that need defragmentation
- Add /sys/slab/*/defrag_ratio to allow setting defrag limits
  per slab.
- Add support for buffer heads.
- Describe how the cleanup after the daily updatedb can be
  improved by slab defragmentation.

V2->V3
- Support directory reclaim
- Add infrastructure to trigger defragmentation after slab shrinking if we
  have slabs with a high degree of fragmentation.

V1->V2
- Clean up control flow using a state variable. Simplify API. Back to 2
  functions that now take arrays of objects.
- Inode defrag support for a set of filesystems
- Fix up dentry defrag support to work on negative dentries by adding
  a new dentry flag that indicates that a dentry is not in the process
  of being freed or allocated.

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 02/26] SLUB: Move count_partial()

2007-08-31 Thread Christoph Lameter

Move the counting function for objects in partial slabs so that it is placed
before kmem_cache_shrink. We will need to use it to establish the
fragmentation ratio of per node slab lists.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 mm/slub.c |   26 +-
 1 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 45c76fe..aad6f83 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2595,6 +2595,19 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+static unsigned long count_partial(struct kmem_cache_node *n)
+{
+   unsigned long flags;
+   unsigned long x = 0;
+   struct page *page;
+
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry(page, >partial, lru)
+   x += page->inuse;
+   spin_unlock_irqrestore(>list_lock, flags);
+   return x;
+}
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
@@ -3331,19 +3344,6 @@ static int list_locations(struct kmem_cache *s, char 
*buf,
return n;
 }
 
-static unsigned long count_partial(struct kmem_cache_node *n)
-{
-   unsigned long flags;
-   unsigned long x = 0;
-   struct page *page;
-
-   spin_lock_irqsave(>list_lock, flags);
-   list_for_each_entry(page, >partial, lru)
-   x += page->inuse;
-   spin_unlock_irqrestore(>list_lock, flags);
-   return x;
-}
-
 enum slab_stat_type {
SL_FULL,
SL_PARTIAL,
-- 
1.5.2.4

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: maturity and status and attributes, oh my!

2007-08-31 Thread Robert P. J. Day

On Sat, 1 Sep 2007, Stefan Richter wrote:

> Robert P. J. Day wrote:
> ...
> > attributes would have two critical and non-negotiable properties:
> >
> > 1) they would be entirely orthogonal to one another, and
> > 2) they can be assigned at most one of a pre-defined set of values
>
> If they are fully orthogonal to another, then they are also
> nonexclusive.  You want them to be mutual exclusive, not orthogonal.

*attributes* would be orthogonal to one another -- the values *within*
an attribute would be mutually exclusive.  maybe i phrased that badly
the first time.  so a feature could have both a maturity *and* a
status (just using my hypothetical attributes here), but no feature
can have an attribute with more than one value.

ergo, you can have a maturity of, say, deprecated, *and* a status of,
say, broken.  is that what you meant?  it's what i was getting at.

> >   experimental -> normal (stable) -> deprecated -> obsolete
> >
> >   it's a natural progression and, at any point, a feature cannot
> > possibly have more than one maturity value.  it would be as absurd
> > as saying that someone was a teenager *and* was a twenty-something
> > at the same time.
>
> Keep in mind though that 'experimental', in the context of Linux
> kernel features, has nothing to do with the age of a feature.

your point is well taken.  i'm just trying to draw a clear distinction
between what i see as the natural chronological progression of a
feature, and its actual level of functionality, which i'm firmly
convinced represent two *very* different pieces of information.
something which is marked as obsolete can still be known to be
functioning perfectly well, while something which is still bleeding
edge might be well known to be broken as well.

with regard to "experimental", what attribute would you imagine it
would be a possible value for, and what other possible values might
that attribute have as opposed to experimental?

rday
-- 

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://crashcourse.ca

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/36] Large Blocksize Support V6

2007-08-31 Thread Christoph Lameter

Thanks to some help Mingming Cao we now have support for extX with up to 
64k blocksize. There were several issues in the jbd layer (The ext2 
patch that Christoph complained about was dropped).

The patchset can be tested (assuming one has a current git tree)

git checkout -b largeblock
git pull 
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git 
largeblock

... Fiddle around with large blocksize functionality

git checkout master

... Back to Linus' tree.

git branch -D largeblock

... Get rid of it.


commit ed541c23b8e71a0217fd96d1b421992fdd7519df
Author: Mingming Cao <[EMAIL PROTECTED]>

JBD: blocks reservation fix for large block support

commit a1eaa33cf1600f18e961f1cf5c87820bca44df08
Author: Christoph Lameter <[EMAIL PROTECTED]>

Teach jbd/jbd2 slab management to support >8k block size.

commit 8199976e04333d66202edcaec6cef46771ed194e
Author: Christoph Lameter <[EMAIL PROTECTED]>

Do not use f_mapping in simple_prepare_write()

commit ac4d742ff3b3526d4c22d5b42e9f9fcc99881a8c
Author: Mingming Cao <[EMAIL PROTECTED]>

ext4: fix rec_len overflow with 64KB block size

commit f336a2d00e7c79500ff30fad40f6e3090319cbe7
Author: Mingming Cao <[EMAIL PROTECTED]>

ext3: fix rec_len overflow with 64KB block size

commit b0c1b74d42cce96c592f8d13b7b842a3e07b0273
Author: Christoph Lameter <[EMAIL PROTECTED]>

ext2: fix rec_len overflow with 64KB block size

commit 01229e6a2e84178a8b8467930c113a0096c069f2
Author: Mingming Cao <[EMAIL PROTECTED]>

Large Blocksize support for Ext2/3/4


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nmi_watchdog=2 regression in 2.6.21

2007-08-31 Thread Björn Steinbrink

On 2007.08.31 17:24:46 -0700, Daniel Walker wrote:
> On Fri, 2007-08-31 at 20:06 +0200, Björn Steinbrink wrote:
> 
> 
> > > something to do with the nmi hertz adjustment that happens after
> > > check_nmi_watchdog() ..
> > 
> > Hm hm, does the same thing (watchdog stuck after check) happen with
> > older kernels, ie. those before Stephane's changeset that made it use
> > PERFCTR1?
> 
> I noticed the frequency gets turned down after check_nmi_watchdog() is
> called.. I think it's suppose to trigger once per second, but it's more
> like it updates randomly ..

It's once per second if the cpu is 100% busy, if it's just idling and
halted, the performance counters won't be increased.

> In older kernels it's very slow, but it's more consistent ..

With the same load on the box? Maybe some other changes caused the box
to behave differently (say, CFS), regarding eg. load distribution
amongst the cores.

> 
> Here is some output ..
> 
> morning-glory ~ # cat /proc/interrupts
>CPU0   CPU1   CPU2   CPU3
>   0:103  0  0  0   IO-APIC-edge  timer
>   1:  0  0  0  8   IO-APIC-edge  i8042
>   4:   2320  0  0  1   IO-APIC-edge  serial
>   8:  1  0  0  1   IO-APIC-edge  rtc
>  12:  0  0  0113   IO-APIC-edge  i8042
>  14:   1143  0  0 10   IO-APIC-edge  ide0
>  16:227  0  0  1   IO-APIC-fasteoi   
> uhci_hcd:usb2, eth0
>  18:  0  0  0  0   IO-APIC-fasteoi   
> ehci_hcd:usb1
>  19:  0  0  0  0   IO-APIC-fasteoi   
> uhci_hcd:usb3
>  20:  0  0  0  1   IO-APIC-fasteoi   acpi
> NMI:150168124121
> LOC:   6188   6189   6187   6184
> ERR:  0
> MIS:  0
> morning-glory ~ # cat /proc/interrupts 
>CPU0   CPU1   CPU2   CPU3   
>   0:103  0  0  0   IO-APIC-edge  timer
>   1:  0  0  0  8   IO-APIC-edge  i8042
>   4:   2391  0  0  1   IO-APIC-edge  serial
>   8:  1  0  0  1   IO-APIC-edge  rtc
>  12:  0  0  0113   IO-APIC-edge  i8042
>  14:   1143  0  0 10   IO-APIC-edge  ide0
>  16:872  0  0  1   IO-APIC-fasteoi   
> uhci_hcd:usb2, eth0
>  18:  0  0  0  0   IO-APIC-fasteoi   
> ehci_hcd:usb1
>  19:  0  0  0  0   IO-APIC-fasteoi   
> uhci_hcd:usb3
>  20:  0  0  0  1   IO-APIC-fasteoi   acpi
> NMI:151168124121 
> LOC:  21443  21444  21442  21439 
> ERR:  0
> MIS:  0
> dwalker2 ~ # 
> 
> 
> If you look at the LOC values you'll notice a lot of time has passed,
> with only one NMI and on only one cpu ..
> 
> It's possible this is something else completely tho ..

At least from the interrupt side, that box looks pretty idle, so that's
expected I'd say.

> 
> > Maybe you could "activate" the Dprintk in write_watchdog_counter32() to
> > see which value gets written to the MSR? (I don't see any switch to
> > activate it, so maybe just s/Dprintk(/printk(KERN_WHATEVER / ?)
> 
> Here's the only lines printed,
> 
> setting INTEL_ARCH_PERFCTR0 to -0x0131385e
> setting INTEL_ARCH_PERFCTR0 to -0x0131385e
> setting INTEL_ARCH_PERFCTR0 to -0x0131385e
> setting INTEL_ARCH_PERFCTR0 to -0x0131385e

Ok, dumb I am.

The "interesting" call from p6_rearm passes NULL as desc, and thus the
printk is never called that way. But before we start flooding your logs,
could you just hog all cores with a simple cpu hog and check if that
causes the NMI counter to increase at about 1 Hz? If that doesn't work,
we can go back to that debug output.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] Use one zonelist that is filtered by nodemask

2007-08-31 Thread Christoph Lameter

Acked-by: Christoph Lameter <[EMAIL PROTECTED]>


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] Use one zonelist that is filtered by nodemask

2007-08-31 Thread Christoph Lameter

Good idea. That gets rid of the GFP_THISNODE stuff that I introduced for 
the memoryless node patchset.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nmi_watchdog=2 regression in 2.6.21

2007-08-31 Thread Daniel Walker

On Fri, 2007-08-31 at 20:06 +0200, Björn Steinbrink wrote:


> > something to do with the nmi hertz adjustment that happens after
> > check_nmi_watchdog() ..
> 
> Hm hm, does the same thing (watchdog stuck after check) happen with
> older kernels, ie. those before Stephane's changeset that made it use
> PERFCTR1?

I noticed the frequency gets turned down after check_nmi_watchdog() is
called.. I think it's suppose to trigger once per second, but it's more
like it updates randomly ..

In older kernels it's very slow, but it's more consistent ..

Here is some output ..

morning-glory ~ # cat /proc/interrupts
   CPU0   CPU1   CPU2   CPU3
  0:103  0  0  0   IO-APIC-edge  timer
  1:  0  0  0  8   IO-APIC-edge  i8042
  4:   2320  0  0  1   IO-APIC-edge  serial
  8:  1  0  0  1   IO-APIC-edge  rtc
 12:  0  0  0113   IO-APIC-edge  i8042
 14:   1143  0  0 10   IO-APIC-edge  ide0
 16:227  0  0  1   IO-APIC-fasteoi   
uhci_hcd:usb2, eth0
 18:  0  0  0  0   IO-APIC-fasteoi   
ehci_hcd:usb1
 19:  0  0  0  0   IO-APIC-fasteoi   
uhci_hcd:usb3
 20:  0  0  0  1   IO-APIC-fasteoi   acpi
NMI:150168124121
LOC:   6188   6189   6187   6184
ERR:  0
MIS:  0
morning-glory ~ # cat /proc/interrupts 
   CPU0   CPU1   CPU2   CPU3   
  0:103  0  0  0   IO-APIC-edge  timer
  1:  0  0  0  8   IO-APIC-edge  i8042
  4:   2391  0  0  1   IO-APIC-edge  serial
  8:  1  0  0  1   IO-APIC-edge  rtc
 12:  0  0  0113   IO-APIC-edge  i8042
 14:   1143  0  0 10   IO-APIC-edge  ide0
 16:872  0  0  1   IO-APIC-fasteoi   
uhci_hcd:usb2, eth0
 18:  0  0  0  0   IO-APIC-fasteoi   
ehci_hcd:usb1
 19:  0  0  0  0   IO-APIC-fasteoi   
uhci_hcd:usb3
 20:  0  0  0  1   IO-APIC-fasteoi   acpi
NMI:151168124121 
LOC:  21443  21444  21442  21439 
ERR:  0
MIS:  0
dwalker2 ~ # 


If you look at the LOC values you'll notice a lot of time has passed,
with only one NMI and on only one cpu ..

It's possible this is something else completely tho ..

> Maybe you could "activate" the Dprintk in write_watchdog_counter32() to
> see which value gets written to the MSR? (I don't see any switch to
> activate it, so maybe just s/Dprintk(/printk(KERN_WHATEVER / ?)

Here's the only lines printed,

setting INTEL_ARCH_PERFCTR0 to -0x0131385e
setting INTEL_ARCH_PERFCTR0 to -0x0131385e
setting INTEL_ARCH_PERFCTR0 to -0x0131385e
setting INTEL_ARCH_PERFCTR0 to -0x0131385e

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/6] Filter based on a nodemask as well as a gfp_mask

2007-08-31 Thread Christoph Lameter

Acked-by: Christoph Lameter <[EMAIL PROTECTED]>


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 2/2] JBD: blocks reservation fix for large block support

2007-08-31 Thread Mingming Cao

The blocks per page could be less or quals to 1 with the large block support in 
VM.
The patch fixed the way to calculate the number of blocks to reserve in journal 
in the
case blocksize > pagesize.



Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>

Index: my2.6/fs/jbd/journal.c
===
--- my2.6.orig/fs/jbd/journal.c 2007-08-31 13:27:16.0 -0700
+++ my2.6/fs/jbd/journal.c  2007-08-31 13:28:18.0 -0700
@@ -1611,7 +1611,12 @@ void journal_ack_err(journal_t *journal)
 
 int journal_blocks_per_page(struct inode *inode)
 {
-   return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+   int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+   if (bits > 0)
+   return 1 << bits;
+   else
+   return 1;
 }
 
 /*
Index: my2.6/fs/jbd2/journal.c
===
--- my2.6.orig/fs/jbd2/journal.c2007-08-31 13:32:21.0 -0700
+++ my2.6/fs/jbd2/journal.c 2007-08-31 13:32:30.0 -0700
@@ -1612,7 +1612,12 @@ void jbd2_journal_ack_err(journal_t *jou
 
 int jbd2_journal_blocks_per_page(struct inode *inode)
 {
-   return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+   int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+   if (bits > 0)
+   return 1 << bits;
+   else
+   return 1;
 }
 
 /*


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 1/2] JBD: slab management support for large block(>8k)

2007-08-31 Thread Mingming Cao

>From clameter:
Teach jbd/jbd2 slab management to support >8k block size. Without this, it 
refused to mount on >8k ext3.

Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>

Index: my2.6/fs/jbd/journal.c
===
--- my2.6.orig/fs/jbd/journal.c 2007-08-30 18:40:02.0 -0700
+++ my2.6/fs/jbd/journal.c  2007-08-31 11:01:18.0 -0700
@@ -1627,16 +1627,17 @@ void * __jbd_kmalloc (const char *where,
  * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
  * and allocate frozen and commit buffers from these slabs.
  *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the corresponding kmalloc
+ * cache and create an alias. --clameter)
  */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
+#define JBD_MAX_SLABS 7
+#define JBD_SLAB_INDEX(size)  get_order((size) << (PAGE_SHIFT - 10))
 
 static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
 static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-   "jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
+   "jbd_1k", "jbd_2k", "jbd_4k", "jbd_8k",
+   "jbd_16k", "jbd_32k", "jbd_64k"
 };
 
 static void journal_destroy_jbd_slabs(void)
Index: my2.6/fs/jbd2/journal.c
===
--- my2.6.orig/fs/jbd2/journal.c2007-08-30 18:40:02.0 -0700
+++ my2.6/fs/jbd2/journal.c 2007-08-31 11:04:37.0 -0700
@@ -1639,16 +1639,18 @@ void * __jbd2_kmalloc (const char *where
  * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
  * and allocate frozen and commit buffers from these slabs.
  *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the corresponding kmalloc
+ * cache and create an alias. --clameter)
  */
 
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
+#define JBD_MAX_SLABS 7
+#define JBD_SLAB_INDEX(size)  get_order((size) << (PAGE_SHIFT - 10))
 
 static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
 static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-   "jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
+   "jbd2_1k", "jbd2_2k", "jbd2_4k", "jbd2_8k",
+"jbd2_16k", "jbd2_32k", "jbd2_64k"
 };
 
 static void jbd2_journal_destroy_jbd_slabs(void)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: v2.6.23-rc4-rt1 / new project URL

2007-08-31 Thread Daniel Walker

On Fri, 2007-08-31 at 22:59 +0200, Thomas Gleixner wrote:
> We're pleased to announce the release of the v2.6.23-rc4-rt1 kernel,
> which can be downloaded from a new place:
>  
>http://www.kernel.org/pub/linux/kernel/projects/rt/
>  
> The move to kernel.org is experimental for now, we'll keep it if it
> works out fine.
> 
> Changes since 2.6.23-rc2-rt2:
> 
> - update to -rc4
> - update to 2.6.23-rc4-hrt1
> 
> - UP compile fixes back merged (Kevin Hilman / Steven Rostedt)
> - various latency tracer fixes (Steven Rostedt)

I'm not sure which latency tracing fixes these are, but Steven's
get_monotonic_cycles() changes are racy .. It might be a little
premature to include them .. It at least fouls latency tracing on my
test machine.

> - simple_irq change (Kevin Hilman): needs more thought
> - RCU updates (Paul McKenney): needs proper integration
> - latency tracer changes (Daniel Walker): needs review
> - PICK_OP changes (Daniel Walker): needs review

The PICK_OP changes got reviewed by Ingo , as of,

http://marc.info/?l=linux-rt-users=118638506125380=2

They do need one small fix tho .. Below ..

Signed-off-by: Daniel Walker <[EMAIL PROTECTED]>

Index: linux-2.6.22/include/linux/spinlock.h
===
--- linux-2.6.22.orig/include/linux/spinlock.h  2007-09-01 00:08:04.0 
+
+++ linux-2.6.22/include/linux/spinlock.h   2007-09-01 00:07:48.0 
+
@@ -501,7 +501,7 @@ do {
\
 
 #define spin_trylock_irq(lock) \
__cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irq,  \
-   __spin_trylock_irq, lock))
+   _spin_trylock_irq, lock))
 
 #define spin_trylock_irqsave(lock, flags) \
__cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irqsave,  \


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/4] Large Blocksize support for Ext2/3/4

2007-08-31 Thread Mingming Cao

On Wed, 2007-08-29 at 17:47 -0700, Mingming Cao wrote:

> Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested 
> only. 
> 
> Next steps:
> Need a e2fsprogs changes to able test this feature. As mkfs needs to be
> educated not assuming rec_len to be blocksize all the time.
> Will try it with Christoph Lameter's large block patch next.
> 

Two problems were found when testing largeblock on ext3.  Patches to
follow. 

Good news is, with your changes, plus all these extN changes, I am able
to run ext2/3/4 with 64k block size, tested on x86 and ppc64 with 4k
page size. fsx test runs fine for an hour on ext3 with 16k blocksize on
x86:-)

Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Bugme-new] [Bug 8957] New: Exported functions and variables

2007-08-31 Thread Satyam Sharma

On Fri, 31 Aug 2007, Matti Linnanvuori wrote:
> 
> It seems to me that kernel/module.c allows the whole kernel to use
> exported symbols during the execution of the init function if they are
> weak:
> /* Ok if weak.  */
>   if (ELF_ST_BIND(sym[i].st_info) == STB_WEAK)
>   break;
> That seems a possible way to produce the scenario of this so-called bug.

No, even that won't reproduce the bug you're talking about, and you
clearly don't know how weak symbols are supposed to work / be used :-)
simplify_symbols() -> resolve_symbol() is called to resolve /external/
symbols that the module-being-loaded references, and error out in case
no such (global, exported) symbol was currently found.

So the "sym[i]" there refers to the (as yet unresolved) symbol referenced
in the _dependent module B_, that it sees exported as a weak symbol
(probably because marked as such in some header prototype). That check is
to support usage where we still allow B to load without A being loaded,
because it's somehow ensured that B will never call that function at
runtime unless it is available ... something like:

extern void mod_a_func(void) __attribute__((weak));
static int __init mod_b_init(void)
{
if (mod_a_func)
mod_a_func();
else {
/* some remedial action */
printk(KERN_INFO "own little mod_a_func fallback\n");
}
return 0;
}

Try running the same test I described in previous post with this change.

Moreover, failure to check (mod_a_func) will cause an _oops_, and *not*
the "module A's exported function being called even when module_init()
has not finished" issue that you've complained about. So things look
alright to me on this front, the bug has been rightly rejected as invalid.
And as Arjan pointed out, if you really saw such an issue, please post
some code instead, so that we can have a look.

Thanks,

Satyam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: maturity and status and attributes, oh my!

2007-08-31 Thread Mitchell Erblich

"Robert P. J. Day" wrote:
> 
>   at the risk of driving everyone here totally bonkers, i'm going to
> take one last shot at explaining what i was thinking of when i first
> proposed this whole "maturity level" thing.  and, just so you know,
> the major reason i'm so cranked up about this is that i'm feeling just
> a little territorial -- i was the one who first started nagging people
> to consider this idea, so i'm a little edgy when i see folks finally
> giving it some serious thought but appearing to get ready to implement
> it entirely incorrectly in a way that's going to ruin it irreparably
> and make it utterly useless.
> 
>   this isn't just about defining a single feature called "maturity".
> it's about defining a general mechanism so that you can add entirely
> new (what i call) "attributes" to kernel features.  one attribute
> could be "maturity", which could take one of a number of possible
> values.  another could be "status", with the same restrictions.
> heck, you could define the attribute "colour", and decide that various
> kernel features could be labelled as (at most) one of "red", "green"
> and "chartreuse."  that's what i mean by an "attribute", and
> attributes would have two critical and non-negotiable properties:
<<< snip
> 
>   but i hope i've flogged this thoroughly to the point where people
> can see what i'm driving at.  once you see (as in simon's patch) how
> to add the first attribute, it's trivial to simply duplicate that code
> to add as many more as you want.
> 
> rday
> 
> --
> 
> Robert P. J. Day
> Linux Consulting, Training and Annoying Kernel Pedantry
> Waterloo, Ontario, CANADA
> 
> http://crashcourse.ca
> 
Robert Day,

If I can interpret what you are asking about and changing it abit.

Don't you think that Maturity can be defined ALSO, as the 
   number of known bugs and their priority / serverity against a 
   architecture dependent or independent item?

   Would this suffice and wouldn't it be easier to maintain?

   Mitchell Erblich
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2/4] 2.6.23-rc4: known regressions

2007-08-31 Thread Len Brown


> ACPI
> 
> Subject : 2.6.23-rc4: maxcpus still broken
> References  : http://lkml.org/lkml/2007/8/28/87
> Last known good : ?
> Submitter   : Alexey Dobriyan <[EMAIL PROTECTED]>
> Caused-By   : Len Brown <[EMAIL PROTECTED]>
>   commit 61ec7567db103d537329b0db9a887db570431ff4
> Handled-By  : Len Brown <[EMAIL PROTECTED]>
> Status  : problem is being debugged
> 

Hugh debugged and fixed this one, and it is checked into 2.6.23-rc4-git3
62e6f1e8bb7c48c02b8bdb3085c5f6365682149b 
fix maxcpus=1 oops in show_stat()


thanks,
-Len
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DeskOpt - on fly task, i/o scheduler optimization

2007-08-31 Thread Michal Piotrowski

On 01/09/2007, Chris Snook <[EMAIL PROTECTED]> wrote:
> Michal Piotrowski wrote:
> > Hi,
> >
> > Here is something that might be useful for gamers and audio/video editors
> > http://www.stardust.webpages.pl/files/tools/deskopt/
> >
> > You can easily tune CFS/CFQ scheduler params
>
> I would think that gamers and AV editors would want to be using deadline
> (or maybe even as), not cfq.  How well does it work with other I/O
> schedulers?

Actually it does not support other i/o schedulers (early stage of
development ;).

"Linux supports io scheduling priorities and classes since 2.6.13 with
the CFQ io scheduler." (ionice man page)

So we can only tune
antic_expire  est_time  read_batch_expire  read_expire
write_batch_expire  write_expire
for anticipatory and
fifo_batch  front_merges  read_expire  write_expire  writes_starved
for deadline.

I'll have a look on it.

>
> -- Chris
>

Regards,
Michal

-- 
LOG
http://www.stardust.webpages.pl/log/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_via: write errors on PATA drive connected to VT6421

2007-08-31 Thread Alan Cox

On Sat, 1 Sep 2007 00:55:21 +0200
Ondrej Zary <[EMAIL PROTECTED]> wrote:

> Hello,
> I think that I've found and fixed the problem. There is a copy/paste bug in 
> vt6421_set_dma_mode() function which causes wrong values to be written to 
> PATA_UDMA_TIMING register.
> 
> 
> This patch fixes a copy/paste bug that breaks DMA modes on VT6421 PATA port.
> 
> Signed-off-by: Ondrej Zary <[EMAIL PROTECTED]>
> 
> --- linux-2.6.22.3-orig/drivers/ata/sata_via.c2007-09-01 
> 00:40:22.0 +0200
> +++ linux-2.6.22.3-router2/drivers/ata/sata_via.c 2007-09-01 
> 00:10:40.0 +0200
> @@ -370,7 +370,7 @@
>  {
>   struct pci_dev *pdev = to_pci_dev(ap->host->dev);
>   static const u8 udma_bits[] = { 0xEE, 0xE8, 0xE6, 0xE4, 0xE2, 0xE1, 
> 0xE0, 0xE0 };
> - pci_write_config_byte(pdev, PATA_UDMA_TIMING, udma_bits[adev->pio_mode 
> - XFER_UDMA_0]);
> + pci_write_config_byte(pdev, PATA_UDMA_TIMING, udma_bits[adev->dma_mode 
> - XFER_UDMA_0]);
>  }
>  
>  static const unsigned int svia_bar_sizes[] = {

Acked-by: Alan Cox <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: maturity and status and attributes, oh my!

2007-08-31 Thread Dave Jones

On Fri, Aug 31, 2007 at 05:38:34PM -0400, Robert P. J. Day wrote:

 >   it may be that some people had a different understanding of what was
 > meant by "maturity" than i did.  what *i* meant by that attribute is
 > a feature's current position in the normal software life cycle, and
 > that would be one of:
 > 
 >   experimental -> normal (stable) -> deprecated -> obsolete

Life isn't so black and white.
 * We have stuff go into the tree that isn't experimental on a regular
   basis, due to proving outside of Linus' tree, be that in -mm, or
   a distro tree, or anywhere else.
 * We've had code become undeprecated a few times.
 * Likewise stuff has sometimes got so fucked up that it's become
   experimental again (see the longhaul driver for a great example
   of a catastrophe in motion).

 >   it's a natural progression and, at any point, a feature cannot
 > possibly have more than one maturity value.  it would be as absurd as
 > saying that someone was a teenager *and* was a twenty-something at the
 > same time.  not possible.

Again, not so black and white.
It's feasible that something can be experimental on one architecture,
stable on another (typically x86), or even deprecated on x86, but still
supported on other architectures.
It's not just a per arch thing either, in some cases, we've had
differing levels of maturity based upon other hardware constraints,
or even varying versions of system software.

 > another attribute can then be what i was calling "status" but could
 > also be called "quality".   *that* is where you could categorize a
 > feature as one of FLAKY, BROKEN and so on.  that's an entirely
 > independent categorization from maturity, which means you could have
 > features that were both experimental and flaky, or deprecated and
 > broken, or what have you.  and those settings would be done with
 > separate Kconfig directives:

Kconfig is an awful mechanism for tracking whether something is stable or not.
Take for example the skge net driver.  It's "perfect" on some systems,
and utterly busted on others. How would you express that in Kconfig ?

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sata_via: write errors on PATA drive connected to VT6421

2007-08-31 Thread Ondrej Zary

Hello,
I think that I've found and fixed the problem. There is a copy/paste bug in 
vt6421_set_dma_mode() function which causes wrong values to be written to 
PATA_UDMA_TIMING register.


This patch fixes a copy/paste bug that breaks DMA modes on VT6421 PATA port.

Signed-off-by: Ondrej Zary <[EMAIL PROTECTED]>

--- linux-2.6.22.3-orig/drivers/ata/sata_via.c  2007-09-01 00:40:22.0 
+0200
+++ linux-2.6.22.3-router2/drivers/ata/sata_via.c   2007-09-01 
00:10:40.0 +0200
@@ -370,7 +370,7 @@
 {
struct pci_dev *pdev = to_pci_dev(ap->host->dev);
static const u8 udma_bits[] = { 0xEE, 0xE8, 0xE6, 0xE4, 0xE2, 0xE1, 
0xE0, 0xE0 };
-   pci_write_config_byte(pdev, PATA_UDMA_TIMING, udma_bits[adev->pio_mode 
- XFER_UDMA_0]);
+   pci_write_config_byte(pdev, PATA_UDMA_TIMING, udma_bits[adev->dma_mode 
- XFER_UDMA_0]);
 }
 
 static const unsigned int svia_bar_sizes[] = {


-- 
Ondrej Zary
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/7] blk_end_request: add new request completion interface

2007-08-31 Thread Kiyoshi Ueda

This patch adds 2 new interfaces for request completion:
  o blk_end_request()   : called without queue lock
  o __blk_end_request() : called with queue lock held

Some device drivers call some generic functions below between
end_that_request_{first/chunk} and end_that_request_last().
  o add_disk_randomness()
  o blk_queue_end_tag()
  o blkdev_dequeue_request()
These are called in the blk_end_request() as a part of generic
request completion.
So all device drivers become to call above functions.

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 block/ll_rw_blk.c  |   82 
+ include/linux/blkdev.h |2 
+
 2 files changed, 84 insertions(+)

diff -rupN 2.6.23-rc3-mm1/block/ll_rw_blk.c 
01-blkendreq-interface/block/ll_rw_blk.c
--- 2.6.23-rc3-mm1/block/ll_rw_blk.c2007-08-22 18:54:03.0 -0400
+++ 01-blkendreq-interface/block/ll_rw_blk.c2007-08-23 17:19:20.0 
-0400
@@ -3669,6 +3669,88 @@ void end_request(struct request *req, in
 
 EXPORT_SYMBOL(end_request);
 
+/**
+ * blk_end_request - Generic end_io function to complete a request.
+ * @rq:   the request being processed
+ * @uptodate: 1 for success, 0 for I/O error, < 0 for specific error
+ * @nr_bytes: number of bytes to complete
+ * @needlock: 1 for queue lock need to be held.
+ *0 for queue lock held already.
+ *
+ * Description:
+ * Ends I/O on a number of bytes attached to @rq.
+ * If @rq has leftover, sets it up for the next range of segments.
+ *
+ * Return:
+ * 0 - we are done with this request
+ * 1 - this request is not freed yet, it still has pending buffers.
+ **/
+static int blk_end_request(struct request *rq, int uptodate, int nr_bytes,
+  int needlock)
+{
+   struct request_queue *q = rq->q;
+   unsigned long flags = 0UL;
+
+   if (blk_fs_request(rq) || blk_pc_request(rq)) {
+   if (__end_that_request_first(rq, uptodate, nr_bytes))
+   return 1;
+   }
+
+   /*
+* No need to check the argument here because it is done
+* in add_disk_randomness().
+*/
+   add_disk_randomness(rq->rq_disk);
+
+   if (needlock)
+   spin_lock_irqsave(q->queue_lock, flags);
+
+   if (blk_rq_tagged(rq))
+   blk_queue_end_tag(q, rq);
+
+   if (!list_empty(>queuelist))
+   blkdev_dequeue_request(rq);
+
+   end_that_request_last(rq, uptodate);
+
+   if (needlock)
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   return 0;
+}
+
+/**
+ * blk_end_request - Helper function for drivers to complete the request.
+ * @rq:   the request being processed
+ * @uptodate: 1 for success, 0 for I/O error, < 0 for specific error
+ * @nr_bytes: number of bytes to complete
+ *
+ * Description:
+ * Ends I/O on a number of bytes attached to @rq.
+ * If @rq has leftover, sets it up for the next range of segments.
+ *
+ * Return:
+ * 0 - we are done with this request
+ * 1 - still buffers pending for this request
+ **/
+int blk_end_request(struct request *rq, int uptodate, int nr_bytes)
+{
+   return blk_end_request(rq, uptodate, nr_bytes, 1);
+}
+EXPORT_SYMBOL_GPL(blk_end_request);
+
+/**
+ * __blk_end_request - Helper function for drivers to complete the request.
+ *
+ * Description:
+ * Must be called with queue lock held unlike blk_end_request().
+ **/
+int __blk_end_request(struct request *rq, int uptodate, int nr_bytes)
+{
+   return blk_end_request(rq, uptodate, nr_bytes, 0);
+}
+EXPORT_SYMBOL_GPL(__blk_end_request);
+
 void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 struct bio *bio)
 {
diff -rupN 2.6.23-rc3-mm1/include/linux/blkdev.h 
01-blkendreq-interface/include/linux/blkdev.h
--- 2.6.23-rc3-mm1/include/linux/blkdev.h   2007-08-13 00:25:24.0 
-0400
+++ 01-blkendreq-interface/include/linux/blkdev.h   2007-08-23 
17:22:50.0 -0400
@@ -728,6 +728,8 @@ static inline void blk_run_address_space
  * for parts of the original function. This prevents
  * code duplication in drivers.
  */
+extern int blk_end_request(struct request *rq, int uptodate, int nr_bytes);
+extern int __blk_end_request(struct request *rq, int uptodate, int nr_bytes);
 extern int end_that_request_first(struct request *, int, int);
 extern int end_that_request_chunk(struct request *, int, int);
 extern void end_that_request_last(struct request *, int);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[APPENDIX PATCH 5/5] blk_end_request: userspace multipath-tools for request-based dm

2007-08-31 Thread Kiyoshi Ueda

This patch changes multipath-tools to use request-based dm-multipath.
This patch should be applied on top of 8/28/2007 git multipath-tools.

Request-based dm itself is still under development and not ready
for inclusion.

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 devmapper.c |2 ++
 1 files changed, 2 insertions(+)

diff -rupN git-multipath-tools-20070828/libmultipath/devmapper.c 
rqdm-multipath-tools/libmultipath/devmapper.c
--- git-multipath-tools-20070828/libmultipath/devmapper.c   2007-08-28 
17:05:25.0 -0400
+++ rqdm-multipath-tools/libmultipath/devmapper.c   2007-08-28 
17:50:47.0 -0400
@@ -174,6 +174,7 @@ dm_simplecmd (int task, const char *name
 #ifdef LIBDM_API_FLUSH
dm_task_no_flush(dmt);  /* for DM_DEVICE_SUSPEND/RESUME */
 #endif
+   dm_task_request_base(dmt);
 
r = dm_task_run (dmt);
 
@@ -211,6 +212,7 @@ dm_addmap (int task, const char *name, c
}
 
dm_task_no_open_count(dmt);
+   dm_task_request_base(dmt);
 
r = dm_task_run (dmt);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[APPENDIX PATCH 3/5] blk_end_request: dynamic load balancing for request-based dm-multipath

2007-08-31 Thread Kiyoshi Ueda

This patch adds dynamic load balancer to request-based dm-multipath.

Request-based dm itself is still under development and not ready
for inclusion.

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 drivers/md/Makefile   |3
 drivers/md/dm-adaptive.c  |  369 
++ drivers/md/dm-load-balance.c  |  342 
++
 drivers/md/dm-mpath.c |   32 ++-
 drivers/md/dm-path-selector.h |7
 drivers/md/dm-round-robin.c   |2
 drivers/md/dm.c   |4
 include/linux/device-mapper.h |3
 8 files changed, 742 insertions(+), 20 deletions(-)

diff -rupN a2-rqdm-mpath/drivers/md/dm-adaptive.c 
a3-rqdm-mpath-dlb/drivers/md/dm-adaptive.c
--- a2-rqdm-mpath/drivers/md/dm-adaptive.c  1969-12-31 19:00:00.0 
-0500
+++ a3-rqdm-mpath-dlb/drivers/md/dm-adaptive.c  2007-08-28 16:41:34.0 
-0400
@@ -0,0 +1,369 @@
+/*
+ * Copyright (C) 2007 NEC Corporation.  All Rights Reserved.
+ * dm-adaptive.c
+ *
+ * Module Author: Kiyoshi Ueda
+ *
+ * This file is released under the GPL.
+ *
+ * Adaptive path selector.
+ */
+
+#include "dm.h"
+#include "dm-path-selector.h"
+
+#define DM_MSG_PREFIX  "multipath adaptive"
+#define AD_MIN_IO  100
+#define AD_VERSION "0.2.0"
+
+struct selector {
+// spinlock_t lock;
+   struct list_head valid_paths;
+   struct list_head failed_paths;
+};
+
+struct path_info {
+   struct list_head list;
+   struct dm_path *path;
+   unsigned int repeat_count;
+
+   atomic_t in_flight; /* Total size of in-flight I/Os */
+   size_t perf;/* Recent performance of the path */
+   sector_t last_sectors;  /* Total sectors of the last disk_stat_read */
+   size_t last_io_ticks;   /* io_ticks of the last disk_stat_read */
+
+   size_t rqsz[2]; /* Size of the last request.  For Debug */
+};
+
+static void free_paths(struct list_head *paths)
+{
+   struct path_info *pi, *next;
+
+   list_for_each_entry_safe(pi, next, paths, list) {
+   list_del(>list);
+   pi->path->pscontext = NULL;
+   kfree(pi);
+   }
+}
+
+static struct selector *alloc_selector(void)
+{
+   struct selector *s = kmalloc(sizeof(*s), GFP_KERNEL);
+
+   if (s) {
+   memset(s, 0, sizeof(*s));
+   INIT_LIST_HEAD(>valid_paths);
+   INIT_LIST_HEAD(>failed_paths);
+// s->lock = SPIN_LOCK_UNLOCKED;
+   }
+
+   return s;
+}
+
+static int ad_create(struct path_selector *ps, unsigned argc, char **argv)
+{
+   struct selector *s;
+
+   s = alloc_selector();
+   if (!s)
+   return -ENOMEM;
+
+   ps->context = s;
+   return 0;
+}
+
+static void ad_destroy(struct path_selector *ps)
+{
+   struct selector *s = (struct selector *) ps->context;
+
+   free_paths(>valid_paths);
+   free_paths(>failed_paths);
+   kfree(s);
+   ps->context = NULL;
+}
+
+static int ad_status(struct path_selector *ps, struct dm_path *path,
+   status_type_t type, char *result, unsigned int maxlen)
+{
+   struct path_info *pi;
+   int sz = 0;
+
+   if (!path)
+   DMEMIT("0 ");
+   else {
+   pi = (struct path_info *) path->pscontext;
+   if (!pi)
+   BUG();
+
+   switch (type) {
+   case STATUSTYPE_INFO:
+   DMEMIT("if:%08lu pf:%06lu rsR:%06lu rsW:%06lu ",
+   (unsigned long) atomic_read(>in_flight),
+   pi->perf,
+   pi->rqsz[READ], pi->rqsz[WRITE]);
+   break;
+   case STATUSTYPE_TABLE:
+   DMEMIT("%u ", pi->repeat_count);
+   break;
+   }
+   }
+
+   return sz;
+}
+
+/*
+ * Note: Assuming IRQs are enabled when this function gets called.
+ */
+static int ad_add_path(struct path_selector *ps, struct dm_path *path,
+   int argc, char **argv, char **error)
+{
+   struct selector *s = (struct selector *) ps->context;
+   struct path_info *pi;
+   unsigned int repeat_count = AD_MIN_IO;
+   struct gendisk *disk = path->dev->bdev->bd_disk;
+
+   if (argc > 1) {
+   *error = "adaptive ps: incorrect number of arguments";
+   return -EINVAL;
+   }
+
+   /* First path argument is number of I/Os before switching path. */
+   if ((argc == 1) && (sscanf(argv[0], "%u", _count) != 1)) {
+   *error = "adaptive ps: invalid repeat count";
+   return -EINVAL;
+   }
+
+   /* allocate the path */
+   pi = kmalloc(sizeof(*pi), GFP_KERNEL);
+   if (!pi) {
+   *error = "adaptive ps: Error allocating path context";
+   return -ENOMEM;
+   }
+
+   pi->path =

[APPENDIX PATCH 4/5] blk_end_request: userspace device-mapper for request-based dm

2007-08-31 Thread Kiyoshi Ueda

This patch adds a feature of turning request-based dm on
to device-mapper userspace tool.

To turn on request-based dm, following steps should work:
1. # dmsetup create --rqbase mpath0
2. # echo  | dmsetup load mpath0
3. # dmsetup resume mpath0
Note: If you used bio-based targets for request-based dm device,
  you would hit kernel panic.  And vice versa.  (This is TODO.)
  The patch-set converts only multipath target to request-based.
  So please don't use other targets for request-based dm device
  and patched multipath target for bio-based dm device.
If you use multipath-tools (another patch), multipath-tools should
take care of all.

This patch should be applied on top of CVS device-mapper-1.02.23
of 8/28/2007.

Request-based dm itself is still under development and not ready
for inclusion.

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 dmsetup/dmsetup.c |9 -
 kernel/ioctl/dm-ioctl.h   |5 +
 lib/.exported_symbols |1 +
 lib/ioctl/libdm-iface.c   |   13 -
 lib/ioctl/libdm-targets.h |1 +
 lib/libdevmapper.h|1 +
 6 files changed, 28 insertions(+), 2 deletions(-)

diff -rupN cvs-device-mapper-20070828/dmsetup/dmsetup.c 
rqdm-device-mapper/dmsetup/dmsetup.c
--- cvs-device-mapper-20070828/dmsetup/dmsetup.c2007-08-21 
12:26:06.0 -0400
+++ rqdm-device-mapper/dmsetup/dmsetup.c2007-08-28 17:47:04.0 
-0400
@@ -119,6 +119,7 @@ enum {
NOOPENCOUNT_ARG,
NOTABLE_ARG,
OPTIONS_ARG,
+   RQBASE_ARG,
SEPARATOR_ARG,
SHOWKEYS_ARG,
SORT_ARG,
@@ -493,6 +493,9 @@ static int _create(int argc, char **argv, 
if (_switches[NOOPENCOUNT_ARG] && !dm_task_no_open_count(dmt))
goto out;
 
+   if (_switches[RQBASE_ARG] && !dm_task_request_base(dmt))
+   goto out;
+
if (!dm_task_run(dmt))
goto out;
 
@@ -1978,7 +1982,7 @@ static struct command _commands[] = {
{"help", "[-c|-C|--columns]", 0, 0, _help},
{"create", " [-j|--major  -m|--minor ]\n"
  "\t  [-U|--uid ] [-G|--gid ] [-M|--mode 
]\n"
- "\t  [-u|uuid ]\n"
+ "\t  [-u|uuid ] [--rqbase]\n"
  "\t  [--notable | --table  | ]",
 1, 2, _create},
{"remove", "[-f|--force] ", 0, 1, _remove},
@@ -2366,6 +2370,7 @@ static int _process_switches(int *argc, 
{"noopencount", 0, , NOOPENCOUNT_ARG},
{"notable", 0, , NOTABLE_ARG},
{"options", 1, , OPTIONS_ARG},
+   {"rqbase", 0, , RQBASE_ARG},
{"separator", 1, , SEPARATOR_ARG},
{"showkeys", 0, , SHOWKEYS_ARG},
{"sort", 1, , SORT_ARG},
@@ -2498,6 +2503,8 @@ static int _process_switches(int *argc, 
_switches[NOLOCKFS_ARG]++;
if ((ind == NOOPENCOUNT_ARG))
_switches[NOOPENCOUNT_ARG]++;
+   if ((ind == RQBASE_ARG))
+   _switches[RQBASE_ARG]++;
if ((ind == SHOWKEYS_ARG))
_switches[SHOWKEYS_ARG]++;
if ((ind == TABLE_ARG)) {
diff -rupN cvs-device-mapper-20070828/kernel/ioctl/dm-ioctl.h 
rqdm-device-mapper/kernel/ioctl/dm-ioctl.h
--- cvs-device-mapper-20070828/kernel/ioctl/dm-ioctl.h  2006-10-12 
11:42:24.0 -0400
+++ rqdm-device-mapper/kernel/ioctl/dm-ioctl.h  2007-08-28 17:47:04.0 
-0400
@@ -330,4 +330,9 @@ typedef char ioctl_struct[308];
  */
 #define DM_NOFLUSH_FLAG(1 << 11) /* In */
 
+/*
+ * Set this to create request based device-mapper device.
+ */
+#define DM_REQUEST_BASE_FLAG   (1 << 12) /* In */
+
 #endif /* _LINUX_DM_IOCTL_H */
diff -rupN cvs-device-mapper-20070828/lib/.exported_symbols 
rqdm-device-mapper/lib/.exported_symbols
--- cvs-device-mapper-20070828/lib/.exported_symbols2007-07-28 
06:48:36.0 -0400
+++ rqdm-device-mapper/lib/.exported_symbols2007-08-28 17:47:04.0 
-0400
@@ -32,6 +32,7 @@ dm_task_suppress_identical_reload
 dm_task_add_target
 dm_task_no_flush
 dm_task_no_open_count
+dm_task_request_base
 dm_task_skip_lockfs
 dm_task_update_nodes
 dm_task_run
diff -rupN cvs-device-mapper-20070828/lib/ioctl/libdm-iface.c 
rqdm-device-mapper/lib/ioctl/libdm-iface.c
--- cvs-device-mapper-20070828/lib/ioctl/libdm-iface.c  2007-08-21 
12:26:07.0 -0400
+++ rqdm-device-mapper/lib/ioctl/libdm-iface.c  2007-08-28 17:47:04.0 
-0400
@@ -1038,6 +1038,13 @@ int dm_task_no_open_count(struct dm_task
return 1;
 }
 
+int dm_task_request_base(struct dm_task *dmt)
+{
+   dmt->request_base = 1;
+
+   return 1;
+}
+
 int dm_task_skip_lockfs(struct dm_task *dmt)
 {
dmt->skip_lockfs = 1;
@@ -1281,6 +1288,8 @@ static struct dm_ioctl *_flatten(struct

[APPENDIX PATCH 2/5] blk_end_request: request-based dm-multipath

2007-08-31 Thread Kiyoshi Ueda

This patch converts dm-multipath target driver to request-based.

Request-based dm itself is still under development and not ready
for inclusion.

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 dm-mpath.c |  227 
+ dm-rq-record.h |   36 
+
 2 files changed, 203 insertions(+), 60 deletions(-)

diff -rupN a1-rqdm-core/drivers/md/dm-mpath.c 
a2-rqdm-mpath/drivers/md/dm-mpath.c
--- a1-rqdm-core/drivers/md/dm-mpath.c  2007-08-13 00:25:24.0 -0400
+++ a2-rqdm-mpath/drivers/md/dm-mpath.c 2007-08-29 14:07:39.0 -0400
@@ -8,8 +8,7 @@
 #include "dm.h"
 #include "dm-path-selector.h"
 #include "dm-hw-handler.h"
-#include "dm-bio-list.h"
-#include "dm-bio-record.h"
+#include "dm-rq-record.h"
 
 #include 
 #include 
@@ -77,7 +76,7 @@ struct multipath {
unsigned saved_queue_if_no_path;/* Saved state during suspension */
 
struct work_struct process_queued_ios;
-   struct bio_list queued_ios;
+   struct list_head queued_ios;
unsigned queue_size;
 
struct work_struct trigger_event;
@@ -86,22 +85,22 @@ struct multipath {
 * We must use a mempool of dm_mpath_io structs so that we
 * can resubmit bios on error.
 */
-   mempool_t *mpio_pool;
+   mempool_t *mpio_pool; //REMOVE ME
 };
 
 /*
  * Context information attached to each bio we process.
  */
-struct dm_mpath_io {
+struct dm_mpath_io { //REMOVE ME
struct pgpath *pgpath;
-   struct dm_bio_details details;
+   struct dm_rq_details details;
 };
 
 typedef int (*action_fn) (struct pgpath *pgpath);
 
 #define MIN_IOS 256/* Mempool size */
 
-static struct kmem_cache *_mpio_cache;
+static struct kmem_cache *_mpio_cache; //REMOVE ME
 
 struct workqueue_struct *kmultipathd;
 static void process_queued_ios(struct work_struct *work);
@@ -171,6 +170,7 @@ static struct multipath *alloc_multipath
m = kzalloc(sizeof(*m), GFP_KERNEL);
if (m) {
INIT_LIST_HEAD(>priority_groups);
+   INIT_LIST_HEAD(>queued_ios);
spin_lock_init(>lock);
m->queue_io = 1;
INIT_WORK(>process_queued_ios, process_queued_ios);
@@ -299,7 +299,7 @@ static int __must_push_back(struct multi
dm_noflush_suspending(m->ti));
 }
 
-static int map_io(struct multipath *m, struct bio *bio,
+static int map_io(struct multipath *m, struct request *clone,
  struct dm_mpath_io *mpio, unsigned was_queued)
 {
int r = DM_MAPIO_REMAPPED;
@@ -321,19 +321,27 @@ static int map_io(struct multipath *m, s
if ((pgpath && m->queue_io) ||
(!pgpath && m->queue_if_no_path)) {
/* Queue for the daemon to resubmit */
-   bio_list_add(>queued_ios, bio);
+   list_add_tail(>queuelist, >queued_ios);
m->queue_size++;
if ((m->pg_init_required && !m->pg_init_in_progress) ||
!m->queue_io)
queue_work(kmultipathd, >process_queued_ios);
pgpath = NULL;
+   clone->q = NULL;
+   clone->rq_disk = NULL;
r = DM_MAPIO_SUBMITTED;
-   } else if (pgpath)
-   bio->bi_bdev = pgpath->path.dev->bdev;
-   else if (__must_push_back(m))
+   } else if (pgpath) {
+   clone->q = bdev_get_queue(pgpath->path.dev->bdev);
+   clone->rq_disk = pgpath->path.dev->bdev->bd_disk;
+   } else if (__must_push_back(m)) {
+   clone->q = NULL;
+   clone->rq_disk = NULL;
r = DM_MAPIO_REQUEUE;
-   else
+   } else {
+   clone->q = NULL;
+   clone->rq_disk = NULL;
r = -EIO;   /* Failed */
+   }
 
mpio->pgpath = pgpath;
 
@@ -373,30 +381,28 @@ static void dispatch_queued_ios(struct m
 {
int r;
unsigned long flags;
-   struct bio *bio = NULL, *next;
struct dm_mpath_io *mpio;
union map_info *info;
+   struct request *clone, *n;
+   LIST_HEAD(cl);
 
spin_lock_irqsave(>lock, flags);
-   bio = bio_list_get(>queued_ios);
+   list_splice_init(>queued_ios, );
spin_unlock_irqrestore(>lock, flags);
 
-   while (bio) {
-   next = bio->bi_next;
-   bio->bi_next = NULL;
+   list_for_each_entry_safe(clone, n, , queuelist) {
+   list_del(>queuelist);
 
-   info = dm_get_mapinfo(bio);
+   info = dm_get_rq_mapinfo(clone);
mpio = info->ptr;
 
-   r = map_io(m, bio, mpio, 1);
+   r = map_io(m, clone, mpio, 1);
if (r < 0)
-   bio_endio(bio, bio->bi_size, r);
+   blk_end_request(clone, r, blk_rq_size(clone));
else if (r == DM_MAPIO_REMAPPED)
-

[APPENDIX PATCH 1/5] blk_end_request: request-based dm core

2007-08-31 Thread Kiyoshi Ueda

This patch is an examle of block device stacking at request level,
showing the necessity of blk_end_request() and how the new
rq->end_io() hook is used.
Request-based dm itself is still under development and not ready
for inclusion.

This patch adds request-based dm feature to dm core.
Request-based dm hooks clone's ->end_io() to check errors of clone
returned from device drivers.  (See clone_end_request())

# Currently, request-based dm can be turned on by ioctl at dm device
# creation time, so the userspace patches are needed.
# The ioctl from userspace is ignored if kernel doesn't support it,
# so please update userspace tools first when you try this.
# (If kernel was updated first, you would hit kernel panic.)

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 block/ll_rw_blk.c |9
 drivers/md/dm-hw-handler.h|1
 drivers/md/dm-ioctl.c |5
 drivers/md/dm-table.c |   23 +
 drivers/md/dm.c   |  514 
+- drivers/md/dm.h   |   13 
+
 drivers/scsi/scsi_lib.c   |   38 +++
 include/linux/blkdev.h|6
 include/linux/device-mapper.h |   35 ++
 include/linux/dm-ioctl.h  |9
 10 files changed, 638 insertions(+), 15 deletions(-)

diff -rupN 07-change-end-io/block/ll_rw_blk.c a1-rqdm-core/block/ll_rw_blk.c
--- 07-change-end-io/block/ll_rw_blk.c  2007-08-24 12:31:41.0 -0400
+++ a1-rqdm-core/block/ll_rw_blk.c  2007-08-29 13:53:12.0 -0400
@@ -177,6 +177,13 @@ void blk_queue_softirq_done(struct reque
 
 EXPORT_SYMBOL(blk_queue_softirq_done);
 
+void blk_queue_device_congested(struct request_queue *q, device_congested_fn 
*fn)
+{
+   q->device_congested_fn = fn;
+}
+
+EXPORT_SYMBOL_GPL(blk_queue_device_congested);
+
 /**
  * blk_queue_make_request - define an alternate make_request function for a 
device
  * @q:  the request queue for the device to be affected
@@ -3692,7 +3699,7 @@ int blk_end_io(struct request *rq, int u
struct request_queue *q = rq->q;
unsigned long flags = 0UL;
 
-   if (blk_fs_request(rq) || blk_pc_request(rq)) {
+   if ((blk_fs_request(rq) || blk_pc_request(rq)) && !blk_cloned_rq(rq)) {
if (__end_that_request_first(rq, uptodate, nr_bytes))
return 1;
}
diff -rupN 07-change-end-io/drivers/md/dm.c a1-rqdm-core/drivers/md/dm.c
--- 07-change-end-io/drivers/md/dm.c2007-08-13 00:25:24.0 -0400
+++ a1-rqdm-core/drivers/md/dm.c2007-08-30 11:19:30.0 -0400
@@ -51,6 +51,22 @@ struct dm_target_io {
union map_info info;
 };
 
+/*
+ * For request based dm.
+ * One of these is allocated per request.
+ *
+ * Since assuming "original request : cloned request = 1 : 1" and
+ * a counter for number of clones like struct dm_io.io_count isn't needed,
+ * struct dm_io and struct target_io can merge.
+ */
+struct dm_rq_target_io {
+   struct mapped_device *md;
+   int error;
+   struct request *rq;
+   struct dm_target *ti;
+   union map_info info;
+};
+
 union map_info *dm_get_mapinfo(struct bio *bio)
 {
if (bio && bio->bi_private)
@@ -58,6 +74,14 @@ union map_info *dm_get_mapinfo(struct bi
return NULL;
 }
 
+union map_info *dm_get_rq_mapinfo(struct request *rq)
+{
+   if (rq && rq->end_io_data)
+   return &((struct dm_rq_target_io *)rq->end_io_data)->info;
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
+
 #define MINOR_ALLOCED ((void *)-1)
 
 /*
@@ -70,6 +94,13 @@ union map_info *dm_get_mapinfo(struct bi
 #define DMF_DELETING 4
 #define DMF_NOFLUSH_SUSPENDING 5
 
+/*
+ * Bits for the md->features field.
+ */
+#define DM_FEAT_REQUEST_BASE (1 << 0)
+
+#define dm_feat_rq_base(md) ((md)->features & DM_FEAT_REQUEST_BASE)
+
 struct mapped_device {
struct rw_semaphore io_lock;
struct semaphore suspend_lock;
@@ -79,6 +110,7 @@ struct mapped_device {
atomic_t open_count;
 
unsigned long flags;
+   unsigned long features;
 
struct request_queue *queue;
struct gendisk *disk;
@@ -121,11 +153,16 @@ struct mapped_device {
 
/* forced geometry settings */
struct hd_geometry geometry;
+
+   /* For saving the address of __make_request for request based dm */
+   make_request_fn *saved_make_request_fn;
 };
 
 #define MIN_IOS 256
 static struct kmem_cache *_io_cache;
 static struct kmem_cache *_tio_cache;
+static struct kmem_cache *_rq_cache; /* clone pool for request-based dm */
+static struct kmem_cache *_rq_tio_cache; /* target_io pool for request-based 
dm */
 
 static int __init local_init(void)
 {
@@ -143,9 +180,27 @@ static int __init local_init(void)
return -ENOMEM;
}
 
+   _rq_cache = kmem_cache_create("dm_rq", sizeof(struct request),
+ 0, 0, NULL);
+   if (!_rq_cache) {
+

[PATCH 7/7] blk_end_request: change rq->end_io to cover request completion as a whole

2007-08-31 Thread Kiyoshi Ueda

This patch moves the rq->end_io() calling point to the top of
blk_end_request() from the last of end_that_request_last().
This means that whole request completion can be hooked by rq->end_io()
because all device drivers call blk_end_request() to complete request.

Because the meaning of rq->end_io() is changed, existing rq->end_io()
users are changed as below:
  o Create a new end_io handler using blk_end_io().
blk_end_io() is a default rq->end_io() handler and can take
a callback function, which is called after end_that_request_last().
So the old end_io() handler can be used as the callback.
  o Set the new end_io handler to rq->end_io.

scsi_transport_sas.c:sas_smp_request() seems expecting the request
is bsg and rq->end_io() is bsg_rq_end_io().
So changed to call it directly.

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 block/bsg.c   |   20 +-
 block/ll_rw_blk.c |   70 
++ drivers/md/dm-mpath-rdac.c|   45 
+---
 drivers/scsi/scsi_lib.c   |9 
 drivers/scsi/scsi_transport_sas.c |2 -
 include/linux/blkdev.h|   12 +-
 include/linux/bsg.h   |5 ++
 7 files changed, 139 insertions(+), 24 deletions(-)

diff -rupN 06-remove-old-interface/block/bsg.c 07-change-end-io/block/bsg.c
--- 06-remove-old-interface/block/bsg.c 2007-08-13 00:25:24.0 -0400
+++ 07-change-end-io/block/bsg.c2007-08-24 12:34:14.0 -0400
@@ -312,9 +312,9 @@ out:
 
 /*
  * async completion call-back from the block layer, when scsi/ide/whatever
- * calls end_that_request_last() on a request
+ * calls blk_end_request() on a request
  */
-static void bsg_rq_end_io(struct request *rq, int uptodate)
+static void bsg_rq_dtor(struct request *rq, int uptodate)
 {
struct bsg_command *bc = rq->end_io_data;
struct bsg_device *bd = bc->bd;
@@ -333,6 +333,22 @@ static void bsg_rq_end_io(struct request
wake_up(>wq_done);
 }
 
+static int bsg_rq_end_io(struct request *rq, int uptodate, int nr_bytes,
+int needlock, int (drv_callback)(struct request *))
+{
+   return blk_end_io(rq, uptodate, nr_bytes, needlock, bsg_rq_dtor,
+ drv_callback);
+}
+
+/*
+ * rq->q's lock must be held.
+ */
+void bsg_end_request(struct request *rq, int uptodate)
+{
+   bsg_rq_dtor(rq, uptodate);
+}
+EXPORT_SYMBOL_GPL(bsg_end_request);
+
 /*
  * do final setup of a 'bc' and submit the matching 'rq' to the block
  * layer for io
diff -rupN 06-remove-old-interface/block/ll_rw_blk.c 
07-change-end-io/block/ll_rw_blk.c
--- 06-remove-old-interface/block/ll_rw_blk.c   2007-08-24 12:19:02.0 
-0400
+++ 07-change-end-io/block/ll_rw_blk.c  2007-08-24 12:31:41.0 -0400
@@ -383,24 +383,45 @@ void blk_ordered_complete_seq(struct req
BUG();
 }
 
-static void pre_flush_end_io(struct request *rq, int error)
+static void pre_flush_dtor(struct request *rq, int error)
 {
elv_completed_request(rq->q, rq);
blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_PREFLUSH, error);
 }
 
-static void bar_end_io(struct request *rq, int error)
+static int pre_flush_end_io(struct request *rq, int uptodate, int nr_bytes,
+   int needlock, int (drv_callback)(struct request *))
+{
+   return blk_end_io(rq, uptodate, nr_bytes, needlock, pre_flush_dtor,
+ drv_callback);
+}
+
+static void bar_dtor(struct request *rq, int error)
 {
elv_completed_request(rq->q, rq);
blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_BAR, error);
 }
 
-static void post_flush_end_io(struct request *rq, int error)
+static int bar_end_io(struct request *rq, int uptodate, int nr_bytes,
+ int needlock, int (drv_callback)(struct request *))
+{
+   return blk_end_io(rq, uptodate, nr_bytes, needlock, bar_dtor,
+ drv_callback);
+}
+
+static void post_flush_dtor(struct request *rq, int error)
 {
elv_completed_request(rq->q, rq);
blk_ordered_complete_seq(rq->q, QUEUE_ORDSEQ_POSTFLUSH, error);
 }
 
+static int post_flush_end_io(struct request *rq, int uptodate, int nr_bytes,
+int needlock, int (drv_callback)(struct request *))
+{
+   return blk_end_io(rq, uptodate, nr_bytes, needlock, post_flush_dtor,
+ drv_callback);
+}
+
 static void queue_flush(struct request_queue *q, unsigned which)
 {
struct request *rq;
@@ -2780,11 +2801,11 @@ void blk_put_request(struct request *req
 EXPORT_SYMBOL(blk_put_request);
 
 /**
- * blk_end_sync_rq - executes a completion event on a request
+ * blk_sync_rq_dtor - executes a completion event on a request
  * @rq: request to complete
  * @error: end io status of the request
  */
-void blk_end_sync_rq(struct request *rq, int error)
+static void blk_sync_rq_dtor(struct

[PATCH 6/7] blk_end_request: remove/unexport end_that_request_*

2007-08-31 Thread Kiyoshi Ueda

This patch removes the following functions:
  o end_that_request_first()
  o end_that_request_chunk()
and stops exporting the functions below:
  o end_that_request_last()

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 block/ll_rw_blk.c  |   61 
- include/linux/blkdev.h |   15 

 2 files changed, 21 insertions(+), 55 deletions(-)

diff -rupN 05-ide-cd-change/block/ll_rw_blk.c 
06-remove-old-interface/block/ll_rw_blk.c
--- 05-ide-cd-change/block/ll_rw_blk.c  2007-08-24 12:11:02.0 -0400
+++ 06-remove-old-interface/block/ll_rw_blk.c   2007-08-24 12:19:02.0 
-0400
@@ -3388,6 +3388,20 @@ static void blk_recalc_rq_sectors(struct
}
 }
 
+/**
+ * __end_that_request_first - end I/O on a request
+ * @req:  the request being processed
+ * @uptodate: 1 for success, 0 for I/O error, < 0 for specific error
+ * @nr_bytes: number of bytes to complete
+ *
+ * Description:
+ * Ends I/O on a number of bytes attached to @req, and sets it up
+ * for the next range of segments (if any) in the cluster.
+ *
+ * Return:
+ * 0 - we are done with this request, call end_that_request_last()
+ * 1 - still buffers pending for this request
+ **/
 static int __end_that_request_first(struct request *req, int uptodate,
int nr_bytes)
 {
@@ -3498,49 +3512,6 @@ static int __end_that_request_first(stru
return 1;
 }
 
-/**
- * end_that_request_first - end I/O on a request
- * @req:  the request being processed
- * @uptodate: 1 for success, 0 for I/O error, < 0 for specific error
- * @nr_sectors: number of sectors to end I/O on
- *
- * Description:
- * Ends I/O on a number of sectors attached to @req, and sets it up
- * for the next range of segments (if any) in the cluster.
- *
- * Return:
- * 0 - we are done with this request, call end_that_request_last()
- * 1 - still buffers pending for this request
- **/
-int end_that_request_first(struct request *req, int uptodate, int nr_sectors)
-{
-   return __end_that_request_first(req, uptodate, nr_sectors << 9);
-}
-
-EXPORT_SYMBOL(end_that_request_first);
-
-/**
- * end_that_request_chunk - end I/O on a request
- * @req:  the request being processed
- * @uptodate: 1 for success, 0 for I/O error, < 0 for specific error
- * @nr_bytes: number of bytes to complete
- *
- * Description:
- * Ends I/O on a number of bytes attached to @req, and sets it up
- * for the next range of segments (if any). Like end_that_request_first(),
- * but deals with bytes instead of sectors.
- *
- * Return:
- * 0 - we are done with this request, call end_that_request_last()
- * 1 - still buffers pending for this request
- **/
-int end_that_request_chunk(struct request *req, int uptodate, int nr_bytes)
-{
-   return __end_that_request_first(req, uptodate, nr_bytes);
-}
-
-EXPORT_SYMBOL(end_that_request_chunk);
-
 /*
  * splice the completion data to a local structure and hand off to
  * process_completion_queue() to complete the requests
@@ -3620,7 +3591,7 @@ EXPORT_SYMBOL(blk_complete_request);
 /*
  * queue lock must be held
  */
-void end_that_request_last(struct request *req, int uptodate)
+static void end_that_request_last(struct request *req, int uptodate)
 {
struct gendisk *disk = req->rq_disk;
int error;
@@ -3655,8 +3626,6 @@ void end_that_request_last(struct reques
__blk_put_request(req->q, req);
 }
 
-EXPORT_SYMBOL(end_that_request_last);
-
 void end_request(struct request *req, int uptodate)
 {
__blk_end_request(req, uptodate, sect2byte(req->hard_cur_sectors));
diff -rupN 05-ide-cd-change/include/linux/blkdev.h 
06-remove-old-interface/include/linux/blkdev.h
--- 05-ide-cd-change/include/linux/blkdev.h 2007-08-24 12:21:45.0 
-0400
+++ 06-remove-old-interface/include/linux/blkdev.h  2007-08-24 
12:21:15.0 -0400
@@ -720,19 +720,16 @@ static inline void blk_run_address_space
 }
 
 /*
- * end_request() and friends. Must be called with the request queue spinlock
- * acquired. All functions called within end_request() _must_be_ atomic.
+ * blk_end_request() and friends.
+ * __blk_end_request() and end_request() must be called with
+ * the request queue spinlock acquired.
  *
  * Several drivers define their own end_request and call
- * end_that_request_first() and end_that_request_last()
- * for parts of the original function. This prevents
- * code duplication in drivers.
+ * blk_end_request() for parts of the original function.
+ * This prevents code duplication in drivers.
  */
 extern int blk_end_request(struct request *rq, int uptodate, int nr_bytes);
 extern int __blk_end_request(struct request *rq, int uptodate, int nr_bytes);
-extern int end_that_request_first(struct request *, int, int);
-extern int end_that_request_chunk(struct request *, int, int);
-extern void end_that_request_last(struct

[PATCH 5/7] blk_end_request: change ide-cd (cdrom_newpc_intr)

2007-08-31 Thread Kiyoshi Ueda

This patch changes ide-cd (cdrom_newpc_intr) to use blk_end_request().
Due to the addness of the driver, the patch adds a variant of
the interface, blk_end_request_callback().

cdrom_newpc_intr() of ide-cd is the only function in the kernel tree
which needs to call end_that_request_first() and
end_that_request_last() separately.
blk_end_request_callback() allows it to pass callback function to do
something between end_that_request_first() and end_that_request_last().

ide-cd (cdrom_newpc_intr) needs to the followings:
  1. call post_transform_command() to modify request contents
  2. wait completing request until DRQ_STAT is cleared
after end_that_request_first() and before end_that_request_last().

As for the second one, ide-cd will wait for the interrupt from device.
So blk_end_request() has to return without completing request even if
no leftover in the request.

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 block/ll_rw_blk.c  |   47 +++--
 drivers/ide/ide-cd.c   |   78 
++--- include/linux/blkdev.h |3 
+
 3 files changed, 108 insertions(+), 20 deletions(-)

diff -rupN 04-other-caller-change/block/ll_rw_blk.c 
05-ide-cd-change/block/ll_rw_blk.c
--- 04-other-caller-change/block/ll_rw_blk.c2007-08-23 17:51:33.0 
-0400
+++ 05-ide-cd-change/block/ll_rw_blk.c  2007-08-24 12:11:02.0 -0400
@@ -3671,6 +3671,10 @@ EXPORT_SYMBOL(end_request);
  * @nr_bytes: number of bytes to complete
  * @needlock: 1 for queue lock need to be held.
  *0 for queue lock held already.
+ * @drv_callback: function called between completion of bios in the request
+ *and completion of the request.
+ *If the callback returns non 0, this helper returns without
+ *completion of the request.
  *
  * Description:
  * Ends I/O on a number of bytes attached to @rq.
@@ -3681,7 +3685,8 @@ EXPORT_SYMBOL(end_request);
  * 1 - this request is not freed yet, it still has pending buffers.
  **/
 static int blk_end_request(struct request *rq, int uptodate, int nr_bytes,
-  int needlock)
+  int needlock,
+  int (drv_callback)(struct request *))
 {
struct request_queue *q = rq->q;
unsigned long flags = 0UL;
@@ -3691,6 +3696,10 @@ static int blk_end_request(struct re
return 1;
}
 
+   /* Special feature for drivers/ide/ide-cd.c:cdrom_newpc_intr() */
+   if (drv_callback && drv_callback(rq))
+   return 1;
+
/*
 * No need to check the argument here because it is done
 * in add_disk_randomness().
@@ -3730,7 +3739,7 @@ static int blk_end_request(struct re
  **/
 int blk_end_request(struct request *rq, int uptodate, int nr_bytes)
 {
-   return blk_end_request(rq, uptodate, nr_bytes, 1);
+   return blk_end_request(rq, uptodate, nr_bytes, 1, NULL);
 }
 EXPORT_SYMBOL_GPL(blk_end_request);
 
@@ -3742,10 +3751,42 @@ EXPORT_SYMBOL_GPL(blk_end_request);
  **/
 int __blk_end_request(struct request *rq, int uptodate, int nr_bytes)
 {
-   return blk_end_request(rq, uptodate, nr_bytes, 0);
+   return blk_end_request(rq, uptodate, nr_bytes, 0, NULL);
 }
 EXPORT_SYMBOL_GPL(__blk_end_request);
 
+/**
+ * blk_end_request_callback - Special helper function for the ide-cd driver
+ * @rq:   the request being processed
+ * @uptodate: 1 for success, 0 for I/O error, < 0 for specific error
+ * @nr_bytes: number of bytes to complete
+ * @drv_callback: function called between completion of bios in the request
+ *and completion of the request.
+ *If the callback returns non 0, this helper returns without
+ *completion of the request.
+ *
+ * Description:
+ * Ends I/O on a number of bytes attached to @rq.
+ * If @rq has leftover, sets it up for the next range of segments.
+ *
+ * This special helper function for the ide-cd driver is used
+ * to complete the request only in cdrom_newpc_intr().
+ * This interface will be removed when cdrom_newpc_intr() is rewritten.
+ * Don't use this interface in other places.
+ *
+ * Return:
+ * 0 - we are done with this request
+ * 1 - this request is not freed yet.
+ * this request still has pending buffers or
+ * the ide-cd driver doesn't want to finish this request yet.
+ **/
+int blk_end_request_callback(struct request *rq, int uptodate, int nr_bytes,
+int (drv_callback)(struct request *))
+{
+   return blk_end_request(rq, uptodate, nr_bytes, 1, drv_callback);
+}
+EXPORT_SYMBOL_GPL(blk_end_request_callback);
+
 void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 struct bio *bio)
 {
diff -rupN

[PATCH 4/7] blk_end_request: cciss/cpqarray/xsysace change

2007-08-31 Thread Kiyoshi Ueda

This patch changes "odd" drivers to use blk_end_request().
The drivers are cciss, cpqarray and xsysace.


cciss and cpqarray directly call bio_endio() and disk_stat_add()
when completing request.  But those can be replaced with
__end_that_request_first().
After the replacement, request completion procedures of those drivers
become like the following:
o end_that_request_first()
o add_disk_randomness()
o end_that_request_last()
This can be converted to blk_end_request() by following
the rule (a) mentioned in the patch subject
"[PATCH 3/7] blk_end_request: changing "normal" drivers".


xsysace driver has a state machine in it.
It calls end_that_request_first() and end_that_request_last()
from different states. (ACE_FSM_STATE_REQ_TRANSFER and
ACE_FSM_STATE_REQ_COMPLETE, respectively.)
However, those states are consecutive and without any interruption
inbetween.
So we can just follow the standard conversion rule (a) mentioned in
the patch subject "[PATCH 3/7] blk_end_request: changing "normal" drivers".

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 cciss.c|   26 +++---
 cpqarray.c |   28 ++--
 xsysace.c  |5 +
 3 files changed, 6 insertions(+), 53 deletions(-)

diff -rupN 03-normal-caller-change/drivers/block/cciss.c 
04-other-caller-change/drivers/block/cciss.c
--- 03-normal-caller-change/drivers/block/cciss.c   2007-08-13 
00:25:24.0 -0400
+++ 04-other-caller-change/drivers/block/cciss.c2007-08-23 
17:54:19.0 -0400
@@ -1187,18 +1187,6 @@ static int cciss_ioctl(struct inode *ino
}
 }
 
-static inline void complete_buffers(struct bio *bio, int status)
-{
-   while (bio) {
-   struct bio *xbh = bio->bi_next;
-   int nr_sectors = bio_sectors(bio);
-
-   bio->bi_next = NULL;
-   bio_endio(bio, nr_sectors << 9, status ? 0 : -EIO);
-   bio = xbh;
-   }
-}
-
 static void cciss_check_queues(ctlr_info_t *h)
 {
int start_queue = h->next_to_run;
@@ -1264,21 +1252,14 @@ static void cciss_softirq_done(struct re
pci_unmap_page(h->pdev, temp64.val, cmd->SG[i].Len, ddir);
}
 
-   complete_buffers(rq->bio, (rq->errors == 0));
-
-   if (blk_fs_request(rq)) {
-   const int rw = rq_data_dir(rq);
-
-   disk_stat_add(rq->rq_disk, sectors[rw], rq->nr_sectors);
-   }
-
 #ifdef CCISS_DEBUG
printk("Done with %p\n", rq);
 #endif /* CCISS_DEBUG */
 
-   add_disk_randomness(rq->rq_disk);
+   if (blk_end_request(rq, (rq->errors == 0), blk_rq_size(rq)))
+   BUG();
+
spin_lock_irqsave(>lock, flags);
-   end_that_request_last(rq, (rq->errors == 0));
cmd_free(h, cmd, 1);
cciss_check_queues(h);
spin_unlock_irqrestore(>lock, flags);
@@ -2504,7 +2485,6 @@ after_error_processing:
}
cmd->rq->data_len = 0;
cmd->rq->completion_data = cmd;
-   blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
blk_complete_request(cmd->rq);
 }
 
diff -rupN 03-normal-caller-change/drivers/block/cpqarray.c 
04-other-caller-change/drivers/block/cpqarray.c
--- 03-normal-caller-change/drivers/block/cpqarray.c2007-08-13 
00:25:24.0 -0400
+++ 04-other-caller-change/drivers/block/cpqarray.c 2007-08-23 
17:54:19.0 -0400
@@ -166,7 +166,6 @@ static void start_io(ctlr_info_t *h);
 
 static inline void addQ(cmdlist_t **Qptr, cmdlist_t *c);
 static inline cmdlist_t *removeQ(cmdlist_t **Qptr, cmdlist_t *c);
-static inline void complete_buffers(struct bio *bio, int ok);
 static inline void complete_command(cmdlist_t *cmd, int timeout);
 
 static irqreturn_t do_ida_intr(int irq, void *dev_id);
@@ -978,20 +977,6 @@ static void start_io(ctlr_info_t *h)
}
 }
 
-static inline void complete_buffers(struct bio *bio, int ok)
-{
-   struct bio *xbh;
-   while(bio) {
-   int nr_sectors = bio_sectors(bio);
-
-   xbh = bio->bi_next;
-   bio->bi_next = NULL;
-   
-   bio_endio(bio, nr_sectors << 9, ok ? 0 : -EIO);
-
-   bio = xbh;
-   }
-}
 /*
  * Mark all buffers that cmd was responsible for
  */
@@ -1029,18 +1014,9 @@ static inline void complete_command(cmdl
 pci_unmap_page(hba[cmd->ctlr]->pci_dev, cmd->req.sg[i].addr,
cmd->req.sg[i].size, ddir);
 
-   complete_buffers(rq->bio, ok);
-
-   if (blk_fs_request(rq)) {
-   const int rw = rq_data_dir(rq);
-
-   disk_stat_add(rq->rq_disk, sectors[rw], rq->nr_sectors);
-   }
-
-   add_disk_randomness(rq->rq_disk);
-
DBGPX(printk("Done with %p\n", rq););
-   end_that_request_last(rq, ok ? 1 : -EIO);
+   if (__blk_end_request(rq, ok, blk_rq_size(rq)))
+   BUG();
 }
 
 /*
diff -rupN

[PATCH 3/7] blk_end_request: changing "normal" drivers

2007-08-31 Thread Kiyoshi Ueda

This patch converts "normal" drivers, which complete request
in a standard way shown below, to use blk_end_request().

 a) end_that_request_{chunk/first}
spin_lock_irqsave()
(add_disk_randomness(), blk_queue_end_tag(), blkdev_dequeue_request())
end_that_request_last()
spin_unlock_irqrestore()
=> blk_end_request()

 b) spin_lock_irqsave()
end_that_request_{chunk/first}
(add_disk_randomness(), blk_queue_end_tag(), blkdev_dequeue_request())
end_that_request_last()
spin_unlock_irqrestore()
=> spin_lock_irqsave()
   __blk_end_request()
   spin_unlock_irqsave()

 c) end_that_request_last()
=> __blk_end_request()

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 arch/arm/plat-omap/mailbox.c|9 ++---
 arch/um/drivers/ubd_kern.c  |   10 +-
 block/elevator.c|4 ++--
 block/ll_rw_blk.c   |   15 +--
 drivers/block/DAC960.c  |6 ++
 drivers/block/floppy.c  |8 +++-
 drivers/block/lguest_blk.c  |5 +
 drivers/block/nbd.c |4 +---
 drivers/block/ps3disk.c |6 +-
 drivers/block/sunvdc.c  |5 +
 drivers/block/sx8.c |4 +---
 drivers/block/ub.c  |4 ++--
 drivers/block/viodasd.c |5 +
 drivers/block/xen-blkfront.c|5 ++---
 drivers/cdrom/viocd.c   |5 +
 drivers/ide/ide-cd.c|6 +++---
 drivers/ide/ide-io.c|   22 +++---
 drivers/message/i2o/i2o_block.c |8 ++--
 drivers/mmc/card/block.c|   24 +---
 drivers/mmc/card/queue.c|4 ++--
 drivers/s390/block/dasd.c   |4 +---
 drivers/s390/char/tape_block.c  |3 +--
 drivers/scsi/ide-scsi.c |8 
 drivers/scsi/scsi_lib.c |   13 ++---
 24 files changed, 57 insertions(+), 130 deletions(-)

diff -rupN 02-sect2byte-macro/arch/arm/plat-omap/mailbox.c 
03-normal-caller-change/arch/arm/plat-omap/mailbox.c
--- 02-sect2byte-macro/arch/arm/plat-omap/mailbox.c 2007-08-13 
00:25:24.0 -0400
+++ 03-normal-caller-change/arch/arm/plat-omap/mailbox.c2007-08-23 
17:51:33.0 -0400
@@ -117,7 +117,8 @@ static void mbox_tx_work(struct work_str
 
spin_lock(q->queue_lock);
blkdev_dequeue_request(rq);
-   end_that_request_last(rq, 0);
+   if (__blk_end_request(rq, 0, 0))
+   BUG();
spin_unlock(q->queue_lock);
}
 }
@@ -151,7 +152,8 @@ static void mbox_rx_work(struct work_str
 
spin_lock_irqsave(q->queue_lock, flags);
blkdev_dequeue_request(rq);
-   end_that_request_last(rq, 0);
+   if (__blk_end_request(rq, 0, 0))
+   BUG();
spin_unlock_irqrestore(q->queue_lock, flags);
 
mbox->rxq->callback((void *)msg);
@@ -265,7 +267,8 @@ omap_mbox_read(struct device *dev, struc
 
spin_lock_irqsave(q->queue_lock, flags);
blkdev_dequeue_request(rq);
-   end_that_request_last(rq, 0);
+   if (__blk_end_request(rq, 0, 0))
+   BUG();
spin_unlock_irqrestore(q->queue_lock, flags);
 
if (unlikely(mbox_seq_test(mbox, *p))) {
diff -rupN 02-sect2byte-macro/arch/um/drivers/ubd_kern.c 
03-normal-caller-change/arch/um/drivers/ubd_kern.c
--- 02-sect2byte-macro/arch/um/drivers/ubd_kern.c   2007-08-22 
18:54:03.0 -0400
+++ 03-normal-caller-change/arch/um/drivers/ubd_kern.c  2007-08-23 
17:51:33.0 -0400
@@ -476,15 +476,7 @@ int thread_fd = -1;
 
 static void ubd_end_request(struct request *req, int bytes, int uptodate)
 {
-   if (!end_that_request_first(req, uptodate, bytes >> 9)) {
-   struct ubd *dev = req->rq_disk->private_data;
-   unsigned long flags;
-
-   add_disk_randomness(req->rq_disk);
-   spin_lock_irqsave(>lock, flags);
-   end_that_request_last(req, uptodate);
-   spin_unlock_irqrestore(>lock, flags);
-   }
+   blk_end_request(req, uptodate, bytes);
 }
 
 /* Callable only from interrupt context - otherwise you need to do
diff -rupN 02-sect2byte-macro/block/elevator.c 
03-normal-caller-change/block/elevator.c
--- 02-sect2byte-macro/block/elevator.c 2007-08-13 00:25:24.0 -0400
+++ 03-normal-caller-change/block/elevator.c2007-08-23 17:51:33.0 
-0400
@@ -758,8 +758,8 @@ struct request *elv_next_request(struct 
 
blkdev_dequeue_request(rq);
rq->cmd_flags |= REQ_QUIET;
-   end_that_request_chunk(rq, 0, nr_bytes);
-   end_that_request_last(rq, 0);
+   if (__blk_end_request(rq, 0, nr_bytes))
+

[PATCH 2/7] blk_end_request: add blk_rq_size() macros

2007-08-31 Thread Kiyoshi Ueda

This patch adds macros to get the size of request in bytes.
They are useful because blk_end_request() takes bytes
as a completed I/O size instead of sectors.

Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
---
 blkdev.h |9 +
 1 files changed, 9 insertions(+)

diff -rupN 01-blkendreq-interface/include/linux/blkdev.h 
02-sect2byte-macro/include/linux/blkdev.h
--- 01-blkendreq-interface/include/linux/blkdev.h   2007-08-23 
17:22:50.0 -0400
+++ 02-sect2byte-macro/include/linux/blkdev.h   2007-08-23 17:25:59.0 
-0400
@@ -737,6 +737,15 @@ extern void end_request(struct request *
 extern void blk_complete_request(struct request *);
 
 /*
+ * blk_end_request() takes bytes instead of sectors as a complete size.
+ * blk_rq_size() returns the entire size left to complete in the request.
+ * blk_rq_cur_size() returns the size left to complete in the current segment.
+ */
+#define sect2byte(nr_sectors) ((nr_sectors) << 9)
+#define blk_rq_size(rq) (sect2byte((rq)->hard_nr_sectors))
+#define blk_rq_cur_size(rq) (sect2byte((rq)->current_nr_sectors))
+
+/*
  * end_that_request_first/chunk() takes an uptodate argument. we account
  * any value <= as an io error. 0 means -EIO for compatability reasons,
  * any other < 0 value is the direct error type. An uptodate value of
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/7] blk_end_request: full I/O completion handler

2007-08-31 Thread Kiyoshi Ueda

Hello,

This set of patches changes request completion interface
between device drivers and block layer to 1 step procedure
from current 2 step procedures using end_that_request_{first/chunk}
and end_that_request_last().

This change allows request-based multipath to hook in before
completing each chunk of request, check errors for it and
retry it using another path if error is detected.

Summaries of each patch are below:
  1/7: add new request completion interface, blk_end_request()
  2/7: add some macros to get the size of request in bytes
  3/7: convert normal drivers to use blk_end_request()
  4/7: convert odd drivers like cciss/cpqarray/xsysace to use
   blk_end_request()
  5/7: convert ide-cd (cdrom_newpc_intr) to use blk_end_request()
  6/7: remove/unexport no longer needed end_that_request_*
  7/7: change rq->end_io to cover request completion as a whole

I have tested the patch on two machines, ia64+QLA1280+QLA2200
and x86_64+SATA+IDE-CDROM.
I can't test other device drivers for which I don't have hardware.
So testing help and any comments would be very much appreciated.

The interface change causes code modifications of *ALL DEVICE DRIVERS*
which are using end_that_request_{first/chunk/last} to complete request.
But it should not affect the behavior.

Please review and apply if no problem.
This patch-set should be applied on top of 2.6.23-rc3-mm1.

BACKGROUND
==
The patch is necessary to allow device stacking at request level,
that is request-based device-mapper multipath.
Currently, device-mapper is implemented as a stacking block device
at BIO level.  OTOH, request-based DM will stack at request level to
allow better multipathing decision.
To allow device stacking at request level, the completion procedure
need to provide a hook for it.
For example, dm-multipath has to check errors and retry with other
paths if necessary before returning the I/O result to upper layer.
struct request has 'end_io' hook currently.  But it's called at
the very late stage of completion handling where the I/O result
is already returned to the upper layer.
So we need something here.

The first approach to hook in completion of each chunk of request
was adding a new rq->end_io_first() hook and calling it on the top
of __end_that_request_first().
  - http://marc.theaimsgroup.com/?l=linux-scsi=115520444515914=2
  - http://marc.theaimsgroup.com/?l=linux-kernel=116656637425880=2
However, Jens pointed out that redesigning rq->end_io() as a full
completion handler would be better:

On Thu, 21 Dec 2006 08:49:47 +0100, Jens Axboe <[EMAIL PROTECTED]> wrote:
> Ok, I see what you are getting at. The current ->end_io() is called when
> the request has fully completed, you want notification for each chunk
> potentially completed.
> 
> I think a better design here would be to use ->end_io() as the full
> completion handler, similar to how bio->bi_end_io() works. A request
> originating from __make_request() would set something ala:
.
> instead of calling the functions manually. That would allow you to get
> notification right at the beginning and do what you need, without adding
> a special hook for this.

I thought his comment was reasonable.
So I modified the patches based on his suggestion.

WHAT IS CHANGED
===
The change is basically illustlated by the following pseudo code:

[Before]
  if (end_that_request_{first/chunk} succeeds) { <-- completes bios

 end_that_request_last() <-- calls end_io()

  } else {

  }

[After]
  if (blk_end_request() succeeds) { <-- calls end_io(), completes bios

  } else {

  }

In detail, request completion procedures are changed like below.

[Before]
  o 2 steps completion using end_that_request_{first/chunk}
and end_that_request_last().
  o Device drivers have ownership of a request until they
call end_that_request_last().
  o rq->end_io() is called at the last stage of
end_that_request_last() for some block layer codes need
specific request handling when completing it.

[After]
  o 1 step completion using blk_end_request().
(end_that_request_* are no longer used from device drivers.)
  o Device drivers give over ownership of a request
when calling blk_end_request().
If it returns 0, the request is completed.
If it returns 1, the request isn't completed and
the ownership is returned to the device driver again.
  o rq->end_io() is called at the top of blk_end_request() to
allow to hook all parts of request completion.
Existing users of rq->end_io() must be changed to do
all parts of request completion.

EXAMPLE CODE

Request-based Device-mapper multipath patch-set is attached as appendix,
although it still needs some work and isn't ready for review.
It checks error of a request and retries the request using other paths
if error is detected, before completing bios in the request.
(See clone_end_request() in appendix#1.)

Thanks,
Kiyoshi Ueda
-
To unsubscribe from this list:

Re: Heads Up: Next Batch Of Serial/TTY Changes

2007-08-31 Thread Alan Cox

> Who knows what other gremlins like this now live in the tree :-)
> 
> There was a similar spot a few lines down, both fixed
> as follows:

Thanks I'll take a harder look over those. My test suite didn't check
them just the main termios ioctls didn't scribble.

> And here is the sparc patch, could you please add it to
> your serial queue?  Thanks Alan.
> 
> [SPARC]: Add support for arbitrary serial speeds.
> 
> Signed-off-by: David S. Miller <[EMAIL PROTECTED]>cc[VEOF]); \


Thanks

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: maturity and status and attributes, oh my!

2007-08-31 Thread Stefan Richter

Robert P. J. Day wrote:
...
> attributes would have two critical and non-negotiable properties:
> 
> 1) they would be entirely orthogonal to one another, and
> 2) they can be assigned at most one of a pre-defined set of values

If they are fully orthogonal to another, then they are also
nonexclusive.  You want them to be mutual exclusive, not orthogonal.

...
>   experimental -> normal (stable) -> deprecated -> obsolete
> 
>   it's a natural progression and, at any point, a feature cannot
> possibly have more than one maturity value.  it would be as absurd as
> saying that someone was a teenager *and* was a twenty-something at the
> same time.

Keep in mind though that 'experimental', in the context of Linux kernel
features, has nothing to do with the age of a feature.
-- 
Stefan Richter
-=-=-=== =--= =
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] pci.h: Add PCI identifiers for mainpine cards

2007-08-31 Thread Alan Cox

> drivers/serial/8250_pci.c:2584: error: 'PCI_VENDOR_ID_MAINPINE' undeclared 
> here (not in a function)
> drivers/serial/8250_pci.c:2584: error: 'PCI_DEVICE_ID_MAINPINE_PBRIDGE' 
> undeclared here (not in a function)

Doh.

Signed-off-by: Alan Cox <[EMAIL PROTECTED]>

diff -u --new-file --recursive --exclude-from /usr/src/exclude 
linux.vanilla-2.6.23rc3-mm1/include/linux/pci_ids.h 
linux-2.6.23rc3-mm1/include/linux/pci_ids.h
--- linux.vanilla-2.6.23rc3-mm1/include/linux/pci_ids.h 2007-08-22 
17:23:14.0 +0100
+++ linux-2.6.23rc3-mm1/include/linux/pci_ids.h 2007-08-22 17:50:52.0 
+0100
@@ -1976,6 +1977,8 @@
 #define PCI_VENDOR_ID_TOPIC0x151f
 #define PCI_DEVICE_ID_TOPIC_TP560  0x
 
+#define PCI_VENDOR_ID_MAINPINE 0x1522
+#define PCI_DEVICE_ID_MAINPINE_PBRIDGE 0x0100
 #define PCI_VENDOR_ID_ENE  0x1524
 #define PCI_DEVICE_ID_ENE_CB712_SD 0x0550
 #define PCI_DEVICE_ID_ENE_CB712_SD_2   0x0551


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DeskOpt - on fly task, i/o scheduler optimization

2007-08-31 Thread Chris Snook


Michal Piotrowski wrote:

Hi,

Here is something that might be useful for gamers and audio/video editors
http://www.stardust.webpages.pl/files/tools/deskopt/

You can easily tune CFS/CFQ scheduler params


I would think that gamers and AV editors would want to be using deadline 
(or maybe even as), not cfq.  How well does it work with other I/O 
schedulers?


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] net/, drivers/net/ , missing EXPERIMENTAL in menus

2007-08-31 Thread Robert P. J. Day

On Fri, 31 Aug 2007, Jeff Garzik wrote:

> 'deprecrated' and 'obsolete' are matters of discussed opinion,
> describing the utility of the code in question.  'broken' describes
> the state of the code itself.
>
> Clear difference.

precisely.  thank you for making my point for me.

rday
-- 

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://crashcourse.ca

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Heads Up: Next Batch Of Serial/TTY Changes

2007-08-31 Thread David Miller

From: Alan Cox <[EMAIL PROTECTED]>
Date: Fri, 31 Aug 2007 23:16:13 +0100

> I don't see a real problem. You aren't using
> 
>   c_cflags & CBAUD = 0x1000
> 
> so that could become BOTHER.
> 
> the input bits also appear to be reserved and free ?

Nevermind, I missed how you were doing the new termios2
struct.

I'm trying to wrap things up before jumping onto a plan early tomorrow
morning, but I still tried to whip together a patch and while mostly
straightforward I ran into a few problems.

n_tty_ioctl() for instance:

drivers/char/tty_ioctl.c:799: error: $,1rx(Bstruct termios$,1ry(B has no 
member named $,1rx(Bc_ispeed$,1ry(B

This is calling the copy interface that is supposed to be using
a termios2 when the new interfaces are defined, however:

case TIOCGLCKTRMIOS:
if (kernel_termios_to_user_termios((struct termios 
__user *)arg, real_tty->termios_locked))
return -EFAULT;
return 0;

This is going to write over the end of the userspace
structure by a few bytes, and wasn't caught by you yet
because the i386 implementation is simply copy_to_user()
which does zero type checking.

Who knows what other gremlins like this now live in the tree :-)

There was a similar spot a few lines down, both fixed
as follows:


diff --git a/drivers/char/tty_ioctl.c b/drivers/char/tty_ioctl.c
index 3423e9e..4a8969c 100644
--- a/drivers/char/tty_ioctl.c
+++ b/drivers/char/tty_ioctl.c
@@ -796,14 +796,14 @@ int n_tty_ioctl(struct tty_struct * tty, struct file * 
file,
retval = inq_canon(tty);
return put_user(retval, (unsigned int __user *) arg);
case TIOCGLCKTRMIOS:
-   if (kernel_termios_to_user_termios((struct termios 
__user *)arg, real_tty->termios_locked))
+   if (kernel_termios_to_user_termios_1((struct termios 
__user *)arg, real_tty->termios_locked))
return -EFAULT;
return 0;
 
case TIOCSLCKTRMIOS:
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
-   if 
(user_termios_to_kernel_termios(real_tty->termios_locked, (struct termios 
__user *) arg))
+   if 
(user_termios_to_kernel_termios_1(real_tty->termios_locked, (struct termios 
__user *) arg))
return -EFAULT;
return 0;
 


And here is the sparc patch, could you please add it to
your serial queue?  Thanks Alan.

[SPARC]: Add support for arbitrary serial speeds.

Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

diff --git a/include/asm-sparc/ioctls.h b/include/asm-sparc/ioctls.h
index bdf77b0..058c206 100644
--- a/include/asm-sparc/ioctls.h
+++ b/include/asm-sparc/ioctls.h
@@ -15,6 +15,10 @@
 #define TCSETS _IOW('T', 9, struct termios)
 #define TCSETSW_IOW('T', 10, struct termios)
 #define TCSETSF_IOW('T', 11, struct termios)
+#define TCGETS2_IOR('T', 12, struct termios2)
+#define TCSETS2_IOW('T', 13, struct termios2)
+#define TCSETSW2   _IOW('T', 14, struct termios2)
+#define TCSETSF2   _IOW('T', 15, struct termios2)
 
 /* Note that all the ioctls that are not available in Linux have a 
  * double underscore on the front to: a) avoid some programs to
diff --git a/include/asm-sparc/termbits.h b/include/asm-sparc/termbits.h
index 5eb00a1..90cf221 100644
--- a/include/asm-sparc/termbits.h
+++ b/include/asm-sparc/termbits.h
@@ -31,6 +31,18 @@ struct termios {
 #endif
 };
 
+struct termios2 {
+   tcflag_t c_iflag;   /* input mode flags */
+   tcflag_t c_oflag;   /* output mode flags */
+   tcflag_t c_cflag;   /* control mode flags */
+   tcflag_t c_lflag;   /* local mode flags */
+   cc_t c_line;/* line discipline */
+   cc_t c_cc[NCCS];/* control characters */
+   cc_t _x_cc[2];  /* padding to match ktermios */
+   speed_t c_ispeed;   /* input speed */
+   speed_t c_ospeed;   /* output speed */
+};
+
 struct ktermios {
tcflag_t c_iflag;   /* input mode flags */
tcflag_t c_oflag;   /* output mode flags */
@@ -160,6 +172,7 @@ struct ktermios {
 #define CLOCAL   0x0800
 #define CBAUDEX   0x1000
 /* We'll never see these speeds with the Zilogs, but for completeness... */
+#define  BOTHER   0x1000
 #define  B57600   0x1001
 #define  B115200  0x1002
 #define  B230400  0x1003
@@ -189,6 +202,8 @@ struct ktermios {
 #define CMSPAR   0x4000  /* mark or space (stick) parity */
 #define CRTSCTS  0x8000  /* flow control */
 
+#define IBSHIFT  16/* Shift from CBAUD to CIBAUD */

Re: [PATCH] 8250_pci: Autodetect mainpine cards

2007-08-31 Thread Andrew Morton

On Wed, 22 Aug 2007 23:05:27 +0100
Alan Cox <[EMAIL PROTECTED]> wrote:

> Add support for a whole range of boards. Some are partly autodetected but
> not fully correctly others (PCI Express notably) not at all. Stick all
> the right entries in.
> 
> Thanks to Mainpine for information and testing.

drivers/serial/8250_pci.c:2584: error: 'PCI_VENDOR_ID_MAINPINE' undeclared here 
(not in a function)
drivers/serial/8250_pci.c:2584: error: 'PCI_DEVICE_ID_MAINPINE_PBRIDGE' 
undeclared here (not in a function)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Heads Up: Next Batch Of Serial/TTY Changes

2007-08-31 Thread Jeff Garzik

Alan Cox wrote:

On Fri, 31 Aug 2007 14:41:15 -0700 (PDT)
David Miller <[EMAIL PROTECTED]> wrote:

From: Alan Cox <[EMAIL PROTECTED]>
Date: Fri, 31 Aug 2007 22:11:05 +0100

Firstly some architecture maintainers still haven't updated their
platform for arbitary tty speeds. The kernel is going to start whining
and issuing warnings on your platform if you don't keep up with the
programme (its been 6 months).

I took a look at this for sparc and I'm currently balking the same way
you did :-) The current bit usage on sparc just don't work properly
for what you're trying to do.

I don't see a real problem. You aren't using

c_cflags & CBAUD = 0x1000

so that could become BOTHER.

Pooh says:  oh, bother!

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: libata not working for sis5533

2007-08-31 Thread Patrizio Bassi

Patrizio Bassi ha scritto:
> Michal Piotrowski ha scritto:
>> Hi,
>>
>> [Adding IDE wizards to CC]
>>
>> On 26/08/07, Patrizio Bassi <[EMAIL PROTECTED]> wrote:
>>   
>>> My sis630 chipset shipped with Asus A1000
>>> doesn't work properly with suspend with ide drivers
>>> (http://bugzilla.kernel.org/show_bug.cgi?id=7077)
>>>
>>> i tried to switch to libata but i cannot boot.
>>> I've enabled generic ide and sis specific code, both in-kernel. of
>>> course scsi too.
>>>
>>> when i boot i get: irq #14 nobody cared and stop
>>>
>>> i have to remove battery to reboot pc.
>>> I'm using 2.6.22.5, but i never got any libata kernel working.
>>>
>>> Patrizio
>>>
>>> ps. i'm writing from my desktop as i'm doing hardware mainteinance on
>>> the laptop and could not boot it
>>>
>>> Please CC me.
>>>
>>> lspci:
>>> 00:00.0 Host bridge: Silicon Integrated Systems [SiS] 630 Host (rev 11)
>>> 00:00.1 IDE interface: Silicon Integrated Systems [SiS] 5513 [IDE] (rev d0)
>>> 00:01.0 ISA bridge: Silicon Integrated Systems [SiS] SiS85C503/5513 (LPC
>>> Bridge)
>>> 00:01.1 Ethernet controller: Silicon Integrated Systems [SiS] SiS900
>>> PCI Fast Ethernet (rev 80)
>>> 00:01.2 USB Controller: Silicon Integrated Systems [SiS] USB  1.0
>>> Controller (rev 07)
>>> 00:01.3 USB Controller: Silicon Integrated Systems [SiS] USB 1.0
>>> Controller (rev 07)
>>> 00:01.4 Multimedia audio controller: Silicon Integrated Systems [SiS]
>>> SiS PCI Audio Accelerator (rev 01)
>>>  00:01.6 Modem: Silicon Integrated Systems [SiS] AC'97 Modem
>>> Controller (rev a0)
>>> 00:02.0 PCI bridge: Silicon Integrated Systems [SiS] Virtual
>>> PCI-to-PCI bridge (AGP)
>>> 00:0a.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev 80)
>>> 00:0a.1 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev 80)
>>> 01:00.0 VGA compatible controller: Silicon Integrated Systems [SiS]
>>> 630/730 PCI/AGP VGA Display Adapter (rev 11)
>>> 
>>
>> Regards,
>> Michal
>>
>>   
>
>
i've been out for a week, but found no notice, did i lost any email or
no activity on this issue?

Patrizio

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Heads Up: Next Batch Of Serial/TTY Changes

2007-08-31 Thread Alan Cox

On Fri, 31 Aug 2007 14:41:15 -0700 (PDT)
David Miller <[EMAIL PROTECTED]> wrote:

> From: Alan Cox <[EMAIL PROTECTED]>
> Date: Fri, 31 Aug 2007 22:11:05 +0100
> 
> > Firstly some architecture maintainers still haven't updated their
> > platform for arbitary tty speeds. The kernel is going to start whining
> > and issuing warnings on your platform if you don't keep up with the
> > programme (its been 6 months).
> 
> I took a look at this for sparc and I'm currently balking the same way
> you did :-) The current bit usage on sparc just don't work properly
> for what you're trying to do.

I don't see a real problem. You aren't using

c_cflags & CBAUD = 0x1000

so that could become BOTHER.

the input bits also appear to be reserved and free ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: maturity and status and attributes, oh my!

2007-08-31 Thread Jeff Garzik


Robert P. J. Day wrote:

  it may be that some people had a different understanding of what was
meant by "maturity" than i did.  what *i* meant by that attribute is
a feature's current position in the normal software life cycle, and
that would be one of:

  experimental -> normal (stable) -> deprecated -> obsolete


People who actually write and maintain code disagree with your 
imaginative interpretation.


Jeff, who notes the snipping of CC from previous thread





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] net/, drivers/net/ , missing EXPERIMENTAL in menus

2007-08-31 Thread Jeff Garzik


Robert P. J. Day wrote:

On Fri, 31 Aug 2007, Jeff Garzik wrote:


Robert P. J. Day wrote:



i'm sure i'm going to get shouted down here, but i really disagree
with "BROKEN" being considered a "maturity level".  IMHO, things
like EXPERIMENTAL, DEPRECATED and OBSOLETE represent maturity
levels, for what i think are obvious reasons.

something like BROKEN, though, has *nothing* to do with maturity.
a feature can be any of those maturity levels, and simultaneously
be BROKEN.  i consider BROKEN to be what i call a "status", and
different status levels might be the default of normal, or
KIND_OF_FLAKY or TOTALLY_BORKED -- that's where BROKEN would fit
in.

BROKEN is definitely a maturity level.


no.  it's not.  end of discussion.  you're wrong.

the concept of "maturity level" reflects where in the life cycle some
feature is.  it will typically start as "bleeding edge" or
"experimental" or something like that, eventually stabilize to be
normal (which would be the obvious default), after which, when its
value starts to run out and it begins showing its age, it becomes
"deprecated" and eventually "obsolete"  it's a natural and obvious
progression.

on the other hand, a feature can be "broken" at *any* point in that
life cycle -- that's why it is absolutely *not* a maturity level.
please don't fight with me on this, jeff.  you're simply wrong.


Get off your high horse and actually look at the patches that mark 
things BROKEN.


'deprecrated' and 'obsolete' are matters of discussed opinion, 
describing the utility of the code in question.  'broken' describes the 
state of the code itself.


Clear difference.

Jeff, one who actually marks this stuff as such



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm PATCH] Memory controller improve user interface (v2)

2007-08-31 Thread Balbir Singh

On 9/1/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> On Fri, 31 Aug 2007 00:22:46 +0530
> Balbir Singh <[EMAIL PROTECTED]> wrote:
>
> > +/*
> > + * Strategy routines for formating read/write data
> > + */
> > +int mem_container_read_strategy(unsigned long long val, char *buf)
> > +{
> > + return sprintf(buf, "%llu Bytes\n", val);
> > +}
>
> It's a bit cheesy to be printing the units like this.  It's better to just
> print the raw number.
>
> If you really want to remind the user what units that number is in (not a
> bad idea) then it can be encoded in the filename, like
> /proc/sys/vm/min_free_kbytes, /proc/sys/vm/dirty_expire_centisecs, etc.
>

Sounds good, I'll change the file to memory.limit_in_bytes and
memory.usage_in_bytes.

>
> > +int mem_container_write_strategy(char *buf, unsigned long long *tmp)
> > +{
> > + *tmp = memparse(buf, );
> > + if (*buf != '\0')
> > + return -EINVAL;
> > +
> > + printk("tmp is %llu\n", *tmp);
>
> don't think we want that.
>

Yes, I'll redo the patch and resend.

> > + /*
> > +  * Round up the value to the closest page size
> > +  */
> > + *tmp = ((*tmp + PAGE_SIZE - 1) >> PAGE_SHIFT) << PAGE_SHIFT;
> > + return 0;
> > +}

Thanks,
Balbir
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

maturity and status and attributes, oh my!

2007-08-31 Thread Robert P. J. Day


  at the risk of driving everyone here totally bonkers, i'm going to
take one last shot at explaining what i was thinking of when i first
proposed this whole "maturity level" thing.  and, just so you know,
the major reason i'm so cranked up about this is that i'm feeling just
a little territorial -- i was the one who first started nagging people
to consider this idea, so i'm a little edgy when i see folks finally
giving it some serious thought but appearing to get ready to implement
it entirely incorrectly in a way that's going to ruin it irreparably
and make it utterly useless.

  this isn't just about defining a single feature called "maturity".
it's about defining a general mechanism so that you can add entirely
new (what i call) "attributes" to kernel features.  one attribute
could be "maturity", which could take one of a number of possible
values.  another could be "status", with the same restrictions.
heck, you could define the attribute "colour", and decide that various
kernel features could be labelled as (at most) one of "red", "green"
and "chartreuse."  that's what i mean by an "attribute", and
attributes would have two critical and non-negotiable properties:

1) they would be entirely orthogonal to one another, and
2) they can be assigned at most one of a pre-defined set of values


  that's it.  it's really that simple and simon's earlier patch i
think fits that almost perfectly.  now, back to the disagreement.

  it may be that some people had a different understanding of what was
meant by "maturity" than i did.  what *i* meant by that attribute is
a feature's current position in the normal software life cycle, and
that would be one of:

  experimental -> normal (stable) -> deprecated -> obsolete

  it's a natural progression and, at any point, a feature cannot
possibly have more than one maturity value.  it would be as absurd as
saying that someone was a teenager *and* was a twenty-something at the
same time.  not possible.  and restricting an attribute to a single
value makes definitions and processing *way* easier down the road.
(and note that a feature's maturity says *nothing* about its current
level of quality.  that's next.)

  another attribute can then be what i was calling "status" but could
also be called "quality".   *that* is where you could categorize a
feature as one of FLAKY, BROKEN and so on.  that's an entirely
independent categorization from maturity, which means you could have
features that were both experimental and flaky, or deprecated and
broken, or what have you.  and those settings would be done with
separate Kconfig directives:

config WHATEVER
maturity DEPRECATED
status BROKEN

  from a quick perusal, simon's patch looked pretty much dead-on
(except for that teeth-grinding maturity level of BROKEN :-).  but
other than that, it looked good, although i'll have to go back later
and look more closely.

  but i hope i've flogged this thoroughly to the point where people
can see what i'm driving at.  once you see (as in simon's patch) how
to add the first attribute, it's trivial to simply duplicate that code
to add as many more as you want.

rday

-- 

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://crashcourse.ca

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Follow up to: NFS/RPC Hangs after updating time...

2007-08-31 Thread J. Bruce Fields

On Fri, Aug 31, 2007 at 02:35:19PM -0400, Morrison, Tom wrote:
> This is a follow-up...
> 
> After a huge pain in the rear upgrading from a 
> 2.6.11++ to a 2.6.23-rc3 (I'll give the powerpc
> folks a 'piece' of my mind on that front) - the 
> NFS hang problem that I was experiencing on the 
> older kernel is NOT occurring on this new version.
> 
> Now what do I do?

Well, between the time jump and the rpc debugging output, you've got
some great clues there--given some time I'm sure it would be possible to
completely figure out what's going on.

Unfortunately the people with the most knowledge of the code probably
don't have the time to fix problems on old kernels, so unless somebody
else recognizes the problem immediately, I'm not sure what to suggest.
Obviously, a wholesale upgrade to a more recent kernel would be the one
sure bet

> Is the net/sunrpc net/nfsx pieces isolated enough 
> from the rest of the kernel that I could fork-lift 
> it back to the 2.6.11 (or is that really a lost cause).

I suspect it's a lost cause.  A lot has happened in the last couple
years.

--b.

> > It hangs after attempting to update the time from a 
> > nonsensical time (e.g.: 2 months ago) - the most significant
> > part of it is that it only hangs IFF it has started 
> > serving its NFS client boards before I attempt to 
> > update the time.
> > 
> > 
> > The most significant output (when turning on 
> > RPC debugging) is from:
> > 
> >   linux/net/sunrpc/cache.c (cache_check) - line 90:
> > 
> >  >> Want update, refage=1800, age=4288285
> > 
> > It continually loops through this method - and the cache
> > never gets updated...even thought with some additional
> > sleuthing (aka: additional debug printks - it thinks 
> > that there is an cache update pending).
> 
> Can you reproduce the problem with the current kernel? (Say 2.6.22 or
> later?)
> 
> --b.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-31 Thread Alasdair G Kergon

On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote:
> Resubmitting a bio or submitting a dependent bio from 
> inside a block driver does not need to be throttled because all 
> resources required to guarantee completion must have been obtained 
> _before_ the bio was allowed to proceed into the block layer.

I'm toying with the idea of keeping track of the maximum device stack
depth for each stacked device, and only permitting it to increase in
controlled circumstances.

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [1/2] 2.6.23-rc3: known regressions with patches

2007-08-31 Thread Len Brown

On Wednesday 29 August 2007 11:28, Michal Piotrowski wrote:

> ACPI
> 
> Subject : the fan doesn't work any more
> References  : http://lkml.org/lkml/2007/8/28/359
> Last known good : ?
> Submitter   : Daniel Ritz <[EMAIL PROTECTED]>
> Caused-By   : Alexey Starikovskiy <[EMAIL PROTECTED]>
>   commit cd8c93a4e04dce8f00d1ef3a476aac8bd65ae40b
> Handled-By  : Alexey Starikovskiy <[EMAIL PROTECTED]>
> Patch   : http://lkml.org/lkml/2007/8/29/15
> Status  : patch was suggested

I believe that this is gone as of 2.6.23-rc4-git3

http://bugzilla.kernel.org/show_bug.cgi?id=8958

thanks,
-Len
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Heads Up: Next Batch Of Serial/TTY Changes

2007-08-31 Thread David Miller

From: Alan Cox <[EMAIL PROTECTED]>
Date: Fri, 31 Aug 2007 22:11:05 +0100

> Firstly some architecture maintainers still haven't updated their
> platform for arbitary tty speeds. The kernel is going to start whining
> and issuing warnings on your platform if you don't keep up with the
> programme (its been 6 months).

I took a look at this for sparc and I'm currently balking the same way
you did :-) The current bit usage on sparc just don't work properly
for what you're trying to do.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Nonblocking call may block in a mutex? Nonblocking call after poll may fail?

2007-08-31 Thread David Schwartz


> If this output-buffer has "4-bytes space remaining for process A",
> then a non-blocking write of process A could still encounter a locked
> mutex, if process B is busy writing to the output-buffer.

Of course.

> Should process A now block/sleep until that mutex is free and it can
> access the output-buffer (and it's 4 bytes space)?

That depends on how long the other process might hold the mutex. If it's
just the time it takes to copy the buffer and do fast things, then it should
wait. If it might be a long time, then it's probably better not to block, as
the process requested.

> What about a non-blocking (write-) poll of process A: if the poll call
> succeeds (the output buffer has space remaining for process A), and
> process A now performs a non-blocking write: what happens if A
> encounters a blocked mutex, since process B is busy writing to the
> output-buffer.
> a) Should A block until the mutex is available?

Probably, unless the mutex is one that could be held for a very long time.
It really depends upon what semantics make sense with your driver. Is the
wait so short it should be considered not blocking or is it potentially long
enough that it should be avoided?

A non-blocking call does not mean it must never ever lose the CPU at all
under any circumstances. It just means no waits for "too long".

> b) Should A return -EAGAIN, even though the poll call succeeded?

If the wait would be for too long, then yes.

> c) Should it be impossible for this to happen! i.e. -> should process
> A already "have" the mutex in question, when the poll call succeeds
> (thus preventing B from writing to the output buffer)

No. Functions like 'poll' and 'select' are just status-reporting functions.
They should not change the semantics of other operations unless that's
unavoidable.

> For c) What if process A "has" the mutex, but never does the
> non-blocking write. Then no process can write, since the mutex is held
> by process A...

Right. That's why return values from 'poll' and 'select' don't guarantee
future behavior.

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: jffs2 deadlock introduced in linux 2.6.22.5

2007-08-31 Thread Jesper Juhl

On 31/08/2007, Jason Lunz <[EMAIL PROTECTED]> wrote:
> On Thu, Aug 30, 2007 at 11:23:55AM -0700, Jason Lunz wrote:
> > commit 1d8715b388c978b0f1b1bf4812fcee0e73b023d7 was added between
> > 2.6.22.4 and 2.6.22.5 to cure a locking problem, but it seems to have
> > introduced another (worse?) one.
>
> I spoke too soon. I checked more carefully, and this problem was
> introduced somewhere between 2.6.21 and 2.6.22. The jffs2 fix in
> 2.6.22.5 isn't the culprit.
>

Sounds like this belongs on the regression tracking page at
http://kernelnewbies.org/known_regressions and/or in a bugzilla
(http://bugzilla.kernel.org/) bugreport so it doesn't get lost.

-- 
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] net/, drivers/net/ , missing EXPERIMENTAL in menus

2007-08-31 Thread Randy Dunlap

On Fri, 31 Aug 2007 17:00:57 -0400 (EDT) Robert P. J. Day wrote:

> On Fri, 31 Aug 2007, Randy Dunlap wrote:
> 
> > What I like about the patch is that it associates some kconfig
> > symbol with prompt strings, so that we don't have to edit
> > "(EXPERIMENTAL)" all the darn time (e.g.).
> >
> > I'd be quite happy with calling it "status" rather than "maturity",
> > and with being able to use multiple of the status tags at one time,
> > such as
> >
> > config FOO
> > depends on BAR
> > status OBSOLETE BROKEN
> 
> g ... i already made my point in my earlier post.  i'd
> really, really like it if *this* attribute remained as "maturity".  an
> entirely *separate* attribute could be defined as a feature "status",
> which would be entirely orthogonal to maturity level, so that the
> above would be written as
> 
>   maturity OBSOLETE
>   status BROKEN
> 
> there's a reason for this -- any feature should have exactly *one*
> value for any attribute.  that is, in terms of maturity, a feature
> could be EXPERIMENTAL *or* DEPRECATED *or* OBSOLETE.  it ***can't***
> be more than one, as in both DEPRECATED *and* OBSOLETE.  to allow that
> flexibility is to descend into absurdity.


If Simon (or anyone else) continues to work on it, I'll leave this
decision up to them...


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: jffs2 deadlock introduced in linux 2.6.22.5

2007-08-31 Thread Jason Lunz

On Thu, Aug 30, 2007 at 11:23:55AM -0700, Jason Lunz wrote:
> commit 1d8715b388c978b0f1b1bf4812fcee0e73b023d7 was added between
> 2.6.22.4 and 2.6.22.5 to cure a locking problem, but it seems to have
> introduced another (worse?) one.

I spoke too soon. I checked more carefully, and this problem was
introduced somewhere between 2.6.21 and 2.6.22. The jffs2 fix in
2.6.22.5 isn't the culprit.

Jason
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] Update Documentation/fb/00-INDEX - add new files, remove entries for deleted ones

2007-08-31 Thread Jesper Juhl


An update to Documentation/fb/00-INDEX is long overdue.
This patch adds entries for new files in the directory 
and removes entries for files that no longer exist. The
files are now also sorted alphabetically.


Signed-off-by: Jesper Juhl <[EMAIL PROTECTED]>
---

 Documentation/fb/00-INDEX |   46 
 1 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/Documentation/fb/00-INDEX b/Documentation/fb/00-INDEX
index 92e89ae..caabbd3 100644
--- a/Documentation/fb/00-INDEX
+++ b/Documentation/fb/00-INDEX
@@ -5,21 +5,49 @@ please mail me.
 
 00-INDEX
- this file
+arkfb.txt
+   - info on the fbdev driver for ARK Logic chips.
+aty128fb.txt
+   - info on the ATI Rage128 frame buffer driver.
+cirrusfb.txt
+   - info on the driver for Cirrus Logic chipsets.
+cyblafb/
+   - directory with documentation files related to the cyblafb driver.
+deferred_io.txt
+   - an introduction to deferred IO.
+fbcon.txt
+   - intro to and usage guide for the framebuffer console (fbcon).
 framebuffer.txt
-   - introduction to frame buffer devices
+   - introduction to frame buffer devices.
+imacfb.txt
+   - info on the generic EFI platform driver for Intel based Macs.
+intel810.txt
+   - documentation for the Intel 810/815 framebuffer driver.
+intelfb.txt
+   - docs for Intel 830M/845G/852GM/855GM/865G/915G/945G fb driver.
 internals.txt
-   - quick overview of frame buffer device internals
+   - quick overview of frame buffer device internals.
+matroxfb.txt
+   - info on the Matrox framebuffer driver for Alpha, Intel and PPC.
 modedb.txt
-   - info on the video mode database
-aty128fb.txt
-   - info on the ATI Rage128 frame buffer driver
-clgenfb.txt
-   - info on the Cirrus Logic frame buffer driver
+   - info on the video mode database.
 matroxfb.txt
-   - info on the Matrox frame buffer driver
+   - info on the Matrox frame buffer driver.
 pvr2fb.txt
-   - info on the PowerVR 2 frame buffer driver
+   - info on the PowerVR 2 frame buffer driver.
+pxafb.txt
+   - info on the driver for the PXA25x LCD controller.
+s3fb.txt
+   - info on the fbdev driver for S3 Trio/Virge chips.
+sa1100fb.txt
+   - information about the driver for the SA-1100 LCD controller.
+sisfb.txt
+   - info on the framebuffer device driver for various SiS chips.
+sstfb.txt
+   - info on the frame buffer driver for 3dfx' Voodoo Graphics boards.
 tgafb.txt
- info on the TGA (DECChip 21030) frame buffer driver
 vesafb.txt
- info on the VESA frame buffer device
+vt8623fb.txt
+   - info on the fb driver for the graphics core in VIA VT8623 chipsets.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5][resend] Add a 00-INDEX file to Documentation/sysctl/

2007-08-31 Thread Jesper Juhl


Add a 00-INDEX file to Documentation/sysctl/


Signed-off-by: Jesper Juhl <[EMAIL PROTECTED]>
---

 00-INDEX |   16 
 1 file changed, 16 insertions(+)

--- /dev/null   2005-11-21 04:22:37.0 +0100
+++ linux-2.6/Documentation/sysctl/00-INDEX 2007-08-11 23:52:50.0 
+0200
@@ -0,0 +1,16 @@
+00-INDEX
+   - this file.
+README
+   - general information about /proc/sys/ sysctl files.
+abi.txt
+   - documentation for /proc/sys/abi/*.
+ctl_unnumbered.txt
+   - explanation of why one should not add new binary sysctl numbers.
+fs.txt
+   - documentation for /proc/sys/fs/*.
+kernel.txt
+   - documentation for /proc/sys/kernel/*.
+sunrpc.txt
+   - documentation for /proc/sys/sunrpc/*.
+vm.txt
+   - documentation for /proc/sys/vm/*.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5][resend] Add a 00-INDEX file to Documentation/telephony/

2007-08-31 Thread Jesper Juhl


Add a 00-INDEX file to Documentation/telephony/


Signed-off-by: Jesper Juhl <[EMAIL PROTECTED]>
---

 00-INDEX |4 
 1 file changed, 4 insertions(+)

--- /dev/null   2005-11-21 04:22:37.0 +0100
+++ linux-2.6/Documentation/telephony/00-INDEX  2007-08-11 23:55:54.0 
+0200
@@ -0,0 +1,4 @@
+00-INDEX
+   - this file.
+ixj.txt
+   - document describing the Quicknet drivers.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] Add a missing 00-INDEX file for Documentation/vm/

2007-08-31 Thread Jesper Juhl

This patch adds a 00-INDEX file to Documentation/vm/


Signed-off-by: Jesper Juhl <[EMAIL PROTECTED]>
---

 00-INDEX |   20 
 1 file changed, 20 insertions(+)

--- /dev/null   2005-11-21 04:22:37.0 +0100
+++ Documentation/vm/00-INDEX   2007-08-31 23:16:00.0 +0200
@@ -0,0 +1,20 @@
+00-INDEX
+   - this file.
+balance
+   - various information on memory balancing.
+hugetlbpage.txt
+   - a brief summary of hugetlbpage support in the Linux kernel.
+locking
+   - info on how locking and synchronization is done in the Linux vm code.
+numa
+   - information about NUMA specific code in the Linux vm.
+numa_memory_policy.txt
+   - documentation of concepts and APIs of the 2.6 memory policy support.
+overcommit-accounting
+   - description of the Linux kernels overcommit handling modes.
+page_migration
+   - description of page migration in NUMA systems.
+slabinfo.c
+   - source code for a tool to get reports about slabs.
+slub.txt
+   - a short users guide for SLUB.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 >

1 - 100 of 664 matches

Mail list logo