Re: linux-2.6.20-rc4-mm1 Reiser4 filesystem freeze and corruption

2007-01-22 Thread Vince

Zan Lynx wrote:

I have been running 2.6.20-rc2-mm1 without problems, but both rc3-mm1
and rc4-mm1 have been giving me these freezes.  They were happening
inside X and without external console it was impossible to get anything,
plus I was reluctant to test it since the freeze sometimes requires a
full fsck.reiser4 --build-fs to recover the filesystem.

> [...]

Hi,

I don't know if it is related, but I've had the following BUG on 
2.6.20-rc4-mm1 (+ hot-fixes patches applied) :


---
kernel BUG at fs/reiser4/plugin/item/extent_file_ops.c:973!
invalid opcode:  [#1]
PREEMPT
last sysfs file: /devices/pci:00/:00:13.0/eth0/statistics/collisions
Modules linked in: binfmt_misc nfs lockd sunrpc radeon drm reiser4 
ati_remote fuse usbhid snd_via82xx snd_ac97_codec ac97_bus snd_pcm_oss 
snd_mixer_oss snd_pcm snd_page_alloc snd_mpu401_uart snd_seq_oss 
snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer 
snd_seq_device ohci1394 ieee1394 psmouse sr_mod cdrom sg ehci_hcd 
via_agp agpgart uhci_hcd usbcore i2c_viapro snd soundcore

CPU:0
EIP:0060:[]Not tainted VLI
EFLAGS: 00010282   (2.6.20-rc4-mm1 #1)
EIP is at reiser4_write_extent+0xd5/0x626 [reiser4]
eax: ccca139c   ebx: 0200   ecx: f5bec400   edx: ffe4
esi:    edi: f5bec414   ebp: da6ff274   esp: e17d7e34
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process sstrip (pid: 23858, ti=e17d6000 task=d8ffc570 task.ti=e17d6000)
Stack:  00100100 00200200 00100100 0034 bf826a50 e083ff00 

   c000 da6ff2c8 dccba4c0 0005 01ff 021e  

     0004 f9b6cdad  0004 0004 
0001

Call Trace:
 [] reiser4_update_sd+0x22/0x28 [reiser4]
 [] notify_change+0x200/0x20f
 [] vsscanf+0x1e2/0x3ff
 [] write_unix_file+0x0/0x495 [reiser4]
 [] __remove_suid+0x10/0x14
 [] mark_page_accessed+0x1c/0x2e
 [] reiser4_txn_begin+0x1c/0x2e [reiser4]
 [] reiser4_write_extent+0x0/0x626 [reiser4]
 [] write_unix_file+0x25a/0x495 [reiser4]
 [] __handle_mm_fault+0x2bd/0x79b
 [] write_unix_file+0x0/0x495 [reiser4]
 [] vfs_write+0x8a/0x136
 [] sys_write+0x41/0x67
 [] sysenter_past_esp+0x5f/0x85
 ===
Code: 04 89 0c 24 31 c9 89 5c 24 04 e8 52 fc ff ff 31 d2 e9 59 05 00 00 
64 a1 08 00 00 00 8b 80 b4 04 00 00 8b 40 38 83 78 08 00 74 04 <0f> 0b 
eb fe 8b 8c 24 e0 00 00 00 31 db 8b 01 8b 51 04 89 c1 0f
EIP: [] reiser4_write_extent+0xd5/0x626 [reiser4] SS:ESP 
0068:e17d7e34
 <4>reiser4[sstrip(23858)]: release_unix_file 
(fs/reiser4/plugin/file/file.c:2417)[vs-44]:

WARNING: out of memory?
reiser4[sstrip(23858)]: release_unix_file 
(fs/reiser4/plugin/file/file.c:2417)[vs-44]:

WARNING: out of memory?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[fix, rfc] kbuild: O= with M= (Re: [PATCH -mm] replacement for broken kbuild-dont-put-temp-files-in-the-source-tree.patch)

2007-01-22 Thread Oleg Verych
On 2006-11-17, Oleg Verych wrote:
> On Tue, Oct 31, 2006 at 02:51:36PM +0100, olecom wrote:
> []
>> On Tue, Oct 31, 2006 at 02:14:16AM +0100, Horst Schirmeier wrote:
> []
>> > I'm not sure what you mean by $(objdir); I just got something to work
>> > which creates the /dev/null symlink in a (newly created if necessary)
>> > directory named
>> 
>> $(objtree) is a directory for all possible outputs of the build precess,
>> it's set up by `O=' or `KBUILD_OUTPUT', and this is *not* output for ready
>> external modules `$(M)'. Try to play with this, please.
>
> And for me, they are *not* working together:

It works with this:

Proposed-by: me

--- linux-2.6.20-rc5/scripts/Makefile.modpost.orig  2007-01-12 
19:54:26.0 +0100
+++ linux-2.6.20-rc5/scripts/Makefile.modpost   2007-01-23 08:23:51.583426500 
+0100
@@ -58,5 +58,5 @@
 #  Includes step 3,4
 quiet_cmd_modpost = MODPOST $(words $(filter-out vmlinux FORCE, $^)) modules
-  cmd_modpost = scripts/mod/modpost\
+  cmd_modpost = $(objtree)/scripts/mod/modpost \
 $(if $(CONFIG_MODVERSIONS),-m) \
$(if $(CONFIG_MODULE_SRCVERSION_ALL),-a,)  \
>> 
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[git patches] net driver fixes

2007-01-22 Thread Jeff Garzik

Please pull from 'upstream-linus' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git 
upstream-linus

to receive the following updates:

 drivers/net/ehea/ehea.h  |2 +-
 drivers/net/ehea/ehea_main.c |   56 +-
 drivers/net/ehea/ehea_phyp.c |   10 +-
 drivers/net/netxen/netxen_nic.h  |7 ++--
 drivers/net/netxen/netxen_nic_hw.c   |3 +-
 drivers/net/netxen/netxen_nic_main.c |2 +-
 drivers/net/pcmcia/3c589_cs.c|7 +++-
 drivers/net/phy/phy.c|3 +-
 8 files changed, 57 insertions(+), 33 deletions(-)

Amit S. Kale (2):
  NetXen: Firmware check modifications
  NetXen: Use pci_register_driver() instead of pci_module_init() in 
init_module

Komuro (1):
  modify 3c589_cs to be SMP safe

Kumar Gala (1):
  PHY: Export phy ethtool helpers

Thomas Klein (7):
  ehea: Fixed wrong dereferencation
  ehea: Fixing firmware queue config issue
  ehea: Modified initial autoneg state determination
  ehea: New method to determine number of available ports
  ehea: Improved logging of permission issues
  ehea: Added logging off associated errors
  ehea: Fixed possible nullpointer access

diff --git a/drivers/net/ehea/ehea.h b/drivers/net/ehea/ehea.h
index 39ad9f7..be10a3a 100644
--- a/drivers/net/ehea/ehea.h
+++ b/drivers/net/ehea/ehea.h
@@ -39,7 +39,7 @@
 #include 
 
 #define DRV_NAME   "ehea"
-#define DRV_VERSION"EHEA_0043"
+#define DRV_VERSION"EHEA_0044"
 
 #define EHEA_MSG_DEFAULT (NETIF_MSG_LINK | NETIF_MSG_TIMER \
| NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR)
diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
index 83fa32f..1072e69 100644
--- a/drivers/net/ehea/ehea_main.c
+++ b/drivers/net/ehea/ehea_main.c
@@ -558,12 +558,12 @@ static irqreturn_t ehea_qp_aff_irq_handler(int irq, void 
*param)
u32 qp_token;
 
eqe = ehea_poll_eq(port->qp_eq);
-   ehea_debug("eqe=%p", eqe);
+
while (eqe) {
-   ehea_debug("*eqe=%lx", *(u64*)eqe);
-   eqe = ehea_poll_eq(port->qp_eq);
qp_token = EHEA_BMASK_GET(EHEA_EQE_QP_TOKEN, eqe->entry);
-   ehea_debug("next eqe=%p", eqe);
+   ehea_error("QP aff_err: entry=0x%lx, token=0x%x",
+  eqe->entry, qp_token);
+   eqe = ehea_poll_eq(port->qp_eq);
}
 
return IRQ_HANDLED;
@@ -575,8 +575,9 @@ static struct ehea_port *ehea_get_port(struct ehea_adapter 
*adapter,
int i;
 
for (i = 0; i < adapter->num_ports; i++)
-   if (adapter->port[i]->logical_port_id == logical_port)
-   return adapter->port[i];
+   if (adapter->port[i])
+   if (adapter->port[i]->logical_port_id == logical_port)
+   return adapter->port[i];
return NULL;
 }
 
@@ -642,6 +643,8 @@ int ehea_sense_port_attr(struct ehea_port *port)
break;
}
 
+   port->autoneg = 1;
+
/* Number of default QPs */
port->num_def_qps = cb0->num_default_qps;
 
@@ -728,10 +731,7 @@ int ehea_set_portspeed(struct ehea_port *port, u32 
port_speed)
}
} else {
if (hret == H_AUTHORITY) {
-   ehea_info("Hypervisor denied setting port speed. Either"
- " this partition is not authorized to set "
- "port speed or another partition has modified"
- " port speed first.");
+   ehea_info("Hypervisor denied setting port speed");
ret = -EPERM;
} else {
ret = -EIO;
@@ -998,7 +998,7 @@ static int ehea_configure_port(struct ehea_port *port)
 | EHEA_BMASK_SET(PXLY_RC_JUMBO_FRAME, 1);
 
for (i = 0; i < port->num_def_qps; i++)
-   cb0->default_qpn_arr[i] = port->port_res[i].qp->init_attr.qp_nr;
+   cb0->default_qpn_arr[i] = port->port_res[0].qp->init_attr.qp_nr;
 
if (netif_msg_ifup(port))
ehea_dump(cb0, sizeof(*cb0), "ehea_configure_port");
@@ -1485,11 +1485,12 @@ out:
 
 static void ehea_promiscuous_error(u64 hret, int enable)
 {
-   ehea_info("Hypervisor denied %sabling promiscuous mode.%s",
- enable == 1 ? "en" : "dis",
- hret != H_AUTHORITY ? "" : " Another partition owning a "
- "logical port on the same physical port might have altered "
- "promiscuous mode first.");
+   if (hret == H_AUTHORITY)
+   ehea_info("Hypervisor denied %sabling promiscuous mode",
+ enable == 1 ? "en" : "dis");
+   else
+   ehea_error("failed %sabling promiscuous mode",
+  enable == 1 ? "en" : "dis");
 }
 
 static void 

Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Luming Yu

what about removing psmouse module?

On 1/23/07, Jean-Marc Valin <[EMAIL PROTECTED]> wrote:

>>> will be a device driver. Common causes of suspend/resume problems from
>>> the list you give below are acpi modules, bluetooth and usb. I'd also be
>>> consider pcmcia, drm and fuse possibilities. But again, go for unloading
>>> everything possible in the first instance.
>> Actually, the reason I sent this is that when I showed the oops/gpf to
>> Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
>> problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
>> suspend to RAM now works ~95% of the time.
>
> Try a kernel without CONFIG_SMP... that will verify if it is SMP
> related.

Well, this happens to be my main work machine, which I'm not willing to
have running at half speed for several weeks. Anything else you can suggest?

Jean-Marc
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: `make htmldocs` fails -- 2.6.20-rc4-mm1

2007-01-22 Thread Don Mullis
On Mon, 2007-01-22 at 22:22 -0800, Randy Dunlap wrote:
> On Mon, 22 Jan 2007 22:02:30 -0800 Don Mullis wrote:
> 
> > 
> > Bisection shows the bad patch to be:
> > gregkh-driver-uio-documentation.patch
> > 
> > The htmldocs build failure can be eliminated by:
> > quilt remove Documentation/DocBook/kernel-api.tmpl
> 
> or by:  quilt delete gregkh-driver-uio-documentation.patch ??

That would fix the htmldoc build too, but would throw out lots of
documentation.  Greg K-H would seem the prime candidate to propose a
fix.


> How about an accurate description of what kernel tree has this problem?
> It's not 2.6.19.  It's not 2.6.20-rc5.

2.6.20-rc4-mm1, sorry.  Forgot that posting as a reply to the
2.6.20-rc4-mm1 announcement is no help for someone receiving the mail
directly.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: `make htmldocs` fails

2007-01-22 Thread Greg KH
On Mon, Jan 22, 2007 at 10:02:30PM -0800, Don Mullis wrote:
> 
> Bisection shows the bad patch to be:
> gregkh-driver-uio-documentation.patch
> 
> The htmldocs build failure can be eliminated by:
> quilt remove Documentation/DocBook/kernel-api.tmpl
> 
> The error messages were:
> 
> .../linux-2.6.19 $ make htmldocs
>   DOCPROC Documentation/DocBook/kernel-api.xml
> 
> Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521):
>  No description found for parameter 'owner'
> 
> Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521):
>  No description found for parameter 'info'
> 
> Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:591):
>  No description found for parameter 'idev'

Thanks, I've fixed these warnings now.

> 
> Error(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//include/linux/uio_driver.h:33):
>  cannot understand prototype: 'struct uio_info '

I think I've fixed this now, the next -mm should contain the update.

thanks for letting me know.

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: `make htmldocs` fails

2007-01-22 Thread Randy Dunlap
On Mon, 22 Jan 2007 22:02:30 -0800 Don Mullis wrote:

> 
> Bisection shows the bad patch to be:
> gregkh-driver-uio-documentation.patch
> 
> The htmldocs build failure can be eliminated by:
> quilt remove Documentation/DocBook/kernel-api.tmpl

or by:  quilt delete gregkh-driver-uio-documentation.patch ??

> The error messages were:
> 
> .../linux-2.6.19 $ make htmldocs
>   DOCPROC Documentation/DocBook/kernel-api.xml
> 
> Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521):
>  No description found for parameter 'owner'
> 
> Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521):
>  No description found for parameter 'info'
> 
> Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:591):
>  No description found for parameter 'idev'
> 
> Error(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//include/linux/uio_driver.h:33):
>  cannot understand prototype: 'struct uio_info '
> 
> Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//include/linux/uio_driver.h):
>  no structured comments found
> make[1]: *** [Documentation/DocBook/kernel-api.xml] Error 1
> make: *** [htmldocs] Error 2
> 
> The failure was observed on an up-to-date Fedora Core 5 host. 

How about an accurate description of what kernel tree has this problem?
It's not 2.6.19.  It's not 2.6.20-rc5.

---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA hotplug from the user side ?

2007-01-22 Thread Tejun Heo
Henrique de Moraes Holschuh wrote:
> Does SATA electrical conector keying let the disk firmware unload
> heads before the user manages to pull it out enough to sever power?

I don't think so.

> If it does not, the drive will do an emergency head unload, which is
> not good and will likely reduce the drive's lifetime.

Probably.

> Using hdparm -Y before the unplug, or scsiadd -r (on a kernel that
> has Tejun's new patch to optionally issue an START_STOP_UNIT to the
> SCSI device enabled) is probably a good idea.  Unless it is a shared
> SATA port (I don't know if such a thing exists yet) and another box
> is talking to the disk, etc.

Agreed.  But it would be *much* better if all these can be taken care of
by hald and its minions.  Such that the user can just tell the system
that the hdd is going to be removed and all these dirty tricks are done
automagically.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kbuild: Replace remaining "depends" with "depends on"

2007-01-22 Thread Oleg Verych
On 2006-12-13, Robert P. J. Day wrote:
>
>   Replace the very few remaining "depends" Kconfig directives with
> "depends on".
>
> Signed-off-by: Robert P. J. Day <[EMAIL PROTECTED]>
>
> ---

For this kind of fixes, please use
"kconfig" subsystem instead of
"kbuild" in subject. Thanks.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA hotplug from the user side ?

2007-01-22 Thread Tejun Heo
Soeren Sonnenburg wrote:
> OK, how about this (please especially check the non SIL part):
> 
> SATA Hotplug from the User Side
> 
> - For SIL3114 and SIL3124 you don't have to run any commands at all. It

ahci and ck804 flavor of sata_nv's can do hotplug without user
assistance too.

[--snip--]
> - For other chipsets one in addition to stop using the device before
> unplugging has to call scsiadd -r to fully remove it, e.g. in the
> following example the disk on scsi host 3 channel 0 id 0 lun 0 will be
> removed, then on reinserting a disk call scsiadd -s :
> 
> # scsiadd -p
> 
> Attached devices:
> Host: scsi2 Channel: 00 Id: 00 Lun: 00
>   Vendor: ATA  Model: ST3400832AS  Rev: 3.01
>   Type:   Direct-AccessANSI SCSI revision: 05
> Host: scsi3 Channel: 00 Id: 00 Lun: 00
>   Vendor: ATA  Model: ST3400620AS  Rev: 3.AA
>   Type:   Direct-AccessANSI SCSI revision: 05
> 
> # scsiadd -r 3 0 0 0
> # scsiadd -s

Doing the above might not be such a good idea on drivers which don't
implement new EH yet.  Those are sata_mv, sata_promise (getting there)
and sata_sx4.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [updated PATCH] remove 555 unneeded #includes of sched.h

2007-01-22 Thread Oleg Verych
On 2006-12-29, Tim Schmielau wrote:
[]
> OK, building 2.6.20-rc2-mm1 with all 59 configs from arch/arm/configs 
> with and w/o the patch indeed found one mysterious #include that may not 
> be removed. Thanks, Russell!
>
> Andrew, please use the attached patch instead of the previous one, it also 
> has a slightly better patch description.

Great job!

About patch. To make it smaller, i think, you better use less
"unified context" lines `diff -u1'.

Nice done! 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


`make htmldocs` fails

2007-01-22 Thread Don Mullis

Bisection shows the bad patch to be:
gregkh-driver-uio-documentation.patch

The htmldocs build failure can be eliminated by:
quilt remove Documentation/DocBook/kernel-api.tmpl

The error messages were:

.../linux-2.6.19 $ make htmldocs
  DOCPROC Documentation/DocBook/kernel-api.xml

Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521):
 No description found for parameter 'owner'

Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521):
 No description found for parameter 'info'

Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:591):
 No description found for parameter 'idev'

Error(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//include/linux/uio_driver.h:33):
 cannot understand prototype: 'struct uio_info '

Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//include/linux/uio_driver.h):
 no structured comments found
make[1]: *** [Documentation/DocBook/kernel-api.xml] Error 1
make: *** [htmldocs] Error 2

The failure was observed on an up-to-date Fedora Core 5 host. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/7] ehea: Fixed wrong dereferencation

2007-01-22 Thread Jeff Garzik

Thomas Klein wrote:

Not only check the pointer against 0 but also the dereferenced value

Signed-off-by: Thomas Klein <[EMAIL PROTECTED]>
---


 drivers/net/ehea/ehea.h  |2 +-
 drivers/net/ehea/ehea_main.c |6 --
 2 files changed, 5 insertions(+), 3 deletions(-)


applied 1-7 to #upstream-fixes

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] atl1: Attansic L1 ethernet driver

2007-01-22 Thread Jeff Garzik

Jay Cliburn wrote:
This is the latest submittal of the patchset providing support for the 
Attansic L1 gigabit ethernet adapter.  This patchset is built against 
kernel version 2.6.20-rc5.


This version incorporates all comments from:

Christoph Hellwig:
http://lkml.org/lkml/2007/1/11/43
http://lkml.org/lkml/2007/1/11/45
http://lkml.org/lkml/2007/1/11/48
http://lkml.org/lkml/2007/1/11/49

Jeff Garzik:
http://lkml.org/lkml/2007/1/18/233

Many thanks to both for reviewing the driver.

The monolithic version of this patchset may be found at:
ftp://hogchain.net/pub/linux/attansic/kernel_driver/atl1-2.0.5-linux-2.6.20.rc5.patch.bz2


OK, I have merged the monolithic patch into jgarzik/netdev-2.6.git#atl1. 
 Once I'm done merging patches tonight, I will merge this new 'atl1' 
branch into the 'ALL' meta-branch, which will auto-propagate this driver 
into Andrew Morton's -mm for testing.


For future driver updates, please send a patch rather than the full 
driver diff.


If it looks good, we will push for 2.6.21 (or 2.6.22, if updates or 
objections continue to come fast and furious).  It's in "the system" 
now, thanks for all your hard work!


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/12] clocksource: remove update_callback

2007-01-22 Thread Daniel Walker
Uses the block notifier to replace the functionality of update_callback().
update_callback() was a special case specifically for the tsc, but including
it in the clocksource structure duplicated it needlessly for other clocks.

Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]>

---
 arch/i386/kernel/tsc.c  |   35 ++-
 arch/x86_64/kernel/tsc.c|   19 +--
 include/linux/clocksource.h |2 --
 include/linux/timekeeping.h |1 +
 kernel/time/timekeeping.c   |   32 ++--
 5 files changed, 46 insertions(+), 43 deletions(-)

Index: linux-2.6.19/arch/i386/kernel/tsc.c
===
--- linux-2.6.19.orig/arch/i386/kernel/tsc.c
+++ linux-2.6.19/arch/i386/kernel/tsc.c
@@ -51,8 +51,7 @@ static int __init tsc_setup(char *str)
 __setup("notsc", tsc_setup);
 
 /*
- * code to mark and check if the TSC is unstable
- * due to cpufreq or due to unsynced TSCs
+ * Flag that denotes an unstable tsc and check function.
  */
 static int tsc_unstable;
 
@@ -61,12 +60,6 @@ static inline int check_tsc_unstable(voi
return tsc_unstable;
 }
 
-void mark_tsc_unstable(void)
-{
-   tsc_unstable = 1;
-}
-EXPORT_SYMBOL_GPL(mark_tsc_unstable);
-
 /* Accellerators for sched_clock()
  * convert from cycles(64bits) => nanoseconds (64bits)
  *  basic equation:
@@ -180,6 +173,7 @@ int recalibrate_cpu_khz(void)
if (cpu_has_tsc) {
cpu_khz = calculate_cpu_khz();
tsc_khz = cpu_khz;
+   mark_tsc_unstable();
cpu_data[0].loops_per_jiffy =
cpufreq_scale(cpu_data[0].loops_per_jiffy,
cpu_khz_old, cpu_khz);
@@ -332,7 +326,6 @@ core_initcall(cpufreq_tsc);
 /* clock source code */
 
 static unsigned long current_tsc_khz = 0;
-static int tsc_update_callback(void);
 
 static cycle_t read_tsc(void)
 {
@@ -350,32 +343,24 @@ static struct clocksource clocksource_ts
.mask   = CLOCKSOURCE_MASK(64),
.mult   = 0, /* to be set */
.shift  = 22,
-   .update_callback= tsc_update_callback,
.is_continuous  = 1,
.list   = LIST_HEAD_INIT(clocksource_tsc.list),
 };
 
-static int tsc_update_callback(void)
+/*
+ * code to mark if the TSC is unstable
+ * due to cpufreq or due to unsynced TSCs
+ */
+void mark_tsc_unstable(void)
 {
-   int change = 0;
-
/* check to see if we should switch to the safe clocksource: */
-   if (clocksource_tsc.rating != 0 && check_tsc_unstable()) {
+   if (unlikely(!tsc_unstable && clocksource_tsc.rating != 0)) {
clocksource_tsc.rating = 0;
clocksource_rating_change(_tsc);
-   change = 1;
-   }
-
-   /* only update if tsc_khz has changed: */
-   if (current_tsc_khz != tsc_khz) {
-   current_tsc_khz = tsc_khz;
-   clocksource_tsc.mult = clocksource_khz2mult(current_tsc_khz,
-   clocksource_tsc.shift);
-   change = 1;
}
-
-   return change;
+   tsc_unstable = 1;
 }
+EXPORT_SYMBOL_GPL(mark_tsc_unstable);
 
 static int __init dmi_mark_tsc_unstable(struct dmi_system_id *d)
 {
Index: linux-2.6.19/arch/x86_64/kernel/tsc.c
===
--- linux-2.6.19.orig/arch/x86_64/kernel/tsc.c
+++ linux-2.6.19/arch/x86_64/kernel/tsc.c
@@ -47,11 +47,6 @@ static inline int check_tsc_unstable(voi
return tsc_unstable;
 }
 
-void mark_tsc_unstable(void)
-{
-   tsc_unstable = 1;
-}
-EXPORT_SYMBOL_GPL(mark_tsc_unstable);
 
 #ifdef CONFIG_CPU_FREQ
 
@@ -185,8 +180,6 @@ __setup("notsc", notsc_setup);
 
 /* clock source code: */
 
-static int tsc_update_callback(void);
-
 static cycle_t read_tsc(void)
 {
cycle_t ret = (cycle_t)get_cycles_sync();
@@ -206,24 +199,22 @@ static struct clocksource clocksource_ts
.mask   = (cycle_t)-1,
.mult   = 0, /* to be set */
.shift  = 22,
-   .update_callback= tsc_update_callback,
.is_continuous  = 1,
.vread  = vread_tsc,
.list   = LIST_HEAD_INIT(clocksource_tsc.list),
 };
 
-static int tsc_update_callback(void)
+void mark_tsc_unstable(void)
 {
-   int change = 0;
-
/* check to see if we should switch to the safe clocksource: */
-   if (clocksource_tsc.rating != 50 && check_tsc_unstable()) {
+   if (unlikely(!tsc_unstable && clocksource_tsc.rating != 50)) {
clocksource_tsc.rating = 50;
clocksource_rating_change(_tsc);
-   change = 1;
}
-   return change;
+
+   tsc_unstable = 1;
 }
+EXPORT_SYMBOL_GPL(mark_tsc_unstable);
 
 static int __init init_tsc_clocksource(void)
 {

[PATCH 12/12] clocksource: atomic signals

2007-01-22 Thread Daniel Walker
Modifies the way clocks are switched to in the timekeeping code. The original
code would constantly monitor the clocksource list checking for newly added
clocksources. I modified this by using atomic types to signal when a new clock
is added. This allows the operation to be used only when it's needed.

The fast path is also reduced to checking a single atomic value.

Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]>

---
 include/linux/clocksource.h |5 
 include/linux/timekeeping.h |   10 
 kernel/time/clocksource.c   |6 +
 kernel/time/timekeeping.c   |   51 +++-
 4 files changed, 49 insertions(+), 23 deletions(-)

Index: linux-2.6.19/include/linux/clocksource.h
===
--- linux-2.6.19.orig/include/linux/clocksource.h
+++ linux-2.6.19/include/linux/clocksource.h
@@ -26,6 +26,11 @@ typedef u64 cycle_t;
 extern struct clocksource clocksource_jiffies;
 
 /*
+ * Atomic signal that is specific to timekeeping.
+ */
+extern atomic_t clock_check;
+
+/*
  * Allows inlined calling for notifier routines.
  */
 extern struct atomic_notifier_head clocksource_list_notifier;
Index: linux-2.6.19/include/linux/timekeeping.h
===
--- linux-2.6.19.orig/include/linux/timekeeping.h
+++ linux-2.6.19/include/linux/timekeeping.h
@@ -5,15 +5,7 @@
 
 extern void update_wall_time(void);
 
-#ifdef CONFIG_GENERIC_TIME
-
-extern struct clocksource *clock;
-
-#else /* CONFIG_GENERIC_TIME */
-static inline int change_clocksource(void)
-{
-   return 0;
-}
+#ifndef CONFIG_GENERIC_TIME
 
 static inline void change_clocksource(void) { }
 static inline void timekeeping_init_notifier(void) { }
Index: linux-2.6.19/kernel/time/clocksource.c
===
--- linux-2.6.19.orig/kernel/time/clocksource.c
+++ linux-2.6.19/kernel/time/clocksource.c
@@ -50,6 +50,7 @@ static char override_name[32];
 static int finished_booting;
 
 ATOMIC_NOTIFIER_HEAD(clocksource_list_notifier);
+atomic_t clock_check = ATOMIC_INIT(0);
 
 /* clocksource_done_booting - Called near the end of bootup
  *
@@ -58,6 +59,8 @@ ATOMIC_NOTIFIER_HEAD(clocksource_list_no
 static int __init clocksource_done_booting(void)
 {
finished_booting = 1;
+   /* Check for a new clock now */
+   atomic_inc(_check);
return 0;
 }
 
@@ -291,6 +294,9 @@ static ssize_t sysfs_override_clocksourc
/* try to select it: */
next_clocksource = select_clocksource();
 
+   /* Signal that there is a new clocksource */
+   atomic_inc(_check);
+
spin_unlock_irq(_lock);
 
return ret;
Index: linux-2.6.19/kernel/time/timekeeping.c
===
--- linux-2.6.19.orig/kernel/time/timekeeping.c
+++ linux-2.6.19/kernel/time/timekeeping.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -19,7 +20,6 @@ static unsigned long timekeeping_suspend
  * Clock used for timekeeping
  */
 struct clocksource *clock = _jiffies;
-atomic_t clock_recalc_interval = ATOMIC_INIT(0);
 
 /*
  * The current time
@@ -150,11 +150,12 @@ int do_settimeofday(struct timespec *tv)
 EXPORT_SYMBOL(do_settimeofday);
 
 /**
- * change_clocksource - Swaps clocksources if a new one is available
+ * timkeeping_change_clocksource - Swaps clocksources if a new one is available
  *
  * Accumulates current time interval and initializes new clocksource
+ * Needs to be called with the xtime_lock held.
  */
-static int change_clocksource(void)
+static int timekeeping_change_clocksource(void)
 {
struct clocksource *new;
cycle_t now;
@@ -169,9 +170,15 @@ static int change_clocksource(void)
clock->cycle_last = now;
printk(KERN_INFO "Time: %s clocksource has been installed.\n",
   clock->name);
+   hrtimer_clock_notify();
+   clock->error = 0;
+   clock->xtime_nsec = 0;
+   clocksource_calculate_interval(clock, tick_nsec);
return 1;
-   } else if (unlikely(atomic_read(_recalc_interval))) {
-   atomic_set(_recalc_interval, 0);
+   } else {
+   clock->error = 0;
+   clock->xtime_nsec = 0;
+   clocksource_calculate_interval(clock, tick_nsec);
return 1;
}
return 0;
@@ -198,9 +205,14 @@ int timekeeping_is_continuous(void)
 static int
 clocksource_callback(struct notifier_block *nb, unsigned long op, void *c)
 {
-   if (c == clock && op == CLOCKSOURCE_NOTIFY_FREQ &&
-   !atomic_read(_recalc_interval))
-   atomic_inc(_recalc_interval);
+   if (likely(c != clock))
+   return 0;
+
+   switch (op) {
+   case CLOCKSOURCE_NOTIFY_FREQ:
+   case CLOCKSOURCE_NOTIFY_RATING:
+   

[PATCH 02/12] clocksource: rating sorted list

2007-01-22 Thread Daniel Walker
Converts the original plain list into a sorted list based on the clock rating.
Later in my tree this allows some of the variables to be dropped since the
highest rated clock is always at the front of the list. This also does some
other nice things like allow the sysfs files to print the clocks in a more
interesting order. It's forward looking.

Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]>

---
 arch/i386/kernel/tsc.c  |2 
 arch/x86_64/kernel/tsc.c|2 
 include/linux/clocksource.h |9 ++-
 kernel/time/clocksource.c   |  132 +---
 4 files changed, 97 insertions(+), 48 deletions(-)

Index: linux-2.6.19/arch/i386/kernel/tsc.c
===
--- linux-2.6.19.orig/arch/i386/kernel/tsc.c
+++ linux-2.6.19/arch/i386/kernel/tsc.c
@@ -361,7 +361,7 @@ static int tsc_update_callback(void)
/* check to see if we should switch to the safe clocksource: */
if (clocksource_tsc.rating != 0 && check_tsc_unstable()) {
clocksource_tsc.rating = 0;
-   clocksource_reselect();
+   clocksource_rating_change(_tsc);
change = 1;
}
 
Index: linux-2.6.19/arch/x86_64/kernel/tsc.c
===
--- linux-2.6.19.orig/arch/x86_64/kernel/tsc.c
+++ linux-2.6.19/arch/x86_64/kernel/tsc.c
@@ -218,7 +218,7 @@ static int tsc_update_callback(void)
/* check to see if we should switch to the safe clocksource: */
if (clocksource_tsc.rating != 50 && check_tsc_unstable()) {
clocksource_tsc.rating = 50;
-   clocksource_reselect();
+   clocksource_rating_change(_tsc);
change = 1;
}
return change;
Index: linux-2.6.19/include/linux/clocksource.h
===
--- linux-2.6.19.orig/include/linux/clocksource.h
+++ linux-2.6.19/include/linux/clocksource.h
@@ -12,6 +12,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
 #include 
 #include 
 
@@ -183,9 +186,9 @@ static inline void clocksource_calculate
 
 
 /* used to install a new clocksource */
-int clocksource_register(struct clocksource*);
-void clocksource_reselect(void);
-struct clocksource* clocksource_get_next(void);
+extern struct clocksource *clocksource_get_next(void);
+extern int clocksource_register(struct clocksource*);
+extern void clocksource_rating_change(struct clocksource*);
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL
 extern void update_vsyscall(struct timespec *ts, struct clocksource *c);
Index: linux-2.6.19/kernel/time/clocksource.c
===
--- linux-2.6.19.orig/kernel/time/clocksource.c
+++ linux-2.6.19/kernel/time/clocksource.c
@@ -35,7 +35,7 @@
  * next_clocksource:
  * pending next selected clocksource.
  * clocksource_list:
- * linked list with the registered clocksources
+ * rating sorted linked list with the registered clocksources
  * clocksource_lock:
  * protects manipulations to curr_clocksource and next_clocksource
  * and the clocksource_list
@@ -80,69 +80,105 @@ struct clocksource *clocksource_get_next
 }
 
 /**
- * select_clocksource - Finds the best registered clocksource.
+ * __is_registered - Returns a clocksource if it's registered
+ * @name:  name of the clocksource to return
  *
  * Private function. Must hold clocksource_lock when called.
  *
- * Looks through the list of registered clocksources, returning
- * the one with the highest rating value. If there is a clocksource
- * name that matches the override string, it returns that clocksource.
+ * Returns the clocksource if registered, zero otherwise.
+ * If no clocksources are registered the jiffies clock is
+ * returned.
  */
-static struct clocksource *select_clocksource(void)
+static struct clocksource * __is_registered(char * name)
 {
-   struct clocksource *best = NULL;
struct list_head *tmp;
 
list_for_each(tmp, _list) {
struct clocksource *src;
 
src = list_entry(tmp, struct clocksource, list);
-   if (!best)
-   best = src;
-
-   /* check for override: */
-   if (strlen(src->name) == strlen(override_name) &&
-   !strcmp(src->name, override_name)) {
-   best = src;
-   break;
-   }
-   /* pick the highest rating: */
-   if (src->rating > best->rating)
-   best = src;
+   if (!strcmp(src->name, name))
+   return src;
}
 
-   return best;
+   return 0;
 }
 
 /**
- * is_registered_source - Checks if clocksource is registered
- * @c: pointer to a clocksource
+ * __get_clock - Finds a specific clocksource
+ * @name:  name of the clocksource to return
  

[PATCH 03/12] clocksource: arm initialize list value

2007-01-22 Thread Daniel Walker
Update arch/arm/ with list initialization.

Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]>

---
 arch/arm/mach-imx/time.c  |1 +
 arch/arm/mach-ixp4xx/common.c |1 +
 arch/arm/mach-netx/time.c |1 +
 arch/arm/mach-pxa/time.c  |1 +
 4 files changed, 4 insertions(+)

Index: linux-2.6.19/arch/arm/mach-imx/time.c
===
--- linux-2.6.19.orig/arch/arm/mach-imx/time.c
+++ linux-2.6.19/arch/arm/mach-imx/time.c
@@ -87,6 +87,7 @@ static struct clocksource clocksource_im
.read   = imx_get_cycles,
.mask   = 0x,
.shift  = 20,
+   .list   = LIST_HEAD_INIT(clocksource_imx.list),
.is_continuous  = 1,
 };
 
Index: linux-2.6.19/arch/arm/mach-ixp4xx/common.c
===
--- linux-2.6.19.orig/arch/arm/mach-ixp4xx/common.c
+++ linux-2.6.19/arch/arm/mach-ixp4xx/common.c
@@ -396,6 +396,7 @@ static struct clocksource clocksource_ix
.mask   = CLOCKSOURCE_MASK(32),
.shift  = 20,
.is_continuous  = 1,
+   .list   = LIST_HEAD_INIT(clocksource_ixp4xx.list),
 };
 
 unsigned long ixp4xx_timer_freq = FREQ;
Index: linux-2.6.19/arch/arm/mach-netx/time.c
===
--- linux-2.6.19.orig/arch/arm/mach-netx/time.c
+++ linux-2.6.19/arch/arm/mach-netx/time.c
@@ -62,6 +62,7 @@ static struct clocksource clocksource_ne
.read   = netx_get_cycles,
.mask   = CLOCKSOURCE_MASK(32),
.shift  = 20,
+   .list   = LIST_HEAD_INIT(clocksource_netx.list),
.is_continuous  = 1,
 };
 
Index: linux-2.6.19/arch/arm/mach-pxa/time.c
===
--- linux-2.6.19.orig/arch/arm/mach-pxa/time.c
+++ linux-2.6.19/arch/arm/mach-pxa/time.c
@@ -112,6 +112,7 @@ static struct clocksource clocksource_px
.read   = pxa_get_cycles,
.mask   = CLOCKSOURCE_MASK(32),
.shift  = 20,
+   .list   = LIST_HEAD_INIT(clocksource_pxa.list),
.is_continuous  = 1,
 };
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/12] clocksource: avr32 initialize list value

2007-01-22 Thread Daniel Walker
Update arch/avre32/ with list initialization.

Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]>

---
 arch/avr32/kernel/time.c |1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6.19/arch/avr32/kernel/time.c
===
--- linux-2.6.19.orig/arch/avr32/kernel/time.c
+++ linux-2.6.19/arch/avr32/kernel/time.c
@@ -37,6 +37,7 @@ static struct clocksource clocksource_av
.read   = read_cycle_count,
.mask   = CLOCKSOURCE_MASK(32),
.shift  = 16,
+   .list   = LIST_HEAD_INIT(clocksource_avr32.list),
.is_continuous  = 1,
 };
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 06/12] clocksource: i386 initialize list value

2007-01-22 Thread Daniel Walker
Update arch/i386/ with list initialization.

Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]>

---
 arch/i386/kernel/hpet.c  |1 +
 arch/i386/kernel/i8253.c |1 +
 arch/i386/kernel/tsc.c   |1 +
 3 files changed, 3 insertions(+)

Index: linux-2.6.19/arch/i386/kernel/hpet.c
===
--- linux-2.6.19.orig/arch/i386/kernel/hpet.c
+++ linux-2.6.19/arch/i386/kernel/hpet.c
@@ -282,6 +282,7 @@ static struct clocksource clocksource_hp
.mask   = HPET_MASK,
.shift  = HPET_SHIFT,
.is_continuous  = 1,
+   .list   = LIST_HEAD_INIT(clocksource_hpet.list),
 };
 
 static int __init init_hpet_clocksource(void)
Index: linux-2.6.19/arch/i386/kernel/i8253.c
===
--- linux-2.6.19.orig/arch/i386/kernel/i8253.c
+++ linux-2.6.19/arch/i386/kernel/i8253.c
@@ -177,6 +177,7 @@ static struct clocksource clocksource_pi
.mask   = CLOCKSOURCE_MASK(32),
.mult   = 0,
.shift  = 20,
+   .list   = LIST_HEAD_INIT(clocksource_pit.list),
 };
 
 static int __init init_pit_clocksource(void)
Index: linux-2.6.19/arch/i386/kernel/tsc.c
===
--- linux-2.6.19.orig/arch/i386/kernel/tsc.c
+++ linux-2.6.19/arch/i386/kernel/tsc.c
@@ -352,6 +352,7 @@ static struct clocksource clocksource_ts
.shift  = 22,
.update_callback= tsc_update_callback,
.is_continuous  = 1,
+   .list   = LIST_HEAD_INIT(clocksource_tsc.list),
 };
 
 static int tsc_update_callback(void)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/12] clocksource: mips initialize list value

2007-01-22 Thread Daniel Walker
Update arch/mips/ with list initialization.

Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]>

---
 arch/mips/kernel/time.c |1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6.19/arch/mips/kernel/time.c
===
--- linux-2.6.19.orig/arch/mips/kernel/time.c
+++ linux-2.6.19/arch/mips/kernel/time.c
@@ -307,6 +307,7 @@ static unsigned int __init calibrate_hpt
 struct clocksource clocksource_mips = {
.name   = "MIPS",
.mask   = 0x,
+   .list   = LIST_HEAD_INIT(clocksource_mips.list),
.is_continuous  = 1,
 };
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 07/12] clocksource: x86_64 initialize list value

2007-01-22 Thread Daniel Walker
Update arch/x86_64/ with list initialization.

Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]>

---
 arch/x86_64/kernel/hpet.c |1 +
 arch/x86_64/kernel/tsc.c  |1 +
 2 files changed, 2 insertions(+)

Index: linux-2.6.19/arch/x86_64/kernel/hpet.c
===
--- linux-2.6.19.orig/arch/x86_64/kernel/hpet.c
+++ linux-2.6.19/arch/x86_64/kernel/hpet.c
@@ -472,6 +472,7 @@ struct clocksource clocksource_hpet = {
.shift  = HPET_SHIFT,
.is_continuous  = 1,
.vread  = vread_hpet,
+   .list   = LIST_HEAD_INIT(clocksource_hpet.list),
 };
 
 static int __init init_hpet_clocksource(void)
Index: linux-2.6.19/arch/x86_64/kernel/tsc.c
===
--- linux-2.6.19.orig/arch/x86_64/kernel/tsc.c
+++ linux-2.6.19/arch/x86_64/kernel/tsc.c
@@ -209,6 +209,7 @@ static struct clocksource clocksource_ts
.update_callback= tsc_update_callback,
.is_continuous  = 1,
.vread  = vread_tsc,
+   .list   = LIST_HEAD_INIT(clocksource_tsc.list),
 };
 
 static int tsc_update_callback(void)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 08/12] clocksource: driver initialize list value

2007-01-22 Thread Daniel Walker
Update drivers/clocksource/ with list initialization.

Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]>

---
 drivers/clocksource/acpi_pm.c|1 +
 drivers/clocksource/cyclone.c|1 +
 drivers/clocksource/scx200_hrt.c |1 +
 3 files changed, 3 insertions(+)

Index: linux-2.6.19/drivers/clocksource/acpi_pm.c
===
--- linux-2.6.19.orig/drivers/clocksource/acpi_pm.c
+++ linux-2.6.19/drivers/clocksource/acpi_pm.c
@@ -74,6 +74,7 @@ static struct clocksource clocksource_ac
.mult   = 0, /*to be caluclated*/
.shift  = 22,
.is_continuous  = 1,
+   .list   = LIST_HEAD_INIT(clocksource_acpi_pm.list),
 };
 
 
Index: linux-2.6.19/drivers/clocksource/cyclone.c
===
--- linux-2.6.19.orig/drivers/clocksource/cyclone.c
+++ linux-2.6.19/drivers/clocksource/cyclone.c
@@ -32,6 +32,7 @@ static struct clocksource clocksource_cy
.mult   = 10,
.shift  = 0,
.is_continuous  = 1,
+   .list   = LIST_HEAD_INIT(clocksource_cyclone.list),
 };
 
 static int __init init_cyclone_clocksource(void)
Index: linux-2.6.19/drivers/clocksource/scx200_hrt.c
===
--- linux-2.6.19.orig/drivers/clocksource/scx200_hrt.c
+++ linux-2.6.19/drivers/clocksource/scx200_hrt.c
@@ -58,6 +58,7 @@ static struct clocksource cs_hrt = {
.read   = read_hrt,
.mask   = CLOCKSOURCE_MASK(32),
.is_continuous  = 1,
+   .list   = LIST_HEAD_INIT(cs_hrt.list),
/* mult, shift are set based on mhz27 flag */
 };
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/12] clocksource: initialize list value

2007-01-22 Thread Daniel Walker
A change to clocksource initialization. If the list field is initialized 
it allows clocksource_register to complete faster since it doesn't have 
to scan the list of clocks doing strcmp on each looking for duplicates.

Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]>

---
 kernel/time/clocksource.c |3 +--
 kernel/time/jiffies.c |1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.19/kernel/time/clocksource.c
===
--- linux-2.6.19.orig/kernel/time/clocksource.c
+++ linux-2.6.19/kernel/time/clocksource.c
@@ -186,12 +186,11 @@ int clocksource_register(struct clocksou
unsigned long flags;
 
spin_lock_irqsave(_lock, flags);
-   if (unlikely(!list_empty(>list) && __is_registered(c->name))) {
+   if (unlikely(!list_empty(>list))) {
printk("register_clocksource: Cannot register %s clocksource. "
   "Already registered!", c->name);
ret = -EBUSY;
} else {
-   INIT_LIST_HEAD(>list);
__sorted_list_add(c);
/* scan the registered clocksources, and pick the best one */
next_clocksource = select_clocksource();
Index: linux-2.6.19/kernel/time/jiffies.c
===
--- linux-2.6.19.orig/kernel/time/jiffies.c
+++ linux-2.6.19/kernel/time/jiffies.c
@@ -63,6 +63,7 @@ struct clocksource clocksource_jiffies =
.mult   = NSEC_PER_JIFFY << JIFFIES_SHIFT, /* details above */
.shift  = JIFFIES_SHIFT,
.is_continuous  = 0, /* tick based, not free running */
+   .list   = LIST_HEAD_INIT(clocksource_jiffies.list),
 };
 
 static int __init init_jiffies_clocksource(void)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/12] clocksource: add block notifier

2007-01-22 Thread Daniel Walker
Adds a call back interface for register/rating change events. This is also used
later in this series to signal other interesting events.


Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]>

---
 include/linux/clocksource.h |   37 +
 include/linux/timekeeping.h |3 +++
 kernel/time/clocksource.c   |   10 ++
 3 files changed, 50 insertions(+)

Index: linux-2.6.19/include/linux/clocksource.h
===
--- linux-2.6.19.orig/include/linux/clocksource.h
+++ linux-2.6.19/include/linux/clocksource.h
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -24,6 +25,42 @@ typedef u64 cycle_t;
 /* XXX - Would like a better way for initializing curr_clocksource */
 extern struct clocksource clocksource_jiffies;
 
+/*
+ * Allows inlined calling for notifier routines.
+ */
+extern struct atomic_notifier_head clocksource_list_notifier;
+
+/*
+ * Block notifier flags.
+ */
+#define CLOCKSOURCE_NOTIFY_REGISTER1
+#define CLOCKSOURCE_NOTIFY_RATING  2
+#define CLOCKSOURCE_NOTIFY_FREQ4
+
+/**
+ * clocksource_notifier_register - Registers a list change notifier
+ * @nb:pointer to a notifier block
+ *
+ * Returns zero always.
+ */
+static inline int clocksource_notifier_register(struct notifier_block *nb)
+{
+   return atomic_notifier_chain_register(_list_notifier, nb);
+}
+
+/**
+ * clocksource_freq_change - Allows notification of dynamic frequency changes.
+ *
+ * Signals that a clocksource is dynamically changing it's frequency.
+ * This could happen if a clocksource becomes more/less stable.
+ */
+static inline void clocksource_freq_change(struct clocksource *c)
+{
+   atomic_notifier_call_chain(_list_notifier,
+  CLOCKSOURCE_NOTIFY_FREQ, c);
+}
+
+
 /**
  * struct clocksource - hardware abstraction for a free running counter
  * Provides mostly state-free accessors to the underlying hardware.
Index: linux-2.6.19/include/linux/timekeeping.h
===
--- linux-2.6.19.orig/include/linux/timekeeping.h
+++ linux-2.6.19/include/linux/timekeeping.h
@@ -14,6 +14,9 @@ static inline int change_clocksource(voi
 {
return 0;
 }
+
+static inline void change_clocksource(void) { }
+
 #endif /* !CONFIG_GENERIC_TIME */
 
 #endif /* _LINUX_TIMEKEEPING_H */
Index: linux-2.6.19/kernel/time/clocksource.c
===
--- linux-2.6.19.orig/kernel/time/clocksource.c
+++ linux-2.6.19/kernel/time/clocksource.c
@@ -49,6 +49,8 @@ static DEFINE_SPINLOCK(clocksource_lock)
 static char override_name[32];
 static int finished_booting;
 
+ATOMIC_NOTIFIER_HEAD(clocksource_list_notifier);
+
 /* clocksource_done_booting - Called near the end of bootup
  *
  * Hack to avoid lots of clocksource churn at boot time
@@ -196,6 +198,10 @@ int clocksource_register(struct clocksou
next_clocksource = select_clocksource();
}
spin_unlock_irqrestore(_lock, flags);
+
+   atomic_notifier_call_chain(_list_notifier,
+  CLOCKSOURCE_NOTIFY_REGISTER, c);
+
return ret;
 }
 EXPORT_SYMBOL(clocksource_register);
@@ -224,6 +230,10 @@ void clocksource_rating_change(struct cl
 
next_clocksource = select_clocksource();
spin_unlock_irqrestore(_lock, flags);
+
+   atomic_notifier_call_chain(_list_notifier,
+  CLOCKSOURCE_NOTIFY_RATING, c);
+
 }
 EXPORT_SYMBOL(clocksource_rating_change);
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/12] timekeeping: create kernel/time/timekeeping.c

2007-01-22 Thread Daniel Walker
Move the generic timekeeping code from kernel/timer.c to
kernel/time/timekeeping.c . This requires some glue code which is
added to the include/linux/timekeeping.h header.

Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> 

---
 include/linux/clocksource.h |3 
 include/linux/timekeeping.h |   19 +
 kernel/time/Makefile|2 
 kernel/time/clocksource.c   |3 
 kernel/time/timekeeping.c   |  639 
 kernel/timer.c  |  630 ---
 6 files changed, 663 insertions(+), 633 deletions(-)

Index: linux-2.6.19/include/linux/clocksource.h
===
--- linux-2.6.19.orig/include/linux/clocksource.h
+++ linux-2.6.19/include/linux/clocksource.h
@@ -18,6 +18,9 @@
 /* clocksource cycle base type */
 typedef u64 cycle_t;
 
+/* XXX - Would like a better way for initializing curr_clocksource */
+extern struct clocksource clocksource_jiffies;
+
 /**
  * struct clocksource - hardware abstraction for a free running counter
  * Provides mostly state-free accessors to the underlying hardware.
Index: linux-2.6.19/include/linux/timekeeping.h
===
--- /dev/null
+++ linux-2.6.19/include/linux/timekeeping.h
@@ -0,0 +1,19 @@
+#ifndef _LINUX_TIMEKEEPING_H
+#define _LINUX_TIMEKEEPING_H
+
+#include 
+
+extern void update_wall_time(void);
+
+#ifdef CONFIG_GENERIC_TIME
+
+extern struct clocksource *clock;
+
+#else /* CONFIG_GENERIC_TIME */
+static inline int change_clocksource(void)
+{
+   return 0;
+}
+#endif /* !CONFIG_GENERIC_TIME */
+
+#endif /* _LINUX_TIMEKEEPING_H */
Index: linux-2.6.19/kernel/time/Makefile
===
--- linux-2.6.19.orig/kernel/time/Makefile
+++ linux-2.6.19/kernel/time/Makefile
@@ -1,4 +1,4 @@
-obj-y += ntp.o clocksource.o jiffies.o timer_list.o
+obj-y += ntp.o clocksource.o jiffies.o timer_list.o timekeeping.o
 
 obj-$(CONFIG_GENERIC_CLOCKEVENTS)  += clockevents.o
 obj-$(CONFIG_TIMER_STATS)  += timer_stats.o
Index: linux-2.6.19/kernel/time/clocksource.c
===
--- linux-2.6.19.orig/kernel/time/clocksource.c
+++ linux-2.6.19/kernel/time/clocksource.c
@@ -29,9 +29,6 @@
 #include 
 #include 
 
-/* XXX - Would like a better way for initializing curr_clocksource */
-extern struct clocksource clocksource_jiffies;
-
 /*[Clocksource internal variables]-
  * curr_clocksource:
  * currently selected clocksource. Initialized to clocksource_jiffies.
Index: linux-2.6.19/kernel/time/timekeeping.c
===
--- /dev/null
+++ linux-2.6.19/kernel/time/timekeeping.c
@@ -0,0 +1,639 @@
+
+
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * flag for if timekeeping is suspended
+ */
+static int timekeeping_suspended;
+
+/*
+ * time in seconds when suspend began
+ */
+static unsigned long timekeeping_suspend_time;
+
+/*
+ * Clock used for timekeeping
+ */
+struct clocksource *clock = _jiffies;
+
+/*
+ * The current time
+ * wall_to_monotonic is what we need to add to xtime (or xtime corrected
+ * for sub jiffie times) to get to monotonic time.  Monotonic is pegged
+ * at zero at system boot time, so wall_to_monotonic will be negative,
+ * however, we will ALWAYS keep the tv_nsec part positive so we can use
+ * the usual normalization.
+ */
+struct timespec xtime __attribute__ ((aligned (16)));
+struct timespec wall_to_monotonic __attribute__ ((aligned (16)));
+
+EXPORT_SYMBOL(xtime);
+
+#ifdef CONFIG_GENERIC_TIME
+/**
+ * __get_nsec_offset - Returns nanoseconds since last call to periodic_hook
+ *
+ * private function, must hold xtime_lock lock when being
+ * called. Returns the number of nanoseconds since the
+ * last call to update_wall_time() (adjusted by NTP scaling)
+ */
+static inline s64 __get_nsec_offset(void)
+{
+   cycle_t cycle_now, cycle_delta;
+   s64 ns_offset;
+
+   /* read clocksource: */
+   cycle_now = clocksource_read(clock);
+
+   /* calculate the delta since the last update_wall_time: */
+   cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
+
+   /* convert to nanoseconds: */
+   ns_offset = cyc2ns(clock, cycle_delta);
+
+   return ns_offset;
+}
+
+/**
+ * __get_realtime_clock_ts - Returns the time of day in a timespec
+ * @ts:pointer to the timespec to be set
+ *
+ * Returns the time of day in a timespec. Used by
+ * do_gettimeofday() and get_realtime_clock_ts().
+ */
+static inline void __get_realtime_clock_ts(struct timespec *ts)
+{
+   unsigned long seq;
+   s64 nsecs;
+
+   do {
+   seq = read_seqbegin(_lock);
+
+   *ts = xtime;
+   nsecs = __get_nsec_offset();
+
+   } while (read_seqretry(_lock, seq));
+
+   timespec_add_ns(ts, nsecs);
+}
+
+/**
+ * 

[PATCH 00/12] clocksource/timekeeping cleanup

2007-01-22 Thread Daniel Walker
This patchset represents the most stable clocksource changes in my tree. Also
John (and others) have reviewed these changes a few times. I think it's all 
acceptable.

Daniel

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Balbir Singh

Rik van Riel wrote:

Nick Piggin wrote:


The other nice thing about it was that it didn't have a hard
cutoff that the current reclaim_mapped toggle does -- you could
opt to scan the mapped list at a lower ratio than the unmapped
one. Of course, it also has some downsides too, and would
require retuning...


Here's a simple idea for tuning.

For each list we keep track of:
1) the size of the list
2) the rate at which we scan the list
3) the fraction of (non new) pages that get
   referenced

That way we can determine which list has the largest
fraction of "idle" pages sitting around and consequently
which list should be scanned more aggressively.

For each list we can calculate how frequently the pages
in the list are being used:

pressure = referenced percentage * scan rate / list size

The VM can equalize the pressure by scanning the list with
lower usage less than the other list.  This way the VM can
give the right amount of memory to each type.



This sounds like a good thing to start with. I think we can
then use swappiness to decide what to evict.


Of course, each list needs to be divided into inactive and
active like the current VM, in order to make sure that the
pages which are used once cannot push the real working set
of that list out of memory.



Yes, that makes sense.


There is a more subtle problem when the list's working set
is larger than the amount of memory the list has.  In that
situation the VM will be faulting pages back in just after
they got evicted.  Something like my /proc/refaults code
can detect that and adjust the size of the undersized list
accordingly.

Of course, once we properly distinguish between the more
frequently and less frequently accessed pages within each
of the page sets (mapped/anonymous vs. unmapped) and have
the pressure between the lists equalized, why do we need
to keep them separate again?



:-)

--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: problems with latest smbfs changes on 2.4.34 and security backports

2007-01-22 Thread Willy Tarreau
Hi Dann,

On Mon, Jan 22, 2007 at 11:19:43AM -0700, dann frazier wrote:
> On Mon, Jan 22, 2007 at 10:50:47AM +1100, Grant Coady wrote:
> > On Mon, 22 Jan 2007 00:03:21 +0100, Willy Tarreau <[EMAIL PROTECTED]> wrote:
> > [EMAIL PROTECTED]:/home/other$ uname -r
> > 2.4.34b
> > [EMAIL PROTECTED]:/home/other$ mkdir test
> > [EMAIL PROTECTED]:/home/other$ ln -s test testlink
> > ln: creating symbolic link `testlink' to `test': Operation not permitted
> > [EMAIL PROTECTED]:/home/other$ echo "this is also a test" > test/file
> > [EMAIL PROTECTED]:/home/other$ ln -s test/file test2
> > ln: creating symbolic link `test2' to `test/file': Operation not permitted
> > 
> > trying to create symlinks.
> > 
> > No problems creating symlinks with 2.4.33.3.
> 
> Yes, I've found that this varies depending upon the options passed. If
> uid=0, I can create symlinks, otherwise I always get permission
> denied. This behavior appears to be consistent with 2.6.
> 
> I also need to do some testing with the proposed patch to smbmount
> that will let you omit options (current versions will always pass an
> option to the kernel, even if you the user did not provide one).
> If you do not pass options, the behavior should fallback to
> server-provided values.
> 
> Note that this bug has been my only interaction with smbfs, so I'm
> certainly no expert on how it *should* behave. My plan is to
> take all of the use cases we're coming up with and try to maintain
> the "historic" 2.4 behavior as much as possible, but still not
> silently dropping user-provided mount options. When the behavior needs
> to change to honor them, I'll try to match what current 2.6
> does. Make sense?

Yes, it does for me. So to sum up, I apply your patch to 2.4.34.1
and it restores the same behaviour for Santiago and Grant as they get
in 2.6. Whether it's the expected behaviour or not is not the point,
as it will be easier for us to later mimic 2.6 if a change is needed
since we're not experts at all in this area.

If we're all OK for this, I'll go with that.

Thanks guys,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sigaction's ucontext_t with incorrect stack reference when SA_SIGINFO is being used ?

2007-01-22 Thread Nicholas Miell
On Mon, 2007-01-22 at 09:57 +0100, Xavier Roche wrote:
> Hi folks,
> 
> I have a probably louzy question regarding sigaction() behaviour when an
> alternate signal stack is used: it seems that I can not get the user
> stack reference in the ucontext_t stack context ; ie. the uc_stack
> member contains reference of the alternate signal stack, not the stack
> that was used before the crash.
> 
> Is this is a normal behaviour ? Is there a way to retrieve the original
> user's stack inside the signal callback ?
> 
> The example given below demonstrates the issue:
> top of stack==0x7f3d7000, alternative_stack==0x501010
> SEGV==0x7f3d6ff8; sp==0x501010; current stack is the alternate stack
> 
> It is obvious that the SEGV was a stack overflow: the si_addr address is
> just on the page below the stack limit.

POSIX says:
"the third argument can be cast to a pointer to an object of type
ucontext_t to refer to the receiving thread's context that was
interrupted when the signal was delivered."

so if uc_stack doesn't point to the stack in use immediately prior to
signal generation, this is a bug.

(In theory I should be able to pass the ucontext_t supplied to the
signal handler to setcontext() and resume execution exactly where I left
off -- glibc's refusal to support kernel-generated ucontexts gets in the
way of this, but the point still stands.)

I have no idea who to bother about i386 signal delivery, though. (And I
suspect this bug has probably been copied to other architectures as
well.)

-- 
Nicholas Miell <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[git patches] more ftape removal

2007-01-22 Thread Jeff Garzik

Remove bits left over from prior ftape removal.

Please pull from 'ftape' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git ftape

to receive the following updates:

 include/linux/Kbuild   |1 -
 include/linux/mtio.h   |  146 
 include/linux/qic117.h |  290 
 3 files changed, 0 insertions(+), 437 deletions(-)
 delete mode 100644 include/linux/qic117.h

Adrian Bunk (1):
  more ftape removal

diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 862e483..8c634f9 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -129,7 +129,6 @@ header-y += posix_types.h
 header-y += ppdev.h
 header-y += prctl.h
 header-y += ps2esdi.h
-header-y += qic117.h
 header-y += qnxtypes.h
 header-y += quotaio_v1.h
 header-y += quotaio_v2.h
diff --git a/include/linux/mtio.h b/include/linux/mtio.h
index 8c66151..6f8d2d4 100644
--- a/include/linux/mtio.h
+++ b/include/linux/mtio.h
@@ -10,7 +10,6 @@
 
 #include 
 #include 
-#include 
 
 /*
  * Structures and definitions for mag tape io control commands
@@ -116,32 +115,6 @@ struct mtget {
 #define MT_ISFTAPE_UNKNOWN 0x80 /* obsolete */
 #define MT_ISFTAPE_FLAG0x80
 
-struct mt_tape_info {
-   long t_type;/* device type id (mt_type) */
-   char *t_name;   /* descriptive name */
-};
-
-#define MT_TAPE_INFO   { \
-   {MT_ISUNKNOWN,  "Unknown type of tape device"}, \
-   {MT_ISQIC02,"Generic QIC-02 tape streamer"}, \
-   {MT_ISWT5150,   "Wangtek 5150, QIC-150"}, \
-   {MT_ISARCHIVE_5945L2,   "Archive 5945L-2"}, \
-   {MT_ISCMSJ500,  "CMS Jumbo 500"}, \
-   {MT_ISTDC3610,  "Tandberg TDC 3610, QIC-24"}, \
-   {MT_ISARCHIVE_VP60I,"Archive VP60i, QIC-02"}, \
-   {MT_ISARCHIVE_2150L,"Archive Viper 2150L"}, \
-   {MT_ISARCHIVE_2060L,"Archive Viper 2060L"}, \
-   {MT_ISARCHIVESC499, "Archive SC-499 QIC-36 controller"}, \
-   {MT_ISQIC02_ALL_FEATURES, "Generic QIC-02 tape, all features"}, \
-   {MT_ISWT5099EEN24,  "Wangtek 5099-een24, 60MB"}, \
-   {MT_ISTEAC_MT2ST,   "Teac MT-2ST 155mb data cassette drive"}, \
-   {MT_ISEVEREX_FT40A, "Everex FT40A, QIC-40"}, \
-   {MT_ISONSTREAM_SC,  "OnStream SC-, DI-, DP-, or USB tape drive"}, \
-   {MT_ISSCSI1,"Generic SCSI-1 tape"}, \
-   {MT_ISSCSI2,"Generic SCSI-2 tape"}, \
-   {0, NULL} \
-}
-
 
 /* structure for MTIOCPOS - mag tape get position command */
 
@@ -150,130 +123,11 @@ struct   mtpos {
 };
 
 
-/*  structure for MTIOCVOLINFO, query information about the volume
- *  currently positioned at (zftape)
- */
-struct mtvolinfo {
-   unsigned int mt_volno;   /* vol-number */
-   unsigned int mt_blksz;   /* blocksize used when recording */
-   unsigned int mt_rawsize; /* raw tape space consumed, in kb */
-   unsigned int mt_size;/* volume size after decompression, in kb */
-   unsigned int mt_cmpr:1;  /* this volume has been compressed */
-};
-
-/* raw access to a floppy drive, read and write an arbitrary segment.
- * For ftape/zftape to support formatting etc.
- */
-#define MT_FT_RD_SINGLE  0
-#define MT_FT_RD_AHEAD   1
-#define MT_FT_WR_ASYNC   0 /* start tape only when all buffers are full */
-#define MT_FT_WR_MULTI   1 /* start tape, continue until buffers are empty  */
-#define MT_FT_WR_SINGLE  2 /* write a single segment and stop afterwards*/
-#define MT_FT_WR_DELETE  3 /* write deleted data marks, one segment at time */
-
-struct mtftseg
-{
-   unsigned mt_segno;   /* the segment to read or write */
-   unsigned mt_mode;/* modes for read/write (sync/async etc.) */
-   int  mt_result;  /* result of r/w request, not of the ioctl */
-   void__user *mt_data;/* User space buffer: must be 29kb */
-};
-
-/* get tape capacity (ftape/zftape)
- */
-struct mttapesize {
-   unsigned long mt_capacity; /* entire, uncompressed capacity 
-   * of a cartridge
-   */
-   unsigned long mt_used; /* what has been used so far, raw 
-   * uncompressed amount
-   */
-};
-
-/*  possible values of the ftfmt_op field
- */
-#define FTFMT_SET_PARMS1 /* set software parms */
-#define FTFMT_GET_PARMS2 /* get software parms */
-#define FTFMT_FORMAT_TRACK 3 /* start formatting a tape track   */
-#define FTFMT_STATUS   4 /* monitor formatting a tape track */
-#define FTFMT_VERIFY   5 /* verify the given segment*/
-
-struct ftfmtparms {
-   unsigned char  ft_qicstd;   /* QIC-40/QIC-80/QIC-3010/QIC-3020 */
-   unsigned char  ft_fmtcode;  /* Refer to the QIC specs */
-   unsigned char  ft_fhm;  /* floppy head max */
-   unsigned char  ft_ftm;  

[git patch] mention JFFS impending death

2007-01-22 Thread Jeff Garzik

JFFS is already marked CONFIG_BROKEN in fs/Kconfig, with a note that
it's going away in 2.6.21, but the corresponding update to
feature-removal-schedule.txt was accidentally omitted.  Fixed.

Please pull from 'kill-jffs-prep' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git kill-jffs-prep

to receive the following updates:

 Documentation/feature-removal-schedule.txt |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

Jeff Garzik (1):
  Note that JFFS (v1) is to be deleted, in feature-removal-schedule.txt

diff --git a/Documentation/feature-removal-schedule.txt 
b/Documentation/feature-removal-schedule.txt
index fc53239..0ba6af0 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -318,3 +318,10 @@ Why:   /proc/acpi/button has been replaced by events 
to the input layer
 Who:   Len Brown <[EMAIL PROTECTED]>
 
 ---
+
+What:  JFFS (version 1)
+When:  2.6.21
+Why:   Unmaintained for years, superceded by JFFS2 for years.
+Who:   Jeff Garzik <[EMAIL PROTECTED]>
+
+---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Balbir Singh

Nick Piggin wrote:

Balbir Singh wrote:


This makes me wonder if it makes sense to split up the LRU into page
cache LRU and mapped pages LRU. I see two benefits

1. Currently based on swappiness, we might walk an entire list
   searching for page cache pages or mapped pages. With these
   lists separated, it should get easier and faster to implement
   this scheme
2. There is another parallel thread on implementing page cache
   limits. If the lists split out, we need not scan the entire
   list to find page cache pages to evict them.

Of course I might be missing something (some piece of history)


I actually had patches to do "split active lists" a while back.

They worked by lazily moving the page at reclaim-time, based on
whether or not it is mapped. This isn't too much worse than the
kernel's current idea of what a mapped page is.

They actually got a noticable speedup of the swapping kbuild
workload, but at this stage there were some more basic
improvements needed, so the difference could be smaller today.

The other nice thing about it was that it didn't have a hard
cutoff that the current reclaim_mapped toggle does -- you could
opt to scan the mapped list at a lower ratio than the unmapped
one. Of course, it also has some downsides too, and would
require retuning...



Thanks, I am motivated to experiment with the idea. I guess I need
to (re)discover the downsides for myself :-)

--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread yunfeng zhang

re-code my patch, tab = 8. Sorry!

  Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]>

Index: linux-2.6.19/Documentation/vm_pps.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.19/Documentation/vm_pps.txt   2007-01-23 11:32:02.0 
+0800
@@ -0,0 +1,236 @@
+ Pure Private Page System (pps)
+  [EMAIL PROTECTED]
+  December 24-26, 2006
+
+// Purpose <([{
+The file is used to document the idea which is published firstly at
+http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my
+OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the
+patch of the document is for enchancing the performance of Linux swap
+subsystem. You can find the overview of the idea in section  and how I patch it into Linux 2.6.19 in section
+.
+// }])>
+
+// How to Reclaim Pages more Efficiently <([{
+Good idea originates from overall design and management ability, when you look
+down from a manager view, you will relief yourself from disordered code and
+find some problem immediately.
+
+OK! to modern OS, its memory subsystem can be divided into three layers
+1) Space layer (InodeSpace, UserSpace and CoreSpace).
+2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer).
+3) Page table, zone and memory inode layer (architecture-dependent).
+Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but
+here, it's placed on the 2nd layer since it's the basic unit of VMA.
+
+Since the 2nd layer assembles the much statistic of page-acess information, so
+it's nature that swap subsystem should be deployed and implemented on the 2nd
+layer.
+
+Undoubtedly, there are some virtues about it
+1) SwapDaemon can collect the statistic of process acessing pages and by it
+   unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range
+   to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in
+   current Linux legacy swap subsystem.
+2) Page-fault can issue better readahead requests since history data shows all
+   related pages have conglomerating affinity. In contrast, Linux page-fault
+   readaheads the pages relative to the SwapSpace position of current
+   page-fault page.
+3) It's conformable to POSIX madvise API family.
+4) It simplifies Linux memory model dramatically. Keep it in mind that new swap
+   strategy is from up to down. In fact, Linux legacy swap subsystem is maybe
+   the only one from down to up.
+
+Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a
+system on memory node::active_list/inactive_list.
+
+I've finished a patch, see section . Note, it
+ISN'T perfect.
+// }])>
+
+// Pure Private Page System -- pps  <([{
+As I've referred in previous section, perfectly applying my idea need to unroot
+page-surrounging swap subsystem to migrate it on VMA, but a huge gap has
+defeated me -- active_list and inactive_list. In fact, you can find
+lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by
+myself. It's also the difference between my design and Linux, in my OS, page is
+the charge of its new owner totally, however, to Linux, page management system
+is still tracing it by PG_active flag.
+
+So I conceive another solution:) That is, set up an independent page-recycle
+system rooted on Linux legacy page system -- pps, intercept all private pages
+belonging to PrivateVMA to pps, then use my pps to cycle them.  By the way, the
+whole job should be consist of two parts, here is the first --
+PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS)
+scheduled in future. Of course, if both are done, it will empty Linux legacy
+page system.
+
+In fact, pps is centered on how to better collect and unmap process private
+pages, the whole process is divided into six stages -- . PPS
+uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma)
+of mm/vmscan.c. Other sections show the remain aspects of pps
+1)  is basic data definition.
+2)  is focused on synchronization.
+3)  how private pages enter in/go off pps.
+4)  which VMA is belonging to pps.
+5)  new daemon thread kppsd, pps statistic data etc.
+
+I'm also glad to highlight my a new idea -- dftlb which is described in
+section .
+// }])>
+
+// Delay to Flush TLB (dftlb) <([{
+Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in
+brief, when we want to unmap a page from the page table of a process, why we
+send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we
+can insert flushing tasks into timer interrupt route to implement a
+free-charged TLB flushing.
+
+The trick is implemented in
+1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c.
+2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute
+   flushing tasks.
+3) all data are 

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Björn Steinbrink wrote:

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.


Indeed, it seems to be just the NV_INT_DEV check that is problematic. 
Here's a patch that's likely better to test, it forces the NV_INT_DEV 
flag on when a command is active, and also fixes that questionable code 
in nv_host_intr that I mentioned.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 22:33:43.0 
-0600
@@ -700,7 +700,6 @@ static void nv_adma_check_cpb(struct ata
 static int nv_host_intr(struct ata_port *ap, u8 irq_stat)
 {
struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag);
-   int handled;
 
/* freeze if hotplugged */
if (unlikely(irq_stat & (NV_INT_ADDED | NV_INT_REMOVED))) {
@@ -719,13 +718,7 @@ static int nv_host_intr(struct ata_port 
}
 
/* handle interrupt */
-   handled = ata_host_intr(ap, qc);
-   if (unlikely(!handled)) {
-   /* spurious, clear it */
-   ata_check_status(ap);
-   }
-
-   return 1;
+   return ata_host_intr(ap, qc);
 }
 
 static irqreturn_t nv_adma_interrupt(int irq, void *dev_instance)
@@ -752,6 +745,11 @@ static irqreturn_t nv_adma_interrupt(int
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) {
u8 irq_stat = readb(host->mmio_base + 
NV_INT_STATUS_CK804)
>> (NV_INT_PORT_SHIFT * i);
+   if(ata_tag_valid(ap->active_tag))
+   /** NV_INT_DEV indication seems 
unreliable at times
+   at least in ADMA mode. Force it on 
always when a
+   command is active, to prevent 
losing interrupts. */
+   irq_stat |= NV_INT_DEV;
handled += nv_host_intr(ap, irq_stat);
continue;
}


Re: [patch] faster vgetcpu using sidt (take 2)

2007-01-22 Thread dean gaudet
On Thu, 18 Jan 2007, Andi Kleen wrote:

> > let me know what you think... thanks.
> 
> It's ok, although I would like to have the file in a separate directory.

cool -- do you have a directory in mind?

and would you like this change as two separate patches or one combined 
patch?

thanks
-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] binfmt_elf: core dump masking support

2007-01-22 Thread Kawai, Hidehiro
Hi,

>>>(run echo 1 > coremask, echo 0 > coremask in a loop while dumping
>>>core. Do you have enough locking to make it work as expected?)
>>
>>Currently, any lock isn't acquired.  But I think the kernel only
>>have to preserve the coremask setting in a local variable at the
>>begining of core dumping.  I'm going to do this in the next version.
> 
> No, I do not think that is enough. At minimum, you'd need atomic_t
> variable. But I'd recomend against it. Playing with locking is tricky.

Why do you think it is not enough?  I think that any locking is not
needed.
My design principle is that the core dump routine is controlled by
the bitmask which was assigned to the dumping process at the time of
starting core dump. So if a coremask setting is changed while
core dumping, the change doesn't affect current dumping process.
This can be implemented as follows:

   core_dump_start:
  unsigned int mask = current->mm->coremask;
  for each VMA {
write a header which depends on the result of maydump(vma, mask)
  }
  for each VMA {
write a body which depends on the result of maydump(vma, mask)
  }

NOTE:
  maydump() is the central function, which decides whether a given
  VMA should be dumped or not.

What do you think about this?


Best regards,
-- 
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Jean-Marc Valin
>>> will be a device driver. Common causes of suspend/resume problems from
>>> the list you give below are acpi modules, bluetooth and usb. I'd also be
>>> consider pcmcia, drm and fuse possibilities. But again, go for unloading
>>> everything possible in the first instance.
>> Actually, the reason I sent this is that when I showed the oops/gpf to
>> Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug
>> problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the
>> suspend to RAM now works ~95% of the time.
> 
> Try a kernel without CONFIG_SMP... that will verify if it is SMP
> related.

Well, this happens to be my main work machine, which I'm not willing to
have running at half speed for several weeks. Anything else you can suggest?

Jean-Marc
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Suspend to RAM generates oops and general protection fault

2007-01-22 Thread Jean-Marc Valin
>> I just encountered the following oops and general protection fault
>> trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2
>> GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The
>> relevant errors are below but the full dmesg log is at
>> http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in
>> http://people.xiph.org/~jm/config-2.6.20-rc5.txt
>>
>> This happens when I'm running 2.6.20-rc5. The previous kernel version I
>> was using is 2.6.19-rc6 and was much more broken (second attempt
>> *always* failed), so it's probably not a regression.
> 
> This is a shot against the odds, but could you please check if the attached
> patch has any effect?

Thanks, I'll try that. It may take a while because the problem only
happened once in dozens of suspend/resume cycles.

Jean-Marc

> Rafael
> 
> 
> 
> 
> 
> 
> Both process_zones()and drain_node_pages() check for populated zones before
> touching pagesets. However, __drain_pages does not do so,
> 
> This may result in a NULL pointer dereference for pagesets in unpopulated
> zones if a NUMA setup is combined with cpu hotplug.
> 
> Initially the unpopulated zone has the pcp pointers pointing to the boot
> pagesets.  Since the zone is not populated the boot pageset pointers will
> not be changed during page allocator and slab bootstrap.
> 
> If a cpu is later brought down (first call to __drain_pages()) then the pcp
> pointers for cpus in unpopulated zones are set to NULL since __drain_pages
> does not first check for an unpopulated zone.
> 
> If the cpu is then brought up again then we call process_zones() which will 
> ignore
> the unpopulated zone. So the pageset pointers will still be NULL.
> 
> If the cpu is then again brought down then __drain_pages will attempt to drain
> pages by following the NULL pageset pointer for unpopulated zones.
> 
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> ---
>  mm/page_alloc.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> Index: linux-2.6.20-rc4/mm/page_alloc.c
> ===
> --- linux-2.6.20-rc4.orig/mm/page_alloc.c
> +++ linux-2.6.20-rc4/mm/page_alloc.c
> @@ -714,6 +714,9 @@ static void __drain_pages(unsigned int c
>   if (!populated_zone(zone))
>   continue;
>  
> + if (!populated_zone(zone))
> + continue;
> +
>   pset = zone_pcp(zone, cpu);
>   for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
>   struct per_cpu_pages *pcp;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Rik van Riel

Nick Piggin wrote:


The other nice thing about it was that it didn't have a hard
cutoff that the current reclaim_mapped toggle does -- you could
opt to scan the mapped list at a lower ratio than the unmapped
one. Of course, it also has some downsides too, and would
require retuning...


Here's a simple idea for tuning.

For each list we keep track of:
1) the size of the list
2) the rate at which we scan the list
3) the fraction of (non new) pages that get
   referenced

That way we can determine which list has the largest
fraction of "idle" pages sitting around and consequently
which list should be scanned more aggressively.

For each list we can calculate how frequently the pages
in the list are being used:

pressure = referenced percentage * scan rate / list size

The VM can equalize the pressure by scanning the list with
lower usage less than the other list.  This way the VM can
give the right amount of memory to each type.

Of course, each list needs to be divided into inactive and
active like the current VM, in order to make sure that the
pages which are used once cannot push the real working set
of that list out of memory.

There is a more subtle problem when the list's working set
is larger than the amount of memory the list has.  In that
situation the VM will be faulting pages back in just after
they got evicted.  Something like my /proc/refaults code
can detect that and adjust the size of the undersized list
accordingly.

Of course, once we properly distinguish between the more
frequently and less frequently accessed pages within each
of the page sets (mapped/anonymous vs. unmapped) and have
the pressure between the lists equalized, why do we need
to keep them separate again?

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem

2007-01-22 Thread yunfeng zhang


Patched against 2.6.19 leads to:

mm/vmscan.c: In function `shrink_pvma_scan_ptes':
mm/vmscan.c:1340: too many arguments to function `page_remove_rmap'

So changed
 page_remove_rmap(series.pages[i], vma);
to
 page_remove_rmap(series.pages[i]);



I've worked on 2.6.19, but when update to 2.6.20-rc5, the function is changed.


But your patch doesn't offer any swap-performance improvement for both swsusp
or tmpfs.  Swap-in is still half speed of Swap-out.



Current Linux page allocation fairly provides pages for every process, since
swap daemon only is started when memory is low, so when it starts to scan
active_list, the private pages of processes are messed up with each other,
vmscan.c:shrink_list() is the only approach to attach disk swap page to page on
active_list, as the result, all private pages lost their affinity on swap
partition. I will give a testlater...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Nick Piggin

Balbir Singh wrote:


This makes me wonder if it makes sense to split up the LRU into page
cache LRU and mapped pages LRU. I see two benefits

1. Currently based on swappiness, we might walk an entire list
   searching for page cache pages or mapped pages. With these
   lists separated, it should get easier and faster to implement
   this scheme
2. There is another parallel thread on implementing page cache
   limits. If the lists split out, we need not scan the entire
   list to find page cache pages to evict them.

Of course I might be missing something (some piece of history)


I actually had patches to do "split active lists" a while back.

They worked by lazily moving the page at reclaim-time, based on
whether or not it is mapped. This isn't too much worse than the
kernel's current idea of what a mapped page is.

They actually got a noticable speedup of the swapping kbuild
workload, but at this stage there were some more basic
improvements needed, so the difference could be smaller today.

The other nice thing about it was that it didn't have a hard
cutoff that the current reclaim_mapped toggle does -- you could
opt to scan the mapped list at a lower ratio than the unmapped
one. Of course, it also has some downsides too, and would
require retuning...

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Christoph Lameter
On Tue, 23 Jan 2007, Balbir Singh wrote:

> Yes, good point, I see what you mean in terms of impact. But the trade
> off could come from shrink_active_list() which does
> 
> list_del(>lru)
> if (!reclaim_mapped && other_conditions)
>   list_add(>lru, _active);
> ...
> 
> In the case mentioned above, we would triple the cachlines when an area
> is mapped/unmapped (which might be acceptable since it is a state change
> for the page ;) ). In the trade-off I mentioned, it would happen
> everytime reclaim is invoked and it has nothing to do with a page changing
> state.
> 
> Did I miss something?

We do the list_del/list_add right now in reclaim while moving pages 
between active and inactive lists. However, reclaim is not run until the 
systems is under memory pressure. Reclaim is run rarely and then lots of 
these movements are occurring. At that point is it likely that the 
cachelines are already available since the page structs had to be touched 
for earlier movements.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Balbir Singh

Christoph Lameter wrote:


perfmon can do much of what you are looking for.



Thanks, I'll look into it.

--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Balbir Singh

Christoph Lameter wrote:

On Tue, 23 Jan 2007, Balbir Singh wrote:


When you unmap or map, you need to touch the pte entries and know the
pages involved, so shouldn't be equivalent to a list_del and list_add
for each page impacted by the map/unmap operation?


When you unmap and map you must currently get exclusive access to the
cachelines of the pte and the cacheline of the page struct. If we use a
list_move on page->lru then we have would have to update pointers in up
to 4 other page structs. Thus we need exclusive access to 4 additional
cachelines. This triples the number of cachelines touched. Instead of 2
cachelines we need 6.




Yes, good point, I see what you mean in terms of impact. But the trade
off could come from shrink_active_list() which does

list_del(>lru)
if (!reclaim_mapped && other_conditions)
list_add(>lru, _active);
...

In the case mentioned above, we would triple the cachlines when an area
is mapped/unmapped (which might be acceptable since it is a state change
for the page ;) ). In the trade-off I mentioned, it would happen
everytime reclaim is invoked and it has nothing to do with a page changing
state.

Did I miss something?

--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Christoph Lameter
On Tue, 23 Jan 2007, Balbir Singh wrote:

> I have always wondered if it would be useful to have a kernel debug
> feature that can extract page references per task, it would be good
> to see the page references (last 'n') of a workload that is not
> doing too well on a particular system.

perfmon can do much of what you are looking for.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Balbir Singh

Rik van Riel wrote:

Christoph Lameter wrote:


With the proposed schemd you would have to move pages between lists if
they are mapped and unmapped by a process. Terminating a process could
lead to lots of pages moving to the unnmapped list.


That could be a problem.

Another problem is that any such heuristic in the VM is
bound to have corner cases that some workloads will hit.

It would be really nice if we came up with a page replacement
algorithm that did not need many extra heuristics to make it
work...



Yes, it's damn hard at times. I was reading through an article
(Architectural support for translation table management in large address
space machines - Huck and Hayes), it talks about how Object Oriented
Systems encourage more sharing and decrease the locality of resulting
virtual address memory stream. Even multi threading tends to impact
locality of references.

Unfortunately, we have only heuristics to go by and of-course their
mathematical model.

I have always wondered if it would be useful to have a kernel debug
feature that can extract page references per task, it would be good
to see the page references (last 'n') of a workload that is not
doing too well on a particular system.



--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/12] md: move write operations to raid5_run_ops

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

handle_stripe sets STRIPE_OP_PREXOR, STRIPE_OP_BIODRAIN, STRIPE_OP_POSTXOR
to request a write to the stripe cache.  raid5_run_ops is triggerred to run
and executes the request outside the stripe lock.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  152 +---
 1 files changed, 131 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2c74f9b..2390657 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1788,7 +1788,75 @@ static void compute_block_2(struct stripe_head *sh, int 
dd_idx1, int dd_idx2)
}
 }
 
+static int handle_write_operations5(struct stripe_head *sh, int rcw, int 
expand)
+{
+   int i, pd_idx = sh->pd_idx, disks = sh->disks;
+   int locked=0;
+
+   if (rcw == 0) {
+   /* skip the drain operation on an expand */
+   if (!expand) {
+   BUG_ON(test_and_set_bit(STRIPE_OP_BIODRAIN,
+   >ops.pending));
+   sh->ops.count++;
+   }
+
+   BUG_ON(test_and_set_bit(STRIPE_OP_POSTXOR, >ops.pending));
+   sh->ops.count++;
+
+   for (i=disks ; i-- ;) {
+   struct r5dev *dev = >dev[i];
+
+   if (dev->towrite) {
+   set_bit(R5_LOCKED, >flags);
+   if (!expand)
+   clear_bit(R5_UPTODATE, >flags);
+   locked++;
+   }
+   }
+   } else {
+   BUG_ON(!(test_bit(R5_UPTODATE, >dev[pd_idx].flags) ||
+   test_bit(R5_Wantcompute, >dev[pd_idx].flags)));
+
+   BUG_ON(test_and_set_bit(STRIPE_OP_PREXOR, >ops.pending) ||
+   test_and_set_bit(STRIPE_OP_BIODRAIN, >ops.pending) 
||
+   test_and_set_bit(STRIPE_OP_POSTXOR, >ops.pending));
+
+   sh->ops.count += 3;
+
+   for (i=disks ; i-- ;) {
+   struct r5dev *dev = >dev[i];
+   if (i==pd_idx)
+   continue;
 
+   /* For a read-modify write there may be blocks that are
+* locked for reading while others are ready to be 
written
+* so we distinguish these blocks by the R5_Wantprexor 
bit
+*/
+   if (dev->towrite &&
+   (test_bit(R5_UPTODATE, >flags) ||
+   test_bit(R5_Wantcompute, >flags))) {
+   set_bit(R5_Wantprexor, >flags);
+   set_bit(R5_LOCKED, >flags);
+   clear_bit(R5_UPTODATE, >flags);
+   locked++;
+   }
+   }
+   }
+
+   /* keep the parity disk locked while asynchronous operations
+* are in flight
+*/
+   set_bit(R5_LOCKED, >dev[pd_idx].flags);
+   clear_bit(R5_UPTODATE, >dev[pd_idx].flags);
+   locked++;
+
+   PRINTK("%s: stripe %llu locked: %d pending: %lx\n",
+   __FUNCTION__, (unsigned long long)sh->sector,
+   locked, sh->ops.pending);
+
+   return locked;
+}
 
 /*
  * Each stripe/dev can have one or more bion attached.
@@ -2151,8 +2219,67 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(STRIPE_HANDLE, >state);
}
 
-   /* now to consider writing and what else, if anything should be read */
-   if (to_write) {
+   /* Now we check to see if any write operations have recently
+* completed
+*/
+
+   /* leave prexor set until postxor is done, allows us to distinguish
+* a rmw from a rcw during biodrain
+*/
+   if (test_bit(STRIPE_OP_PREXOR, >ops.complete) &&
+   test_bit(STRIPE_OP_POSTXOR, >ops.complete)) {
+
+   clear_bit(STRIPE_OP_PREXOR, >ops.complete);
+   clear_bit(STRIPE_OP_PREXOR, >ops.ack);
+   clear_bit(STRIPE_OP_PREXOR, >ops.pending);
+
+   for (i=disks; i--;)
+   clear_bit(R5_Wantprexor, >dev[i].flags);
+   }
+
+   /* if only POSTXOR is set then this is an 'expand' postxor */
+   if (test_bit(STRIPE_OP_BIODRAIN, >ops.complete) &&
+   test_bit(STRIPE_OP_POSTXOR, >ops.complete)) {
+
+   clear_bit(STRIPE_OP_BIODRAIN, >ops.complete);
+   clear_bit(STRIPE_OP_BIODRAIN, >ops.ack);
+   clear_bit(STRIPE_OP_BIODRAIN, >ops.pending);
+
+   clear_bit(STRIPE_OP_POSTXOR, >ops.complete);
+   clear_bit(STRIPE_OP_POSTXOR, >ops.ack);
+   clear_bit(STRIPE_OP_POSTXOR, >ops.pending);
+
+   /* All the 'written' buffers and the parity block are ready to 
be
+ 

[PATCH 06/12] md: move raid5 compute block operations to raid5_run_ops

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

handle_stripe sets STRIPE_OP_COMPUTE_BLK to request servicing from
raid5_run_ops.  It also sets a flag for the block being computed to let
other parts of handle_stripe submit dependent operations.  raid5_run_ops
guarantees that the compute operation completes before any dependent
operation starts.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  125 +++-
 1 files changed, 93 insertions(+), 32 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2390657..279a30c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1980,7 +1980,7 @@ static void handle_stripe5(struct stripe_head *sh)
int i;
int syncing, expanding, expanded;
int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
-   int non_overwrite = 0;
+   int compute=0, req_compute=0, non_overwrite=0;
int failed_num=0;
struct r5dev *dev;
unsigned long pending=0;
@@ -2032,8 +2032,8 @@ static void handle_stripe5(struct stripe_head *sh)
/* now count some things */
if (test_bit(R5_LOCKED, >flags)) locked++;
if (test_bit(R5_UPTODATE, >flags)) uptodate++;
+   if (test_bit(R5_Wantcompute, >flags)) BUG_ON(++compute > 
1);
 
-   
if (dev->toread) to_read++;
if (dev->towrite) {
to_write++;
@@ -2188,31 +2188,82 @@ static void handle_stripe5(struct stripe_head *sh)
 * parity, or to satisfy requests
 * or to load a block that is being partially written.
 */
-   if (to_read || non_overwrite || (syncing && (uptodate < disks)) || 
expanding) {
-   for (i=disks; i--;) {
-   dev = >dev[i];
-   if (!test_bit(R5_LOCKED, >flags) && 
!test_bit(R5_UPTODATE, >flags) &&
-   (dev->toread ||
-(dev->towrite && !test_bit(R5_OVERWRITE, 
>flags)) ||
-syncing ||
-expanding ||
-(failed && (sh->dev[failed_num].toread ||
-(sh->dev[failed_num].towrite && 
!test_bit(R5_OVERWRITE, >dev[failed_num].flags
-   )
-   ) {
-   /* we would like to get this block, possibly
-* by computing it, but we might not be able to
+   if (to_read || non_overwrite || (syncing && (uptodate + compute < 
disks)) || expanding ||
+   test_bit(STRIPE_OP_COMPUTE_BLK, >ops.pending)) {
+
+   /* Clear completed compute operations.  Parity recovery
+* (STRIPE_OP_MOD_REPAIR_PD) implies a write-back which is 
handled
+* later on in this routine
+*/
+   if (test_bit(STRIPE_OP_COMPUTE_BLK, >ops.complete) &&
+   !test_bit(STRIPE_OP_MOD_REPAIR_PD, >ops.pending)) {
+   clear_bit(STRIPE_OP_COMPUTE_BLK, >ops.complete);
+   clear_bit(STRIPE_OP_COMPUTE_BLK, >ops.ack);
+   clear_bit(STRIPE_OP_COMPUTE_BLK, >ops.pending);
+   }
+
+   /* look for blocks to read/compute, skip this if a compute
+* is already in flight, or if the stripe contents are in the
+* midst of changing due to a write
+*/
+   if (!test_bit(STRIPE_OP_COMPUTE_BLK, >ops.pending) &&
+   !test_bit(STRIPE_OP_PREXOR, >ops.pending) &&
+   !test_bit(STRIPE_OP_POSTXOR, >ops.pending)) {
+   for (i=disks; i--;) {
+   dev = >dev[i];
+
+   /* don't schedule compute operations or reads on
+* the parity block while a check is in flight
 */
-   if (uptodate == disks-1) {
-   PRINTK("Computing block %d\n", i);
-   compute_block(sh, i);
-   uptodate++;
-   } else if (test_bit(R5_Insync, >flags)) {
-   set_bit(R5_LOCKED, >flags);
-   set_bit(R5_Wantread, >flags);
-   locked++;
-   PRINTK("Reading block %d (sync=%d)\n", 
-   i, syncing);
+   if ((i == sh->pd_idx) && 
test_bit(STRIPE_OP_CHECK, >ops.pending))
+   continue;
+
+   if (!test_bit(R5_LOCKED, >flags) && 
!test_bit(R5_UPTODATE, >flags) &&
+   

[PATCH 03/12] md: add raid5_run_ops and support routines

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

Prepare the raid5 implementation to use async_tx for running stripe
operations:
* biofill (copy data into request buffers to satisfy a read request)
* compute block (generate a missing block in the cache from the other
blocks)
* prexor (subtract existing data as part of the read-modify-write process)
* biodrain (copy data out of request buffers to satisfy a write request)
* postxor (recalculate parity for new data that has entered the cache)
* check (verify that the parity is correct)
* io (submit i/o to the member disks)

Changelog:
* removed ops_complete_biodrain in favor of ops_complete_postxor and
ops_complete_write.
* removed the workqueue
* call bi_end_io for reads in ops_complete_biofill

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  520 
 include/linux/raid/raid5.h |   63 +
 2 files changed, 580 insertions(+), 3 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 68b6fea..e70ee17 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -52,6 +52,7 @@
 #include "raid6.h"
 
 #include 
+#include 
 
 /*
  * Stripe cache
@@ -324,6 +325,525 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
return sh;
 }
 
+static int
+raid5_end_read_request(struct bio * bi, unsigned int bytes_done, int error);
+static int
+raid5_end_write_request (struct bio *bi, unsigned int bytes_done, int error);
+
+static void ops_run_io(struct stripe_head *sh)
+{
+   raid5_conf_t *conf = sh->raid_conf;
+   int i, disks = sh->disks;
+
+   might_sleep();
+
+   for (i=disks; i-- ;) {
+   int rw;
+   struct bio *bi;
+   mdk_rdev_t *rdev;
+   if (test_and_clear_bit(R5_Wantwrite, >dev[i].flags))
+   rw = WRITE;
+   else if (test_and_clear_bit(R5_Wantread, >dev[i].flags))
+   rw = READ;
+   else
+   continue;
+
+   bi = >dev[i].req;
+
+   bi->bi_rw = rw;
+   if (rw == WRITE)
+   bi->bi_end_io = raid5_end_write_request;
+   else
+   bi->bi_end_io = raid5_end_read_request;
+
+   rcu_read_lock();
+   rdev = rcu_dereference(conf->disks[i].rdev);
+   if (rdev && test_bit(Faulty, >flags))
+   rdev = NULL;
+   if (rdev)
+   atomic_inc(>nr_pending);
+   rcu_read_unlock();
+
+   if (rdev) {
+   if (test_bit(STRIPE_SYNCING, >state) ||
+   test_bit(STRIPE_EXPAND_SOURCE, >state) ||
+   test_bit(STRIPE_EXPAND_READY, >state))
+   md_sync_acct(rdev->bdev, STRIPE_SECTORS);
+
+   bi->bi_bdev = rdev->bdev;
+   PRINTK("%s: for %llu schedule op %ld on disc %d\n",
+   __FUNCTION__, (unsigned long long)sh->sector,
+   bi->bi_rw, i);
+   atomic_inc(>count);
+   bi->bi_sector = sh->sector + rdev->data_offset;
+   bi->bi_flags = 1 << BIO_UPTODATE;
+   bi->bi_vcnt = 1;
+   bi->bi_max_vecs = 1;
+   bi->bi_idx = 0;
+   bi->bi_io_vec = >dev[i].vec;
+   bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
+   bi->bi_io_vec[0].bv_offset = 0;
+   bi->bi_size = STRIPE_SIZE;
+   bi->bi_next = NULL;
+   if (rw == WRITE &&
+   test_bit(R5_ReWrite, >dev[i].flags))
+   atomic_add(STRIPE_SECTORS, 
>corrected_errors);
+   generic_make_request(bi);
+   } else {
+   if (rw == WRITE)
+   set_bit(STRIPE_DEGRADED, >state);
+   PRINTK("skip op %ld on disc %d for sector %llu\n",
+   bi->bi_rw, i, (unsigned long long)sh->sector);
+   clear_bit(R5_LOCKED, >dev[i].flags);
+   set_bit(STRIPE_HANDLE, >state);
+   }
+   }
+}
+
+static struct dma_async_tx_descriptor *
+async_copy_data(int frombio, struct bio *bio, struct page *page, sector_t 
sector,
+   struct dma_async_tx_descriptor *tx)
+{
+   struct bio_vec *bvl;
+   struct page *bio_page;
+   int i;
+   int page_offset;
+
+   if (bio->bi_sector >= sector)
+   page_offset = (signed)(bio->bi_sector - sector) * 512;
+   else
+   page_offset = (signed)(sector - bio->bi_sector) * -512;
+   bio_for_each_segment(bvl, bio, i) {
+   int len = bio_iovec_idx(bio,i)->bv_len;
+   int clen;
+ 

[PATCH 04/12] md: use raid5_run_ops for stripe cache operations

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

Each stripe has three flag variables to reflect the state of operations
(pending, ack, and complete).
-pending: set to request servicing in raid5_run_ops
-ack: set to reflect that raid5_runs_ops has seen this request
-complete: set when the operation is complete and it is ok for handle_stripe5
to clear 'pending' and 'ack'.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   65 +---
 1 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e70ee17..2c74f9b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -126,6 +126,7 @@ static void __release_stripe(raid5_conf_t *conf, struct 
stripe_head *sh)
}
md_wakeup_thread(conf->mddev->thread);
} else {
+   BUG_ON(sh->ops.pending);
if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, 
>state)) {
atomic_dec(>preread_active_stripes);
if (atomic_read(>preread_active_stripes) 
< IO_THRESHOLD)
@@ -225,7 +226,8 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
 
BUG_ON(atomic_read(>count) != 0);
BUG_ON(test_bit(STRIPE_HANDLE, >state));
-   
+   BUG_ON(sh->ops.pending || sh->ops.ack || sh->ops.complete);
+
CHECK_DEVLOCK();
PRINTK("init_stripe called, stripe %llu\n", 
(unsigned long long)sh->sector);
@@ -241,11 +243,11 @@ static void init_stripe(struct stripe_head *sh, sector_t 
sector, int pd_idx, int
for (i = sh->disks; i--; ) {
struct r5dev *dev = >dev[i];
 
-   if (dev->toread || dev->towrite || dev->written ||
+   if (dev->toread || dev->read || dev->towrite || dev->written ||
test_bit(R5_LOCKED, >flags)) {
-   printk("sector=%llx i=%d %p %p %p %d\n",
+   printk("sector=%llx i=%d %p %p %p %p %d\n",
   (unsigned long long)sh->sector, i, dev->toread,
-  dev->towrite, dev->written,
+  dev->read, dev->towrite, dev->written,
   test_bit(R5_LOCKED, >flags));
BUG();
}
@@ -325,6 +327,43 @@ static struct stripe_head *get_active_stripe(raid5_conf_t 
*conf, sector_t sector
return sh;
 }
 
+/* check_op() ensures that we only dequeue an operation once */
+#define check_op(op) do {\
+   if (test_bit(op, >ops.pending) &&\
+   !test_bit(op, >ops.complete)) {\
+   if (test_and_set_bit(op, >ops.ack))\
+   clear_bit(op, );\
+   else\
+   ack++;\
+   } else\
+   clear_bit(op, );\
+} while(0)
+
+/* find new work to run, do not resubmit work that is already
+ * in flight
+ */
+static unsigned long get_stripe_work(struct stripe_head *sh)
+{
+   unsigned long pending;
+   int ack = 0;
+
+   pending = sh->ops.pending;
+
+   check_op(STRIPE_OP_BIOFILL);
+   check_op(STRIPE_OP_COMPUTE_BLK);
+   check_op(STRIPE_OP_PREXOR);
+   check_op(STRIPE_OP_BIODRAIN);
+   check_op(STRIPE_OP_POSTXOR);
+   check_op(STRIPE_OP_CHECK);
+   if (test_and_clear_bit(STRIPE_OP_IO, >ops.pending))
+   ack++;
+
+   sh->ops.count -= ack;
+   BUG_ON(sh->ops.count < 0);
+
+   return pending;
+}
+
 static int
 raid5_end_read_request(struct bio * bi, unsigned int bytes_done, int error);
 static int
@@ -1859,7 +1898,6 @@ static int stripe_to_pdidx(sector_t stripe, raid5_conf_t 
*conf, int disks)
  *schedule a write of some buffers
  *return confirmation of parity correctness
  *
- * Parity calculations are done inside the stripe lock
  * buffers are taken off read_list or write_list, and bh_cache buffers
  * get BH_Lock set before the stripe lock is released.
  *
@@ -1877,10 +1915,11 @@ static void handle_stripe5(struct stripe_head *sh)
int non_overwrite = 0;
int failed_num=0;
struct r5dev *dev;
+   unsigned long pending=0;
 
-   PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n",
-   (unsigned long long)sh->sector, atomic_read(>count),
-   sh->pd_idx);
+   PRINTK("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d 
ops=%lx:%lx:%lx\n",
+  (unsigned long long)sh->sector, sh->state, 
atomic_read(>count),
+  sh->pd_idx, sh->ops.pending, sh->ops.ack, sh->ops.complete);
 
spin_lock(>lock);
clear_bit(STRIPE_HANDLE, >state);
@@ -2330,8 +2369,14 @@ static void handle_stripe5(struct stripe_head *sh)
}
}
 
+   if (sh->ops.count)
+   pending = get_stripe_work(sh);
+
spin_unlock(>lock);
 
+   if (pending)
+   

[PATCH 10/12] md: move raid5 io requests to raid5_run_ops

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

handle_stripe now only updates the state of stripes.  All execution of
operations is moved to raid5_run_ops.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   68 
 1 files changed, 10 insertions(+), 58 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1956b3c..8af084f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2360,6 +2360,8 @@ static void handle_stripe5(struct stripe_head *sh)
PRINTK("Read_old block %d for 
r-m-w\n", i);
set_bit(R5_LOCKED, >flags);
set_bit(R5_Wantread, 
>flags);
+   if 
(!test_and_set_bit(STRIPE_OP_IO, >ops.pending))
+   sh->ops.count++;
locked++;
} else {
set_bit(STRIPE_DELAYED, 
>state);
@@ -2380,6 +2382,8 @@ static void handle_stripe5(struct stripe_head *sh)
PRINTK("Read_old block %d for 
Reconstruct\n", i);
set_bit(R5_LOCKED, >flags);
set_bit(R5_Wantread, 
>flags);
+   if 
(!test_and_set_bit(STRIPE_OP_IO, >ops.pending))
+   sh->ops.count++;
locked++;
} else {
set_bit(STRIPE_DELAYED, 
>state);
@@ -2479,6 +2483,8 @@ static void handle_stripe5(struct stripe_head *sh)
 
set_bit(R5_LOCKED, >flags);
set_bit(R5_Wantwrite, >flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending))
+   sh->ops.count++;
clear_bit(STRIPE_DEGRADED, >state);
locked++;
set_bit(STRIPE_INSYNC, >state);
@@ -2500,12 +2506,16 @@ static void handle_stripe5(struct stripe_head *sh)
dev = >dev[failed_num];
if (!test_bit(R5_ReWrite, >flags)) {
set_bit(R5_Wantwrite, >flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending))
+   sh->ops.count++;
set_bit(R5_ReWrite, >flags);
set_bit(R5_LOCKED, >flags);
locked++;
} else {
/* let's read it back */
set_bit(R5_Wantread, >flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending))
+   sh->ops.count++;
set_bit(R5_LOCKED, >flags);
locked++;
}
@@ -2615,64 +2625,6 @@ static void handle_stripe5(struct stripe_head *sh)
  test_bit(BIO_UPTODATE, >bi_flags)
? 0 : -EIO);
}
-   for (i=disks; i-- ;) {
-   int rw;
-   struct bio *bi;
-   mdk_rdev_t *rdev;
-   if (test_and_clear_bit(R5_Wantwrite, >dev[i].flags))
-   rw = WRITE;
-   else if (test_and_clear_bit(R5_Wantread, >dev[i].flags))
-   rw = READ;
-   else
-   continue;
- 
-   bi = >dev[i].req;
- 
-   bi->bi_rw = rw;
-   if (rw == WRITE)
-   bi->bi_end_io = raid5_end_write_request;
-   else
-   bi->bi_end_io = raid5_end_read_request;
- 
-   rcu_read_lock();
-   rdev = rcu_dereference(conf->disks[i].rdev);
-   if (rdev && test_bit(Faulty, >flags))
-   rdev = NULL;
-   if (rdev)
-   atomic_inc(>nr_pending);
-   rcu_read_unlock();
- 
-   if (rdev) {
-   if (syncing || expanding || expanded)
-   md_sync_acct(rdev->bdev, STRIPE_SECTORS);
-
-   bi->bi_bdev = rdev->bdev;
-   PRINTK("for %llu schedule op %ld on disc %d\n",
-   (unsigned long long)sh->sector, bi->bi_rw, i);
-   atomic_inc(>count);
-   bi->bi_sector = sh->sector + rdev->data_offset;
-   bi->bi_flags = 1 << BIO_UPTODATE;
-   bi->bi_vcnt = 1;
-   bi->bi_max_vecs = 1;
-   bi->bi_idx = 0;
-   bi->bi_io_vec = 

[PATCH 11/12] md: remove raid5 compute_block and compute_parity5

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

replaced by raid5_run_ops

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |  124 
 1 files changed, 0 insertions(+), 124 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8af084f..a981c35 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1480,130 +1480,6 @@ static void copy_data(int frombio, struct bio *bio,
   }   \
} while(0)
 
-
-static void compute_block(struct stripe_head *sh, int dd_idx)
-{
-   int i, count, disks = sh->disks;
-   void *ptr[MAX_XOR_BLOCKS], *dest, *p;
-
-   PRINTK("compute_block, stripe %llu, idx %d\n", 
-   (unsigned long long)sh->sector, dd_idx);
-
-   dest = page_address(sh->dev[dd_idx].page);
-   memset(dest, 0, STRIPE_SIZE);
-   count = 0;
-   for (i = disks ; i--; ) {
-   if (i == dd_idx)
-   continue;
-   p = page_address(sh->dev[i].page);
-   if (test_bit(R5_UPTODATE, >dev[i].flags))
-   ptr[count++] = p;
-   else
-   printk(KERN_ERR "compute_block() %d, stripe %llu, %d"
-   " not present\n", dd_idx,
-   (unsigned long long)sh->sector, i);
-
-   check_xor();
-   }
-   if (count)
-   xor_block(count, STRIPE_SIZE, dest, ptr);
-   set_bit(R5_UPTODATE, >dev[dd_idx].flags);
-}
-
-static void compute_parity5(struct stripe_head *sh, int method)
-{
-   raid5_conf_t *conf = sh->raid_conf;
-   int i, pd_idx = sh->pd_idx, disks = sh->disks, count;
-   void *ptr[MAX_XOR_BLOCKS], *dest;
-   struct bio *chosen;
-
-   PRINTK("compute_parity5, stripe %llu, method %d\n",
-   (unsigned long long)sh->sector, method);
-
-   count = 0;
-   dest = page_address(sh->dev[pd_idx].page);
-   switch(method) {
-   case READ_MODIFY_WRITE:
-   BUG_ON(!test_bit(R5_UPTODATE, >dev[pd_idx].flags));
-   for (i=disks ; i-- ;) {
-   if (i==pd_idx)
-   continue;
-   if (sh->dev[i].towrite &&
-   test_bit(R5_UPTODATE, >dev[i].flags)) {
-   ptr[count++] = page_address(sh->dev[i].page);
-   chosen = sh->dev[i].towrite;
-   sh->dev[i].towrite = NULL;
-
-   if (test_and_clear_bit(R5_Overlap, 
>dev[i].flags))
-   wake_up(>wait_for_overlap);
-
-   BUG_ON(sh->dev[i].written);
-   sh->dev[i].written = chosen;
-   check_xor();
-   }
-   }
-   break;
-   case RECONSTRUCT_WRITE:
-   memset(dest, 0, STRIPE_SIZE);
-   for (i= disks; i-- ;)
-   if (i!=pd_idx && sh->dev[i].towrite) {
-   chosen = sh->dev[i].towrite;
-   sh->dev[i].towrite = NULL;
-
-   if (test_and_clear_bit(R5_Overlap, 
>dev[i].flags))
-   wake_up(>wait_for_overlap);
-
-   BUG_ON(sh->dev[i].written);
-   sh->dev[i].written = chosen;
-   }
-   break;
-   case CHECK_PARITY:
-   break;
-   }
-   if (count) {
-   xor_block(count, STRIPE_SIZE, dest, ptr);
-   count = 0;
-   }
-   
-   for (i = disks; i--;)
-   if (sh->dev[i].written) {
-   sector_t sector = sh->dev[i].sector;
-   struct bio *wbi = sh->dev[i].written;
-   while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) 
{
-   copy_data(1, wbi, sh->dev[i].page, sector);
-   wbi = r5_next_bio(wbi, sector);
-   }
-
-   set_bit(R5_LOCKED, >dev[i].flags);
-   set_bit(R5_UPTODATE, >dev[i].flags);
-   }
-
-   switch(method) {
-   case RECONSTRUCT_WRITE:
-   case CHECK_PARITY:
-   for (i=disks; i--;)
-   if (i != pd_idx) {
-   ptr[count++] = page_address(sh->dev[i].page);
-   check_xor();
-   }
-   break;
-   case READ_MODIFY_WRITE:
-   for (i = disks; i--;)
-   if (sh->dev[i].written) {
-   ptr[count++] = page_address(sh->dev[i].page);
-   check_xor();
-   }
-  

[PATCH 08/12] md: satisfy raid5 read requests via raid5_run_ops

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

Use raid5_run_ops to carry out the memory copies for a raid5 read request.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   40 +++-
 1 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2422253..db8925f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1980,7 +1980,7 @@ static void handle_stripe5(struct stripe_head *sh)
int i;
int syncing, expanding, expanded;
int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0;
-   int compute=0, req_compute=0, non_overwrite=0;
+   int to_fill=0, compute=0, req_compute=0, non_overwrite=0;
int failed_num=0;
struct r5dev *dev;
unsigned long pending=0;
@@ -2004,34 +2004,20 @@ static void handle_stripe5(struct stripe_head *sh)
dev = >dev[i];
clear_bit(R5_Insync, >flags);
 
-   PRINTK("check %d: state 0x%lx read %p write %p written %p\n",
-   i, dev->flags, dev->toread, dev->towrite, dev->written);
-   /* maybe we can reply to a read */
+   PRINTK("check %d: state 0x%lx toread %p read %p write %p 
written %p\n",
+   i, dev->flags, dev->toread, dev->read, dev->towrite, 
dev->written);
+
+   /* maybe we can start a biofill operation */
if (test_bit(R5_UPTODATE, >flags) && dev->toread) {
-   struct bio *rbi, *rbi2;
-   PRINTK("Return read for disc %d\n", i);
-   spin_lock_irq(>device_lock);
-   rbi = dev->toread;
-   dev->toread = NULL;
-   if (test_and_clear_bit(R5_Overlap, >flags))
-   wake_up(>wait_for_overlap);
-   spin_unlock_irq(>device_lock);
-   while (rbi && rbi->bi_sector < dev->sector + 
STRIPE_SECTORS) {
-   copy_data(0, rbi, dev->page, dev->sector);
-   rbi2 = r5_next_bio(rbi, dev->sector);
-   spin_lock_irq(>device_lock);
-   if (--rbi->bi_phys_segments == 0) {
-   rbi->bi_next = return_bi;
-   return_bi = rbi;
-   }
-   spin_unlock_irq(>device_lock);
-   rbi = rbi2;
-   }
+   to_read--;
+   if (!test_bit(STRIPE_OP_BIOFILL, >ops.pending))
+   set_bit(R5_Wantfill, >flags);
}
 
/* now count some things */
if (test_bit(R5_LOCKED, >flags)) locked++;
if (test_bit(R5_UPTODATE, >flags)) uptodate++;
+   if (test_bit(R5_Wantfill, >flags)) to_fill++;
if (test_bit(R5_Wantcompute, >flags)) BUG_ON(++compute > 
1);
 
if (dev->toread) to_read++;
@@ -2055,9 +2041,13 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(R5_Insync, >flags);
}
rcu_read_unlock();
+
+   if (to_fill && !test_and_set_bit(STRIPE_OP_BIOFILL, >ops.pending))
+   sh->ops.count++;
+
PRINTK("locked=%d uptodate=%d to_read=%d"
-   " to_write=%d failed=%d failed_num=%d\n",
-   locked, uptodate, to_read, to_write, failed, failed_num);
+   " to_write=%d to_fill=%d failed=%d failed_num=%d\n",
+   locked, uptodate, to_read, to_write, to_fill, failed, 
failed_num);
/* check if the array has lost two devices and, if so, some requests 
might
 * need to be failed
 */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/12] dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

This is a driver for the iop DMA/AAU/ADMA units which are capable of pq_xor,
pq_update, pq_zero_sum, xor, dual_xor, xor_zero_sum, fill, copy+crc, and copy
operations.

Changelog:
* fixed a slot allocation bug in do_iop13xx_adma_xor that caused too few
slots to be requested eventually leading to data corruption
* enabled the slot allocation routine to attempt to free slots before
returning -ENOMEM
* switched the cleanup routine to solely use the software chain and the
status register to determine if a descriptor is complete.  This is
necessary to support other IOP engines that do not have status writeback
capability
* make the driver iop generic
* modified the allocation routines to understand allocating a group of
slots for a single operation
* added a null xor initialization operation for the xor only channel on
iop3xx
* support xor operations on buffers larger than the hardware maximum
* split the do_* routines into separate prep, src/dest set, submit stages
* added async_tx support (dependent operations initiation at cleanup time)
* simplified group handling
* added interrupt support (callbacks via tasklets)
* brought the pending depth inline with ioat (i.e. 4 descriptors)

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/dma/Kconfig |8 
 drivers/dma/Makefile|1 
 drivers/dma/iop-adma.c  | 1511 +++
 include/asm-arm/hardware/iop_adma.h |  116 +++
 4 files changed, 1636 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index c82ed5f..d61e3e5 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -41,4 +41,12 @@ config INTEL_IOATDMA
default m
---help---
  Enable support for the Intel(R) I/OAT DMA engine.
+
+config INTEL_IOP_ADMA
+tristate "Intel IOP ADMA support"
+depends on DMA_ENGINE && (ARCH_IOP32X || ARCH_IOP33X || ARCH_IOP13XX)
+default m
+---help---
+  Enable support for the Intel(R) IOP Series RAID engines.
+
 endmenu
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index 6a99341..8ebf10d 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_DMA_ENGINE) += dmaengine.o
 obj-$(CONFIG_NET_DMA) += iovlock.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o
+obj-$(CONFIG_INTEL_IOP_ADMA) += iop-adma.o
 obj-$(CONFIG_ASYNC_TX_DMA) += async_tx.o xor.o
diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c
new file mode 100644
index 000..77f859e
--- /dev/null
+++ b/drivers/dma/iop-adma.c
@@ -0,0 +1,1511 @@
+/*
+ * Copyright(c) 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+
+/*
+ * This driver supports the asynchrounous DMA copy and RAID engines available
+ * on the Intel Xscale(R) family of I/O Processors (IOP 32x, 33x, 134x)
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define to_iop_adma_chan(chan) container_of(chan, struct iop_adma_chan, common)
+#define to_iop_adma_device(dev) container_of(dev, struct iop_adma_device, 
common)
+#define to_iop_adma_slot(lh) container_of(lh, struct iop_adma_desc_slot, 
slot_node)
+#define tx_to_iop_adma_slot(tx) container_of(tx, struct iop_adma_desc_slot, 
async_tx)
+
+#define IOP_ADMA_DEBUG 0
+#define PRINTK(x...) ((void)(IOP_ADMA_DEBUG && printk(x)))
+
+/**
+ * iop_adma_free_slots - flags descriptor slots for reuse
+ * @slot: Slot to free
+ * Caller must hold _chan->lock while calling this function
+ */
+static inline void iop_adma_free_slots(struct iop_adma_desc_slot *slot)
+{
+   int stride = slot->stride;
+
+   while (stride--) {
+   slot->stride = 0;
+   slot = list_entry(slot->slot_node.next,
+   struct iop_adma_desc_slot,
+   slot_node);
+   }
+}
+
+static inline dma_cookie_t
+iop_adma_run_tx_complete_actions(struct iop_adma_desc_slot *desc,
+   struct iop_adma_chan *iop_chan, dma_cookie_t cookie)
+{
+   BUG_ON(desc->async_tx.cookie < 0);
+   spin_lock_bh(>async_tx.lock);
+   

[PATCH 07/12] md: move raid5 parity checks to raid5_run_ops

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

handle_stripe sets STRIPE_OP_CHECK to request a check operation in
raid5_run_ops.  If raid5_run_ops is able to perform the check with a
dma engine the parity will be preserved in memory removing the need to
re-read it from disk, as is necessary in the synchronous case.

'Repair' operations re-use the same logic as compute block, with the caveat
that the results of the compute block are immediately written back to the
parity disk.  To differentiate these operations the STRIPE_OP_MOD_REPAIR_PD
flag is added.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   81 
 1 files changed, 62 insertions(+), 19 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 279a30c..2422253 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2411,32 +2411,75 @@ static void handle_stripe5(struct stripe_head *sh)
locked += handle_write_operations5(sh, rcw, 0);
}
 
-   /* maybe we need to check and possibly fix the parity for this stripe
-* Any reads will already have been scheduled, so we just see if enough 
data
-* is available
+   /* 1/ Maybe we need to check and possibly fix the parity for this 
stripe.
+*Any reads will already have been scheduled, so we just see if 
enough data
+*is available.
+* 2/ Hold off parity checks while parity dependent operations are in 
flight
+*(conflicting writes are protected by the 'locked' variable)
 */
-   if (syncing && locked == 0 &&
-   !test_bit(STRIPE_INSYNC, >state)) {
+   if ((syncing && locked == 0 && !test_bit(STRIPE_OP_COMPUTE_BLK, 
>ops.pending) &&
+   !test_bit(STRIPE_INSYNC, >state)) ||
+   test_bit(STRIPE_OP_CHECK, >ops.pending) ||
+   test_bit(STRIPE_OP_MOD_REPAIR_PD, >ops.pending)) {
+
set_bit(STRIPE_HANDLE, >state);
-   if (failed == 0) {
-   BUG_ON(uptodate != disks);
-   compute_parity5(sh, CHECK_PARITY);
-   uptodate--;
-   if (page_is_zero(sh->dev[sh->pd_idx].page)) {
-   /* parity is correct (on disc, not in buffer 
any more) */
-   set_bit(STRIPE_INSYNC, >state);
-   } else {
-   conf->mddev->resync_mismatches += 
STRIPE_SECTORS;
-   if (test_bit(MD_RECOVERY_CHECK, 
>mddev->recovery))
-   /* don't try to repair!! */
+   /* Take one of the following actions:
+* 1/ start a check parity operation if (uptodate == disks)
+* 2/ finish a check parity operation and act on the result
+* 3/ skip to the writeback section if we previously
+*initiated a recovery operation
+*/
+   if (failed == 0 && !test_bit(STRIPE_OP_MOD_REPAIR_PD, 
>ops.pending)) {
+   if (!test_and_set_bit(STRIPE_OP_CHECK, 
>ops.pending)) {
+   BUG_ON(uptodate != disks);
+   clear_bit(R5_UPTODATE, 
>dev[sh->pd_idx].flags);
+   sh->ops.count++;
+   uptodate--;
+   } else if (test_and_clear_bit(STRIPE_OP_CHECK, 
>ops.complete)) {
+   clear_bit(STRIPE_OP_CHECK, >ops.ack);
+   clear_bit(STRIPE_OP_CHECK, >ops.pending);
+
+   if (sh->ops.zero_sum_result == 0)
+   /* parity is correct (on disc, not in 
buffer any more) */
set_bit(STRIPE_INSYNC, >state);
else {
-   compute_block(sh, sh->pd_idx);
-   uptodate++;
+   conf->mddev->resync_mismatches += 
STRIPE_SECTORS;
+   if (test_bit(MD_RECOVERY_CHECK, 
>mddev->recovery))
+   /* don't try to repair!! */
+   set_bit(STRIPE_INSYNC, 
>state);
+   else {
+   BUG_ON(test_and_set_bit(
+   STRIPE_OP_COMPUTE_BLK,
+   >ops.pending));
+   set_bit(STRIPE_OP_MOD_REPAIR_PD,
+   >ops.pending);
+   
BUG_ON(test_and_set_bit(R5_Wantcompute,
+   
>dev[sh->pd_idx].flags));
+

[PATCH 09/12] md: use async_tx and raid5_run_ops for raid5 expansion operations

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

The parity calculation for an expansion operation is the same as the
calculation performed at the end of a write with the caveat that all blocks
in the stripe are scheduled to be written.  An expansion operation is
identified as a stripe with the POSTXOR flag set and the BIODRAIN flag not
set.

The bulk copy operation to the new stripe is handled inline by async_tx.

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/md/raid5.c |   48 
 1 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index db8925f..1956b3c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2511,18 +2511,32 @@ static void handle_stripe5(struct stripe_head *sh)
}
}
 
-   if (expanded && test_bit(STRIPE_EXPANDING, >state)) {
-   /* Need to write out all blocks after computing parity */
-   sh->disks = conf->raid_disks;
-   sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 
conf->raid_disks);
-   compute_parity5(sh, RECONSTRUCT_WRITE);
+   /* Finish postxor operations initiated by the expansion
+* process
+*/
+   if (test_bit(STRIPE_OP_POSTXOR, >ops.complete) &&
+   !test_bit(STRIPE_OP_BIODRAIN, >ops.pending)) {
+
+   clear_bit(STRIPE_EXPANDING, >state);
+
+   clear_bit(STRIPE_OP_POSTXOR, >ops.pending);
+   clear_bit(STRIPE_OP_POSTXOR, >ops.ack);
+   clear_bit(STRIPE_OP_POSTXOR, >ops.complete);
+
for (i= conf->raid_disks; i--;) {
-   set_bit(R5_LOCKED, >dev[i].flags);
-   locked++;
set_bit(R5_Wantwrite, >dev[i].flags);
+   if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending))
+   sh->ops.count++;
}
-   clear_bit(STRIPE_EXPANDING, >state);
-   } else if (expanded) {
+   }
+
+   if (expanded && test_bit(STRIPE_EXPANDING, >state) &&
+   !test_bit(STRIPE_OP_POSTXOR, >ops.pending)) {
+   /* Need to write out all blocks after computing parity */
+   sh->disks = conf->raid_disks;
+   sh->pd_idx = stripe_to_pdidx(sh->sector, conf, 
conf->raid_disks);
+   locked += handle_write_operations5(sh, 0, 1);
+   } else if (expanded && !test_bit(STRIPE_OP_POSTXOR, >ops.pending)) {
clear_bit(STRIPE_EXPAND_READY, >state);
atomic_dec(>reshape_stripes);
wake_up(>wait_for_overlap);
@@ -2533,6 +2547,7 @@ static void handle_stripe5(struct stripe_head *sh)
/* We have read all the blocks in this stripe and now we need to
 * copy some of them into a target stripe for expand.
 */
+   struct dma_async_tx_descriptor *tx = NULL;
clear_bit(STRIPE_EXPAND_SOURCE, >state);
for (i=0; i< sh->disks; i++)
if (i != sh->pd_idx) {
@@ -2556,9 +2571,12 @@ static void handle_stripe5(struct stripe_head *sh)
release_stripe(sh2);
continue;
}
-   memcpy(page_address(sh2->dev[dd_idx].page),
-  page_address(sh->dev[i].page),
-  STRIPE_SIZE);
+
+   /* place all the copies on one channel */
+   tx = async_memcpy(sh2->dev[dd_idx].page,
+   sh->dev[i].page, 0, 0, STRIPE_SIZE,
+   ASYNC_TX_DEP_ACK, tx, NULL, NULL);
+
set_bit(R5_Expanded, >dev[dd_idx].flags);
set_bit(R5_UPTODATE, >dev[dd_idx].flags);
for (j=0; jraid_disks; j++)
@@ -2570,6 +2588,12 @@ static void handle_stripe5(struct stripe_head *sh)
set_bit(STRIPE_HANDLE, >state);
}
release_stripe(sh2);
+
+   /* done submitting copies, wait for them to 
complete */
+   if (i + 1 >= sh->disks) {
+   async_tx_ack(tx);
+   dma_wait_for_async_tx(tx);
+   }
}
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/12] dmaengine: add the async_tx api

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

async_tx is an api to describe a series of bulk memory
transfers/transforms.  When possible these transactions are carried out by
asynchrounous dma engines.  The api handles inter-transaction dependencies
and hides dma channel management from the client.  When a dma engine is not
present the transaction is carried out via synchronous software routines.

Xor operations are handled by async_tx, to this end xor.c is moved into
drivers/dma and is changed to take an explicit destination address and
a series of sources to match the hardware engine implementation.

When CONFIG_DMA_ENGINE is not set the asynchrounous path is compiled away.

Changelog:
* fixed a leftover debug print
* don't allow callbacks in async_interrupt_cond
* fixed xor_block changes
* fixed usage of ASYNC_TX_XOR_DROP_DEST

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/Makefile |1 
 drivers/dma/Kconfig  |   16 +
 drivers/dma/Makefile |1 
 drivers/dma/async_tx.c   |  910 ++
 drivers/dma/xor.c|  153 
 drivers/md/Kconfig   |2 
 drivers/md/Makefile  |6 
 drivers/md/raid5.c   |   52 +--
 drivers/md/xor.c |  154 
 include/linux/async_tx.h |  180 +
 include/linux/raid/xor.h |5 
 11 files changed, 1291 insertions(+), 189 deletions(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index 0dd96d1..7d55837 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_I2C) += i2c/
 obj-$(CONFIG_W1)   += w1/
 obj-$(CONFIG_HWMON)+= hwmon/
 obj-$(CONFIG_PHONE)+= telephony/
+obj-$(CONFIG_ASYNC_TX_DMA) += dma/
 obj-$(CONFIG_MD)   += md/
 obj-$(CONFIG_BT)   += bluetooth/
 obj-$(CONFIG_ISDN) += isdn/
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 30d021d..c82ed5f 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -7,8 +7,8 @@ menu "DMA Engine support"
 config DMA_ENGINE
bool "Support for DMA engines"
---help---
- DMA engines offload copy operations from the CPU to dedicated
- hardware, allowing the copies to happen asynchronously.
+  DMA engines offload bulk memory operations from the CPU to dedicated
+  hardware, allowing the operations to happen asynchronously.
 
 comment "DMA Clients"
 
@@ -22,6 +22,17 @@ config NET_DMA
  Since this is the main user of the DMA engine, it should be enabled;
  say Y here.
 
+config ASYNC_TX_DMA
+   tristate "Asynchronous Bulk Memory Transfers/Transforms API"
+   default y
+   ---help---
+ This enables the async_tx management layer for dma engines.
+ Subsystems coded to this API will use offload engines for bulk
+ memory operations where present.  Software implementations are
+ called when a dma engine is not present or fails to allocate
+ memory to carry out the transaction.
+ Current subsystems ported to async_tx: MD_RAID4,5
+
 comment "DMA Devices"
 
 config INTEL_IOATDMA
@@ -30,5 +41,4 @@ config INTEL_IOATDMA
default m
---help---
  Enable support for the Intel(R) I/OAT DMA engine.
-
 endmenu
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile
index bdcfdbd..6a99341 100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_DMA_ENGINE) += dmaengine.o
 obj-$(CONFIG_NET_DMA) += iovlock.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o
+obj-$(CONFIG_ASYNC_TX_DMA) += async_tx.o xor.o
diff --git a/drivers/dma/async_tx.c b/drivers/dma/async_tx.c
new file mode 100644
index 000..eee208d
--- /dev/null
+++ b/drivers/dma/async_tx.c
@@ -0,0 +1,910 @@
+/*
+ * Copyright(c) 2006 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+ *
+ * The full GNU General Public License is included in this distribution in the
+ * file called COPYING.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define ASYNC_TX_DEBUG 0
+#define PRINTK(x...) ((void)(ASYNC_TX_DEBUG && printk(x)))
+
+#ifdef CONFIG_DMA_ENGINE
+static struct dma_client *async_api_client;
+static struct async_channel_entry async_channel_directory[] = {
+   

[PATCH 01/12] dmaengine: add base support for the async_tx api

2007-01-22 Thread Dan Williams
From: Dan Williams <[EMAIL PROTECTED]>

* introduce struct dma_async_tx_descriptor as a common field for all dmaengine
software descriptors
* convert the device_memcpy_* methods into separate prep, set src/dest, and
submit stages
* support capabilities beyond memcpy (xor, memset, xor zero sum, completion
interrupts)
* convert ioatdma to the new semantics

Signed-off-by: Dan Williams <[EMAIL PROTECTED]>
---

 drivers/dma/dmaengine.c   |   44 ++--
 drivers/dma/ioatdma.c |  256 ++--
 drivers/dma/ioatdma.h |8 +
 include/linux/dmaengine.h |  263 ++---
 4 files changed, 394 insertions(+), 177 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 1527804..8d203ad 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -210,7 +210,8 @@ static void dma_chans_rebalance(void)
mutex_lock(_list_mutex);
 
list_for_each_entry(client, _client_list, global_node) {
-   while (client->chans_desired > client->chan_count) {
+   while (client->chans_desired < 0 ||
+   client->chans_desired > client->chan_count) {
chan = dma_client_chan_alloc(client);
if (!chan)
break;
@@ -219,7 +220,8 @@ static void dma_chans_rebalance(void)
   chan,
   DMA_RESOURCE_ADDED);
}
-   while (client->chans_desired < client->chan_count) {
+   while (client->chans_desired >= 0 &&
+   client->chans_desired < client->chan_count) {
spin_lock_irqsave(>lock, flags);
chan = list_entry(client->channels.next,
  struct dma_chan,
@@ -294,12 +296,12 @@ void dma_async_client_unregister(struct dma_client 
*client)
  * @number: count of DMA channels requested
  *
  * Clients call dma_async_client_chan_request() to specify how many
- * DMA channels they need, 0 to free all currently allocated.
+ * DMA channels they need, 0 to free all currently allocated. A request
+ * < 0 indicates the client wants to handle all engines in the system.
  * The resulting allocations/frees are indicated to the client via the
  * event callback.
  */
-void dma_async_client_chan_request(struct dma_client *client,
-   unsigned int number)
+void dma_async_client_chan_request(struct dma_client *client, int number)
 {
client->chans_desired = number;
dma_chans_rebalance();
@@ -318,6 +320,31 @@ int dma_async_device_register(struct dma_device *device)
if (!device)
return -ENODEV;
 
+   /* validate device routines */
+   BUG_ON(test_bit(DMA_MEMCPY, >capabilities) &&
+   !device->device_prep_dma_memcpy);
+   BUG_ON(test_bit(DMA_XOR, >capabilities) &&
+   !device->device_prep_dma_xor);
+   BUG_ON(test_bit(DMA_ZERO_SUM, >capabilities) &&
+   !device->device_prep_dma_zero_sum);
+   BUG_ON(test_bit(DMA_MEMSET, >capabilities) &&
+   !device->device_prep_dma_memset);
+   BUG_ON(test_bit(DMA_ZERO_SUM, >capabilities) &&
+   !device->device_prep_dma_interrupt);
+
+   BUG_ON(!device->device_alloc_chan_resources);
+   BUG_ON(!device->device_free_chan_resources);
+   BUG_ON(!device->device_tx_submit);
+   BUG_ON(!device->device_set_dest);
+   BUG_ON(!device->device_set_src);
+   BUG_ON(!device->device_dependency_added);
+   BUG_ON(!device->device_is_tx_complete);
+   BUG_ON(!device->map_page);
+   BUG_ON(!device->map_single);
+   BUG_ON(!device->unmap_page);
+   BUG_ON(!device->unmap_single);
+   BUG_ON(!device->device_issue_pending);
+
init_completion(>done);
kref_init(>refcount);
device->dev_id = id++;
@@ -402,11 +429,8 @@ subsys_initcall(dma_bus_init);
 EXPORT_SYMBOL(dma_async_client_register);
 EXPORT_SYMBOL(dma_async_client_unregister);
 EXPORT_SYMBOL(dma_async_client_chan_request);
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
-EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
-EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
-EXPORT_SYMBOL(dma_async_memcpy_complete);
-EXPORT_SYMBOL(dma_async_memcpy_issue_pending);
+EXPORT_SYMBOL(dma_async_is_tx_complete);
+EXPORT_SYMBOL(dma_async_issue_pending);
 EXPORT_SYMBOL(dma_async_device_register);
 EXPORT_SYMBOL(dma_async_device_unregister);
 EXPORT_SYMBOL(dma_chan_cleanup);
diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c
index 8e87261..70bdd18 100644
--- a/drivers/dma/ioatdma.c
+++ b/drivers/dma/ioatdma.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ioatdma.h"
 #include "ioatdma_io.h"
 #include "ioatdma_registers.h"
@@ -39,6 +40,7 @@
 #define to_ioat_chan(chan) container_of(chan, struct ioat_dma_chan, common)
 #define 

Re: Why active list and inactive list?

2007-01-22 Thread Christoph Lameter
On Tue, 23 Jan 2007, Balbir Singh wrote:

> When you unmap or map, you need to touch the pte entries and know the
> pages involved, so shouldn't be equivalent to a list_del and list_add
> for each page impacted by the map/unmap operation?

When you unmap and map you must currently get exclusive access to the 
cachelines of the pte and the cacheline of the page struct. If we use a 
list_move on page->lru then we have would have to update pointers in up 
to 4 other page structs. Thus we need exclusive access to 4 additional 
cachelines. This triples the number of cachelines touched. Instead of 2 
cachelines we need 6.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Balbir Singh

Christoph Lameter wrote:

On Tue, 23 Jan 2007, Balbir Singh wrote:


This makes me wonder if it makes sense to split up the LRU into page
cache LRU and mapped pages LRU. I see two benefits

1. Currently based on swappiness, we might walk an entire list
   searching for page cache pages or mapped pages. With these
   lists separated, it should get easier and faster to implement
   this scheme
2. There is another parallel thread on implementing page cache
   limits. If the lists split out, we need not scan the entire
   list to find page cache pages to evict them.

Of course I might be missing something (some piece of history)


This means page cache = unmapped file backed page right? Otherwise this
would not work. I always thought that the page cache were all file backed
pages both mapped and unmapped.



Yes, unfortunately my terminology was not clear. I mean unmapped file
backed pages.


With the proposed schemd you would have to move pages between lists if
they are mapped and unmapped by a process. Terminating a process could
lead to lots of pages moving to the unnmapped list.



When you unmap or map, you need to touch the pte entries and know the
pages involved, so shouldn't be equivalent to a list_del and list_add
for each page impacted by the map/unmap operation?

--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux v2.6.20-rc5

2007-01-22 Thread Herbert Xu
Jeff Chua <[EMAIL PROTECTED]> wrote:
> 
> 
> From: Jeff Chua <[EMAIL PROTECTED]>
> 
>>   CC [M]  drivers/kvm/vmx.o
>> {standard input}: Assembler messages:
>> {standard input}:3257: Error: bad register name `%sil'
>> make[2]: *** [drivers/kvm/vmx.o] Error 1
>> make[1]: *** [drivers/kvm] Error 2
>> make: *** [drivers] Error 2
> 
> I'm not using the kernel profiler, so here's a patch to make it work without 
> CONFIG_PROFILING.

Actually that only happens to work by chance (by making one of al/bl/cl/dl
available).  This patch should fix it properly.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index ce219e3..0aa2659 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1824,7 +1824,7 @@ again:
 #endif
"setbe %0 \n\t"
"popf \n\t"
- : "=g" (fail)
+ : "=q" (fail)
  : "r"(vcpu->launched), "d"((unsigned long)HOST_RSP),
"c"(vcpu),
[rax]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RAX])),
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] KVM: Fix asm constraints with CONFIG_FRAME_POINTER=n

2007-01-22 Thread Herbert Xu
Avi Kivity <[EMAIL PROTECTED]> wrote:
> A "g" constraint may place a local variable in an %rsp-relative memory 
> operand.
> but if your assembly changes %rsp, the operand points to the wrong location.
> 
> An "r" constraint fixes that.
> 
> Thanks to Ingo Molnar for neatly bisecting the problem.
> 
> Signed-off-by: Avi Kivity <[EMAIL PROTECTED]>
> 
> Index: linux-2.6/drivers/kvm/vmx.c
> ===
> --- linux-2.6.orig/drivers/kvm/vmx.c
> +++ linux-2.6/drivers/kvm/vmx.c
> @@ -1825,7 +1825,7 @@ again:
> #endif
>"setbe %0 \n\t"
>"popf \n\t"
> - : "=g" (fail)
> + : "=r" (fail)
>  : "r"(vcpu->launched), "d"((unsigned long)HOST_RSP),
>"c"(vcpu),
>[rax]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RAX])),

We need the following fix for 2.6.20.

[KVM] vmx: Fix register constraint in launch code

Both "=r" and "=g" breaks my build on i386:

$ make
  CC [M]  drivers/kvm/vmx.o
{standard input}: Assembler messages:
{standard input}:3318: Error: bad register name `%sil'
make[1]: *** [drivers/kvm/vmx.o] Error 1
make: *** [_module_drivers/kvm] Error 2

The reason is that setbe requires an 8-bit register but "=r" does not
constrain the target register to be one that has an 8-bit version on
i386.

According to

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10153

the correct constraint is "=q".

Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index ce219e3..0aa2659 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1824,7 +1824,7 @@ again:
 #endif
"setbe %0 \n\t"
"popf \n\t"
- : "=g" (fail)
+ : "=q" (fail)
  : "r"(vcpu->launched), "d"((unsigned long)HOST_RSP),
"c"(vcpu),
[rax]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RAX])),
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH query] arm: i.MX/MX1 clock event source

2007-01-22 Thread Pavel Pisa
On Monday 22 January 2007 20:59, Ingo Molnar wrote:
> * Pavel Pisa <[EMAIL PROTECTED]> wrote:
> > Hello Thomas, Sascha and Ingo
> >
> > please can you find some time to review next patch
> >   arm: i.MX/MX1 clock event source
> > which has been sent to you and to the ALKML at 2007-01-13.
> >
> > http://thread.gmane.org/gmane.linux.ports.arm.kernel/29510/focus=29533
> >
> > There seems to be some problems, because this patch has not been
> > accepted to patch-2.6.20-rc5-rt7.patch, but GENERIC_CLOCKEVENTS are
> > set already for i.MX and this results in a problems to run RT kernel
> > on this architecture.
>
> i've added your patch to -rt, but note that there's a new, slightly
> incompatible clockevents code in -rt now so you'll need to do some more
> (hopefully trivial) fixups for this to build and work.
>
>   Ingo


Hello Ingo,

thanks for reply. I am attaching updated version of the patch at the end of 
e-mail.
There is problem with missing include in tick-sched.c

  CC  kernel/time/tick-sched.o
/usr/src/linux-2.6.20-rc5/kernel/time/tick-sched.c: In function 
`tick_nohz_handler':
/usr/src/linux-2.6.20-rc5/kernel/time/tick-sched.c:330: warning: implicit 
declaration of function `get_irq_regs'
/usr/src/linux-2.6.20-rc5/kernel/time/tick-sched.c:330: warning: initialization 
makes pointer from integer without a cast
/usr/src/linux-2.6.20-rc5/kernel/time/tick-sched.c: In function 
`tick_sched_timer':
/usr/src/linux-2.6.20-rc5/kernel/time/tick-sched.c:425: warning: initialization 
makes pointer from integer without a cast
  LD  kernel/time/built-in.o

--- linux-2.6.20-rc5.orig/kernel/time/tick-sched.c
+++ linux-2.6.20-rc5/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "tick-internal.h"

And

  CC  arch/arm/kernel/process.o
/usr/src/linux-2.6.20-rc5/arch/arm/kernel/process.c: In function `cpu_idle':
/usr/src/linux-2.6.20-rc5/arch/arm/kernel/process.c:157: warning: implicit 
declaration of function `hrtimer_stop_sched_tick'
/usr/src/linux-2.6.20-rc5/arch/arm/kernel/process.c:161: warning: implicit 
declaration of function `hrtimer_restart_sched_tick'

--- linux-2.6.20-rc5.orig/arch/arm/kernel/process.c
+++ linux-2.6.20-rc5/arch/arm/kernel/process.c
@@ -154,11 +154,11 @@ void cpu_idle(void)
if (!idle)
idle = default_idle;
leds_event(led_idle_start);
-   hrtimer_stop_sched_tick();
+   tick_nohz_stop_sched_tick();
while (!need_resched() && !need_resched_delayed())
idle();
leds_event(led_idle_end);
-   hrtimer_restart_sched_tick();
+   tick_nohz_restart_sched_tick();
local_irq_disable();
__preempt_enable_no_resched();
__schedule();

Unfortunately, even with these corrections boot stuck at

Memory: 18972KB available (2488K code, 358K data, 92K init)

I have not time now to start JTAG debugging session, so I look at that
tomorrow or on Friday.

It seems, that the interrupts are not coming from device.

Best wishes

Pavel

==
Subject: arm: i.MX/MX1 clock event source

Support clock event source based on i.MX general purpose
timer in free running timer mode.

Signed-off-by: Pavel Pisa <[EMAIL PROTECTED]>

 arch/arm/mach-imx/time.c |  112 ---
 1 file changed, 107 insertions(+), 5 deletions(-)

Index: linux-2.6.20-rc5/arch/arm/mach-imx/time.c
===
--- linux-2.6.20-rc5.orig/arch/arm/mach-imx/time.c
+++ linux-2.6.20-rc5/arch/arm/mach-imx/time.c
@@ -15,6 +15,9 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_GENERIC_CLOCKEVENTS
+#include 
+#endif
 
 #include 
 #include 
@@ -25,6 +28,11 @@
 /* Use timer 1 as system timer */
 #define TIMER_BASE IMX_TIM1_BASE
 
+#ifdef CONFIG_GENERIC_CLOCKEVENTS
+static struct clock_event_device clockevent_imx;
+static enum clock_event_mode clockevent_mode = CLOCK_EVT_MODE_PERIODIC;
+#endif
+
 static unsigned long evt_diff;
 
 /*
@@ -33,6 +41,7 @@ static unsigned long evt_diff;
 static irqreturn_t
 imx_timer_interrupt(int irq, void *dev_id)
 {
+   unsigned long tcmp;
uint32_t tstat;
 
/* clear the interrupt */
@@ -42,13 +51,20 @@ imx_timer_interrupt(int irq, void *dev_i
if (tstat & TSTAT_COMP) {
do {
 
+#ifdef CONFIG_GENERIC_CLOCKEVENTS
+   if (clockevent_imx.event_handler)
+   clockevent_imx.event_handler(_imx);
+   if (likely(clockevent_mode != CLOCK_EVT_MODE_PERIODIC))
+   break;
+#else
write_seqlock(_lock);
timer_tick();
write_sequnlock(_lock);
-   IMX_TCMP(TIMER_BASE) += evt_diff;
+#endif
+ 

Re: Why active list and inactive list?

2007-01-22 Thread Rik van Riel

Christoph Lameter wrote:

On Mon, 22 Jan 2007, Rik van Riel wrote:


The big one is how we are to do some background aging in a
clock-pro system, so referenced bits don't just pile up when
the VM has enough memory - otherwise we might not know the
right pages to evict when a new process starts up and starts
allocating lots of memory.


There are two bad choices right?

1. Scan for reference bits

   Bad because we may have to scan quite a bit without too much
   result. LRU allows us to defer this until memory is tight.
   Any such scan will pollute the cache and cause a stall of
   the app. You really do not want this for a realtime system.

2. Take faults on reference and update the page state.
   Bad because this means a fault if the reference bit
   has not been set. Could be many faults.

Clock pro really requires 2 right? So lots of additional page faults?


Nope, the faults are not required.

I suspect you're confused with the part where it keeps track
of recently evicted (not resident in RAM at all) pages. That
kind of info is common in database replacement schemes, but
not in general purpose OS memory management.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Christoph Lameter
On Mon, 22 Jan 2007, Rik van Riel wrote:

> The big one is how we are to do some background aging in a
> clock-pro system, so referenced bits don't just pile up when
> the VM has enough memory - otherwise we might not know the
> right pages to evict when a new process starts up and starts
> allocating lots of memory.

There are two bad choices right?

1. Scan for reference bits

   Bad because we may have to scan quite a bit without too much
   result. LRU allows us to defer this until memory is tight.
   Any such scan will pollute the cache and cause a stall of
   the app. You really do not want this for a realtime system.

2. Take faults on reference and update the page state.
   Bad because this means a fault if the reference bit
   has not been set. Could be many faults.

Clock pro really requires 2 right? So lots of additional page faults?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.22 19:24:22 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >>>Running a kernel with the return statement replace by a line that prints
> >>>the irq_stat instead.
> >>>
> >>>Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
> >>40 minutes stress test now and no exception yet. What's interesting is
> >>that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
> >>might have get dropped are as above.
> >>I'll keep it running for some time and will then re-enable the return
> >>statement to see if there's a relation between the irq_stat 0x0 and the
> >>exception.
> >
> >No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
> >0x0 for ata1. Syslog/dmesg has nothing new either, still the same
> >pattern of dismissed irq_stats.
> 
> I've finally managed to reproduce this problem on my box, by doing:
> 
> watch --interval=0.1 /sbin/hdparm -I /dev/sda
> 
> on one drive and then running bonnie++ on /dev/sdb connected to the 
> other port on the same controller device. Usually within a few minutes 
> one of the IDENTIFY commands would time out in the same way you guys 
> have been seeing.
> 
> Through some various trials and tribulations, the only conclusion I can 
> come to is that this controller really doesn't like that 
> NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
> adding some debug code to the qc_issue function that would check to see 
> if the BUSY flag in altstatus went high or that register showed an 
> interrupt within a certain time afterwards, however that really seemed 
> to hose things, the system wouldn't even boot.

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.

> Try out this patch, it just calls the ata_host_intr function where 
> appropriate without using nv_host_intr which looks at the 
> NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
> Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
> that. With this patch I can get through a whole bonnie++ run with the 
> repeated IDENTIFY requests running without seeing the error.

I'll see if I can schedule a test run for tomorrow, I currently need
this box.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Rik van Riel

Christoph Lameter wrote:

On Mon, 22 Jan 2007, Rik van Riel wrote:


It would be really nice if we came up with a page replacement
algorithm that did not need many extra heuristics to make it
work...


I guess the "clock" type algorithms are the most promising in that 
area. What happened to all those advanced page replacement endeavors? 
What is the most promising of those? You seem to have done a lot of work 
on those.


CLOCK-Pro seems the most promising algorithm, because it can
act well both as a first level cache (operating system running
applications) and as a second level cache (operating system
running as a file server), because it tracks both recency and
frequency well.

However, there are a few unanswered questions on clock-pro.

The big one is how we are to do some background aging in a
clock-pro system, so referenced bits don't just pile up when
the VM has enough memory - otherwise we might not know the
right pages to evict when a new process starts up and starts
allocating lots of memory.

At least we've solved the problems of keeping track of the
recently evicted pages in a cheap way, and balancing the
pressure/hotness of different caches against each other.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Andrea Arcangeli
On Tue, Jan 23, 2007 at 07:01:33AM +0530, Balbir Singh wrote:
> This makes me wonder if it makes sense to split up the LRU into page
> cache LRU and mapped pages LRU. I see two benefits
> 1. Currently based on swappiness, we might walk an entire list
>searching for page cache pages or mapped pages. With these
>lists separated, it should get easier and faster to implement
>this scheme

When I tried that long time ago, I recall I had troubles, but there
wasn't the reclaim_mapped based on static values so it was even
harder. However it would be still a problem today to decide when to
switch from the unmapped to the mapped lru. When reclaim_mapped is
set, you'll still have to shrink some unmapped page, and by splitting
you literally lose age information to save some cpu. Eventually you
risk spending time in trying to free unfreeable pinned pages that sits
in the unmapped list before finally jumping to the mapped list. So
you've to add yet another list to get rid of the pinned stuff in the
unmapped list and I stopped when I had to refile pages from the
"pinned" list to the unmapped list in irq I/O completion context, now
it's all spin_lock_irq so it would be more natural at least...

> 2. There is another parallel thread on implementing page cache
>limits. If the lists split out, we need not scan the entire
>list to find page cache pages to evict them.

BTW I'm unsure about the cache limit thread, the overhead of the vm
collection shouldn't be an issue, and those tends to hide vm
inefficiencies.

For example Neil has a patch to reduce the writeback cache to 10M-50M
(much lower than the current 1% minimum) to hide huge unfariness in
the writeback cache. I think they should mount the fs with -o sync
instead of using that patch until the unfariness is fixed or
tunable. The patch itself is fine though but for that problem it only
looks a workaround. So I at least try to be always quite skeptical
when I hear about cache "fixed size limiting" patches that improve
responsiveness or performance ;)

> Of course I might be missing something (some piece of history)

Partly ;) Code was very different back then, today it would be easier
thanks to reclaim_mapped but the partial loss of age information and
potential loss of cpu in a pinned walk would probably remain.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)

2007-01-22 Thread Neil Brown
On Monday January 22, [EMAIL PROTECTED] wrote:
> On 1/22/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> > On Monday January 22, [EMAIL PROTECTED] wrote:
> > > Justin Piszcz wrote:
> > > > My .config is attached, please let me know if any other information is
> > > > needed and please CC (lkml) as I am not on the list, thanks!
> > > >
> > > > Running Kernel 2.6.19.2 on a MD RAID5 volume.  Copying files over Samba 
> > > > to
> > > > the RAID5 running XFS.
> > > >
> > > > Any idea what happened here?
> > 
> > > >
> > > Without digging too deeply, I'd say you've hit the same bug Sami Farin
> > > and others
> > > have reported starting with 2.6.19: pages mapped with kmap_atomic()
> > > become unmapped
> > > during memcpy() or similar operations.  Try disabling preempt -- that
> > > seems to be the
> > > common factor.
> >
> > That is exactly the conclusion I had just come to (a kmap_atomic page
> > must be being unmapped during memcpy).  I wasn't aware that others had
> > reported it - thanks for that.
> >
> > Turning off CONFIG_PREEMPT certainly seems like a good idea.
> >
> Coming from an ARM background I am not yet versed in the inner
> workings of kmap_atomic, but if you have time for a question I am
> curious as to why spin_lock(>lock)  is not sufficient pre-emption
> protection for copy_data() in this case?
> 

Presumably there is a bug somewhere.
kmap_atomic itself calls inc_preempt_count so that preemption should
be disabled at least until the kunmap_atomic is called.

But apparently not.  The symptoms point exactly to the page getting
unmapped when it shouldn't.  Until that bug is found and fixed, the
work around of turning of CONFIG_PREEMPT seems to make sense.

Of course it would be great if someone who can easily reproduce this
bug could do the 'git bisect' thing to find out where the bug crept
in.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Christoph Lameter
On Mon, 22 Jan 2007, Rik van Riel wrote:

> It would be really nice if we came up with a page replacement
> algorithm that did not need many extra heuristics to make it
> work...

I guess the "clock" type algorithms are the most promising in that 
area. What happened to all those advanced page replacement endeavors? 
What is the most promising of those? You seem to have done a lot of work 
on those.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Rik van Riel

Christoph Lameter wrote:

With the proposed schemd you would have to move pages between lists if 
they are mapped and unmapped by a process. Terminating a process could 
lead to lots of pages moving to the unnmapped list.


That could be a problem.

Another problem is that any such heuristic in the VM is
bound to have corner cases that some workloads will hit.

It would be really nice if we came up with a page replacement
algorithm that did not need many extra heuristics to make it
work...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Peek at envinroment of procs

2007-01-22 Thread Jan Engelhardt
Hi,


what is the preferred way to get at another process's environment 
variables? /proc/$$/environ looks like the most portable way [across all 
arches Linux runs on], but it cannot easily be mmap'ed because the size 
is not known. In fact, mmap does not seem to work at all on that file. 
So I would have to allocate a large buffer (4K is the limit for procfs 
files AFAICR) to potentially hold big environments, which does not sound 
really wise either. Or is it the best choice available?


-`J'
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: KB->KiB, MB -> MiB, ... (IEC 60027-2)

2007-01-22 Thread Jan Engelhardt

On Jan 23 2007 02:04, Krzysztof Halasa wrote:
>Andreas Schwab <[EMAIL PROTECTED]> writes:
>
>> But other than the sector size there is no natural power of 2 connected to
>> disk size.  A disk can have any odd number of sectors.
>
>But the manufacturers don't count in sectors.
>
>It should be consistent, though. "How many GB of disk space do you
>need to store 2 GB of USB flash, and how many to store 2 GB RAM image"?

Here's the marketing gap a company could jump in:
  "first to count in real GB"


-`J'
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Rik van Riel

Balbir Singh wrote:


This makes me wonder if it makes sense to split up the LRU into page
cache LRU and mapped pages LRU. I see two benefits


Unlikely.  I have seen several workloads fall over because they
did not throw out mapped pages soon enough.

If the kernel does not keep the most frequently accessed pages
resident, hit rates will suffer.  Sometimes (well, usually)
those are the mapped pages, but this is not true in all workloads.

Some workloads are very page cache heavy and it pays to keep
the more frequently accessed page cache pages resident while
discarding the less frequently accessed ones.

Since memory size has increased a lot more than disk speed
over the last decade (and this is likely to continue for the
next decades), the quality of page replacement algorithms is
likely to become more and more important over time.


1. Currently based on swappiness, we might walk an entire list
   searching for page cache pages or mapped pages. With these
   lists separated, it should get easier and faster to implement
   this scheme


How do you classify a mapped page cache page?

Another issue is that you'll want to make sure that the page
cache pages that are referenced more frequently than the least
referenced mapped (I assume you mean anonymous?) pages in
memory, while swapping out those least used anonymous pages.

One way to do this could be to compare the scan rates, list
sizes and referenced percentage of both lists, to find out
which of the two caches is hotter.


2. There is another parallel thread on implementing page cache
   limits. If the lists split out, we need not scan the entire
   list to find page cache pages to evict them.


If the lists split out, there is no reason to limit the page
cache size because you can easily reclaim them.  Right?


Of course I might be missing something (some piece of history)


http://linux-mm.org/AdvancedPageReplacement

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)

2007-01-22 Thread Dan Williams

On 1/22/07, Neil Brown <[EMAIL PROTECTED]> wrote:

On Monday January 22, [EMAIL PROTECTED] wrote:
> Justin Piszcz wrote:
> > My .config is attached, please let me know if any other information is
> > needed and please CC (lkml) as I am not on the list, thanks!
> >
> > Running Kernel 2.6.19.2 on a MD RAID5 volume.  Copying files over Samba to
> > the RAID5 running XFS.
> >
> > Any idea what happened here?

> >
> Without digging too deeply, I'd say you've hit the same bug Sami Farin
> and others
> have reported starting with 2.6.19: pages mapped with kmap_atomic()
> become unmapped
> during memcpy() or similar operations.  Try disabling preempt -- that
> seems to be the
> common factor.

That is exactly the conclusion I had just come to (a kmap_atomic page
must be being unmapped during memcpy).  I wasn't aware that others had
reported it - thanks for that.

Turning off CONFIG_PREEMPT certainly seems like a good idea.


Coming from an ARM background I am not yet versed in the inner
workings of kmap_atomic, but if you have time for a question I am
curious as to why spin_lock(>lock)  is not sufficient pre-emption
protection for copy_data() in this case?


NeilBrown


Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Alistair John Strachan wrote:

On Tuesday 23 January 2007 01:24, Robert Hancock wrote:

As a final aside, this is another case where the hardware docs for this
controller would really be useful, in order to know whether we are
actually supposed to be reading that register in ADMA mode or not. I
sent a query to Allen Martin at NVIDIA asking if there's a way I could
get access to the documents, but I haven't heard anything yet.


Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.


Will we see this fix in 2.6.20?


Hopefully, assuming it actually does fix the problem for those that have 
been seeing it..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Christoph Lameter
On Tue, 23 Jan 2007, Balbir Singh wrote:

> This makes me wonder if it makes sense to split up the LRU into page
> cache LRU and mapped pages LRU. I see two benefits
> 
> 1. Currently based on swappiness, we might walk an entire list
>searching for page cache pages or mapped pages. With these
>lists separated, it should get easier and faster to implement
>this scheme
> 2. There is another parallel thread on implementing page cache
>limits. If the lists split out, we need not scan the entire
>list to find page cache pages to evict them.
> 
> Of course I might be missing something (some piece of history)

This means page cache = unmapped file backed page right? Otherwise this 
would not work. I always thought that the page cache were all file backed 
pages both mapped and unmapped.

With the proposed schemd you would have to move pages between lists if 
they are mapped and unmapped by a process. Terminating a process could 
lead to lots of pages moving to the unnmapped list.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] 9p: null terminate error strings for debug print

2007-01-22 Thread Eric Van Hensbergen
From: Eric Van Hensbergen <[EMAIL PROTECTED]> - unquoted

We weren't properly NULL terminating protocol error strings for our
debug printk resulting in garbage being included in the output when debug
was enabled.

Signed-off-by: Eric Van Hensbergen <[EMAIL PROTECTED]>
---
 fs/9p/error.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/9p/error.c b/fs/9p/error.c
index ae91555..0d7fa4e 100644
--- a/fs/9p/error.c
+++ b/fs/9p/error.c
@@ -83,6 +83,7 @@ int v9fs_errstr2errno(char *errstr, int len)
 
if (errno == 0) {
/* TODO: if error isn't found, add it dynamically */
+   errstr[len] = 0;
printk(KERN_ERR "%s: errstr :%s: not found\n", __FUNCTION__,
   errstr);
errno = 1;
-- 
1.5.0.rc1.gde38

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] 9p: fix segfault caused by race condition in meta-data operations

2007-01-22 Thread Eric Van Hensbergen
From: Eric Van Hensbergen <[EMAIL PROTECTED]> - unquoted

Running dbench multithreaded exposed a race condition where fid structures
were removed while in use.  This patch adds semaphores to meta-data operations
to protect the fid structure.  Some cleanup of error-case handling in the
inode operations is also included.

Signed-off-by: Eric Van Hensbergen <[EMAIL PROTECTED]>
---
 fs/9p/fid.c   |   69 +-
 fs/9p/fid.h   |5 ++
 fs/9p/vfs_file.c  |   47 ++--
 fs/9p/vfs_inode.c |  204 ++--
 4 files changed, 196 insertions(+), 129 deletions(-)

diff --git a/fs/9p/fid.c b/fs/9p/fid.c
index 2750720..a9b6301 100644
--- a/fs/9p/fid.c
+++ b/fs/9p/fid.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "debug.h"
 #include "v9fs.h"
@@ -84,6 +85,7 @@ struct v9fs_fid *v9fs_fid_create(struct v9fs_session_info 
*v9ses, int fid)
new->iounit = 0;
new->rdir_pos = 0;
new->rdir_fcall = NULL;
+   init_MUTEX(>lock);
INIT_LIST_HEAD(>list);
 
return new;
@@ -102,11 +104,11 @@ void v9fs_fid_destroy(struct v9fs_fid *fid)
 }
 
 /**
- * v9fs_fid_lookup - retrieve the right fid from a  particular dentry
+ * v9fs_fid_lookup - return a locked fid from a dentry
  * @dentry: dentry to look for fid in
- * @type: intent of lookup (operation or traversal)
  *
- * find a fid in the dentry
+ * find a fid in the dentry, obtain its semaphore and return a reference to it.
+ * code calling lookup is responsible for releasing lock
  *
  * TODO: only match fids that have the same uid as current user
  *
@@ -124,7 +126,68 @@ struct v9fs_fid *v9fs_fid_lookup(struct dentry *dentry)
 
if (!return_fid) {
dprintk(DEBUG_ERROR, "Couldn't find a fid in dentry\n");
+   return_fid = ERR_PTR(-EBADF);
}
 
+   if(down_interruptible(_fid->lock))
+   return ERR_PTR(-EINTR);
+
return return_fid;
 }
+
+/**
+ * v9fs_fid_clone - lookup the fid for a dentry, clone a private copy and 
release it
+ * @dentry: dentry to look for fid in
+ *
+ * find a fid in the dentry and then clone to a new private fid
+ *
+ * TODO: only match fids that have the same uid as current user
+ *
+ */
+
+struct v9fs_fid *v9fs_fid_clone(struct dentry *dentry)
+{
+   struct v9fs_session_info *v9ses = v9fs_inode2v9ses(dentry->d_inode);
+   struct v9fs_fid *base_fid, *new_fid = ERR_PTR(-EBADF);
+   struct v9fs_fcall *fcall = NULL;
+   int fid, err;
+
+   base_fid = v9fs_fid_lookup(dentry);
+
+   if(IS_ERR(base_fid))
+   return base_fid;
+
+   if(base_fid) {  /* clone fid */
+   fid = v9fs_get_idpool(>fidpool);
+   if (fid < 0) {
+   eprintk(KERN_WARNING, "newfid fails!\n");
+   new_fid = ERR_PTR(-ENOSPC);
+   goto Release_Fid;
+   }
+
+   err = v9fs_t_walk(v9ses, base_fid->fid, fid, NULL, );
+   if (err < 0) {
+   dprintk(DEBUG_ERROR, "clone walk didn't work\n");
+   v9fs_put_idpool(fid, >fidpool);
+   new_fid = ERR_PTR(err);
+   goto Free_Fcall;
+   }
+   new_fid = v9fs_fid_create(v9ses, fid);
+   if (new_fid == NULL) {
+   dprintk(DEBUG_ERROR, "out of memory\n");
+   new_fid = ERR_PTR(-ENOMEM);
+   }
+Free_Fcall:
+   kfree(fcall);
+   }
+
+Release_Fid:
+   up(_fid->lock);
+   return new_fid;
+}
+
+void v9fs_fid_clunk(struct v9fs_session_info *v9ses, struct v9fs_fid *fid)
+{
+   v9fs_t_clunk(v9ses, fid->fid);
+   v9fs_fid_destroy(fid);
+}
diff --git a/fs/9p/fid.h b/fs/9p/fid.h
index aa974d6..48fc170 100644
--- a/fs/9p/fid.h
+++ b/fs/9p/fid.h
@@ -30,6 +30,8 @@ struct v9fs_fid {
struct list_head list;   /* list of fids associated with a dentry */
struct list_head active; /* XXX - debug */
 
+   struct semaphore lock;
+
u32 fid;
unsigned char fidopen;/* set when fid is opened */
unsigned char fidclunked; /* set when fid has already been clunked */
@@ -55,3 +57,6 @@ struct v9fs_fid *v9fs_fid_get_created(struct dentry *);
 void v9fs_fid_destroy(struct v9fs_fid *fid);
 struct v9fs_fid *v9fs_fid_create(struct v9fs_session_info *, int fid);
 int v9fs_fid_insert(struct v9fs_fid *fid, struct dentry *dentry);
+struct v9fs_fid *v9fs_fid_clone(struct dentry *dentry);
+void v9fs_fid_clunk(struct v9fs_session_info *v9ses, struct v9fs_fid *fid);
+
diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index e86a071..9f17b0c 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -55,53 +55,22 @@ int v9fs_file_open(struct inode *inode, struct file *file)
struct v9fs_fid *vfid;
struct v9fs_fcall *fcall = NULL;
int omode;
-   int fid = V9FS_NOFID;
int err;
 

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Alistair John Strachan
On Tuesday 23 January 2007 01:24, Robert Hancock wrote:
> As a final aside, this is another case where the hardware docs for this
> controller would really be useful, in order to know whether we are
> actually supposed to be reading that register in ADMA mode or not. I
> sent a query to Allen Martin at NVIDIA asking if there's a way I could
> get access to the documents, but I haven't heard anything yet.

Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.

Will we see this fix in 2.6.20?

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Balbir Singh

Andrea Arcangeli wrote:

On Tue, Jan 23, 2007 at 01:10:46AM +0100, Niki Hammler wrote:

Dear Linux Developers/Enthusiasts,

For a course at my university I'm implementing parts of an operating
system where I get most ideas from the Linux Kernel (which I like very
much). One book I gain information from is [1].

Linux uses for its Page Replacing Algorithm (based on LRU) *two* chained
lists - one active list and one active list.
I implemented my PRA this way too.

No the big question is: WHY do I need *two* lists? Isn't it just
overhead/more work? Are there any reasons for that?

In my opinion, it would be better to have just one just; pop frames to
be swapped out from the end of the list and push new frames in front of
it. Then just evaluate the frames and shift them around in the list.

Is there any explanation why Linux uses two lists?


Back then I designed it with two lru lists because by splitting the
active from the inactive cache allows to detect the cache pollution
before it starts discarding the working set. The idea is that the
pollution will enter and exit the inactive list without ever being
elected to the active list because by definition it will never
generate a cache hit. The working set will instead trigger cache hits
during page faults or repeated reads, and it will be preserved better
by electing it to enter the active list.

A page in the inactive list will be collected much more quickly than a
page in the active list, so the pollution will be collected more
quickly than the working set. Then the VM while freeing cache tries to
keep a balance between the size of the two lists to avoid being too
unfair, obviously at some point the active list have to be
de-activated too. If your server "fits in ram" you'll find lots of
cache to be active and so the I/O activity not part of the working set
will be collected without affecting the working set much.


This makes me wonder if it makes sense to split up the LRU into page
cache LRU and mapped pages LRU. I see two benefits

1. Currently based on swappiness, we might walk an entire list
   searching for page cache pages or mapped pages. With these
   lists separated, it should get easier and faster to implement
   this scheme
2. There is another parallel thread on implementing page cache
   limits. If the lists split out, we need not scan the entire
   list to find page cache pages to evict them.

Of course I might be missing something (some piece of history)

--
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] md: bitmap read_page error

2007-01-22 Thread yang yin

I think your patch is not enough to slove the read_page error
completely. I think in the bitmap_init_from_disk we also need to check
the 'count' never exceeds the size of file before calling the
read_page function. How do your think about it.
Thanks your reply.

2007/1/23, Neil Brown <[EMAIL PROTECTED]>:

On Monday January 22, [EMAIL PROTECTED] wrote:
> If the bitmap size is less than one page including super_block and
> bitmap and the inode's i_blkbits is also small, when doing the
> read_page function call to read the sb_page, it may return a error.
> For example, if the device is 12800 chunks, its bitmap file size is
> about 1.6KB include the bitmap super block. But the inode i_blkbits
> value of the bitmap file is 10,  the read_page will submit 4 bh to
> load the sb_page. Because the size of bitmap is only 1.6KB, in the
> while loop, the error will ocurr when do bmap operation for the block
> 2, which will  return 0. Then the bitmap can't be initated because of
> ther read sb page fail.
>
> Another error is in the bitmap_init_from_disk function.  Before doing
> read_page,. calculating the count value misses the size of super
> block. When the bitmap just needs one page, It will read two pages
> adding the super block. But at the second read, the count value will
> be set to 0, and not all the bitmap will be read from the disk and
> some bitmap will missed at the second page.
>
> I give a patch as following:

Thanks a lot for this.
Rather than checking the file size in read_page, I would like to make
sure the 'count' that is passed in never exceeds the size of the
file.  This should have the same effect.

So this is that patch I plan to submit.

Thanks again,
NeilBrown


--
Avoid reading past the end of a bitmap file.

In most cases we check the size of the bitmap file before
reading data from it.  However when reading the superblock,
we always read the first PAGE_SIZE bytes, which might not
always be appropriate.  So limit that read to the size of the
file if appropriate.

Also, we get the count of available bytes wrong in one place,
so that too can read past the end of the file.

Cc:  "yang yin" <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/bitmap.c |   12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c   2007-01-23 09:44:11.0 +1100
+++ ./drivers/md/bitmap.c   2007-01-23 09:44:59.0 +1100
@@ -479,9 +479,12 @@ static int bitmap_read_sb(struct bitmap
int err = -EINVAL;

/* page 0 is the superblock, read it... */
-   if (bitmap->file)
-   bitmap->sb_page = read_page(bitmap->file, 0, bitmap, PAGE_SIZE);
-   else {
+   if (bitmap->file) {
+   loff_t isize = i_size_read(bitmap->file->f_mapping->host);
+   int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize;
+
+   bitmap->sb_page = read_page(bitmap->file, 0, bitmap, bytes);
+   } else {
bitmap->sb_page = read_sb_page(bitmap->mddev, bitmap->offset, 
0);
}
if (IS_ERR(bitmap->sb_page)) {
@@ -877,7 +880,8 @@ static int bitmap_init_from_disk(struct
int count;
/* unmap the old page, we're done with it */
if (index == num_pages-1)
-   count = bytes - index * PAGE_SIZE;
+   count = bytes + sizeof(bitmap_super_t)
+   - index * PAGE_SIZE;
else
count = PAGE_SIZE;
if (index == 0) {


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Make CARDBUS_MEM_SIZE and CARDBUS_IO_SIZE customizable

2007-01-22 Thread Atsushi Nemoto
On Mon, 22 Jan 2007 18:17:38 +0300, Sergei Shtylyov <[EMAIL PROTECTED]> wrote:
> > +   cbiosize=nn[KMG]The fixed amount of bus space which is
> > +   reserved for the CardBus bridges IO window.
> 
> It shoyld be "bridge's"...

Thanks.  Updated again.


Subject: [PATCH] Make CARDBUS_MEM_SIZE and CARDBUS_IO_SIZE customizable

CARDBUS_MEM_SIZE was increased to 64MB on 2.6.20-rc2, but larger size
might result in allocation failure for the reserving itself on some
platforms (for example typical 32bit MIPS).  Make it (and
CARDBUS_IO_SIZE too) customizable by "pci=" option for such platforms.

Signed-off-by: Atsushi Nemoto <[EMAIL PROTECTED]>
---
 Documentation/kernel-parameters.txt |6 ++
 drivers/pci/pci.c   |6 ++
 drivers/pci/setup-bus.c |   27 +++
 include/linux/pci.h |3 +++
 4 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 25d2985..a194b8f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1259,6 +1259,12 @@ and is between 256 and 4096 characters. 
This sorting is done to get a device
order compatible with older (<= 2.4) kernels.
nobfsortDon't sort PCI devices into breadth-first order.
+   cbiosize=nn[KMG]The fixed amount of bus space which is
+   reserved for the CardBus bridge's IO window.
+   The default value is 256 bytes.
+   cbmemsize=nn[KMG]   The fixed amount of bus space which is
+   reserved for the CardBus bridge's memory
+   window. The default value is 64 megabytes.
 
pcmv=   [HW,PCMCIA] BadgePAD 4
 
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 206c834..639069a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1168,6 +1168,12 @@ static int __devinit pci_setup(char *str
if (*str && (str = pcibios_setup(str)) && *str) {
if (!strcmp(str, "nomsi")) {
pci_no_msi();
+   } else if (!strncmp(str, "cbiosize=", 9)) {
+   pci_cardbus_io_size =
+   memparse(str + 9, );
+   } else if (!strncmp(str, "cbmemsize=", 10)) {
+   pci_cardbus_mem_size =
+   memparse(str + 10, );
} else {
printk(KERN_ERR "PCI: Unknown option `%s'\n",
str);
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 89f3036..1dfc288 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -40,8 +40,11 @@ #define ROUND_UP(x, a)   (((x) + (a) - 1)
  * FIXME: IO should be max 256 bytes.  However, since we may
  * have a P2P bridge below a cardbus bridge, we need 4K.
  */
-#define CARDBUS_IO_SIZE(256)
-#define CARDBUS_MEM_SIZE   (64*1024*1024)
+#define DEFAULT_CARDBUS_IO_SIZE(256)
+#define DEFAULT_CARDBUS_MEM_SIZE   (64*1024*1024)
+/* pci=cbmemsize=nnM,cbiosize=nn can override this */
+unsigned long pci_cardbus_io_size = DEFAULT_CARDBUS_IO_SIZE;
+unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE;
 
 static void __devinit
 pbus_assign_resources_sorted(struct pci_bus *bus)
@@ -415,12 +418,12 @@ pci_bus_size_cardbus(struct pci_bus *bus
 * Reserve some resources for CardBus.  We reserve
 * a fixed amount of bus space for CardBus bridges.
 */
-   b_res[0].start = CARDBUS_IO_SIZE;
-   b_res[0].end = b_res[0].start + CARDBUS_IO_SIZE - 1;
+   b_res[0].start = pci_cardbus_io_size;
+   b_res[0].end = b_res[0].start + pci_cardbus_io_size - 1;
b_res[0].flags |= IORESOURCE_IO;
 
-   b_res[1].start = CARDBUS_IO_SIZE;
-   b_res[1].end = b_res[1].start + CARDBUS_IO_SIZE - 1;
+   b_res[1].start = pci_cardbus_io_size;
+   b_res[1].end = b_res[1].start + pci_cardbus_io_size - 1;
b_res[1].flags |= IORESOURCE_IO;
 
/*
@@ -440,16 +443,16 @@ pci_bus_size_cardbus(struct pci_bus *bus
 * twice the size.
 */
if (ctrl & PCI_CB_BRIDGE_CTL_PREFETCH_MEM0) {
-   b_res[2].start = CARDBUS_MEM_SIZE;
-   b_res[2].end = b_res[2].start + CARDBUS_MEM_SIZE - 1;
+   b_res[2].start = pci_cardbus_mem_size;
+   b_res[2].end = b_res[2].start + pci_cardbus_mem_size - 1;
b_res[2].flags |= IORESOURCE_MEM | IORESOURCE_PREFETCH;
 
-   b_res[3].start = CARDBUS_MEM_SIZE;
-   b_res[3].end = b_res[3].start + CARDBUS_MEM_SIZE - 1;
+   

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Björn Steinbrink wrote:

Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.


No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.


I've finally managed to reproduce this problem on my box, by doing:

watch --interval=0.1 /sbin/hdparm -I /dev/sda

on one drive and then running bonnie++ on /dev/sdb connected to the 
other port on the same controller device. Usually within a few minutes 
one of the IDENTIFY commands would time out in the same way you guys 
have been seeing.


Through some various trials and tribulations, the only conclusion I can 
come to is that this controller really doesn't like that 
NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
adding some debug code to the qc_issue function that would check to see 
if the BUSY flag in altstatus went high or that register showed an 
interrupt within a certain time afterwards, however that really seemed 
to hose things, the system wouldn't even boot.


Try out this patch, it just calls the ata_host_intr function where 
appropriate without using nv_host_intr which looks at the 
NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
that. With this patch I can get through a whole bonnie++ run with the 
repeated IDENTIFY requests running without seeing the error.


As an aside, there seems to be some dubious code in nv_host_intr, if 
ata_host_intr returns 0 for handled when a command is outstanding, it 
goes and calls ata_check_status anyway. This is rather dangerous since 
if an interrupt showed up right after ata_host_intr but before 
ata_check_status, the ata_check_status would clear it and we would 
forget about it. I tried fixing just that issue and still had this 
problem however. I suspect that code is truly broken and needs further 
thought, but this patch avoids calling it in the ADMA case, at any rate.


As a final aside, this is another case where the hardware docs for this 
controller would really be useful, in order to know whether we are 
actually supposed to be reading that register in ADMA mode or not. I 
sent a query to Allen Martin at NVIDIA asking if there's a way I could 
get access to the documents, but I haven't heard anything yet.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.0 
-0600
@@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int
 
/* if in ATA register mode, use standard ata interrupt 
handler */
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) {
-   u8 irq_stat = readb(host->mmio_base + 
NV_INT_STATUS_CK804)
-   >> (NV_INT_PORT_SHIFT * i);
-   handled += nv_host_intr(ap, irq_stat);
+   struct ata_queued_cmd *qc = ata_qc_from_tag(ap, 
ap->active_tag);
+   if(qc && !(qc->tf.flags & ATA_TFLAG_POLLING))
+   handled += ata_host_intr(ap, qc);
continue;
}
 


Re: configfs: return value for drop_item()/make_item()?

2007-01-22 Thread Joel Becker
On Mon, Jan 22, 2007 at 01:35:36PM +0100, Michael Noisternig wrote:
> Sure, but what I meant to say was that the user, when creating a 
> directory, did not request creation of such sub-directories, so I see 
> them as created by the kernel.

Ahh, but userspace did!  It's part of the configfs contract.
They've asked for an new config item and all that it entails.

> If you argue that they are in fact created by the user because they are 
> a direct result of a user action, then I can apply the same argument to 
> this one example:
> ...
> >This is precisely what configfs is designed to forbid. The kernel
> >does not, ever, create configfs objects on its own. It does it as a
> >result of userspace action.
> 
> No. The sub-directory only appears as a direct result of the user 
> writing a value into the 'type' attribute. ;-)

Ok, you're stretching the metaphor.  Writing a value to a "type"
attribute is, indeed, a userspace action.  However, configfs' contract
is that only mkdir(2) creates objects.
We're not trying to create the do-everything-kitchen-sink system
here.  That way lies the problems we're trying to avoid.  That's why
configfs has a specific contract it provides to (a) userspace and (b)
client drivers.

> >you're never going to get it from configfs. You should be using
> >sysfs.
> 
> Hardly. sysfs doesn't allow the user creating directories. :>

sysfs certainly supports your "echo b > type" style of object
creation.  You're type_file->store() method gets a "b" in the buffer and
then does sysfs_mkdir() of your new object directory.  Here, the kernel
is creating the new object (the directory).

> Well, you don't need PTR_ERR().

Sure, you could use **new_item.  It's the same complexity
change.
 
> That's an interesting other solution, however it seems a bit redundant 
> (params are referenced by links as well as in the 'order' attribute 
> file) and not as simple as my method 2). I guess, for now, in lack of a 
> convincing solution, I will implement method 2) as the one easiest to 
> adapt to given my current code base.

But they are not referenced by the order file.  It's just an
attribute :-)  Really, you can look at it either way.  But configfs has
a specific perspective based on its contracts, and so it works within
them.

> Hm, I had envisioned the user to fully configure the module via file 
> system operations only. Now if the user is supposed to use a wrapper 
> program this sheds a different light on all those 
> what's-the-best-solution issues...

Certainly the user can do the configuration by hand.  It will
always work.  But why make them understand your userspace<->kernel API
when you can just provide a tool?  They're all going to script it up
anyway.

Joel

-- 

"The doctrine of human equality reposes on this: that there is no
 man really clever who has not found that he is stupid."
- Gilbert K. Chesterson

Joel Becker
Principal Software Developer
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[ot] Re: coding style and re-inventing the wheel way too many times

2007-01-22 Thread Oleg Verych
On 2006-12-21, Robert P. J. Day wrote:
[]
>   in any event, even *i* am not going to go near this kind of cleanup,
> but is there anything actually worth doing about it?  just curious.

Moscow wasn't built at once...

You may notice as some others are doing little by little steps:
- source cleanups;
- right dependancy;
- headers includes;
- warnings elimination;
- code reading-analyzing (brainwashing, as for me;) and a few line
fixes everywhere.

I think there are members here, who doing things, like that for at
least two years.

You are trying to focus (mostly) on style and sense -- ok. And maybe it's
like to be alone within a crowd. Anyways, as long as you fill, that you
can do that, and patches are accepded, changes are *documented* its ok,
also. Sometimes it's stupid, worthless or something else. If so, then
find more interesting things to do, unless you are going to proof
something to somebody ;E.

Here's, as i can see, almost no place for emotions, only technical stuff
and everything not far from it. See, there isn't any reply on you
message, except this one. I think, because of that.

--
-o--=O`C
 #oo'L O
<___=E M

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2

2007-01-22 Thread Andrew Morton
On Tue, 23 Jan 2007 11:37:09 +1100
Donald Douwsma <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> >> On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <[EMAIL PROTECTED]> 
> >> wrote:
> >> Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke 
> >> the OOM killer and kill all of my processes?
> > 
> > What's that?   Software raid or hardware raid?  If the latter, which driver?
> 
> I've hit this using local disk while testing xfs built against 2.6.20-rc4 
> (SMP x86_64)
> 
> dmesg follows, I'm not sure if anything in this is useful after the first 
> event as our automated tests continued on
> after the failure.

This looks different.

> ...
>
> Mem-info:
> Node 0 DMA per-cpu:
> CPU0: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
>  0
> CPU1: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
>  0
> CPU2: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
>  0
> CPU3: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:  
>  0
> Node 0 DMA32 per-cpu:
> CPU0: Hot: hi:  186, btch:  31 usd:  31   Cold: hi:   62, btch:  15 usd:  
> 53
> CPU1: Hot: hi:  186, btch:  31 usd:   2   Cold: hi:   62, btch:  15 usd:  
> 60
> CPU2: Hot: hi:  186, btch:  31 usd:  20   Cold: hi:   62, btch:  15 usd:  
> 47
> CPU3: Hot: hi:  186, btch:  31 usd:  25   Cold: hi:   62, btch:  15 usd:  
> 56
> Active:76 inactive:495856 dirty:0 writeback:0 unstable:0 free:3680 slab:9119 
> mapped:32 pagetables:637

No dirty pages, no pages under writeback.

> Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB 
> present:9376kB pages_scanned:3296
> all_unreclaimable? yes
> lowmem_reserve[]: 0 2003 2003
> Node 0 DMA32 free:6684kB min:5712kB low:7140kB high:8568kB active:304kB 
> inactive:1981624kB present:2052068kB

Inactive list is filled.

> pages_scanned:4343329 all_unreclaimable? yes

We scanned our guts out and decided that nothing was reclaimable.

> No available memory (MPOL_BIND): kill process 3492 (hald) score 0 or a child
> No available memory (MPOL_BIND): kill process 7914 (top) score 0 or a child
> No available memory (MPOL_BIND): kill process 4166 (nscd) score 0 or a child
> No available memory (MPOL_BIND): kill process 17869 (xfs_repair) score 0 or a 
> child

But in all cases a constrained memory policy was in use.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS or Kernel Problem / Bug

2007-01-22 Thread David Chinner
On Mon, Jan 22, 2007 at 09:07:23AM +0100, Stefan Priebe - FH wrote:
> Hi!
> 
> The update of the IDE layer was in 2.6.19. I don't think it is a 
> hardware bug cause all these 5 machines runs fine since a few years with 
> 2.6.16.X and before. We switch to 2.6.18.6 on monday last week and all 
> machines began to crash periodically. On friday last week we downgraded 
> them all to 2.6.16.37 and all 5 machines runs fine again. So i don't 
> believe it is a hardware problem. Do you really think that could be?

I was thinking more of a driver change that is being triggered on
that particular hardware. FWIW, did you test 2.6.19?

I really need a better idea of the workload these servers are running
and, ideally, a reproducable test case to track something like
this down. At the moment I have no idea what is going on and no
real information on which to even base a guess.

Were there any other messages in the log?

On Mon, Jan 22, 2007 at 10:42:36AM +0100, Stefan Priebe - FH wrote:
> Hi!
> 
> I've another idea... could it be, that it is a barrier problem? Since 
> barriers are enabled by default from 2.6.17 on ...

You could try turning it off. If it does fix the problem, then I'd be
pointing once again at hardware ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.20-rc5] SPI: alternative fix for spi_busnum_to_master

2007-01-22 Thread Atsushi Nemoto
On Mon, 22 Jan 2007 14:12:02 -0800, David Brownell <[EMAIL PROTECTED]> wrote:
> > Here is a revised version.  The children list of spi_master_class
> > contains only spi_master class so we can just compare bus_num member
> > instead of class_id string.
> 
> Looks just a bit iffy ... though, thanks for helping to finally
> sort this out!

Well, so previous patch (which was checking class_id string) would be
preferred?

> > +   cdev = class_device_get(cdev);
> > +   if (!cdev)
> > +   continue;
> 
> That "continue" case doesn't seem like it should be possible... but
> at any rate, the "get" can be deferred until the relevent class
> device is known, since that _valid_ handle can't disappear so long
> as that semaphore is held.  And if you find the right device but
> can't get a reference ... no point in continuing!
> 
> Something like a class_find_device() would be the best way to solve
> this sort of problem, IMO.  But we don't have one of those.  :(

Indeed the check can be omitted.  Should I send a new patch just
moving class_device_get() into "if (master->bus_num == bus_num)"
block?

The crashing with udev is 2.6.20 regression so I wish this fixed very
soon.  Thank you for review.

---
Atsushi Nemoto
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: KB->KiB, MB -> MiB, ... (IEC 60027-2)

2007-01-22 Thread Krzysztof Halasa
Andreas Schwab <[EMAIL PROTECTED]> writes:

> But other than the sector size there is no natural power of 2 connected to
> disk size.  A disk can have any odd number of sectors.

But the manufacturers don't count in sectors.

It should be consistent, though. "How many GB of disk space do you
need to store 2 GB of USB flash, and how many to store 2 GB RAM image"?
:-)
-- 
Krzysztof Halasa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] GIT v1.5.0-rc2

2007-01-22 Thread Carl Worth
On Mon, 22 Jan 2007 11:28:32 -0800, Junio C Hamano wrote:
> Thanks for your comments;

You're welcome.

> the attached probably needs proofreading.

In general, I like it. The git-branch documentation already talks
about "remote-tracking branches" so I've rewritten a couple of
sentence below to use that same terminology. Also there are a couple
of grammar errors related to pluralization, (likely the fault of
English being quite a bit less consistent than other languages with
subject/verb number agreement, etc.).

> +   A repository with the separate remote layout starts with only
> +   one default branch, 'master', to be used for your own
> +   development.  Unlike the traditional layout that copied all
> +   the upstream branches into your branch namespace (while
> +   renaming their 'master' to your 'origin'), they are not made
> +   into your branches.  Instead, they are kept track of using
> +   'refs/remotes/origin/$upstream_branch_name'.

  renaming remote 'master' to local 'origin'), the new approach
  puts upstream branches into local "remote-tracking branches"
  with their own namespace. These can be referenced with names
  such as "origin/$upstream_branch_name" and are stored in
  .git/refs/remotes rather than .git/refs/heads where normal
  branches are stored.

> +   This layout keeps your own branch namespace less cluttered,
> +   avoids name collision with your upstream, makes it possible
> +   to automatically track new branches created at the remote
> +   after you clone from it, and makes it easier to interact with
> +   more than one remote repositories.  There might be some

Should be "more than one remote repository.". Also I'd add, ", (see
the new 'git remote' command)" before the end of that sentence.

> +   * 'git branch' does not show the branches from your upstream.

Again to use the same terminology, "does not show the remote-tracking
branches.".

> +   Repositories initialized with the traditional layout
> +   continues to work (and will continue to work).

The 's' on "continues" is incorrect. Perhaps:

continue to work (and will work in the future as well).

or just drop the parenthetical phrase.

-Carl


pgpJ2CCBnd3h5.pgp
Description: PGP signature


Re: [PATCH -rt] whitespace cleanup for 2.6.20-rc5-rt7

2007-01-22 Thread Satoru Takeuchi
At Tue, 23 Jan 2007 00:42:31 +0100,
Richard Knutsson wrote:
> 
> Michal Piotrowski wrote:
> > How about this script?
> >
> > "d) Ensure that your patch does not add new trailing whitespace.  The 
> > below
> >   script will fix up your patch by stripping off such whitespace.
> >
> > #!/bin/sh
> >
> > strip1()
> > {
> > TMP=$(mktemp /tmp/XX)
> > cp "$1" "$TMP"
> > sed -e '/^+/s/[ ]*$//' <"$TMP" >"$1"
> > rm "$TMP"
> > }
> >
> > for i in "$@"
> > do
> > strip1 "$i"
> > done
> > "
> > http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt
> I believe:
> 
> for i in "$@"; do \
>   sed --in-place -e "s/[   ]+$//" "$i"
> done
> 
> will do as well...

Hi Richard,

IIRC, `+' is extended regex, so -r option is needed:

sed -r --in-place -e "s/[]+$//" "$i"

Satoru Takeuchi

> 
> Richard Knutsson
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why active list and inactive list?

2007-01-22 Thread Andrea Arcangeli
On Tue, Jan 23, 2007 at 01:10:46AM +0100, Niki Hammler wrote:
> Dear Linux Developers/Enthusiasts,
> 
> For a course at my university I'm implementing parts of an operating
> system where I get most ideas from the Linux Kernel (which I like very
> much). One book I gain information from is [1].
> 
> Linux uses for its Page Replacing Algorithm (based on LRU) *two* chained
> lists - one active list and one active list.
> I implemented my PRA this way too.
> 
> No the big question is: WHY do I need *two* lists? Isn't it just
> overhead/more work? Are there any reasons for that?
>
> In my opinion, it would be better to have just one just; pop frames to
> be swapped out from the end of the list and push new frames in front of
> it. Then just evaluate the frames and shift them around in the list.
> 
> Is there any explanation why Linux uses two lists?

Back then I designed it with two lru lists because by splitting the
active from the inactive cache allows to detect the cache pollution
before it starts discarding the working set. The idea is that the
pollution will enter and exit the inactive list without ever being
elected to the active list because by definition it will never
generate a cache hit. The working set will instead trigger cache hits
during page faults or repeated reads, and it will be preserved better
by electing it to enter the active list.

A page in the inactive list will be collected much more quickly than a
page in the active list, so the pollution will be collected more
quickly than the working set. Then the VM while freeing cache tries to
keep a balance between the size of the two lists to avoid being too
unfair, obviously at some point the active list have to be
de-activated too. If your server "fits in ram" you'll find lots of
cache to be active and so the I/O activity not part of the working set
will be collected without affecting the working set much.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2

2007-01-22 Thread Donald Douwsma
Andrew Morton wrote:
>> On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <[EMAIL PROTECTED]> 
>> wrote:
>> Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke 
>> the OOM killer and kill all of my processes?
> 
> What's that?   Software raid or hardware raid?  If the latter, which driver?

I've hit this using local disk while testing xfs built against 2.6.20-rc4 (SMP 
x86_64)

dmesg follows, I'm not sure if anything in this is useful after the first event 
as our automated tests continued on
after the failure.

> Please include /proc/meminfo from after the oom-killing.
>
> Please work out what is using all that slab memory, via /proc/slabinfo.

Sorry I didnt pick this up ether.
I'll try to reproduce this and gather some more detailed info for a single 
event.

Donald


...
XFS mounting filesystem sdb5
Ending clean XFS mount for filesystem: sdb5
XFS mounting filesystem sdb5
Ending clean XFS mount for filesystem: sdb5
hald invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0

Call Trace:
 [] out_of_memory+0x70/0x25d
 [] __alloc_pages+0x22c/0x2b5
 [] alloc_page_vma+0x71/0x76
 [] read_swap_cache_async+0x45/0xd8
 [] swapin_readahead+0x60/0xd3
 [] __handle_mm_fault+0x703/0x9d8
 [] do_page_fault+0x42b/0x7b3
 [] do_readv_writev+0x176/0x18b
 [] thread_return+0x0/0xed
 [] __const_udelay+0x2c/0x2d
 [] scsi_done+0x0/0x17
 [] error_exit+0x0/0x84

Mem-info:
Node 0 DMA per-cpu:
CPU0: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
CPU1: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
CPU2: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
CPU3: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU0: Hot: hi:  186, btch:  31 usd:  31   Cold: hi:   62, btch:  15 usd:  53
CPU1: Hot: hi:  186, btch:  31 usd:   2   Cold: hi:   62, btch:  15 usd:  60
CPU2: Hot: hi:  186, btch:  31 usd:  20   Cold: hi:   62, btch:  15 usd:  47
CPU3: Hot: hi:  186, btch:  31 usd:  25   Cold: hi:   62, btch:  15 usd:  56
Active:76 inactive:495856 dirty:0 writeback:0 unstable:0 free:3680 slab:9119 
mapped:32 pagetables:637
Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB 
present:9376kB pages_scanned:3296
all_unreclaimable? yes
lowmem_reserve[]: 0 2003 2003
Node 0 DMA32 free:6684kB min:5712kB low:7140kB high:8568kB active:304kB 
inactive:1981624kB present:2052068kB
pages_scanned:4343329 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 
1*2048kB 1*4096kB = 8036kB
Node 0 DMA32: 273*4kB 29*8kB 1*16kB 1*32kB 1*64kB 1*128kB 2*256kB 1*512kB 
0*1024kB 0*2048kB 1*4096kB = 6684kB
Swap cache: add 741048, delete 244661, find 84826/143198, race 680+239
Free swap  = 1088524kB
Total swap = 3140668kB
Free swap:   1088524kB
524224 pages of RAM
9619 reserved pages
259 pages shared
496388 pages swap cached
No available memory (MPOL_BIND): kill process 3492 (hald) score 0 or a child
Killed process 3626 (hald-addon-acpi)
top invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0

Call Trace:
 [] out_of_memory+0x70/0x25d
 [] __alloc_pages+0x22c/0x2b5
 [] alloc_pages_current+0x74/0x79
 [] __page_cache_alloc+0xb/0xe
 [] __do_page_cache_readahead+0xa1/0x217
 [] io_schedule+0x28/0x33
 [] __wait_on_bit_lock+0x5b/0x66
 [] __lock_page+0x72/0x78
 [] do_page_cache_readahead+0x4e/0x5a
 [] filemap_nopage+0x140/0x30c
 [] __handle_mm_fault+0x1fb/0x9d8
 [] do_page_fault+0x42b/0x7b3
 [] __wake_up+0x43/0x50
 [] tty_ldisc_deref+0x71/0x76
 [] error_exit+0x0/0x84

Mem-info:
Node 0 DMA per-cpu:
CPU0: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
CPU1: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
CPU2: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
CPU3: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU0: Hot: hi:  186, btch:  31 usd:  31   Cold: hi:   62, btch:  15 usd:  53
CPU1: Hot: hi:  186, btch:  31 usd:   2   Cold: hi:   62, btch:  15 usd:  60
CPU2: Hot: hi:  186, btch:  31 usd:   1   Cold: hi:   62, btch:  15 usd:  10
CPU3: Hot: hi:  186, btch:  31 usd:  25   Cold: hi:   62, btch:  15 usd:  26
Active:90 inactive:496233 dirty:0 writeback:0 unstable:0 free:3485 slab:9119 
mapped:32 pagetables:637
Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB 
present:9376kB pages_scanned:3328
all_unreclaimable? yes
lowmem_reserve[]: 0 2003 2003
Node 0 DMA32 free:5904kB min:5712kB low:7140kB high:8568kB active:360kB 
inactive:1983092kB present:2052068kB
pages_scanned:4587649 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 
1*2048kB 1*4096kB = 8036kB
Node 0 DMA32: 78*4kB 29*8kB 1*16kB 1*32kB 1*64kB 1*128kB 2*256kB 1*512kB 
0*1024kB 0*2048kB 1*4096kB = 

[PATCH 002 of 4] md: Make 'repair' actually work for raid1.

2007-01-22 Thread NeilBrown

When 'repair' finds a block that is different one the various
parts of the mirror. it is meant to write a chosen good version
to the others.  However it currently writes out the original data
to each. The memcpy to make all the data the same is missing.


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid1.c |5 +
 1 file changed, 5 insertions(+)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c2007-01-23 11:13:45.0 +1100
+++ ./drivers/md/raid1.c2007-01-23 11:23:43.0 +1100
@@ -1221,6 +1221,11 @@ static void sync_request_write(mddev_t *
sbio->bi_sector = r1_bio->sector +

conf->mirrors[i].rdev->data_offset;
sbio->bi_bdev = 
conf->mirrors[i].rdev->bdev;
+   for (j = 0; j < vcnt ; j++)
+   
memcpy(page_address(sbio->bi_io_vec[j].bv_page),
+  
page_address(pbio->bi_io_vec[j].bv_page),
+  PAGE_SIZE);
+
}
}
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 003 of 4] md: Make sure the events count in an md array never returns to zero.

2007-01-22 Thread NeilBrown

Now that we sometimes step the array events count backwards
(when transitioning dirty->clean where nothing else interesting
has happened - so that we don't need to write to spares all the time),
it is possible for the event count to return to zero, which is
potentially confusing and triggers and MD_BUG.

We could possibly remove the MD_BUG, but is just as easy, and
probably safer, to make sure we never return to zero.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-01-23 11:13:44.0 +1100
+++ ./drivers/md/md.c   2007-01-23 11:23:58.0 +1100
@@ -1633,7 +1633,8 @@ repeat:
 * and 'events' is odd, we can roll back to the previous clean state */
if (nospares
&& (mddev->in_sync && mddev->recovery_cp == MaxSector)
-   && (mddev->events & 1))
+   && (mddev->events & 1)
+   && mddev->events != 1)
mddev->events--;
else {
/* otherwise we have to go forward and ... */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 004 of 4] md: Avoid reading past the end of a bitmap file.

2007-01-22 Thread NeilBrown

In most cases we check the size of the bitmap file before
reading data from it.  However when reading the superblock,
we always read the first PAGE_SIZE bytes, which might not 
always be appropriate.  So limit that read to the size of the
file if appropriate.

Also, we get the count of available bytes wrong in one place,
so that too can read past the end of the file.

Cc:  "yang yin" <[EMAIL PROTECTED]>

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/bitmap.c |   12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c   2007-01-23 11:13:43.0 +1100
+++ ./drivers/md/bitmap.c   2007-01-23 11:24:09.0 +1100
@@ -479,9 +479,12 @@ static int bitmap_read_sb(struct bitmap 
int err = -EINVAL;
 
/* page 0 is the superblock, read it... */
-   if (bitmap->file)
-   bitmap->sb_page = read_page(bitmap->file, 0, bitmap, PAGE_SIZE);
-   else {
+   if (bitmap->file) {
+   loff_t isize = i_size_read(bitmap->file->f_mapping->host);
+   int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize;
+
+   bitmap->sb_page = read_page(bitmap->file, 0, bitmap, bytes);
+   } else {
bitmap->sb_page = read_sb_page(bitmap->mddev, bitmap->offset, 
0);
}
if (IS_ERR(bitmap->sb_page)) {
@@ -877,7 +880,8 @@ static int bitmap_init_from_disk(struct 
int count;
/* unmap the old page, we're done with it */
if (index == num_pages-1)
-   count = bytes - index * PAGE_SIZE;
+   count = bytes + sizeof(bitmap_super_t)
+   - index * PAGE_SIZE;
else
count = PAGE_SIZE;
if (index == 0) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 001 of 4] md: Update email address and status for MD in MAINTAINERS.

2007-01-22 Thread NeilBrown


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./MAINTAINERS |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff .prev/MAINTAINERS ./MAINTAINERS
--- .prev/MAINTAINERS   2007-01-23 11:14:14.0 +1100
+++ ./MAINTAINERS   2007-01-23 11:23:03.0 +1100
@@ -3011,9 +3011,9 @@ SOFTWARE RAID (Multiple Disks) SUPPORT
 P: Ingo Molnar
 M: [EMAIL PROTECTED]
 P: Neil Brown
-M: [EMAIL PROTECTED]
+M: [EMAIL PROTECTED]
 L: linux-raid@vger.kernel.org
-S: Maintained
+S: Supported
 
 SOFTWARE SUSPEND:
 P: Pavel Machek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 000 of 4] md: Introduction - Assorted bugfixes

2007-01-22 Thread NeilBrown
Following are 4 patches suitable for inclusion in 2.6.20.

Thanks,
NeilBrown

 [PATCH 001 of 4] md: Update email address and status for MD in MAINTAINERS.
 [PATCH 002 of 4] md: Make 'repair' actually work for raid1.
 [PATCH 003 of 4] md: Make sure the events count in an md array never returns 
to zero.
 [PATCH 004 of 4] md: Avoid reading past the end of a bitmap file.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   >