Re: linux-2.6.20-rc4-mm1 Reiser4 filesystem freeze and corruption
Zan Lynx wrote: I have been running 2.6.20-rc2-mm1 without problems, but both rc3-mm1 and rc4-mm1 have been giving me these freezes. They were happening inside X and without external console it was impossible to get anything, plus I was reluctant to test it since the freeze sometimes requires a full fsck.reiser4 --build-fs to recover the filesystem. > [...] Hi, I don't know if it is related, but I've had the following BUG on 2.6.20-rc4-mm1 (+ hot-fixes patches applied) : --- kernel BUG at fs/reiser4/plugin/item/extent_file_ops.c:973! invalid opcode: [#1] PREEMPT last sysfs file: /devices/pci:00/:00:13.0/eth0/statistics/collisions Modules linked in: binfmt_misc nfs lockd sunrpc radeon drm reiser4 ati_remote fuse usbhid snd_via82xx snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_mpu401_uart snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device ohci1394 ieee1394 psmouse sr_mod cdrom sg ehci_hcd via_agp agpgart uhci_hcd usbcore i2c_viapro snd soundcore CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010282 (2.6.20-rc4-mm1 #1) EIP is at reiser4_write_extent+0xd5/0x626 [reiser4] eax: ccca139c ebx: 0200 ecx: f5bec400 edx: ffe4 esi: edi: f5bec414 ebp: da6ff274 esp: e17d7e34 ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process sstrip (pid: 23858, ti=e17d6000 task=d8ffc570 task.ti=e17d6000) Stack: 00100100 00200200 00100100 0034 bf826a50 e083ff00 c000 da6ff2c8 dccba4c0 0005 01ff 021e 0004 f9b6cdad 0004 0004 0001 Call Trace: [] reiser4_update_sd+0x22/0x28 [reiser4] [] notify_change+0x200/0x20f [] vsscanf+0x1e2/0x3ff [] write_unix_file+0x0/0x495 [reiser4] [] __remove_suid+0x10/0x14 [] mark_page_accessed+0x1c/0x2e [] reiser4_txn_begin+0x1c/0x2e [reiser4] [] reiser4_write_extent+0x0/0x626 [reiser4] [] write_unix_file+0x25a/0x495 [reiser4] [] __handle_mm_fault+0x2bd/0x79b [] write_unix_file+0x0/0x495 [reiser4] [] vfs_write+0x8a/0x136 [] sys_write+0x41/0x67 [] sysenter_past_esp+0x5f/0x85 === Code: 04 89 0c 24 31 c9 89 5c 24 04 e8 52 fc ff ff 31 d2 e9 59 05 00 00 64 a1 08 00 00 00 8b 80 b4 04 00 00 8b 40 38 83 78 08 00 74 04 <0f> 0b eb fe 8b 8c 24 e0 00 00 00 31 db 8b 01 8b 51 04 89 c1 0f EIP: [] reiser4_write_extent+0xd5/0x626 [reiser4] SS:ESP 0068:e17d7e34 <4>reiser4[sstrip(23858)]: release_unix_file (fs/reiser4/plugin/file/file.c:2417)[vs-44]: WARNING: out of memory? reiser4[sstrip(23858)]: release_unix_file (fs/reiser4/plugin/file/file.c:2417)[vs-44]: WARNING: out of memory? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[fix, rfc] kbuild: O= with M= (Re: [PATCH -mm] replacement for broken kbuild-dont-put-temp-files-in-the-source-tree.patch)
On 2006-11-17, Oleg Verych wrote: > On Tue, Oct 31, 2006 at 02:51:36PM +0100, olecom wrote: > [] >> On Tue, Oct 31, 2006 at 02:14:16AM +0100, Horst Schirmeier wrote: > [] >> > I'm not sure what you mean by $(objdir); I just got something to work >> > which creates the /dev/null symlink in a (newly created if necessary) >> > directory named >> >> $(objtree) is a directory for all possible outputs of the build precess, >> it's set up by `O=' or `KBUILD_OUTPUT', and this is *not* output for ready >> external modules `$(M)'. Try to play with this, please. > > And for me, they are *not* working together: It works with this: Proposed-by: me --- linux-2.6.20-rc5/scripts/Makefile.modpost.orig 2007-01-12 19:54:26.0 +0100 +++ linux-2.6.20-rc5/scripts/Makefile.modpost 2007-01-23 08:23:51.583426500 +0100 @@ -58,5 +58,5 @@ # Includes step 3,4 quiet_cmd_modpost = MODPOST $(words $(filter-out vmlinux FORCE, $^)) modules - cmd_modpost = scripts/mod/modpost\ + cmd_modpost = $(objtree)/scripts/mod/modpost \ $(if $(CONFIG_MODVERSIONS),-m) \ $(if $(CONFIG_MODULE_SRCVERSION_ALL),-a,) \ >> > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[git patches] net driver fixes
Please pull from 'upstream-linus' branch of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git upstream-linus to receive the following updates: drivers/net/ehea/ehea.h |2 +- drivers/net/ehea/ehea_main.c | 56 +- drivers/net/ehea/ehea_phyp.c | 10 +- drivers/net/netxen/netxen_nic.h |7 ++-- drivers/net/netxen/netxen_nic_hw.c |3 +- drivers/net/netxen/netxen_nic_main.c |2 +- drivers/net/pcmcia/3c589_cs.c|7 +++- drivers/net/phy/phy.c|3 +- 8 files changed, 57 insertions(+), 33 deletions(-) Amit S. Kale (2): NetXen: Firmware check modifications NetXen: Use pci_register_driver() instead of pci_module_init() in init_module Komuro (1): modify 3c589_cs to be SMP safe Kumar Gala (1): PHY: Export phy ethtool helpers Thomas Klein (7): ehea: Fixed wrong dereferencation ehea: Fixing firmware queue config issue ehea: Modified initial autoneg state determination ehea: New method to determine number of available ports ehea: Improved logging of permission issues ehea: Added logging off associated errors ehea: Fixed possible nullpointer access diff --git a/drivers/net/ehea/ehea.h b/drivers/net/ehea/ehea.h index 39ad9f7..be10a3a 100644 --- a/drivers/net/ehea/ehea.h +++ b/drivers/net/ehea/ehea.h @@ -39,7 +39,7 @@ #include #define DRV_NAME "ehea" -#define DRV_VERSION"EHEA_0043" +#define DRV_VERSION"EHEA_0044" #define EHEA_MSG_DEFAULT (NETIF_MSG_LINK | NETIF_MSG_TIMER \ | NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR) diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c index 83fa32f..1072e69 100644 --- a/drivers/net/ehea/ehea_main.c +++ b/drivers/net/ehea/ehea_main.c @@ -558,12 +558,12 @@ static irqreturn_t ehea_qp_aff_irq_handler(int irq, void *param) u32 qp_token; eqe = ehea_poll_eq(port->qp_eq); - ehea_debug("eqe=%p", eqe); + while (eqe) { - ehea_debug("*eqe=%lx", *(u64*)eqe); - eqe = ehea_poll_eq(port->qp_eq); qp_token = EHEA_BMASK_GET(EHEA_EQE_QP_TOKEN, eqe->entry); - ehea_debug("next eqe=%p", eqe); + ehea_error("QP aff_err: entry=0x%lx, token=0x%x", + eqe->entry, qp_token); + eqe = ehea_poll_eq(port->qp_eq); } return IRQ_HANDLED; @@ -575,8 +575,9 @@ static struct ehea_port *ehea_get_port(struct ehea_adapter *adapter, int i; for (i = 0; i < adapter->num_ports; i++) - if (adapter->port[i]->logical_port_id == logical_port) - return adapter->port[i]; + if (adapter->port[i]) + if (adapter->port[i]->logical_port_id == logical_port) + return adapter->port[i]; return NULL; } @@ -642,6 +643,8 @@ int ehea_sense_port_attr(struct ehea_port *port) break; } + port->autoneg = 1; + /* Number of default QPs */ port->num_def_qps = cb0->num_default_qps; @@ -728,10 +731,7 @@ int ehea_set_portspeed(struct ehea_port *port, u32 port_speed) } } else { if (hret == H_AUTHORITY) { - ehea_info("Hypervisor denied setting port speed. Either" - " this partition is not authorized to set " - "port speed or another partition has modified" - " port speed first."); + ehea_info("Hypervisor denied setting port speed"); ret = -EPERM; } else { ret = -EIO; @@ -998,7 +998,7 @@ static int ehea_configure_port(struct ehea_port *port) | EHEA_BMASK_SET(PXLY_RC_JUMBO_FRAME, 1); for (i = 0; i < port->num_def_qps; i++) - cb0->default_qpn_arr[i] = port->port_res[i].qp->init_attr.qp_nr; + cb0->default_qpn_arr[i] = port->port_res[0].qp->init_attr.qp_nr; if (netif_msg_ifup(port)) ehea_dump(cb0, sizeof(*cb0), "ehea_configure_port"); @@ -1485,11 +1485,12 @@ out: static void ehea_promiscuous_error(u64 hret, int enable) { - ehea_info("Hypervisor denied %sabling promiscuous mode.%s", - enable == 1 ? "en" : "dis", - hret != H_AUTHORITY ? "" : " Another partition owning a " - "logical port on the same physical port might have altered " - "promiscuous mode first."); + if (hret == H_AUTHORITY) + ehea_info("Hypervisor denied %sabling promiscuous mode", + enable == 1 ? "en" : "dis"); + else + ehea_error("failed %sabling promiscuous mode", + enable == 1 ? "en" : "dis"); } static void
Re: Suspend to RAM generates oops and general protection fault
what about removing psmouse module? On 1/23/07, Jean-Marc Valin <[EMAIL PROTECTED]> wrote: >>> will be a device driver. Common causes of suspend/resume problems from >>> the list you give below are acpi modules, bluetooth and usb. I'd also be >>> consider pcmcia, drm and fuse possibilities. But again, go for unloading >>> everything possible in the first instance. >> Actually, the reason I sent this is that when I showed the oops/gpf to >> Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug >> problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the >> suspend to RAM now works ~95% of the time. > > Try a kernel without CONFIG_SMP... that will verify if it is SMP > related. Well, this happens to be my main work machine, which I'm not willing to have running at half speed for several weeks. Anything else you can suggest? Jean-Marc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: `make htmldocs` fails -- 2.6.20-rc4-mm1
On Mon, 2007-01-22 at 22:22 -0800, Randy Dunlap wrote: > On Mon, 22 Jan 2007 22:02:30 -0800 Don Mullis wrote: > > > > > Bisection shows the bad patch to be: > > gregkh-driver-uio-documentation.patch > > > > The htmldocs build failure can be eliminated by: > > quilt remove Documentation/DocBook/kernel-api.tmpl > > or by: quilt delete gregkh-driver-uio-documentation.patch ?? That would fix the htmldoc build too, but would throw out lots of documentation. Greg K-H would seem the prime candidate to propose a fix. > How about an accurate description of what kernel tree has this problem? > It's not 2.6.19. It's not 2.6.20-rc5. 2.6.20-rc4-mm1, sorry. Forgot that posting as a reply to the 2.6.20-rc4-mm1 announcement is no help for someone receiving the mail directly. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: `make htmldocs` fails
On Mon, Jan 22, 2007 at 10:02:30PM -0800, Don Mullis wrote: > > Bisection shows the bad patch to be: > gregkh-driver-uio-documentation.patch > > The htmldocs build failure can be eliminated by: > quilt remove Documentation/DocBook/kernel-api.tmpl > > The error messages were: > > .../linux-2.6.19 $ make htmldocs > DOCPROC Documentation/DocBook/kernel-api.xml > > Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521): > No description found for parameter 'owner' > > Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521): > No description found for parameter 'info' > > Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:591): > No description found for parameter 'idev' Thanks, I've fixed these warnings now. > > Error(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//include/linux/uio_driver.h:33): > cannot understand prototype: 'struct uio_info ' I think I've fixed this now, the next -mm should contain the update. thanks for letting me know. greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: `make htmldocs` fails
On Mon, 22 Jan 2007 22:02:30 -0800 Don Mullis wrote: > > Bisection shows the bad patch to be: > gregkh-driver-uio-documentation.patch > > The htmldocs build failure can be eliminated by: > quilt remove Documentation/DocBook/kernel-api.tmpl or by: quilt delete gregkh-driver-uio-documentation.patch ?? > The error messages were: > > .../linux-2.6.19 $ make htmldocs > DOCPROC Documentation/DocBook/kernel-api.xml > > Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521): > No description found for parameter 'owner' > > Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521): > No description found for parameter 'info' > > Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:591): > No description found for parameter 'idev' > > Error(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//include/linux/uio_driver.h:33): > cannot understand prototype: 'struct uio_info ' > > Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//include/linux/uio_driver.h): > no structured comments found > make[1]: *** [Documentation/DocBook/kernel-api.xml] Error 1 > make: *** [htmldocs] Error 2 > > The failure was observed on an up-to-date Fedora Core 5 host. How about an accurate description of what kernel tree has this problem? It's not 2.6.19. It's not 2.6.20-rc5. --- ~Randy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA hotplug from the user side ?
Henrique de Moraes Holschuh wrote: > Does SATA electrical conector keying let the disk firmware unload > heads before the user manages to pull it out enough to sever power? I don't think so. > If it does not, the drive will do an emergency head unload, which is > not good and will likely reduce the drive's lifetime. Probably. > Using hdparm -Y before the unplug, or scsiadd -r (on a kernel that > has Tejun's new patch to optionally issue an START_STOP_UNIT to the > SCSI device enabled) is probably a good idea. Unless it is a shared > SATA port (I don't know if such a thing exists yet) and another box > is talking to the disk, etc. Agreed. But it would be *much* better if all these can be taken care of by hald and its minions. Such that the user can just tell the system that the hdd is going to be removed and all these dirty tricks are done automagically. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kbuild: Replace remaining "depends" with "depends on"
On 2006-12-13, Robert P. J. Day wrote: > > Replace the very few remaining "depends" Kconfig directives with > "depends on". > > Signed-off-by: Robert P. J. Day <[EMAIL PROTECTED]> > > --- For this kind of fixes, please use "kconfig" subsystem instead of "kbuild" in subject. Thanks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA hotplug from the user side ?
Soeren Sonnenburg wrote: > OK, how about this (please especially check the non SIL part): > > SATA Hotplug from the User Side > > - For SIL3114 and SIL3124 you don't have to run any commands at all. It ahci and ck804 flavor of sata_nv's can do hotplug without user assistance too. [--snip--] > - For other chipsets one in addition to stop using the device before > unplugging has to call scsiadd -r to fully remove it, e.g. in the > following example the disk on scsi host 3 channel 0 id 0 lun 0 will be > removed, then on reinserting a disk call scsiadd -s : > > # scsiadd -p > > Attached devices: > Host: scsi2 Channel: 00 Id: 00 Lun: 00 > Vendor: ATA Model: ST3400832AS Rev: 3.01 > Type: Direct-AccessANSI SCSI revision: 05 > Host: scsi3 Channel: 00 Id: 00 Lun: 00 > Vendor: ATA Model: ST3400620AS Rev: 3.AA > Type: Direct-AccessANSI SCSI revision: 05 > > # scsiadd -r 3 0 0 0 > # scsiadd -s Doing the above might not be such a good idea on drivers which don't implement new EH yet. Those are sata_mv, sata_promise (getting there) and sata_sx4. Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [updated PATCH] remove 555 unneeded #includes of sched.h
On 2006-12-29, Tim Schmielau wrote: [] > OK, building 2.6.20-rc2-mm1 with all 59 configs from arch/arm/configs > with and w/o the patch indeed found one mysterious #include that may not > be removed. Thanks, Russell! > > Andrew, please use the attached patch instead of the previous one, it also > has a slightly better patch description. Great job! About patch. To make it smaller, i think, you better use less "unified context" lines `diff -u1'. Nice done! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
`make htmldocs` fails
Bisection shows the bad patch to be: gregkh-driver-uio-documentation.patch The htmldocs build failure can be eliminated by: quilt remove Documentation/DocBook/kernel-api.tmpl The error messages were: .../linux-2.6.19 $ make htmldocs DOCPROC Documentation/DocBook/kernel-api.xml Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521): No description found for parameter 'owner' Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:521): No description found for parameter 'info' Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//drivers/uio/uio.c:591): No description found for parameter 'idev' Error(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//include/linux/uio_driver.h:33): cannot understand prototype: 'struct uio_info ' Warning(/noback/kernel.org/2.6.20-rc_fault-injection/linux-2.6.19//include/linux/uio_driver.h): no structured comments found make[1]: *** [Documentation/DocBook/kernel-api.xml] Error 1 make: *** [htmldocs] Error 2 The failure was observed on an up-to-date Fedora Core 5 host. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/7] ehea: Fixed wrong dereferencation
Thomas Klein wrote: Not only check the pointer against 0 but also the dereferenced value Signed-off-by: Thomas Klein <[EMAIL PROTECTED]> --- drivers/net/ehea/ehea.h |2 +- drivers/net/ehea/ehea_main.c |6 -- 2 files changed, 5 insertions(+), 3 deletions(-) applied 1-7 to #upstream-fixes - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] atl1: Attansic L1 ethernet driver
Jay Cliburn wrote: This is the latest submittal of the patchset providing support for the Attansic L1 gigabit ethernet adapter. This patchset is built against kernel version 2.6.20-rc5. This version incorporates all comments from: Christoph Hellwig: http://lkml.org/lkml/2007/1/11/43 http://lkml.org/lkml/2007/1/11/45 http://lkml.org/lkml/2007/1/11/48 http://lkml.org/lkml/2007/1/11/49 Jeff Garzik: http://lkml.org/lkml/2007/1/18/233 Many thanks to both for reviewing the driver. The monolithic version of this patchset may be found at: ftp://hogchain.net/pub/linux/attansic/kernel_driver/atl1-2.0.5-linux-2.6.20.rc5.patch.bz2 OK, I have merged the monolithic patch into jgarzik/netdev-2.6.git#atl1. Once I'm done merging patches tonight, I will merge this new 'atl1' branch into the 'ALL' meta-branch, which will auto-propagate this driver into Andrew Morton's -mm for testing. For future driver updates, please send a patch rather than the full driver diff. If it looks good, we will push for 2.6.21 (or 2.6.22, if updates or objections continue to come fast and furious). It's in "the system" now, thanks for all your hard work! Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/12] clocksource: remove update_callback
Uses the block notifier to replace the functionality of update_callback(). update_callback() was a special case specifically for the tsc, but including it in the clocksource structure duplicated it needlessly for other clocks. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- arch/i386/kernel/tsc.c | 35 ++- arch/x86_64/kernel/tsc.c| 19 +-- include/linux/clocksource.h |2 -- include/linux/timekeeping.h |1 + kernel/time/timekeeping.c | 32 ++-- 5 files changed, 46 insertions(+), 43 deletions(-) Index: linux-2.6.19/arch/i386/kernel/tsc.c === --- linux-2.6.19.orig/arch/i386/kernel/tsc.c +++ linux-2.6.19/arch/i386/kernel/tsc.c @@ -51,8 +51,7 @@ static int __init tsc_setup(char *str) __setup("notsc", tsc_setup); /* - * code to mark and check if the TSC is unstable - * due to cpufreq or due to unsynced TSCs + * Flag that denotes an unstable tsc and check function. */ static int tsc_unstable; @@ -61,12 +60,6 @@ static inline int check_tsc_unstable(voi return tsc_unstable; } -void mark_tsc_unstable(void) -{ - tsc_unstable = 1; -} -EXPORT_SYMBOL_GPL(mark_tsc_unstable); - /* Accellerators for sched_clock() * convert from cycles(64bits) => nanoseconds (64bits) * basic equation: @@ -180,6 +173,7 @@ int recalibrate_cpu_khz(void) if (cpu_has_tsc) { cpu_khz = calculate_cpu_khz(); tsc_khz = cpu_khz; + mark_tsc_unstable(); cpu_data[0].loops_per_jiffy = cpufreq_scale(cpu_data[0].loops_per_jiffy, cpu_khz_old, cpu_khz); @@ -332,7 +326,6 @@ core_initcall(cpufreq_tsc); /* clock source code */ static unsigned long current_tsc_khz = 0; -static int tsc_update_callback(void); static cycle_t read_tsc(void) { @@ -350,32 +343,24 @@ static struct clocksource clocksource_ts .mask = CLOCKSOURCE_MASK(64), .mult = 0, /* to be set */ .shift = 22, - .update_callback= tsc_update_callback, .is_continuous = 1, .list = LIST_HEAD_INIT(clocksource_tsc.list), }; -static int tsc_update_callback(void) +/* + * code to mark if the TSC is unstable + * due to cpufreq or due to unsynced TSCs + */ +void mark_tsc_unstable(void) { - int change = 0; - /* check to see if we should switch to the safe clocksource: */ - if (clocksource_tsc.rating != 0 && check_tsc_unstable()) { + if (unlikely(!tsc_unstable && clocksource_tsc.rating != 0)) { clocksource_tsc.rating = 0; clocksource_rating_change(_tsc); - change = 1; - } - - /* only update if tsc_khz has changed: */ - if (current_tsc_khz != tsc_khz) { - current_tsc_khz = tsc_khz; - clocksource_tsc.mult = clocksource_khz2mult(current_tsc_khz, - clocksource_tsc.shift); - change = 1; } - - return change; + tsc_unstable = 1; } +EXPORT_SYMBOL_GPL(mark_tsc_unstable); static int __init dmi_mark_tsc_unstable(struct dmi_system_id *d) { Index: linux-2.6.19/arch/x86_64/kernel/tsc.c === --- linux-2.6.19.orig/arch/x86_64/kernel/tsc.c +++ linux-2.6.19/arch/x86_64/kernel/tsc.c @@ -47,11 +47,6 @@ static inline int check_tsc_unstable(voi return tsc_unstable; } -void mark_tsc_unstable(void) -{ - tsc_unstable = 1; -} -EXPORT_SYMBOL_GPL(mark_tsc_unstable); #ifdef CONFIG_CPU_FREQ @@ -185,8 +180,6 @@ __setup("notsc", notsc_setup); /* clock source code: */ -static int tsc_update_callback(void); - static cycle_t read_tsc(void) { cycle_t ret = (cycle_t)get_cycles_sync(); @@ -206,24 +199,22 @@ static struct clocksource clocksource_ts .mask = (cycle_t)-1, .mult = 0, /* to be set */ .shift = 22, - .update_callback= tsc_update_callback, .is_continuous = 1, .vread = vread_tsc, .list = LIST_HEAD_INIT(clocksource_tsc.list), }; -static int tsc_update_callback(void) +void mark_tsc_unstable(void) { - int change = 0; - /* check to see if we should switch to the safe clocksource: */ - if (clocksource_tsc.rating != 50 && check_tsc_unstable()) { + if (unlikely(!tsc_unstable && clocksource_tsc.rating != 50)) { clocksource_tsc.rating = 50; clocksource_rating_change(_tsc); - change = 1; } - return change; + + tsc_unstable = 1; } +EXPORT_SYMBOL_GPL(mark_tsc_unstable); static int __init init_tsc_clocksource(void) {
[PATCH 12/12] clocksource: atomic signals
Modifies the way clocks are switched to in the timekeeping code. The original code would constantly monitor the clocksource list checking for newly added clocksources. I modified this by using atomic types to signal when a new clock is added. This allows the operation to be used only when it's needed. The fast path is also reduced to checking a single atomic value. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- include/linux/clocksource.h |5 include/linux/timekeeping.h | 10 kernel/time/clocksource.c |6 + kernel/time/timekeeping.c | 51 +++- 4 files changed, 49 insertions(+), 23 deletions(-) Index: linux-2.6.19/include/linux/clocksource.h === --- linux-2.6.19.orig/include/linux/clocksource.h +++ linux-2.6.19/include/linux/clocksource.h @@ -26,6 +26,11 @@ typedef u64 cycle_t; extern struct clocksource clocksource_jiffies; /* + * Atomic signal that is specific to timekeeping. + */ +extern atomic_t clock_check; + +/* * Allows inlined calling for notifier routines. */ extern struct atomic_notifier_head clocksource_list_notifier; Index: linux-2.6.19/include/linux/timekeeping.h === --- linux-2.6.19.orig/include/linux/timekeeping.h +++ linux-2.6.19/include/linux/timekeeping.h @@ -5,15 +5,7 @@ extern void update_wall_time(void); -#ifdef CONFIG_GENERIC_TIME - -extern struct clocksource *clock; - -#else /* CONFIG_GENERIC_TIME */ -static inline int change_clocksource(void) -{ - return 0; -} +#ifndef CONFIG_GENERIC_TIME static inline void change_clocksource(void) { } static inline void timekeeping_init_notifier(void) { } Index: linux-2.6.19/kernel/time/clocksource.c === --- linux-2.6.19.orig/kernel/time/clocksource.c +++ linux-2.6.19/kernel/time/clocksource.c @@ -50,6 +50,7 @@ static char override_name[32]; static int finished_booting; ATOMIC_NOTIFIER_HEAD(clocksource_list_notifier); +atomic_t clock_check = ATOMIC_INIT(0); /* clocksource_done_booting - Called near the end of bootup * @@ -58,6 +59,8 @@ ATOMIC_NOTIFIER_HEAD(clocksource_list_no static int __init clocksource_done_booting(void) { finished_booting = 1; + /* Check for a new clock now */ + atomic_inc(_check); return 0; } @@ -291,6 +294,9 @@ static ssize_t sysfs_override_clocksourc /* try to select it: */ next_clocksource = select_clocksource(); + /* Signal that there is a new clocksource */ + atomic_inc(_check); + spin_unlock_irq(_lock); return ret; Index: linux-2.6.19/kernel/time/timekeeping.c === --- linux-2.6.19.orig/kernel/time/timekeeping.c +++ linux-2.6.19/kernel/time/timekeeping.c @@ -3,6 +3,7 @@ #include #include #include +#include #include /* @@ -19,7 +20,6 @@ static unsigned long timekeeping_suspend * Clock used for timekeeping */ struct clocksource *clock = _jiffies; -atomic_t clock_recalc_interval = ATOMIC_INIT(0); /* * The current time @@ -150,11 +150,12 @@ int do_settimeofday(struct timespec *tv) EXPORT_SYMBOL(do_settimeofday); /** - * change_clocksource - Swaps clocksources if a new one is available + * timkeeping_change_clocksource - Swaps clocksources if a new one is available * * Accumulates current time interval and initializes new clocksource + * Needs to be called with the xtime_lock held. */ -static int change_clocksource(void) +static int timekeeping_change_clocksource(void) { struct clocksource *new; cycle_t now; @@ -169,9 +170,15 @@ static int change_clocksource(void) clock->cycle_last = now; printk(KERN_INFO "Time: %s clocksource has been installed.\n", clock->name); + hrtimer_clock_notify(); + clock->error = 0; + clock->xtime_nsec = 0; + clocksource_calculate_interval(clock, tick_nsec); return 1; - } else if (unlikely(atomic_read(_recalc_interval))) { - atomic_set(_recalc_interval, 0); + } else { + clock->error = 0; + clock->xtime_nsec = 0; + clocksource_calculate_interval(clock, tick_nsec); return 1; } return 0; @@ -198,9 +205,14 @@ int timekeeping_is_continuous(void) static int clocksource_callback(struct notifier_block *nb, unsigned long op, void *c) { - if (c == clock && op == CLOCKSOURCE_NOTIFY_FREQ && - !atomic_read(_recalc_interval)) - atomic_inc(_recalc_interval); + if (likely(c != clock)) + return 0; + + switch (op) { + case CLOCKSOURCE_NOTIFY_FREQ: + case CLOCKSOURCE_NOTIFY_RATING: +
[PATCH 02/12] clocksource: rating sorted list
Converts the original plain list into a sorted list based on the clock rating. Later in my tree this allows some of the variables to be dropped since the highest rated clock is always at the front of the list. This also does some other nice things like allow the sysfs files to print the clocks in a more interesting order. It's forward looking. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- arch/i386/kernel/tsc.c |2 arch/x86_64/kernel/tsc.c|2 include/linux/clocksource.h |9 ++- kernel/time/clocksource.c | 132 +--- 4 files changed, 97 insertions(+), 48 deletions(-) Index: linux-2.6.19/arch/i386/kernel/tsc.c === --- linux-2.6.19.orig/arch/i386/kernel/tsc.c +++ linux-2.6.19/arch/i386/kernel/tsc.c @@ -361,7 +361,7 @@ static int tsc_update_callback(void) /* check to see if we should switch to the safe clocksource: */ if (clocksource_tsc.rating != 0 && check_tsc_unstable()) { clocksource_tsc.rating = 0; - clocksource_reselect(); + clocksource_rating_change(_tsc); change = 1; } Index: linux-2.6.19/arch/x86_64/kernel/tsc.c === --- linux-2.6.19.orig/arch/x86_64/kernel/tsc.c +++ linux-2.6.19/arch/x86_64/kernel/tsc.c @@ -218,7 +218,7 @@ static int tsc_update_callback(void) /* check to see if we should switch to the safe clocksource: */ if (clocksource_tsc.rating != 50 && check_tsc_unstable()) { clocksource_tsc.rating = 50; - clocksource_reselect(); + clocksource_rating_change(_tsc); change = 1; } return change; Index: linux-2.6.19/include/linux/clocksource.h === --- linux-2.6.19.orig/include/linux/clocksource.h +++ linux-2.6.19/include/linux/clocksource.h @@ -12,6 +12,9 @@ #include #include #include +#include +#include + #include #include @@ -183,9 +186,9 @@ static inline void clocksource_calculate /* used to install a new clocksource */ -int clocksource_register(struct clocksource*); -void clocksource_reselect(void); -struct clocksource* clocksource_get_next(void); +extern struct clocksource *clocksource_get_next(void); +extern int clocksource_register(struct clocksource*); +extern void clocksource_rating_change(struct clocksource*); #ifdef CONFIG_GENERIC_TIME_VSYSCALL extern void update_vsyscall(struct timespec *ts, struct clocksource *c); Index: linux-2.6.19/kernel/time/clocksource.c === --- linux-2.6.19.orig/kernel/time/clocksource.c +++ linux-2.6.19/kernel/time/clocksource.c @@ -35,7 +35,7 @@ * next_clocksource: * pending next selected clocksource. * clocksource_list: - * linked list with the registered clocksources + * rating sorted linked list with the registered clocksources * clocksource_lock: * protects manipulations to curr_clocksource and next_clocksource * and the clocksource_list @@ -80,69 +80,105 @@ struct clocksource *clocksource_get_next } /** - * select_clocksource - Finds the best registered clocksource. + * __is_registered - Returns a clocksource if it's registered + * @name: name of the clocksource to return * * Private function. Must hold clocksource_lock when called. * - * Looks through the list of registered clocksources, returning - * the one with the highest rating value. If there is a clocksource - * name that matches the override string, it returns that clocksource. + * Returns the clocksource if registered, zero otherwise. + * If no clocksources are registered the jiffies clock is + * returned. */ -static struct clocksource *select_clocksource(void) +static struct clocksource * __is_registered(char * name) { - struct clocksource *best = NULL; struct list_head *tmp; list_for_each(tmp, _list) { struct clocksource *src; src = list_entry(tmp, struct clocksource, list); - if (!best) - best = src; - - /* check for override: */ - if (strlen(src->name) == strlen(override_name) && - !strcmp(src->name, override_name)) { - best = src; - break; - } - /* pick the highest rating: */ - if (src->rating > best->rating) - best = src; + if (!strcmp(src->name, name)) + return src; } - return best; + return 0; } /** - * is_registered_source - Checks if clocksource is registered - * @c: pointer to a clocksource + * __get_clock - Finds a specific clocksource + * @name: name of the clocksource to return
[PATCH 03/12] clocksource: arm initialize list value
Update arch/arm/ with list initialization. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- arch/arm/mach-imx/time.c |1 + arch/arm/mach-ixp4xx/common.c |1 + arch/arm/mach-netx/time.c |1 + arch/arm/mach-pxa/time.c |1 + 4 files changed, 4 insertions(+) Index: linux-2.6.19/arch/arm/mach-imx/time.c === --- linux-2.6.19.orig/arch/arm/mach-imx/time.c +++ linux-2.6.19/arch/arm/mach-imx/time.c @@ -87,6 +87,7 @@ static struct clocksource clocksource_im .read = imx_get_cycles, .mask = 0x, .shift = 20, + .list = LIST_HEAD_INIT(clocksource_imx.list), .is_continuous = 1, }; Index: linux-2.6.19/arch/arm/mach-ixp4xx/common.c === --- linux-2.6.19.orig/arch/arm/mach-ixp4xx/common.c +++ linux-2.6.19/arch/arm/mach-ixp4xx/common.c @@ -396,6 +396,7 @@ static struct clocksource clocksource_ix .mask = CLOCKSOURCE_MASK(32), .shift = 20, .is_continuous = 1, + .list = LIST_HEAD_INIT(clocksource_ixp4xx.list), }; unsigned long ixp4xx_timer_freq = FREQ; Index: linux-2.6.19/arch/arm/mach-netx/time.c === --- linux-2.6.19.orig/arch/arm/mach-netx/time.c +++ linux-2.6.19/arch/arm/mach-netx/time.c @@ -62,6 +62,7 @@ static struct clocksource clocksource_ne .read = netx_get_cycles, .mask = CLOCKSOURCE_MASK(32), .shift = 20, + .list = LIST_HEAD_INIT(clocksource_netx.list), .is_continuous = 1, }; Index: linux-2.6.19/arch/arm/mach-pxa/time.c === --- linux-2.6.19.orig/arch/arm/mach-pxa/time.c +++ linux-2.6.19/arch/arm/mach-pxa/time.c @@ -112,6 +112,7 @@ static struct clocksource clocksource_px .read = pxa_get_cycles, .mask = CLOCKSOURCE_MASK(32), .shift = 20, + .list = LIST_HEAD_INIT(clocksource_pxa.list), .is_continuous = 1, }; -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/12] clocksource: avr32 initialize list value
Update arch/avre32/ with list initialization. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- arch/avr32/kernel/time.c |1 + 1 file changed, 1 insertion(+) Index: linux-2.6.19/arch/avr32/kernel/time.c === --- linux-2.6.19.orig/arch/avr32/kernel/time.c +++ linux-2.6.19/arch/avr32/kernel/time.c @@ -37,6 +37,7 @@ static struct clocksource clocksource_av .read = read_cycle_count, .mask = CLOCKSOURCE_MASK(32), .shift = 16, + .list = LIST_HEAD_INIT(clocksource_avr32.list), .is_continuous = 1, }; -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 06/12] clocksource: i386 initialize list value
Update arch/i386/ with list initialization. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- arch/i386/kernel/hpet.c |1 + arch/i386/kernel/i8253.c |1 + arch/i386/kernel/tsc.c |1 + 3 files changed, 3 insertions(+) Index: linux-2.6.19/arch/i386/kernel/hpet.c === --- linux-2.6.19.orig/arch/i386/kernel/hpet.c +++ linux-2.6.19/arch/i386/kernel/hpet.c @@ -282,6 +282,7 @@ static struct clocksource clocksource_hp .mask = HPET_MASK, .shift = HPET_SHIFT, .is_continuous = 1, + .list = LIST_HEAD_INIT(clocksource_hpet.list), }; static int __init init_hpet_clocksource(void) Index: linux-2.6.19/arch/i386/kernel/i8253.c === --- linux-2.6.19.orig/arch/i386/kernel/i8253.c +++ linux-2.6.19/arch/i386/kernel/i8253.c @@ -177,6 +177,7 @@ static struct clocksource clocksource_pi .mask = CLOCKSOURCE_MASK(32), .mult = 0, .shift = 20, + .list = LIST_HEAD_INIT(clocksource_pit.list), }; static int __init init_pit_clocksource(void) Index: linux-2.6.19/arch/i386/kernel/tsc.c === --- linux-2.6.19.orig/arch/i386/kernel/tsc.c +++ linux-2.6.19/arch/i386/kernel/tsc.c @@ -352,6 +352,7 @@ static struct clocksource clocksource_ts .shift = 22, .update_callback= tsc_update_callback, .is_continuous = 1, + .list = LIST_HEAD_INIT(clocksource_tsc.list), }; static int tsc_update_callback(void) -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/12] clocksource: mips initialize list value
Update arch/mips/ with list initialization. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- arch/mips/kernel/time.c |1 + 1 file changed, 1 insertion(+) Index: linux-2.6.19/arch/mips/kernel/time.c === --- linux-2.6.19.orig/arch/mips/kernel/time.c +++ linux-2.6.19/arch/mips/kernel/time.c @@ -307,6 +307,7 @@ static unsigned int __init calibrate_hpt struct clocksource clocksource_mips = { .name = "MIPS", .mask = 0x, + .list = LIST_HEAD_INIT(clocksource_mips.list), .is_continuous = 1, }; -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 07/12] clocksource: x86_64 initialize list value
Update arch/x86_64/ with list initialization. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- arch/x86_64/kernel/hpet.c |1 + arch/x86_64/kernel/tsc.c |1 + 2 files changed, 2 insertions(+) Index: linux-2.6.19/arch/x86_64/kernel/hpet.c === --- linux-2.6.19.orig/arch/x86_64/kernel/hpet.c +++ linux-2.6.19/arch/x86_64/kernel/hpet.c @@ -472,6 +472,7 @@ struct clocksource clocksource_hpet = { .shift = HPET_SHIFT, .is_continuous = 1, .vread = vread_hpet, + .list = LIST_HEAD_INIT(clocksource_hpet.list), }; static int __init init_hpet_clocksource(void) Index: linux-2.6.19/arch/x86_64/kernel/tsc.c === --- linux-2.6.19.orig/arch/x86_64/kernel/tsc.c +++ linux-2.6.19/arch/x86_64/kernel/tsc.c @@ -209,6 +209,7 @@ static struct clocksource clocksource_ts .update_callback= tsc_update_callback, .is_continuous = 1, .vread = vread_tsc, + .list = LIST_HEAD_INIT(clocksource_tsc.list), }; static int tsc_update_callback(void) -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 08/12] clocksource: driver initialize list value
Update drivers/clocksource/ with list initialization. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- drivers/clocksource/acpi_pm.c|1 + drivers/clocksource/cyclone.c|1 + drivers/clocksource/scx200_hrt.c |1 + 3 files changed, 3 insertions(+) Index: linux-2.6.19/drivers/clocksource/acpi_pm.c === --- linux-2.6.19.orig/drivers/clocksource/acpi_pm.c +++ linux-2.6.19/drivers/clocksource/acpi_pm.c @@ -74,6 +74,7 @@ static struct clocksource clocksource_ac .mult = 0, /*to be caluclated*/ .shift = 22, .is_continuous = 1, + .list = LIST_HEAD_INIT(clocksource_acpi_pm.list), }; Index: linux-2.6.19/drivers/clocksource/cyclone.c === --- linux-2.6.19.orig/drivers/clocksource/cyclone.c +++ linux-2.6.19/drivers/clocksource/cyclone.c @@ -32,6 +32,7 @@ static struct clocksource clocksource_cy .mult = 10, .shift = 0, .is_continuous = 1, + .list = LIST_HEAD_INIT(clocksource_cyclone.list), }; static int __init init_cyclone_clocksource(void) Index: linux-2.6.19/drivers/clocksource/scx200_hrt.c === --- linux-2.6.19.orig/drivers/clocksource/scx200_hrt.c +++ linux-2.6.19/drivers/clocksource/scx200_hrt.c @@ -58,6 +58,7 @@ static struct clocksource cs_hrt = { .read = read_hrt, .mask = CLOCKSOURCE_MASK(32), .is_continuous = 1, + .list = LIST_HEAD_INIT(cs_hrt.list), /* mult, shift are set based on mhz27 flag */ }; -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/12] clocksource: initialize list value
A change to clocksource initialization. If the list field is initialized it allows clocksource_register to complete faster since it doesn't have to scan the list of clocks doing strcmp on each looking for duplicates. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- kernel/time/clocksource.c |3 +-- kernel/time/jiffies.c |1 + 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.19/kernel/time/clocksource.c === --- linux-2.6.19.orig/kernel/time/clocksource.c +++ linux-2.6.19/kernel/time/clocksource.c @@ -186,12 +186,11 @@ int clocksource_register(struct clocksou unsigned long flags; spin_lock_irqsave(_lock, flags); - if (unlikely(!list_empty(>list) && __is_registered(c->name))) { + if (unlikely(!list_empty(>list))) { printk("register_clocksource: Cannot register %s clocksource. " "Already registered!", c->name); ret = -EBUSY; } else { - INIT_LIST_HEAD(>list); __sorted_list_add(c); /* scan the registered clocksources, and pick the best one */ next_clocksource = select_clocksource(); Index: linux-2.6.19/kernel/time/jiffies.c === --- linux-2.6.19.orig/kernel/time/jiffies.c +++ linux-2.6.19/kernel/time/jiffies.c @@ -63,6 +63,7 @@ struct clocksource clocksource_jiffies = .mult = NSEC_PER_JIFFY << JIFFIES_SHIFT, /* details above */ .shift = JIFFIES_SHIFT, .is_continuous = 0, /* tick based, not free running */ + .list = LIST_HEAD_INIT(clocksource_jiffies.list), }; static int __init init_jiffies_clocksource(void) -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/12] clocksource: add block notifier
Adds a call back interface for register/rating change events. This is also used later in this series to signal other interesting events. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- include/linux/clocksource.h | 37 + include/linux/timekeeping.h |3 +++ kernel/time/clocksource.c | 10 ++ 3 files changed, 50 insertions(+) Index: linux-2.6.19/include/linux/clocksource.h === --- linux-2.6.19.orig/include/linux/clocksource.h +++ linux-2.6.19/include/linux/clocksource.h @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -24,6 +25,42 @@ typedef u64 cycle_t; /* XXX - Would like a better way for initializing curr_clocksource */ extern struct clocksource clocksource_jiffies; +/* + * Allows inlined calling for notifier routines. + */ +extern struct atomic_notifier_head clocksource_list_notifier; + +/* + * Block notifier flags. + */ +#define CLOCKSOURCE_NOTIFY_REGISTER1 +#define CLOCKSOURCE_NOTIFY_RATING 2 +#define CLOCKSOURCE_NOTIFY_FREQ4 + +/** + * clocksource_notifier_register - Registers a list change notifier + * @nb:pointer to a notifier block + * + * Returns zero always. + */ +static inline int clocksource_notifier_register(struct notifier_block *nb) +{ + return atomic_notifier_chain_register(_list_notifier, nb); +} + +/** + * clocksource_freq_change - Allows notification of dynamic frequency changes. + * + * Signals that a clocksource is dynamically changing it's frequency. + * This could happen if a clocksource becomes more/less stable. + */ +static inline void clocksource_freq_change(struct clocksource *c) +{ + atomic_notifier_call_chain(_list_notifier, + CLOCKSOURCE_NOTIFY_FREQ, c); +} + + /** * struct clocksource - hardware abstraction for a free running counter * Provides mostly state-free accessors to the underlying hardware. Index: linux-2.6.19/include/linux/timekeeping.h === --- linux-2.6.19.orig/include/linux/timekeeping.h +++ linux-2.6.19/include/linux/timekeeping.h @@ -14,6 +14,9 @@ static inline int change_clocksource(voi { return 0; } + +static inline void change_clocksource(void) { } + #endif /* !CONFIG_GENERIC_TIME */ #endif /* _LINUX_TIMEKEEPING_H */ Index: linux-2.6.19/kernel/time/clocksource.c === --- linux-2.6.19.orig/kernel/time/clocksource.c +++ linux-2.6.19/kernel/time/clocksource.c @@ -49,6 +49,8 @@ static DEFINE_SPINLOCK(clocksource_lock) static char override_name[32]; static int finished_booting; +ATOMIC_NOTIFIER_HEAD(clocksource_list_notifier); + /* clocksource_done_booting - Called near the end of bootup * * Hack to avoid lots of clocksource churn at boot time @@ -196,6 +198,10 @@ int clocksource_register(struct clocksou next_clocksource = select_clocksource(); } spin_unlock_irqrestore(_lock, flags); + + atomic_notifier_call_chain(_list_notifier, + CLOCKSOURCE_NOTIFY_REGISTER, c); + return ret; } EXPORT_SYMBOL(clocksource_register); @@ -224,6 +230,10 @@ void clocksource_rating_change(struct cl next_clocksource = select_clocksource(); spin_unlock_irqrestore(_lock, flags); + + atomic_notifier_call_chain(_list_notifier, + CLOCKSOURCE_NOTIFY_RATING, c); + } EXPORT_SYMBOL(clocksource_rating_change); -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 01/12] timekeeping: create kernel/time/timekeeping.c
Move the generic timekeeping code from kernel/timer.c to kernel/time/timekeeping.c . This requires some glue code which is added to the include/linux/timekeeping.h header. Signed-Off-By: Daniel Walker <[EMAIL PROTECTED]> --- include/linux/clocksource.h |3 include/linux/timekeeping.h | 19 + kernel/time/Makefile|2 kernel/time/clocksource.c |3 kernel/time/timekeeping.c | 639 kernel/timer.c | 630 --- 6 files changed, 663 insertions(+), 633 deletions(-) Index: linux-2.6.19/include/linux/clocksource.h === --- linux-2.6.19.orig/include/linux/clocksource.h +++ linux-2.6.19/include/linux/clocksource.h @@ -18,6 +18,9 @@ /* clocksource cycle base type */ typedef u64 cycle_t; +/* XXX - Would like a better way for initializing curr_clocksource */ +extern struct clocksource clocksource_jiffies; + /** * struct clocksource - hardware abstraction for a free running counter * Provides mostly state-free accessors to the underlying hardware. Index: linux-2.6.19/include/linux/timekeeping.h === --- /dev/null +++ linux-2.6.19/include/linux/timekeeping.h @@ -0,0 +1,19 @@ +#ifndef _LINUX_TIMEKEEPING_H +#define _LINUX_TIMEKEEPING_H + +#include + +extern void update_wall_time(void); + +#ifdef CONFIG_GENERIC_TIME + +extern struct clocksource *clock; + +#else /* CONFIG_GENERIC_TIME */ +static inline int change_clocksource(void) +{ + return 0; +} +#endif /* !CONFIG_GENERIC_TIME */ + +#endif /* _LINUX_TIMEKEEPING_H */ Index: linux-2.6.19/kernel/time/Makefile === --- linux-2.6.19.orig/kernel/time/Makefile +++ linux-2.6.19/kernel/time/Makefile @@ -1,4 +1,4 @@ -obj-y += ntp.o clocksource.o jiffies.o timer_list.o +obj-y += ntp.o clocksource.o jiffies.o timer_list.o timekeeping.o obj-$(CONFIG_GENERIC_CLOCKEVENTS) += clockevents.o obj-$(CONFIG_TIMER_STATS) += timer_stats.o Index: linux-2.6.19/kernel/time/clocksource.c === --- linux-2.6.19.orig/kernel/time/clocksource.c +++ linux-2.6.19/kernel/time/clocksource.c @@ -29,9 +29,6 @@ #include #include -/* XXX - Would like a better way for initializing curr_clocksource */ -extern struct clocksource clocksource_jiffies; - /*[Clocksource internal variables]- * curr_clocksource: * currently selected clocksource. Initialized to clocksource_jiffies. Index: linux-2.6.19/kernel/time/timekeeping.c === --- /dev/null +++ linux-2.6.19/kernel/time/timekeeping.c @@ -0,0 +1,639 @@ + + +#include +#include +#include +#include + +/* + * flag for if timekeeping is suspended + */ +static int timekeeping_suspended; + +/* + * time in seconds when suspend began + */ +static unsigned long timekeeping_suspend_time; + +/* + * Clock used for timekeeping + */ +struct clocksource *clock = _jiffies; + +/* + * The current time + * wall_to_monotonic is what we need to add to xtime (or xtime corrected + * for sub jiffie times) to get to monotonic time. Monotonic is pegged + * at zero at system boot time, so wall_to_monotonic will be negative, + * however, we will ALWAYS keep the tv_nsec part positive so we can use + * the usual normalization. + */ +struct timespec xtime __attribute__ ((aligned (16))); +struct timespec wall_to_monotonic __attribute__ ((aligned (16))); + +EXPORT_SYMBOL(xtime); + +#ifdef CONFIG_GENERIC_TIME +/** + * __get_nsec_offset - Returns nanoseconds since last call to periodic_hook + * + * private function, must hold xtime_lock lock when being + * called. Returns the number of nanoseconds since the + * last call to update_wall_time() (adjusted by NTP scaling) + */ +static inline s64 __get_nsec_offset(void) +{ + cycle_t cycle_now, cycle_delta; + s64 ns_offset; + + /* read clocksource: */ + cycle_now = clocksource_read(clock); + + /* calculate the delta since the last update_wall_time: */ + cycle_delta = (cycle_now - clock->cycle_last) & clock->mask; + + /* convert to nanoseconds: */ + ns_offset = cyc2ns(clock, cycle_delta); + + return ns_offset; +} + +/** + * __get_realtime_clock_ts - Returns the time of day in a timespec + * @ts:pointer to the timespec to be set + * + * Returns the time of day in a timespec. Used by + * do_gettimeofday() and get_realtime_clock_ts(). + */ +static inline void __get_realtime_clock_ts(struct timespec *ts) +{ + unsigned long seq; + s64 nsecs; + + do { + seq = read_seqbegin(_lock); + + *ts = xtime; + nsecs = __get_nsec_offset(); + + } while (read_seqretry(_lock, seq)); + + timespec_add_ns(ts, nsecs); +} + +/** + *
[PATCH 00/12] clocksource/timekeeping cleanup
This patchset represents the most stable clocksource changes in my tree. Also John (and others) have reviewed these changes a few times. I think it's all acceptable. Daniel -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Rik van Riel wrote: Nick Piggin wrote: The other nice thing about it was that it didn't have a hard cutoff that the current reclaim_mapped toggle does -- you could opt to scan the mapped list at a lower ratio than the unmapped one. Of course, it also has some downsides too, and would require retuning... Here's a simple idea for tuning. For each list we keep track of: 1) the size of the list 2) the rate at which we scan the list 3) the fraction of (non new) pages that get referenced That way we can determine which list has the largest fraction of "idle" pages sitting around and consequently which list should be scanned more aggressively. For each list we can calculate how frequently the pages in the list are being used: pressure = referenced percentage * scan rate / list size The VM can equalize the pressure by scanning the list with lower usage less than the other list. This way the VM can give the right amount of memory to each type. This sounds like a good thing to start with. I think we can then use swappiness to decide what to evict. Of course, each list needs to be divided into inactive and active like the current VM, in order to make sure that the pages which are used once cannot push the real working set of that list out of memory. Yes, that makes sense. There is a more subtle problem when the list's working set is larger than the amount of memory the list has. In that situation the VM will be faulting pages back in just after they got evicted. Something like my /proc/refaults code can detect that and adjust the size of the undersized list accordingly. Of course, once we properly distinguish between the more frequently and less frequently accessed pages within each of the page sets (mapped/anonymous vs. unmapped) and have the pressure between the lists equalized, why do we need to keep them separate again? :-) -- Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: problems with latest smbfs changes on 2.4.34 and security backports
Hi Dann, On Mon, Jan 22, 2007 at 11:19:43AM -0700, dann frazier wrote: > On Mon, Jan 22, 2007 at 10:50:47AM +1100, Grant Coady wrote: > > On Mon, 22 Jan 2007 00:03:21 +0100, Willy Tarreau <[EMAIL PROTECTED]> wrote: > > [EMAIL PROTECTED]:/home/other$ uname -r > > 2.4.34b > > [EMAIL PROTECTED]:/home/other$ mkdir test > > [EMAIL PROTECTED]:/home/other$ ln -s test testlink > > ln: creating symbolic link `testlink' to `test': Operation not permitted > > [EMAIL PROTECTED]:/home/other$ echo "this is also a test" > test/file > > [EMAIL PROTECTED]:/home/other$ ln -s test/file test2 > > ln: creating symbolic link `test2' to `test/file': Operation not permitted > > > > trying to create symlinks. > > > > No problems creating symlinks with 2.4.33.3. > > Yes, I've found that this varies depending upon the options passed. If > uid=0, I can create symlinks, otherwise I always get permission > denied. This behavior appears to be consistent with 2.6. > > I also need to do some testing with the proposed patch to smbmount > that will let you omit options (current versions will always pass an > option to the kernel, even if you the user did not provide one). > If you do not pass options, the behavior should fallback to > server-provided values. > > Note that this bug has been my only interaction with smbfs, so I'm > certainly no expert on how it *should* behave. My plan is to > take all of the use cases we're coming up with and try to maintain > the "historic" 2.4 behavior as much as possible, but still not > silently dropping user-provided mount options. When the behavior needs > to change to honor them, I'll try to match what current 2.6 > does. Make sense? Yes, it does for me. So to sum up, I apply your patch to 2.4.34.1 and it restores the same behaviour for Santiago and Grant as they get in 2.6. Whether it's the expected behaviour or not is not the point, as it will be easier for us to later mimic 2.6 if a change is needed since we're not experts at all in this area. If we're all OK for this, I'll go with that. Thanks guys, Willy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sigaction's ucontext_t with incorrect stack reference when SA_SIGINFO is being used ?
On Mon, 2007-01-22 at 09:57 +0100, Xavier Roche wrote: > Hi folks, > > I have a probably louzy question regarding sigaction() behaviour when an > alternate signal stack is used: it seems that I can not get the user > stack reference in the ucontext_t stack context ; ie. the uc_stack > member contains reference of the alternate signal stack, not the stack > that was used before the crash. > > Is this is a normal behaviour ? Is there a way to retrieve the original > user's stack inside the signal callback ? > > The example given below demonstrates the issue: > top of stack==0x7f3d7000, alternative_stack==0x501010 > SEGV==0x7f3d6ff8; sp==0x501010; current stack is the alternate stack > > It is obvious that the SEGV was a stack overflow: the si_addr address is > just on the page below the stack limit. POSIX says: "the third argument can be cast to a pointer to an object of type ucontext_t to refer to the receiving thread's context that was interrupted when the signal was delivered." so if uc_stack doesn't point to the stack in use immediately prior to signal generation, this is a bug. (In theory I should be able to pass the ucontext_t supplied to the signal handler to setcontext() and resume execution exactly where I left off -- glibc's refusal to support kernel-generated ucontexts gets in the way of this, but the point still stands.) I have no idea who to bother about i386 signal delivery, though. (And I suspect this bug has probably been copied to other architectures as well.) -- Nicholas Miell <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[git patches] more ftape removal
Remove bits left over from prior ftape removal. Please pull from 'ftape' branch of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git ftape to receive the following updates: include/linux/Kbuild |1 - include/linux/mtio.h | 146 include/linux/qic117.h | 290 3 files changed, 0 insertions(+), 437 deletions(-) delete mode 100644 include/linux/qic117.h Adrian Bunk (1): more ftape removal diff --git a/include/linux/Kbuild b/include/linux/Kbuild index 862e483..8c634f9 100644 --- a/include/linux/Kbuild +++ b/include/linux/Kbuild @@ -129,7 +129,6 @@ header-y += posix_types.h header-y += ppdev.h header-y += prctl.h header-y += ps2esdi.h -header-y += qic117.h header-y += qnxtypes.h header-y += quotaio_v1.h header-y += quotaio_v2.h diff --git a/include/linux/mtio.h b/include/linux/mtio.h index 8c66151..6f8d2d4 100644 --- a/include/linux/mtio.h +++ b/include/linux/mtio.h @@ -10,7 +10,6 @@ #include #include -#include /* * Structures and definitions for mag tape io control commands @@ -116,32 +115,6 @@ struct mtget { #define MT_ISFTAPE_UNKNOWN 0x80 /* obsolete */ #define MT_ISFTAPE_FLAG0x80 -struct mt_tape_info { - long t_type;/* device type id (mt_type) */ - char *t_name; /* descriptive name */ -}; - -#define MT_TAPE_INFO { \ - {MT_ISUNKNOWN, "Unknown type of tape device"}, \ - {MT_ISQIC02,"Generic QIC-02 tape streamer"}, \ - {MT_ISWT5150, "Wangtek 5150, QIC-150"}, \ - {MT_ISARCHIVE_5945L2, "Archive 5945L-2"}, \ - {MT_ISCMSJ500, "CMS Jumbo 500"}, \ - {MT_ISTDC3610, "Tandberg TDC 3610, QIC-24"}, \ - {MT_ISARCHIVE_VP60I,"Archive VP60i, QIC-02"}, \ - {MT_ISARCHIVE_2150L,"Archive Viper 2150L"}, \ - {MT_ISARCHIVE_2060L,"Archive Viper 2060L"}, \ - {MT_ISARCHIVESC499, "Archive SC-499 QIC-36 controller"}, \ - {MT_ISQIC02_ALL_FEATURES, "Generic QIC-02 tape, all features"}, \ - {MT_ISWT5099EEN24, "Wangtek 5099-een24, 60MB"}, \ - {MT_ISTEAC_MT2ST, "Teac MT-2ST 155mb data cassette drive"}, \ - {MT_ISEVEREX_FT40A, "Everex FT40A, QIC-40"}, \ - {MT_ISONSTREAM_SC, "OnStream SC-, DI-, DP-, or USB tape drive"}, \ - {MT_ISSCSI1,"Generic SCSI-1 tape"}, \ - {MT_ISSCSI2,"Generic SCSI-2 tape"}, \ - {0, NULL} \ -} - /* structure for MTIOCPOS - mag tape get position command */ @@ -150,130 +123,11 @@ struct mtpos { }; -/* structure for MTIOCVOLINFO, query information about the volume - * currently positioned at (zftape) - */ -struct mtvolinfo { - unsigned int mt_volno; /* vol-number */ - unsigned int mt_blksz; /* blocksize used when recording */ - unsigned int mt_rawsize; /* raw tape space consumed, in kb */ - unsigned int mt_size;/* volume size after decompression, in kb */ - unsigned int mt_cmpr:1; /* this volume has been compressed */ -}; - -/* raw access to a floppy drive, read and write an arbitrary segment. - * For ftape/zftape to support formatting etc. - */ -#define MT_FT_RD_SINGLE 0 -#define MT_FT_RD_AHEAD 1 -#define MT_FT_WR_ASYNC 0 /* start tape only when all buffers are full */ -#define MT_FT_WR_MULTI 1 /* start tape, continue until buffers are empty */ -#define MT_FT_WR_SINGLE 2 /* write a single segment and stop afterwards*/ -#define MT_FT_WR_DELETE 3 /* write deleted data marks, one segment at time */ - -struct mtftseg -{ - unsigned mt_segno; /* the segment to read or write */ - unsigned mt_mode;/* modes for read/write (sync/async etc.) */ - int mt_result; /* result of r/w request, not of the ioctl */ - void__user *mt_data;/* User space buffer: must be 29kb */ -}; - -/* get tape capacity (ftape/zftape) - */ -struct mttapesize { - unsigned long mt_capacity; /* entire, uncompressed capacity - * of a cartridge - */ - unsigned long mt_used; /* what has been used so far, raw - * uncompressed amount - */ -}; - -/* possible values of the ftfmt_op field - */ -#define FTFMT_SET_PARMS1 /* set software parms */ -#define FTFMT_GET_PARMS2 /* get software parms */ -#define FTFMT_FORMAT_TRACK 3 /* start formatting a tape track */ -#define FTFMT_STATUS 4 /* monitor formatting a tape track */ -#define FTFMT_VERIFY 5 /* verify the given segment*/ - -struct ftfmtparms { - unsigned char ft_qicstd; /* QIC-40/QIC-80/QIC-3010/QIC-3020 */ - unsigned char ft_fmtcode; /* Refer to the QIC specs */ - unsigned char ft_fhm; /* floppy head max */ - unsigned char ft_ftm;
[git patch] mention JFFS impending death
JFFS is already marked CONFIG_BROKEN in fs/Kconfig, with a note that it's going away in 2.6.21, but the corresponding update to feature-removal-schedule.txt was accidentally omitted. Fixed. Please pull from 'kill-jffs-prep' branch of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git kill-jffs-prep to receive the following updates: Documentation/feature-removal-schedule.txt |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) Jeff Garzik (1): Note that JFFS (v1) is to be deleted, in feature-removal-schedule.txt diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index fc53239..0ba6af0 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -318,3 +318,10 @@ Why: /proc/acpi/button has been replaced by events to the input layer Who: Len Brown <[EMAIL PROTECTED]> --- + +What: JFFS (version 1) +When: 2.6.21 +Why: Unmaintained for years, superceded by JFFS2 for years. +Who: Jeff Garzik <[EMAIL PROTECTED]> + +--- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Nick Piggin wrote: Balbir Singh wrote: This makes me wonder if it makes sense to split up the LRU into page cache LRU and mapped pages LRU. I see two benefits 1. Currently based on swappiness, we might walk an entire list searching for page cache pages or mapped pages. With these lists separated, it should get easier and faster to implement this scheme 2. There is another parallel thread on implementing page cache limits. If the lists split out, we need not scan the entire list to find page cache pages to evict them. Of course I might be missing something (some piece of history) I actually had patches to do "split active lists" a while back. They worked by lazily moving the page at reclaim-time, based on whether or not it is mapped. This isn't too much worse than the kernel's current idea of what a mapped page is. They actually got a noticable speedup of the swapping kbuild workload, but at this stage there were some more basic improvements needed, so the difference could be smaller today. The other nice thing about it was that it didn't have a hard cutoff that the current reclaim_mapped toggle does -- you could opt to scan the mapped list at a lower ratio than the unmapped one. Of course, it also has some downsides too, and would require retuning... Thanks, I am motivated to experiment with the idea. I guess I need to (re)discover the downsides for myself :-) -- Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
re-code my patch, tab = 8. Sorry! Signed-off-by: Yunfeng Zhang <[EMAIL PROTECTED]> Index: linux-2.6.19/Documentation/vm_pps.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.19/Documentation/vm_pps.txt 2007-01-23 11:32:02.0 +0800 @@ -0,0 +1,236 @@ + Pure Private Page System (pps) + [EMAIL PROTECTED] + December 24-26, 2006 + +// Purpose <([{ +The file is used to document the idea which is published firstly at +http://www.ussg.iu.edu/hypermail/linux/kernel/0607.2/0451.html, as a part of my +OS -- main page http://blog.chinaunix.net/u/21764/index.php. In brief, the +patch of the document is for enchancing the performance of Linux swap +subsystem. You can find the overview of the idea in section and how I patch it into Linux 2.6.19 in section +. +// }])> + +// How to Reclaim Pages more Efficiently <([{ +Good idea originates from overall design and management ability, when you look +down from a manager view, you will relief yourself from disordered code and +find some problem immediately. + +OK! to modern OS, its memory subsystem can be divided into three layers +1) Space layer (InodeSpace, UserSpace and CoreSpace). +2) VMA layer (PrivateVMA and SharedVMA, memory architecture-independent layer). +3) Page table, zone and memory inode layer (architecture-dependent). +Maybe it makes you sense that Page/PTE should be placed on the 3rd layer, but +here, it's placed on the 2nd layer since it's the basic unit of VMA. + +Since the 2nd layer assembles the much statistic of page-acess information, so +it's nature that swap subsystem should be deployed and implemented on the 2nd +layer. + +Undoubtedly, there are some virtues about it +1) SwapDaemon can collect the statistic of process acessing pages and by it + unmaps ptes, SMP specially benefits from it for we can use flush_tlb_range + to unmap ptes batchly rather than frequently TLB IPI interrupt per a page in + current Linux legacy swap subsystem. +2) Page-fault can issue better readahead requests since history data shows all + related pages have conglomerating affinity. In contrast, Linux page-fault + readaheads the pages relative to the SwapSpace position of current + page-fault page. +3) It's conformable to POSIX madvise API family. +4) It simplifies Linux memory model dramatically. Keep it in mind that new swap + strategy is from up to down. In fact, Linux legacy swap subsystem is maybe + the only one from down to up. + +Unfortunately, Linux 2.6.19 swap subsystem is based on the 3rd layer -- a +system on memory node::active_list/inactive_list. + +I've finished a patch, see section . Note, it +ISN'T perfect. +// }])> + +// Pure Private Page System -- pps <([{ +As I've referred in previous section, perfectly applying my idea need to unroot +page-surrounging swap subsystem to migrate it on VMA, but a huge gap has +defeated me -- active_list and inactive_list. In fact, you can find +lru_add_active code anywhere ... It's IMPOSSIBLE to me to complete it only by +myself. It's also the difference between my design and Linux, in my OS, page is +the charge of its new owner totally, however, to Linux, page management system +is still tracing it by PG_active flag. + +So I conceive another solution:) That is, set up an independent page-recycle +system rooted on Linux legacy page system -- pps, intercept all private pages +belonging to PrivateVMA to pps, then use my pps to cycle them. By the way, the +whole job should be consist of two parts, here is the first -- +PrivateVMA-oriented, other is SharedVMA-oriented (should be called SPS) +scheduled in future. Of course, if both are done, it will empty Linux legacy +page system. + +In fact, pps is centered on how to better collect and unmap process private +pages, the whole process is divided into six stages -- . PPS +uses init_mm::mm_list to enumerate all swappable UserSpace (shrink_private_vma) +of mm/vmscan.c. Other sections show the remain aspects of pps +1) is basic data definition. +2) is focused on synchronization. +3) how private pages enter in/go off pps. +4) which VMA is belonging to pps. +5) new daemon thread kppsd, pps statistic data etc. + +I'm also glad to highlight my a new idea -- dftlb which is described in +section . +// }])> + +// Delay to Flush TLB (dftlb) <([{ +Delay to flush TLB is instroduced by me to enhance flushing TLB efficiency, in +brief, when we want to unmap a page from the page table of a process, why we +send TLB IPI to other CPUs immediately, since every CPU has timer interrupt, we +can insert flushing tasks into timer interrupt route to implement a +free-charged TLB flushing. + +The trick is implemented in +1) TLB flushing task is added in fill_in_tlb_task of mm/vmscan.c. +2) timer_flush_tlb_tasks of kernel/timer.c is used by other CPUs to execute + flushing tasks. +3) all data are
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804. I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now without a single problem and that should still look at NV_INT_STATUS_CK804, right? I just noticed that my last email might not have been clear enough. The exceptions happened when I re-enabled the return statement in addition to the debug message. Without the INT_DEV check, it is completely fine AFAICT. Indeed, it seems to be just the NV_INT_DEV check that is problematic. Here's a patch that's likely better to test, it forces the NV_INT_DEV flag on when a command is active, and also fixes that questionable code in nv_host_intr that I mentioned. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 22:33:43.0 -0600 @@ -700,7 +700,6 @@ static void nv_adma_check_cpb(struct ata static int nv_host_intr(struct ata_port *ap, u8 irq_stat) { struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag); - int handled; /* freeze if hotplugged */ if (unlikely(irq_stat & (NV_INT_ADDED | NV_INT_REMOVED))) { @@ -719,13 +718,7 @@ static int nv_host_intr(struct ata_port } /* handle interrupt */ - handled = ata_host_intr(ap, qc); - if (unlikely(!handled)) { - /* spurious, clear it */ - ata_check_status(ap); - } - - return 1; + return ata_host_intr(ap, qc); } static irqreturn_t nv_adma_interrupt(int irq, void *dev_instance) @@ -752,6 +745,11 @@ static irqreturn_t nv_adma_interrupt(int if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) { u8 irq_stat = readb(host->mmio_base + NV_INT_STATUS_CK804) >> (NV_INT_PORT_SHIFT * i); + if(ata_tag_valid(ap->active_tag)) + /** NV_INT_DEV indication seems unreliable at times + at least in ADMA mode. Force it on always when a + command is active, to prevent losing interrupts. */ + irq_stat |= NV_INT_DEV; handled += nv_host_intr(ap, irq_stat); continue; }
Re: [patch] faster vgetcpu using sidt (take 2)
On Thu, 18 Jan 2007, Andi Kleen wrote: > > let me know what you think... thanks. > > It's ok, although I would like to have the file in a separate directory. cool -- do you have a directory in mind? and would you like this change as two separate patches or one combined patch? thanks -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] binfmt_elf: core dump masking support
Hi, >>>(run echo 1 > coremask, echo 0 > coremask in a loop while dumping >>>core. Do you have enough locking to make it work as expected?) >> >>Currently, any lock isn't acquired. But I think the kernel only >>have to preserve the coremask setting in a local variable at the >>begining of core dumping. I'm going to do this in the next version. > > No, I do not think that is enough. At minimum, you'd need atomic_t > variable. But I'd recomend against it. Playing with locking is tricky. Why do you think it is not enough? I think that any locking is not needed. My design principle is that the core dump routine is controlled by the bitmask which was assigned to the dumping process at the time of starting core dump. So if a coremask setting is changed while core dumping, the change doesn't affect current dumping process. This can be implemented as follows: core_dump_start: unsigned int mask = current->mm->coremask; for each VMA { write a header which depends on the result of maydump(vma, mask) } for each VMA { write a body which depends on the result of maydump(vma, mask) } NOTE: maydump() is the central function, which decides whether a given VMA should be dumped or not. What do you think about this? Best regards, -- Hidehiro Kawai Hitachi, Ltd., Systems Development Laboratory - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Suspend to RAM generates oops and general protection fault
>>> will be a device driver. Common causes of suspend/resume problems from >>> the list you give below are acpi modules, bluetooth and usb. I'd also be >>> consider pcmcia, drm and fuse possibilities. But again, go for unloading >>> everything possible in the first instance. >> Actually, the reason I sent this is that when I showed the oops/gpf to >> Matthew Garrett at linux.conf.au, he said it looked like a CPU hotplug >> problem and suggested I send it to lkml. BTW, with 2.6.20-rc5, the >> suspend to RAM now works ~95% of the time. > > Try a kernel without CONFIG_SMP... that will verify if it is SMP > related. Well, this happens to be my main work machine, which I'm not willing to have running at half speed for several weeks. Anything else you can suggest? Jean-Marc - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Suspend to RAM generates oops and general protection fault
>> I just encountered the following oops and general protection fault >> trying to suspend/resume my laptop. I've got a Dell D820 laptop with a 2 >> GHz Core 2 Duo CPU. It usually suspends/resumes fine but not always. The >> relevant errors are below but the full dmesg log is at >> http://people.xiph.org/~jm/suspend_resume_oops.txt and my config is in >> http://people.xiph.org/~jm/config-2.6.20-rc5.txt >> >> This happens when I'm running 2.6.20-rc5. The previous kernel version I >> was using is 2.6.19-rc6 and was much more broken (second attempt >> *always* failed), so it's probably not a regression. > > This is a shot against the odds, but could you please check if the attached > patch has any effect? Thanks, I'll try that. It may take a while because the problem only happened once in dozens of suspend/resume cycles. Jean-Marc > Rafael > > > > > > > Both process_zones()and drain_node_pages() check for populated zones before > touching pagesets. However, __drain_pages does not do so, > > This may result in a NULL pointer dereference for pagesets in unpopulated > zones if a NUMA setup is combined with cpu hotplug. > > Initially the unpopulated zone has the pcp pointers pointing to the boot > pagesets. Since the zone is not populated the boot pageset pointers will > not be changed during page allocator and slab bootstrap. > > If a cpu is later brought down (first call to __drain_pages()) then the pcp > pointers for cpus in unpopulated zones are set to NULL since __drain_pages > does not first check for an unpopulated zone. > > If the cpu is then brought up again then we call process_zones() which will > ignore > the unpopulated zone. So the pageset pointers will still be NULL. > > If the cpu is then again brought down then __drain_pages will attempt to drain > pages by following the NULL pageset pointer for unpopulated zones. > > Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> > > --- > mm/page_alloc.c |3 +++ > 1 file changed, 3 insertions(+) > > Index: linux-2.6.20-rc4/mm/page_alloc.c > === > --- linux-2.6.20-rc4.orig/mm/page_alloc.c > +++ linux-2.6.20-rc4/mm/page_alloc.c > @@ -714,6 +714,9 @@ static void __drain_pages(unsigned int c > if (!populated_zone(zone)) > continue; > > + if (!populated_zone(zone)) > + continue; > + > pset = zone_pcp(zone, cpu); > for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) { > struct per_cpu_pages *pcp; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Nick Piggin wrote: The other nice thing about it was that it didn't have a hard cutoff that the current reclaim_mapped toggle does -- you could opt to scan the mapped list at a lower ratio than the unmapped one. Of course, it also has some downsides too, and would require retuning... Here's a simple idea for tuning. For each list we keep track of: 1) the size of the list 2) the rate at which we scan the list 3) the fraction of (non new) pages that get referenced That way we can determine which list has the largest fraction of "idle" pages sitting around and consequently which list should be scanned more aggressively. For each list we can calculate how frequently the pages in the list are being used: pressure = referenced percentage * scan rate / list size The VM can equalize the pressure by scanning the list with lower usage less than the other list. This way the VM can give the right amount of memory to each type. Of course, each list needs to be divided into inactive and active like the current VM, in order to make sure that the pages which are used once cannot push the real working set of that list out of memory. There is a more subtle problem when the list's working set is larger than the amount of memory the list has. In that situation the VM will be faulting pages back in just after they got evicted. Something like my /proc/refaults code can detect that and adjust the size of the undersized list accordingly. Of course, once we properly distinguish between the more frequently and less frequently accessed pages within each of the page sets (mapped/anonymous vs. unmapped) and have the pressure between the lists equalized, why do we need to keep them separate again? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5 1/1] MM: enhance Linux swap subsystem
Patched against 2.6.19 leads to: mm/vmscan.c: In function `shrink_pvma_scan_ptes': mm/vmscan.c:1340: too many arguments to function `page_remove_rmap' So changed page_remove_rmap(series.pages[i], vma); to page_remove_rmap(series.pages[i]); I've worked on 2.6.19, but when update to 2.6.20-rc5, the function is changed. But your patch doesn't offer any swap-performance improvement for both swsusp or tmpfs. Swap-in is still half speed of Swap-out. Current Linux page allocation fairly provides pages for every process, since swap daemon only is started when memory is low, so when it starts to scan active_list, the private pages of processes are messed up with each other, vmscan.c:shrink_list() is the only approach to attach disk swap page to page on active_list, as the result, all private pages lost their affinity on swap partition. I will give a testlater... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Balbir Singh wrote: This makes me wonder if it makes sense to split up the LRU into page cache LRU and mapped pages LRU. I see two benefits 1. Currently based on swappiness, we might walk an entire list searching for page cache pages or mapped pages. With these lists separated, it should get easier and faster to implement this scheme 2. There is another parallel thread on implementing page cache limits. If the lists split out, we need not scan the entire list to find page cache pages to evict them. Of course I might be missing something (some piece of history) I actually had patches to do "split active lists" a while back. They worked by lazily moving the page at reclaim-time, based on whether or not it is mapped. This isn't too much worse than the kernel's current idea of what a mapped page is. They actually got a noticable speedup of the swapping kbuild workload, but at this stage there were some more basic improvements needed, so the difference could be smaller today. The other nice thing about it was that it didn't have a hard cutoff that the current reclaim_mapped toggle does -- you could opt to scan the mapped list at a lower ratio than the unmapped one. Of course, it also has some downsides too, and would require retuning... -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
On Tue, 23 Jan 2007, Balbir Singh wrote: > Yes, good point, I see what you mean in terms of impact. But the trade > off could come from shrink_active_list() which does > > list_del(>lru) > if (!reclaim_mapped && other_conditions) > list_add(>lru, _active); > ... > > In the case mentioned above, we would triple the cachlines when an area > is mapped/unmapped (which might be acceptable since it is a state change > for the page ;) ). In the trade-off I mentioned, it would happen > everytime reclaim is invoked and it has nothing to do with a page changing > state. > > Did I miss something? We do the list_del/list_add right now in reclaim while moving pages between active and inactive lists. However, reclaim is not run until the systems is under memory pressure. Reclaim is run rarely and then lots of these movements are occurring. At that point is it likely that the cachelines are already available since the page structs had to be touched for earlier movements. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Christoph Lameter wrote: perfmon can do much of what you are looking for. Thanks, I'll look into it. -- Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Christoph Lameter wrote: On Tue, 23 Jan 2007, Balbir Singh wrote: When you unmap or map, you need to touch the pte entries and know the pages involved, so shouldn't be equivalent to a list_del and list_add for each page impacted by the map/unmap operation? When you unmap and map you must currently get exclusive access to the cachelines of the pte and the cacheline of the page struct. If we use a list_move on page->lru then we have would have to update pointers in up to 4 other page structs. Thus we need exclusive access to 4 additional cachelines. This triples the number of cachelines touched. Instead of 2 cachelines we need 6. Yes, good point, I see what you mean in terms of impact. But the trade off could come from shrink_active_list() which does list_del(>lru) if (!reclaim_mapped && other_conditions) list_add(>lru, _active); ... In the case mentioned above, we would triple the cachlines when an area is mapped/unmapped (which might be acceptable since it is a state change for the page ;) ). In the trade-off I mentioned, it would happen everytime reclaim is invoked and it has nothing to do with a page changing state. Did I miss something? -- Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
On Tue, 23 Jan 2007, Balbir Singh wrote: > I have always wondered if it would be useful to have a kernel debug > feature that can extract page references per task, it would be good > to see the page references (last 'n') of a workload that is not > doing too well on a particular system. perfmon can do much of what you are looking for. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Rik van Riel wrote: Christoph Lameter wrote: With the proposed schemd you would have to move pages between lists if they are mapped and unmapped by a process. Terminating a process could lead to lots of pages moving to the unnmapped list. That could be a problem. Another problem is that any such heuristic in the VM is bound to have corner cases that some workloads will hit. It would be really nice if we came up with a page replacement algorithm that did not need many extra heuristics to make it work... Yes, it's damn hard at times. I was reading through an article (Architectural support for translation table management in large address space machines - Huck and Hayes), it talks about how Object Oriented Systems encourage more sharing and decrease the locality of resulting virtual address memory stream. Even multi threading tends to impact locality of references. Unfortunately, we have only heuristics to go by and of-course their mathematical model. I have always wondered if it would be useful to have a kernel debug feature that can extract page references per task, it would be good to see the page references (last 'n') of a workload that is not doing too well on a particular system. -- Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/12] md: move write operations to raid5_run_ops
From: Dan Williams <[EMAIL PROTECTED]> handle_stripe sets STRIPE_OP_PREXOR, STRIPE_OP_BIODRAIN, STRIPE_OP_POSTXOR to request a write to the stripe cache. raid5_run_ops is triggerred to run and executes the request outside the stripe lock. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 152 +--- 1 files changed, 131 insertions(+), 21 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 2c74f9b..2390657 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1788,7 +1788,75 @@ static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2) } } +static int handle_write_operations5(struct stripe_head *sh, int rcw, int expand) +{ + int i, pd_idx = sh->pd_idx, disks = sh->disks; + int locked=0; + + if (rcw == 0) { + /* skip the drain operation on an expand */ + if (!expand) { + BUG_ON(test_and_set_bit(STRIPE_OP_BIODRAIN, + >ops.pending)); + sh->ops.count++; + } + + BUG_ON(test_and_set_bit(STRIPE_OP_POSTXOR, >ops.pending)); + sh->ops.count++; + + for (i=disks ; i-- ;) { + struct r5dev *dev = >dev[i]; + + if (dev->towrite) { + set_bit(R5_LOCKED, >flags); + if (!expand) + clear_bit(R5_UPTODATE, >flags); + locked++; + } + } + } else { + BUG_ON(!(test_bit(R5_UPTODATE, >dev[pd_idx].flags) || + test_bit(R5_Wantcompute, >dev[pd_idx].flags))); + + BUG_ON(test_and_set_bit(STRIPE_OP_PREXOR, >ops.pending) || + test_and_set_bit(STRIPE_OP_BIODRAIN, >ops.pending) || + test_and_set_bit(STRIPE_OP_POSTXOR, >ops.pending)); + + sh->ops.count += 3; + + for (i=disks ; i-- ;) { + struct r5dev *dev = >dev[i]; + if (i==pd_idx) + continue; + /* For a read-modify write there may be blocks that are +* locked for reading while others are ready to be written +* so we distinguish these blocks by the R5_Wantprexor bit +*/ + if (dev->towrite && + (test_bit(R5_UPTODATE, >flags) || + test_bit(R5_Wantcompute, >flags))) { + set_bit(R5_Wantprexor, >flags); + set_bit(R5_LOCKED, >flags); + clear_bit(R5_UPTODATE, >flags); + locked++; + } + } + } + + /* keep the parity disk locked while asynchronous operations +* are in flight +*/ + set_bit(R5_LOCKED, >dev[pd_idx].flags); + clear_bit(R5_UPTODATE, >dev[pd_idx].flags); + locked++; + + PRINTK("%s: stripe %llu locked: %d pending: %lx\n", + __FUNCTION__, (unsigned long long)sh->sector, + locked, sh->ops.pending); + + return locked; +} /* * Each stripe/dev can have one or more bion attached. @@ -2151,8 +2219,67 @@ static void handle_stripe5(struct stripe_head *sh) set_bit(STRIPE_HANDLE, >state); } - /* now to consider writing and what else, if anything should be read */ - if (to_write) { + /* Now we check to see if any write operations have recently +* completed +*/ + + /* leave prexor set until postxor is done, allows us to distinguish +* a rmw from a rcw during biodrain +*/ + if (test_bit(STRIPE_OP_PREXOR, >ops.complete) && + test_bit(STRIPE_OP_POSTXOR, >ops.complete)) { + + clear_bit(STRIPE_OP_PREXOR, >ops.complete); + clear_bit(STRIPE_OP_PREXOR, >ops.ack); + clear_bit(STRIPE_OP_PREXOR, >ops.pending); + + for (i=disks; i--;) + clear_bit(R5_Wantprexor, >dev[i].flags); + } + + /* if only POSTXOR is set then this is an 'expand' postxor */ + if (test_bit(STRIPE_OP_BIODRAIN, >ops.complete) && + test_bit(STRIPE_OP_POSTXOR, >ops.complete)) { + + clear_bit(STRIPE_OP_BIODRAIN, >ops.complete); + clear_bit(STRIPE_OP_BIODRAIN, >ops.ack); + clear_bit(STRIPE_OP_BIODRAIN, >ops.pending); + + clear_bit(STRIPE_OP_POSTXOR, >ops.complete); + clear_bit(STRIPE_OP_POSTXOR, >ops.ack); + clear_bit(STRIPE_OP_POSTXOR, >ops.pending); + + /* All the 'written' buffers and the parity block are ready to be +
[PATCH 06/12] md: move raid5 compute block operations to raid5_run_ops
From: Dan Williams <[EMAIL PROTECTED]> handle_stripe sets STRIPE_OP_COMPUTE_BLK to request servicing from raid5_run_ops. It also sets a flag for the block being computed to let other parts of handle_stripe submit dependent operations. raid5_run_ops guarantees that the compute operation completes before any dependent operation starts. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 125 +++- 1 files changed, 93 insertions(+), 32 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 2390657..279a30c 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1980,7 +1980,7 @@ static void handle_stripe5(struct stripe_head *sh) int i; int syncing, expanding, expanded; int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0; - int non_overwrite = 0; + int compute=0, req_compute=0, non_overwrite=0; int failed_num=0; struct r5dev *dev; unsigned long pending=0; @@ -2032,8 +2032,8 @@ static void handle_stripe5(struct stripe_head *sh) /* now count some things */ if (test_bit(R5_LOCKED, >flags)) locked++; if (test_bit(R5_UPTODATE, >flags)) uptodate++; + if (test_bit(R5_Wantcompute, >flags)) BUG_ON(++compute > 1); - if (dev->toread) to_read++; if (dev->towrite) { to_write++; @@ -2188,31 +2188,82 @@ static void handle_stripe5(struct stripe_head *sh) * parity, or to satisfy requests * or to load a block that is being partially written. */ - if (to_read || non_overwrite || (syncing && (uptodate < disks)) || expanding) { - for (i=disks; i--;) { - dev = >dev[i]; - if (!test_bit(R5_LOCKED, >flags) && !test_bit(R5_UPTODATE, >flags) && - (dev->toread || -(dev->towrite && !test_bit(R5_OVERWRITE, >flags)) || -syncing || -expanding || -(failed && (sh->dev[failed_num].toread || -(sh->dev[failed_num].towrite && !test_bit(R5_OVERWRITE, >dev[failed_num].flags - ) - ) { - /* we would like to get this block, possibly -* by computing it, but we might not be able to + if (to_read || non_overwrite || (syncing && (uptodate + compute < disks)) || expanding || + test_bit(STRIPE_OP_COMPUTE_BLK, >ops.pending)) { + + /* Clear completed compute operations. Parity recovery +* (STRIPE_OP_MOD_REPAIR_PD) implies a write-back which is handled +* later on in this routine +*/ + if (test_bit(STRIPE_OP_COMPUTE_BLK, >ops.complete) && + !test_bit(STRIPE_OP_MOD_REPAIR_PD, >ops.pending)) { + clear_bit(STRIPE_OP_COMPUTE_BLK, >ops.complete); + clear_bit(STRIPE_OP_COMPUTE_BLK, >ops.ack); + clear_bit(STRIPE_OP_COMPUTE_BLK, >ops.pending); + } + + /* look for blocks to read/compute, skip this if a compute +* is already in flight, or if the stripe contents are in the +* midst of changing due to a write +*/ + if (!test_bit(STRIPE_OP_COMPUTE_BLK, >ops.pending) && + !test_bit(STRIPE_OP_PREXOR, >ops.pending) && + !test_bit(STRIPE_OP_POSTXOR, >ops.pending)) { + for (i=disks; i--;) { + dev = >dev[i]; + + /* don't schedule compute operations or reads on +* the parity block while a check is in flight */ - if (uptodate == disks-1) { - PRINTK("Computing block %d\n", i); - compute_block(sh, i); - uptodate++; - } else if (test_bit(R5_Insync, >flags)) { - set_bit(R5_LOCKED, >flags); - set_bit(R5_Wantread, >flags); - locked++; - PRINTK("Reading block %d (sync=%d)\n", - i, syncing); + if ((i == sh->pd_idx) && test_bit(STRIPE_OP_CHECK, >ops.pending)) + continue; + + if (!test_bit(R5_LOCKED, >flags) && !test_bit(R5_UPTODATE, >flags) && +
[PATCH 03/12] md: add raid5_run_ops and support routines
From: Dan Williams <[EMAIL PROTECTED]> Prepare the raid5 implementation to use async_tx for running stripe operations: * biofill (copy data into request buffers to satisfy a read request) * compute block (generate a missing block in the cache from the other blocks) * prexor (subtract existing data as part of the read-modify-write process) * biodrain (copy data out of request buffers to satisfy a write request) * postxor (recalculate parity for new data that has entered the cache) * check (verify that the parity is correct) * io (submit i/o to the member disks) Changelog: * removed ops_complete_biodrain in favor of ops_complete_postxor and ops_complete_write. * removed the workqueue * call bi_end_io for reads in ops_complete_biofill Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 520 include/linux/raid/raid5.h | 63 + 2 files changed, 580 insertions(+), 3 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 68b6fea..e70ee17 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -52,6 +52,7 @@ #include "raid6.h" #include +#include /* * Stripe cache @@ -324,6 +325,525 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector return sh; } +static int +raid5_end_read_request(struct bio * bi, unsigned int bytes_done, int error); +static int +raid5_end_write_request (struct bio *bi, unsigned int bytes_done, int error); + +static void ops_run_io(struct stripe_head *sh) +{ + raid5_conf_t *conf = sh->raid_conf; + int i, disks = sh->disks; + + might_sleep(); + + for (i=disks; i-- ;) { + int rw; + struct bio *bi; + mdk_rdev_t *rdev; + if (test_and_clear_bit(R5_Wantwrite, >dev[i].flags)) + rw = WRITE; + else if (test_and_clear_bit(R5_Wantread, >dev[i].flags)) + rw = READ; + else + continue; + + bi = >dev[i].req; + + bi->bi_rw = rw; + if (rw == WRITE) + bi->bi_end_io = raid5_end_write_request; + else + bi->bi_end_io = raid5_end_read_request; + + rcu_read_lock(); + rdev = rcu_dereference(conf->disks[i].rdev); + if (rdev && test_bit(Faulty, >flags)) + rdev = NULL; + if (rdev) + atomic_inc(>nr_pending); + rcu_read_unlock(); + + if (rdev) { + if (test_bit(STRIPE_SYNCING, >state) || + test_bit(STRIPE_EXPAND_SOURCE, >state) || + test_bit(STRIPE_EXPAND_READY, >state)) + md_sync_acct(rdev->bdev, STRIPE_SECTORS); + + bi->bi_bdev = rdev->bdev; + PRINTK("%s: for %llu schedule op %ld on disc %d\n", + __FUNCTION__, (unsigned long long)sh->sector, + bi->bi_rw, i); + atomic_inc(>count); + bi->bi_sector = sh->sector + rdev->data_offset; + bi->bi_flags = 1 << BIO_UPTODATE; + bi->bi_vcnt = 1; + bi->bi_max_vecs = 1; + bi->bi_idx = 0; + bi->bi_io_vec = >dev[i].vec; + bi->bi_io_vec[0].bv_len = STRIPE_SIZE; + bi->bi_io_vec[0].bv_offset = 0; + bi->bi_size = STRIPE_SIZE; + bi->bi_next = NULL; + if (rw == WRITE && + test_bit(R5_ReWrite, >dev[i].flags)) + atomic_add(STRIPE_SECTORS, >corrected_errors); + generic_make_request(bi); + } else { + if (rw == WRITE) + set_bit(STRIPE_DEGRADED, >state); + PRINTK("skip op %ld on disc %d for sector %llu\n", + bi->bi_rw, i, (unsigned long long)sh->sector); + clear_bit(R5_LOCKED, >dev[i].flags); + set_bit(STRIPE_HANDLE, >state); + } + } +} + +static struct dma_async_tx_descriptor * +async_copy_data(int frombio, struct bio *bio, struct page *page, sector_t sector, + struct dma_async_tx_descriptor *tx) +{ + struct bio_vec *bvl; + struct page *bio_page; + int i; + int page_offset; + + if (bio->bi_sector >= sector) + page_offset = (signed)(bio->bi_sector - sector) * 512; + else + page_offset = (signed)(sector - bio->bi_sector) * -512; + bio_for_each_segment(bvl, bio, i) { + int len = bio_iovec_idx(bio,i)->bv_len; + int clen; +
[PATCH 04/12] md: use raid5_run_ops for stripe cache operations
From: Dan Williams <[EMAIL PROTECTED]> Each stripe has three flag variables to reflect the state of operations (pending, ack, and complete). -pending: set to request servicing in raid5_run_ops -ack: set to reflect that raid5_runs_ops has seen this request -complete: set when the operation is complete and it is ok for handle_stripe5 to clear 'pending' and 'ack'. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 65 +--- 1 files changed, 56 insertions(+), 9 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index e70ee17..2c74f9b 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -126,6 +126,7 @@ static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh) } md_wakeup_thread(conf->mddev->thread); } else { + BUG_ON(sh->ops.pending); if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, >state)) { atomic_dec(>preread_active_stripes); if (atomic_read(>preread_active_stripes) < IO_THRESHOLD) @@ -225,7 +226,8 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int BUG_ON(atomic_read(>count) != 0); BUG_ON(test_bit(STRIPE_HANDLE, >state)); - + BUG_ON(sh->ops.pending || sh->ops.ack || sh->ops.complete); + CHECK_DEVLOCK(); PRINTK("init_stripe called, stripe %llu\n", (unsigned long long)sh->sector); @@ -241,11 +243,11 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int for (i = sh->disks; i--; ) { struct r5dev *dev = >dev[i]; - if (dev->toread || dev->towrite || dev->written || + if (dev->toread || dev->read || dev->towrite || dev->written || test_bit(R5_LOCKED, >flags)) { - printk("sector=%llx i=%d %p %p %p %d\n", + printk("sector=%llx i=%d %p %p %p %p %d\n", (unsigned long long)sh->sector, i, dev->toread, - dev->towrite, dev->written, + dev->read, dev->towrite, dev->written, test_bit(R5_LOCKED, >flags)); BUG(); } @@ -325,6 +327,43 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector return sh; } +/* check_op() ensures that we only dequeue an operation once */ +#define check_op(op) do {\ + if (test_bit(op, >ops.pending) &&\ + !test_bit(op, >ops.complete)) {\ + if (test_and_set_bit(op, >ops.ack))\ + clear_bit(op, );\ + else\ + ack++;\ + } else\ + clear_bit(op, );\ +} while(0) + +/* find new work to run, do not resubmit work that is already + * in flight + */ +static unsigned long get_stripe_work(struct stripe_head *sh) +{ + unsigned long pending; + int ack = 0; + + pending = sh->ops.pending; + + check_op(STRIPE_OP_BIOFILL); + check_op(STRIPE_OP_COMPUTE_BLK); + check_op(STRIPE_OP_PREXOR); + check_op(STRIPE_OP_BIODRAIN); + check_op(STRIPE_OP_POSTXOR); + check_op(STRIPE_OP_CHECK); + if (test_and_clear_bit(STRIPE_OP_IO, >ops.pending)) + ack++; + + sh->ops.count -= ack; + BUG_ON(sh->ops.count < 0); + + return pending; +} + static int raid5_end_read_request(struct bio * bi, unsigned int bytes_done, int error); static int @@ -1859,7 +1898,6 @@ static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks) *schedule a write of some buffers *return confirmation of parity correctness * - * Parity calculations are done inside the stripe lock * buffers are taken off read_list or write_list, and bh_cache buffers * get BH_Lock set before the stripe lock is released. * @@ -1877,10 +1915,11 @@ static void handle_stripe5(struct stripe_head *sh) int non_overwrite = 0; int failed_num=0; struct r5dev *dev; + unsigned long pending=0; - PRINTK("handling stripe %llu, cnt=%d, pd_idx=%d\n", - (unsigned long long)sh->sector, atomic_read(>count), - sh->pd_idx); + PRINTK("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d ops=%lx:%lx:%lx\n", + (unsigned long long)sh->sector, sh->state, atomic_read(>count), + sh->pd_idx, sh->ops.pending, sh->ops.ack, sh->ops.complete); spin_lock(>lock); clear_bit(STRIPE_HANDLE, >state); @@ -2330,8 +2369,14 @@ static void handle_stripe5(struct stripe_head *sh) } } + if (sh->ops.count) + pending = get_stripe_work(sh); + spin_unlock(>lock); + if (pending) +
[PATCH 10/12] md: move raid5 io requests to raid5_run_ops
From: Dan Williams <[EMAIL PROTECTED]> handle_stripe now only updates the state of stripes. All execution of operations is moved to raid5_run_ops. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 68 1 files changed, 10 insertions(+), 58 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 1956b3c..8af084f 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2360,6 +2360,8 @@ static void handle_stripe5(struct stripe_head *sh) PRINTK("Read_old block %d for r-m-w\n", i); set_bit(R5_LOCKED, >flags); set_bit(R5_Wantread, >flags); + if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; locked++; } else { set_bit(STRIPE_DELAYED, >state); @@ -2380,6 +2382,8 @@ static void handle_stripe5(struct stripe_head *sh) PRINTK("Read_old block %d for Reconstruct\n", i); set_bit(R5_LOCKED, >flags); set_bit(R5_Wantread, >flags); + if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; locked++; } else { set_bit(STRIPE_DELAYED, >state); @@ -2479,6 +2483,8 @@ static void handle_stripe5(struct stripe_head *sh) set_bit(R5_LOCKED, >flags); set_bit(R5_Wantwrite, >flags); + if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; clear_bit(STRIPE_DEGRADED, >state); locked++; set_bit(STRIPE_INSYNC, >state); @@ -2500,12 +2506,16 @@ static void handle_stripe5(struct stripe_head *sh) dev = >dev[failed_num]; if (!test_bit(R5_ReWrite, >flags)) { set_bit(R5_Wantwrite, >flags); + if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; set_bit(R5_ReWrite, >flags); set_bit(R5_LOCKED, >flags); locked++; } else { /* let's read it back */ set_bit(R5_Wantread, >flags); + if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; set_bit(R5_LOCKED, >flags); locked++; } @@ -2615,64 +2625,6 @@ static void handle_stripe5(struct stripe_head *sh) test_bit(BIO_UPTODATE, >bi_flags) ? 0 : -EIO); } - for (i=disks; i-- ;) { - int rw; - struct bio *bi; - mdk_rdev_t *rdev; - if (test_and_clear_bit(R5_Wantwrite, >dev[i].flags)) - rw = WRITE; - else if (test_and_clear_bit(R5_Wantread, >dev[i].flags)) - rw = READ; - else - continue; - - bi = >dev[i].req; - - bi->bi_rw = rw; - if (rw == WRITE) - bi->bi_end_io = raid5_end_write_request; - else - bi->bi_end_io = raid5_end_read_request; - - rcu_read_lock(); - rdev = rcu_dereference(conf->disks[i].rdev); - if (rdev && test_bit(Faulty, >flags)) - rdev = NULL; - if (rdev) - atomic_inc(>nr_pending); - rcu_read_unlock(); - - if (rdev) { - if (syncing || expanding || expanded) - md_sync_acct(rdev->bdev, STRIPE_SECTORS); - - bi->bi_bdev = rdev->bdev; - PRINTK("for %llu schedule op %ld on disc %d\n", - (unsigned long long)sh->sector, bi->bi_rw, i); - atomic_inc(>count); - bi->bi_sector = sh->sector + rdev->data_offset; - bi->bi_flags = 1 << BIO_UPTODATE; - bi->bi_vcnt = 1; - bi->bi_max_vecs = 1; - bi->bi_idx = 0; - bi->bi_io_vec =
[PATCH 11/12] md: remove raid5 compute_block and compute_parity5
From: Dan Williams <[EMAIL PROTECTED]> replaced by raid5_run_ops Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 124 1 files changed, 0 insertions(+), 124 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 8af084f..a981c35 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1480,130 +1480,6 @@ static void copy_data(int frombio, struct bio *bio, } \ } while(0) - -static void compute_block(struct stripe_head *sh, int dd_idx) -{ - int i, count, disks = sh->disks; - void *ptr[MAX_XOR_BLOCKS], *dest, *p; - - PRINTK("compute_block, stripe %llu, idx %d\n", - (unsigned long long)sh->sector, dd_idx); - - dest = page_address(sh->dev[dd_idx].page); - memset(dest, 0, STRIPE_SIZE); - count = 0; - for (i = disks ; i--; ) { - if (i == dd_idx) - continue; - p = page_address(sh->dev[i].page); - if (test_bit(R5_UPTODATE, >dev[i].flags)) - ptr[count++] = p; - else - printk(KERN_ERR "compute_block() %d, stripe %llu, %d" - " not present\n", dd_idx, - (unsigned long long)sh->sector, i); - - check_xor(); - } - if (count) - xor_block(count, STRIPE_SIZE, dest, ptr); - set_bit(R5_UPTODATE, >dev[dd_idx].flags); -} - -static void compute_parity5(struct stripe_head *sh, int method) -{ - raid5_conf_t *conf = sh->raid_conf; - int i, pd_idx = sh->pd_idx, disks = sh->disks, count; - void *ptr[MAX_XOR_BLOCKS], *dest; - struct bio *chosen; - - PRINTK("compute_parity5, stripe %llu, method %d\n", - (unsigned long long)sh->sector, method); - - count = 0; - dest = page_address(sh->dev[pd_idx].page); - switch(method) { - case READ_MODIFY_WRITE: - BUG_ON(!test_bit(R5_UPTODATE, >dev[pd_idx].flags)); - for (i=disks ; i-- ;) { - if (i==pd_idx) - continue; - if (sh->dev[i].towrite && - test_bit(R5_UPTODATE, >dev[i].flags)) { - ptr[count++] = page_address(sh->dev[i].page); - chosen = sh->dev[i].towrite; - sh->dev[i].towrite = NULL; - - if (test_and_clear_bit(R5_Overlap, >dev[i].flags)) - wake_up(>wait_for_overlap); - - BUG_ON(sh->dev[i].written); - sh->dev[i].written = chosen; - check_xor(); - } - } - break; - case RECONSTRUCT_WRITE: - memset(dest, 0, STRIPE_SIZE); - for (i= disks; i-- ;) - if (i!=pd_idx && sh->dev[i].towrite) { - chosen = sh->dev[i].towrite; - sh->dev[i].towrite = NULL; - - if (test_and_clear_bit(R5_Overlap, >dev[i].flags)) - wake_up(>wait_for_overlap); - - BUG_ON(sh->dev[i].written); - sh->dev[i].written = chosen; - } - break; - case CHECK_PARITY: - break; - } - if (count) { - xor_block(count, STRIPE_SIZE, dest, ptr); - count = 0; - } - - for (i = disks; i--;) - if (sh->dev[i].written) { - sector_t sector = sh->dev[i].sector; - struct bio *wbi = sh->dev[i].written; - while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) { - copy_data(1, wbi, sh->dev[i].page, sector); - wbi = r5_next_bio(wbi, sector); - } - - set_bit(R5_LOCKED, >dev[i].flags); - set_bit(R5_UPTODATE, >dev[i].flags); - } - - switch(method) { - case RECONSTRUCT_WRITE: - case CHECK_PARITY: - for (i=disks; i--;) - if (i != pd_idx) { - ptr[count++] = page_address(sh->dev[i].page); - check_xor(); - } - break; - case READ_MODIFY_WRITE: - for (i = disks; i--;) - if (sh->dev[i].written) { - ptr[count++] = page_address(sh->dev[i].page); - check_xor(); - } -
[PATCH 08/12] md: satisfy raid5 read requests via raid5_run_ops
From: Dan Williams <[EMAIL PROTECTED]> Use raid5_run_ops to carry out the memory copies for a raid5 read request. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 40 +++- 1 files changed, 15 insertions(+), 25 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 2422253..db8925f 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1980,7 +1980,7 @@ static void handle_stripe5(struct stripe_head *sh) int i; int syncing, expanding, expanded; int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0; - int compute=0, req_compute=0, non_overwrite=0; + int to_fill=0, compute=0, req_compute=0, non_overwrite=0; int failed_num=0; struct r5dev *dev; unsigned long pending=0; @@ -2004,34 +2004,20 @@ static void handle_stripe5(struct stripe_head *sh) dev = >dev[i]; clear_bit(R5_Insync, >flags); - PRINTK("check %d: state 0x%lx read %p write %p written %p\n", - i, dev->flags, dev->toread, dev->towrite, dev->written); - /* maybe we can reply to a read */ + PRINTK("check %d: state 0x%lx toread %p read %p write %p written %p\n", + i, dev->flags, dev->toread, dev->read, dev->towrite, dev->written); + + /* maybe we can start a biofill operation */ if (test_bit(R5_UPTODATE, >flags) && dev->toread) { - struct bio *rbi, *rbi2; - PRINTK("Return read for disc %d\n", i); - spin_lock_irq(>device_lock); - rbi = dev->toread; - dev->toread = NULL; - if (test_and_clear_bit(R5_Overlap, >flags)) - wake_up(>wait_for_overlap); - spin_unlock_irq(>device_lock); - while (rbi && rbi->bi_sector < dev->sector + STRIPE_SECTORS) { - copy_data(0, rbi, dev->page, dev->sector); - rbi2 = r5_next_bio(rbi, dev->sector); - spin_lock_irq(>device_lock); - if (--rbi->bi_phys_segments == 0) { - rbi->bi_next = return_bi; - return_bi = rbi; - } - spin_unlock_irq(>device_lock); - rbi = rbi2; - } + to_read--; + if (!test_bit(STRIPE_OP_BIOFILL, >ops.pending)) + set_bit(R5_Wantfill, >flags); } /* now count some things */ if (test_bit(R5_LOCKED, >flags)) locked++; if (test_bit(R5_UPTODATE, >flags)) uptodate++; + if (test_bit(R5_Wantfill, >flags)) to_fill++; if (test_bit(R5_Wantcompute, >flags)) BUG_ON(++compute > 1); if (dev->toread) to_read++; @@ -2055,9 +2041,13 @@ static void handle_stripe5(struct stripe_head *sh) set_bit(R5_Insync, >flags); } rcu_read_unlock(); + + if (to_fill && !test_and_set_bit(STRIPE_OP_BIOFILL, >ops.pending)) + sh->ops.count++; + PRINTK("locked=%d uptodate=%d to_read=%d" - " to_write=%d failed=%d failed_num=%d\n", - locked, uptodate, to_read, to_write, failed, failed_num); + " to_write=%d to_fill=%d failed=%d failed_num=%d\n", + locked, uptodate, to_read, to_write, to_fill, failed, failed_num); /* check if the array has lost two devices and, if so, some requests might * need to be failed */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 12/12] dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines
From: Dan Williams <[EMAIL PROTECTED]> This is a driver for the iop DMA/AAU/ADMA units which are capable of pq_xor, pq_update, pq_zero_sum, xor, dual_xor, xor_zero_sum, fill, copy+crc, and copy operations. Changelog: * fixed a slot allocation bug in do_iop13xx_adma_xor that caused too few slots to be requested eventually leading to data corruption * enabled the slot allocation routine to attempt to free slots before returning -ENOMEM * switched the cleanup routine to solely use the software chain and the status register to determine if a descriptor is complete. This is necessary to support other IOP engines that do not have status writeback capability * make the driver iop generic * modified the allocation routines to understand allocating a group of slots for a single operation * added a null xor initialization operation for the xor only channel on iop3xx * support xor operations on buffers larger than the hardware maximum * split the do_* routines into separate prep, src/dest set, submit stages * added async_tx support (dependent operations initiation at cleanup time) * simplified group handling * added interrupt support (callbacks via tasklets) * brought the pending depth inline with ioat (i.e. 4 descriptors) Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/dma/Kconfig |8 drivers/dma/Makefile|1 drivers/dma/iop-adma.c | 1511 +++ include/asm-arm/hardware/iop_adma.h | 116 +++ 4 files changed, 1636 insertions(+), 0 deletions(-) diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig index c82ed5f..d61e3e5 100644 --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@ -41,4 +41,12 @@ config INTEL_IOATDMA default m ---help--- Enable support for the Intel(R) I/OAT DMA engine. + +config INTEL_IOP_ADMA +tristate "Intel IOP ADMA support" +depends on DMA_ENGINE && (ARCH_IOP32X || ARCH_IOP33X || ARCH_IOP13XX) +default m +---help--- + Enable support for the Intel(R) IOP Series RAID engines. + endmenu diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile index 6a99341..8ebf10d 100644 --- a/drivers/dma/Makefile +++ b/drivers/dma/Makefile @@ -1,4 +1,5 @@ obj-$(CONFIG_DMA_ENGINE) += dmaengine.o obj-$(CONFIG_NET_DMA) += iovlock.o obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o +obj-$(CONFIG_INTEL_IOP_ADMA) += iop-adma.o obj-$(CONFIG_ASYNC_TX_DMA) += async_tx.o xor.o diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c new file mode 100644 index 000..77f859e --- /dev/null +++ b/drivers/dma/iop-adma.c @@ -0,0 +1,1511 @@ +/* + * Copyright(c) 2006 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + */ + +/* + * This driver supports the asynchrounous DMA copy and RAID engines available + * on the Intel Xscale(R) family of I/O Processors (IOP 32x, 33x, 134x) + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define to_iop_adma_chan(chan) container_of(chan, struct iop_adma_chan, common) +#define to_iop_adma_device(dev) container_of(dev, struct iop_adma_device, common) +#define to_iop_adma_slot(lh) container_of(lh, struct iop_adma_desc_slot, slot_node) +#define tx_to_iop_adma_slot(tx) container_of(tx, struct iop_adma_desc_slot, async_tx) + +#define IOP_ADMA_DEBUG 0 +#define PRINTK(x...) ((void)(IOP_ADMA_DEBUG && printk(x))) + +/** + * iop_adma_free_slots - flags descriptor slots for reuse + * @slot: Slot to free + * Caller must hold _chan->lock while calling this function + */ +static inline void iop_adma_free_slots(struct iop_adma_desc_slot *slot) +{ + int stride = slot->stride; + + while (stride--) { + slot->stride = 0; + slot = list_entry(slot->slot_node.next, + struct iop_adma_desc_slot, + slot_node); + } +} + +static inline dma_cookie_t +iop_adma_run_tx_complete_actions(struct iop_adma_desc_slot *desc, + struct iop_adma_chan *iop_chan, dma_cookie_t cookie) +{ + BUG_ON(desc->async_tx.cookie < 0); + spin_lock_bh(>async_tx.lock); +
[PATCH 07/12] md: move raid5 parity checks to raid5_run_ops
From: Dan Williams <[EMAIL PROTECTED]> handle_stripe sets STRIPE_OP_CHECK to request a check operation in raid5_run_ops. If raid5_run_ops is able to perform the check with a dma engine the parity will be preserved in memory removing the need to re-read it from disk, as is necessary in the synchronous case. 'Repair' operations re-use the same logic as compute block, with the caveat that the results of the compute block are immediately written back to the parity disk. To differentiate these operations the STRIPE_OP_MOD_REPAIR_PD flag is added. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 81 1 files changed, 62 insertions(+), 19 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 279a30c..2422253 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2411,32 +2411,75 @@ static void handle_stripe5(struct stripe_head *sh) locked += handle_write_operations5(sh, rcw, 0); } - /* maybe we need to check and possibly fix the parity for this stripe -* Any reads will already have been scheduled, so we just see if enough data -* is available + /* 1/ Maybe we need to check and possibly fix the parity for this stripe. +*Any reads will already have been scheduled, so we just see if enough data +*is available. +* 2/ Hold off parity checks while parity dependent operations are in flight +*(conflicting writes are protected by the 'locked' variable) */ - if (syncing && locked == 0 && - !test_bit(STRIPE_INSYNC, >state)) { + if ((syncing && locked == 0 && !test_bit(STRIPE_OP_COMPUTE_BLK, >ops.pending) && + !test_bit(STRIPE_INSYNC, >state)) || + test_bit(STRIPE_OP_CHECK, >ops.pending) || + test_bit(STRIPE_OP_MOD_REPAIR_PD, >ops.pending)) { + set_bit(STRIPE_HANDLE, >state); - if (failed == 0) { - BUG_ON(uptodate != disks); - compute_parity5(sh, CHECK_PARITY); - uptodate--; - if (page_is_zero(sh->dev[sh->pd_idx].page)) { - /* parity is correct (on disc, not in buffer any more) */ - set_bit(STRIPE_INSYNC, >state); - } else { - conf->mddev->resync_mismatches += STRIPE_SECTORS; - if (test_bit(MD_RECOVERY_CHECK, >mddev->recovery)) - /* don't try to repair!! */ + /* Take one of the following actions: +* 1/ start a check parity operation if (uptodate == disks) +* 2/ finish a check parity operation and act on the result +* 3/ skip to the writeback section if we previously +*initiated a recovery operation +*/ + if (failed == 0 && !test_bit(STRIPE_OP_MOD_REPAIR_PD, >ops.pending)) { + if (!test_and_set_bit(STRIPE_OP_CHECK, >ops.pending)) { + BUG_ON(uptodate != disks); + clear_bit(R5_UPTODATE, >dev[sh->pd_idx].flags); + sh->ops.count++; + uptodate--; + } else if (test_and_clear_bit(STRIPE_OP_CHECK, >ops.complete)) { + clear_bit(STRIPE_OP_CHECK, >ops.ack); + clear_bit(STRIPE_OP_CHECK, >ops.pending); + + if (sh->ops.zero_sum_result == 0) + /* parity is correct (on disc, not in buffer any more) */ set_bit(STRIPE_INSYNC, >state); else { - compute_block(sh, sh->pd_idx); - uptodate++; + conf->mddev->resync_mismatches += STRIPE_SECTORS; + if (test_bit(MD_RECOVERY_CHECK, >mddev->recovery)) + /* don't try to repair!! */ + set_bit(STRIPE_INSYNC, >state); + else { + BUG_ON(test_and_set_bit( + STRIPE_OP_COMPUTE_BLK, + >ops.pending)); + set_bit(STRIPE_OP_MOD_REPAIR_PD, + >ops.pending); + BUG_ON(test_and_set_bit(R5_Wantcompute, + >dev[sh->pd_idx].flags)); +
[PATCH 09/12] md: use async_tx and raid5_run_ops for raid5 expansion operations
From: Dan Williams <[EMAIL PROTECTED]> The parity calculation for an expansion operation is the same as the calculation performed at the end of a write with the caveat that all blocks in the stripe are scheduled to be written. An expansion operation is identified as a stripe with the POSTXOR flag set and the BIODRAIN flag not set. The bulk copy operation to the new stripe is handled inline by async_tx. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 48 1 files changed, 36 insertions(+), 12 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index db8925f..1956b3c 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2511,18 +2511,32 @@ static void handle_stripe5(struct stripe_head *sh) } } - if (expanded && test_bit(STRIPE_EXPANDING, >state)) { - /* Need to write out all blocks after computing parity */ - sh->disks = conf->raid_disks; - sh->pd_idx = stripe_to_pdidx(sh->sector, conf, conf->raid_disks); - compute_parity5(sh, RECONSTRUCT_WRITE); + /* Finish postxor operations initiated by the expansion +* process +*/ + if (test_bit(STRIPE_OP_POSTXOR, >ops.complete) && + !test_bit(STRIPE_OP_BIODRAIN, >ops.pending)) { + + clear_bit(STRIPE_EXPANDING, >state); + + clear_bit(STRIPE_OP_POSTXOR, >ops.pending); + clear_bit(STRIPE_OP_POSTXOR, >ops.ack); + clear_bit(STRIPE_OP_POSTXOR, >ops.complete); + for (i= conf->raid_disks; i--;) { - set_bit(R5_LOCKED, >dev[i].flags); - locked++; set_bit(R5_Wantwrite, >dev[i].flags); + if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; } - clear_bit(STRIPE_EXPANDING, >state); - } else if (expanded) { + } + + if (expanded && test_bit(STRIPE_EXPANDING, >state) && + !test_bit(STRIPE_OP_POSTXOR, >ops.pending)) { + /* Need to write out all blocks after computing parity */ + sh->disks = conf->raid_disks; + sh->pd_idx = stripe_to_pdidx(sh->sector, conf, conf->raid_disks); + locked += handle_write_operations5(sh, 0, 1); + } else if (expanded && !test_bit(STRIPE_OP_POSTXOR, >ops.pending)) { clear_bit(STRIPE_EXPAND_READY, >state); atomic_dec(>reshape_stripes); wake_up(>wait_for_overlap); @@ -2533,6 +2547,7 @@ static void handle_stripe5(struct stripe_head *sh) /* We have read all the blocks in this stripe and now we need to * copy some of them into a target stripe for expand. */ + struct dma_async_tx_descriptor *tx = NULL; clear_bit(STRIPE_EXPAND_SOURCE, >state); for (i=0; i< sh->disks; i++) if (i != sh->pd_idx) { @@ -2556,9 +2571,12 @@ static void handle_stripe5(struct stripe_head *sh) release_stripe(sh2); continue; } - memcpy(page_address(sh2->dev[dd_idx].page), - page_address(sh->dev[i].page), - STRIPE_SIZE); + + /* place all the copies on one channel */ + tx = async_memcpy(sh2->dev[dd_idx].page, + sh->dev[i].page, 0, 0, STRIPE_SIZE, + ASYNC_TX_DEP_ACK, tx, NULL, NULL); + set_bit(R5_Expanded, >dev[dd_idx].flags); set_bit(R5_UPTODATE, >dev[dd_idx].flags); for (j=0; jraid_disks; j++) @@ -2570,6 +2588,12 @@ static void handle_stripe5(struct stripe_head *sh) set_bit(STRIPE_HANDLE, >state); } release_stripe(sh2); + + /* done submitting copies, wait for them to complete */ + if (i + 1 >= sh->disks) { + async_tx_ack(tx); + dma_wait_for_async_tx(tx); + } } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/12] dmaengine: add the async_tx api
From: Dan Williams <[EMAIL PROTECTED]> async_tx is an api to describe a series of bulk memory transfers/transforms. When possible these transactions are carried out by asynchrounous dma engines. The api handles inter-transaction dependencies and hides dma channel management from the client. When a dma engine is not present the transaction is carried out via synchronous software routines. Xor operations are handled by async_tx, to this end xor.c is moved into drivers/dma and is changed to take an explicit destination address and a series of sources to match the hardware engine implementation. When CONFIG_DMA_ENGINE is not set the asynchrounous path is compiled away. Changelog: * fixed a leftover debug print * don't allow callbacks in async_interrupt_cond * fixed xor_block changes * fixed usage of ASYNC_TX_XOR_DROP_DEST Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/Makefile |1 drivers/dma/Kconfig | 16 + drivers/dma/Makefile |1 drivers/dma/async_tx.c | 910 ++ drivers/dma/xor.c| 153 drivers/md/Kconfig |2 drivers/md/Makefile |6 drivers/md/raid5.c | 52 +-- drivers/md/xor.c | 154 include/linux/async_tx.h | 180 + include/linux/raid/xor.h |5 11 files changed, 1291 insertions(+), 189 deletions(-) diff --git a/drivers/Makefile b/drivers/Makefile index 0dd96d1..7d55837 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -61,6 +61,7 @@ obj-$(CONFIG_I2C) += i2c/ obj-$(CONFIG_W1) += w1/ obj-$(CONFIG_HWMON)+= hwmon/ obj-$(CONFIG_PHONE)+= telephony/ +obj-$(CONFIG_ASYNC_TX_DMA) += dma/ obj-$(CONFIG_MD) += md/ obj-$(CONFIG_BT) += bluetooth/ obj-$(CONFIG_ISDN) += isdn/ diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig index 30d021d..c82ed5f 100644 --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@ -7,8 +7,8 @@ menu "DMA Engine support" config DMA_ENGINE bool "Support for DMA engines" ---help--- - DMA engines offload copy operations from the CPU to dedicated - hardware, allowing the copies to happen asynchronously. + DMA engines offload bulk memory operations from the CPU to dedicated + hardware, allowing the operations to happen asynchronously. comment "DMA Clients" @@ -22,6 +22,17 @@ config NET_DMA Since this is the main user of the DMA engine, it should be enabled; say Y here. +config ASYNC_TX_DMA + tristate "Asynchronous Bulk Memory Transfers/Transforms API" + default y + ---help--- + This enables the async_tx management layer for dma engines. + Subsystems coded to this API will use offload engines for bulk + memory operations where present. Software implementations are + called when a dma engine is not present or fails to allocate + memory to carry out the transaction. + Current subsystems ported to async_tx: MD_RAID4,5 + comment "DMA Devices" config INTEL_IOATDMA @@ -30,5 +41,4 @@ config INTEL_IOATDMA default m ---help--- Enable support for the Intel(R) I/OAT DMA engine. - endmenu diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile index bdcfdbd..6a99341 100644 --- a/drivers/dma/Makefile +++ b/drivers/dma/Makefile @@ -1,3 +1,4 @@ obj-$(CONFIG_DMA_ENGINE) += dmaengine.o obj-$(CONFIG_NET_DMA) += iovlock.o obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o +obj-$(CONFIG_ASYNC_TX_DMA) += async_tx.o xor.o diff --git a/drivers/dma/async_tx.c b/drivers/dma/async_tx.c new file mode 100644 index 000..eee208d --- /dev/null +++ b/drivers/dma/async_tx.c @@ -0,0 +1,910 @@ +/* + * Copyright(c) 2006 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + */ +#include +#include +#include +#include +#include + +#define ASYNC_TX_DEBUG 0 +#define PRINTK(x...) ((void)(ASYNC_TX_DEBUG && printk(x))) + +#ifdef CONFIG_DMA_ENGINE +static struct dma_client *async_api_client; +static struct async_channel_entry async_channel_directory[] = { +
[PATCH 01/12] dmaengine: add base support for the async_tx api
From: Dan Williams <[EMAIL PROTECTED]> * introduce struct dma_async_tx_descriptor as a common field for all dmaengine software descriptors * convert the device_memcpy_* methods into separate prep, set src/dest, and submit stages * support capabilities beyond memcpy (xor, memset, xor zero sum, completion interrupts) * convert ioatdma to the new semantics Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/dma/dmaengine.c | 44 ++-- drivers/dma/ioatdma.c | 256 ++-- drivers/dma/ioatdma.h |8 + include/linux/dmaengine.h | 263 ++--- 4 files changed, 394 insertions(+), 177 deletions(-) diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c index 1527804..8d203ad 100644 --- a/drivers/dma/dmaengine.c +++ b/drivers/dma/dmaengine.c @@ -210,7 +210,8 @@ static void dma_chans_rebalance(void) mutex_lock(_list_mutex); list_for_each_entry(client, _client_list, global_node) { - while (client->chans_desired > client->chan_count) { + while (client->chans_desired < 0 || + client->chans_desired > client->chan_count) { chan = dma_client_chan_alloc(client); if (!chan) break; @@ -219,7 +220,8 @@ static void dma_chans_rebalance(void) chan, DMA_RESOURCE_ADDED); } - while (client->chans_desired < client->chan_count) { + while (client->chans_desired >= 0 && + client->chans_desired < client->chan_count) { spin_lock_irqsave(>lock, flags); chan = list_entry(client->channels.next, struct dma_chan, @@ -294,12 +296,12 @@ void dma_async_client_unregister(struct dma_client *client) * @number: count of DMA channels requested * * Clients call dma_async_client_chan_request() to specify how many - * DMA channels they need, 0 to free all currently allocated. + * DMA channels they need, 0 to free all currently allocated. A request + * < 0 indicates the client wants to handle all engines in the system. * The resulting allocations/frees are indicated to the client via the * event callback. */ -void dma_async_client_chan_request(struct dma_client *client, - unsigned int number) +void dma_async_client_chan_request(struct dma_client *client, int number) { client->chans_desired = number; dma_chans_rebalance(); @@ -318,6 +320,31 @@ int dma_async_device_register(struct dma_device *device) if (!device) return -ENODEV; + /* validate device routines */ + BUG_ON(test_bit(DMA_MEMCPY, >capabilities) && + !device->device_prep_dma_memcpy); + BUG_ON(test_bit(DMA_XOR, >capabilities) && + !device->device_prep_dma_xor); + BUG_ON(test_bit(DMA_ZERO_SUM, >capabilities) && + !device->device_prep_dma_zero_sum); + BUG_ON(test_bit(DMA_MEMSET, >capabilities) && + !device->device_prep_dma_memset); + BUG_ON(test_bit(DMA_ZERO_SUM, >capabilities) && + !device->device_prep_dma_interrupt); + + BUG_ON(!device->device_alloc_chan_resources); + BUG_ON(!device->device_free_chan_resources); + BUG_ON(!device->device_tx_submit); + BUG_ON(!device->device_set_dest); + BUG_ON(!device->device_set_src); + BUG_ON(!device->device_dependency_added); + BUG_ON(!device->device_is_tx_complete); + BUG_ON(!device->map_page); + BUG_ON(!device->map_single); + BUG_ON(!device->unmap_page); + BUG_ON(!device->unmap_single); + BUG_ON(!device->device_issue_pending); + init_completion(>done); kref_init(>refcount); device->dev_id = id++; @@ -402,11 +429,8 @@ subsys_initcall(dma_bus_init); EXPORT_SYMBOL(dma_async_client_register); EXPORT_SYMBOL(dma_async_client_unregister); EXPORT_SYMBOL(dma_async_client_chan_request); -EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf); -EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg); -EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg); -EXPORT_SYMBOL(dma_async_memcpy_complete); -EXPORT_SYMBOL(dma_async_memcpy_issue_pending); +EXPORT_SYMBOL(dma_async_is_tx_complete); +EXPORT_SYMBOL(dma_async_issue_pending); EXPORT_SYMBOL(dma_async_device_register); EXPORT_SYMBOL(dma_async_device_unregister); EXPORT_SYMBOL(dma_chan_cleanup); diff --git a/drivers/dma/ioatdma.c b/drivers/dma/ioatdma.c index 8e87261..70bdd18 100644 --- a/drivers/dma/ioatdma.c +++ b/drivers/dma/ioatdma.c @@ -31,6 +31,7 @@ #include #include #include +#include #include "ioatdma.h" #include "ioatdma_io.h" #include "ioatdma_registers.h" @@ -39,6 +40,7 @@ #define to_ioat_chan(chan) container_of(chan, struct ioat_dma_chan, common) #define
Re: Why active list and inactive list?
On Tue, 23 Jan 2007, Balbir Singh wrote: > When you unmap or map, you need to touch the pte entries and know the > pages involved, so shouldn't be equivalent to a list_del and list_add > for each page impacted by the map/unmap operation? When you unmap and map you must currently get exclusive access to the cachelines of the pte and the cacheline of the page struct. If we use a list_move on page->lru then we have would have to update pointers in up to 4 other page structs. Thus we need exclusive access to 4 additional cachelines. This triples the number of cachelines touched. Instead of 2 cachelines we need 6. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Christoph Lameter wrote: On Tue, 23 Jan 2007, Balbir Singh wrote: This makes me wonder if it makes sense to split up the LRU into page cache LRU and mapped pages LRU. I see two benefits 1. Currently based on swappiness, we might walk an entire list searching for page cache pages or mapped pages. With these lists separated, it should get easier and faster to implement this scheme 2. There is another parallel thread on implementing page cache limits. If the lists split out, we need not scan the entire list to find page cache pages to evict them. Of course I might be missing something (some piece of history) This means page cache = unmapped file backed page right? Otherwise this would not work. I always thought that the page cache were all file backed pages both mapped and unmapped. Yes, unfortunately my terminology was not clear. I mean unmapped file backed pages. With the proposed schemd you would have to move pages between lists if they are mapped and unmapped by a process. Terminating a process could lead to lots of pages moving to the unnmapped list. When you unmap or map, you need to touch the pte entries and know the pages involved, so shouldn't be equivalent to a list_del and list_add for each page impacted by the map/unmap operation? -- Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux v2.6.20-rc5
Jeff Chua <[EMAIL PROTECTED]> wrote: > > > From: Jeff Chua <[EMAIL PROTECTED]> > >> CC [M] drivers/kvm/vmx.o >> {standard input}: Assembler messages: >> {standard input}:3257: Error: bad register name `%sil' >> make[2]: *** [drivers/kvm/vmx.o] Error 1 >> make[1]: *** [drivers/kvm] Error 2 >> make: *** [drivers] Error 2 > > I'm not using the kernel profiler, so here's a patch to make it work without > CONFIG_PROFILING. Actually that only happens to work by chance (by making one of al/bl/cl/dl available). This patch should fix it properly. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c index ce219e3..0aa2659 100644 --- a/drivers/kvm/vmx.c +++ b/drivers/kvm/vmx.c @@ -1824,7 +1824,7 @@ again: #endif "setbe %0 \n\t" "popf \n\t" - : "=g" (fail) + : "=q" (fail) : "r"(vcpu->launched), "d"((unsigned long)HOST_RSP), "c"(vcpu), [rax]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RAX])), - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] KVM: Fix asm constraints with CONFIG_FRAME_POINTER=n
Avi Kivity <[EMAIL PROTECTED]> wrote: > A "g" constraint may place a local variable in an %rsp-relative memory > operand. > but if your assembly changes %rsp, the operand points to the wrong location. > > An "r" constraint fixes that. > > Thanks to Ingo Molnar for neatly bisecting the problem. > > Signed-off-by: Avi Kivity <[EMAIL PROTECTED]> > > Index: linux-2.6/drivers/kvm/vmx.c > === > --- linux-2.6.orig/drivers/kvm/vmx.c > +++ linux-2.6/drivers/kvm/vmx.c > @@ -1825,7 +1825,7 @@ again: > #endif >"setbe %0 \n\t" >"popf \n\t" > - : "=g" (fail) > + : "=r" (fail) > : "r"(vcpu->launched), "d"((unsigned long)HOST_RSP), >"c"(vcpu), >[rax]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RAX])), We need the following fix for 2.6.20. [KVM] vmx: Fix register constraint in launch code Both "=r" and "=g" breaks my build on i386: $ make CC [M] drivers/kvm/vmx.o {standard input}: Assembler messages: {standard input}:3318: Error: bad register name `%sil' make[1]: *** [drivers/kvm/vmx.o] Error 1 make: *** [_module_drivers/kvm] Error 2 The reason is that setbe requires an 8-bit register but "=r" does not constrain the target register to be one that has an 8-bit version on i386. According to http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10153 the correct constraint is "=q". Signed-off-by: Herbert Xu <[EMAIL PROTECTED]> Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c index ce219e3..0aa2659 100644 --- a/drivers/kvm/vmx.c +++ b/drivers/kvm/vmx.c @@ -1824,7 +1824,7 @@ again: #endif "setbe %0 \n\t" "popf \n\t" - : "=g" (fail) + : "=q" (fail) : "r"(vcpu->launched), "d"((unsigned long)HOST_RSP), "c"(vcpu), [rax]"i"(offsetof(struct kvm_vcpu, regs[VCPU_REGS_RAX])), - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH query] arm: i.MX/MX1 clock event source
On Monday 22 January 2007 20:59, Ingo Molnar wrote: > * Pavel Pisa <[EMAIL PROTECTED]> wrote: > > Hello Thomas, Sascha and Ingo > > > > please can you find some time to review next patch > > arm: i.MX/MX1 clock event source > > which has been sent to you and to the ALKML at 2007-01-13. > > > > http://thread.gmane.org/gmane.linux.ports.arm.kernel/29510/focus=29533 > > > > There seems to be some problems, because this patch has not been > > accepted to patch-2.6.20-rc5-rt7.patch, but GENERIC_CLOCKEVENTS are > > set already for i.MX and this results in a problems to run RT kernel > > on this architecture. > > i've added your patch to -rt, but note that there's a new, slightly > incompatible clockevents code in -rt now so you'll need to do some more > (hopefully trivial) fixups for this to build and work. > > Ingo Hello Ingo, thanks for reply. I am attaching updated version of the patch at the end of e-mail. There is problem with missing include in tick-sched.c CC kernel/time/tick-sched.o /usr/src/linux-2.6.20-rc5/kernel/time/tick-sched.c: In function `tick_nohz_handler': /usr/src/linux-2.6.20-rc5/kernel/time/tick-sched.c:330: warning: implicit declaration of function `get_irq_regs' /usr/src/linux-2.6.20-rc5/kernel/time/tick-sched.c:330: warning: initialization makes pointer from integer without a cast /usr/src/linux-2.6.20-rc5/kernel/time/tick-sched.c: In function `tick_sched_timer': /usr/src/linux-2.6.20-rc5/kernel/time/tick-sched.c:425: warning: initialization makes pointer from integer without a cast LD kernel/time/built-in.o --- linux-2.6.20-rc5.orig/kernel/time/tick-sched.c +++ linux-2.6.20-rc5/kernel/time/tick-sched.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "tick-internal.h" And CC arch/arm/kernel/process.o /usr/src/linux-2.6.20-rc5/arch/arm/kernel/process.c: In function `cpu_idle': /usr/src/linux-2.6.20-rc5/arch/arm/kernel/process.c:157: warning: implicit declaration of function `hrtimer_stop_sched_tick' /usr/src/linux-2.6.20-rc5/arch/arm/kernel/process.c:161: warning: implicit declaration of function `hrtimer_restart_sched_tick' --- linux-2.6.20-rc5.orig/arch/arm/kernel/process.c +++ linux-2.6.20-rc5/arch/arm/kernel/process.c @@ -154,11 +154,11 @@ void cpu_idle(void) if (!idle) idle = default_idle; leds_event(led_idle_start); - hrtimer_stop_sched_tick(); + tick_nohz_stop_sched_tick(); while (!need_resched() && !need_resched_delayed()) idle(); leds_event(led_idle_end); - hrtimer_restart_sched_tick(); + tick_nohz_restart_sched_tick(); local_irq_disable(); __preempt_enable_no_resched(); __schedule(); Unfortunately, even with these corrections boot stuck at Memory: 18972KB available (2488K code, 358K data, 92K init) I have not time now to start JTAG debugging session, so I look at that tomorrow or on Friday. It seems, that the interrupts are not coming from device. Best wishes Pavel == Subject: arm: i.MX/MX1 clock event source Support clock event source based on i.MX general purpose timer in free running timer mode. Signed-off-by: Pavel Pisa <[EMAIL PROTECTED]> arch/arm/mach-imx/time.c | 112 --- 1 file changed, 107 insertions(+), 5 deletions(-) Index: linux-2.6.20-rc5/arch/arm/mach-imx/time.c === --- linux-2.6.20-rc5.orig/arch/arm/mach-imx/time.c +++ linux-2.6.20-rc5/arch/arm/mach-imx/time.c @@ -15,6 +15,9 @@ #include #include #include +#ifdef CONFIG_GENERIC_CLOCKEVENTS +#include +#endif #include #include @@ -25,6 +28,11 @@ /* Use timer 1 as system timer */ #define TIMER_BASE IMX_TIM1_BASE +#ifdef CONFIG_GENERIC_CLOCKEVENTS +static struct clock_event_device clockevent_imx; +static enum clock_event_mode clockevent_mode = CLOCK_EVT_MODE_PERIODIC; +#endif + static unsigned long evt_diff; /* @@ -33,6 +41,7 @@ static unsigned long evt_diff; static irqreturn_t imx_timer_interrupt(int irq, void *dev_id) { + unsigned long tcmp; uint32_t tstat; /* clear the interrupt */ @@ -42,13 +51,20 @@ imx_timer_interrupt(int irq, void *dev_i if (tstat & TSTAT_COMP) { do { +#ifdef CONFIG_GENERIC_CLOCKEVENTS + if (clockevent_imx.event_handler) + clockevent_imx.event_handler(_imx); + if (likely(clockevent_mode != CLOCK_EVT_MODE_PERIODIC)) + break; +#else write_seqlock(_lock); timer_tick(); write_sequnlock(_lock); - IMX_TCMP(TIMER_BASE) += evt_diff; +#endif +
Re: Why active list and inactive list?
Christoph Lameter wrote: On Mon, 22 Jan 2007, Rik van Riel wrote: The big one is how we are to do some background aging in a clock-pro system, so referenced bits don't just pile up when the VM has enough memory - otherwise we might not know the right pages to evict when a new process starts up and starts allocating lots of memory. There are two bad choices right? 1. Scan for reference bits Bad because we may have to scan quite a bit without too much result. LRU allows us to defer this until memory is tight. Any such scan will pollute the cache and cause a stall of the app. You really do not want this for a realtime system. 2. Take faults on reference and update the page state. Bad because this means a fault if the reference bit has not been set. Could be many faults. Clock pro really requires 2 right? So lots of additional page faults? Nope, the faults are not required. I suspect you're confused with the part where it keeps track of recently evicted (not resident in RAM at all) pages. That kind of info is common in database replacement schemes, but not in general purpose OS memory management. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
On Mon, 22 Jan 2007, Rik van Riel wrote: > The big one is how we are to do some background aging in a > clock-pro system, so referenced bits don't just pile up when > the VM has enough memory - otherwise we might not know the > right pages to evict when a new process starts up and starts > allocating lots of memory. There are two bad choices right? 1. Scan for reference bits Bad because we may have to scan quite a bit without too much result. LRU allows us to defer this until memory is tight. Any such scan will pollute the cache and cause a stall of the app. You really do not want this for a realtime system. 2. Take faults on reference and update the page state. Bad because this means a fault if the reference bit has not been set. Could be many faults. Clock pro really requires 2 right? So lots of additional page faults? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.22 19:24:22 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >>>Running a kernel with the return statement replace by a line that prints > >>>the irq_stat instead. > >>> > >>>Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. > >>40 minutes stress test now and no exception yet. What's interesting is > >>that ata1 saw exactly one interrupt with irq_stat 0x0, all others that > >>might have get dropped are as above. > >>I'll keep it running for some time and will then re-enable the return > >>statement to see if there's a relation between the irq_stat 0x0 and the > >>exception. > > > >No, doesn't seem to be related, did get 2 exceptions, but no irq_stat > >0x0 for ata1. Syslog/dmesg has nothing new either, still the same > >pattern of dismissed irq_stats. > > I've finally managed to reproduce this problem on my box, by doing: > > watch --interval=0.1 /sbin/hdparm -I /dev/sda > > on one drive and then running bonnie++ on /dev/sdb connected to the > other port on the same controller device. Usually within a few minutes > one of the IDENTIFY commands would time out in the same way you guys > have been seeing. > > Through some various trials and tribulations, the only conclusion I can > come to is that this controller really doesn't like that > NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried > adding some debug code to the qc_issue function that would check to see > if the BUSY flag in altstatus went high or that register showed an > interrupt within a certain time afterwards, however that really seemed > to hose things, the system wouldn't even boot. Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804. I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now without a single problem and that should still look at NV_INT_STATUS_CK804, right? I just noticed that my last email might not have been clear enough. The exceptions happened when I re-enabled the return statement in addition to the debug message. Without the INT_DEV check, it is completely fine AFAICT. > Try out this patch, it just calls the ata_host_intr function where > appropriate without using nv_host_intr which looks at the > NV_INT_STATUS_CK804 register. This is what the original ADMA patch from > Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for > that. With this patch I can get through a whole bonnie++ run with the > repeated IDENTIFY requests running without seeing the error. I'll see if I can schedule a test run for tomorrow, I currently need this box. Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Christoph Lameter wrote: On Mon, 22 Jan 2007, Rik van Riel wrote: It would be really nice if we came up with a page replacement algorithm that did not need many extra heuristics to make it work... I guess the "clock" type algorithms are the most promising in that area. What happened to all those advanced page replacement endeavors? What is the most promising of those? You seem to have done a lot of work on those. CLOCK-Pro seems the most promising algorithm, because it can act well both as a first level cache (operating system running applications) and as a second level cache (operating system running as a file server), because it tracks both recency and frequency well. However, there are a few unanswered questions on clock-pro. The big one is how we are to do some background aging in a clock-pro system, so referenced bits don't just pile up when the VM has enough memory - otherwise we might not know the right pages to evict when a new process starts up and starts allocating lots of memory. At least we've solved the problems of keeping track of the recently evicted pages in a cheap way, and balancing the pressure/hotness of different caches against each other. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
On Tue, Jan 23, 2007 at 07:01:33AM +0530, Balbir Singh wrote: > This makes me wonder if it makes sense to split up the LRU into page > cache LRU and mapped pages LRU. I see two benefits > 1. Currently based on swappiness, we might walk an entire list >searching for page cache pages or mapped pages. With these >lists separated, it should get easier and faster to implement >this scheme When I tried that long time ago, I recall I had troubles, but there wasn't the reclaim_mapped based on static values so it was even harder. However it would be still a problem today to decide when to switch from the unmapped to the mapped lru. When reclaim_mapped is set, you'll still have to shrink some unmapped page, and by splitting you literally lose age information to save some cpu. Eventually you risk spending time in trying to free unfreeable pinned pages that sits in the unmapped list before finally jumping to the mapped list. So you've to add yet another list to get rid of the pinned stuff in the unmapped list and I stopped when I had to refile pages from the "pinned" list to the unmapped list in irq I/O completion context, now it's all spin_lock_irq so it would be more natural at least... > 2. There is another parallel thread on implementing page cache >limits. If the lists split out, we need not scan the entire >list to find page cache pages to evict them. BTW I'm unsure about the cache limit thread, the overhead of the vm collection shouldn't be an issue, and those tends to hide vm inefficiencies. For example Neil has a patch to reduce the writeback cache to 10M-50M (much lower than the current 1% minimum) to hide huge unfariness in the writeback cache. I think they should mount the fs with -o sync instead of using that patch until the unfariness is fixed or tunable. The patch itself is fine though but for that problem it only looks a workaround. So I at least try to be always quite skeptical when I hear about cache "fixed size limiting" patches that improve responsiveness or performance ;) > Of course I might be missing something (some piece of history) Partly ;) Code was very different back then, today it would be easier thanks to reclaim_mapped but the partial loss of age information and potential loss of cpu in a pinned walk would probably remain. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)
On Monday January 22, [EMAIL PROTECTED] wrote: > On 1/22/07, Neil Brown <[EMAIL PROTECTED]> wrote: > > On Monday January 22, [EMAIL PROTECTED] wrote: > > > Justin Piszcz wrote: > > > > My .config is attached, please let me know if any other information is > > > > needed and please CC (lkml) as I am not on the list, thanks! > > > > > > > > Running Kernel 2.6.19.2 on a MD RAID5 volume. Copying files over Samba > > > > to > > > > the RAID5 running XFS. > > > > > > > > Any idea what happened here? > > > > > > > > > Without digging too deeply, I'd say you've hit the same bug Sami Farin > > > and others > > > have reported starting with 2.6.19: pages mapped with kmap_atomic() > > > become unmapped > > > during memcpy() or similar operations. Try disabling preempt -- that > > > seems to be the > > > common factor. > > > > That is exactly the conclusion I had just come to (a kmap_atomic page > > must be being unmapped during memcpy). I wasn't aware that others had > > reported it - thanks for that. > > > > Turning off CONFIG_PREEMPT certainly seems like a good idea. > > > Coming from an ARM background I am not yet versed in the inner > workings of kmap_atomic, but if you have time for a question I am > curious as to why spin_lock(>lock) is not sufficient pre-emption > protection for copy_data() in this case? > Presumably there is a bug somewhere. kmap_atomic itself calls inc_preempt_count so that preemption should be disabled at least until the kunmap_atomic is called. But apparently not. The symptoms point exactly to the page getting unmapped when it shouldn't. Until that bug is found and fixed, the work around of turning of CONFIG_PREEMPT seems to make sense. Of course it would be great if someone who can easily reproduce this bug could do the 'git bisect' thing to find out where the bug crept in. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
On Mon, 22 Jan 2007, Rik van Riel wrote: > It would be really nice if we came up with a page replacement > algorithm that did not need many extra heuristics to make it > work... I guess the "clock" type algorithms are the most promising in that area. What happened to all those advanced page replacement endeavors? What is the most promising of those? You seem to have done a lot of work on those. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Christoph Lameter wrote: With the proposed schemd you would have to move pages between lists if they are mapped and unmapped by a process. Terminating a process could lead to lots of pages moving to the unnmapped list. That could be a problem. Another problem is that any such heuristic in the VM is bound to have corner cases that some workloads will hit. It would be really nice if we came up with a page replacement algorithm that did not need many extra heuristics to make it work... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Peek at envinroment of procs
Hi, what is the preferred way to get at another process's environment variables? /proc/$$/environ looks like the most portable way [across all arches Linux runs on], but it cannot easily be mmap'ed because the size is not known. In fact, mmap does not seem to work at all on that file. So I would have to allocate a large buffer (4K is the limit for procfs files AFAICR) to potentially hold big environments, which does not sound really wise either. Or is it the best choice available? -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: KB->KiB, MB -> MiB, ... (IEC 60027-2)
On Jan 23 2007 02:04, Krzysztof Halasa wrote: >Andreas Schwab <[EMAIL PROTECTED]> writes: > >> But other than the sector size there is no natural power of 2 connected to >> disk size. A disk can have any odd number of sectors. > >But the manufacturers don't count in sectors. > >It should be consistent, though. "How many GB of disk space do you >need to store 2 GB of USB flash, and how many to store 2 GB RAM image"? Here's the marketing gap a company could jump in: "first to count in real GB" -`J' -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Balbir Singh wrote: This makes me wonder if it makes sense to split up the LRU into page cache LRU and mapped pages LRU. I see two benefits Unlikely. I have seen several workloads fall over because they did not throw out mapped pages soon enough. If the kernel does not keep the most frequently accessed pages resident, hit rates will suffer. Sometimes (well, usually) those are the mapped pages, but this is not true in all workloads. Some workloads are very page cache heavy and it pays to keep the more frequently accessed page cache pages resident while discarding the less frequently accessed ones. Since memory size has increased a lot more than disk speed over the last decade (and this is likely to continue for the next decades), the quality of page replacement algorithms is likely to become more and more important over time. 1. Currently based on swappiness, we might walk an entire list searching for page cache pages or mapped pages. With these lists separated, it should get easier and faster to implement this scheme How do you classify a mapped page cache page? Another issue is that you'll want to make sure that the page cache pages that are referenced more frequently than the least referenced mapped (I assume you mean anonymous?) pages in memory, while swapping out those least used anonymous pages. One way to do this could be to compare the scan rates, list sizes and referenced percentage of both lists, to find out which of the two caches is hotter. 2. There is another parallel thread on implementing page cache limits. If the lists split out, we need not scan the entire list to find page cache pages to evict them. If the lists split out, there is no reason to limit the page cache size because you can easily reclaim them. Right? Of course I might be missing something (some piece of history) http://linux-mm.org/AdvancedPageReplacement -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba -> RAID5)
On 1/22/07, Neil Brown <[EMAIL PROTECTED]> wrote: On Monday January 22, [EMAIL PROTECTED] wrote: > Justin Piszcz wrote: > > My .config is attached, please let me know if any other information is > > needed and please CC (lkml) as I am not on the list, thanks! > > > > Running Kernel 2.6.19.2 on a MD RAID5 volume. Copying files over Samba to > > the RAID5 running XFS. > > > > Any idea what happened here? > > > Without digging too deeply, I'd say you've hit the same bug Sami Farin > and others > have reported starting with 2.6.19: pages mapped with kmap_atomic() > become unmapped > during memcpy() or similar operations. Try disabling preempt -- that > seems to be the > common factor. That is exactly the conclusion I had just come to (a kmap_atomic page must be being unmapped during memcpy). I wasn't aware that others had reported it - thanks for that. Turning off CONFIG_PREEMPT certainly seems like a good idea. Coming from an ARM background I am not yet versed in the inner workings of kmap_atomic, but if you have time for a question I am curious as to why spin_lock(>lock) is not sufficient pre-emption protection for copy_data() in this case? NeilBrown Regards, Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Alistair John Strachan wrote: On Tuesday 23 January 2007 01:24, Robert Hancock wrote: As a final aside, this is another case where the hardware docs for this controller would really be useful, in order to know whether we are actually supposed to be reading that register in ADMA mode or not. I sent a query to Allen Martin at NVIDIA asking if there's a way I could get access to the documents, but I haven't heard anything yet. Obviously, NVIDIA's response is disappointing, but thank you for putting the time in to debug this problem. Definitely sounds like a hardware defect, I'm just glad there's a workaround. Will we see this fix in 2.6.20? Hopefully, assuming it actually does fix the problem for those that have been seeing it.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
On Tue, 23 Jan 2007, Balbir Singh wrote: > This makes me wonder if it makes sense to split up the LRU into page > cache LRU and mapped pages LRU. I see two benefits > > 1. Currently based on swappiness, we might walk an entire list >searching for page cache pages or mapped pages. With these >lists separated, it should get easier and faster to implement >this scheme > 2. There is another parallel thread on implementing page cache >limits. If the lists split out, we need not scan the entire >list to find page cache pages to evict them. > > Of course I might be missing something (some piece of history) This means page cache = unmapped file backed page right? Otherwise this would not work. I always thought that the page cache were all file backed pages both mapped and unmapped. With the proposed schemd you would have to move pages between lists if they are mapped and unmapped by a process. Terminating a process could lead to lots of pages moving to the unnmapped list. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] 9p: null terminate error strings for debug print
From: Eric Van Hensbergen <[EMAIL PROTECTED]> - unquoted We weren't properly NULL terminating protocol error strings for our debug printk resulting in garbage being included in the output when debug was enabled. Signed-off-by: Eric Van Hensbergen <[EMAIL PROTECTED]> --- fs/9p/error.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/9p/error.c b/fs/9p/error.c index ae91555..0d7fa4e 100644 --- a/fs/9p/error.c +++ b/fs/9p/error.c @@ -83,6 +83,7 @@ int v9fs_errstr2errno(char *errstr, int len) if (errno == 0) { /* TODO: if error isn't found, add it dynamically */ + errstr[len] = 0; printk(KERN_ERR "%s: errstr :%s: not found\n", __FUNCTION__, errstr); errno = 1; -- 1.5.0.rc1.gde38 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] 9p: fix segfault caused by race condition in meta-data operations
From: Eric Van Hensbergen <[EMAIL PROTECTED]> - unquoted Running dbench multithreaded exposed a race condition where fid structures were removed while in use. This patch adds semaphores to meta-data operations to protect the fid structure. Some cleanup of error-case handling in the inode operations is also included. Signed-off-by: Eric Van Hensbergen <[EMAIL PROTECTED]> --- fs/9p/fid.c | 69 +- fs/9p/fid.h |5 ++ fs/9p/vfs_file.c | 47 ++-- fs/9p/vfs_inode.c | 204 ++-- 4 files changed, 196 insertions(+), 129 deletions(-) diff --git a/fs/9p/fid.c b/fs/9p/fid.c index 2750720..a9b6301 100644 --- a/fs/9p/fid.c +++ b/fs/9p/fid.c @@ -25,6 +25,7 @@ #include #include #include +#include #include "debug.h" #include "v9fs.h" @@ -84,6 +85,7 @@ struct v9fs_fid *v9fs_fid_create(struct v9fs_session_info *v9ses, int fid) new->iounit = 0; new->rdir_pos = 0; new->rdir_fcall = NULL; + init_MUTEX(>lock); INIT_LIST_HEAD(>list); return new; @@ -102,11 +104,11 @@ void v9fs_fid_destroy(struct v9fs_fid *fid) } /** - * v9fs_fid_lookup - retrieve the right fid from a particular dentry + * v9fs_fid_lookup - return a locked fid from a dentry * @dentry: dentry to look for fid in - * @type: intent of lookup (operation or traversal) * - * find a fid in the dentry + * find a fid in the dentry, obtain its semaphore and return a reference to it. + * code calling lookup is responsible for releasing lock * * TODO: only match fids that have the same uid as current user * @@ -124,7 +126,68 @@ struct v9fs_fid *v9fs_fid_lookup(struct dentry *dentry) if (!return_fid) { dprintk(DEBUG_ERROR, "Couldn't find a fid in dentry\n"); + return_fid = ERR_PTR(-EBADF); } + if(down_interruptible(_fid->lock)) + return ERR_PTR(-EINTR); + return return_fid; } + +/** + * v9fs_fid_clone - lookup the fid for a dentry, clone a private copy and release it + * @dentry: dentry to look for fid in + * + * find a fid in the dentry and then clone to a new private fid + * + * TODO: only match fids that have the same uid as current user + * + */ + +struct v9fs_fid *v9fs_fid_clone(struct dentry *dentry) +{ + struct v9fs_session_info *v9ses = v9fs_inode2v9ses(dentry->d_inode); + struct v9fs_fid *base_fid, *new_fid = ERR_PTR(-EBADF); + struct v9fs_fcall *fcall = NULL; + int fid, err; + + base_fid = v9fs_fid_lookup(dentry); + + if(IS_ERR(base_fid)) + return base_fid; + + if(base_fid) { /* clone fid */ + fid = v9fs_get_idpool(>fidpool); + if (fid < 0) { + eprintk(KERN_WARNING, "newfid fails!\n"); + new_fid = ERR_PTR(-ENOSPC); + goto Release_Fid; + } + + err = v9fs_t_walk(v9ses, base_fid->fid, fid, NULL, ); + if (err < 0) { + dprintk(DEBUG_ERROR, "clone walk didn't work\n"); + v9fs_put_idpool(fid, >fidpool); + new_fid = ERR_PTR(err); + goto Free_Fcall; + } + new_fid = v9fs_fid_create(v9ses, fid); + if (new_fid == NULL) { + dprintk(DEBUG_ERROR, "out of memory\n"); + new_fid = ERR_PTR(-ENOMEM); + } +Free_Fcall: + kfree(fcall); + } + +Release_Fid: + up(_fid->lock); + return new_fid; +} + +void v9fs_fid_clunk(struct v9fs_session_info *v9ses, struct v9fs_fid *fid) +{ + v9fs_t_clunk(v9ses, fid->fid); + v9fs_fid_destroy(fid); +} diff --git a/fs/9p/fid.h b/fs/9p/fid.h index aa974d6..48fc170 100644 --- a/fs/9p/fid.h +++ b/fs/9p/fid.h @@ -30,6 +30,8 @@ struct v9fs_fid { struct list_head list; /* list of fids associated with a dentry */ struct list_head active; /* XXX - debug */ + struct semaphore lock; + u32 fid; unsigned char fidopen;/* set when fid is opened */ unsigned char fidclunked; /* set when fid has already been clunked */ @@ -55,3 +57,6 @@ struct v9fs_fid *v9fs_fid_get_created(struct dentry *); void v9fs_fid_destroy(struct v9fs_fid *fid); struct v9fs_fid *v9fs_fid_create(struct v9fs_session_info *, int fid); int v9fs_fid_insert(struct v9fs_fid *fid, struct dentry *dentry); +struct v9fs_fid *v9fs_fid_clone(struct dentry *dentry); +void v9fs_fid_clunk(struct v9fs_session_info *v9ses, struct v9fs_fid *fid); + diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c index e86a071..9f17b0c 100644 --- a/fs/9p/vfs_file.c +++ b/fs/9p/vfs_file.c @@ -55,53 +55,22 @@ int v9fs_file_open(struct inode *inode, struct file *file) struct v9fs_fid *vfid; struct v9fs_fcall *fcall = NULL; int omode; - int fid = V9FS_NOFID; int err;
Re: SATA exceptions with 2.6.20-rc5
On Tuesday 23 January 2007 01:24, Robert Hancock wrote: > As a final aside, this is another case where the hardware docs for this > controller would really be useful, in order to know whether we are > actually supposed to be reading that register in ADMA mode or not. I > sent a query to Allen Martin at NVIDIA asking if there's a way I could > get access to the documents, but I haven't heard anything yet. Obviously, NVIDIA's response is disappointing, but thank you for putting the time in to debug this problem. Definitely sounds like a hardware defect, I'm just glad there's a workaround. Will we see this fix in 2.6.20? -- Cheers, Alistair. Final year Computer Science undergraduate. 1F2 55 South Clerk Street, Edinburgh, UK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
Andrea Arcangeli wrote: On Tue, Jan 23, 2007 at 01:10:46AM +0100, Niki Hammler wrote: Dear Linux Developers/Enthusiasts, For a course at my university I'm implementing parts of an operating system where I get most ideas from the Linux Kernel (which I like very much). One book I gain information from is [1]. Linux uses for its Page Replacing Algorithm (based on LRU) *two* chained lists - one active list and one active list. I implemented my PRA this way too. No the big question is: WHY do I need *two* lists? Isn't it just overhead/more work? Are there any reasons for that? In my opinion, it would be better to have just one just; pop frames to be swapped out from the end of the list and push new frames in front of it. Then just evaluate the frames and shift them around in the list. Is there any explanation why Linux uses two lists? Back then I designed it with two lru lists because by splitting the active from the inactive cache allows to detect the cache pollution before it starts discarding the working set. The idea is that the pollution will enter and exit the inactive list without ever being elected to the active list because by definition it will never generate a cache hit. The working set will instead trigger cache hits during page faults or repeated reads, and it will be preserved better by electing it to enter the active list. A page in the inactive list will be collected much more quickly than a page in the active list, so the pollution will be collected more quickly than the working set. Then the VM while freeing cache tries to keep a balance between the size of the two lists to avoid being too unfair, obviously at some point the active list have to be de-activated too. If your server "fits in ram" you'll find lots of cache to be active and so the I/O activity not part of the working set will be collected without affecting the working set much. This makes me wonder if it makes sense to split up the LRU into page cache LRU and mapped pages LRU. I see two benefits 1. Currently based on swappiness, we might walk an entire list searching for page cache pages or mapped pages. With these lists separated, it should get easier and faster to implement this scheme 2. There is another parallel thread on implementing page cache limits. If the lists split out, we need not scan the entire list to find page cache pages to evict them. Of course I might be missing something (some piece of history) -- Balbir Singh Linux Technology Center IBM, ISTL - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] md: bitmap read_page error
I think your patch is not enough to slove the read_page error completely. I think in the bitmap_init_from_disk we also need to check the 'count' never exceeds the size of file before calling the read_page function. How do your think about it. Thanks your reply. 2007/1/23, Neil Brown <[EMAIL PROTECTED]>: On Monday January 22, [EMAIL PROTECTED] wrote: > If the bitmap size is less than one page including super_block and > bitmap and the inode's i_blkbits is also small, when doing the > read_page function call to read the sb_page, it may return a error. > For example, if the device is 12800 chunks, its bitmap file size is > about 1.6KB include the bitmap super block. But the inode i_blkbits > value of the bitmap file is 10, the read_page will submit 4 bh to > load the sb_page. Because the size of bitmap is only 1.6KB, in the > while loop, the error will ocurr when do bmap operation for the block > 2, which will return 0. Then the bitmap can't be initated because of > ther read sb page fail. > > Another error is in the bitmap_init_from_disk function. Before doing > read_page,. calculating the count value misses the size of super > block. When the bitmap just needs one page, It will read two pages > adding the super block. But at the second read, the count value will > be set to 0, and not all the bitmap will be read from the disk and > some bitmap will missed at the second page. > > I give a patch as following: Thanks a lot for this. Rather than checking the file size in read_page, I would like to make sure the 'count' that is passed in never exceeds the size of the file. This should have the same effect. So this is that patch I plan to submit. Thanks again, NeilBrown -- Avoid reading past the end of a bitmap file. In most cases we check the size of the bitmap file before reading data from it. However when reading the superblock, we always read the first PAGE_SIZE bytes, which might not always be appropriate. So limit that read to the size of the file if appropriate. Also, we get the count of available bytes wrong in one place, so that too can read past the end of the file. Cc: "yang yin" <[EMAIL PROTECTED]> Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./drivers/md/bitmap.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c --- .prev/drivers/md/bitmap.c 2007-01-23 09:44:11.0 +1100 +++ ./drivers/md/bitmap.c 2007-01-23 09:44:59.0 +1100 @@ -479,9 +479,12 @@ static int bitmap_read_sb(struct bitmap int err = -EINVAL; /* page 0 is the superblock, read it... */ - if (bitmap->file) - bitmap->sb_page = read_page(bitmap->file, 0, bitmap, PAGE_SIZE); - else { + if (bitmap->file) { + loff_t isize = i_size_read(bitmap->file->f_mapping->host); + int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize; + + bitmap->sb_page = read_page(bitmap->file, 0, bitmap, bytes); + } else { bitmap->sb_page = read_sb_page(bitmap->mddev, bitmap->offset, 0); } if (IS_ERR(bitmap->sb_page)) { @@ -877,7 +880,8 @@ static int bitmap_init_from_disk(struct int count; /* unmap the old page, we're done with it */ if (index == num_pages-1) - count = bytes - index * PAGE_SIZE; + count = bytes + sizeof(bitmap_super_t) + - index * PAGE_SIZE; else count = PAGE_SIZE; if (index == 0) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make CARDBUS_MEM_SIZE and CARDBUS_IO_SIZE customizable
On Mon, 22 Jan 2007 18:17:38 +0300, Sergei Shtylyov <[EMAIL PROTECTED]> wrote: > > + cbiosize=nn[KMG]The fixed amount of bus space which is > > + reserved for the CardBus bridges IO window. > > It shoyld be "bridge's"... Thanks. Updated again. Subject: [PATCH] Make CARDBUS_MEM_SIZE and CARDBUS_IO_SIZE customizable CARDBUS_MEM_SIZE was increased to 64MB on 2.6.20-rc2, but larger size might result in allocation failure for the reserving itself on some platforms (for example typical 32bit MIPS). Make it (and CARDBUS_IO_SIZE too) customizable by "pci=" option for such platforms. Signed-off-by: Atsushi Nemoto <[EMAIL PROTECTED]> --- Documentation/kernel-parameters.txt |6 ++ drivers/pci/pci.c |6 ++ drivers/pci/setup-bus.c | 27 +++ include/linux/pci.h |3 +++ 4 files changed, 30 insertions(+), 12 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 25d2985..a194b8f 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1259,6 +1259,12 @@ and is between 256 and 4096 characters. This sorting is done to get a device order compatible with older (<= 2.4) kernels. nobfsortDon't sort PCI devices into breadth-first order. + cbiosize=nn[KMG]The fixed amount of bus space which is + reserved for the CardBus bridge's IO window. + The default value is 256 bytes. + cbmemsize=nn[KMG] The fixed amount of bus space which is + reserved for the CardBus bridge's memory + window. The default value is 64 megabytes. pcmv= [HW,PCMCIA] BadgePAD 4 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 206c834..639069a 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1168,6 +1168,12 @@ static int __devinit pci_setup(char *str if (*str && (str = pcibios_setup(str)) && *str) { if (!strcmp(str, "nomsi")) { pci_no_msi(); + } else if (!strncmp(str, "cbiosize=", 9)) { + pci_cardbus_io_size = + memparse(str + 9, ); + } else if (!strncmp(str, "cbmemsize=", 10)) { + pci_cardbus_mem_size = + memparse(str + 10, ); } else { printk(KERN_ERR "PCI: Unknown option `%s'\n", str); diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 89f3036..1dfc288 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -40,8 +40,11 @@ #define ROUND_UP(x, a) (((x) + (a) - 1) * FIXME: IO should be max 256 bytes. However, since we may * have a P2P bridge below a cardbus bridge, we need 4K. */ -#define CARDBUS_IO_SIZE(256) -#define CARDBUS_MEM_SIZE (64*1024*1024) +#define DEFAULT_CARDBUS_IO_SIZE(256) +#define DEFAULT_CARDBUS_MEM_SIZE (64*1024*1024) +/* pci=cbmemsize=nnM,cbiosize=nn can override this */ +unsigned long pci_cardbus_io_size = DEFAULT_CARDBUS_IO_SIZE; +unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE; static void __devinit pbus_assign_resources_sorted(struct pci_bus *bus) @@ -415,12 +418,12 @@ pci_bus_size_cardbus(struct pci_bus *bus * Reserve some resources for CardBus. We reserve * a fixed amount of bus space for CardBus bridges. */ - b_res[0].start = CARDBUS_IO_SIZE; - b_res[0].end = b_res[0].start + CARDBUS_IO_SIZE - 1; + b_res[0].start = pci_cardbus_io_size; + b_res[0].end = b_res[0].start + pci_cardbus_io_size - 1; b_res[0].flags |= IORESOURCE_IO; - b_res[1].start = CARDBUS_IO_SIZE; - b_res[1].end = b_res[1].start + CARDBUS_IO_SIZE - 1; + b_res[1].start = pci_cardbus_io_size; + b_res[1].end = b_res[1].start + pci_cardbus_io_size - 1; b_res[1].flags |= IORESOURCE_IO; /* @@ -440,16 +443,16 @@ pci_bus_size_cardbus(struct pci_bus *bus * twice the size. */ if (ctrl & PCI_CB_BRIDGE_CTL_PREFETCH_MEM0) { - b_res[2].start = CARDBUS_MEM_SIZE; - b_res[2].end = b_res[2].start + CARDBUS_MEM_SIZE - 1; + b_res[2].start = pci_cardbus_mem_size; + b_res[2].end = b_res[2].start + pci_cardbus_mem_size - 1; b_res[2].flags |= IORESOURCE_MEM | IORESOURCE_PREFETCH; - b_res[3].start = CARDBUS_MEM_SIZE; - b_res[3].end = b_res[3].start + CARDBUS_MEM_SIZE - 1; +
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: Running a kernel with the return statement replace by a line that prints the irq_stat instead. Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. 40 minutes stress test now and no exception yet. What's interesting is that ata1 saw exactly one interrupt with irq_stat 0x0, all others that might have get dropped are as above. I'll keep it running for some time and will then re-enable the return statement to see if there's a relation between the irq_stat 0x0 and the exception. No, doesn't seem to be related, did get 2 exceptions, but no irq_stat 0x0 for ata1. Syslog/dmesg has nothing new either, still the same pattern of dismissed irq_stats. I've finally managed to reproduce this problem on my box, by doing: watch --interval=0.1 /sbin/hdparm -I /dev/sda on one drive and then running bonnie++ on /dev/sdb connected to the other port on the same controller device. Usually within a few minutes one of the IDENTIFY commands would time out in the same way you guys have been seeing. Through some various trials and tribulations, the only conclusion I can come to is that this controller really doesn't like that NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried adding some debug code to the qc_issue function that would check to see if the BUSY flag in altstatus went high or that register showed an interrupt within a certain time afterwards, however that really seemed to hose things, the system wouldn't even boot. Try out this patch, it just calls the ata_host_intr function where appropriate without using nv_host_intr which looks at the NV_INT_STATUS_CK804 register. This is what the original ADMA patch from Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for that. With this patch I can get through a whole bonnie++ run with the repeated IDENTIFY requests running without seeing the error. As an aside, there seems to be some dubious code in nv_host_intr, if ata_host_intr returns 0 for handled when a command is outstanding, it goes and calls ata_check_status anyway. This is rather dangerous since if an interrupt showed up right after ata_host_intr but before ata_check_status, the ata_check_status would clear it and we would forget about it. I tried fixing just that issue and still had this problem however. I suspect that code is truly broken and needs further thought, but this patch avoids calling it in the ADMA case, at any rate. As a final aside, this is another case where the hardware docs for this controller would really be useful, in order to know whether we are actually supposed to be reading that register in ADMA mode or not. I sent a query to Allen Martin at NVIDIA asking if there's a way I could get access to the documents, but I haven't heard anything yet. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.0 -0600 @@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int /* if in ATA register mode, use standard ata interrupt handler */ if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) { - u8 irq_stat = readb(host->mmio_base + NV_INT_STATUS_CK804) - >> (NV_INT_PORT_SHIFT * i); - handled += nv_host_intr(ap, irq_stat); + struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag); + if(qc && !(qc->tf.flags & ATA_TFLAG_POLLING)) + handled += ata_host_intr(ap, qc); continue; }
Re: configfs: return value for drop_item()/make_item()?
On Mon, Jan 22, 2007 at 01:35:36PM +0100, Michael Noisternig wrote: > Sure, but what I meant to say was that the user, when creating a > directory, did not request creation of such sub-directories, so I see > them as created by the kernel. Ahh, but userspace did! It's part of the configfs contract. They've asked for an new config item and all that it entails. > If you argue that they are in fact created by the user because they are > a direct result of a user action, then I can apply the same argument to > this one example: > ... > >This is precisely what configfs is designed to forbid. The kernel > >does not, ever, create configfs objects on its own. It does it as a > >result of userspace action. > > No. The sub-directory only appears as a direct result of the user > writing a value into the 'type' attribute. ;-) Ok, you're stretching the metaphor. Writing a value to a "type" attribute is, indeed, a userspace action. However, configfs' contract is that only mkdir(2) creates objects. We're not trying to create the do-everything-kitchen-sink system here. That way lies the problems we're trying to avoid. That's why configfs has a specific contract it provides to (a) userspace and (b) client drivers. > >you're never going to get it from configfs. You should be using > >sysfs. > > Hardly. sysfs doesn't allow the user creating directories. :> sysfs certainly supports your "echo b > type" style of object creation. You're type_file->store() method gets a "b" in the buffer and then does sysfs_mkdir() of your new object directory. Here, the kernel is creating the new object (the directory). > Well, you don't need PTR_ERR(). Sure, you could use **new_item. It's the same complexity change. > That's an interesting other solution, however it seems a bit redundant > (params are referenced by links as well as in the 'order' attribute > file) and not as simple as my method 2). I guess, for now, in lack of a > convincing solution, I will implement method 2) as the one easiest to > adapt to given my current code base. But they are not referenced by the order file. It's just an attribute :-) Really, you can look at it either way. But configfs has a specific perspective based on its contracts, and so it works within them. > Hm, I had envisioned the user to fully configure the module via file > system operations only. Now if the user is supposed to use a wrapper > program this sheds a different light on all those > what's-the-best-solution issues... Certainly the user can do the configuration by hand. It will always work. But why make them understand your userspace<->kernel API when you can just provide a tool? They're all going to script it up anyway. Joel -- "The doctrine of human equality reposes on this: that there is no man really clever who has not found that he is stupid." - Gilbert K. Chesterson Joel Becker Principal Software Developer Oracle E-mail: [EMAIL PROTECTED] Phone: (650) 506-8127 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[ot] Re: coding style and re-inventing the wheel way too many times
On 2006-12-21, Robert P. J. Day wrote: [] > in any event, even *i* am not going to go near this kind of cleanup, > but is there anything actually worth doing about it? just curious. Moscow wasn't built at once... You may notice as some others are doing little by little steps: - source cleanups; - right dependancy; - headers includes; - warnings elimination; - code reading-analyzing (brainwashing, as for me;) and a few line fixes everywhere. I think there are members here, who doing things, like that for at least two years. You are trying to focus (mostly) on style and sense -- ok. And maybe it's like to be alone within a crowd. Anyways, as long as you fill, that you can do that, and patches are accepded, changes are *documented* its ok, also. Sometimes it's stupid, worthless or something else. If so, then find more interesting things to do, unless you are going to proof something to somebody ;E. Here's, as i can see, almost no place for emotions, only technical stuff and everything not far from it. See, there isn't any reply on you message, except this one. I think, because of that. -- -o--=O`C #oo'L O <___=E M - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
On Tue, 23 Jan 2007 11:37:09 +1100 Donald Douwsma <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > >> On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <[EMAIL PROTECTED]> > >> wrote: > >> Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke > >> the OOM killer and kill all of my processes? > > > > What's that? Software raid or hardware raid? If the latter, which driver? > > I've hit this using local disk while testing xfs built against 2.6.20-rc4 > (SMP x86_64) > > dmesg follows, I'm not sure if anything in this is useful after the first > event as our automated tests continued on > after the failure. This looks different. > ... > > Mem-info: > Node 0 DMA per-cpu: > CPU0: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: > 0 > CPU1: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: > 0 > CPU2: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: > 0 > CPU3: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: > 0 > Node 0 DMA32 per-cpu: > CPU0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: 15 usd: > 53 > CPU1: Hot: hi: 186, btch: 31 usd: 2 Cold: hi: 62, btch: 15 usd: > 60 > CPU2: Hot: hi: 186, btch: 31 usd: 20 Cold: hi: 62, btch: 15 usd: > 47 > CPU3: Hot: hi: 186, btch: 31 usd: 25 Cold: hi: 62, btch: 15 usd: > 56 > Active:76 inactive:495856 dirty:0 writeback:0 unstable:0 free:3680 slab:9119 > mapped:32 pagetables:637 No dirty pages, no pages under writeback. > Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB > present:9376kB pages_scanned:3296 > all_unreclaimable? yes > lowmem_reserve[]: 0 2003 2003 > Node 0 DMA32 free:6684kB min:5712kB low:7140kB high:8568kB active:304kB > inactive:1981624kB present:2052068kB Inactive list is filled. > pages_scanned:4343329 all_unreclaimable? yes We scanned our guts out and decided that nothing was reclaimable. > No available memory (MPOL_BIND): kill process 3492 (hald) score 0 or a child > No available memory (MPOL_BIND): kill process 7914 (top) score 0 or a child > No available memory (MPOL_BIND): kill process 4166 (nscd) score 0 or a child > No available memory (MPOL_BIND): kill process 17869 (xfs_repair) score 0 or a > child But in all cases a constrained memory policy was in use. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS or Kernel Problem / Bug
On Mon, Jan 22, 2007 at 09:07:23AM +0100, Stefan Priebe - FH wrote: > Hi! > > The update of the IDE layer was in 2.6.19. I don't think it is a > hardware bug cause all these 5 machines runs fine since a few years with > 2.6.16.X and before. We switch to 2.6.18.6 on monday last week and all > machines began to crash periodically. On friday last week we downgraded > them all to 2.6.16.37 and all 5 machines runs fine again. So i don't > believe it is a hardware problem. Do you really think that could be? I was thinking more of a driver change that is being triggered on that particular hardware. FWIW, did you test 2.6.19? I really need a better idea of the workload these servers are running and, ideally, a reproducable test case to track something like this down. At the moment I have no idea what is going on and no real information on which to even base a guess. Were there any other messages in the log? On Mon, Jan 22, 2007 at 10:42:36AM +0100, Stefan Priebe - FH wrote: > Hi! > > I've another idea... could it be, that it is a barrier problem? Since > barriers are enabled by default from 2.6.17 on ... You could try turning it off. If it does fix the problem, then I'd be pointing once again at hardware ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.20-rc5] SPI: alternative fix for spi_busnum_to_master
On Mon, 22 Jan 2007 14:12:02 -0800, David Brownell <[EMAIL PROTECTED]> wrote: > > Here is a revised version. The children list of spi_master_class > > contains only spi_master class so we can just compare bus_num member > > instead of class_id string. > > Looks just a bit iffy ... though, thanks for helping to finally > sort this out! Well, so previous patch (which was checking class_id string) would be preferred? > > + cdev = class_device_get(cdev); > > + if (!cdev) > > + continue; > > That "continue" case doesn't seem like it should be possible... but > at any rate, the "get" can be deferred until the relevent class > device is known, since that _valid_ handle can't disappear so long > as that semaphore is held. And if you find the right device but > can't get a reference ... no point in continuing! > > Something like a class_find_device() would be the best way to solve > this sort of problem, IMO. But we don't have one of those. :( Indeed the check can be omitted. Should I send a new patch just moving class_device_get() into "if (master->bus_num == bus_num)" block? The crashing with udev is 2.6.20 regression so I wish this fixed very soon. Thank you for review. --- Atsushi Nemoto - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: KB->KiB, MB -> MiB, ... (IEC 60027-2)
Andreas Schwab <[EMAIL PROTECTED]> writes: > But other than the sector size there is no natural power of 2 connected to > disk size. A disk can have any odd number of sectors. But the manufacturers don't count in sectors. It should be consistent, though. "How many GB of disk space do you need to store 2 GB of USB flash, and how many to store 2 GB RAM image"? :-) -- Krzysztof Halasa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Announce] GIT v1.5.0-rc2
On Mon, 22 Jan 2007 11:28:32 -0800, Junio C Hamano wrote: > Thanks for your comments; You're welcome. > the attached probably needs proofreading. In general, I like it. The git-branch documentation already talks about "remote-tracking branches" so I've rewritten a couple of sentence below to use that same terminology. Also there are a couple of grammar errors related to pluralization, (likely the fault of English being quite a bit less consistent than other languages with subject/verb number agreement, etc.). > + A repository with the separate remote layout starts with only > + one default branch, 'master', to be used for your own > + development. Unlike the traditional layout that copied all > + the upstream branches into your branch namespace (while > + renaming their 'master' to your 'origin'), they are not made > + into your branches. Instead, they are kept track of using > + 'refs/remotes/origin/$upstream_branch_name'. renaming remote 'master' to local 'origin'), the new approach puts upstream branches into local "remote-tracking branches" with their own namespace. These can be referenced with names such as "origin/$upstream_branch_name" and are stored in .git/refs/remotes rather than .git/refs/heads where normal branches are stored. > + This layout keeps your own branch namespace less cluttered, > + avoids name collision with your upstream, makes it possible > + to automatically track new branches created at the remote > + after you clone from it, and makes it easier to interact with > + more than one remote repositories. There might be some Should be "more than one remote repository.". Also I'd add, ", (see the new 'git remote' command)" before the end of that sentence. > + * 'git branch' does not show the branches from your upstream. Again to use the same terminology, "does not show the remote-tracking branches.". > + Repositories initialized with the traditional layout > + continues to work (and will continue to work). The 's' on "continues" is incorrect. Perhaps: continue to work (and will work in the future as well). or just drop the parenthetical phrase. -Carl pgpJ2CCBnd3h5.pgp Description: PGP signature
Re: [PATCH -rt] whitespace cleanup for 2.6.20-rc5-rt7
At Tue, 23 Jan 2007 00:42:31 +0100, Richard Knutsson wrote: > > Michal Piotrowski wrote: > > How about this script? > > > > "d) Ensure that your patch does not add new trailing whitespace. The > > below > > script will fix up your patch by stripping off such whitespace. > > > > #!/bin/sh > > > > strip1() > > { > > TMP=$(mktemp /tmp/XX) > > cp "$1" "$TMP" > > sed -e '/^+/s/[ ]*$//' <"$TMP" >"$1" > > rm "$TMP" > > } > > > > for i in "$@" > > do > > strip1 "$i" > > done > > " > > http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt > I believe: > > for i in "$@"; do \ > sed --in-place -e "s/[ ]+$//" "$i" > done > > will do as well... Hi Richard, IIRC, `+' is extended regex, so -r option is needed: sed -r --in-place -e "s/[]+$//" "$i" Satoru Takeuchi > > Richard Knutsson > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why active list and inactive list?
On Tue, Jan 23, 2007 at 01:10:46AM +0100, Niki Hammler wrote: > Dear Linux Developers/Enthusiasts, > > For a course at my university I'm implementing parts of an operating > system where I get most ideas from the Linux Kernel (which I like very > much). One book I gain information from is [1]. > > Linux uses for its Page Replacing Algorithm (based on LRU) *two* chained > lists - one active list and one active list. > I implemented my PRA this way too. > > No the big question is: WHY do I need *two* lists? Isn't it just > overhead/more work? Are there any reasons for that? > > In my opinion, it would be better to have just one just; pop frames to > be swapped out from the end of the list and push new frames in front of > it. Then just evaluate the frames and shift them around in the list. > > Is there any explanation why Linux uses two lists? Back then I designed it with two lru lists because by splitting the active from the inactive cache allows to detect the cache pollution before it starts discarding the working set. The idea is that the pollution will enter and exit the inactive list without ever being elected to the active list because by definition it will never generate a cache hit. The working set will instead trigger cache hits during page faults or repeated reads, and it will be preserved better by electing it to enter the active list. A page in the inactive list will be collected much more quickly than a page in the active list, so the pollution will be collected more quickly than the working set. Then the VM while freeing cache tries to keep a balance between the size of the two lists to avoid being too unfair, obviously at some point the active list have to be de-activated too. If your server "fits in ram" you'll find lots of cache to be active and so the I/O activity not part of the working set will be collected without affecting the working set much. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc5: cp 18gb 18gb.2 = OOM killer, reproducible just like 2.16.19.2
Andrew Morton wrote: >> On Sun, 21 Jan 2007 14:27:34 -0500 (EST) Justin Piszcz <[EMAIL PROTECTED]> >> wrote: >> Why does copying an 18GB on a 74GB raptor raid1 cause the kernel to invoke >> the OOM killer and kill all of my processes? > > What's that? Software raid or hardware raid? If the latter, which driver? I've hit this using local disk while testing xfs built against 2.6.20-rc4 (SMP x86_64) dmesg follows, I'm not sure if anything in this is useful after the first event as our automated tests continued on after the failure. > Please include /proc/meminfo from after the oom-killing. > > Please work out what is using all that slab memory, via /proc/slabinfo. Sorry I didnt pick this up ether. I'll try to reproduce this and gather some more detailed info for a single event. Donald ... XFS mounting filesystem sdb5 Ending clean XFS mount for filesystem: sdb5 XFS mounting filesystem sdb5 Ending clean XFS mount for filesystem: sdb5 hald invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0 Call Trace: [] out_of_memory+0x70/0x25d [] __alloc_pages+0x22c/0x2b5 [] alloc_page_vma+0x71/0x76 [] read_swap_cache_async+0x45/0xd8 [] swapin_readahead+0x60/0xd3 [] __handle_mm_fault+0x703/0x9d8 [] do_page_fault+0x42b/0x7b3 [] do_readv_writev+0x176/0x18b [] thread_return+0x0/0xed [] __const_udelay+0x2c/0x2d [] scsi_done+0x0/0x17 [] error_exit+0x0/0x84 Mem-info: Node 0 DMA per-cpu: CPU0: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 CPU1: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 CPU2: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 CPU3: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: 15 usd: 53 CPU1: Hot: hi: 186, btch: 31 usd: 2 Cold: hi: 62, btch: 15 usd: 60 CPU2: Hot: hi: 186, btch: 31 usd: 20 Cold: hi: 62, btch: 15 usd: 47 CPU3: Hot: hi: 186, btch: 31 usd: 25 Cold: hi: 62, btch: 15 usd: 56 Active:76 inactive:495856 dirty:0 writeback:0 unstable:0 free:3680 slab:9119 mapped:32 pagetables:637 Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB present:9376kB pages_scanned:3296 all_unreclaimable? yes lowmem_reserve[]: 0 2003 2003 Node 0 DMA32 free:6684kB min:5712kB low:7140kB high:8568kB active:304kB inactive:1981624kB present:2052068kB pages_scanned:4343329 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8036kB Node 0 DMA32: 273*4kB 29*8kB 1*16kB 1*32kB 1*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB = 6684kB Swap cache: add 741048, delete 244661, find 84826/143198, race 680+239 Free swap = 1088524kB Total swap = 3140668kB Free swap: 1088524kB 524224 pages of RAM 9619 reserved pages 259 pages shared 496388 pages swap cached No available memory (MPOL_BIND): kill process 3492 (hald) score 0 or a child Killed process 3626 (hald-addon-acpi) top invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0 Call Trace: [] out_of_memory+0x70/0x25d [] __alloc_pages+0x22c/0x2b5 [] alloc_pages_current+0x74/0x79 [] __page_cache_alloc+0xb/0xe [] __do_page_cache_readahead+0xa1/0x217 [] io_schedule+0x28/0x33 [] __wait_on_bit_lock+0x5b/0x66 [] __lock_page+0x72/0x78 [] do_page_cache_readahead+0x4e/0x5a [] filemap_nopage+0x140/0x30c [] __handle_mm_fault+0x1fb/0x9d8 [] do_page_fault+0x42b/0x7b3 [] __wake_up+0x43/0x50 [] tty_ldisc_deref+0x71/0x76 [] error_exit+0x0/0x84 Mem-info: Node 0 DMA per-cpu: CPU0: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 CPU1: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 CPU2: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 CPU3: Hot: hi:0, btch: 1 usd: 0 Cold: hi:0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: 15 usd: 53 CPU1: Hot: hi: 186, btch: 31 usd: 2 Cold: hi: 62, btch: 15 usd: 60 CPU2: Hot: hi: 186, btch: 31 usd: 1 Cold: hi: 62, btch: 15 usd: 10 CPU3: Hot: hi: 186, btch: 31 usd: 25 Cold: hi: 62, btch: 15 usd: 26 Active:90 inactive:496233 dirty:0 writeback:0 unstable:0 free:3485 slab:9119 mapped:32 pagetables:637 Node 0 DMA free:8036kB min:24kB low:28kB high:36kB active:0kB inactive:1856kB present:9376kB pages_scanned:3328 all_unreclaimable? yes lowmem_reserve[]: 0 2003 2003 Node 0 DMA32 free:5904kB min:5712kB low:7140kB high:8568kB active:360kB inactive:1983092kB present:2052068kB pages_scanned:4587649 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8036kB Node 0 DMA32: 78*4kB 29*8kB 1*16kB 1*32kB 1*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB =
[PATCH 002 of 4] md: Make 'repair' actually work for raid1.
When 'repair' finds a block that is different one the various parts of the mirror. it is meant to write a chosen good version to the others. However it currently writes out the original data to each. The memcpy to make all the data the same is missing. Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./drivers/md/raid1.c |5 + 1 file changed, 5 insertions(+) diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c --- .prev/drivers/md/raid1.c2007-01-23 11:13:45.0 +1100 +++ ./drivers/md/raid1.c2007-01-23 11:23:43.0 +1100 @@ -1221,6 +1221,11 @@ static void sync_request_write(mddev_t * sbio->bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset; sbio->bi_bdev = conf->mirrors[i].rdev->bdev; + for (j = 0; j < vcnt ; j++) + memcpy(page_address(sbio->bi_io_vec[j].bv_page), + page_address(pbio->bi_io_vec[j].bv_page), + PAGE_SIZE); + } } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 003 of 4] md: Make sure the events count in an md array never returns to zero.
Now that we sometimes step the array events count backwards (when transitioning dirty->clean where nothing else interesting has happened - so that we don't need to write to spares all the time), it is possible for the event count to return to zero, which is potentially confusing and triggers and MD_BUG. We could possibly remove the MD_BUG, but is just as easy, and probably safer, to make sure we never return to zero. Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./drivers/md/md.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2007-01-23 11:13:44.0 +1100 +++ ./drivers/md/md.c 2007-01-23 11:23:58.0 +1100 @@ -1633,7 +1633,8 @@ repeat: * and 'events' is odd, we can roll back to the previous clean state */ if (nospares && (mddev->in_sync && mddev->recovery_cp == MaxSector) - && (mddev->events & 1)) + && (mddev->events & 1) + && mddev->events != 1) mddev->events--; else { /* otherwise we have to go forward and ... */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 004 of 4] md: Avoid reading past the end of a bitmap file.
In most cases we check the size of the bitmap file before reading data from it. However when reading the superblock, we always read the first PAGE_SIZE bytes, which might not always be appropriate. So limit that read to the size of the file if appropriate. Also, we get the count of available bytes wrong in one place, so that too can read past the end of the file. Cc: "yang yin" <[EMAIL PROTECTED]> Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./drivers/md/bitmap.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c --- .prev/drivers/md/bitmap.c 2007-01-23 11:13:43.0 +1100 +++ ./drivers/md/bitmap.c 2007-01-23 11:24:09.0 +1100 @@ -479,9 +479,12 @@ static int bitmap_read_sb(struct bitmap int err = -EINVAL; /* page 0 is the superblock, read it... */ - if (bitmap->file) - bitmap->sb_page = read_page(bitmap->file, 0, bitmap, PAGE_SIZE); - else { + if (bitmap->file) { + loff_t isize = i_size_read(bitmap->file->f_mapping->host); + int bytes = isize > PAGE_SIZE ? PAGE_SIZE : isize; + + bitmap->sb_page = read_page(bitmap->file, 0, bitmap, bytes); + } else { bitmap->sb_page = read_sb_page(bitmap->mddev, bitmap->offset, 0); } if (IS_ERR(bitmap->sb_page)) { @@ -877,7 +880,8 @@ static int bitmap_init_from_disk(struct int count; /* unmap the old page, we're done with it */ if (index == num_pages-1) - count = bytes - index * PAGE_SIZE; + count = bytes + sizeof(bitmap_super_t) + - index * PAGE_SIZE; else count = PAGE_SIZE; if (index == 0) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 001 of 4] md: Update email address and status for MD in MAINTAINERS.
Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./MAINTAINERS |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff .prev/MAINTAINERS ./MAINTAINERS --- .prev/MAINTAINERS 2007-01-23 11:14:14.0 +1100 +++ ./MAINTAINERS 2007-01-23 11:23:03.0 +1100 @@ -3011,9 +3011,9 @@ SOFTWARE RAID (Multiple Disks) SUPPORT P: Ingo Molnar M: [EMAIL PROTECTED] P: Neil Brown -M: [EMAIL PROTECTED] +M: [EMAIL PROTECTED] L: linux-raid@vger.kernel.org -S: Maintained +S: Supported SOFTWARE SUSPEND: P: Pavel Machek - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 000 of 4] md: Introduction - Assorted bugfixes
Following are 4 patches suitable for inclusion in 2.6.20. Thanks, NeilBrown [PATCH 001 of 4] md: Update email address and status for MD in MAINTAINERS. [PATCH 002 of 4] md: Make 'repair' actually work for raid1. [PATCH 003 of 4] md: Make sure the events count in an md array never returns to zero. [PATCH 004 of 4] md: Avoid reading past the end of a bitmap file. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/