Re: [PATCH] maestro3 oops + fix (for ac9!)
D'oh, looks like if power management is disabled, pmdev is NULL (I get that message when I load the module), but we try to derefence it anyways. The fix is obvious: duh, yeah, I'll send out a proper patch that handles the pm_register failure too. thanks. -- zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
On Thu, Jan 18, 2001 at 08:49:38AM -0800, Linus Torvalds wrote: That has its advantages: it's a very local thing, and doesn't need any state. However, the fact is that you _need_ the persistency of a socket option if you want to take advantage of external programs etc getting good behaviour without having to know that they are talking to a socket. *nod* We set TCP_CORK on the socket we handed to external programs that were being run via 'site exec' in an ftp server. It resulted in much nicer packets being spit out, especially in the 'ls' case where it likes to write() on really goofy boundaries. [yes, ftp and 'site exec' in particular are far from sexy, but do the same with CGI scripts and the world might care :)] - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Possible Bug: drivers/sound/maestro.c
have this problem (or a similar one, anyway -- sometimes the sound becomes distorted or comes only through one speaker) under both Linux 2.2 and Win2K. If it was just Linux, I'd assume it was a driver problem, but the This is a long-standing bug with the maestro2 driver. My current theory is that its a race condition where the APUs get confused while we update their control memory, but this doesn't make total sense. Some of the bug reports I get are implying that the sound is breaking when we're not touching the apu's control mem. Maybe implying a nastier silicon bug.. I've been meaning to try implementing a work around to the theoretical bug, but I've always had trouble triggering it :/ The fun part of this (and why you would see the bug in win2k) is that the maestro2 is very poorly documented. I've never heard of anyone having full docs on the APUs, including the people I've talked to at ESS. They bought the part from another company.. (and thankfully axed it in the maestro3) so we're stabbing in the dark. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] maestro3 PM update
This gets rid of the PM callback maestro3 was using to catch suspend/resume events, and just uses the pci_dev callbacks instead. It has minor cleanups in the _probe() path, including a braindead oops fix. It works on my laptop, but I'd really appreciate it seeing more testing. Alan, could you queue this for 2.4-testing and eventually Linus? [most of the patch is noise from code blocks moving around :/] Thanks to Andres Salomon for pointing out the oops case and motivating the cleanup.. -- zach --- maestro3.c-1.10 Wed Jan 31 00:45:54 2001 +++ maestro3.c Wed Jan 31 00:39:44 2001 @@ -28,6 +28,9 @@ * Shouts go out to Mike "DJ XPCom" Ang. * * History + * v1.20 - Jan 30 2001 - Zach Brown [EMAIL PROTECTED] + * get rid of pm callback and use pci_dev suspend/resume instead + * m3_probe cleanups, including pm oops think-o * v1.10 - Jan 6 2001 - Zach Brown [EMAIL PROTECTED] * revert to lame remap_page_range mmap() just to make it work * record mmap fixed. @@ -134,7 +137,6 @@ #include asm/hardirq.h #include linux/spinlock.h #include linux/ac97_codec.h -#include linux/pm.h /* * for crizappy mmap() @@ -145,7 +147,7 @@ #define M_DEBUG 1 -#define DRIVER_VERSION "1.10" +#define DRIVER_VERSION "1.20" #define M3_MODULE_NAME "maestro3" #define PFX M3_MODULE_NAME ": " @@ -356,9 +358,8 @@ /* * I'm not very good at laying out functions in a file :) */ -static int m3_pm_callback(struct pm_dev *dev, pm_request_t req, void *data); static int m3_notifier(struct notifier_block *nb, unsigned long event, void *buf); -static int m3_suspend(struct m3_card *card); +static void m3_suspend(struct pci_dev *pci_dev); static void check_suspend(struct m3_card *card); struct notifier_block m3_reboot_nb = {m3_notifier, NULL, 0}; @@ -2585,10 +2586,8 @@ u32 n; int i; struct m3_card *card = NULL; -int num = 0; int ret = 0; int card_type = pci_id-driver_data; -struct pm_dev *pmdev; DPRINTK(DPMOD, "in maestro_install\n"); @@ -2606,7 +2605,7 @@ printk(KERN_WARNING PFX "out of memory\n"); return -ENOMEM; } -memset(card, 0, sizeof(*card)); +memset(card, 0, sizeof(struct m3_card)); card-pcidev = pci_dev; init_waitqueue_head(card-suspend_queue); @@ -2677,7 +2676,6 @@ printk(KERN_WARNING PFX "initing a dsp device that is already in use?\n"); /* register devices */ if ((s-dev_audio = register_sound_dsp(m3_audio_fops, -1)) 0) { -num = i; break; } } @@ -2690,20 +2688,11 @@ goto out; } -pmdev = pm_register(PM_PCI_DEV, PM_PCI_ID(pci_dev), m3_pm_callback); - -if( pmdev == NULL) { -printk(KERN_WARNING PFX "couldn't register pm callback, suspend/resume might not work.\n"); -/* XXX do error stuff :) */ -} - -pmdev-data = card; +pci_dev-driver_data = card; m3_enable_ints(card); m3_assp_continue(card); -printk(KERN_INFO PFX "%d channels configured.\n", num); - out: if(ret) { if(card-iobase) @@ -2729,7 +2718,6 @@ { struct m3_card *card; -pm_unregister_all(m3_pm_callback); unregister_reboot_notifier(m3_reboot_nb); while ((card = devs)) { @@ -2767,72 +2755,17 @@ for(card = devs; card != NULL; card = card-next) { if(!card-in_suspend) -m3_suspend(card); -} -return 0; -} - -MODULE_AUTHOR("Zach Brown [EMAIL PROTECTED]"); -MODULE_DESCRIPTION("ESS Maestro3/Allegro Driver"); -#ifdef M_DEBUG -MODULE_PARM(debug,"i"); -#endif -MODULE_PARM(external_amp,"i"); - -static struct pci_driver m3_pci_driver = { -name: "ess_m3_audio", -id_table: m3_id_table, -probe: m3_probe, -remove: m3_remove, -}; - -static int __init m3_init_module(void) -{ -if (!pci_present()) /* No PCI bus in this machine! */ -return -ENODEV; - -printk(KERN_INFO PFX "version " DRIVER_VERSION " built at " __TIME__ " " __DATE__ "\n"); - -if (register_reboot_notifier(m3_reboot_nb)) { -printk(KERN_WARNING PFX "reboot notifier registration failed\n"); -return -ENODEV; /* ? */ -} - -if (!pci_register_driver(m3_pci_driver)) { -pci_unregister_driver(m3_pci_driver); -return -ENODEV; +m3_suspend(card-pcidev); /* XXX legal? */ } return 0; } -static void __exit m3_cleanup_module(void) -{ -pci_unregister_driver(m3_pci_driver); -} - -module_init(m3_init_module); -module_exit(m3_cleanup_module); - -void check_suspend(struct m3_card *card) -{ -DECLARE_WAITQUEUE(wait, current); - -if(!card-in_suspend) -return; - -card-in_suspend++; -add_wait_queue(card-suspend_queue, wait); -current-state = TASK
Re: PATCH: maestro ported to 2.4 PCI API
- ported to Linux 2.4 PCI API, PCI module based, cleaned up return values. (taking into account all the hints Jeff has given me ;) cool :) - did NOT change any power management support, since I don't know anything about power management. someone else (in .de, it appears to be the source of maestro hacking nowadays :)) is cleaning up the power management. - bumped version. we might as well just stop using these, they don't mean much of anything anymore. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] maestro3 2.4.2 pci dma address fix
It seems the maestro3 can't DMA from mem above 256M. Thanks to everyone who sent in bug reports that lead us to discover this.. this patch fixes that and also allocates memory at driver init time rather than at device open time.. please apply with wild abandon. -- zach --- linux-2.4.2/drivers/sound/maestro3.c.dmaWed Feb 28 09:22:18 2001 +++ linux-2.4.2/drivers/sound/maestro3.cWed Feb 28 10:10:07 2001 @@ -28,6 +28,9 @@ * Shouts go out to Mike "DJ XPCom" Ang. * * History + * v1.22 - Feb 28 2001 - Zach Brown [EMAIL PROTECTED] + * allocate mem at insmod/setup, rather than open + * limit pci dma addresses to 28bit, thanks guys. * v1.21 - Feb 04 2001 - Zach Brown [EMAIL PROTECTED] * fix up really dumb notifier - suspend oops * v1.20 - Jan 30 2001 - Zach Brown [EMAIL PROTECTED] @@ -150,7 +153,7 @@ #define M_DEBUG 1 -#define DRIVER_VERSION "1.21" +#define DRIVER_VERSION "1.22" #define M3_MODULE_NAME "maestro3" #define PFX M3_MODULE_NAME ": " @@ -330,6 +333,12 @@ MODULE_DEVICE_TABLE (pci, m3_id_table); +/* + * reports seem to indicate that the m3 is limited + * to 28bit bus addresses. rgggh... + */ +#define M3_PCI_DMA_MASK 0x0fff + static unsigned ld2(unsigned int x) { @@ -1943,6 +1952,9 @@ static void free_dmabuf(struct pci_dev *pci_dev, struct dmabuf *db) { +if(db-rawbuf == NULL) +return; + DPRINTK(DPSTR,"freeing %p from dmabuf %p\n",db-rawbuf, db); { @@ -1967,7 +1979,7 @@ int minor = MINOR(inode-i_rdev); struct m3_card *c; struct m3_state *s = NULL; -int i, ret = 0; +int i; unsigned char fmtm = ~0, fmts = 0; unsigned long flags; @@ -2013,10 +2025,6 @@ spin_lock_irqsave(s-lock, flags); if (file-f_mode FMODE_READ) { -if(allocate_dmabuf(s-card-pcidev, (s-dma_adc))) { -ret = -ENOMEM; -goto out; -} fmtm = ~((ESS_FMT_STEREO | ESS_FMT_16BIT) ESS_ADC_SHIFT); if ((minor 0xf) == SND_DEV_DSP16) fmts |= ESS_FMT_16BIT ESS_ADC_SHIFT; @@ -2025,10 +2033,6 @@ set_adc_rate(s, 8000); } if (file-f_mode FMODE_WRITE) { -if(allocate_dmabuf(s-card-pcidev, (s-dma_dac))) { -ret = -ENOMEM; -goto out; -} fmtm = ~((ESS_FMT_STEREO | ESS_FMT_16BIT) ESS_DAC_SHIFT); if ((minor 0xf) == SND_DEV_DSP16) fmts |= ESS_FMT_16BIT ESS_DAC_SHIFT; @@ -2040,7 +2044,6 @@ s-open_mode |= file-f_mode (FMODE_READ | FMODE_WRITE); MOD_INC_USE_COUNT; -out: up(s-open_sem); spin_unlock_irqrestore(s-lock, flags); return 0; @@ -2064,7 +2067,6 @@ m3_remove_list(s-card, (s-card-mixer_list), s-dma_dac.mixer_index); nuke_lists(s-card, (s-dma_dac)); } -free_dmabuf(s-card-pcidev, (s-dma_dac)); } if (file-f_mode FMODE_READ) { stop_adc(s); @@ -2072,7 +2074,6 @@ m3_remove_list(s-card, (s-card-adc1_list), s-dma_adc.adc1_index); nuke_lists(s-card, (s-dma_adc)); } -free_dmabuf(s-card-pcidev, (s-dma_adc)); } s-open_mode = (~file-f_mode) (FMODE_READ|FMODE_WRITE); @@ -2594,8 +2595,8 @@ DPRINTK(DPMOD, "in maestro_install\n"); -if (!pci_dma_supported(pci_dev, 0x)) { -printk(KERN_ERR PFX "architecture does not support 32bit PCI busmaster DMA\n"); +if (!pci_dma_supported(pci_dev, M3_PCI_DMA_MASK)) { +printk(KERN_ERR PFX "architecture does not support limiting to 28bit PCI bus +addresses\n"); return -ENODEV; } @@ -2604,6 +2605,8 @@ pci_set_master(pci_dev); +pci_dev-dma_mask = M3_PCI_DMA_MASK; + if( (card = kmalloc(sizeof(struct m3_card), GFP_KERNEL)) == NULL) { printk(KERN_WARNING PFX "out of memory\n"); return -ENOMEM; @@ -2681,6 +2684,12 @@ if ((s-dev_audio = register_sound_dsp(m3_audio_fops, -1)) 0) { break; } + +if( allocate_dmabuf(card-pcidev, (s-dma_adc)) || +allocate_dmabuf(card-pcidev, (s-dma_dac))) { +ret = -ENOMEM; +goto out; +} } if(request_irq(card-irq, m3_interrupt, SA_SHIRQ, card_names[card-card_type], card)) { @@ -2734,8 +2743,12 @@ for(i=0;iNR_DSPS;i++) { struct m3_state *s = card-channels[i]; -if(s-dev_audio != -1) -unregister_sound_dsp(s-dev_audio); +if(s-dev_audio 0) +continue; + +unregister_sound_dsp(s-dev_audio); +free_dmabuf(card-pcidev, s-dma_adc); +free_dmabuf(card-pcidev, s-dma_dac); } release_region(card-iobase, 256); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a messa
[RFC] pci_dma_set_mask()
This patch really has two parts. Most of it adds a helper function that does the if(pci_dma_supported()) { -dma_mask = mask } code path. I was using the api today and didn't realize that I had to set the mask myself, I assumed the _supported call would do it. If people prefer the struct setting api, thats fine with me :) It also seems goofy that pci_dma_supported() unconditionally return true on x86. If a device needs a mask smaller than GFP_DMA (admitedly probably won't exist in the real world, one hopes), they're going to have troubles because x86 just falls back to GFP_DMA when the mask isn't 0x.. all totally untested, of course.. -- zach --- linux-2.4.2/include/linux/pci.h.dmasup Wed Feb 28 10:26:14 2001 +++ linux-2.4.2/include/linux/pci.h Wed Feb 28 10:30:12 2001 @@ -527,6 +527,7 @@ int pci_enable_device(struct pci_dev *dev); void pci_set_master(struct pci_dev *dev); +int pci_set_dma_mask(struct pci_dev *dev, dma_addr_t mask); int pci_set_power_state(struct pci_dev *dev, int state); int pci_assign_resource(struct pci_dev *dev, int i); --- linux-2.4.2/include/asm-i386/pci.h.dmasup Wed Feb 28 10:19:01 2001 +++ linux-2.4.2/include/asm-i386/pci.h Wed Feb 28 10:25:40 2001 @@ -152,6 +152,14 @@ */ extern inline int pci_dma_supported(struct pci_dev *hwdev, dma_addr_t mask) { +/* + * we fall back to GFP_DMA when the mask isn't all 1s, + * so we can't guarantee allocations that must be + * within a tighter range than GFP_DMA.. + */ +if(mask 0x00ff) +return 0; + return 1; } --- linux-2.4.2/drivers/pci/pci.c.dmasupWed Feb 28 10:26:34 2001 +++ linux-2.4.2/drivers/pci/pci.c Wed Feb 28 10:27:34 2001 @@ -518,6 +518,18 @@ pcibios_set_master(dev); } +int +pci_set_dma_mask(struct pci_dev *dev, dma_addr_t mask) +{ +if(! pci_dma_supported(dev, mask)) +return 0; + +dev-dma_mask = mask; + +return 1; +} + + /* * Translate the low bits of the PCI base * to the resource type @@ -1206,6 +1218,7 @@ EXPORT_SYMBOL(pci_find_slot); EXPORT_SYMBOL(pci_find_subsys); EXPORT_SYMBOL(pci_set_master); +EXPORT_SYMBOL(pci_set_dma_mask); EXPORT_SYMBOL(pci_set_power_state); EXPORT_SYMBOL(pci_assign_resource); EXPORT_SYMBOL(pci_register_driver); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] pci_dma_set_mask()
pci_dma_supported has a boolean return, but the kernel norm is to return zero on success, and -EFOO on error. I like your proposal with the *nod* I just followed pci_dma_supported(). extremely minor nit that I think pci_set_dma_mask should return ENODEV or EIO or something on error, and zero on success. I agree, though I'd like to leave the decision up to people who live and breathe this stuff. please feel free to make minor adjustments and submit :) - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] pci_set_dma_mask() + doc :)
please feel free to flame or apply, I'm not sure I'm really fond of the code example.. - z --- linux-2.4.2/include/linux/pci.h.dmasup Wed Feb 28 10:26:14 2001 +++ linux-2.4.2/include/linux/pci.h Wed Feb 28 10:30:12 2001 @@ -527,6 +527,7 @@ int pci_enable_device(struct pci_dev *dev); void pci_set_master(struct pci_dev *dev); +int pci_set_dma_mask(struct pci_dev *dev, dma_addr_t mask); int pci_set_power_state(struct pci_dev *dev, int state); int pci_assign_resource(struct pci_dev *dev, int i); --- linux-2.4.2/include/asm-i386/pci.h.dmasup Wed Feb 28 10:19:01 2001 +++ linux-2.4.2/include/asm-i386/pci.h Wed Feb 28 10:25:40 2001 @@ -152,6 +152,14 @@ */ extern inline int pci_dma_supported(struct pci_dev *hwdev, dma_addr_t mask) { +/* + * we fall back to GFP_DMA when the mask isn't all 1s, + * so we can't guarantee allocations that must be + * within a tighter range than GFP_DMA.. + */ +if(mask 0x00ff) +return 0; + return 1; } --- linux-2.4.2/drivers/pci/pci.c.dmasupWed Feb 28 10:26:34 2001 +++ linux-2.4.2/drivers/pci/pci.c Thu Mar 1 19:04:35 2001 @@ -518,6 +518,18 @@ pcibios_set_master(dev); } +int +pci_set_dma_mask(struct pci_dev *dev, dma_addr_t mask) +{ +if(! pci_dma_supported(dev, mask)) +return -EIO; + +dev-dma_mask = mask; + +return 0; +} + + /* * Translate the low bits of the PCI base * to the resource type @@ -1206,6 +1218,7 @@ EXPORT_SYMBOL(pci_find_slot); EXPORT_SYMBOL(pci_find_subsys); EXPORT_SYMBOL(pci_set_master); +EXPORT_SYMBOL(pci_set_dma_mask); EXPORT_SYMBOL(pci_set_power_state); EXPORT_SYMBOL(pci_assign_resource); EXPORT_SYMBOL(pci_register_driver); --- linux-2.4.2/Documentation/DMA-mapping.txt.dmasupWed Feb 28 23:03:04 2001 +++ linux-2.4.2/Documentation/DMA-mapping.txt Thu Mar 1 19:29:38 2001 @@ -71,12 +71,35 @@ if (! pci_dma_supported(pdev, 0x00ff)) goto ignore_this_device; +When DMA is possible for a given mask, the PCI layer must be informed of the +mask for later allocation operations on the device. This is achieved by +setting the dma_mask member of the pci_dev structure, like so: + +#define MY_HW_DMA_MASK 0x00ff + + if (! pci_dma_supported(pdev, MY_HW_DMA_MASK)) + goto ignore_this_device; + + pdev-dma_mask = MY_HW_DMA_MASK; + +A helper function is provided which performs this common code sequence: + + int pci_set_dma_mask(struct pci_dev *pdev, dma_addr_t device_mask) + +Unlike pci_dma_supported(), this returns -EIO when the PCI layer will not be +able to DMA with addresses restricted by that mask, and returns 0 when DMA +transfers are possible. If the call succeeds, the dma_mask will have been +updated so that your driver need not worry about it. + There is a case which we are aware of at this time, which is worth mentioning in this documentation. If your device supports multiple functions (for example a sound card provides playback and record functions) and the various different functions have _different_ DMA addressing limitations, you may wish to probe each mask and -only provide the functionality which the machine can handle. +only provide the functionality which the machine can handle. It +is important that the last call to pci_set_dma_mask() be for the +most specific mask. + Here is pseudo-code showing how this might be done: #define PLAYBACK_ADDRESS_BITS 0x @@ -86,14 +109,14 @@ struct pci_dev *pdev; ... - if (pci_dma_supported(pdev, PLAYBACK_ADDRESS_BITS)) { + if (pci_set_dma_mask(pdev, PLAYBACK_ADDRESS_BITS)) { card-playback_enabled = 1; } else { card-playback_enabled = 0; printk(KERN_WARN "%s: Playback disabled due to DMA limitations.\n", card-name); } - if (pci_dma_supported(pdev, RECORD_ADDRESS_BITS)) { + if (pci_set_dma_mask(pdev, RECORD_ADDRESS_BITS)) { card-record_enabled = 1; } else { card-record_enabled = 0; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[CFT] maestro update vs 2.2.18
I finally spent some time fixing up the maestro driver. lots of feature additions had backed up, and the source was rotting. Its still gross, but at least its cleaned up a bit. "It works for me" on my pentium with an ESS maestro2 engineering board, but laptops will be another story entirely. I'd love it if people could apply this patch to vanilla 2.2.18 and let me know how it goes. The patch does a few things. Most interestingly for the user, it moves away from the model of having multiple /dev/dsp? files and instead allows /dev/dsp to be opened concurrently. It also adds some support for the hardware volume buttons on laptops, but not all vendors wire this the same way. As I don't have a maestro-bearing laptop, this is totally untested. The code is butchered, so the diff is almost illegible. Perhaps I'll learn and do things in stages next time, but I was on a roll :) One of the more notable changes involves using the kernel's ac97_codec code rather than its own. Hopefully this will result in better mixer behaviour. I'm particularly interested in hearing how suspend/resume functions, whether or not the multi-open stuff works, and I'd like to get subvendor IDs from people whose laptop's hardware volume buttons work. See the Documentation/sound/Maestro text for instructions on enabling multi-open (channels=2 or 4) and hardware volume support (hw_vol=1). Its an awfully large diff, so it can be fetched from: http://www.zabbo.net/maestro/patches/2.2.18-mega-1.diff.gz if this works I'll officially submit it and make the same sorts of changes to 2.4. -- zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IDE UDMA on a CMD-648 Chip
I can't find the chip's datasheet. CMD only gives it to direct customers. I do have the datasheet for the CMD-646U, a prior UDMA supporting chip. Have you tried mailing them?. I sent mail to something silly like support@cmd. After they found the right person for me to talk to, I mentioned wanting to play with linux support, and he happily sent the datasheets for the 648 and 649. I use andre's 2.2 ide patches to get support for it.. -- zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: user space web server accelerator support
Fantastic! I was not aware of it, sorry... where can I find some doc? There are some patches in the apache source rpms in http://www.zabbo.net/phhttpd/ that shows how apache can connect to another daemon and get its incoming connections sockets from it. phhttpd itself is pretty hairy code (don't ask :)), but the apache changes are pretty straight forward. -- zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Patch: linux-2.4.0-test11/drivers/sound/maestro.c port to new PCI interface
Unrelated to your change: the maestro reboot notifier shouldn't need to unregister all that stuff. Who cares if the sound devices are freed, since we are rebooting. free_irq+maestro_power seems sufficient. or maybe stop_dma+free_irq+poweroff. its only the power stuff that matters. some biosen don't power down properly if the chip isn't powered down. That could actually be because of weird changes we make to the chip and then mask by powering it down, but shutting it down made the machines halt again :) -- zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Patch: linux-2.4.0-test11/drivers/sound/maestro.c port to new PCI interface
[first off, thanks for kicking this stuff into gear Adam. I'm way too lazy to do this stuff of my own volition :)] was to serialize access to the mixer, there are surely better ways to do it. Why are interrupts disabled? the maestro has crappy register indirection that you must use to do nearly anything. when the locking was added we just went nutty with one big lock :) yes, this could be cleaned up with a mixer specific lock as the mixer regs aren't atomic, but don't require the same indirection as the rest of the maestro stuff. the locking fix is still valid if they're all under one big lock. As far as I can tell, I agree with you, but I do not think that is related to the move to the new PCI interface. *nod* I also agree that the ioctl patch is kind of a bandaid over the problems that you described, and, while Zach Brown can speak The biggest problem is that the current code is gross gross gross. I've been avoiding dealing with it too much in the hopes that moving to oss_audio will make things much more friendly across the board. problem permanent, and (2) it is an incremental improvement over the status quo. *nod* I'd like 2.2/2.4 to at least be solid for people so I can disappear (again) to do the oss_audio work. Really, I'd hope it wouldn't take all that long. That said, I am willing to help try to clean up the locking code in maestro if Zach thinks it is worth doing right now, since I have a notebook computer with a maestro chip in If you get me a patch thats obviously correct, I'd happily agree to it going in. I'm not sure its immeditately needed, but won't stop you from playing with things :). Jeff has the start of a patch that moves maestro.c to be a consumer of ac97_codec. It was really close, except for some locking oopsies. If you're dorking around with the locking, you might be able to fix that up at the same time. The new ac97 interface cleans up the code a lot. you can look at how the maestro3 does it, its almost identical to how maestro.c + ac97 would look. [hah, except that maestro3 hasn't had any locking applied to it at all.] This driver needs __devxxx, I've heard mention of some hotpluggable audio that is based on the maestro chipset (which chip I don't remember) scary. :) this all reminds me, the maestro3 driver is working well enough for most people that I should get it going on 2.4 and submit it.. -- zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: sunhme.c patch for new PCI interface (UNTESTED)
On Thu, Nov 16, 2000 at 04:22:36AM -0800, David S. Miller wrote: Sure, that sounds nice. Actually, one of the possible "grand plans" for 2.5 is a unified "struct device". I don't know what will actually happen here. apologies for pointing out the potentially obvoius, but people might want to investigate 'newbus' while brainstorming about this: http://www.daemonnews.org/27/newbus-intro.html -- zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Maestro3/Allegro: (was ESS device 1998)
On Thu, Nov 02, 2000 at 12:03:41PM +, Mo McKinlay wrote: I recently obtained an HP Omnibook XE2 laptop. It's a reasonably As people have mentioned, there is an alpha free driver near http://www.zabbo.net/maestro3/. Its not quite up to par yet. maybe the web page should talk a bit more about the chip familiy. The maestro3 has a lot of pieces in common with the maestro2, except for the part of the chip that did pcm manipulation. the m3 only has a dsp where the m2 had specific silicon for doing pcm work. the allegro is a "slimmed down" maestro3, and neither have anything to do with cirrus/crystal CS parts as far as I know :) I expect you'll have the 'slow down' problem on the Xe2, we have the clocking messed up on some implementations (those that don't clock the thing at 49mhz, as god intended? :)) I've given up on the internal modem (I'm 90% sure it's some kind of *nod* Its the usual mc97 codec setup that leaves the hard work for the processor. I'm sure one can play around with the dsp on it as well, but we don't have specs on the dsp's internals. -- zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
maestro3 2.2 oss driver snapshot
http://www.zabbo.net/maestro3/ contains a driver for the maestro3 and allegro chipsets that I'm fairly happy with. its 2.2 only for now, play with 2.4 at your own risk. for now it includes its own ac97_codec.c that is backported from 2.2. I expect playback to work as well as ac97 mixing. apm support works pretty darn well, you can suspend during pcm playback and it should start playing again on resume. mmap() should work, but is untested. record does not work at all. if you test it, please let me know how it goes and tell me all about your hardware. I'll have a polished version later that will be submitted into the kernel proper. -- zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] maestro3 2.4.4-pre5 fixes
The maestro3 driver has a error cleanup bug that would leave its reboot notifier registerd after exiting init_module with an error. Bayad. It also had a minor bug where it didn't explicitly set the codec-id before probing.. this patch vs 2.4.4-pre5 (I hope, its really vs jeff's cvs :)) fixes both. Linus, please apply. -- zach Index: maestro3.c === RCS file: /cvsroot/gkernel/linux_2_4/drivers/sound/maestro3.c,v retrieving revision 1.1.1.24.6.1 diff -u -r1.1.1.24.6.1 maestro3.c --- maestro3.c 2001/04/20 01:40:38 1.1.1.24.6.1 +++ maestro3.c 2001/04/21 14:36:21 @@ -2307,6 +2307,8 @@ codec-private_data = card; codec-codec_read = m3_ac97_read; codec-codec_write = m3_ac97_write; +/* someday we should support secondary codecs.. */ +codec-id = 0; if (ac97_probe_codec(codec) == 0) { printk(KERN_ERR PFX "codec probe failed\n"); @@ -2933,6 +2935,7 @@ if (!pci_register_driver(m3_pci_driver)) { pci_unregister_driver(m3_pci_driver); +unregister_reboot_notifier(m3_reboot_nb); return -ENODEV; } return 0; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: X15 alpha release: as fast as TUX but in user space
i think Zach's phhttpd is an important milestone as well, it's the first userspace webserver that shows how to use event-based, sigio-based async networking IO and sendfile() under Linux. (I believe it had some *blush* performance problems related to sigio queue overflow, these issues might be solved in the latest kernels.) The zerocopy enhancements should help phhttpd as well. oh, it has a bunch of problems :) over-threading created complexity in the fast path. It always spends memory on a contiguous header region for each connection, which may not be valid in the days of zero copy sendmsg. It does IO in the fast path. And looking back at it, I'm struck by how naive most of the code is :) :) I've always been tempted to go back and take a real swing at a nice content server, but there's only so many hours in the day, and apache+thttpd+tux complete the problem space. If X15 isn't released with an open license, I may be tempted yet again :) - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Patch to make ymfpci legacy address 16 bits
On Wed, May 09, 2001 at 05:08:15PM -0400, Jeff Garzik wrote: Why does maestro.c not use my suggestion? Because it doesn't use struct pci_driver. I finally found an able hacker with maestro hardware with power management. He not only fixed the nasty pm races that were causing channel corruption, but he seems willing to move it towards the proper pci_driver APIs as well. So don't worry Jeff, we'll fix the evil maestro.c driver yet :) :) - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] ess maestro, support for hardware volume control
this patch applies to (at least) 2.4.3 up to and including 2.4.6-pre2. It enables the hardware volume control feature of the maestro. cool. I had support for this in the mega-patch I posted long ago, but I never seperated and submitted those changes 'cause I'm a moron. By giving hwv=0 to insmod one can explicitly disable it. Setting can we have a better name like 'hwvol_enable'? + set_mixer(c, 0, val); careful. you just used the indirect ac97 registers without holding the card's lock.. if another processor does a mixer ioctl while this is happening you'll get weird behaviour. it looks like you should just be able to spin_{,un}lock(card-lock) around that call, but the maestro locking is so friggin' twisty.. this gets even more exciting when the driver uses ac97_codec, which the mega-patch also had in it. Fix the locking (and the obscure parameter name? :)) and it looks fine.. good work. -- zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] ess maestro, support for hardware volume control
below is the version with the suggested fixes, and with s/hwv/hwvol/ for hwv_input also. fantastic, thanks lukas. alan, can you throw this in -ac? I don't think it will cause problems for people with nonstandard wiring on the hw vol pins (read: dell lattitudes), but if it does we can blacklist the lattitudes (later turning into enabling support code for their custom hw vol wiring.. its on the todo list :/). This should work on inspirons. - z --- linux-2.4.6-pre2/drivers/sound/maestro.c Sat Jun 9 16:55:22 2001 +++ linux/drivers/sound/maestro.c Sat Jun 9 19:48:34 2001 @@ -115,6 +115,10 @@ * themselves, but we'll see. * * History + * v0.15 - Jun 09 2001 - Lukas Schroeder [EMAIL PROTECTED] + * enable hardware volume control (by default) + * add hwvol_enable= to allow disabling of HWV (values are 0 or 1) + * add hwvol_input= to allow selecting the HWV input pins (values are 0 or 1) * (still kind of v0.14) Nov 23 - Alan Cox [EMAIL PROTECTED] * Add clocking= for people with seriously warped hardware * (still v0.14) Nov 10 2000 - Bartlomiej Zolnierkiewicz [EMAIL PROTECTED] @@ -269,8 +273,13 @@ static int clocking=48000; +/* enable hardware volume control? */ +static int hwvol_enable = 1; +/* hardware volume input pin selection */ +static int hwvol_input = 0; + /* - */ -#define DRIVER_VERSION 0.14 +#define DRIVER_VERSION 0.15 #ifndef PCI_VENDOR_ESS #define PCI_VENDOR_ESS 0x125D @@ -312,6 +321,9 @@ #define NR_APUS 64 #define NR_APU_REGS 16 +/* steps per hardware volume count */ +#define HWV_MIXER_STEP 15 + /* acpi states */ enum { ACPI_D0=0, @@ -514,6 +526,7 @@ /* - */ +static void set_mixer(struct ess_card *card,unsigned int mixer, unsigned int val ) ; static void check_suspend(struct ess_card *card); static struct ess_card *devs = NULL; @@ -1898,10 +1911,20 @@ if(event(16)) { - /* XXX if we have a hw volume control int enable - all the ints? doesn't make sense.. */ + unsigned int val; + event = inw(c-iobase+0x18); - outb(0xFF, c-iobase+0x1A); + outb((16), c-iobase+0x1A); + + /* read the HW Master Volume Counter + Bits 7:5 Master Volume Left + Bits 3:1 Master Volume Right +*/ + i = inb(c-iobase+0x1f); + val = ((HWV_MIXER_STEP * ((i1) 7)) 8) | HWV_MIXER_STEP * ((i5) 7); + spin_lock(s-lock); + set_mixer(c, 0, val); + spin_unlock(s-lock); } else { @@ -3088,8 +3111,10 @@ w=~(114);/* External clock */ w=~(17); /* HWV off */ + if (hwvol_enable) w|=(17); w=~(16); /* Debounce off */ - w=~(15); /* GPIO 4:5 */ + w=~(15); /* GPIO 4:5 ; HVI pin selection */ + if (hwvol_input) w|=(15); w|= (14); /* Disconnect from the CHI. Enabling this made a dell 7500 work. */ w=~(12); /* MIDI fix off (undoc) */ w=~(11); /* reserved, always write 0 */ @@ -3170,7 +3195,8 @@ outw(w, iobase+0x18); w=inw(iobase+0x18); - w=~(16); /* Harpo off */ + w=~(16); /* HWV irq off */ + if (hwvol_enable) w|=(16); outw(w, iobase+0x18); w=inw(iobase+0x18); @@ -3487,6 +3513,7 @@ /* now go to sleep 'till something interesting happens */ maestro_power(card,ACPI_D2); + printk(KERN_INFO maestro: hardware volume control %senabled\n, (hwvol_enable) ? : not ); printk(KERN_INFO maestro: %d channels configured.\n, num); return 1; } @@ -3593,6 +3620,10 @@ MODULE_PARM(dsps_order,i); MODULE_PARM(use_pm,i); MODULE_PARM(clocking, i); + +MODULE_PARM(hwvol_enable, i); +MODULE_PARM(hwvol_input, i); + void cleanup_module(void) { M_printk(maestro: unloading\n); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] ess maestro, support for hardware volume control
I now have a patch that will output the hwv buttons pressed (up, down, mute) to a new dynamically allocated misc device as letters u, d, m, instead of directly modifying the mixer. Anyone want that? It's more flexible than either the patch that's currently in -ac or Lukas's patch, but you need a little userspace daemon for it to do anything useful. hmm.. how do the alsa guys deal with hw volume? I'm not sure I like adding more stuff to the OSS API. -- zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH] cutting up struct kernel_stat into cpu_stat
The attached patch-in-progress removes the per-cpu statistics from struct kernel_stat and puts them in a cpu_stat structure, one per cpu, cacheline padded. The data is still coolated and presented through /proc/stat, but another file /proc/cpustat is also added. The locking is as nonexistant as it was with kernel_stat, but who cares, they're just fuzzy stats to be eyeballed by system tuners :). A tool for printing the cpu stats specifically can be found near: http://www.osdlab.org/sw_resources/cpustat/index.shtml Its output is almost identical to solaris' mpstat. I'm not sure I like the macro use, but it shields the callers from the union garbage. We can easily also make a THIS_CPU_STAT_ADD() interface, as some have hinted would be nice :) Currently its mostly ( :) ) only collecting the stats that were collected in kernel_stat. I'd like to add more stats -- page faults, syscalls, cross-cpu calls, etc. I understand people not wanting more live cachelines in the fast paths. I can make CPU_CRITICAL_STAT defines that are config-ed out.. comments? If its ok I can whip up a patch that updates all the ports use of -irqs[] as well. - z [ heading out for lunch :) ] --- linux-2.4.5-cpustat/fs/proc/proc_misc.c.cpustat Fri Apr 13 20:26:07 2001 +++ linux-2.4.5-cpustat/fs/proc/proc_misc.c Thu Jun 21 12:23:49 2001 @@ -265,32 +265,36 @@ int i, len; extern unsigned long total_forks; unsigned long jif = jiffies; - unsigned int sum = 0, user = 0, nice = 0, system = 0; + unsigned int sum = 0, user = 0, nice = 0, system = 0, ctxt = 0; int major, disk; for (i = 0 ; i smp_num_cpus; i++) { int cpu = cpu_logical_map(i), j; - user += kstat.per_cpu_user[cpu]; - nice += kstat.per_cpu_nice[cpu]; - system += kstat.per_cpu_system[cpu]; + user += CPU_STAT_VAL(cpu, user); + nice += CPU_STAT_VAL(cpu, nice); + system += CPU_STAT_VAL(cpu, system); + ctxt += CPU_STAT_VAL(cpu, context_swtch); #if !defined(CONFIG_ARCH_S390) for (j = 0 ; j NR_IRQS ; j++) - sum += kstat.irqs[cpu][j]; + sum += CPU_STAT_VAL(cpu, irqs[j]); #endif } len = sprintf(page, cpu %u %u %u %lu\n, user, nice, system, jif * smp_num_cpus - (user + nice + system)); - for (i = 0 ; i smp_num_cpus; i++) + for (i = 0 ; i smp_num_cpus; i++) { + unsigned int user_i, nice_i, system_i; + + user_i = CPU_STAT_VAL(cpu_logical_map(i), user); + nice_i = CPU_STAT_VAL(cpu_logical_map(i), nice); + system_i = CPU_STAT_VAL(cpu_logical_map(i), system); + len += sprintf(page + len, cpu%d %u %u %u %lu\n, i, - kstat.per_cpu_user[cpu_logical_map(i)], - kstat.per_cpu_nice[cpu_logical_map(i)], - kstat.per_cpu_system[cpu_logical_map(i)], - jif - ( kstat.per_cpu_user[cpu_logical_map(i)] \ - + kstat.per_cpu_nice[cpu_logical_map(i)] \ - + kstat.per_cpu_system[cpu_logical_map(i)])); + user_i, nice_i, system_i, + jif - ( user_i + nice_i + system_i ) ); + } len += sprintf(page + len, page %u %u\n swap %u %u\n @@ -330,13 +334,58 @@ \nctxt %u\n btime %lu\n processes %lu\n, - kstat.context_swtch, + ctxt, xtime.tv_sec - jif / HZ, total_forks); return proc_calc_metrics(page, start, off, count, eof, len); } +static int cstat_read_proc(char *page, char **start, off_t off, +int count, int *eof, void *data) +{ + int i, len; + + len = sprintf(page, cpu_stat 0.0\n); + + for (i = 0 ; i smp_num_cpus; i++) { + unsigned int user, nice, system; + int j, cpu = cpu_logical_map(i); + +#if !defined(CONFIG_ARCH_S390) + len += sprintf(page + len, cpu%d irqs , cpu); + for (j = 0 ; j NR_IRQS ; j++) { + len += sprintf(page + len, %u, + CPU_STAT_VAL(cpu, irqs[j])); + } + len += sprintf(page + len, \n); +#endif +#if defined(CONFIG_SMP) + len += sprintf(page + len, cpu%d context_migration %u\n, + cpu, CPU_STAT_VAL(cpu, context_migration)); +#endif + len += sprintf(page + len, cpu%d bottom_halves %u\n, + cpu, CPU_STAT_VAL(cpu, bh)); + len += sprintf(page + len, cpu%d context_switches %u\n, + cpu, CPU_STAT_VAL(cpu, context_swtch)); + + user =
[RFC][PATCH] struct kernel_stat - struct cpu_stat[NR_CPUS]
Currently struct kernel_stat has a few pre cpu arrays. This creates cacheline exchange noise as the cpus update their entries in each array. This patch creates an array of per cpu structs. The structure is padded to the length of a cacheline. The meat of the patch against 2.4.6-pre8 is attached. The rest of the patch is rather large because it touches all the architectures' use of ks-irqs[], and can be found at http://www.osdlab.org/sw_resources/cpustat/cpustat-2.4.6.pre8-1.diff These per cpu statistics are reported via a new /proc/cpustat, a quick tool for processing that output, vmstat-style, can be found near http://www.osdlab.org/sw_resources/cpustat/index.shtml In addition to shuffling structures around, the patch adds the recording of context scheduling migrations as well as minor, major, and cow faults. I'd really like to hear what people think of the patch. Unless someone strongly is dead set against it, I'd like to send it off to Linus and move on to other things :) Would arch maintainers rather that I fed them per-arch patches for their trees? - z --- linux-2.4.6-pre8/fs/proc/proc_misc.c.cpustatFri Apr 13 20:26:07 2001 +++ linux-2.4.6-pre8/fs/proc/proc_misc.cMon Jul 2 15:04:49 2001 @@ -265,32 +265,37 @@ int i, len; extern unsigned long total_forks; unsigned long jif = jiffies; - unsigned int sum = 0, user = 0, nice = 0, system = 0; + unsigned int sum = 0, user = 0, nice = 0, system = 0, ctxt = 0; int major, disk; for (i = 0 ; i smp_num_cpus; i++) { int cpu = cpu_logical_map(i), j; - user += kstat.per_cpu_user[cpu]; - nice += kstat.per_cpu_nice[cpu]; - system += kstat.per_cpu_system[cpu]; + user += cpu_stat[cpu].user; + nice += cpu_stat[cpu].nice; + system += cpu_stat[cpu].system; + ctxt += cpu_stat[cpu].context_swtch; #if !defined(CONFIG_ARCH_S390) for (j = 0 ; j NR_IRQS ; j++) - sum += kstat.irqs[cpu][j]; + sum += cpu_stat[cpu].irqs[j]; #endif } len = sprintf(page, cpu %u %u %u %lu\n, user, nice, system, jif * smp_num_cpus - (user + nice + system)); - for (i = 0 ; i smp_num_cpus; i++) + for (i = 0 ; i smp_num_cpus; i++) { + unsigned int user_i, nice_i, system_i; + int cpu = cpu_logical_map(i); + + user_i = cpu_stat[cpu].user; + nice_i = cpu_stat[cpu].nice; + system_i = cpu_stat[cpu].system; + len += sprintf(page + len, cpu%d %u %u %u %lu\n, i, - kstat.per_cpu_user[cpu_logical_map(i)], - kstat.per_cpu_nice[cpu_logical_map(i)], - kstat.per_cpu_system[cpu_logical_map(i)], - jif - ( kstat.per_cpu_user[cpu_logical_map(i)] \ - + kstat.per_cpu_nice[cpu_logical_map(i)] \ - + kstat.per_cpu_system[cpu_logical_map(i)])); + user_i, nice_i, system_i, + jif - ( user_i + nice_i + system_i ) ); + } len += sprintf(page + len, page %u %u\n swap %u %u\n @@ -330,13 +335,64 @@ \nctxt %u\n btime %lu\n processes %lu\n, - kstat.context_swtch, + ctxt, xtime.tv_sec - jif / HZ, total_forks); return proc_calc_metrics(page, start, off, count, eof, len); } +static int cstat_read_proc(char *page, char **start, off_t off, +int count, int *eof, void *data) +{ + int i, len; + + len = sprintf(page, cpu_stat 0.0\n); + + for (i = 0 ; i smp_num_cpus; i++) { + unsigned int user, nice, system; + int j, cpu = cpu_logical_map(i); + struct cpu_stat *cs = cpu_stat[cpu]; + +#if !defined(CONFIG_ARCH_S390) + len += sprintf(page + len, cpu%d irqs , cpu); + for (j = 0 ; j NR_IRQS ; j++) { + len += sprintf(page + len, %u, + cs-irqs[j]); + } + len += sprintf(page + len, \n); +#endif +#if defined(CONFIG_SMP) + len += sprintf(page + len, cpu%d context_migration %u\n, + cpu, cs-context_migration); +#endif + len += sprintf(page + len, cpu%d context_switches %u\n, + cpu, cs-context_swtch); + + len += sprintf(page + len, cpu%d major_faults %u\n, + cpu, cs-major_fault); + len += sprintf(page + len, cpu%d minor_faults %u\n, + cpu, cs-minor_fault); + len += sprintf(page + len, cpu%d
[PATCH] 2.4.7-pre3 kernel_stat - cpu_stat[NR_CPUS]
This patch does the following: - creates a cacheline aligned/padded struct cpu_stat[NR_CPUS]. - moves the [NR_CPUS] members of kernel_stat into cpu_stat This moves the stat data that a cpu will update into a contiguous region. Previous users of kernel_stat would compete for an array's cacheline with other cpus. - creates /proc/cpu/[0-9]+/ and fs/proc/proc_cpu.c with code for managing files in the cpu's directories. This should be useful for rusty's */active and arjan/rmk's */frequency. I have no strong feelings about where this lives or what it should be named. - adds collection of fault statistics and adds 'context migration' recording, per cpu. - updates users of kernel_stat in proc, reports shouldn't change - updates every friggin' user of kstat.irqs[] in arch/ with a macro so that we never have to do this again. The patch is rather large, due to that last bit. It can be found at http://www.osdlab.org/sw_resources/cpustat/cpustat-2.4.7.pre3-1.diff with a tool for summarizing /proc/cpu/*/stat at: http://www.osdlab.org/sw_resources/cpustat/index.shtml I'd like to get this sent to Linus soon, but wanted to run the /proc/cpu/* stuff by l-k one last time. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
maestro3 patch, resent
duh. I sent this to rutgers originally.. --- Date: Mon, 5 Feb 2001 07:42:25 -0500 From: Zach Brown [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [PATCH] maestro3 2.4.1-ac2 shutdown fix Its a wonder that anyone lets me write code. The following fixes a goofy shutdown problem with the 2.4 maestro3 driver as it appears in Alan's 2.4.1-ac2 patch. If power management was disabled the maestro3 driver would oops trying to save the dsp state as the machine shut down. The full source for the up to date 2.4 driver can be found at: http://www.zabbo.net/maestro3/maestro3-2.4-20010204.tar.gz thanks again to Andres Salomon for continuing to report my dumb bugs. Hopefully this will be the last of the trivial bugs, this driver needs some real meaningful cleaning up.. -- zach --- maestro3.c Mon Feb 5 06:51:58 2001 +++ maestro3.c Mon Feb 5 07:40:53 2001 @@ -28,6 +28,8 @@ * Shouts go out to Mike "DJ XPCom" Ang. * * History + * v1.21 - Feb 04 2001 - Zach Brown [EMAIL PROTECTED] + * fix up really dumb notifier - suspend oops * v1.20 - Jan 30 2001 - Zach Brown [EMAIL PROTECTED] * get rid of pm callback and use pci_dev suspend/resume instead * m3_probe cleanups, including pm oops think-o @@ -147,7 +149,7 @@ #define M_DEBUG 1 -#define DRIVER_VERSION "1.20" +#define DRIVER_VERSION "1.21" #define M3_MODULE_NAME "maestro3" #define PFX M3_MODULE_NAME ": " @@ -2763,7 +2765,6 @@ static void m3_suspend(struct pci_dev *pci_dev) { unsigned long flags; -int index; int i; struct m3_card *card = pci_dev-driver_data; @@ -2788,15 +2789,18 @@ m3_assp_halt(card); -index = 0; -DPRINTK(DPMOD, "saving code\n"); -for(i = REV_B_CODE_MEMORY_BEGIN ; i = REV_B_CODE_MEMORY_END; i++) -card-suspend_mem[index++] = -m3_assp_read(card, MEMTYPE_INTERNAL_CODE, i); -DPRINTK(DPMOD, "saving data\n"); -for(i = REV_B_DATA_MEMORY_BEGIN ; i = REV_B_DATA_MEMORY_END; i++) -card-suspend_mem[index++] = -m3_assp_read(card, MEMTYPE_INTERNAL_DATA, i); +if(card-suspend_mem) { +int index = 0; + +DPRINTK(DPMOD, "saving code\n"); +for(i = REV_B_CODE_MEMORY_BEGIN ; i = REV_B_CODE_MEMORY_END; i++) +card-suspend_mem[index++] = +m3_assp_read(card, MEMTYPE_INTERNAL_CODE, i); +DPRINTK(DPMOD, "saving data\n"); +for(i = REV_B_DATA_MEMORY_BEGIN ; i = REV_B_DATA_MEMORY_END; i++) +card-suspend_mem[index++] = +m3_assp_read(card, MEMTYPE_INTERNAL_DATA, i); +} DPRINTK(DPMOD, "powering down apci regs\n"); m3_outw(card, 0x, 0x54); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] maestro3 still oopses?
The maestro3 driver, included in 2.4.2-pre2 (which I assume is the same as maestro3-2.4-20010204.tar.gz, I haven't bothered to try it; I'm perfectly happy w/ my patch), oopses upon shutdown. the maestro3 snapshot in 2.4.2pre2 is not up to date. I imagine it came from alan, who got the jan30 patch, but didn't get the trivial feb 04 patch that fixes the oops you're seeing. the proper patch to 2.4.2-pre2 (and 2.4ac-current, presumably) is attached. Does it fix you problem? [it tries to do so without duplicating code, you'll notice.] + * use pci_module_init() instead of pci_register_driver(). I'd rather do this in a seperate patch that does lots of other cleanups that are pending. - z --- maestro3.c Mon Feb 5 06:51:58 2001 +++ maestro3.c Mon Feb 5 07:40:53 2001 @@ -28,6 +28,8 @@ * Shouts go out to Mike "DJ XPCom" Ang. * * History + * v1.21 - Feb 04 2001 - Zach Brown [EMAIL PROTECTED] + * fix up really dumb notifier - suspend oops * v1.20 - Jan 30 2001 - Zach Brown [EMAIL PROTECTED] * get rid of pm callback and use pci_dev suspend/resume instead * m3_probe cleanups, including pm oops think-o @@ -147,7 +149,7 @@ #define M_DEBUG 1 -#define DRIVER_VERSION "1.20" +#define DRIVER_VERSION "1.21" #define M3_MODULE_NAME "maestro3" #define PFX M3_MODULE_NAME ": " @@ -2763,7 +2765,6 @@ static void m3_suspend(struct pci_dev *pci_dev) { unsigned long flags; -int index; int i; struct m3_card *card = pci_dev-driver_data; @@ -2788,15 +2789,18 @@ m3_assp_halt(card); -index = 0; -DPRINTK(DPMOD, "saving code\n"); -for(i = REV_B_CODE_MEMORY_BEGIN ; i = REV_B_CODE_MEMORY_END; i++) -card-suspend_mem[index++] = -m3_assp_read(card, MEMTYPE_INTERNAL_CODE, i); -DPRINTK(DPMOD, "saving data\n"); -for(i = REV_B_DATA_MEMORY_BEGIN ; i = REV_B_DATA_MEMORY_END; i++) -card-suspend_mem[index++] = -m3_assp_read(card, MEMTYPE_INTERNAL_DATA, i); +if(card-suspend_mem) { +int index = 0; + +DPRINTK(DPMOD, "saving code\n"); +for(i = REV_B_CODE_MEMORY_BEGIN ; i = REV_B_CODE_MEMORY_END; i++) +card-suspend_mem[index++] = +m3_assp_read(card, MEMTYPE_INTERNAL_CODE, i); +DPRINTK(DPMOD, "saving data\n"); +for(i = REV_B_DATA_MEMORY_BEGIN ; i = REV_B_DATA_MEMORY_END; i++) +card-suspend_mem[index++] = +m3_assp_read(card, MEMTYPE_INTERNAL_DATA, i); +} DPRINTK(DPMOD, "powering down apci regs\n"); m3_outw(card, 0x, 0x54); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: user space web server accelerator support
Zach, have you ever noticed such a performance bottleneck in your phhttpd? yup, this is definitely something you don't want to be doing in the fast path :) Any thoughts? Sorry I don't remember the start of this thread, but I'll ask anyway; have you looked at Ingo Molnar's Tux server? Its state of the art unix serving, implemented in the linux kernel: http://people.redhat.com/mingo/TUX-patches/ -- zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a handy list_first_entry macro
On Tue, Apr 17, 2007 at 03:18:56PM +0400, Pavel Emelianov wrote: There are many places in the kernel where the construction like foo = list_entry(head-next, struct foo_struct, list); are used. The code might look more descriptive and neat if using the macro list_first_entry(head, type, member) \ list_entry((head)-next, type, member) - dquot = list_entry(dirty-next, struct dquot, dq_dirty); + dquot = list_first_entry(dirty, struct dquot, dq_dirty); I think it's more than just descriptive and neat. A common sneaky bug is to accidentally pass list-next into list_entry() instead of list-next. This is easy to do because one is used to typing list in list_empty(), etc. So by following -next in the macro we make the list argument consistent in list_empty() and list_first_entry() and hopefully reduce the risk of making that mistake. I like it. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] dio: invalidate clean pages before dio write
dio: invalidate clean pages before dio write This patch fixes a user-triggerable oops that was reported by Leonid Ananiev as archived at http://lkml.org/lkml/2007/2/8/337. dio writes invalidate clean pages that intersect the written region so that subsequent buffered reads go to disk to read the new data. If this fails the interface tries to tell the caller that the cache is inconsistent by returning EIO. Before this patch we had the problem where this invalidation failure would clobber -EIOCBQUEUED as it made its way from fs/direct-io.c to fs/aio.c. Both fs/aio.c and bio completion call aio_complete() and we reference freed memory, usually oopsing. This patch addresses this problem by invalidating before the write so that we can cleanly return -EIO before -direct_IO() has had a chance to return -EIOCBQUEUED. There is a compromise here. During the dio write we can fault in mmap()ed pages which intersect the written range with get_user_pages() if the user provided them for the source buffer. This is a crazy thing to do, but we can make it mostly work in most cases by trying the invalidation again. The compromise is that we won't return an error if this second invalidation fails if it's an AIO write and we have -EIOCBQUEUED. This was tested by having two processes race performing large O_DIRECT and buffered ordered writes. Within minutes ext3 would see a race between ext3_releasepage() and jbd holding a reference on ordered data buffers and would cause invalidation to fail, panicing the box. The test can be found in the 'aio_dio_bugs' test group in test.kernel.org/autotest. After this patch the test passes. Signed-off-by: Zach Brown [EMAIL PROTECTED] --- mm/filemap.c | 48 1 file changed, 36 insertions(+), 12 deletions(-) --- a/mm/filemap.c Wed Feb 28 06:00:13 2007 + +++ b/mm/filemap.c Thu Mar 08 17:48:58 2007 -0800 @@ -2379,7 +2379,8 @@ generic_file_direct_IO(int rw, struct ki struct file *file = iocb-ki_filp; struct address_space *mapping = file-f_mapping; ssize_t retval; - size_t write_len = 0; + size_t write_len; + pgoff_t end = 0; /* silence gcc */ /* * If it's a write, unmap all mmappings of the file up-front. This @@ -2388,23 +2389,46 @@ generic_file_direct_IO(int rw, struct ki */ if (rw == WRITE) { write_len = iov_length(iov, nr_segs); + end = (offset + write_len - 1) PAGE_CACHE_SHIFT; if (mapping_mapped(mapping)) unmap_mapping_range(mapping, offset, write_len, 0); } retval = filemap_write_and_wait(mapping); - if (retval == 0) { - retval = mapping-a_ops-direct_IO(rw, iocb, iov, - offset, nr_segs); - if (rw == WRITE mapping-nrpages) { - pgoff_t end = (offset + write_len - 1) -PAGE_CACHE_SHIFT; - int err = invalidate_inode_pages2_range(mapping, + if (retval) + goto out; + + /* +* After a write we want buffered reads to be sure to go to disk to get +* the new data. We invalidate clean cached page from the region we're +* about to write. We do this *before* the write so that we can return +* -EIO without clobbering -EIOCBQUEUED from -direct_IO(). +*/ + if (rw == WRITE mapping-nrpages) { + retval = invalidate_inode_pages2_range(mapping, offset PAGE_CACHE_SHIFT, end); - if (err) - retval = err; - } - } + if (retval) + goto out; + } + + retval = mapping-a_ops-direct_IO(rw, iocb, iov, offset, nr_segs); + if (retval) + goto out; + + /* +* Finally, try again to invalidate clean pages which might have been +* faulted in by get_user_pages() if the source of the write was an +* mmap()ed region of the file we're writing. That's a pretty crazy +* thing to do, so we don't support it 100%. If this invalidation +* fails and we have -EIOCBQUEUED we ignore the failure. +*/ + if (rw == WRITE mapping-nrpages) { + int err = invalidate_inode_pages2_range(mapping, + offset PAGE_CACHE_SHIFT, end); + if (err retval = 0) + retval = err; + } +out: return retval; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/11] ANNOUNCE: Syslets, generic asynchronous system call support
I'm finally back from my travel and conference hiatus.. you guys have been busy! :) On Feb 13, 2007, at 6:20 AM, Ingo Molnar wrote: I'm pleased to announce the first release of the Syslet kernel feature and kernel subsystem, which provides generic asynchrous system call support: http://redhat.com/~mingo/syslet-patches/ In general, I really like the look of this. I think I'm convinced that your strong preference to do this with full kernel threads (1:1 task_struct - thread_info/stack relationship) is the right thing to do. The fibrils fell on the side of risking bugs by sharing task_structs amongst stacks executing kernel paths. This, correct me if I'm wrong, falls on the side of risking behavioural quirks stemming from task_struct references that we happen to have not enabled sharing of yet. I have strong hopes that we won't actually *care* about the behavioural differences we get from having individual task structs (which share the important things!) between syscall handlers. The *only* seemingly significant case I've managed to find, the IO scheduler priority and context fields, is easy enough to fix up. Jens and I have been talking about that. It's been bugging him for other reasons. So, thanks, nice work. I'm going to focus on finding out if its feasible for The Database to use this instead of the current iocb mechanics. I'm optimistic. Syslets are small, simple, lightweight programs (consisting of system-calls, 'atoms') I will admit, though, that I'm not at all convinced that we need this. Adding a system call for addition (just addition? how far do we go?!) sure feels like a warning sign to me that we're heading down a slippery slope. I would rather we started with an obviously minimal syscall which just takes an array of calls and args and executes them unconditionally. But its existance doesn't stop the use case I care about. So it's hard to get *too* worked up about it. Comments, suggestions, reports are welcome! For what it's worth, it looks like 'x86-optimized-copy_uatom.patch' got some hunks that should have been in 'x86-optimized- sys_umem_add.patch'. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/11] syslets: core code
But the whole point is that the notion of a register is wrong in the first place. [...] forget about it then. The thing we register is dead-simple: struct async_head_user { struct syslet_uatom __user **completion_ring; unsigned long ring_size_bytes; unsigned long max_nr_threads; }; this can be passed in to sys_async_exec() as a second pointer, and the kernel can put the expected-completion pointer (and the user ring idx pointer) into its struct atom. It's just a few instructions, and only in the cachemiss case. that would make completions arbitrarily split-up-able. No registration whatsoever. A waiter could specify which ring's events it is interested in. A 'ring' could be a single-entry thing as well, for a single instance of pending IO. I like this, too. (Not surprisingly, having outlined something like it in a mail in one of the previous threads :)). I'll bring up the POSIX AIO list IO case. It wants to issue a group of IOs and sleep until they all return. Being able to cheaply instantiate a ring implicitly with the submission of the IO calls in the list will make implementing this almost too easy. It'd obviously just wait for that list's ring to drain. I hope. There might be complications around the edges (waiting for multiple list IOs to drain?), but it seems like this would be on the right track. I might be alone in caring about having a less ridiculous POSIX AIO interface in glibc, though, I'll admit. It seems like it'd be a pretty sad missed opportunity if we rolled a fantastic general AIO interface and left glibc to still screw around with it's own manual threading :/. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/14] syslets: core code
+static void +__mark_async_thread_ready(struct async_thread *at, struct async_head *ah) +{ + list_del(at-entry); + list_add_tail(at-entry, ah-ready_async_threads); +__mark_async_thread_busy(struct async_thread *at, struct async_head *ah) +{ + list_del(at-entry); + list_add_tail(at-entry, ah-busy_async_threads); list_move_tail()? - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: fix kernel bug when page is temporally busy
If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch aio: fix kernel bug when page is temporally busy Sorry Leonid, this patch is not safe. It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be called. This can lead to operations hanging, both AIO and calls that come through do_sync_{read,write}. It overwrites -EIOCBQUEUED, leading to an aio_complete() while a retry is happening. This can lead to reference count confusion. Double-frees, referencing freed memory, that kind of thing. This isn't a new problem. The current code that overwrites with -EIO has this problem. But moving to -EIOCBRETRY does introduce new behaviour of aio_complete() and the retry path racing. I'll have a candidate patch to address the problem of EIO being raised on the way back up from a path which has returned -EIOCBQUEUED. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/11] syslets: core code
2) On the client facing side (port 53), I'd very much hope for a way to do 'recvv' on datagram sockets, so I can retrieve a whole bunch of UDP datagrams with only one kernel transition. I want to highlight this point that Bert is making. Whenever we talk about AIO and kernel threads some folks are rightly concerned that we're talking about a thread *per IO* and fear that memory consumption will be fatal. Take the case of userspace which implements what we'd think of as page cache writeback. (*coughs, points at email address*). It wants to issue thousands of IOs to disjoint regions of a file. Thousands of kernel threads, oh crap! But it only issues each IO with a separate syscall (or io_submit() op) because it doesn't have an interface that lets it specify IOs that vector user memory addresses *and file position*. If we had a seemingly obvious interface that let it kick off batched IOs to different parts of the file, the looming disaster of a thread per IO vanishes in that case. struct off_vec { off_t pos; size_t len; }; long sys_sgwrite(int fd, struct iovec *memvec, size_t mv_count, struct off_vec *ovec, size_t ov_count); It doesn't take long to imagine other uses for this that are less exotic. Take e2fsck and its iterating through indirect blocks or directory data blocks. It has a list of disjoint file regions (blocks) it wants to read, but it does them serially to keep the code from getting even more confusing. blktrace a clean e2fsck -f some time.. it's leaving *HALF* of the disk read bandwith on the table by performing serial block-sized reads. If it could specify batches of them the code would still be simple but it could tell the kernel and IO scheduler *exactly* what it wants, without having to mess around with sys_readahead() or AIO or any of that junk :). Anyway, that's just something that's been on my mind. If there are obvious clean opportunities to get more done with single syscalls, it might not be such a bad thing. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: fix kernel bug when page is temporally busy
On Feb 15, 2007, at 11:11 AM, Ananiev, Leonid I wrote: It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be called. This can lead to operations hanging If EIOCBRETRY then generic_file_aio_write() will be recalled for the same iocb. Only if kick_iocb() is called. It won't be called if i_i_p2_r() was the only thing to return -EIOCBRETRY. It overwrites -EIOCBQUEUED, leading to an aio_complete() while a retry is happening. EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call: Not by fs/aio.c, but *by the place that originated -EIOCBQUEUED*. Later. After IO has completed. see fs/direct-io.c:dio_bio_end_aio(). This is what -EIOCBQUEUED means! It's a promise to call aio_complete () in the future. Have you read the giant comment over the definition of struct kiocb in include/linux/aio.h? This can lead to reference count confusion. But just reference count confusion was deleted by patch. Isn't it? Sorry, I don't understand what you're trying to ask here. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: fix kernel bug when page is temporally busy
On Feb 15, 2007, at 3:32 PM, Ananiev, Leonid I wrote: If EIOCBRETRY then generic_file_aio_write() will be recalled for the same iocb. Only if kick_iocb() is called. It won't be called if i_i_p2_r() was the only thing to return -EIOCBRETRY. It is not need to call kick_iocb() for generic_file_aio_write() calling. It is recalled without any wakeup waiting: for (;;) { ret = filp-f_op-aio_write(kiocb, iov, 1, kiocb.ki_pos); if (ret != -EIOCBRETRY) break; wait_on_retry_sync_kiocb(kiocb); } Note: wait_on_retry_sync_kiocb() does not wait. Yes it does. It will not return until kiocbSetKicked() is called, and that is only called from kick_iocb(). It overwrites -EIOCBQUEUED Do you mean that there is one more kernel bug which overwrites -EIOCBQUEUED by any errno or number of bytes and this new value is returned to caller as an IO result while IO is not finished yet. The proposed patch does not crate this bug if any. Right, and I said that in the mail you're quoting. It actually fixes a kernel panic bag when iocb.users count becomes incorrect. The bag Kernel BUG at fs/aio.c:509 is there because aio_run_iocb() have not a chance to differ real EIO and EIO which is actually means EAGAYN or EIOCBRETRY. Yes, I understand the bug you're trying to fix. You're introducing other bugs with the patch. It will not be merged. It is interesting that I've not seen any EIOCBQUEUED returned to aio_run_iocb() during 5 hours aiostress running. What arguments are you running aio-stress with? -EIOCBQUEUED is only used for O_DIRECT, and then only in certain circumstances. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] aio: propogate post-EIOCBQUEUED errors to completion event
aio: propogate post-EIOCBQUEUED errors to completion event This addresses an oops reported by Leonid Ananiev [EMAIL PROTECTED] as archived at http://lkml.org/lkml/2007/2/8/337. O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its intention to call aio_complete() once the bios complete. As we return from submission we must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let the bio completion call aio_complete(). This stops us from returning errors after O_DIRECT submission. But we have a few places that are very interested in generating errors after bio submission. The most critical of these is invalidating the page cache after a write. This avoids exposing stale data to buffered operations that are performed after the O_DIRECT write succeeds. We must do this after submission because the user buffer might have been an mmap()ed buffer of the region being written to. The get_user_pages() in the O_DIRECT completion path could have faulted in stale data. So this patch introduces a helper, aio_propogate_error(), which queues post-submission errors in the iocb so that they are given to the user completion event when aio_complete() is finally called. To get this working we change the aio_complete() path so that the ring insertion is performed as the final iocb reference is dropped. This gives the submission path time to queue its pending error before it drops its reference. This increases the space in the iocb as it has to record the two result codes from aio_complete() and the pending error from the submission path. This was tested by running O_DIRECT aio-stress concurrently with buffered reads while triggering EIO in invalidate_inode_pages2_range() with the help of a debugfs bool hack. Previously the kernel would oops as fs/aio.c and bio completion both called aio_complete(). With this patch aio-stress sees -EIO. Signed-off-by: Zach Brown [EMAIL PROTECTED] --- fs/aio.c| 109 ++ include/linux/aio.h | 30 +++ mm/filemap.c|4 - 3 files changed, 91 insertions(+), 52 deletions(-) --- a/fs/aio.c Mon Feb 19 11:52:27 2007 -0800 +++ b/fs/aio.c Mon Feb 19 12:32:52 2007 -0800 @@ -192,6 +192,46 @@ static int aio_setup_ring(struct kioctx (void)__event; \ kunmap_atomic((void *)((unsigned long)__event PAGE_MASK), km); \ } while(0) + +static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb) +{ + struct aio_ring_info*info; + struct aio_ring *ring; + struct io_event *event; + unsigned long tail; + + assert_spin_locked(ctx-ctx_lock); + + info = ctx-ring_info; + ring = kmap_atomic(info-ring_pages[0], KM_IRQ1); + + tail = info-tail; + event = aio_ring_event(info, tail, KM_IRQ0); + if (++tail = info-nr) + tail = 0; + + event-obj = (u64)(unsigned long)iocb-ki_obj.user; + event-data = iocb-ki_user_data; + event-res = iocb-ki_pending_err ? iocb-ki_pending_err : iocb-ki_res; + event-res2 = iocb-ki_res2; + + dprintk(aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n, + ctx, tail, iocb, iocb-ki_obj.user, iocb-ki_user_data, + iocb-ki_pending_err, iocb-ki_res, iocb-ki_res2); + + /* after flagging the request as done, we +* must never even look at it again +*/ + smp_wmb(); /* make event visible before updating tail */ + + info-tail = tail; + ring-tail = tail; + + put_aio_ring_event(event, KM_IRQ0); + kunmap_atomic(ring, KM_IRQ1); + + pr_debug(added to ring %p at [%lu]\n, iocb, tail); +} /* ioctx_alloc * Allocates and initializes an ioctx. Returns an ERR_PTR if it failed. @@ -418,6 +458,7 @@ static struct kiocb fastcall *__aio_get_ req-ki_cancel = NULL; req-ki_retry = NULL; req-ki_dtor = NULL; + req-ki_pending_err = 0; req-private = NULL; req-ki_iovec = NULL; INIT_LIST_HEAD(req-ki_run_list); @@ -507,10 +548,14 @@ static int __aio_put_req(struct kioctx * assert_spin_locked(ctx-ctx_lock); - req-ki_users --; + req-ki_users--; BUG_ON(req-ki_users 0); if (likely(req-ki_users)) return 0; + + if (kiocbIsInserted(req)) + aio_ring_insert_entry(ctx, req); + list_del(req-ki_list);/* remove from active_reqs */ req-ki_cancel = NULL; req-ki_retry = NULL; @@ -924,12 +969,8 @@ int fastcall aio_complete(struct kiocb * int fastcall aio_complete(struct kiocb *iocb, long res, long res2) { struct kioctx *ctx = iocb-ki_ctx; - struct aio_ring_info*info; - struct aio_ring *ring; - struct io_event *event; + int ret; unsigned long flags; - unsigned long tail; - int ret
Re: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event
I *think* the patch is right, but picking the changes to the code and watching its movement at the same time is making my head spin. Really? The only things that changed are the assignment from iocb- res (after testing pending_err) instead of the 'res' argument and the dprintk. But, uh, sure. If it's a burden I can respin it as two patches. Here they come! - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] aio: create aio_ring_insert_entry() function
aio: create aio_ring_insert_entry() function This patch just hoists the process of inserting an entry into the completion ring into its own function. No functionality is changed. This is in preparation for calling ring insertion from another path with slightly different mechanics. diff -r b551e6cbf434 fs/aio.c Signed-off-by: Zach Brown [EMAIL PROTECTED] --- fs/aio.c | 77 + 1 file changed, 43 insertions(+), 34 deletions(-) --- a/fs/aio.c Mon Feb 19 13:09:01 2007 -0800 +++ b/fs/aio.c Mon Feb 19 13:12:04 2007 -0800 @@ -192,6 +192,47 @@ static int aio_setup_ring(struct kioctx (void)__event; \ kunmap_atomic((void *)((unsigned long)__event PAGE_MASK), km); \ } while(0) + +static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb, + long res, long res2) +{ + struct aio_ring_info*info; + struct aio_ring *ring; + struct io_event *event; + unsigned long tail; + + assert_spin_locked(ctx-ctx_lock); + + info = ctx-ring_info; + ring = kmap_atomic(info-ring_pages[0], KM_IRQ1); + + tail = info-tail; + event = aio_ring_event(info, tail, KM_IRQ0); + if (++tail = info-nr) + tail = 0; + + event-obj = (u64)(unsigned long)iocb-ki_obj.user; + event-data = iocb-ki_user_data; + event-res = res; + event-res2 = res2; + + dprintk(aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n, + ctx, tail, iocb, iocb-ki_obj.user, iocb-ki_user_data, + res, res2); + + /* after flagging the request as done, we +* must never even look at it again +*/ + smp_wmb(); /* make event visible before updating tail */ + + info-tail = tail; + ring-tail = tail; + + put_aio_ring_event(event, KM_IRQ0); + kunmap_atomic(ring, KM_IRQ1); + + pr_debug(added to ring %p at [%lu]\n, iocb, tail); +} /* ioctx_alloc * Allocates and initializes an ioctx. Returns an ERR_PTR if it failed. @@ -924,11 +965,7 @@ int fastcall aio_complete(struct kiocb * int fastcall aio_complete(struct kiocb *iocb, long res, long res2) { struct kioctx *ctx = iocb-ki_ctx; - struct aio_ring_info*info; - struct aio_ring *ring; - struct io_event *event; unsigned long flags; - unsigned long tail; int ret; /* @@ -946,8 +983,6 @@ int fastcall aio_complete(struct kiocb * return 1; } - info = ctx-ring_info; - /* add a completion event to the ring buffer. * must be done holding ctx-ctx_lock to prevent * other code from messing with the tail @@ -966,34 +1001,8 @@ int fastcall aio_complete(struct kiocb * if (kiocbIsCancelled(iocb)) goto put_rq; - ring = kmap_atomic(info-ring_pages[0], KM_IRQ1); - - tail = info-tail; - event = aio_ring_event(info, tail, KM_IRQ0); - if (++tail = info-nr) - tail = 0; - - event-obj = (u64)(unsigned long)iocb-ki_obj.user; - event-data = iocb-ki_user_data; - event-res = res; - event-res2 = res2; - - dprintk(aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n, - ctx, tail, iocb, iocb-ki_obj.user, iocb-ki_user_data, - res, res2); - - /* after flagging the request as done, we -* must never even look at it again -*/ - smp_wmb(); /* make event visible before updating tail */ - - info-tail = tail; - ring-tail = tail; - - put_aio_ring_event(event, KM_IRQ0); - kunmap_atomic(ring, KM_IRQ1); - - pr_debug(added to ring %p at [%lu]\n, iocb, tail); + aio_ring_insert_entry(ctx, iocb, res, res2); + put_rq: /* everything turned out well, dispose of the aiocb. */ ret = __aio_put_req(ctx, iocb); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
aio: propogate post-EIOCBQUEUED errors to completion event This addresses an oops reported by Leonid Ananiev [EMAIL PROTECTED] as archived at http://lkml.org/lkml/2007/2/8/337. O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its intention to call aio_complete() once the bios complete. As we return from submission we must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let the bio completion call aio_complete(). This stops us from returning errors after O_DIRECT submission. But we have a few places that are very interested in generating errors after bio submission. The most critical of these is invalidating the page cache after a write. This avoids exposing stale data to buffered operations that are performed after the O_DIRECT write succeeds. We must do this after submission because the user buffer might have been an mmap()ed buffer of the region being written to. The get_user_pages() in the O_DIRECT completion path could have faulted in stale data. So this patch introduces a helper, aio_propogate_error(), which queues post-submission errors in the iocb so that they are given to the user completion event when aio_complete() is finally called. To get this working we change the aio_complete() path so that the ring insertion is performed as the final iocb reference is dropped. This gives the submission path time to queue its pending error before it drops its reference. This increases the space in the iocb as it has to record the two result codes from aio_complete() and the pending error from the submission path. This was tested by running O_DIRECT aio-stress concurrently with buffered reads while triggering EIO in invalidate_inode_pages2_range() with the help of a debugfs bool hack. Previously the kernel would oops as fs/aio.c and bio completion both called aio_complete(). With this patch aio-stress sees -EIO. Signed-off-by: Zach Brown [EMAIL PROTECTED] --- fs/aio.c| 49 +- include/linux/aio.h | 30 + mm/filemap.c|4 +-- 3 files changed, 57 insertions(+), 26 deletions(-) --- a/fs/aio.c Mon Feb 19 13:12:20 2007 -0800 +++ b/fs/aio.c Mon Feb 19 13:16:00 2007 -0800 @@ -193,8 +193,7 @@ static int aio_setup_ring(struct kioctx kunmap_atomic((void *)((unsigned long)__event PAGE_MASK), km); \ } while(0) -static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb, - long res, long res2) +static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb) { struct aio_ring_info*info; struct aio_ring *ring; @@ -213,12 +212,12 @@ static void aio_ring_insert_entry(struct event-obj = (u64)(unsigned long)iocb-ki_obj.user; event-data = iocb-ki_user_data; - event-res = res; - event-res2 = res2; - - dprintk(aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n, + event-res = iocb-ki_pending_err ? iocb-ki_pending_err : iocb-ki_res; + event-res2 = iocb-ki_res2; + + dprintk(aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n, ctx, tail, iocb, iocb-ki_obj.user, iocb-ki_user_data, - res, res2); + iocb-ki_pending_err, iocb-ki_res, iocb-ki_res2); /* after flagging the request as done, we * must never even look at it again @@ -459,6 +458,7 @@ static struct kiocb fastcall *__aio_get_ req-ki_cancel = NULL; req-ki_retry = NULL; req-ki_dtor = NULL; + req-ki_pending_err = 0; req-private = NULL; req-ki_iovec = NULL; INIT_LIST_HEAD(req-ki_run_list); @@ -548,10 +548,14 @@ static int __aio_put_req(struct kioctx * assert_spin_locked(ctx-ctx_lock); - req-ki_users --; + req-ki_users--; BUG_ON(req-ki_users 0); if (likely(req-ki_users)) return 0; + + if (kiocbIsInserted(req)) + aio_ring_insert_entry(ctx, req); + list_del(req-ki_list);/* remove from active_reqs */ req-ki_cancel = NULL; req-ki_retry = NULL; @@ -983,27 +987,24 @@ int fastcall aio_complete(struct kiocb * return 1; } - /* add a completion event to the ring buffer. -* must be done holding ctx-ctx_lock to prevent -* other code from messing with the tail -* pointer since we might be called from irq -* context. -*/ + /* +* We queue up the completion codes into the iocb. They are combined +* with a potential error from the submission path and inserted into +* the ring once the last reference to the iocb is dropped. Cancelled +* iocbs don't insert events on completion because userland was given +* an event directly as part of the cancelation interface. +*/ spin_lock_irqsave(ctx-ctx_lock, flags); if (iocb-ki_run_list.prev !list_empty(iocb
Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
Zach So this patch introduces a helper, aio_propogate_error(), ...which is spelled incorrectly: aio_propagate_error. Man, I am batting 1000! Randy also made fun of my 'intead'. Zach which queues post-submission errors in the iocb so that they are Zach given to the user completion event when aio_complete() is Zach finally called. Ugly, but I can't think of a better way to do it, either. Yeah, this seemed to be the lesser of the available evils. We (Chris and I, while in california) considered introducing a primitive to have the submission path wait for aio_complete() to be called so that it could just return the error. We also thought about turning EIOCB {RETRY,QUEUED} into bits on the iocb instead of return codes that we have to lovingly pass back up to fs/aio.c. This seemed to be the least intrusive :/. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event
I would argue that one common cause of the EIO is userland error (mmap concurrent with O_DIRECT), and EIO is the correct answer. I disagree. That means that using the pagecache to synchronize things like the proposed online defragmentation will occasionally make O_DIRECT users fail. Well, only if the pages couldn't be invalidated. Arguably ext3 should be trying a lot harder to invalidate before giving up and returning failure from invalidate_page(). Hopefully this all goes away in the long term with Chris' placeholder work that stops insertions into the radix tree while dio ops are in flight. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] convert aio event reap to use atomic-op instead of spin_lock
First, I'll NAK this and all AIO patches until the patch description says that it's been run through the regression tests that we've started collecting in autotest. They're trivial to run, never fear: cd /usr/local svn checkout svn://test.kernel.org/autotest/trunk/client autotest cd autotest bin/autotest tests/aio_dio_bugs/control (We'll be moving this to LTP soon, I think, but that will just change the commands to cut and paste.) Resurrect an old patch that uses atomic operation to update ring buffer index on AIO event queue. This work allows futher application/libaio optimization to run fast path io_getevents in user space. I have mixed feelings. I think the userspace getevents support was poorly designed and the simple fact that we've gone this long without it says just how desperately the feature isn't needed. At the same time, it's a small change and the code is already there. So it's hard to get too worked up about it. We can always remove support in the future by never changing the indexes if we require that apps fall back to sys_io_getevents(). In any case, let's see a test which exercises userspace event collection before we claim to support it. I'd prefer a little stand-alone C app like the ones in autotest for simple unit testing. A fio patch to stress test it would probably also be well received. I've also added one more change on top of old implementation that rounds ring buffer size to power of 2 to allow fast head/tail calculation. With the new scheme, there is no more modulo operation on them and we simply increment either pointer index directly. Please quantify the improvement. We're allocating more ring elements than userspace asked for and than were accounted for in aio_max_nr. That cost, however teeny, will be measured against the benefit. - /* Compensate for the ring buffer's head/tail overlap entry */ - nr_events += 2; /* 1 is required, 2 for good luck */ + /* round nr_event to next power of 2 */ + nr_events = roundup_pow_of_two(nr_events); Is that buggy? How will the code tell if head == tail means a full ring or an empty ring? The old code added that extra event to let it tell the ring was full before head == tail and it would think it's empty again, I think. I'm guessing roundup(nr_events + 1) is needed. -#define aio_ring_event(info, nr, km) ({ \ - unsigned pos = (nr) + AIO_EVENTS_OFFSET;\ +#define aio_ring_event(info, __nr, km) ({\ + unsigned pos = ((__nr) ((info)-nr - 1)) + AIO_EVENTS_OFFSET; \ struct io_event *__event; \ __event = kmap_atomic( \ (info)-ring_pages[pos / AIO_EVENTS_PER_PAGE], km); \ (info) is now evaluated twice. __info = (info), please, just like __event there. put_aio_ring_event(event, KM_IRQ0); kunmap_atomic(ring, KM_IRQ1); The ring page, which is mmaped to userspace at some weird address, was just written through a kmap. Do we need a flush_dcache_page()? This isn't this patch's problem, but it'd need to be addressed if we're using the ring from userspace. - - struct io_event io_events[0]; + struct io_event io_events[0]; Try to minimize gratuitous reformatting, please. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] convert aio event reap to use atomic-op instead of spin_lock
Sorry I wasn't thorough enough. And partially because I was worried about changing structure type for user space facing struct aio_ring. Now that I looked through all arches, it looks safe as all arch's atomic_t has the same size as int. Here is the updated patch. @@ -144,7 +144,7 @@ struct kiocb { struct aio_ring { unsignedid; /* kernel internal index number */ unsignednr; /* number of io_events */ - unsignedhead; + atomic_thead; unsignedtail; Embedding an atomic_t in an ABI struct? That makes everyone else nervous too, right? It may look safe on i386/x86-64 today, but this doesn't seem like wise practice. Is there any reason to believe that atomic_t will never change size? Does anything else do this already? If nothing else, the unsigned (should be __u32, sigh) could be cast to an atomic_t. Is being able to do atomic work on a u32 between the kernel and userspace something that all archs have support for? I mean, take the fact that userspace and the kernel could both be doing these atomic ops on different virtual addresses and so conceivably different cachelines. Is that a problem for anyone? I do find myself wondering if the notion of userspace ring synchronization shouldn't be built around futexes. They weren't around when this mmap()ed ring business was created. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] convert aio event reap to use atomic-op instead of spin_lock
Ken uses the other (superior!) way of implementing ringbuffers: the head and tail pointers (the naming of which AIO appears to have reversed) are not constrained to the ringsize - they are simply allowed to wrap through 0xfff. A-ha! That sure sounds great. I'd be happy to see the kernel side fs/aio.c ring management move in that direction, sure, with a specific patch to do that. I get less excited when we talk about changing the index semantics as part of supporting serialization of the ring with userspace. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] convert aio event reap to use atomic-op instead of spin_lock
I have mixed feelings. I think the userspace getevents support was poorly designed and the simple fact that we've gone this long without it says just how desperately the feature isn't needed. I kept on getting requests from application developers who want that feature. My initial patch was dated back May 2004. My point is that demand for it doesn't seem to be very high, and your use of the past tense there seems to support that. And the patches to support it are not looking great. Is it worth it, then? What difference are you measuring so far? - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] convert aio event reap to use atomic-op instead of spin_lock
I kept on getting requests from application developers who want that feature. My initial patch was dated back May 2004. The right way to do it involves synchronization between the kernel side io_getevents() and the userspace code pulling events out of the ring. Alan Cox suggested embedding a futex in the shared memory region, but I don't think anyone ever implemented that. Yeah, I like the idea of futexes. I'm worried that virtual aliasing spells doom for the current home-brewed serialization that fs/aio.c is doing with the shared ring head/tail accesses. Am I worrying about nothing here? I will look into this aside from this patch. That's probably the case. Also, any changes in this area *must* correctly update the compat/incompat feature flags in the ring buffer header. That has been missed in the past... Do you know of anyone using the current ring info ABI? The *only* user I know of is the check of ctx-magic in libaio. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
On Feb 20, 2007, at 9:33 AM, Ananiev, Leonid I wrote: There is change in the patch which is uncommented in preface. Now aio_ring_insert_entry() is not called if req-ki_users=1. Before was called. Could you comment it? It is described in the patch description, though perhaps not explicitly enough: To get this working we change the aio_complete() path so that the ring insertion is performed as the final iocb reference is dropped. This stops an aio_complete() from generating a completion event before the submission path has had a chance to generate an error. This could happen if O_DIRECT bios completed very quickly before the submission path had a chance to return up and call ivalidate_inode_pages2_range(). - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event
There are some strange O_DIRECT corner cases in here such that the 'last writer' may actually be a 'last reader' and winning can mean have a copy of the page in page cache older than the copy on disk. As long as it is marked dirty so that it eventually gets synced to disk, it shouldn't matter. No, Chris is pointing out that an an O_DIRECT write can leave clean read pages in the page cache. All it takes is giving a source buffer for the write which is an mmap ()ed apeture of the region that is being written to. If you get the offsets right you can get the get_user_pages() down in fs/direct-io.c will populate the page cache before the actual O_DIRECT write gets to it. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
On Feb 21, 2007, at 12:35 AM, Ken Chen wrote: On 2/20/07, Ananiev, Leonid [EMAIL PROTECTED] wrote: 1) mem=1G in kernel boot param if you have more 2) unmount; mk2fs; mount 3) dd if=/dev/zero of=test_file bs=1M count=1200 4) aiostress -s 1200m -O -o 2 -i 1 -r 16k test_file 5) if i++50 goto 2). Would you please instrument the call chain of invalidate_complete_page2() and tell us exactly where it returns zero value in your failure case? invalidate_complete_page2 try_to_release_page ext3_releasepage journal_try_to_free_buffers ??? For what it's worth, Badari has explained this race in the past in a credible way. I'll take the liberty of pasting a mail from him: kjournald submited buffers for IO and waiting for them to finish. Note that it has a ref. against the buffer. journal_commit_transaction() ... submited buffers for IO /* Waiting for IO to complete */ while (commit_transaction-t_locked_list) { ... get_bh(bh); if (buffer_locked(bh)) { spin_unlock(journal-j_list_lock); wait_on_buffer(bh); spin_lock(journal-j_list_lock); } .. put_bh(bh); } Now, DIO process comes to frees the jh through journal_try_to_free_buffers() but fails to drop_buffers() since kjournald() has a reference against it. invalidate_inode_pages2_range() .. ext3_releasepage() journal_try_to_free_buffers() journal_put_journal_head() __journal_try_to_free_buffer() --- freed jh try_to_free_buffers() drop_buffers() if (buffer_busy(bh)) goto failed; --- returns EIO due to b_count I don't mean to say that we shouldn't get traces to confirm the theory, just sharing. And now we can point to this in the archives next time :). - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
This is an interesting trick, but I'd like to consider hard whether the added complexity is worth it. Could you list the various other cases you have in mind which would want to use it ? I'm happy to report that the sync case and the invalidate_inode_pages2_range() case are the only two which try to rewrite 'ret'. I audited all the call paths between aio_{read,write} and -EIOCBQUUEUED. So, sure, maybe we can shuffle the house of cards a little in the direction away from having a fs/aio.c helper for the situation by simply removing the few current instances of the problem. It won't scale as people add problems without knowing about the rule to not overwrite -EIOCBQUEUED, but hopefully that won't be a rule for much longer :) :). For the O_SYNC case we could add another magical test to the is_async assignment which won't perform AIO on O_SYNC descriptors. For the invalidate_inode_pages2_range() case, I wonder if the right place to issue this is after the direct IO write has completed and before aio_complete() is issued (somewhat like the way we do bio_check_pages_dirty for DIO reads), rather than after submission when the IO may still not have hit the disk. This would also make the behaviour uniform for synchronous and async cases. Hmm, I think I like that. It solves the problem for the current sole user of -EIOCBQUEUED without too much disruption. BTW, am I right in interpreting that with your change aio_complete () may trigger an io_getevents() wakeup, before the corresponding event is placed on the ring buffer ? Hmm, yeah, it looks like I goofed that. I'll roll a patch which does the invalidation down in fs/direct-io.c. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3
Plus there's the fundamental killer that KAIO is a /lot/ harder to implement (and to maintain) on the kernel side: it has to be implemented for every IO discipline, and even for the IO disciplines it supports at the moment, it is not truly asynchronous for things like metadata blocking or VFS blocking. To handle things like metadata blocking it has to resort to non-statemachine techniques like retries - which are bad for performance. Yes, yes, yes. As one of the poor suckers who has been fixing bugs in fs/aio.c and fs/direct-io.c, I really want everyone to read Ingo's paragraph a few times. Have it printed on a t-shirt. Look at the number of man-years that have gone into fs/aio.c and fs/ direct-io.c. After all that effort it *barely* supports non-blocking O_DIRECT IO. The maintenance overhead of those two files, above all else, is what pushed me to finally try that nutty fibril attempt. Syslets/threadlets on the other hand, once the core is implemented, have near zero ongoing maintainance cost (compared to KAIO pushed into every IO subsystem) and cover all IO disciplines and API variants immediately, and they are as perfectly asynchronous as it gets. Amen. As an experiment, I'm working on backing the sys_io_*() calls with syslets. It's looking very promising so far. So all in one, i used to think that AIO state-machines have a long- term place within the kernel, but with syslets i think i've proven myself embarrasingly wrong =B-) Welcome to the party :). - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3
The more I think about it, a reasonable solution might actually be to use threadlets for disk I/O and pure event based processing for networking. It is two different handling paths and non-unified, but that might be the price for good performance :-) I generally agree, with some comments. If we come to the decision that there are some message rates that are better suited to delivery into a user-read ring (10gige rx to kevent, say) then it doesn't seem like it would be much of a stretch to add a facility where syslet completion could be funneled into that channel as well. I also wonder if there isn't some opportunity to cut down the number of syscalls / op in networking land. Is it madness to think of a call like recvmsgv() which could provide a vector of msghdrs? It might not make sense, but it might cut down on the per-op overhead for loads that know they're going to be heavy enough to get a decent amount of batching without fatally harming latency. Maybe those loads are rare.. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3
direct-io.c is evil. Ridiculously. You will have a hard time finding someone to defend it, I predict :). There is good news on that front, too. Chris (Mason) is making progress on getting rid of the worst of the Magical Locking that makes buffered races with O_DIRECT ops so awful. I'm not holding my breath for a page cache so fine grained that it could pin and reference 512B granular user regions and build bios from them, though that sure would be nice :). As an experiment, I'm working on backing the sys_io_*() calls with syslets. It's looking very promising so far. Great, I'd love to see the comparisons. I'm out for data. If it sucks, well, we'll know just how much. I'm pretty hopeful that it won't :). One other implementation to consider is actually using kernel threads compared to how syslets perform. Direct IO for one always blocks, so there shouldn't be much of a performance difference compared to syslets, with the bonus that no arch specific code is needed. Yeah, I'm starting with raw kernel threads so we can get some numbers before moving to syslets. One of the benefits syslets bring is the ability to start processing the next pending request the moment current request processing blocks. In the concurrent O_DIRECT write case that avoids releasing a ton of kernel threads which all just run to serialize on i_mutex (potentially bouncing it around cache domains) as the O_DIRECT ops are built and sent. -z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch] dynamic resizing dentry hash using RCU
On Feb 23, 2007, at 7:37 AM, Nick Piggin wrote: The dentry hash uses up 8MB for 1 million entries on my 4GB system is one of the biggest wasters of memory for me. Because I rarely have more than one or two hundred thousand dentries. And that's with several kernel trees worth of entries. Most desktop and probably even many types of servers will only use a fraction of that. So I introduce a new method for resizing hash tables with RCU, and apply that to the dentry hash. Can you compare what you've done to the design that Paul and David talked about a year ago? http://lkml.org/lkml/2006/1/30/74 I'd love to see a generic implementation of RCU hashing that subsystems can then take advantage of. It's long been on the fun side of my todo list. The side I never get to :/. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] aio: fix oops because of extra IO control block freeing.
On Mar 7, 2007, at 2:14 PM, Andrew Morton wrote: On Mon, 05 Mar 2007 17:23:33 +0300 Leonid Ananiev [EMAIL PROTECTED] wrote: From Leonid Ananiev The patch fixes oops because of extra IO control block freeing. IO is retried if page could not be invalidated. This patch is incorrect and shouldn't be merged. Signed-off-by: Leonid Ananiev [EMAIL PROTECTED] The patch fixes oops Kernel BUG at fs/aio.c:509 archived at http://lkml.org/lkml/2007/2/8/337 The number of IO control block (iocb)users 0. If page could not be invalidated by invalidate_inode_pages2_range() than EIO is returned. It happens if journal_try_to_free_buffers() fails to drop_buffers(). This EIO is not differing from real IO competition with EIO and aio_complete() is called. (I'm going to read this as This EIO is misinterpreted as real IO completion with -EIO and aio_complete() is called.) Later aio_complete() is called from dio_bio_end_aio() and frees iocb once more. This analysis is correct. Nothing can clobber -EIOCBQUEUED as it is returned up from fs/direct-io.c to fs/aio.c. This -EIO from invalidation is one of the two places that currently break this rule. The fix I had hoped for, invalidating down in fs/direct-io.c before returning -EIOCBQUEUED, doesn't work because it ends up getting the ordering between the journal lock and the page lock backwards. Sigh. (note to self: help lockdep warn us about that ordering) After patch generic_file_direct_IO() sets PgBusy flag in iocb if page could not be invalidated. iocb is retried after IO competition. The process is waked up if IO is SYNC else iocb is kicked. The lines ___if (ret != -EIOCBRETRY)___ is deleted because nothing set to EIOCBRETRY. True, but this is gratuitously cruel to external users of - EIOCBRETRY. Silently breaking them doesn't sound like a great plan. If we really decide to remove EIOCBRETRY support we'd get rid of all the retry infrastructure and remove the EIOCBRETRY errno so their builds failed. That's a separate issue that shouldn't be confused with this EIOCBQUEUED clobbering. Just removing some pieces of the infrastructure willy-nilly isn't acceptable. Please copy [EMAIL PROTECTED] on AIO patches. Next patches 2/3 and 3/3 do cleanup only. I cannot find those patches. Me either. I was waiting for them to arrive before responding. diff -upr linux-2.6.20/mm/filemap.c linux-2.6.20-aio2/mm/filemap.c --- linux-2.6.20/mm/filemap.c 2007-02-04 21:44:54.0 +0300 +++ linux-2.6.20-aio2/mm/filemap.c 2007-03-04 21:46:10.0 +0300 @@ -2413,10 +2413,9 @@ generic_file_direct_IO(int rw, struct ki if (rw == WRITE mapping-nrpages) { pgoff_t end = (offset + write_len - 1) PAGE_CACHE_SHIFT; - int err = invalidate_inode_pages2_range(mapping, - offset PAGE_CACHE_SHIFT, end); - if (err) - retval = err; + if (invalidate_inode_pages2_range(mapping, + offset PAGE_CACHE_SHIFT, end)) + kiocbSetPgBusy(iocb); There are two problems behind this bug: - ext3_releasepage() returns failure if it races with kjournald holding a reference while it waits for a transaction to commit. - generic_file_direct_IO() causes an oops if it clobbers -EIOCBQUEUED with the return code from invalidate_inode_pages2_range()..- releasepage(). This fix makes the incorrect assertion that *any* failure from invalidate_inode_pages2_range(), which might not have anything to do with this race you're currently seeing, is transitory and should trigger a retry. That's wrong, it should be returning the error. Now, getting ext3_releasepage() to not fail if this race hits to begin with is another story. Chris has some ideas about reworking the page laundering helper to make that more reliable. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] aio: remove bare user-triggerable error printk
aio: remove bare user-triggerable error printk The user can generate console output if they cause do_mmap() to fail during sys_io_setup(). This was seen in a regression test that does exactly that by spinning calling mmap() until it gets -ENOMEM before calling io_setup(). We don't need this printk all, just remove it. Signed-off-by: Zach Brown [EMAIL PROTECTED] --- fs/aio.c |1 - 1 file changed, 1 deletion(-) --- a/fs/aio.c Tue Mar 27 14:56:08 2007 -0700 +++ b/fs/aio.c Tue Mar 27 14:56:49 2007 -0700 @@ -136,7 +136,6 @@ static int aio_setup_ring(struct kioctx 0); if (IS_ERR((void *)info-mmap_base)) { up_write(ctx-mm-mmap_sem); - printk(mmap err: %ld\n, -info-mmap_base); info-mmap_size = 0; aio_free_ring(ctx); return -EAGAIN; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: remove bare user-triggerable error printk
We don't need this printk all, just remove it. (at all. Sigh.) - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] cache pipe buf page address for non-highmem arch
+#define pipe_kmap_atomic(page, type) pipe_kmap(page) +#define pipe_kunmap(page) do { } while (0) +#define pipe_kunmap_atomic(page, type) do { } while (0) Please don't drop arguments in stubs. It can let completely broken code compile, like: pipe_kunmap(SOME_COMPLETE_NONSENSE); Static inlines with empty bodies are the gold standard. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] cache pipe buf page address for non-highmem arch
Does this look OK? Almost... #ifdef CONFIG_HIGHMEM static inline void pipe_kunmap_atomic(void *addr, enum km_type type) #else /* CONFIG_HIGHMEM */ static inline void pipe_kunmap_atomic(struct page *page, enum km_type type) - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] syslets: add generic syslets infrastructure
Firstly, why not just specify an address for the return value and be done with it? This infrastructure seems overkill, and you can always extend later if required. Sorry, which infrastructure? Providing the function and stack to return to? Sure, I could certainly entertain the idea of not having syslet tasks return to userspace in the first pass. Ingo sure seemed excited by the idea. Or do you mean the syscall return value ending up in the userspace completion event ring? That's mostly about being able to wait for pending syslets to complete. Secondly, you really should allow integration with an eventfd so you don't make the posix AIO mistake of providing a poll-incompatible interface. Yeah, this seems straight forward enough that I haven't made it an initial priority. I'm sure it will be helpful for people who are stuck integrating with entrenched software that wants to wait for pollable fds. For more flexible software, though, it's compelling to now be able to aggregate waiting for completion of the existing waiting syscalls (poll, epoll_wait, futexes, whatever) by issuing them as concurrent syslets. Finally, and probably most alarmingly, AFAICT randomly changing TID will break all threaded programs, which means this won't be fitted into existing code bases, making it YA niche Linux-only API 8( Yeah, this still needs to be investigated. I haven't yet and I haven't heard of anyone else trying their hand at it. In the YANLOA mode apps would know that executing syslets is an implicit clone() and would act accordingly. 8(, indeed. I wonder if there isn't an opportunity to add a clone() flag which juggles the association between TIDs and task_structs. I don't relish the idea of investigating the life cycles of task_struct references that derive from TIDs and seeing how those would race with a syslet blocking and cloning, but, well, maybe that's what needs to be done. This all isn't my area of expertise, though, sadly. It would be swell if someone wanted to look into it before I'm forced to learn yet another weird corner of the kernel. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] syslets: add generic syslets infrastructure
Or do you mean the syscall return value ending up in the userspace completion event ring? That's mostly about being able to wait for pending syslets to complete. The latter. A ring is optimal for processing a huge number of requests, but if you're really going to be firing off syslet threads all over the place you're not going to be optimal anyway. And being able to point the return value to the stack or into some datastructure is way nicer to code (zero setup == easy to understand and easy to convert). One of Linus' rhetorical requirements for the syslets work is that we be able to wait for the result without spending overhead building up state in some completion context. The userland rings in the current syslet patches achieve that and don't seem outrageously complicated. I have a hard time getting worked up about this particular piece of the puzzle, though. If people are excited about *only* providing a pollable fd to collect the syslet completions, well, great, whatever. This must be solved, yet all avenues seem crawling with worms. Yup. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/6] syslets: add generic syslets infrastructure
Linus Torvalds wrote: On Thu, 10 Jan 2008, Rusty Russell wrote: I'd have to read his original statement, but eventfd doesn't build up state, so I think it qualifies. How about you guys battle it out by giving an example program usign the interface? Here's a favourite really simple load of mine: - do the equivalent of ls -lR or find /usr as quickly as possible, without playing sort by the inode numbers games (that don't work in general, but are great for some filesystems) Do this on a directory that isn't newly created, but has had files added and removed over time (so that the return order of readdir() isn't dense and sorted in the inode tables already). The classic example is ls -l /usr/bin or similar. Sure, that's straight forward enough. We've all written little test apps for variants of this load in the past, anyway. (It was one of the first things I did for fibrils, Ingo had a variant for syslets which read small file data too, Chris has a syslet mode in his 'acp' util, etc.) I was going to send out a patch series pretty soon which includes cleanups (I think) of the sys_indirect() infrastructure. I can throw together this little test app along with that. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: hwclock failure in x86.git
I'm no expert, but I happened to notice this go by. The first thing I notice about the path is that ioport_32.c and the unified ioport.c use __clear_bit, while ioport_64.c uses clear_bit. That doesn't seem too critical. +#ifdef CONFIG_X86_32 +asmlinkage long sys_iopl(unsigned long regsp) +{ + volatile struct pt_regs *regs = (struct pt_regs *)regsp; + unsigned int level = regs-bx; + unsigned int old = (regs-flags 12) 3; + + if (level 3) + return -EINVAL; + /* Trying to gain more privileges? */ + if (level old) { + if (!capable(CAP_SYS_RAWIO)) + return -EPERM; + } + regs-flags = (regs-flags ~X86_EFLAGS_IOPL) | (level 12); + + return 0; +} -asmlinkage long sys_iopl(unsigned long regsp) -{ - volatile struct pt_regs *regs = (struct pt_regs *)regsp; - unsigned int level = regs-bx; - unsigned int old = (regs-flags 12) 3; - struct thread_struct *t = current-thread; - - if (level 3) - return -EINVAL; - /* Trying to gain more privileges? */ - if (level old) { - if (!capable(CAP_SYS_RAWIO)) - return -EPERM; - } - - t-iopl = level 12; - regs-flags = (regs-flags ~X86_EFLAGS_IOPL) | t-iopl; - set_iopl_mask(t-iopl); - - return 0; -} Is it OK that we lost the different final steps of those two functions? It looks like someone might have missed the differing tail sections of the function when copying and pasting and updating just the start of the functions? Sorry if I missed why this is OK and intended, I didn't read very closely. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] sys_indirect system call
Ulrich Drepper wrote: The following patches provide an alternative implementation of the sys_indirect system call which has been discussed a few times. This no system call allows us to extend existing system call interfaces with adding more system calls. I might quarrel with some details, but I really like the general notion. I think we can use this to pass per-syscall syslet data to the scheduler. Today the patches have submission syscalls which basically wrap some data for the scheduler around the syscall ABI. This might save the syslet patches from needing to know about each arch's ABI. We'll see. I'll give it a try. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] sys_indirect system call
BTW, I've botched the x86-on-x86_64 support. I have a patch but need to patch it before I'll submit v3 of the patch set. If you want to work on the patch and get syslet support going, let me know, I'll send the latest version. I probably won't come around to trying sys_indirect with syslets until next week, so it's probably fine for me to just wait for you to get v3 out. (and I'm more likely to test native i386 or x86_64 anyway, so.. :)) - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Top kernel oopses/warnings this week
Report counts may be too high due to duplicate recognition of the very same report.¹ this is true however it's .. a hard issue. It's really hard to distinguish a duplicate report from two reports of the same bug. Can we hack some data in to oops output to help? Say a giant per-boot anonymous random number (yeah, I know, harder than it sounds) and then an incrementing oops counter. That'd also let you discover that the latter oopses in a chain of oopses might be fall-out from the head of the chain. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.24 BUG] 100% iowait on host while UML is running
We could check ctx-reqs_active before scheduling to determine whether or not we are waiting for I/O, but this would require taking the context lock in order to be accurate. Given that the test would be only for the sake of book keeping, it might be okay to do it outside of the lock. Zach, what are your thoughts on this? I agree that it'd be OK to test it outside the lock, though we'll want some commentary: /* Try to only show up in io wait if there are ops in flight */ if (ctx-reqs_active) io_schedule(); else schedule(); It's cheap, safe, and accurate the overwhelming majority of the time :). We only need it in read_events(). The other two io_schedule() calls are only reached to wait on pending reqs specifically. It still won't make sense for iocbs which aren't performing IO, but I guess that's one more bridge to cross when we come to it. Do you want to throw this tiny patch together and submit it? - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/6] indirect: use asmlinkage in i386 syscall table prototype
call_indirect() was using the wrong calling convention for the system call handlers. system call handlers would get mixed up arguments. Signed-off-by: Zach Brown [EMAIL PROTECTED] diff --git a/include/asm-x86/indirect_32.h b/include/asm-x86/indirect_32.h index a1b72ac..e3dea8e 100644 --- a/include/asm-x86/indirect_32.h +++ b/include/asm-x86/indirect_32.h @@ -15,8 +15,8 @@ struct indirect_registers { static inline long call_indirect(struct indirect_registers *regs) { - extern long (*sys_call_table[]) (__u32, __u32, __u32, __u32, __u32, __u32); - + extern asmlinkage long (*sys_call_table[])(long, long, long, + long, long, long); return sys_call_table[INDIRECT_SYSCALL(regs)](regs-ebx, regs-ecx, regs-edx, regs-esi, regs-edi, regs-ebp); -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
syslets v7: back to basics
The following patches are a substantial refactoring of the syslet code. I'm branding them as the v7 release of the syslet infrastructure, though they represent a signifiant change in focus. My current focus is to see the most fundamental functionality brought to maturity. To me, this means getting a ABI that is used by applications through glibc on x86 and PPC64. Only once that is ready should we distract ourselves with advanced complexity. To that end, this patch series differs from v6 in significant ways: * syslets are initiated by providing syslet arguments to sys_indirect(). * uatoms, threadlets, and the kaio changes are postponed until they can be justified and rebuilt on more complete infrastructure. (I'm not saying these shouldn't or won't be persued. I'm saying that we should get the simplest piece working first.) * the code is clarified and commented, the patches are bisectable and pass checkpatch. The use of sys_indirect() and the move from 'atom's simplified the ABI considerably. I've put a trivial example in a syslet-userspace git tree: git://git.kernel.org/pub/scm/linux/kernel/git/zab/syslets-userspace.git This git repository will grow more tests and documentation over time. The patches sent with this mail are based on the v6 indirect patches but they aren't included. The full syslets patch series, including the indirect patches, are available in a few forms: broken out patch series: http://www.kernel.org/pub/linux/kernel/people/zab/broken-out/syslets/ in a 'syslets' git branch off of current linux-2.6.git: git://git.kernel.org/pub/scm/linux/kernel/git/zab/linux-2.6.git syslets git tree of the guilt .git/patches directory: git://git.kernel.org/pub/scm/linux/kernel/git/zab/guilt-series.git The patches were barely tested on i386 and x86_64. There are both implementation details and design problems left. My hope is that we can address these in the coming weeks. - Do we stop the user from initiating more syslets than fit in the ring? - Do we worry now about the hashed ring mutexes scaling poorly? (They will.) - What are the semantics of ptrace()ing a syslet submission which blocks? - How should applications deal with waiting syslet tasks with stale data in their task_struct? (syslet, setuid, syslet..) - Issuing a syslet is an implicit sys_clone(), will apps pass in clone flags? - Are the u32 ring index reads and writes atomic for supported architectures? Any feedback on these questions would be greatly appreciated. I'm particularly interested in hearing from people who are trying to use syslets in their applications. This will involve awkward wrappers instead of glibc calls for now, and your machine may explode, but hopefully the chance to influence the design of syslets would make it worth the effort. Finally, I carried the enormous cc: list for this mail over from previous syslet releases. If you want to be removed or added to the list for future syslet releases, please do let me know. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6] syslet: asm-generic support to disable syslets
This provides an implementation of the architecture dependent portion of syslets which disables syslet operations. This patch is an incomplete demonstration. All asm-*/syslet*.h files would include these files until their architectures provide implementations which enable syslet support. Signed-off-by: Zach Brown [EMAIL PROTECTED] diff --git a/include/asm-generic/syslet-abi.h b/include/asm-generic/syslet-abi.h new file mode 100644 index 000..5d19971 --- /dev/null +++ b/include/asm-generic/syslet-abi.h @@ -0,0 +1,11 @@ +#ifndef _ASM_GENERIC_SYSLET_ABI_H +#define _ASM_GENERIC_SYSLET_ABI_H + +/* + * I'm assuming that a u64 ip and u64 esp won't be enough for all + * archs, so I just let each arch define its own. + */ +struct syslet_frame { +}; + +#endif diff --git a/include/asm-generic/syslet.h b/include/asm-generic/syslet.h new file mode 100644 index 000..de9a750 --- /dev/null +++ b/include/asm-generic/syslet.h @@ -0,0 +1,34 @@ +#ifndef _ASM_GENERIC_SYSLET_H +#define _ASM_GENERIC_SYSLET_H + +/* + * This provider of the arch-specific syslet APIs is used when an architecture + * doesn't support syslets. + */ + +/* this stops the other functions from ever being called */ +static inline int syslet_frame_valid(struct syslet_frame *frame) +{ + return 0; +} + +static inline void set_user_frame(struct task_struct *task, + struct syslet_frame *frame) +{ + BUG(); +} + +static inline void move_user_context(struct task_struct *dest, + struct task_struct *src) +{ + BUG(); +} + +static inline int create_syslet_thread(long (*fn)(void *), + void *arg, unsigned long flags) +{ + BUG(); + return 0; +} + +#endif diff --git a/include/asm-x86/syslet-abi.h b/include/asm-x86/syslet-abi.h new file mode 100644 index 000..14a7182 --- /dev/null +++ b/include/asm-x86/syslet-abi.h @@ -0,0 +1 @@ +#include asm-generic/syslet-abi.h diff --git a/include/asm-x86/syslet.h b/include/asm-x86/syslet.h new file mode 100644 index 000..583d810 --- /dev/null +++ b/include/asm-x86/syslet.h @@ -0,0 +1 @@ +#include asm-generic/syslet.h -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/6] syslet: introduce abi structs
This patch adds the architecture independent structures of the syslet ABI. Signed-off-by: Zach Brown [EMAIL PROTECTED] diff --git a/include/linux/syslet-abi.h b/include/linux/syslet-abi.h new file mode 100644 index 000..a8bc1a3 --- /dev/null +++ b/include/linux/syslet-abi.h @@ -0,0 +1,34 @@ +#ifndef _LINUX_SYSLET_ABI_H +#define _LINUX_SYSLET_ABI_H + +#include asm/syslet-abi.h /* for struct syslet_frame */ + +struct syslet_args { + u64 completion_ring_ptr; + u64 caller_data; + struct syslet_frame frame; +}; + +struct syslet_completion { + u64 status; + u64 caller_data; +}; + +/* + * The ring follows the wrapping convention as described by Andrew at: + * http://lkml.org/lkml/2007/4/11/276 + * The head is updated by the kernel as completions are added and the + * tail is updated by userspace as completions are removed. + * + * The number of elements must be a power of two and the ring must be + * aligned to a u64. + */ +struct syslet_ring { + u32 kernel_head; + u32 user_tail; + u32 elements; + u32 wait_group; + struct syslet_completion comp[0]; +}; + +#endif -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/6] syslets: add indirect args
This adds the syslet indirect args to the indirect_params union. This is broken, but it lets us simply demonstrate the rest of the syslet universe around the indirect argument passing convention. A caller could well want to perform a syscall that uses indirect arguments as a syscall. Maybe we turn indirect_params into a struct that contains a union for arguments which can never be used concurrently. This needs wider discussion. Signed-off-by: Zach Brown [EMAIL PROTECTED] diff --git a/include/linux/indirect.h b/include/linux/indirect.h index 97f9ac4..5d5abd7 100644 --- a/include/linux/indirect.h +++ b/include/linux/indirect.h @@ -3,6 +3,7 @@ #define _LINUX_INDIRECT_H #include asm/indirect.h +#include linux/syslet-abi.h /* IMPORTANT: @@ -14,6 +15,7 @@ union indirect_params { struct { int flags; } file_flags; + struct syslet_args syslet; }; #define INDIRECT_PARAM(set, name) current-indirect_params.set.name -- 1.5.2.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/6] syslets: add generic syslets infrastructure
The indirect syslet arguments specify where to store the completion and what function in userspcae to return to once the syslet has been executed. The details of how we pass the indirect syslet arguments needs help. We parse the indirect syslet arguments in sys_indirect() before we call the given system call. If they're OK we mark the task as ready to become a syslet. We make sure that there is a child task waiting. We call into kernel/syslet.c from the scheduler when we try to block a task which has been marked as ready. A child task is woken and returns to userspace. We store the result of the system call in the userspace ring back up in sys_indrect() as the system call finally finishes. At that point the original task returns to the frame that userspace provided in the indirect syslet args. This generic infrastructure relies on architecture specific routines to create a new child task, move userspace state from one kernel task to another, and to setup the userspace return frame in ptregs. Code in asm-generic just returns -EINVAL until an architecture provides the needed routines. This is a simplification of Ingo's more involved syslet and threatlet infrastructure which was built around 'uatoms'. Enough code has changed that it wasn't appropriate to bring the previous Signed-off-by lines forward. Signed-off-by: Zach Brown [EMAIL PROTECTED]; diff --git a/fs/exec.c b/fs/exec.c index 282240a..942262f 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -51,6 +51,7 @@ #include linux/tsacct_kern.h #include linux/cn_proc.h #include linux/audit.h +#include linux/syslet.h #include asm/uaccess.h #include asm/mmu_context.h @@ -1614,6 +1615,8 @@ static int coredump_wait(int exit_code) complete(vfork_done); } + kill_syslet_tasks(tsk); + if (core_waiters) wait_for_completion(startup_done); fail: diff --git a/include/asm-generic/errno.h b/include/asm-generic/errno.h index e8852c0..26674c4 100644 --- a/include/asm-generic/errno.h +++ b/include/asm-generic/errno.h @@ -106,4 +106,7 @@ #defineEOWNERDEAD 130 /* Owner died */ #defineENOTRECOVERABLE 131 /* State not recoverable */ +/* for syslets */ +#defineESYSLETPENDING 132 /* syslet syscall now pending */ + #endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 61a4b83..a134966 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1182,6 +1182,13 @@ struct task_struct { /* Additional system call parameters. */ union indirect_params indirect_params; + + /* task waiting to return to userspace if we block as a syslet */ + spinlock_t syslet_lock; + struct list_headsyslet_tasks; + unsignedsyslet_ready:1, + syslet_return:1, + syslet_exit:1; }; /* diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index addb39f..1a44838 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -55,6 +55,7 @@ struct compat_timeval; struct robust_list_head; struct getcpu_cache; struct indirect_registers; +struct syslet_ring; #include linux/types.h #include linux/aio_abi.h @@ -615,6 +616,8 @@ asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); asmlinkage long sys_indirect(struct indirect_registers __user *userregs, void __user *userparams, size_t paramslen, int flags); +asmlinkage long sys_syslet_ring_wait(struct syslet_ring __user *ring, +unsigned long user_idx); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); diff --git a/include/linux/syslet.h b/include/linux/syslet.h new file mode 100644 index 000..734b98f --- /dev/null +++ b/include/linux/syslet.h @@ -0,0 +1,18 @@ +#ifndef _LINUX_SYSLET_H +#define _LINUX_SYSLET_H + +#include linux/syslet-abi.h +#include asm/syslet.h + +void syslet_init(struct task_struct *tsk); +void kill_syslet_tasks(struct task_struct *cur); +void syslet_schedule(struct task_struct *cur); +int syslet_pre_indirect(void); +int syslet_post_indirect(int status); + +static inline int syslet_args_present(union indirect_params *params) +{ + return params-syslet.completion_ring_ptr; +} + +#endif diff --git a/kernel/Makefile b/kernel/Makefile index bcad37d..7a7dfbe 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -9,7 +9,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \ rcupdate.o extable.o params.o posix-timers.o \ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ hrtimer.o rwsem.o latency.o nsproxy.o srcu.o \ - utsname.o notifier.o + utsname.o notifier.o syslet.o obj-$(CONFIG_SYSCTL) += sysctl_check.o obj-$(CONFIG_STACKTRACE) += stacktrace.o diff --git a/kernel/exit.c b/kernel/exit.c index
[PATCH 6/6] syslets: add both 32bit and 64bit x86 syslet support
This adds the architecture-specific routines needed by syslets for x86. The syslet thread creation routines create a new thread which executes a kernel function and then returns to userspace instead of exiting. move_user_context() and set_user_frame() let the scheduler modify a child thread so that it returns to userspace at the same place that a blocking system call would have when it finished. This currently performs a very expensive copy of the fpu state. Intel is working on a more robust patch which allocates the i387 state off of thread_struct. When that is ready this can just juggle pointers to transfer the fpu state. The syslets infrastructure needs to work with ptregs for the task which is in sys_indirect(). So we add a PTREGSCALL stub around sys_indirect() in x86_64. Finally, we wire up sys_syslet_ring_wait(). Signed-off-by: Zach Brown [EMAIL PROTECTED] diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index dc7f938..66a121d 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -1025,6 +1025,30 @@ ENTRY(kernel_thread_helper) CFI_ENDPROC ENDPROC(kernel_thread_helper) +ENTRY(syslet_thread_helper) + CFI_STARTPROC + /* +* Allocate space on the stack for pt-regs. +* sizeof(struct pt_regs) == 64, and we've got 8 bytes on the +* kernel stack already: +*/ + subl $64-8, %esp + CFI_ADJUST_CFA_OFFSET 64-8 + movl %edx,%eax + push %edx + CFI_ADJUST_CFA_OFFSET 4 + call *%ebx + addl $4, %esp + CFI_ADJUST_CFA_OFFSET -4 + + movl %eax, PT_EAX(%esp) + + GET_THREAD_INFO(%ebp) + + jmp syscall_exit + CFI_ENDPROC +ENDPROC(syslet_thread_helper) + #ifdef CONFIG_XEN ENTRY(xen_hypervisor_callback) CFI_STARTPROC diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 3a058bb..08e34f6 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -412,6 +412,7 @@ END(\label) PTREGSCALL stub_rt_sigsuspend, sys_rt_sigsuspend, %rdx PTREGSCALL stub_sigaltstack, sys_sigaltstack, %rdx PTREGSCALL stub_iopl, sys_iopl, %rsi + PTREGSCALL stub_indirect, sys_indirect, %r8 ENTRY(ptregscall_common) popq %r11 @@ -994,6 +995,71 @@ child_rip: ENDPROC(child_rip) /* + * Create a syslet kernel thread. This differs from a thread created with + * kernel_thread() in that it returns to userspace after it finishes executing + * its given kernel function. + * + * C extern interface: + * extern long create_syslet_thread(int (*fn)(void *), + * void * arg, unsigned long flags) + * + * asm input arguments: + * rdi: fn, rsi: arg, rdx: flags + */ +ENTRY(create_syslet_thread) + CFI_STARTPROC + FAKE_STACK_FRAME $syslet_child_rip + SAVE_ALL + + # rdi: flags, rsi: usp, rdx: will be pt_regs + movq %rdx,%rdi + movq $-1, %rsi + movq %rsp, %rdx + + xorl %r8d,%r8d + xorl %r9d,%r9d + + # clone now + call do_fork + movq %rax,RAX(%rsp) + xorl %edi,%edi + + /* +* It isn't worth to check for reschedule here, +* so internally to the x86_64 port you can rely on kernel_thread() +* not to reschedule the child before returning, this avoids the need +* of hacks for example to fork off the per-CPU idle tasks. + * [Hopefully no generic code relies on the reschedule -AK] +*/ + RESTORE_ALL + UNFAKE_STACK_FRAME + ret + CFI_ENDPROC +ENDPROC(syslet_kernel_thread) + +syslet_child_rip: + CFI_STARTPROC + + movq %rdi, %rax + movq %rsi, %rdi + call *%rax + + /* +* Fix up the PDA - we might return with sysexit: +*/ + RESTORE_TOP_OF_STACK %r11 + + /* +* return to user-space: +*/ + movq %rax, RAX(%rsp) + RESTORE_REST + jmp int_ret_from_sys_call + + CFI_ENDPROC +ENDPROC(syslet_child_rip) + +/* * execve(). This function needs to use IRET, not SYSRET, to set up all state properly. * * C extern interface: diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 7b89958..7bf2836 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -394,6 +394,39 @@ int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags) EXPORT_SYMBOL(kernel_thread); /* + * This gets run with %ebx containing the + * function to call, and %edx containing + * the args. + */ +void syslet_thread_helper(void); + +/* + * Create a syslet kernel thread. This differs from a thread created with + * kernel_thread() in that it returns to userspace after it finishes executing + * its given kernel function. + */ +int create_syslet_thread(long (*fn)(void *), void *arg, unsigned long flags) +{ + struct pt_regs regs; + + memset(regs, 0, sizeof(regs)); + + regs.ebx = (unsigned long)fn
Re: [PATCH 5/6] syslets: add generic syslets infrastructure
+/* + * syslet_ring doesn't have any kernel-side storage. Userspace allocates them + * in their address space and initializes their fields and then passes them to + * the kernel. + * + * These hashes provide the kernel-side storage for the wait queues which + * sys_syslet_ring_wait() uses and the mutex which completion uses to serialize + * the (possible blocking) ordered writes of the completion and kernel head + * index into the ring. + * + * We chose the bucket that supports a given ring by hashing a u32 that + * userspace sets in the ring. + */ +#define SYSLET_HASH_BITS (CONFIG_BASE_SMALL ? 4 : 8) +#define SYSLET_HASH_NR (1 SYSLET_HASH_BITS) +#define SYSLET_HASH_MASK (SYSLET_HASH_NR - 1) +static wait_queue_head_t syslet_waitqs[SYSLET_HASH_NR]; +static struct mutex syslet_muts[SYSLET_HASH_NR]; Why do you care about hashed tables scalability and not using trees? Well, this notion of letting tasks safely complete to any ring they can address is just a possibility. We might decide that it's not worth it. This implementation was an easy example that borrows from the way futexes do similar work. I like it because you could have, say, different processes completing into a ring in shared memory. If we do allow this kind of flexible ring specification, it's not at all clear that trees would be the best way to address the scalability limits. There are lots of possibilities, including locking the page lock of the page which holds the head index. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
David Miller wrote: From: Ulrich Drepper [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 01:53:14 -0500 FWIW, I think this indirect syscall stuff is the most ugly interface I've ever seen proposed for the kernel. Well, there's no XML in /proc :) :). But, yes, I agree that the internal code needs a lot more cleanup before being considered for merging. And I agree with all of the objections raised by both H. Pater Anvin and Eric Dumazet. I'm worried, too. Do we have a stronger alternative? I'm all ears, this isn't really my area of expertise. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 4/6] Allow setting FD_CLOEXEC flag for new sockets
+union indirect_params { + struct { +int flags; + } file_flags; +}; Have you given thought to having to perform compat translation on this? Today it's only copied directly from the user pointer into the union in the task_struct. I'd love if we could only use fixed-width fields in the interface to avoid the compat nonsense. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 1/6] actual sys_indirect code
Ingo Molnar wrote: cool patchset. Small nit, the series is not bisectable: Yeah, in a few places. Uli, this is super easy to test if you maintain the patches with guilt. With some easy scripting around guilt push you can verify that the series builds as each patch is applied. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 4/6] Allow setting FD_CLOEXEC flag for new sockets
Since there is no legacy interface to worry about all members added to the structure can and should be neutral of the word size. OK, perhaps add a giant comment warning about that. History tells us that people will get it wrong. We've done this with some syscalls already (like pread64) where we always use the wide form in the parameter list. It's just more simple here since it does not have to split into two 32-bit registers. *nod* - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 0/6] sys_indirect system call
This was mentioned in one of my mails. I added the parameter to accommodate Linus's and Zack's idea to use the functionality for syslets as well. Not really a multiplexer, it is meant to be a execute synchronously or asynchronously flag. In the latter case an additional parameter might be needed to indicate the notification mechanism. I'm sure the additional parameter will be needed, and it might be pretty involved. I think the current notion of syslets needs, at the very least: - a userspace stack and IP to return to after going async - the completion context when completed async - a pointer to a return code when completed without blocking We'll see. I'm working on it. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
That's only because you're being, deliberately or accidentally, vague about what your actual (as opposed to imagined) requirements are. Maybe I can help by summarizing how syslets fit in to this. Currently the syslet patches add a single submission call which includes an argument which is a structure which duplicates the system call ABI. The submission syscall in the kernel does some syslet specific work which amounts to verifying state and storing it in the task_struct. It then has to unpack the system call arguments from this submission syscall argument and call the specified system call. Every architecture will need helpers, then, on either side. They'll need to pack their arguments into the struct and then unpack and call in the kernel. The PPC64 guys have already expressed concern about this. It's, in effect, adding the syslet arguments to every single system call. So, instead of duplicating the system call ABI in the argument to a syslet submission syscall, we could pass the syslet arguments via this indirect parameters convention. This, hopefully, will reduce complexity by reducing the number of places that we have to muck around with the sycall ABI. That's the high level summary, anyway. I'm working on the simplest expression of this mechanism at the moment. We'll have code to argue about before the silly thanksgiving break, I hope. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
Where does this INDIRECT_PARAM() macro get defined? I do not see it being defined anywhere in these patches. Defined in linux/indirect.h: +#define INDIRECT_PARAM(set, name) current-indirect_params.set.name Not my idea, I was following one review comment. This was not in the patches you posted, I double checked before sending my reply. Not to belabor this point, but it was: http://lkml.org/lkml/2007/11/20/53 $ grep -l INDIRECT_PARAM .git/patches/master/* .git/patches/master/indirect-v4-4.patch .git/patches/master/indirect-v4-5.patch .git/patches/master/indirect-v4-6.patch Maybe the patches got to you out of order so you saw 5/ before 4/? - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]loop cleanup in fs/namespace.c - repost
The patch given below replaces the goto-loop by a while-based one. That certainly looks fine. I would also replace the 'return' with 'break', but I guess that's more of a question of personal preference. Besides, it removes the export for the same routine, because there are no users for it outside of the core VFS code. This doesn't look fine. Did you test this? mntput_no_expire() is called from mntput() which is an inline function in mount.h. So lots of callers of mntput() in modules will end up trying to call mntput_no_expire() from modules. $ nm fs/fuse/fuse.ko | grep mntput_no_expire U mntput_no_expire - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH]loop cleanup in fs/namespace.c - repost
This doesn't look fine. Did you test this? Oops, my fault. Of course, I tested the patch, but kernel modules are disabled in my test setup, so I missed the error. :) Enclosed to this message is a new patch, which replaces the goto-loop by the while-based one, but leaves the EXPORT_SYMBOL macro intact. It certainly looks OK to me now, for whatever that's worth. You probably want to wait 'till the next merge window to get it in, though. It's just a cleanup and so shouldn't go in this late in the -rc line. Maybe Andrew will be willing to queue it until that time in -mm. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/6] indirect: use asmlinkage in i386 syscall table prototype
+extern asmlinkage long (*sys_call_table[])(long, long, long, This should be something like below instead, otherwise gcc wont parse asmlinkage as being an attribute of the function signature. extern long (asmlinkage *sys_call_table[])(long, long, long, Yeah, thanks for pointing these out. As it happened, Jens beat you to it :). Both problems have been fixed in the git repositories for the guilt series and userspace tools, respectively. You can always fetch the most recent versions from there. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: syslets v7: back to basics
I pulled from your tree to look over the patches, and noticed that it looks like several commits were merged improperly. It looks like they were auto merged or something from an email, and the commit message contains the email headers, rather than just the commit message in the body. This leads to the shortlog showing entries that start with Return-Path:. These are patches that guilt imported from email messages. It didn't strip the headers and I didn't care to. I'll try to in the future, it isn't a big deal. I was hoping to find at least some initial information on the overall design in Documentation/ but don't see any. Have you written any yet that I could take a look at elsewhere maybe? No, but it's coming. I'd like to have some robust documentation so that Ulrich can help me understand what more he'd need to support POSIX AIO with syslets from glibc. Some of the things I was trying to figure out is does each syslet get its own stack, Yes. Each blocking operation has a thread that is performing the operation synchronously. The benefit is that the thread is only created if the operation blocks. If it doesn't block then it's a normal system call invocation. You don't have to manage threads and communicate the arguments and results of system calls amongst threads for the case where it never blocks. and schedule only at a few well defined points No, every blocking point is considered a scheduling point. , and if so, would it then be fair to characterize them as kernel mode fibers? I'm not sure what exactly you mean by kernel mode fibers (I can guess, but I'd rather not). From the answer of to the last question, though, I'm going to guess that it might not be the most apt characterization. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 0/5] sys_indirect system call
Note that the sys_indirect system call takes an additional parameter which is for now forced to be zero. This parameter is meant to enable the use of sys_indirect to create syslets, asynchronously executed system calls. This syslet approach is also the main reason for the interface in the form proposed here. Can you talk more about how you imagined syslets working with sys_indirect()? I threw something together to test most of the machinery around the actual argument passing. It just added the syslet arguments to the union: http://lkml.org/lkml/2007/12/6/338 http://lkml.org/lkml/2007/12/6/339 This obviously doesn't work because it's a perfectly reasonable thing to want to provide the file_flags to some call as well as the syslet args to perform that call in a syslet. Then it butchers sys_indirect() itself: http://lkml.org/lkml/2007/12/6/340 The syslet machinery wants to run some code before and after the system call itself when the syslet arguments are provided. It can also call almost every single system call, white-listing in sys_indirect() is probably not the answer here. I could throw together some ideas, of course, but thought I'd see if you guys already had a design in mind that I could pursue. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails
Hisashi Hifumi wrote: Hi. Current dio has some problems: 1, In ext3 ordered, dio write can return with EIO because of the race between invalidation of a page and jbd. jbd pins the bhs while committing journal so try_to_release_page fails when jbd is committing the transaction. Yeah. It sure would be fantastic if some ext3 expert could stop this from happening somehow. But that hasn't happened in.. uh.. Badari, for how many years has this been on the radar? :) Past discussion about this issue is as follows. http://marc.info/?t=11934343124r=1w=2 http://marc.info/?t=11265676282r=1w=2 2, invalidate_inode_pages2_range() sets ret=-EIO when invalidate_complete_page2() fails, but this ret is cleared if do_launder_page() succeed on a page of next index. Oops. That's too bad. So maybe we should fix it by not stomping on that return code? ret2 = do_launder() if (ret2 == 0) ret2 = invalidate() if (ret == 0) ret = ret2 I'd be surprised if we ever wanted to mask an -EIO when later pages laundered successfully. In this case, dio is carried out even if invalidate_complete_page2() fails on some pages. This can cause inconsistency between memory and blocks on HDD because the page cache still exists. Yeah. I solved problems above by introducing invalidate_inode_pages3_range() and falling through to buffered I/O when invalidation of a page failed. Well, I like the idea of more intelligently dealing with the known problem between dio and ext3. I'm not sure that falling back to buffered is right. If dio can tell that it only failed because jbd held the buffer, should we have waited for the transaction to complete before trying to invalidate again? If we could do that, we'd avoid performing the IO twice. We can distinguish between failure of page invalidation and other errors with the return value of invalidate_inode_pages3_range(). I'm not sure duplicating the invalidation loop into a new function is the right thing. Maybe we'd just tweak inode_pages2 to indicate to the caller the specific failing circumstances somehow. Maybe. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix invalidate_inode_pages2_range not to clear ret
Hisashi Hifumi wrote: Hi. DIO invalidates page cache through invalidate_inode_pages2_range(). invalidate_inode_pages2_range() sets ret=-EIO when invalidate_complete_page2() fails, but this ret is cleared if do_launder_page() succeed on a page of next index. In this case, dio is carried out even if invalidate_complete_page2() fails on some pages. This can cause inconsistency between memory and blocks on HDD because the page cache still exists. Following patch fixes this issue. I like the idea of fixing this in a separate patch, yes, thanks! But... Thanks. Signed-off-by :Hisashi Hifumi [EMAIL PROTECTED] diff -Nrup linux-2.6.24-rc5.org/mm/truncate.c linux-2.6.24-rc5/mm/truncate.c --- linux-2.6.24-rc5.org/mm/truncate.c2007-12-12 16:32:45.0 +0900 +++ linux-2.6.24-rc5/mm/truncate.c2007-12-13 11:45:29.0 +0900 @@ -392,6 +392,7 @@ int invalidate_inode_pages2_range(struct pgoff_t next; int i; int ret = 0; +int ret2 = 0; int did_range_unmap = 0; int wrapped = 0; @@ -441,13 +442,13 @@ int invalidate_inode_pages2_range(struct BUG_ON(page_mapped(page)); ret = do_launder_page(mapping, page); if (ret == 0 !invalidate_complete_page2(mapping, page)) -ret = -EIO; +ret2 = -EIO; unlock_page(page); } pagevec_release(pvec); cond_resched(); } -return ret; +return !ret ? ret2 : ret; } EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range); ... this doesn't work. Notice that it only propagates the -EIO into ret2? It can lose errors from do_launder_page() itself because they aren't stored into ret2, which is what's returned if the last do_launder_page() succeeds. This isn't I meant when I mentioned that 'ret2' pattern. The idea is to store the errors from inner calls into some private var and then only promote those errors as the function's return code (ret) if there was an error and ret wasn't already set. We do this in a few places in dio, you'll notice. This patch gets the meaning of 'ret' and 'ret2' opposite from those uses, which is kind of confusing. So, uh, how about the following. Totally untested, but compiled. --- invalidate_inode_pages2_range(): consistently return first error Hisashi Hifumi noticed that we were losing errors in invalidate_inode_pages2_range(). Later do_launder_page() calls could overwrite errors generated by earlier calls. Fix this by storing do_launder_page in a temporary variable which is only promoted to the function's return code if it hadn't already generated an error. Signed-off-by: Zach Brown [EMAIL PROTECTED] Cc: Hisashi Hifumi [EMAIL PROTECTED] diff --git a/mm/truncate.c b/mm/truncate.c index cadc156..24578b3 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -392,6 +392,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t next; int i; int ret = 0; + int rc; int did_range_unmap = 0; int wrapped = 0; @@ -439,9 +440,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping, } } BUG_ON(page_mapped(page)); - ret = do_launder_page(mapping, page); - if (ret == 0 !invalidate_complete_page2(mapping, page)) - ret = -EIO; + rc = do_launder_page(mapping, page); + if (rc == 0 !invalidate_complete_page2(mapping, page)) + rc = -EIO; + if (rc ret == 0) + ret = rc; unlock_page(page); } pagevec_release(pvec); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails
If anyone has a testcase - I can take a look at the problem again. I can try and throw something together.. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)
Erez Zadok wrote: Setting: ltp-full-20071031, dio01 test on ext3 with Linus's latest tree. Kernel w/ SMP, preemption, and lockdep configured. This is a real lock ordering problem. Thanks for reporting it. The updating of atime inside sys_mmap() orders the mmap_sem in the vfs outside of the journal handle in ext3's inode dirtying: - #1 (jbd_handle){--..}: [c023068f] __lock_acquire+0x9cc/0xb95 [c0230c2f] lock_acquire+0x5f/0x78 [c029c7e9] journal_start+0xee/0xf8 [c02959d6] ext3_journal_start_sb+0x48/0x4a [c029152b] ext3_dirty_inode+0x27/0x6c [c026f701] __mark_inode_dirty+0x29/0x144 [c026817b] touch_atime+0xb7/0xbc [c023af6d] generic_file_mmap+0x2d/0x42 [c024a5cc] mmap_region+0x1e6/0x3b4 [c024aa6b] do_mmap_pgoff+0x1fb/0x253 [c02067af] sys_mmap2+0x9b/0xb5 [c020275e] syscall_call+0x7/0xb [] 0x ext3_direct_IO() orders the journal handle outside of the mmap_sem that dio_get_page() acquires to pin pages with get_user_pages(): - #0 (mm-mmap_sem){}: [c023057f] __lock_acquire+0x8bc/0xb95 [c0230c2f] lock_acquire+0x5f/0x78 [c0397d4f] down_read+0x3a/0x4c [c02778a2] dio_get_page+0x4e/0x15d [c0278352] __blockdev_direct_IO+0x431/0xa81 [c0291318] ext3_direct_IO+0x10c/0x1a1 [c023c091] generic_file_direct_IO+0x124/0x139 [c023c0fc] generic_file_direct_write+0x56/0x11c [c023ca15] __generic_file_aio_write_nolock+0x33d/0x489 [c023cbb9] generic_file_aio_write+0x58/0xb6 [c028d4e3] ext3_file_write+0x27/0x99 [c0256d0f] do_sync_write+0xc5/0x102 [c0257463] vfs_write+0x90/0x119 [c0257a25] sys_write+0x3d/0x61 [c02026d6] sysenter_past_esp+0x5f/0xa5 [] 0x Two fixes come to mind: 1) use something like Peter's -mmap_prepare() to update atime before acquiring the mmap_sem. ( http://lkml.org/lkml/2007/11/11/97 ). I don't know if this would leave more paths which do a journal_start() while holding the mmap_sem. 2) rework ext3's dio to only hold the jbd handle in ext3_get_block(). Chris has a patch for this kicking around somewhere but I'm told it has problems exposing old blocks in ordered data mode. Does anyone have preferences? I could go either way. I certainly don't like the idea of journal handles being held across the entirety of fs/direct-io.c. It's yet another case of O_DIRECT differing wildly from the buffered path :(. - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: partial write should not return error code.
Rusty Russell wrote: When an AIO write gets an error after writing some data (eg. ENOSPC), it should return the amount written already, not the error. Just like write() is supposed to. Andrew, please don't queue this fix. I think the bug is valid but the patch is subtly dangerous. diff -r 18802689361a fs/aio.c --- a/fs/aio.cThu Jan 03 15:22:24 2008 +1100 +++ b/fs/aio.cThu Jan 03 18:05:25 2008 +1100 @@ -1346,6 +1350,13 @@ static ssize_t aio_rw_vect_retry(struct /* This means we must have transferred all that we could */ /* No need to retry anymore */ if ((ret == 0) || (iocb-ki_left == 0)) + ret = iocb-ki_nbytes - iocb-ki_left; + + /* If we managed to write some out we return that, rather than + * the eventual error. */ + if (opcode == IOCB_CMD_PWRITEV + ret 0 + iocb-ki_nbytes - iocb-ki_left) ret = iocb-ki_nbytes - iocb-ki_left; This doesn't account for the (sigh) -EIOCB* error codes. They must be returned to the caller so that it can properly handle the iocb reference counting. Failure to do so can lead to oopses. To be fair, I think you'll have a really hard time finding an -aio_write() implementation which would return partial progress and *then* one of the magical errnos. But the infrastructure does allow it. So maybe we could get a helper in aio.h that abstracts out the (ret 0 ret != -EIOCBQUEUED ret != -EIOCBRETRY) condition. Then I think this patch would be fine. I assigned a bug to remind myself to revisit this if you aren't excited by continuing with the patch: http://bugzilla.kernel.org/show_bug.cgi?id=9681 - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: negative offset should return -EINVAL
Rusty Russell wrote: An AIO read or write should return -EINVAL if the offset is negative. This check matches the one in pread and pwrite. This was found by the libaio test suite. Signed-off-by: Rusty Russell [EMAIL PROTECTED] This looks fine to me. Signed-off-by: Zach Brown [EMAIL PROTECTED] - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] aio: partial write should not return error code.
No, that's fine, here is the new one: When an AIO write gets a non-retry error after writing some data (eg. ENOSPC), it should return the amount written already, not the error. Just like write() is supposed to. This was found by the libaio test suite. Signed-off-by: Rusty Russell [EMAIL PROTECTED] This looks good, feel free to push this from your tree. Acked-By: Zach Brown [EMAIL PROTECTED] - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] Make wait_on_retry_sync_kiocb killable
Matthew Wilcox wrote: Use TASK_KILLABLE to allow wait_on_retry_sync_kiocb to return -EINTR. All callers then check the return value and break out of their loops. This won't work because sync kiocbs are a nasty hack that don't follow the (also nasty) refcounting patterns of the aio core. -EIOCBRETRY means that aio_{read,write}() has taken on the IO kiocb reference and has ensured that call kick_iocb() will be called in the future. Usually kick_iocb() would queue the kiocb to have its ki_retry method called by the kernel aio threads while holding that reference. But sync kiocbs are on-stack and aren't reference counted. kick_iocb() magic: /* sync iocbs are easy: they can only ever be executing from a * single context. */ if (is_sync_kiocb(iocb)) { kiocbSetKicked(iocb); wake_up_process(iocb-ki_obj.tsk); return; } So, with this patch, if we catch a signal and return from wait_on_retry_sync_kiocb() and return from do_sync_{read,write}() then that on-stack sync kiocb is going to be long gone when kick_iocb() goes to work with it. So the first step would be to make sync kiocbs real refcounted structures so that kick_iocb() could find that the sync submitter has disappeared. But then we have to worry about leaving retrying operations in flight after the sync submitter has returned from their system call. They might be VERY SURPRISED to find that a read() implemented with do_sync_read() is still writing into their userspace pointer after the syscall was interrupted by a signal. This leads us to the possibility of working with the ki_cancel method to stop a pending operation if a signal is caught from a sync submitter. In practice, nothing sets ki_cancel. And finally, this code will not be run in a solely mainline kernel. The only thing in mainline that returns -EIOCBRETRY is the goofy usb gadget. It has both -{read,write} and -aio_{read,write} file op methods so vfs_{read,write}() will never call do_sync_{read,write}(). Sure, out-of-tree aio providers (SDP?) might get caught up in this. (Ha ha! Welcome to fs/aio.c!) So I'm not sure where to go with this. It's a mess, but it doesn't seem like anything is using it. A significant clean up of the retry and cancelation support in fs/aio.c is in flight. Maybe we can revisit this once that settles down. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #5]
+ * ERR_CAST - Explicitly cast an error-valued pointer to another pointer type + * @ptr: The pointer to cast. + * + * Explicitly cast an error-valued pointer to another pointer type in such a + * way as to make it clear that's what's going on. + */ +static inline void *ERR_CAST(const void *ptr) +{ + return (void *) ptr; +} Just to nit, surely you don't need the cast inside the function. The casting happens at the call site between the argument and returned pointer. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #5]
Roland Dreier wrote: +static inline void *ERR_CAST(const void *ptr) +{ +return (void *) ptr; +} Just to nit, surely you don't need the cast inside the function. The casting happens at the call site between the argument and returned pointer. The way it's written you kinda do, since it takes a const void * and returns a plain void *. Ah, right, I missed that. - z - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/