from:"Zach Brown"

Re: [PATCH] maestro3 oops + fix (for ac9!)

2001-01-16 Thread Zach Brown


 D'oh, looks like if power management is disabled, pmdev is NULL (I get
 that message when I load the module), but we try to derefence it anyways.
 The fix is obvious:

duh, yeah, I'll send out a proper patch that handles the pm_register
failure too.

thanks.

-- 
 zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]

2001-01-18 Thread Zach Brown


On Thu, Jan 18, 2001 at 08:49:38AM -0800, Linus Torvalds wrote:

 That has its advantages: it's a very local thing, and doesn't need any
 state. However, the fact is that you _need_ the persistency of a socket
 option if you want to take advantage of external programs etc getting good
 behaviour without having to know that they are talking to a socket. 

*nod*

We set TCP_CORK on the socket we handed to external programs that were
being run via 'site exec' in an ftp server.  It resulted in much nicer
packets being spit out, especially in the 'ls' case where it likes to
write() on really goofy boundaries.

[yes, ftp and 'site exec' in particular are far from sexy, but do the same
with CGI scripts and the world might care :)]

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Possible Bug: drivers/sound/maestro.c

2001-01-29 Thread Zach Brown


  have this problem (or a similar one, anyway -- sometimes the sound becomes
  distorted or comes only through one speaker) under both Linux 2.2 and
  Win2K. If it was just Linux, I'd assume it was a driver problem, but the

This is a long-standing bug with the maestro2 driver.  

My current theory is that its a race condition where the APUs get
confused while we update their control memory, but this doesn't make
total sense.  Some of the bug reports I get are implying that the sound
is breaking when we're not touching the apu's control mem.  Maybe
implying a nastier silicon bug..

I've been meaning to try implementing a work around to the theoretical
bug, but I've always had trouble triggering it :/

The fun part of this (and why you would see the bug in win2k) is that
the maestro2 is very poorly documented.  I've never heard of anyone
having full docs on the APUs, including the people I've talked to at ESS.
They bought the part from another company.. (and thankfully axed it in
the maestro3)

so we're stabbing in the dark.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[PATCH] maestro3 PM update

2001-01-29 Thread Zach Brown


This gets rid of the PM callback maestro3 was using to catch
suspend/resume events, and just uses the pci_dev callbacks instead.
It has minor cleanups in the _probe() path, including a braindead
oops fix.  It works on my laptop, but I'd really appreciate it seeing
more testing.

Alan, could you queue this for 2.4-testing and eventually Linus?

[most of the patch is noise from code blocks moving around :/]

Thanks to Andres Salomon for pointing out the oops case and motivating
the cleanup..

-- 
 zach

--- maestro3.c-1.10 Wed Jan 31 00:45:54 2001
+++ maestro3.c  Wed Jan 31 00:39:44 2001
@@ -28,6 +28,9 @@
  * Shouts go out to Mike "DJ XPCom" Ang.
  *
  * History
+ *  v1.20 - Jan 30 2001 - Zach Brown [EMAIL PROTECTED]
+ *   get rid of pm callback and use pci_dev suspend/resume instead
+ *   m3_probe cleanups, including pm oops think-o
  *  v1.10 - Jan 6 2001 - Zach Brown [EMAIL PROTECTED]
  *   revert to lame remap_page_range mmap() just to make it work
  *   record mmap fixed.
@@ -134,7 +137,6 @@
 #include asm/hardirq.h
 #include linux/spinlock.h
 #include linux/ac97_codec.h
-#include linux/pm.h
 
  /*
   * for crizappy mmap()
@@ -145,7 +147,7 @@
 
 #define M_DEBUG 1
 
-#define DRIVER_VERSION  "1.10"
+#define DRIVER_VERSION  "1.20"
 #define M3_MODULE_NAME  "maestro3"
 #define PFX M3_MODULE_NAME ": "
 
@@ -356,9 +358,8 @@
 /*
  * I'm not very good at laying out functions in a file :)
  */
-static int m3_pm_callback(struct pm_dev *dev, pm_request_t req, void *data);
 static int m3_notifier(struct notifier_block *nb, unsigned long event, void *buf);
-static int m3_suspend(struct m3_card *card);
+static void m3_suspend(struct pci_dev *pci_dev);
 static void check_suspend(struct m3_card *card);
 
 struct notifier_block m3_reboot_nb = {m3_notifier, NULL, 0};
@@ -2585,10 +2586,8 @@
 u32 n;
 int i;
 struct m3_card *card = NULL;
-int num = 0;
 int ret = 0;
 int card_type = pci_id-driver_data;
-struct pm_dev *pmdev;
 
 DPRINTK(DPMOD, "in maestro_install\n");
 
@@ -2606,7 +2605,7 @@
 printk(KERN_WARNING PFX "out of memory\n");
 return -ENOMEM;
 }
-memset(card, 0, sizeof(*card));
+memset(card, 0, sizeof(struct m3_card));
 card-pcidev = pci_dev;
 init_waitqueue_head(card-suspend_queue);
 
@@ -2677,7 +2676,6 @@
 printk(KERN_WARNING PFX "initing a dsp device that is already in use?\n");
 /* register devices */
 if ((s-dev_audio = register_sound_dsp(m3_audio_fops, -1))  0) {
-num = i;
 break;
 }
 }
@@ -2690,20 +2688,11 @@
 goto out;
 }
 
-pmdev = pm_register(PM_PCI_DEV, PM_PCI_ID(pci_dev), m3_pm_callback);
-
-if( pmdev == NULL) {
-printk(KERN_WARNING PFX "couldn't register pm callback, suspend/resume might 
not work.\n");
-/* XXX do error stuff :) */
-}
-
-pmdev-data = card;
+pci_dev-driver_data = card;
 
 m3_enable_ints(card);
 m3_assp_continue(card);
 
-printk(KERN_INFO PFX "%d channels configured.\n", num);
-
 out:
 if(ret) {
 if(card-iobase)
@@ -2729,7 +2718,6 @@
 {
 struct m3_card *card;
 
-pm_unregister_all(m3_pm_callback);
 unregister_reboot_notifier(m3_reboot_nb);
 
 while ((card = devs)) {
@@ -2767,72 +2755,17 @@
 
 for(card = devs; card != NULL; card = card-next) {
 if(!card-in_suspend)
-m3_suspend(card);
-}
-return 0;
-}
-
-MODULE_AUTHOR("Zach Brown [EMAIL PROTECTED]");
-MODULE_DESCRIPTION("ESS Maestro3/Allegro Driver");
-#ifdef M_DEBUG
-MODULE_PARM(debug,"i");
-#endif
-MODULE_PARM(external_amp,"i");
-
-static struct pci_driver m3_pci_driver = {
-name:   "ess_m3_audio",
-id_table:   m3_id_table,
-probe:  m3_probe,
-remove: m3_remove,
-};
-
-static int __init m3_init_module(void)
-{
-if (!pci_present())   /* No PCI bus in this machine! */
-return -ENODEV;
-
-printk(KERN_INFO PFX "version " DRIVER_VERSION " built at " __TIME__ " " __DATE__ 
"\n");
-
-if (register_reboot_notifier(m3_reboot_nb)) {
-printk(KERN_WARNING PFX "reboot notifier registration failed\n");
-return -ENODEV; /* ? */
-}
-
-if (!pci_register_driver(m3_pci_driver)) {
-pci_unregister_driver(m3_pci_driver);
-return -ENODEV;
+m3_suspend(card-pcidev); /* XXX legal? */
 }
 return 0;
 }
 
-static void __exit m3_cleanup_module(void)
-{
-pci_unregister_driver(m3_pci_driver);
-}
-
-module_init(m3_init_module);
-module_exit(m3_cleanup_module);
-
-void check_suspend(struct m3_card *card)
-{
-DECLARE_WAITQUEUE(wait, current);
-
-if(!card-in_suspend) 
-return;
-
-card-in_suspend++;
-add_wait_queue(card-suspend_queue, wait);
-current-state = TASK

Re: PATCH: maestro ported to 2.4 PCI API

2001-05-21 Thread Zach Brown


   - ported to Linux 2.4 PCI API, PCI module based, cleaned up
 return values. (taking into account all the hints Jeff has given
 me ;)

cool :)

   - did NOT change any power management support, since I don't know
 anything about power management.

someone else (in .de, it appears to be the source of maestro hacking
nowadays :)) is cleaning up the power management.

   - bumped version.

we might as well just stop using these, they don't mean much of anything
anymore.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] maestro3 2.4.2 pci dma address fix

2001-02-28 Thread Zach Brown


It seems the maestro3 can't DMA from mem above 256M.   Thanks to
everyone who sent in bug reports that lead us to discover this.. 

this patch fixes that and also allocates memory at driver init time
rather than at device open time..

please apply with wild abandon.

-- 
 zach

--- linux-2.4.2/drivers/sound/maestro3.c.dmaWed Feb 28 09:22:18 2001
+++ linux-2.4.2/drivers/sound/maestro3.cWed Feb 28 10:10:07 2001
@@ -28,6 +28,9 @@
  * Shouts go out to Mike "DJ XPCom" Ang.
  *
  * History
+ *  v1.22 - Feb 28 2001 - Zach Brown [EMAIL PROTECTED]
+ *   allocate mem at insmod/setup, rather than open
+ *   limit pci dma addresses to 28bit, thanks guys.
  *  v1.21 - Feb 04 2001 - Zach Brown [EMAIL PROTECTED]
  *   fix up really dumb notifier - suspend oops
  *  v1.20 - Jan 30 2001 - Zach Brown [EMAIL PROTECTED]
@@ -150,7 +153,7 @@
 
 #define M_DEBUG 1
 
-#define DRIVER_VERSION  "1.21"
+#define DRIVER_VERSION  "1.22"
 #define M3_MODULE_NAME  "maestro3"
 #define PFX M3_MODULE_NAME ": "
 
@@ -330,6 +333,12 @@
 
 MODULE_DEVICE_TABLE (pci, m3_id_table);
 
+/*
+ * reports seem to indicate that the m3 is limited
+ * to 28bit bus addresses.  rgggh...
+ */
+#define M3_PCI_DMA_MASK 0x0fff
+
 static unsigned 
 ld2(unsigned int x)
 {
@@ -1943,6 +1952,9 @@
 static void
 free_dmabuf(struct pci_dev *pci_dev, struct dmabuf *db)
 {
+if(db-rawbuf == NULL)
+return;
+
 DPRINTK(DPSTR,"freeing %p from dmabuf %p\n",db-rawbuf, db);
 
 {
@@ -1967,7 +1979,7 @@
 int minor = MINOR(inode-i_rdev);
 struct m3_card *c;
 struct m3_state *s = NULL;
-int i, ret = 0;
+int i;
 unsigned char fmtm = ~0, fmts = 0;
 unsigned long flags;
 
@@ -2013,10 +2025,6 @@
 spin_lock_irqsave(s-lock, flags);
 
 if (file-f_mode  FMODE_READ) {
-if(allocate_dmabuf(s-card-pcidev, (s-dma_adc)))  {
-ret = -ENOMEM;
-goto out;
-}
 fmtm = ~((ESS_FMT_STEREO | ESS_FMT_16BIT)  ESS_ADC_SHIFT);
 if ((minor  0xf) == SND_DEV_DSP16)
 fmts |= ESS_FMT_16BIT  ESS_ADC_SHIFT; 
@@ -2025,10 +2033,6 @@
 set_adc_rate(s, 8000);
 }
 if (file-f_mode  FMODE_WRITE) {
-if(allocate_dmabuf(s-card-pcidev, (s-dma_dac)))  {
-ret = -ENOMEM;
-goto out;
-}
 fmtm = ~((ESS_FMT_STEREO | ESS_FMT_16BIT)  ESS_DAC_SHIFT);
 if ((minor  0xf) == SND_DEV_DSP16)
 fmts |= ESS_FMT_16BIT  ESS_DAC_SHIFT;
@@ -2040,7 +2044,6 @@
 s-open_mode |= file-f_mode  (FMODE_READ | FMODE_WRITE);
 
 MOD_INC_USE_COUNT;
-out:
 up(s-open_sem);
 spin_unlock_irqrestore(s-lock, flags);
 return 0;
@@ -2064,7 +2067,6 @@
 m3_remove_list(s-card, (s-card-mixer_list), s-dma_dac.mixer_index);
 nuke_lists(s-card, (s-dma_dac));
 }
-free_dmabuf(s-card-pcidev, (s-dma_dac));
 }
 if (file-f_mode  FMODE_READ) {
 stop_adc(s);
@@ -2072,7 +2074,6 @@
 m3_remove_list(s-card, (s-card-adc1_list), s-dma_adc.adc1_index);
 nuke_lists(s-card, (s-dma_adc));
 }
-free_dmabuf(s-card-pcidev, (s-dma_adc));
 }
 
 s-open_mode = (~file-f_mode)  (FMODE_READ|FMODE_WRITE);
@@ -2594,8 +2595,8 @@
 
 DPRINTK(DPMOD, "in maestro_install\n");
 
-if (!pci_dma_supported(pci_dev, 0x)) {
-printk(KERN_ERR PFX "architecture does not support 32bit PCI busmaster 
DMA\n");
+if (!pci_dma_supported(pci_dev, M3_PCI_DMA_MASK)) {
+printk(KERN_ERR PFX "architecture does not support limiting to 28bit PCI bus 
+addresses\n");
 return -ENODEV;
 }
 
@@ -2604,6 +2605,8 @@
 
 pci_set_master(pci_dev);
 
+pci_dev-dma_mask = M3_PCI_DMA_MASK;
+
 if( (card = kmalloc(sizeof(struct m3_card), GFP_KERNEL)) == NULL) {
 printk(KERN_WARNING PFX "out of memory\n");
 return -ENOMEM;
@@ -2681,6 +2684,12 @@
 if ((s-dev_audio = register_sound_dsp(m3_audio_fops, -1))  0) {
 break;
 }
+
+if( allocate_dmabuf(card-pcidev, (s-dma_adc)) ||
+allocate_dmabuf(card-pcidev, (s-dma_dac)))  { 
+ret = -ENOMEM;
+goto out;
+}
 }
 
 if(request_irq(card-irq, m3_interrupt, SA_SHIRQ, card_names[card-card_type], 
card)) {
@@ -2734,8 +2743,12 @@
 for(i=0;iNR_DSPS;i++)
 {
 struct m3_state *s = card-channels[i];
-if(s-dev_audio != -1)
-unregister_sound_dsp(s-dev_audio);
+if(s-dev_audio  0)
+continue;
+
+unregister_sound_dsp(s-dev_audio);
+free_dmabuf(card-pcidev, s-dma_adc);
+free_dmabuf(card-pcidev, s-dma_dac);
 }
 
 release_region(card-iobase, 256);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a messa

[RFC] pci_dma_set_mask()

2001-02-28 Thread Zach Brown


This patch really has two parts.  Most of it adds a helper function that
does the 

if(pci_dma_supported()) { -dma_mask = mask }

code path.  I was using the api today and didn't realize that I had to
set the mask myself, I assumed the _supported call would do it.   If
people prefer the struct setting api, thats fine with me :)  

It also seems goofy that pci_dma_supported() unconditionally return true
on x86.  If a device needs a mask smaller than GFP_DMA (admitedly probably
won't exist in the real world, one hopes), they're going to have troubles
because x86 just falls back to GFP_DMA when the mask isn't 0x..

all totally untested, of course..

-- 
 zach

--- linux-2.4.2/include/linux/pci.h.dmasup  Wed Feb 28 10:26:14 2001
+++ linux-2.4.2/include/linux/pci.h Wed Feb 28 10:30:12 2001
@@ -527,6 +527,7 @@
 
 int pci_enable_device(struct pci_dev *dev);
 void pci_set_master(struct pci_dev *dev);
+int pci_set_dma_mask(struct pci_dev *dev, dma_addr_t mask);
 int pci_set_power_state(struct pci_dev *dev, int state);
 int pci_assign_resource(struct pci_dev *dev, int i);
 
--- linux-2.4.2/include/asm-i386/pci.h.dmasup   Wed Feb 28 10:19:01 2001
+++ linux-2.4.2/include/asm-i386/pci.h  Wed Feb 28 10:25:40 2001
@@ -152,6 +152,14 @@
  */
 extern inline int pci_dma_supported(struct pci_dev *hwdev, dma_addr_t mask)
 {
+/*
+ * we fall back to GFP_DMA when the mask isn't all 1s,
+ * so we can't guarantee allocations that must be
+ * within a tighter range than GFP_DMA..
+ */
+if(mask  0x00ff)
+return 0;
+
return 1;
 }
 
--- linux-2.4.2/drivers/pci/pci.c.dmasupWed Feb 28 10:26:34 2001
+++ linux-2.4.2/drivers/pci/pci.c   Wed Feb 28 10:27:34 2001
@@ -518,6 +518,18 @@
pcibios_set_master(dev);
 }
 
+int
+pci_set_dma_mask(struct pci_dev *dev, dma_addr_t mask)
+{
+if(! pci_dma_supported(dev, mask))
+return 0;
+
+dev-dma_mask = mask;
+
+return 1;
+}
+
+
 /*
  * Translate the low bits of the PCI base
  * to the resource type
@@ -1206,6 +1218,7 @@
 EXPORT_SYMBOL(pci_find_slot);
 EXPORT_SYMBOL(pci_find_subsys);
 EXPORT_SYMBOL(pci_set_master);
+EXPORT_SYMBOL(pci_set_dma_mask);
 EXPORT_SYMBOL(pci_set_power_state);
 EXPORT_SYMBOL(pci_assign_resource);
 EXPORT_SYMBOL(pci_register_driver);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] pci_dma_set_mask()

2001-02-28 Thread Zach Brown


 pci_dma_supported has a boolean return, but the kernel norm is to return
 zero on success, and -EFOO on error.  I like your proposal with the

*nod*  I just followed pci_dma_supported().

 extremely minor nit that I think pci_set_dma_mask should return ENODEV
 or EIO or something on error, and zero on success.

I agree, though I'd like to leave the decision up to people who live and
breathe this stuff.

please feel free to make minor adjustments and submit :)

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] pci_set_dma_mask() + doc :)

2001-03-01 Thread Zach Brown


please feel free to flame or apply, I'm not sure I'm really fond of the
code example..

- z

--- linux-2.4.2/include/linux/pci.h.dmasup  Wed Feb 28 10:26:14 2001
+++ linux-2.4.2/include/linux/pci.h Wed Feb 28 10:30:12 2001
@@ -527,6 +527,7 @@
 
 int pci_enable_device(struct pci_dev *dev);
 void pci_set_master(struct pci_dev *dev);
+int pci_set_dma_mask(struct pci_dev *dev, dma_addr_t mask);
 int pci_set_power_state(struct pci_dev *dev, int state);
 int pci_assign_resource(struct pci_dev *dev, int i);
 
--- linux-2.4.2/include/asm-i386/pci.h.dmasup   Wed Feb 28 10:19:01 2001
+++ linux-2.4.2/include/asm-i386/pci.h  Wed Feb 28 10:25:40 2001
@@ -152,6 +152,14 @@
  */
 extern inline int pci_dma_supported(struct pci_dev *hwdev, dma_addr_t mask)
 {
+/*
+ * we fall back to GFP_DMA when the mask isn't all 1s,
+ * so we can't guarantee allocations that must be
+ * within a tighter range than GFP_DMA..
+ */
+if(mask  0x00ff)
+return 0;
+
return 1;
 }
 
--- linux-2.4.2/drivers/pci/pci.c.dmasupWed Feb 28 10:26:34 2001
+++ linux-2.4.2/drivers/pci/pci.c   Thu Mar  1 19:04:35 2001
@@ -518,6 +518,18 @@
pcibios_set_master(dev);
 }
 
+int
+pci_set_dma_mask(struct pci_dev *dev, dma_addr_t mask)
+{
+if(! pci_dma_supported(dev, mask))
+return -EIO;
+
+dev-dma_mask = mask;
+
+return 0;
+}
+
+
 /*
  * Translate the low bits of the PCI base
  * to the resource type
@@ -1206,6 +1218,7 @@
 EXPORT_SYMBOL(pci_find_slot);
 EXPORT_SYMBOL(pci_find_subsys);
 EXPORT_SYMBOL(pci_set_master);
+EXPORT_SYMBOL(pci_set_dma_mask);
 EXPORT_SYMBOL(pci_set_power_state);
 EXPORT_SYMBOL(pci_assign_resource);
 EXPORT_SYMBOL(pci_register_driver);
--- linux-2.4.2/Documentation/DMA-mapping.txt.dmasupWed Feb 28 23:03:04 2001
+++ linux-2.4.2/Documentation/DMA-mapping.txt   Thu Mar  1 19:29:38 2001
@@ -71,12 +71,35 @@
if (! pci_dma_supported(pdev, 0x00ff))
goto ignore_this_device;
 
+When DMA is possible for a given mask, the PCI layer must be informed of the
+mask for later allocation operations on the device.  This is achieved by
+setting the dma_mask member of the pci_dev structure, like so:
+
+#define MY_HW_DMA_MASK 0x00ff
+
+   if (! pci_dma_supported(pdev, MY_HW_DMA_MASK))
+   goto ignore_this_device;
+
+   pdev-dma_mask = MY_HW_DMA_MASK;
+
+A helper function is provided which performs this common code sequence:
+
+   int pci_set_dma_mask(struct pci_dev *pdev, dma_addr_t device_mask)
+
+Unlike pci_dma_supported(), this returns -EIO when the PCI layer will not be
+able to DMA with addresses restricted by that mask, and returns 0 when DMA
+transfers are possible.  If the call succeeds, the dma_mask will have been
+updated so that your driver need not worry about it.
+
 There is a case which we are aware of at this time, which is worth
 mentioning in this documentation.  If your device supports multiple
 functions (for example a sound card provides playback and record
 functions) and the various different functions have _different_
 DMA addressing limitations, you may wish to probe each mask and
-only provide the functionality which the machine can handle.
+only provide the functionality which the machine can handle.  It
+is important that the last call to pci_set_dma_mask() be for the 
+most specific mask.
+
 Here is pseudo-code showing how this might be done:
 
#define PLAYBACK_ADDRESS_BITS   0x
@@ -86,14 +109,14 @@
struct pci_dev *pdev;
 
...
-   if (pci_dma_supported(pdev, PLAYBACK_ADDRESS_BITS)) {
+   if (pci_set_dma_mask(pdev, PLAYBACK_ADDRESS_BITS)) {
card-playback_enabled = 1;
} else {
card-playback_enabled = 0;
printk(KERN_WARN "%s: Playback disabled due to DMA limitations.\n",
   card-name);
}
-   if (pci_dma_supported(pdev, RECORD_ADDRESS_BITS)) {
+   if (pci_set_dma_mask(pdev, RECORD_ADDRESS_BITS)) {
card-record_enabled = 1;
} else {
card-record_enabled = 0;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[CFT] maestro update vs 2.2.18

2001-03-04 Thread Zach Brown


I finally spent some time fixing up the maestro driver.  lots of
feature additions had backed up, and the source was rotting.  Its still
gross, but at least its cleaned up a bit.  "It works for me" on my
pentium with an ESS maestro2 engineering board, but laptops will be
another story entirely.  I'd love it if people could apply this patch to
vanilla 2.2.18 and let me know how it goes.  

The patch does a few things.  Most interestingly for the user, it moves
away from the model of having multiple /dev/dsp? files and instead allows
/dev/dsp to be opened concurrently.  It also adds some support for the
hardware volume buttons on laptops, but not all vendors wire this the same
way.  As I don't have a maestro-bearing laptop, this is totally untested.

The code is butchered, so the diff is almost illegible.  Perhaps I'll
learn and do things in stages next time, but I was on a roll :)  One
of the more notable changes involves using the kernel's ac97_codec
code rather than its own.  Hopefully this will result in better mixer
behaviour.

I'm particularly interested in hearing how suspend/resume functions,
whether or not the multi-open stuff works, and I'd like to get subvendor
IDs from people whose laptop's hardware volume buttons work.  See the
Documentation/sound/Maestro text for instructions on enabling multi-open
(channels=2 or 4) and hardware volume support (hw_vol=1).

Its an awfully large diff, so it can be fetched from:

http://www.zabbo.net/maestro/patches/2.2.18-mega-1.diff.gz

if this works I'll officially submit it and make the same sorts of
changes to 2.4.  

-- 
 zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IDE UDMA on a CMD-648 Chip

2001-03-17 Thread Zach Brown


 I can't find the chip's datasheet. CMD only gives it to direct customers.
 I do have the datasheet for the CMD-646U, a prior UDMA supporting chip.

Have you tried mailing them?.  I sent mail to something silly like
support@cmd.  After they found the right person for me to talk to, I
mentioned wanting to play with linux support, and he happily sent the
datasheets for the 648 and 649.

I use andre's 2.2 ide patches to get support for it..

-- 
 zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: user space web server accelerator support

2001-03-20 Thread Zach Brown


 Fantastic!
 
 I was not aware of it, sorry... where can I find some doc?

There are some patches in the apache source rpms in
http://www.zabbo.net/phhttpd/ that shows how apache can connect to
another daemon and get its incoming connections sockets from it.

phhttpd itself is pretty hairy code (don't ask :)), but the apache
changes are pretty straight forward.

-- 
 zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patch: linux-2.4.0-test11/drivers/sound/maestro.c port to new PCI interface

2000-11-21 Thread Zach Brown


 Unrelated to your change:  the maestro reboot notifier shouldn't need to
 unregister all that stuff.  Who cares if the sound devices are freed,
 since we are rebooting.  free_irq+maestro_power seems sufficient.  or
 maybe stop_dma+free_irq+poweroff.

its only the power stuff that matters.  some biosen don't power down
properly if the chip isn't powered down.  That could actually be because
of weird changes we make to the chip and then mask by powering it down,
but shutting it down made the machines halt again :)

-- 
 zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Patch: linux-2.4.0-test11/drivers/sound/maestro.c port to new PCI interface

2000-11-21 Thread Zach Brown


[first off, thanks for kicking this stuff into gear Adam.  I'm way too
lazy to do this stuff of my own volition :)]

 was to serialize access to the mixer, there are surely better ways to do
 it.  Why are interrupts disabled?

the maestro has crappy register indirection that you must use to do
nearly anything.  when the locking was added we just went nutty with one
big lock :)  yes, this could be cleaned up with a mixer specific lock
as the mixer regs aren't atomic, but don't require the same indirection
as the rest of the maestro stuff.  the locking fix is still valid if
they're all under one big lock.

   As far as I can tell, I agree with you, but I do not think
 that is related to the move to the new PCI interface.

*nod*

   I also agree that the ioctl patch is kind of a bandaid over
 the problems that you described, and, while Zach Brown can speak

The biggest problem is that the current code is gross gross gross.
I've been avoiding dealing with it too much in the hopes that moving to
oss_audio will make things much more friendly across the board.

 problem permanent, and (2) it is an incremental improvement over
 the status quo.

*nod*  I'd like 2.2/2.4 to at least be solid for people so I can
disappear (again) to do the oss_audio work.  Really, I'd hope it
wouldn't take all that long.

   That said, I am willing to help try to clean up the
 locking code in maestro if Zach thinks it is worth doing right
 now, since I have a notebook computer with a maestro chip in

If you get me a patch thats obviously correct, I'd happily agree to
it going in.  I'm not sure its immeditately needed, but won't stop you
from playing with things :).  Jeff has the start of a patch that moves
maestro.c to be a consumer of ac97_codec.  It was really close, except
for some locking oopsies.  If you're dorking around with the locking, you
might be able to fix that up at the same time.  The new ac97 interface
cleans up the code a lot. you can look at how the maestro3 does it,
its almost identical to how maestro.c + ac97 would look.  [hah, except
that maestro3 hasn't had any locking applied to it at all.]

 This driver needs __devxxx, I've heard mention of some hotpluggable
 audio that is based on the maestro chipset (which chip I don't remember)

scary. :)

this all reminds me, the maestro3 driver is working well enough for most
people that I should get it going on 2.4 and submit it..

-- 
 zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: sunhme.c patch for new PCI interface (UNTESTED)

2000-11-16 Thread Zach Brown


On Thu, Nov 16, 2000 at 04:22:36AM -0800, David S. Miller wrote:

 Sure, that sounds nice.
 
 Actually, one of the possible "grand plans" for 2.5 is a unified
 "struct device".  I don't know what will actually happen here.

apologies for pointing out the potentially obvoius, but people might
want to investigate 'newbus' while brainstorming about this:

http://www.daemonnews.org/27/newbus-intro.html

-- 
 zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Maestro3/Allegro: (was ESS device 1998)

2000-11-02 Thread Zach Brown


On Thu, Nov 02, 2000 at 12:03:41PM +, Mo McKinlay wrote:

 I recently obtained an HP Omnibook XE2 laptop. It's a reasonably

As people have mentioned, there is an alpha free driver near
http://www.zabbo.net/maestro3/.  Its not quite up to par yet.  

maybe the web page should talk a bit more about the chip familiy. The
maestro3 has a lot of pieces in common with the maestro2, except for
the part of the chip that did pcm manipulation.  the m3 only has a dsp
where the m2 had specific silicon for doing pcm work.  the allegro
is a "slimmed down" maestro3, and neither have anything to do with
cirrus/crystal CS parts as far as I know :)

I expect you'll have the 'slow down' problem on the Xe2, we have the
clocking messed up on some implementations (those that don't clock the
thing at 49mhz, as god intended? :))
 
 I've given up on the internal modem (I'm 90% sure it's some kind of

*nod*  Its the usual mc97 codec setup that leaves the hard work for the
processor.  I'm sure one can play around with the dsp on it as well,
but we don't have specs on the dsp's internals.

-- 
 zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

maestro3 2.2 oss driver snapshot

2000-11-04 Thread Zach Brown


http://www.zabbo.net/maestro3/

contains a driver for the maestro3 and allegro chipsets that I'm fairly
happy with.

its 2.2 only for now, play with 2.4 at your own risk.  for now it
includes its own ac97_codec.c that is backported from 2.2.

I expect playback to work as well as ac97 mixing.  apm support works pretty
darn well, you can suspend during pcm playback and it should start
playing again on resume.  mmap() should work, but is untested.

record does not work at all.

if you test it, please let me know how it goes and tell me all about
your hardware.  I'll have a polished version later that will be submitted
into the kernel proper.

-- 
 zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[PATCH] maestro3 2.4.4-pre5 fixes

2001-04-21 Thread Zach Brown


The maestro3 driver has a error cleanup bug that would leave its reboot
notifier registerd after exiting init_module with an error.  Bayad.
It also had a minor bug where it didn't explicitly set the codec-id
before probing..

this patch vs 2.4.4-pre5 (I hope, its really vs jeff's cvs :)) fixes both.

Linus, please apply.

-- 
 zach

Index: maestro3.c
===
RCS file: /cvsroot/gkernel/linux_2_4/drivers/sound/maestro3.c,v
retrieving revision 1.1.1.24.6.1
diff -u -r1.1.1.24.6.1 maestro3.c
--- maestro3.c  2001/04/20 01:40:38 1.1.1.24.6.1
+++ maestro3.c  2001/04/21 14:36:21
@@ -2307,6 +2307,8 @@
 codec-private_data = card;
 codec-codec_read = m3_ac97_read;
 codec-codec_write = m3_ac97_write;
+/* someday we should support secondary codecs.. */
+codec-id = 0;
 
 if (ac97_probe_codec(codec) == 0) {
 printk(KERN_ERR PFX "codec probe failed\n");
@@ -2933,6 +2935,7 @@
 
 if (!pci_register_driver(m3_pci_driver)) {
 pci_unregister_driver(m3_pci_driver);
+unregister_reboot_notifier(m3_reboot_nb);
 return -ENODEV;
 }
 return 0;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: X15 alpha release: as fast as TUX but in user space

2001-05-02 Thread Zach Brown


 i think Zach's phhttpd is an important milestone as well, it's the first
 userspace webserver that shows how to use event-based, sigio-based async
 networking IO and sendfile() under Linux. (I believe it had some

*blush*

 performance problems related to sigio queue overflow, these issues might
 be solved in the latest kernels.) The zerocopy enhancements should help
 phhttpd as well.

oh, it has a bunch of problems :)  over-threading created complexity in
the fast path.  It always spends memory on a contiguous header region for
each connection, which may not be valid in the days of zero copy sendmsg.
It does IO in the fast path.  And looking back at it, I'm struck by how
naive most of the code is :) :)

I've always been tempted to go back and take a real swing at a
nice content server, but there's only so many hours in the day, and
apache+thttpd+tux complete the problem space.  If X15 isn't released
with an open license, I may be tempted yet again :)

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patch to make ymfpci legacy address 16 bits

2001-05-09 Thread Zach Brown


On Wed, May 09, 2001 at 05:08:15PM -0400, Jeff Garzik wrote:

 Why does maestro.c not use my suggestion?  Because it doesn't use struct
 pci_driver.

I finally found an able hacker with maestro hardware with power
management.  He not only fixed the nasty pm races that were causing
channel corruption, but he seems willing to move it towards the proper
pci_driver APIs as well.

So don't worry Jeff, we'll fix the evil maestro.c driver yet :) :)

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] ess maestro, support for hardware volume control

2001-06-09 Thread Zach Brown


 this patch applies to (at least) 2.4.3 up to and including 2.4.6-pre2.
 It enables the hardware volume control feature of the maestro.

cool.  I had support for this in the mega-patch I posted long ago, but
I never seperated and submitted those changes 'cause I'm a moron.

 By giving hwv=0 to insmod one can explicitly disable it. Setting

can we have a better name like 'hwvol_enable'?

 + set_mixer(c, 0, val);

careful.  you just used the indirect ac97 registers without holding the
card's lock..  if another processor does a mixer ioctl while this is
happening you'll get weird behaviour.

it looks like you should just be able to spin_{,un}lock(card-lock)
around that call, but the maestro locking is so friggin' twisty.. this
gets even more exciting when the driver uses ac97_codec, which the
mega-patch also had in it.

Fix the locking (and the obscure parameter name? :)) and it looks
fine.. good work.

-- 
 zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] ess maestro, support for hardware volume control

2001-06-09 Thread Zach Brown


 below is the version with the suggested fixes, and with s/hwv/hwvol/ for
 hwv_input also.

fantastic, thanks lukas.

alan, can you throw this in -ac?  I don't think it will cause problems
for people with nonstandard wiring on the hw vol pins (read: dell
lattitudes), but if it does we can blacklist the lattitudes (later turning
into enabling support code for their custom hw vol wiring.. its on the
todo list :/).  This should work on inspirons.

- z

 --- linux-2.4.6-pre2/drivers/sound/maestro.c  Sat Jun  9 16:55:22 2001
 +++ linux/drivers/sound/maestro.c Sat Jun  9 19:48:34 2001
 @@ -115,6 +115,10 @@
   *   themselves, but we'll see.  
   *   
   * History
 + *  v0.15 - Jun 09 2001 - Lukas Schroeder [EMAIL PROTECTED]
 + *   enable hardware volume control (by default)
 + *   add hwvol_enable= to allow disabling of HWV (values are 0 or 1)
 + *   add hwvol_input= to allow selecting the HWV input pins (values are 0 or 1)
   *  (still kind of v0.14) Nov 23 - Alan Cox [EMAIL PROTECTED]
   *   Add clocking= for people with seriously warped hardware
   *  (still v0.14) Nov 10 2000 - Bartlomiej Zolnierkiewicz [EMAIL PROTECTED]
 @@ -269,8 +273,13 @@
   
  static int clocking=48000;
  
 +/* enable hardware volume control? */
 +static int hwvol_enable = 1;
 +/* hardware volume input pin selection */
 +static int hwvol_input = 0;
 +
  /* - */
 -#define DRIVER_VERSION 0.14
 +#define DRIVER_VERSION 0.15
  
  #ifndef PCI_VENDOR_ESS
  #define PCI_VENDOR_ESS   0x125D
 @@ -312,6 +321,9 @@
  #define NR_APUS  64
  #define NR_APU_REGS  16
  
 +/* steps per hardware volume count */
 +#define HWV_MIXER_STEP   15
 +
  /* acpi states */
  enum {
   ACPI_D0=0,
 @@ -514,6 +526,7 @@
  
  /* - */
  
 +static void set_mixer(struct ess_card *card,unsigned int mixer, unsigned int val ) ;
  static void check_suspend(struct ess_card *card);
  
  static struct ess_card *devs = NULL;
 @@ -1898,10 +1911,20 @@
  
   if(event(16))
   {
 - /* XXX if we have a hw volume control int enable
 - all the ints?  doesn't make sense.. */
 + unsigned int val;
 +
   event = inw(c-iobase+0x18);
 - outb(0xFF, c-iobase+0x1A);
 + outb((16), c-iobase+0x1A);
 +
 + /* read the HW Master Volume Counter
 +   Bits 7:5   Master Volume Left
 +   Bits 3:1   Master Volume Right
 +*/
 + i = inb(c-iobase+0x1f);
 + val = ((HWV_MIXER_STEP * ((i1)  7))  8) | HWV_MIXER_STEP * 
((i5)  7);
 + spin_lock(s-lock);
 + set_mixer(c, 0, val);
 + spin_unlock(s-lock);
   }
   else
   {
 @@ -3088,8 +3111,10 @@
   w=~(114);/* External clock */
   
   w=~(17); /* HWV off */
 + if (hwvol_enable) w|=(17);
   w=~(16); /* Debounce off */
 - w=~(15); /* GPIO 4:5 */
 + w=~(15); /* GPIO 4:5 ; HVI pin selection */
 + if (hwvol_input) w|=(15);
   w|= (14); /* Disconnect from the CHI.  Enabling this made a dell 
7500 work. */
   w=~(12); /* MIDI fix off (undoc) */
   w=~(11); /* reserved, always write 0 */
 @@ -3170,7 +3195,8 @@
   outw(w, iobase+0x18);
  
   w=inw(iobase+0x18);
 - w=~(16); /* Harpo off */
 + w=~(16); /* HWV irq off */
 + if (hwvol_enable) w|=(16);
   outw(w, iobase+0x18);
   
   w=inw(iobase+0x18);
 @@ -3487,6 +3513,7 @@
   /* now go to sleep 'till something interesting happens */
   maestro_power(card,ACPI_D2);
  
 + printk(KERN_INFO maestro: hardware volume control %senabled\n, 
(hwvol_enable) ?  : not );
   printk(KERN_INFO maestro: %d channels configured.\n, num);
   return 1; 
  }
 @@ -3593,6 +3620,10 @@
  MODULE_PARM(dsps_order,i);
  MODULE_PARM(use_pm,i);
  MODULE_PARM(clocking, i);
 +
 +MODULE_PARM(hwvol_enable, i);
 +MODULE_PARM(hwvol_input, i);
 +
  
  void cleanup_module(void) {
   M_printk(maestro: unloading\n);
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] ess maestro, support for hardware volume control

2001-06-09 Thread Zach Brown


 I now have a patch that will output the hwv buttons pressed (up,
 down, mute) to a new dynamically allocated misc device as letters
 u, d, m, instead of directly modifying the mixer.  Anyone want
 that?  It's more flexible than either the patch that's currently
 in -ac or Lukas's patch, but you need a little userspace daemon
 for it to do anything useful.

hmm.. how do the alsa guys deal with hw volume?  I'm not sure I like
adding more stuff to the OSS API.  

-- 
 zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH] cutting up struct kernel_stat into cpu_stat

2001-06-21 Thread Zach Brown


The attached patch-in-progress removes the per-cpu statistics from
struct kernel_stat and puts them in a cpu_stat structure, one per cpu,
cacheline padded.  The data is still coolated and presented through
/proc/stat, but another file /proc/cpustat is also added.  The locking
is as nonexistant as it was with kernel_stat, but who cares, they're
just fuzzy stats to be eyeballed by system tuners :).

A tool for printing the cpu stats specifically can be found near:

http://www.osdlab.org/sw_resources/cpustat/index.shtml

Its output is almost identical to solaris' mpstat.  

I'm not sure I like the macro use, but it shields the callers from the
union garbage.  We can easily also make a THIS_CPU_STAT_ADD() interface,
as some have hinted would be nice :)

Currently its mostly ( :) ) only collecting the stats that were
collected in kernel_stat.  I'd like to add more stats -- page faults,
syscalls, cross-cpu calls, etc.  I understand people not wanting more
live cachelines in the fast paths.  I can make CPU_CRITICAL_STAT defines
that are config-ed out..

comments?  If its ok I can whip up a patch that updates all the ports
use of -irqs[] as well.

- z
[ heading out for lunch :) ]


--- linux-2.4.5-cpustat/fs/proc/proc_misc.c.cpustat Fri Apr 13 20:26:07 2001
+++ linux-2.4.5-cpustat/fs/proc/proc_misc.c Thu Jun 21 12:23:49 2001
@@ -265,32 +265,36 @@
int i, len;
extern unsigned long total_forks;
unsigned long jif = jiffies;
-   unsigned int sum = 0, user = 0, nice = 0, system = 0;
+   unsigned int sum = 0, user = 0, nice = 0, system = 0, ctxt = 0;
int major, disk;
 
for (i = 0 ; i  smp_num_cpus; i++) {
int cpu = cpu_logical_map(i), j;
 
-   user += kstat.per_cpu_user[cpu];
-   nice += kstat.per_cpu_nice[cpu];
-   system += kstat.per_cpu_system[cpu];
+   user += CPU_STAT_VAL(cpu, user);
+   nice += CPU_STAT_VAL(cpu, nice);
+   system += CPU_STAT_VAL(cpu, system);
+   ctxt += CPU_STAT_VAL(cpu, context_swtch);
 #if !defined(CONFIG_ARCH_S390)
for (j = 0 ; j  NR_IRQS ; j++)
-   sum += kstat.irqs[cpu][j];
+   sum += CPU_STAT_VAL(cpu, irqs[j]);
 #endif
}
 
len = sprintf(page, cpu  %u %u %u %lu\n, user, nice, system,
  jif * smp_num_cpus - (user + nice + system));
-   for (i = 0 ; i  smp_num_cpus; i++)
+   for (i = 0 ; i  smp_num_cpus; i++) {
+   unsigned int user_i, nice_i, system_i;
+
+   user_i = CPU_STAT_VAL(cpu_logical_map(i), user);
+   nice_i = CPU_STAT_VAL(cpu_logical_map(i), nice);
+   system_i = CPU_STAT_VAL(cpu_logical_map(i), system);
+
len += sprintf(page + len, cpu%d %u %u %u %lu\n,
i,
-   kstat.per_cpu_user[cpu_logical_map(i)],
-   kstat.per_cpu_nice[cpu_logical_map(i)],
-   kstat.per_cpu_system[cpu_logical_map(i)],
-   jif - (  kstat.per_cpu_user[cpu_logical_map(i)] \
-  + kstat.per_cpu_nice[cpu_logical_map(i)] \
-  + kstat.per_cpu_system[cpu_logical_map(i)]));
+   user_i, nice_i, system_i, 
+   jif - (  user_i + nice_i + system_i ) );
+   }
len += sprintf(page + len,
page %u %u\n
 swap %u %u\n
@@ -330,13 +334,58 @@
\nctxt %u\n
btime %lu\n
processes %lu\n,
-   kstat.context_swtch,
+   ctxt, 
xtime.tv_sec - jif / HZ,
total_forks);
 
return proc_calc_metrics(page, start, off, count, eof, len);
 }
 
+static int cstat_read_proc(char *page, char **start, off_t off,
+int count, int *eof, void *data)
+{
+   int i, len;
+
+   len = sprintf(page, cpu_stat 0.0\n);
+
+   for (i = 0 ; i  smp_num_cpus; i++) {
+   unsigned int user, nice, system;
+   int j, cpu = cpu_logical_map(i);
+
+#if !defined(CONFIG_ARCH_S390)
+   len += sprintf(page + len, cpu%d irqs ,  cpu);
+   for (j = 0 ; j  NR_IRQS ; j++) {
+   len += sprintf(page + len,  %u, 
+   CPU_STAT_VAL(cpu, irqs[j]));
+   }
+   len += sprintf(page + len, \n);
+#endif
+#if defined(CONFIG_SMP)
+   len += sprintf(page + len, cpu%d context_migration %u\n,  
+   cpu, CPU_STAT_VAL(cpu, context_migration));
+#endif
+   len += sprintf(page + len, cpu%d bottom_halves %u\n,  
+   cpu, CPU_STAT_VAL(cpu, bh));
+   len += sprintf(page + len, cpu%d context_switches %u\n,  
+   cpu, CPU_STAT_VAL(cpu, context_swtch));
+
+   user =

[RFC][PATCH] struct kernel_stat - struct cpu_stat[NR_CPUS]

2001-07-02 Thread Zach Brown


Currently struct kernel_stat has a few pre cpu arrays.  This creates
cacheline exchange noise as the cpus update their entries in each array.
This patch creates an array of per cpu structs.  The structure is padded
to the length of a cacheline.  The meat of the patch against 2.4.6-pre8
is attached.  The rest of the patch is rather large because it touches
all the architectures' use of ks-irqs[], and can be found at

http://www.osdlab.org/sw_resources/cpustat/cpustat-2.4.6.pre8-1.diff

These per cpu statistics are reported via a new /proc/cpustat, a quick
tool for processing that output, vmstat-style, can be found near

http://www.osdlab.org/sw_resources/cpustat/index.shtml

In addition to shuffling structures around, the patch adds the recording
of context scheduling migrations as well as minor, major, and
cow faults.

I'd really like to hear what people think of the patch.  Unless someone
strongly is dead set against it, I'd like to send it off to Linus and
move on to other things :)  Would arch maintainers rather that I fed
them per-arch patches for their trees?

- z

--- linux-2.4.6-pre8/fs/proc/proc_misc.c.cpustatFri Apr 13 20:26:07 2001
+++ linux-2.4.6-pre8/fs/proc/proc_misc.cMon Jul  2 15:04:49 2001
@@ -265,32 +265,37 @@
int i, len;
extern unsigned long total_forks;
unsigned long jif = jiffies;
-   unsigned int sum = 0, user = 0, nice = 0, system = 0;
+   unsigned int sum = 0, user = 0, nice = 0, system = 0, ctxt = 0;
int major, disk;
 
for (i = 0 ; i  smp_num_cpus; i++) {
int cpu = cpu_logical_map(i), j;
 
-   user += kstat.per_cpu_user[cpu];
-   nice += kstat.per_cpu_nice[cpu];
-   system += kstat.per_cpu_system[cpu];
+   user += cpu_stat[cpu].user;
+   nice += cpu_stat[cpu].nice;
+   system += cpu_stat[cpu].system;
+   ctxt += cpu_stat[cpu].context_swtch;
 #if !defined(CONFIG_ARCH_S390)
for (j = 0 ; j  NR_IRQS ; j++)
-   sum += kstat.irqs[cpu][j];
+   sum += cpu_stat[cpu].irqs[j];
 #endif
}
 
len = sprintf(page, cpu  %u %u %u %lu\n, user, nice, system,
  jif * smp_num_cpus - (user + nice + system));
-   for (i = 0 ; i  smp_num_cpus; i++)
+   for (i = 0 ; i  smp_num_cpus; i++) {
+   unsigned int user_i, nice_i, system_i;
+   int cpu = cpu_logical_map(i);
+
+   user_i = cpu_stat[cpu].user;
+   nice_i = cpu_stat[cpu].nice;
+   system_i = cpu_stat[cpu].system;
+
len += sprintf(page + len, cpu%d %u %u %u %lu\n,
i,
-   kstat.per_cpu_user[cpu_logical_map(i)],
-   kstat.per_cpu_nice[cpu_logical_map(i)],
-   kstat.per_cpu_system[cpu_logical_map(i)],
-   jif - (  kstat.per_cpu_user[cpu_logical_map(i)] \
-  + kstat.per_cpu_nice[cpu_logical_map(i)] \
-  + kstat.per_cpu_system[cpu_logical_map(i)]));
+   user_i, nice_i, system_i, 
+   jif - (  user_i + nice_i + system_i ) );
+   }
len += sprintf(page + len,
page %u %u\n
 swap %u %u\n
@@ -330,13 +335,64 @@
\nctxt %u\n
btime %lu\n
processes %lu\n,
-   kstat.context_swtch,
+   ctxt, 
xtime.tv_sec - jif / HZ,
total_forks);
 
return proc_calc_metrics(page, start, off, count, eof, len);
 }
 
+static int cstat_read_proc(char *page, char **start, off_t off,
+int count, int *eof, void *data)
+{
+   int i, len;
+
+   len = sprintf(page, cpu_stat 0.0\n);
+
+   for (i = 0 ; i  smp_num_cpus; i++) {
+   unsigned int user, nice, system;
+   int j, cpu = cpu_logical_map(i);
+   struct cpu_stat *cs = cpu_stat[cpu];
+
+#if !defined(CONFIG_ARCH_S390)
+   len += sprintf(page + len, cpu%d irqs ,  cpu);
+   for (j = 0 ; j  NR_IRQS ; j++) {
+   len += sprintf(page + len,  %u, 
+   cs-irqs[j]);
+   }
+   len += sprintf(page + len, \n);
+#endif
+#if defined(CONFIG_SMP)
+   len += sprintf(page + len, cpu%d context_migration %u\n,  
+   cpu, cs-context_migration);
+#endif
+   len += sprintf(page + len, cpu%d context_switches %u\n,  
+   cpu, cs-context_swtch);
+
+   len += sprintf(page + len, cpu%d major_faults %u\n,  
+   cpu, cs-major_fault);
+   len += sprintf(page + len, cpu%d minor_faults %u\n,  
+   cpu, cs-minor_fault);
+   len += sprintf(page + len, cpu%d

[PATCH] 2.4.7-pre3 kernel_stat - cpu_stat[NR_CPUS]

2001-07-06 Thread Zach Brown


This patch does the following:

- creates a cacheline aligned/padded struct cpu_stat[NR_CPUS].  
- moves the [NR_CPUS] members of kernel_stat into cpu_stat

This moves the stat data that a cpu will update into a contiguous region.
Previous users of kernel_stat would compete for an array's cacheline
with other cpus.

- creates /proc/cpu/[0-9]+/ and fs/proc/proc_cpu.c with code for managing
  files in the cpu's directories.

This should be useful for rusty's */active and arjan/rmk's */frequency.
I have no strong feelings about where this lives or what it should be
named.

- adds collection of fault statistics and adds 'context migration'
  recording, per cpu.
- updates users of kernel_stat in proc, reports shouldn't change
- updates every friggin' user of kstat.irqs[] in arch/ with a macro so
  that we never have to do this again.

The patch is rather large, due to that last bit.  It can be found at

http://www.osdlab.org/sw_resources/cpustat/cpustat-2.4.7.pre3-1.diff

with a tool for summarizing /proc/cpu/*/stat at:

http://www.osdlab.org/sw_resources/cpustat/index.shtml

I'd like to get this sent to Linus soon, but wanted to run the
/proc/cpu/* stuff by l-k one last time.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

maestro3 patch, resent

2001-02-07 Thread Zach Brown


duh.  I sent this to rutgers originally..

--- 

Date: Mon, 5 Feb 2001 07:42:25 -0500
From: Zach Brown [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: [PATCH] maestro3 2.4.1-ac2 shutdown fix

Its a wonder that anyone lets me write code.

The following fixes a goofy shutdown problem with the 2.4 maestro3 driver
as it appears in Alan's 2.4.1-ac2 patch.   If power management was
disabled the maestro3 driver would oops trying to save the dsp state as
the machine shut down.  

The full source for the up to date 2.4 driver can be found at:

http://www.zabbo.net/maestro3/maestro3-2.4-20010204.tar.gz

thanks again to Andres Salomon for continuing to report my dumb bugs.
Hopefully this will be the last of the trivial bugs, this driver needs
some real meaningful cleaning up..

-- 
 zach

--- maestro3.c  Mon Feb  5 06:51:58 2001
+++ maestro3.c  Mon Feb  5 07:40:53 2001
@@ -28,6 +28,8 @@
  * Shouts go out to Mike "DJ XPCom" Ang.
  *
  * History
+ *  v1.21 - Feb 04 2001 - Zach Brown [EMAIL PROTECTED]
+ *   fix up really dumb notifier - suspend oops
  *  v1.20 - Jan 30 2001 - Zach Brown [EMAIL PROTECTED]
  *   get rid of pm callback and use pci_dev suspend/resume instead
  *   m3_probe cleanups, including pm oops think-o
@@ -147,7 +149,7 @@
 
 #define M_DEBUG 1
 
-#define DRIVER_VERSION  "1.20"
+#define DRIVER_VERSION  "1.21"
 #define M3_MODULE_NAME  "maestro3"
 #define PFX M3_MODULE_NAME ": "
 
@@ -2763,7 +2765,6 @@
 static void m3_suspend(struct pci_dev *pci_dev)
 {
 unsigned long flags;
-int index;
 int i;
 struct m3_card *card = pci_dev-driver_data;
 
@@ -2788,15 +2789,18 @@
 
 m3_assp_halt(card);
 
-index = 0;
-DPRINTK(DPMOD, "saving code\n");
-for(i = REV_B_CODE_MEMORY_BEGIN ; i = REV_B_CODE_MEMORY_END; i++)
-card-suspend_mem[index++] = 
-m3_assp_read(card, MEMTYPE_INTERNAL_CODE, i);
-DPRINTK(DPMOD, "saving data\n");
-for(i = REV_B_DATA_MEMORY_BEGIN ; i = REV_B_DATA_MEMORY_END; i++)
-card-suspend_mem[index++] = 
-m3_assp_read(card, MEMTYPE_INTERNAL_DATA, i);
+if(card-suspend_mem) {
+int index = 0;
+
+DPRINTK(DPMOD, "saving code\n");
+for(i = REV_B_CODE_MEMORY_BEGIN ; i = REV_B_CODE_MEMORY_END; i++)
+card-suspend_mem[index++] = 
+m3_assp_read(card, MEMTYPE_INTERNAL_CODE, i);
+DPRINTK(DPMOD, "saving data\n");
+for(i = REV_B_DATA_MEMORY_BEGIN ; i = REV_B_DATA_MEMORY_END; i++)
+card-suspend_mem[index++] = 
+m3_assp_read(card, MEMTYPE_INTERNAL_DATA, i);
+}
 
 DPRINTK(DPMOD, "powering down apci regs\n");
 m3_outw(card, 0x, 0x54);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] maestro3 still oopses?

2001-02-09 Thread Zach Brown


 The maestro3 driver, included in 2.4.2-pre2 (which I assume is the
 same as maestro3-2.4-20010204.tar.gz, I haven't bothered to try it;
 I'm perfectly happy w/ my patch), oopses upon shutdown.

the maestro3 snapshot in 2.4.2pre2 is not up to date.  I imagine it came
from alan, who got the jan30 patch, but didn't get the trivial feb 04
patch that fixes the oops you're seeing.

the proper patch to 2.4.2-pre2 (and 2.4ac-current, presumably)
is attached.  Does it fix you problem?  [it tries to do so without
duplicating code, you'll notice.]

 + *   use pci_module_init() instead of pci_register_driver().

I'd rather do this in a seperate patch that does lots of other cleanups
that are pending.

- z

--- maestro3.c  Mon Feb  5 06:51:58 2001
+++ maestro3.c  Mon Feb  5 07:40:53 2001
@@ -28,6 +28,8 @@
  * Shouts go out to Mike "DJ XPCom" Ang.
  *
  * History
+ *  v1.21 - Feb 04 2001 - Zach Brown [EMAIL PROTECTED]
+ *   fix up really dumb notifier - suspend oops
  *  v1.20 - Jan 30 2001 - Zach Brown [EMAIL PROTECTED]
  *   get rid of pm callback and use pci_dev suspend/resume instead
  *   m3_probe cleanups, including pm oops think-o
@@ -147,7 +149,7 @@
 
 #define M_DEBUG 1
 
-#define DRIVER_VERSION  "1.20"
+#define DRIVER_VERSION  "1.21"
 #define M3_MODULE_NAME  "maestro3"
 #define PFX M3_MODULE_NAME ": "
 
@@ -2763,7 +2765,6 @@
 static void m3_suspend(struct pci_dev *pci_dev)
 {
 unsigned long flags;
-int index;
 int i;
 struct m3_card *card = pci_dev-driver_data;
 
@@ -2788,15 +2789,18 @@
 
 m3_assp_halt(card);
 
-index = 0;
-DPRINTK(DPMOD, "saving code\n");
-for(i = REV_B_CODE_MEMORY_BEGIN ; i = REV_B_CODE_MEMORY_END; i++)
-card-suspend_mem[index++] = 
-m3_assp_read(card, MEMTYPE_INTERNAL_CODE, i);
-DPRINTK(DPMOD, "saving data\n");
-for(i = REV_B_DATA_MEMORY_BEGIN ; i = REV_B_DATA_MEMORY_END; i++)
-card-suspend_mem[index++] = 
-m3_assp_read(card, MEMTYPE_INTERNAL_DATA, i);
+if(card-suspend_mem) {
+int index = 0;
+
+DPRINTK(DPMOD, "saving code\n");
+for(i = REV_B_CODE_MEMORY_BEGIN ; i = REV_B_CODE_MEMORY_END; i++)
+card-suspend_mem[index++] = 
+m3_assp_read(card, MEMTYPE_INTERNAL_CODE, i);
+DPRINTK(DPMOD, "saving data\n");
+for(i = REV_B_DATA_MEMORY_BEGIN ; i = REV_B_DATA_MEMORY_END; i++)
+card-suspend_mem[index++] = 
+m3_assp_read(card, MEMTYPE_INTERNAL_DATA, i);
+}
 
 DPRINTK(DPMOD, "powering down apci regs\n");
 m3_outw(card, 0x, 0x54);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: user space web server accelerator support

2001-03-23 Thread Zach Brown


 Zach, have you ever noticed such a performance bottleneck in your phhttpd?

yup, this is definitely something you don't want to be doing in the fast
path :)

 Any thoughts?

Sorry I don't remember the start of this thread, but I'll ask anyway;
have you looked at Ingo Molnar's Tux server?  Its state of the art unix
serving, implemented in the linux kernel:

http://people.redhat.com/mingo/TUX-patches/

-- 
 zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a handy list_first_entry macro

2007-04-17 Thread Zach Brown

On Tue, Apr 17, 2007 at 03:18:56PM +0400, Pavel Emelianov wrote:
 There are many places in the kernel where the construction like
 
foo = list_entry(head-next, struct foo_struct, list);
 
 are used. 
 The code might look more descriptive and neat if using the macro
 
list_first_entry(head, type, member) \
  list_entry((head)-next, type, member)

 - dquot = list_entry(dirty-next, struct dquot, dq_dirty);
 + dquot = list_first_entry(dirty, struct dquot, dq_dirty);

I think it's more than just descriptive and neat.  A common sneaky bug
is to accidentally pass list-next into list_entry() instead of
list-next. This is easy to do because one is used to typing list in
list_empty(), etc.  So by following -next in the macro we make the list
argument consistent in list_empty() and list_first_entry() and hopefully
reduce the risk of making that mistake.

I like it.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] dio: invalidate clean pages before dio write

2007-03-09 Thread Zach Brown

dio: invalidate clean pages before dio write

This patch fixes a user-triggerable oops that was reported by Leonid Ananiev as
archived at http://lkml.org/lkml/2007/2/8/337.

dio writes invalidate clean pages that intersect the written region so that
subsequent buffered reads go to disk to read the new data.  If this fails the
interface tries to tell the caller that the cache is inconsistent by returning
EIO.

Before this patch we had the problem where this invalidation failure would
clobber -EIOCBQUEUED as it made its way from fs/direct-io.c to fs/aio.c.  Both
fs/aio.c and bio completion call aio_complete() and we reference freed memory,
usually oopsing.

This patch addresses this problem by invalidating before the write so that we
can cleanly return -EIO before -direct_IO() has had a chance to return
-EIOCBQUEUED.

There is a compromise here.  During the dio write we can fault in mmap()ed
pages which intersect the written range with get_user_pages() if the user
provided them for the source buffer.  This is a crazy thing to do, but we can
make it mostly work in most cases by trying the invalidation again.   The
compromise is that we won't return an error if this second invalidation fails
if it's an AIO write and we have -EIOCBQUEUED.

This was tested by having two processes race performing large O_DIRECT and
buffered ordered writes.  Within minutes ext3 would see a race between
ext3_releasepage() and jbd holding a reference on ordered data buffers and
would cause invalidation to fail, panicing the box.  The test can be found in
the 'aio_dio_bugs' test group in test.kernel.org/autotest.  After this patch
the test passes.

Signed-off-by: Zach Brown [EMAIL PROTECTED]
---

 mm/filemap.c |   48 
 1 file changed, 36 insertions(+), 12 deletions(-)

--- a/mm/filemap.c  Wed Feb 28 06:00:13 2007 +
+++ b/mm/filemap.c  Thu Mar 08 17:48:58 2007 -0800
@@ -2379,7 +2379,8 @@ generic_file_direct_IO(int rw, struct ki
struct file *file = iocb-ki_filp;
struct address_space *mapping = file-f_mapping;
ssize_t retval;
-   size_t write_len = 0;
+   size_t write_len;
+   pgoff_t end = 0; /* silence gcc */
 
/*
 * If it's a write, unmap all mmappings of the file up-front.  This
@@ -2388,23 +2389,46 @@ generic_file_direct_IO(int rw, struct ki
 */
if (rw == WRITE) {
write_len = iov_length(iov, nr_segs);
+   end = (offset + write_len - 1)  PAGE_CACHE_SHIFT;
if (mapping_mapped(mapping))
unmap_mapping_range(mapping, offset, write_len, 0);
}
 
retval = filemap_write_and_wait(mapping);
-   if (retval == 0) {
-   retval = mapping-a_ops-direct_IO(rw, iocb, iov,
-   offset, nr_segs);
-   if (rw == WRITE  mapping-nrpages) {
-   pgoff_t end = (offset + write_len - 1)
-PAGE_CACHE_SHIFT;
-   int err = invalidate_inode_pages2_range(mapping,
+   if (retval)
+   goto out;
+
+   /*
+* After a write we want buffered reads to be sure to go to disk to get
+* the new data.  We invalidate clean cached page from the region we're
+* about to write.  We do this *before* the write so that we can return
+* -EIO without clobbering -EIOCBQUEUED from -direct_IO().
+*/
+   if (rw == WRITE  mapping-nrpages) {
+   retval = invalidate_inode_pages2_range(mapping,
offset  PAGE_CACHE_SHIFT, end);
-   if (err)
-   retval = err;
-   }
-   }
+   if (retval)
+   goto out;
+   }
+
+   retval = mapping-a_ops-direct_IO(rw, iocb, iov, offset, nr_segs);
+   if (retval)
+   goto out;
+
+   /* 
+* Finally, try again to invalidate clean pages which might have been
+* faulted in by get_user_pages() if the source of the write was an
+* mmap()ed region of the file we're writing.  That's a pretty crazy
+* thing to do, so we don't support it 100%.  If this invalidation
+* fails and we have -EIOCBQUEUED we ignore the failure.
+*/
+   if (rw == WRITE  mapping-nrpages) {
+   int err = invalidate_inode_pages2_range(mapping,
+ offset  PAGE_CACHE_SHIFT, end);
+   if (err  retval = 0)
+   retval = err;
+   }
+out:
return retval;
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/11] ANNOUNCE: Syslets, generic asynchronous system call support

2007-02-14 Thread Zach Brown



I'm finally back from my travel and conference hiatus.. you guys have  
been busy! :)


On Feb 13, 2007, at 6:20 AM, Ingo Molnar wrote:

I'm pleased to announce the first release of the Syslet kernel  
feature

and kernel subsystem, which provides generic asynchrous system call
support:

   http://redhat.com/~mingo/syslet-patches/


In general, I really like the look of this.

I think I'm convinced that your strong preference to do this with  
full kernel threads (1:1 task_struct - thread_info/stack  
relationship) is the right thing to do.  The fibrils fell on the side  
of risking bugs by sharing task_structs amongst stacks executing  
kernel paths.  This, correct me if I'm wrong, falls on the side of  
risking behavioural quirks stemming from task_struct references that  
we happen to have not enabled sharing of yet.


I have strong hopes that we won't actually *care* about the  
behavioural differences we get from having individual task structs  
(which share the important things!) between syscall handlers.  The  
*only* seemingly significant case I've managed to find, the IO  
scheduler priority and context fields, is easy enough to fix up.   
Jens and I have been talking about that.  It's been bugging him for  
other reasons.


So, thanks, nice work.  I'm going to focus on finding out if its  
feasible for The Database to use this instead of the current iocb  
mechanics.  I'm optimistic.



Syslets are small, simple, lightweight programs (consisting of
system-calls, 'atoms')


I will admit, though, that I'm not at all convinced that we need  
this.  Adding a system call for addition (just addition?  how far do  
we go?!) sure feels like a warning sign to me that we're heading down  
a slippery slope.  I would rather we started with an obviously  
minimal syscall which just takes an array of calls and args and  
executes them unconditionally.


But its existance doesn't stop the use case I care about.  So it's  
hard to get *too* worked up about it.



Comments, suggestions, reports are welcome!


For what it's worth, it looks like 'x86-optimized-copy_uatom.patch'  
got some hunks that should have been in 'x86-optimized- 
sys_umem_add.patch'.


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 05/11] syslets: core code

2007-02-14 Thread Zach Brown

But the whole point is that the notion of a register is wrong in  
the

first place. [...]


forget about it then. The thing we register is dead-simple:

 struct async_head_user {
 struct syslet_uatom __user  **completion_ring;
 unsigned long   ring_size_bytes;
 unsigned long   max_nr_threads;
 };

this can be passed in to sys_async_exec() as a second pointer, and the
kernel can put the expected-completion pointer (and the user ring idx
pointer) into its struct atom. It's just a few instructions, and  
only in

the cachemiss case.

that would make completions arbitrarily split-up-able. No registration
whatsoever. A waiter could specify which ring's events it is  
interested

in. A 'ring' could be a single-entry thing as well, for a single
instance of pending IO.


I like this, too.  (Not surprisingly, having outlined something like  
it in a mail in one of the previous threads :)).


I'll bring up the POSIX AIO list IO case.  It wants to issue a  
group of IOs and sleep until they all return.  Being able to cheaply  
instantiate a ring implicitly with the submission of the IO calls in  
the list will make implementing this almost too easy.  It'd obviously  
just wait for that list's ring to drain.


I hope.  There might be complications around the edges (waiting for  
multiple list IOs to drain?), but it seems like this would be on the  
right track.


I might be alone in caring about having a less ridiculous POSIX AIO  
interface in glibc, though, I'll admit.  It seems like it'd be a  
pretty sad missed opportunity if we rolled a fantastic general AIO  
interface and left glibc to still screw around with it's own manual  
threading :/.


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 05/14] syslets: core code

2007-02-15 Thread Zach Brown




+static void
+__mark_async_thread_ready(struct async_thread *at, struct  
async_head *ah)

+{
+   list_del(at-entry);
+   list_add_tail(at-entry, ah-ready_async_threads);


+__mark_async_thread_busy(struct async_thread *at, struct  
async_head *ah)

+{
+   list_del(at-entry);
+   list_add_tail(at-entry, ah-busy_async_threads);


list_move_tail()?

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-15 Thread Zach Brown


If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch
aio: fix kernel bug when page is temporally busy


Sorry Leonid, this patch is not safe.

It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be  
called.  This can lead to operations hanging, both AIO and calls that  
come through do_sync_{read,write}.


It overwrites -EIOCBQUEUED, leading to an aio_complete() while a  
retry is happening.  This can lead to reference count confusion.   
Double-frees, referencing freed memory, that kind of thing.  This  
isn't a new problem.  The current code that overwrites with -EIO has  
this problem.  But moving to -EIOCBRETRY does introduce new behaviour  
of aio_complete() and the retry path racing.


I'll have a candidate patch to address the problem of EIO being  
raised on the way back up from a path which has returned -EIOCBQUEUED.


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 05/11] syslets: core code

2007-02-15 Thread Zach Brown

2) 	On the client facing side (port 53), I'd very much hope for a  
way to

do 'recvv' on datagram sockets, so I can retrieve a whole bunch of
UDP datagrams with only one kernel transition.


I want to highlight this point that Bert is making.

Whenever we talk about AIO and kernel threads some folks are rightly  
concerned that we're talking about a thread *per IO* and fear that  
memory consumption will be fatal.


Take the case of userspace which implements what we'd think of as  
page cache writeback.  (*coughs, points at email address*).  It wants  
to issue thousands of IOs to disjoint regions of a file.  Thousands  
of kernel threads, oh crap!


But it only issues each IO with a separate syscall (or io_submit()  
op) because it doesn't have an interface that lets it specify IOs  
that vector user memory addresses *and file position*.


If we had a seemingly obvious interface that let it kick off batched  
IOs to different parts of the file, the looming disaster of a thread  
per IO vanishes in that case.


struct off_vec {
off_t pos;
size_t len;
};

long sys_sgwrite(int fd, struct iovec *memvec, size_t mv_count,
struct off_vec *ovec, size_t ov_count);

It doesn't take long to imagine other uses for this that are less  
exotic.


Take e2fsck and its iterating through indirect blocks or directory  
data blocks.  It has a list of disjoint file regions (blocks) it  
wants to read, but it does them serially to keep the code from  
getting even more confusing.  blktrace a clean e2fsck -f some time..  
it's leaving *HALF* of the disk read bandwith on the table by  
performing serial block-sized reads.  If it could specify batches of  
them the code would still be simple but it could tell the kernel and  
IO scheduler *exactly* what it wants, without having to mess around  
with sys_readahead() or AIO or any of that junk :).


Anyway, that's just something that's been on my mind.  If there are  
obvious clean opportunities to get more done with single syscalls, it  
might not be such a bad thing.


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-15 Thread Zach Brown



On Feb 15, 2007, at 11:11 AM, Ananiev, Leonid I wrote:


It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be
called.  This can lead to operations hanging


If EIOCBRETRY then generic_file_aio_write() will be recalled for the
same iocb.


Only if kick_iocb() is called.  It won't be called if i_i_p2_r() was  
the only thing to return -EIOCBRETRY.





It overwrites -EIOCBQUEUED, leading to an aio_complete() while a
retry is happening.


EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call:


Not by fs/aio.c, but *by the place that originated -EIOCBQUEUED*.   
Later.  After IO has completed.  see fs/direct-io.c:dio_bio_end_aio().


This is what -EIOCBQUEUED means!  It's a promise to call aio_complete 
() in the future.


Have you read the giant comment over the definition of struct kiocb  
in include/linux/aio.h?



This can lead to reference count confusion.

But just reference count confusion was deleted by patch. Isn't it?


Sorry, I don't understand what you're trying to ask here.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-15 Thread Zach Brown



On Feb 15, 2007, at 3:32 PM, Ananiev, Leonid I wrote:


If EIOCBRETRY then generic_file_aio_write() will be recalled for the
same iocb.

Only if kick_iocb() is called.  It won't be called if i_i_p2_r() was
the only thing to return -EIOCBRETRY.

It is not need to call kick_iocb()
for generic_file_aio_write() calling.
It is recalled without any wakeup waiting:
for (;;) {
ret = filp-f_op-aio_write(kiocb, iov, 1,
kiocb.ki_pos);
if (ret != -EIOCBRETRY)
break;
wait_on_retry_sync_kiocb(kiocb);
}
Note: wait_on_retry_sync_kiocb() does not wait.


Yes it does.  It will not return until kiocbSetKicked() is called,  
and that is only called from kick_iocb().



It overwrites -EIOCBQUEUED

Do you mean that there is one more kernel bug which
overwrites -EIOCBQUEUED by any errno or number of bytes and this
new value is returned to caller as an IO result
while IO is not finished yet.

The proposed patch does not crate this bug if any.


Right, and I said that in the mail you're quoting.


It actually fixes a kernel panic bag when iocb.users count becomes
incorrect. The bag  Kernel BUG at fs/aio.c:509 is there because
aio_run_iocb() have not a chance to differ real EIO and
EIO which is actually means EAGAYN or EIOCBRETRY.


Yes, I understand the bug you're trying to fix.  You're introducing  
other bugs with the patch.  It will not be merged.



It is interesting that I've not seen any EIOCBQUEUED returned
to aio_run_iocb() during 5 hours aiostress running.


What arguments are you running aio-stress with?  -EIOCBQUEUED is only  
used for O_DIRECT, and then only in certain circumstances.


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-19 Thread Zach Brown

aio: propogate post-EIOCBQUEUED errors to completion event

This addresses an oops reported by Leonid Ananiev [EMAIL PROTECTED]
as archived at http://lkml.org/lkml/2007/2/8/337.

O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its intention to
call aio_complete() once the bios complete.   As we return from submission we
must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let the
bio completion call aio_complete().  This stops us from returning errors after
O_DIRECT submission.

But we have a few places that are very interested in generating errors after
bio submission.

The most critical of these is invalidating the page cache after a write.  This
avoids exposing stale data to buffered operations that are performed after the
O_DIRECT write succeeds.  We must do this after submission because the user
buffer might have been an mmap()ed buffer of the region being written to.  The
get_user_pages() in the O_DIRECT completion path could have faulted in stale
data.

So this patch introduces a helper, aio_propogate_error(), which queues
post-submission errors in the iocb so that they are given to the user
completion event when aio_complete() is finally called.

To get this working we change the aio_complete() path so that the ring
insertion is performed as the final iocb reference is dropped.  This gives the
submission path time to queue its pending error before it drops its reference.
This increases the space in the iocb as it has to record the two result codes
from aio_complete() and the pending error from the submission path.

This was tested by running O_DIRECT aio-stress concurrently with buffered reads
while triggering EIO in invalidate_inode_pages2_range() with the help of a
debugfs bool hack.  Previously the kernel would oops as fs/aio.c and bio
completion both called aio_complete().  With this patch aio-stress sees -EIO.

Signed-off-by: Zach Brown [EMAIL PROTECTED]
---

 fs/aio.c|  109 ++
 include/linux/aio.h |   30 +++
 mm/filemap.c|4 -
 3 files changed, 91 insertions(+), 52 deletions(-)

--- a/fs/aio.c  Mon Feb 19 11:52:27 2007 -0800
+++ b/fs/aio.c  Mon Feb 19 12:32:52 2007 -0800
@@ -192,6 +192,46 @@ static int aio_setup_ring(struct kioctx 
(void)__event;  \
kunmap_atomic((void *)((unsigned long)__event  PAGE_MASK), km); \
 } while(0)
+
+static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb)
+{
+   struct aio_ring_info*info;
+   struct aio_ring *ring;
+   struct io_event *event;
+   unsigned long   tail;   
+
+   assert_spin_locked(ctx-ctx_lock);
+   
+   info = ctx-ring_info;
+   ring = kmap_atomic(info-ring_pages[0], KM_IRQ1);
+
+   tail = info-tail;
+   event = aio_ring_event(info, tail, KM_IRQ0);
+   if (++tail = info-nr)
+   tail = 0;
+
+   event-obj = (u64)(unsigned long)iocb-ki_obj.user;
+   event-data = iocb-ki_user_data;
+   event-res = iocb-ki_pending_err ? iocb-ki_pending_err : iocb-ki_res;
+   event-res2 = iocb-ki_res2;
+
+   dprintk(aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n,
+   ctx, tail, iocb, iocb-ki_obj.user, iocb-ki_user_data,
+   iocb-ki_pending_err, iocb-ki_res, iocb-ki_res2);
+
+   /* after flagging the request as done, we
+* must never even look at it again
+*/
+   smp_wmb();  /* make event visible before updating tail */
+
+   info-tail = tail;
+   ring-tail = tail;
+
+   put_aio_ring_event(event, KM_IRQ0);
+   kunmap_atomic(ring, KM_IRQ1);
+
+   pr_debug(added to ring %p at [%lu]\n, iocb, tail);
+}
 
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
@@ -418,6 +458,7 @@ static struct kiocb fastcall *__aio_get_
req-ki_cancel = NULL;
req-ki_retry = NULL;
req-ki_dtor = NULL;
+   req-ki_pending_err = 0;
req-private = NULL;
req-ki_iovec = NULL;
INIT_LIST_HEAD(req-ki_run_list);
@@ -507,10 +548,14 @@ static int __aio_put_req(struct kioctx *
 
assert_spin_locked(ctx-ctx_lock);
 
-   req-ki_users --;
+   req-ki_users--;
BUG_ON(req-ki_users  0);
if (likely(req-ki_users))
return 0;
+
+   if (kiocbIsInserted(req))
+   aio_ring_insert_entry(ctx, req);
+
list_del(req-ki_list);/* remove from active_reqs */
req-ki_cancel = NULL;
req-ki_retry = NULL;
@@ -924,12 +969,8 @@ int fastcall aio_complete(struct kiocb *
 int fastcall aio_complete(struct kiocb *iocb, long res, long res2)
 {
struct kioctx   *ctx = iocb-ki_ctx;
-   struct aio_ring_info*info;
-   struct aio_ring *ring;
-   struct io_event *event;
+   int ret;
unsigned long   flags;
-   unsigned long   tail;
-   int ret

Re: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-19 Thread Zach Brown




  I
*think* the patch is right, but picking the changes to the code and  
watching

its movement at the same time is making my head spin.


Really?  The only things that changed are the assignment from iocb- 
res (after testing pending_err) instead of the 'res' argument and  
the dprintk.


But, uh, sure.  If it's a burden I can respin it as two patches.   
Here they come!


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] aio: create aio_ring_insert_entry() function

2007-02-19 Thread Zach Brown

aio: create aio_ring_insert_entry() function

This patch just hoists the process of inserting an entry into the completion
ring into its own function.  No functionality is changed.  This is in
preparation for calling ring insertion from another path with slightly
different mechanics.

diff -r b551e6cbf434 fs/aio.c
Signed-off-by: Zach Brown [EMAIL PROTECTED]
---

 fs/aio.c |   77 +
 1 file changed, 43 insertions(+), 34 deletions(-)

--- a/fs/aio.c  Mon Feb 19 13:09:01 2007 -0800
+++ b/fs/aio.c  Mon Feb 19 13:12:04 2007 -0800
@@ -192,6 +192,47 @@ static int aio_setup_ring(struct kioctx 
(void)__event;  \
kunmap_atomic((void *)((unsigned long)__event  PAGE_MASK), km); \
 } while(0)
+
+static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb,
+ long res, long res2)
+{
+   struct aio_ring_info*info;
+   struct aio_ring *ring;
+   struct io_event *event;
+   unsigned long   tail;   
+
+   assert_spin_locked(ctx-ctx_lock);
+
+   info = ctx-ring_info;
+   ring = kmap_atomic(info-ring_pages[0], KM_IRQ1);
+
+   tail = info-tail;
+   event = aio_ring_event(info, tail, KM_IRQ0);
+   if (++tail = info-nr)
+   tail = 0;
+
+   event-obj = (u64)(unsigned long)iocb-ki_obj.user;
+   event-data = iocb-ki_user_data;
+   event-res = res;
+   event-res2 = res2;
+
+   dprintk(aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n,
+   ctx, tail, iocb, iocb-ki_obj.user, iocb-ki_user_data,
+   res, res2);
+
+   /* after flagging the request as done, we
+* must never even look at it again
+*/
+   smp_wmb();  /* make event visible before updating tail */
+
+   info-tail = tail;
+   ring-tail = tail;
+
+   put_aio_ring_event(event, KM_IRQ0);
+   kunmap_atomic(ring, KM_IRQ1);
+
+   pr_debug(added to ring %p at [%lu]\n, iocb, tail);
+}
 
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
@@ -924,11 +965,7 @@ int fastcall aio_complete(struct kiocb *
 int fastcall aio_complete(struct kiocb *iocb, long res, long res2)
 {
struct kioctx   *ctx = iocb-ki_ctx;
-   struct aio_ring_info*info;
-   struct aio_ring *ring;
-   struct io_event *event;
unsigned long   flags;
-   unsigned long   tail;
int ret;
 
/*
@@ -946,8 +983,6 @@ int fastcall aio_complete(struct kiocb *
return 1;
}
 
-   info = ctx-ring_info;
-
/* add a completion event to the ring buffer.
 * must be done holding ctx-ctx_lock to prevent
 * other code from messing with the tail
@@ -966,34 +1001,8 @@ int fastcall aio_complete(struct kiocb *
if (kiocbIsCancelled(iocb))
goto put_rq;
 
-   ring = kmap_atomic(info-ring_pages[0], KM_IRQ1);
-
-   tail = info-tail;
-   event = aio_ring_event(info, tail, KM_IRQ0);
-   if (++tail = info-nr)
-   tail = 0;
-
-   event-obj = (u64)(unsigned long)iocb-ki_obj.user;
-   event-data = iocb-ki_user_data;
-   event-res = res;
-   event-res2 = res2;
-
-   dprintk(aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n,
-   ctx, tail, iocb, iocb-ki_obj.user, iocb-ki_user_data,
-   res, res2);
-
-   /* after flagging the request as done, we
-* must never even look at it again
-*/
-   smp_wmb();  /* make event visible before updating tail */
-
-   info-tail = tail;
-   ring-tail = tail;
-
-   put_aio_ring_event(event, KM_IRQ0);
-   kunmap_atomic(ring, KM_IRQ1);
-
-   pr_debug(added to ring %p at [%lu]\n, iocb, tail);
+   aio_ring_insert_entry(ctx, iocb, res, res2);
+
 put_rq:
/* everything turned out well, dispose of the aiocb. */
ret = __aio_put_req(ctx, iocb);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-19 Thread Zach Brown

aio: propogate post-EIOCBQUEUED errors to completion event

This addresses an oops reported by Leonid Ananiev [EMAIL PROTECTED]
as archived at http://lkml.org/lkml/2007/2/8/337.

O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its intention to
call aio_complete() once the bios complete.   As we return from submission we
must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let the
bio completion call aio_complete().  This stops us from returning errors after
O_DIRECT submission.

But we have a few places that are very interested in generating errors after
bio submission.

The most critical of these is invalidating the page cache after a write.  This
avoids exposing stale data to buffered operations that are performed after the
O_DIRECT write succeeds.  We must do this after submission because the user
buffer might have been an mmap()ed buffer of the region being written to.  The
get_user_pages() in the O_DIRECT completion path could have faulted in stale
data.

So this patch introduces a helper, aio_propogate_error(), which queues
post-submission errors in the iocb so that they are given to the user
completion event when aio_complete() is finally called.

To get this working we change the aio_complete() path so that the ring
insertion is performed as the final iocb reference is dropped.  This gives the
submission path time to queue its pending error before it drops its reference.
This increases the space in the iocb as it has to record the two result codes
from aio_complete() and the pending error from the submission path.

This was tested by running O_DIRECT aio-stress concurrently with buffered reads
while triggering EIO in invalidate_inode_pages2_range() with the help of a
debugfs bool hack.  Previously the kernel would oops as fs/aio.c and bio
completion both called aio_complete().  With this patch aio-stress sees -EIO.

Signed-off-by: Zach Brown [EMAIL PROTECTED]
---

 fs/aio.c|   49 +-
 include/linux/aio.h |   30 +
 mm/filemap.c|4 +--
 3 files changed, 57 insertions(+), 26 deletions(-)

--- a/fs/aio.c  Mon Feb 19 13:12:20 2007 -0800
+++ b/fs/aio.c  Mon Feb 19 13:16:00 2007 -0800
@@ -193,8 +193,7 @@ static int aio_setup_ring(struct kioctx 
kunmap_atomic((void *)((unsigned long)__event  PAGE_MASK), km); \
 } while(0)
 
-static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb,
- long res, long res2)
+static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb)
 {
struct aio_ring_info*info;
struct aio_ring *ring;
@@ -213,12 +212,12 @@ static void aio_ring_insert_entry(struct
 
event-obj = (u64)(unsigned long)iocb-ki_obj.user;
event-data = iocb-ki_user_data;
-   event-res = res;
-   event-res2 = res2;
-
-   dprintk(aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n,
+   event-res = iocb-ki_pending_err ? iocb-ki_pending_err : iocb-ki_res;
+   event-res2 = iocb-ki_res2;
+
+   dprintk(aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n,
ctx, tail, iocb, iocb-ki_obj.user, iocb-ki_user_data,
-   res, res2);
+   iocb-ki_pending_err, iocb-ki_res, iocb-ki_res2);
 
/* after flagging the request as done, we
 * must never even look at it again
@@ -459,6 +458,7 @@ static struct kiocb fastcall *__aio_get_
req-ki_cancel = NULL;
req-ki_retry = NULL;
req-ki_dtor = NULL;
+   req-ki_pending_err = 0;
req-private = NULL;
req-ki_iovec = NULL;
INIT_LIST_HEAD(req-ki_run_list);
@@ -548,10 +548,14 @@ static int __aio_put_req(struct kioctx *
 
assert_spin_locked(ctx-ctx_lock);
 
-   req-ki_users --;
+   req-ki_users--;
BUG_ON(req-ki_users  0);
if (likely(req-ki_users))
return 0;
+
+   if (kiocbIsInserted(req))
+   aio_ring_insert_entry(ctx, req);
+
list_del(req-ki_list);/* remove from active_reqs */
req-ki_cancel = NULL;
req-ki_retry = NULL;
@@ -983,27 +987,24 @@ int fastcall aio_complete(struct kiocb *
return 1;
}
 
-   /* add a completion event to the ring buffer.
-* must be done holding ctx-ctx_lock to prevent
-* other code from messing with the tail
-* pointer since we might be called from irq
-* context.
-*/
+   /*
+* We queue up the completion codes into the iocb.  They are combined
+* with a potential error from the submission path and inserted into
+* the ring once the last reference to the iocb is dropped.  Cancelled
+* iocbs don't insert events on completion because userland was given
+* an event directly as part of the cancelation interface.
+*/
spin_lock_irqsave(ctx-ctx_lock, flags);
 
if (iocb-ki_run_list.prev  !list_empty(iocb

Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-19 Thread Zach Brown




Zach So this patch introduces a helper, aio_propogate_error(),

...which is spelled incorrectly:  aio_propagate_error.


Man, I am batting 1000!  Randy also made fun of my 'intead'.


Zach which queues post-submission errors in the iocb so that they are
Zach given to the user completion event when aio_complete() is
Zach finally called.

Ugly, but I can't think of a better way to do it, either.


Yeah, this seemed to be the lesser of the available evils.  We (Chris  
and I, while in california) considered introducing a primitive to  
have the submission path wait for aio_complete() to be called so that  
it could just return the error.  We also thought about turning EIOCB 
{RETRY,QUEUED} into bits on the iocb instead of return codes that we  
have to lovingly pass back up to fs/aio.c.


This seemed to be the least intrusive :/.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-19 Thread Zach Brown


I would argue that one common cause of the EIO is userland
error (mmap concurrent with O_DIRECT), and EIO is the correct answer.


I disagree.  That means that using the pagecache to synchronize  
things like
the proposed online defragmentation will occasionally make O_DIRECT  
users

fail.


Well, only if the pages couldn't be invalidated.  Arguably ext3  
should be trying a lot harder to invalidate before giving up and  
returning failure from invalidate_page().


Hopefully this all goes away in the long term with Chris'  
placeholder work that stops insertions into the radix tree while  
dio ops are in flight.


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] convert aio event reap to use atomic-op instead of spin_lock

2007-04-11 Thread Zach Brown

First, I'll NAK this and all AIO patches until the patch description
says that it's been run through the regression tests that we've started
collecting in autotest.  They're trivial to run, never fear:

 cd /usr/local
 svn checkout svn://test.kernel.org/autotest/trunk/client autotest
 cd autotest
 bin/autotest tests/aio_dio_bugs/control

(We'll be moving this to LTP soon, I think, but that will just change
the commands to cut and paste.) 

 Resurrect an old patch that uses atomic operation to update ring buffer
 index on AIO event queue.  This work allows futher application/libaio
 optimization to run fast path io_getevents in user space.

I have mixed feelings.  I think the userspace getevents support was
poorly designed and the simple fact that we've gone this long without it
says just how desperately the feature isn't needed.

At the same time, it's a small change and the code is already there.  So
it's hard to get too worked up about it.  We can always remove support
in the future by never changing the indexes if we require that apps fall
back to sys_io_getevents().

In any case, let's see a test which exercises userspace event collection
before we claim to support it.  I'd prefer a little stand-alone C app
like the ones in autotest for simple unit testing.  A fio patch to
stress test it would probably also be well received.

 I've also added one more change on top of old implementation that rounds
 ring buffer size to power of 2 to allow fast head/tail calculation. With
 the new scheme, there is no more modulo operation on them and we simply
 increment either pointer index directly.  

Please quantify the improvement.  We're allocating more ring elements
than userspace asked for and than were accounted for in aio_max_nr.
That cost, however teeny, will be measured against the benefit.

 - /* Compensate for the ring buffer's head/tail overlap entry */
 - nr_events += 2; /* 1 is required, 2 for good luck */
 + /* round nr_event to next power of 2 */
 + nr_events = roundup_pow_of_two(nr_events);

Is that buggy?  How will the code tell if head == tail means a full ring
or an empty ring?  The old code added that extra event to let it tell
the ring was full before head == tail and it would think it's empty
again, I think.  I'm guessing roundup(nr_events + 1) is needed.

 -#define aio_ring_event(info, nr, km) ({  
 \
 - unsigned pos = (nr) + AIO_EVENTS_OFFSET;\
 +#define aio_ring_event(info, __nr, km) ({\
 + unsigned pos = ((__nr)  ((info)-nr - 1)) + AIO_EVENTS_OFFSET; \
   struct io_event *__event;   \
   __event = kmap_atomic(  \
   (info)-ring_pages[pos / AIO_EVENTS_PER_PAGE], km); \

(info) is now evaluated twice.  __info = (info), please, just like
__event there.

   put_aio_ring_event(event, KM_IRQ0);
   kunmap_atomic(ring, KM_IRQ1);

The ring page, which is mmaped to userspace at some weird address, was
just written through a kmap.  Do we need a flush_dcache_page()?  This
isn't this patch's problem, but it'd need to be addressed if we're using
the ring from userspace.

 -
 - struct io_event io_events[0];
 + struct io_event io_events[0];

Try to minimize gratuitous reformatting, please.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] convert aio event reap to use atomic-op instead of spin_lock

2007-04-11 Thread Zach Brown

 Sorry I wasn't thorough enough.  And partially because I was worried
 about changing structure type for user space facing struct aio_ring.
 Now that I looked through all arches, it looks safe as all arch's
 atomic_t has the same size as int.

 Here is the updated patch.

 @@ -144,7 +144,7 @@ struct kiocb {
  struct aio_ring {
   unsignedid; /* kernel internal index number */
   unsignednr; /* number of io_events */
 - unsignedhead;
 + atomic_thead;
   unsignedtail;

Embedding an atomic_t in an ABI struct?  That makes everyone else
nervous too, right?

It may look safe on i386/x86-64 today, but this doesn't seem like wise
practice.  Is there any reason to believe that atomic_t will never
change size?  Does anything else do this already?

If nothing else, the unsigned (should be __u32, sigh) could be cast to
an atomic_t.

Is being able to do atomic work on a u32 between the kernel and
userspace something that all archs have support for?  I mean, take the
fact that userspace and the kernel could both be doing these atomic ops
on different virtual addresses and so conceivably different cachelines.
Is that a problem for anyone?

I do find myself wondering if the notion of userspace ring
synchronization shouldn't be built around futexes.  They weren't around
when this mmap()ed ring business was created.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] convert aio event reap to use atomic-op instead of spin_lock

2007-04-11 Thread Zach Brown

 Ken uses the other (superior!) way of implementing ringbuffers: the head
 and tail pointers (the naming of which AIO appears to have reversed) are
 not constrained to the ringsize - they are simply allowed to wrap through
 0xfff.

A-ha!  That sure sounds great.

I'd be happy to see the kernel side fs/aio.c ring management move in
that direction, sure, with a specific patch to do that.

I get less excited when we talk about changing the index semantics as
part of supporting serialization of the ring with userspace.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] convert aio event reap to use atomic-op instead of spin_lock

2007-04-11 Thread Zach Brown

 I have mixed feelings.  I think the userspace getevents support was
 poorly designed and the simple fact that we've gone this long without it
 says just how desperately the feature isn't needed.
 
 I kept on getting requests from application developers who want that
 feature.  My initial patch was dated back May 2004.

My point is that demand for it doesn't seem to be very high, and your
use of the past tense there seems to support that.  And the patches to
support it are not looking great.

Is it worth it, then?  What difference are you measuring so far?

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] convert aio event reap to use atomic-op instead of spin_lock

2007-04-11 Thread Zach Brown

  I kept on getting requests from application developers who want that
  feature.  My initial patch was dated back May 2004.
 
 The right way to do it involves synchronization between the kernel side 
 io_getevents() and the userspace code pulling events out of the ring.  
 Alan Cox suggested embedding a futex in the shared memory region, but I 
 don't think anyone ever implemented that.

Yeah, I like the idea of futexes.

I'm worried that virtual aliasing spells doom for the current
home-brewed serialization that fs/aio.c is doing with the shared ring
head/tail accesses.  Am I worrying about nothing here?

  I will look into this aside from this patch.
 
 That's probably the case.  Also, any changes in this area *must* correctly 
 update the compat/incompat feature flags in the ring buffer header.  That 
 has been missed in the past...

Do you know of anyone using the current ring info ABI?

The *only* user I know of is the check of ctx-magic in libaio.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-20 Thread Zach Brown



On Feb 20, 2007, at 9:33 AM, Ananiev, Leonid I wrote:


There is change in the patch which is uncommented in preface.
Now aio_ring_insert_entry() is not called if req-ki_users=1.
Before was called.
Could you comment it?


It is described in the patch description, though perhaps not  
explicitly enough:


To get this working we change the aio_complete() path so that the ring
insertion is performed as the final iocb reference is dropped.

This stops an aio_complete() from generating a completion event  
before the submission path has had a chance to generate an error.   
This could happen if O_DIRECT bios completed very quickly before the  
submission path had a chance to return up and call  
ivalidate_inode_pages2_range().


- z


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-20 Thread Zach Brown

There are some strange O_DIRECT corner cases in here such that the  
'last
writer' may actually be a 'last reader' and winning can mean have  
a copy

of the page in page cache older than the copy on disk.


As long as it is marked dirty so that it eventually gets synced to  
disk,

it shouldn't matter.


No, Chris is pointing out that an an O_DIRECT write can leave clean  
read pages in the page cache.


All it takes is giving a source buffer for the write which is an mmap 
()ed apeture of the region that is being written to.  If you get the  
offsets right you can get the get_user_pages() down in fs/direct-io.c  
will populate the page cache before the actual O_DIRECT write gets to  
it.


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-21 Thread Zach Brown



On Feb 21, 2007, at 12:35 AM, Ken Chen wrote:


On 2/20/07, Ananiev, Leonid [EMAIL PROTECTED] wrote:

1) mem=1G in kernel boot param if you have more
2) unmount; mk2fs; mount
3) dd if=/dev/zero of=test_file bs=1M count=1200
4) aiostress -s 1200m -O -o 2 -i 1 -r 16k test_file
5) if i++50 goto 2).


Would you please instrument the call chain of
invalidate_complete_page2() and tell us exactly where it returns zero
value in your failure case?

  invalidate_complete_page2
 try_to_release_page
ext3_releasepage
   journal_try_to_free_buffers
  ???


For what it's worth, Badari has explained this race in the past in a  
credible way.  I'll take the liberty of pasting a mail from him:



kjournald submited buffers for IO and waiting for them to finish.
Note that it has a ref. against the buffer.

journal_commit_transaction()
...
submited buffers for IO
/* Waiting for IO to complete */
while (commit_transaction-t_locked_list) {
...
get_bh(bh);
if (buffer_locked(bh)) {
spin_unlock(journal-j_list_lock);
wait_on_buffer(bh);  
spin_lock(journal-j_list_lock);
}

..
put_bh(bh);
}

Now, DIO process comes to frees the jh through  
journal_try_to_free_buffers()
but fails to drop_buffers() since kjournald() has a reference against  
it.

invalidate_inode_pages2_range()
..
ext3_releasepage()
journal_try_to_free_buffers()
journal_put_journal_head()
__journal_try_to_free_buffer()
--- freed jh

try_to_free_buffers()
drop_buffers()
if (buffer_busy(bh))
goto failed;
  --- returns EIO due to  
b_count




I don't mean to say that we shouldn't get traces to confirm the  
theory, just sharing.  And now we can point to this in the archives  
next time :).


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-21 Thread Zach Brown

This is an interesting trick, but I'd like to consider hard whether  
the added
complexity is worth it. Could you list the various other cases you  
have in mind

which would want to use it ?


I'm happy to report that the sync case and the  
invalidate_inode_pages2_range() case are the only two which try to  
rewrite 'ret'.  I audited all the call paths between aio_{read,write}  
and -EIOCBQUUEUED.


So, sure, maybe we can shuffle the house of cards a little in the  
direction away from having a fs/aio.c helper for the situation by  
simply removing the few current instances of the problem.  It won't  
scale as people add problems without knowing about the rule to not  
overwrite -EIOCBQUEUED, but hopefully that won't be a rule for much  
longer :) :).


For the O_SYNC case we could add another magical test to the is_async  
assignment which won't perform AIO on O_SYNC descriptors.



For the invalidate_inode_pages2_range() case, I wonder if the right
place to issue this is after the direct IO write has completed and  
before
aio_complete() is issued (somewhat like the way we do  
bio_check_pages_dirty
for DIO reads), rather than after submission when the IO may still  
not have
hit the disk. This would also make the behaviour uniform for  
synchronous and

async cases.


Hmm, I think I like that.  It solves the problem for the current sole  
user of -EIOCBQUEUED without too much disruption.


BTW, am I right in interpreting that with your change aio_complete 
() may
trigger an io_getevents() wakeup, before the corresponding event is  
placed

on the ring buffer ?


Hmm, yeah, it looks like I goofed that.

I'll roll a patch which does the invalidation down in fs/direct-io.c.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3

2007-02-22 Thread Zach Brown


Plus there's the fundamental killer that KAIO is a /lot/ harder to
implement (and to maintain) on the kernel side: it has to be  
implemented
for every IO discipline, and even for the IO disciplines it  
supports at

the moment, it is not truly asynchronous for things like metadata
blocking or VFS blocking. To handle things like metadata blocking  
it has

to resort to non-statemachine techniques like retries - which are bad
for performance.


Yes, yes, yes.

As one of the poor suckers who has been fixing bugs in fs/aio.c and  
fs/direct-io.c, I really want everyone to read Ingo's paragraph a few  
times.  Have it printed on a t-shirt.


Look at the number of man-years that have gone into fs/aio.c and fs/ 
direct-io.c.  After all that effort it *barely* supports non-blocking  
O_DIRECT IO.


The maintenance overhead of those two files, above all else, is what  
pushed me to finally try that nutty fibril attempt.


Syslets/threadlets on the other hand, once the core is implemented,  
have
near zero ongoing maintainance cost (compared to KAIO pushed into  
every
IO subsystem) and cover all IO disciplines and API variants  
immediately,

and they are as perfectly asynchronous as it gets.


Amen.

As an experiment, I'm working on backing the sys_io_*() calls with  
syslets.  It's looking very promising so far.


So all in one, i used to think that AIO state-machines have a long- 
term

place within the kernel, but with syslets i think i've proven myself
embarrasingly wrong =B-)


Welcome to the party :).

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3

2007-02-22 Thread Zach Brown


The more I think about it, a reasonable solution might actually be to
use threadlets for disk I/O and pure event based processing for
networking.  It is two different handling paths and non-unified,
but that might be the price for good performance :-)


I generally agree, with some comments.

If we come to the decision that there are some message rates that are  
better suited to delivery into a user-read ring (10gige rx to kevent,  
say) then it doesn't seem like it would be much of a stretch to add a  
facility where syslet completion could be funneled into that channel  
as well.


I also wonder if there isn't some opportunity to cut down the number  
of syscalls / op in networking land.  Is it madness to think of a  
call like recvmsgv() which could provide a vector of msghdrs?  It  
might not make sense, but it might cut down on the per-op overhead  
for loads that know they're going to be heavy enough to get a decent  
amount of batching without fatally harming latency.  Maybe those  
loads are rare..


- z 
-

To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/13] Syslets, Threadlets, generic AIO support, v3

2007-02-22 Thread Zach Brown


direct-io.c is evil.  Ridiculously.


You will have a hard time finding someone to defend it, I predict :).

There is good news on that front, too.  Chris (Mason) is making  
progress on getting rid of the worst of the Magical Locking that  
makes buffered races with O_DIRECT ops so awful.


I'm not holding my breath for a page cache so fine grained that it  
could pin and reference 512B granular user regions and build bios  
from them, though that sure would be nice :).



As an experiment, I'm working on backing the sys_io_*() calls with
syslets.  It's looking very promising so far.


Great, I'd love to see the comparisons.


I'm out for data.  If it sucks, well, we'll know just how much.  I'm  
pretty hopeful that it won't :).



One other implementation to consider is actually using kernel threads
compared to how syslets perform.  Direct IO for one always blocks, so
there shouldn't be much of a performance difference compared to  
syslets,

with the bonus that no arch specific code is needed.


Yeah, I'm starting with raw kernel threads so we can get some numbers  
before moving to syslets.


One of the benefits syslets bring is the ability to start processing  
the next pending request the moment current request processing  
blocks.  In the concurrent O_DIRECT write case that avoids releasing  
a ton of kernel threads which all just run to serialize on i_mutex  
(potentially bouncing it around cache domains) as the O_DIRECT ops  
are built and sent.


-z 
-

To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc][patch] dynamic resizing dentry hash using RCU

2007-02-23 Thread Zach Brown



On Feb 23, 2007, at 7:37 AM, Nick Piggin wrote:



The dentry hash uses up 8MB for 1 million entries on my 4GB system  
is one
of the biggest wasters of memory for me. Because I rarely have more  
than one or
two hundred thousand dentries. And that's with several kernel trees  
worth of
entries. Most desktop and probably even many types of servers will  
only use a

fraction of that.

So I introduce a new method for resizing hash tables with RCU, and  
apply

that to the dentry hash.


Can you compare what you've done to the design that Paul and David  
talked about a year ago?


  http://lkml.org/lkml/2006/1/30/74

I'd love to see a generic implementation of RCU hashing that  
subsystems can then take advantage of.  It's long been on the fun  
side of my todo list.  The side I never get to :/.


- z

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] aio: fix oops because of extra IO control block freeing.

2007-03-07 Thread Zach Brown


On Mar 7, 2007, at 2:14 PM, Andrew Morton wrote:


On Mon, 05 Mar 2007 17:23:33 +0300
Leonid Ananiev [EMAIL PROTECTED] wrote:


 From Leonid Ananiev

The patch fixes oops because of extra IO control block freeing.
IO is retried if page could not be invalidated.


This patch is incorrect and shouldn't be merged.



Signed-off-by: Leonid Ananiev [EMAIL PROTECTED]

The patch fixes oops Kernel BUG at fs/aio.c:509 archived at
http://lkml.org/lkml/2007/2/8/337
The number of IO control block (iocb)users  0.
If page could not be invalidated by invalidate_inode_pages2_range()
than EIO is returned. It happens if journal_try_to_free_buffers()  
fails

to drop_buffers().
This EIO is not differing from real IO competition with EIO and
aio_complete() is called.


(I'm going to read this as This EIO is misinterpreted as real IO  
completion with -EIO and aio_complete() is called.)



Later aio_complete() is called from dio_bio_end_aio() and frees iocb
once more.


This analysis is correct.  Nothing can clobber -EIOCBQUEUED as it is  
returned up from fs/direct-io.c to fs/aio.c.  This -EIO from  
invalidation is one of the two places that currently break this rule.


The fix I had hoped for, invalidating down in fs/direct-io.c before  
returning -EIOCBQUEUED, doesn't work because it ends up getting the  
ordering between the journal lock and the page lock backwards.   
Sigh.  (note to self: help lockdep warn us about that ordering)



After patch generic_file_direct_IO() sets PgBusy flag in iocb
if page could not be invalidated. iocb is retried after IO  
competition.

The process is waked up if IO is SYNC else iocb is kicked.
The lines ___if (ret != -EIOCBRETRY)___ is deleted because
nothing set to EIOCBRETRY.


True, but this is gratuitously cruel to external users of - 
EIOCBRETRY.  Silently breaking them doesn't sound like a great plan.   
If we really decide to remove EIOCBRETRY support we'd get rid of all  
the retry infrastructure and remove the EIOCBRETRY errno so their  
builds failed.  That's a separate issue that shouldn't be confused  
with this EIOCBQUEUED clobbering.  Just removing some pieces of the  
infrastructure willy-nilly isn't acceptable.




Please copy [EMAIL PROTECTED] on AIO patches.


Next patches 2/3 and 3/3 do cleanup only.


I cannot find those patches.


Me either.  I was waiting for them to arrive before responding.


diff -upr linux-2.6.20/mm/filemap.c linux-2.6.20-aio2/mm/filemap.c
--- linux-2.6.20/mm/filemap.c   2007-02-04 21:44:54.0 +0300
+++ linux-2.6.20-aio2/mm/filemap.c	2007-03-04 21:46:10.0  
+0300

@@ -2413,10 +2413,9 @@ generic_file_direct_IO(int rw, struct ki
if (rw == WRITE  mapping-nrpages) {
pgoff_t end = (offset + write_len - 1)
 PAGE_CACHE_SHIFT;
-   int err = invalidate_inode_pages2_range(mapping,
-   offset  PAGE_CACHE_SHIFT, end);
-   if (err)
-   retval = err;
+   if (invalidate_inode_pages2_range(mapping,
+   offset  PAGE_CACHE_SHIFT, end))
+   kiocbSetPgBusy(iocb);


There are two problems behind this bug:

- ext3_releasepage() returns failure if it races with kjournald  
holding a reference while it waits for a transaction to commit.


- generic_file_direct_IO() causes an oops if it clobbers -EIOCBQUEUED  
with the return code from invalidate_inode_pages2_range()..- 
releasepage().


This fix makes the incorrect assertion that *any* failure from  
invalidate_inode_pages2_range(), which might not have anything to do  
with this race you're currently seeing, is transitory and should  
trigger a retry.  That's wrong, it should be returning the error.


Now, getting ext3_releasepage() to not fail if this race hits to  
begin with is another story.  Chris has some ideas about reworking  
the page laundering helper to make that more reliable.


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] aio: remove bare user-triggerable error printk

2007-03-27 Thread Zach Brown

aio: remove bare user-triggerable error printk

The user can generate console output if they cause do_mmap() to fail during
sys_io_setup().  This was seen in a regression test that does exactly that by
spinning calling mmap() until it gets -ENOMEM before calling io_setup().

We don't need this printk all, just remove it.

Signed-off-by: Zach Brown [EMAIL PROTECTED]
---

 fs/aio.c |1 -
 1 file changed, 1 deletion(-)

--- a/fs/aio.c  Tue Mar 27 14:56:08 2007 -0700
+++ b/fs/aio.c  Tue Mar 27 14:56:49 2007 -0700
@@ -136,7 +136,6 @@ static int aio_setup_ring(struct kioctx 
  0);
if (IS_ERR((void *)info-mmap_base)) {
up_write(ctx-mm-mmap_sem);
-   printk(mmap err: %ld\n, -info-mmap_base);
info-mmap_size = 0;
aio_free_ring(ctx);
return -EAGAIN;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: remove bare user-triggerable error printk

2007-03-27 Thread Zach Brown


We don't need this printk all, just remove it.


(at all.  Sigh.)

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] cache pipe buf page address for non-highmem arch

2007-03-27 Thread Zach Brown


+#define pipe_kmap_atomic(page, type)   pipe_kmap(page)
+#define pipe_kunmap(page)  do { } while (0)
+#define pipe_kunmap_atomic(page, type) do { } while (0)


Please don't drop arguments in stubs.  It can let completely broken  
code compile, like:


pipe_kunmap(SOME_COMPLETE_NONSENSE);

Static inlines with empty bodies are the gold standard.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] cache pipe buf page address for non-highmem arch

2007-03-28 Thread Zach Brown


Does this look OK?


Almost...


#ifdef CONFIG_HIGHMEM
static inline void pipe_kunmap_atomic(void *addr, enum km_type type)
#else   /* CONFIG_HIGHMEM */
static inline void pipe_kunmap_atomic(struct page *page, enum  
km_type type)


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] syslets: add generic syslets infrastructure

2008-01-08 Thread Zach Brown


 Firstly, why not just specify an address for the return value and be done 
 with it?  This infrastructure seems overkill, and you can always extend later 
 if required.

Sorry, which infrastructure?

Providing the function and stack to return to?  Sure, I could certainly
entertain the idea of not having syslet tasks return to userspace in the
first pass.  Ingo sure seemed excited by the idea.

Or do you mean the syscall return value ending up in the userspace
completion event ring?  That's mostly about being able to wait for
pending syslets to complete.

 Secondly, you really should allow integration with an eventfd so you don't 
 make the posix AIO mistake of providing a poll-incompatible interface.

Yeah, this seems straight forward enough that I haven't made it an
initial priority.  I'm sure it will be helpful for people who are stuck
integrating with entrenched software that wants to wait for pollable fds.

For more flexible software, though, it's compelling to now be able to
aggregate waiting for completion of the existing waiting syscalls (poll,
epoll_wait, futexes, whatever) by issuing them as concurrent syslets.

 Finally, and probably most alarmingly, AFAICT randomly changing TID will 
 break 
 all threaded programs, which means this won't be fitted into existing code 
 bases, making it YA niche Linux-only API 8(

Yeah, this still needs to be investigated.  I haven't yet and I haven't
heard of anyone else trying their hand at it.

In the YANLOA mode apps would know that executing syslets is an implicit
clone() and would act accordingly.  8(, indeed.

I wonder if there isn't an opportunity to add a clone() flag which
juggles the association between TIDs and task_structs.  I don't relish
the idea of investigating the life cycles of task_struct references that
derive from TIDs and seeing how those would race with a syslet blocking
and cloning, but, well, maybe that's what needs to be done.

This all isn't my area of expertise, though, sadly.  It would be swell
if someone wanted to look into it before I'm forced to learn yet another
weird corner of the kernel.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] syslets: add generic syslets infrastructure

2008-01-09 Thread Zach Brown


 Or do you mean the syscall return value ending up in the userspace
 completion event ring?  That's mostly about being able to wait for
 pending syslets to complete.
 
 The latter.  A ring is optimal for processing a huge number of requests, but 
 if you're really going to be firing off syslet threads all over the place 
 you're not going to be optimal anyway.  And being able to point the return 
 value to the stack or into some datastructure is way nicer to code (zero 
 setup == easy to understand and easy to convert).

One of Linus' rhetorical requirements for the syslets work is that we be
able to wait for the result without spending overhead building up state
in some completion context.  The userland rings in the current syslet
patches achieve that and don't seem outrageously complicated.

I have a hard time getting worked up about this particular piece of the
puzzle, though.  If people are excited about *only* providing a pollable
fd to collect the syslet completions, well, great, whatever.

 This must be solved, yet all avenues seem crawling with worms.

Yup.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/6] syslets: add generic syslets infrastructure

2008-01-09 Thread Zach Brown

Linus Torvalds wrote:
 
 On Thu, 10 Jan 2008, Rusty Russell wrote:
 I'd have to read his original statement, but eventfd doesn't build up state, 
 so I think it qualifies.
 
 How about you guys battle it out by giving an example program usign the 
 interface?
 
 Here's a favourite really simple load of mine:
 
  - do the equivalent of ls -lR or find /usr as quickly as possible, 
without playing sort by the inode numbers games (that don't work in 
general, but are great for some filesystems)
 
Do this on a directory that isn't newly created, but has had files 
added and removed over time (so that the return order of readdir() 
isn't dense and sorted in the inode tables already). The classic 
example is ls -l /usr/bin or similar.

Sure, that's straight forward enough.  We've all written little test
apps for variants of this load in the past, anyway.  (It was one of the
first things I did for fibrils, Ingo had a variant for syslets which
read small file data too, Chris has a syslet mode in his 'acp' util, etc.)

I was going to send out a patch series pretty soon which includes
cleanups (I think) of the sys_indirect() infrastructure.  I can throw
together this little test app along with that.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: hwclock failure in x86.git

2008-01-10 Thread Zach Brown

I'm no expert, but I happened to notice this go by.

 The first thing I notice about the path is that ioport_32.c and the unified 
 ioport.c use __clear_bit,
 while ioport_64.c uses clear_bit.

That doesn't seem too critical.

 +#ifdef CONFIG_X86_32
 +asmlinkage long sys_iopl(unsigned long regsp)
 +{
 + volatile struct pt_regs *regs = (struct pt_regs *)regsp;
 + unsigned int level = regs-bx;
 + unsigned int old = (regs-flags  12)  3;
 +
 + if (level  3)
 + return -EINVAL;
 + /* Trying to gain more privileges? */
 + if (level  old) {
 + if (!capable(CAP_SYS_RAWIO))
 + return -EPERM;
 + }
 + regs-flags = (regs-flags  ~X86_EFLAGS_IOPL) | (level  12);
 +
 + return 0;
 +}

 -asmlinkage long sys_iopl(unsigned long regsp)
 -{
 - volatile struct pt_regs *regs = (struct pt_regs *)regsp;
 - unsigned int level = regs-bx;
 - unsigned int old = (regs-flags  12)  3;
 - struct thread_struct *t = current-thread;
 -
 - if (level  3)
 - return -EINVAL;
 - /* Trying to gain more privileges? */
 - if (level  old) {
 - if (!capable(CAP_SYS_RAWIO))
 - return -EPERM;
 - }
 -
 - t-iopl = level  12;
 - regs-flags = (regs-flags  ~X86_EFLAGS_IOPL) | t-iopl;
 - set_iopl_mask(t-iopl);
 -
 - return 0;
 -}

Is it OK that we lost the different final steps of those two functions?
 It looks like someone might have missed the differing tail sections of
the function when copying and pasting and updating just the start of the
functions?

Sorry if I missed why this is OK and intended, I didn't read very closely.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] sys_indirect system call

2007-11-15 Thread Zach Brown

Ulrich Drepper wrote:
 The following patches provide an alternative implementation of the
 sys_indirect system call which has been discussed a few times.
 This no system call allows us to extend existing system call
 interfaces with adding more system calls.

I might quarrel with some details, but I really like the general notion.

I think we can use this to pass per-syscall syslet data to the
scheduler.  Today the patches have submission syscalls which basically
wrap some data for the scheduler around the syscall ABI.  This might
save the syslet patches from needing to know about each arch's ABI.

We'll see.  I'll give it a try.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] sys_indirect system call

2007-11-15 Thread Zach Brown


 BTW, I've botched the x86-on-x86_64 support.  I have a patch but need to
 patch it before I'll submit v3 of the patch set.  If you want to work on
 the patch and get syslet support going, let me know, I'll send the
 latest version.

I probably won't come around to trying sys_indirect with syslets until
next week, so it's probably fine for me to just wait for you to get v3
out.  (and I'm more likely to test native i386 or x86_64 anyway, so.. :))

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Top kernel oopses/warnings this week

2007-12-17 Thread Zach Brown


 Report counts may be too high due to duplicate recognition of the very
 same report.¹
 
 this is true however it's .. a hard issue. It's really hard to
 distinguish a duplicate report from
 two reports of the same bug.

Can we hack some data in to oops output to help?  Say a giant per-boot
anonymous random number (yeah, I know, harder than it sounds) and then
an incrementing oops counter.  That'd also let you discover that the
latter oopses in a chain of oopses might be fall-out from the head of
the chain.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.24 BUG] 100% iowait on host while UML is running

2007-12-03 Thread Zach Brown


 We could check ctx-reqs_active before scheduling to determine whether
 or not we are waiting for I/O, but this would require taking the
 context lock in order to be accurate.  Given that the test would be
 only for the sake of book keeping, it might be okay to do it outside
 of the lock.
 
 Zach, what are your thoughts on this?

I agree that it'd be OK to test it outside the lock, though we'll want
some commentary:

/* Try to only show up in io wait if there are ops in flight */
if (ctx-reqs_active)
io_schedule();
else
schedule();

It's cheap, safe, and accurate the overwhelming majority of the time :).

We only need it in read_events().  The other two io_schedule() calls are
only reached to wait on pending reqs specifically.

It still won't make sense for iocbs which aren't performing IO, but I
guess that's one more bridge to cross when we come to it.

Do you want to throw this tiny patch together and submit it?

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/6] indirect: use asmlinkage in i386 syscall table prototype

2007-12-06 Thread Zach Brown

call_indirect() was using the wrong calling convention for the system call
handlers.  system call handlers would get mixed up arguments.

Signed-off-by: Zach Brown [EMAIL PROTECTED]

diff --git a/include/asm-x86/indirect_32.h b/include/asm-x86/indirect_32.h
index a1b72ac..e3dea8e 100644
--- a/include/asm-x86/indirect_32.h
+++ b/include/asm-x86/indirect_32.h
@@ -15,8 +15,8 @@ struct indirect_registers {
 
 static inline long call_indirect(struct indirect_registers *regs)
 {
-  extern long (*sys_call_table[]) (__u32, __u32, __u32, __u32, __u32, __u32);
-
+   extern asmlinkage long (*sys_call_table[])(long, long, long,
+  long, long, long);
   return sys_call_table[INDIRECT_SYSCALL(regs)](regs-ebx, regs-ecx,
regs-edx, regs-esi,
regs-edi, regs-ebp);
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

syslets v7: back to basics

2007-12-06 Thread Zach Brown


The following patches are a substantial refactoring of the syslet code.  I'm
branding them as the v7 release of the syslet infrastructure, though they
represent a signifiant change in focus.

My current focus is to see the most fundamental functionality brought to
maturity.  To me, this means getting a ABI that is used by applications through
glibc on x86 and PPC64.   Only once that is ready should we distract ourselves
with advanced complexity.

To that end, this patch series differs from v6 in significant ways:

 * syslets are initiated by providing syslet arguments to sys_indirect().

 * uatoms, threadlets, and the kaio changes are postponed until they can be
   justified and rebuilt on more complete infrastructure.  (I'm not saying
   these shouldn't or won't be persued.  I'm saying that we should get the
   simplest piece working first.)

 * the code is clarified and commented, the patches are bisectable and pass
   checkpatch.

The use of sys_indirect() and the move from 'atom's simplified the ABI
considerably.  I've put a trivial example in a syslet-userspace git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/zab/syslets-userspace.git

This git repository will grow more tests and documentation over time.

The patches sent with this mail are based on the v6 indirect patches but they
aren't included.  The full syslets patch series, including the indirect
patches, are available in a few forms:

broken out patch series:
http://www.kernel.org/pub/linux/kernel/people/zab/broken-out/syslets/

in a 'syslets' git branch off of current linux-2.6.git:
git://git.kernel.org/pub/scm/linux/kernel/git/zab/linux-2.6.git syslets

git tree of the guilt .git/patches directory:
git://git.kernel.org/pub/scm/linux/kernel/git/zab/guilt-series.git

The patches were barely tested on i386 and x86_64.

There are both implementation details and design problems left.  My hope is
that we can address these in the coming weeks.

 - Do we stop the user from initiating more syslets than fit in the ring? 
 - Do we worry now about the hashed ring mutexes scaling poorly?  (They will.)
 - What are the semantics of ptrace()ing a syslet submission which blocks?
 - How should applications deal with waiting syslet tasks with stale data
   in their task_struct?  (syslet, setuid, syslet..)
 - Issuing a syslet is an implicit sys_clone(), will apps pass in clone flags?
 - Are the u32 ring index reads and writes atomic for supported architectures?

Any feedback on these questions would be greatly appreciated.

I'm particularly interested in hearing from people who are trying to use
syslets in their applications.  This will involve awkward wrappers instead of
glibc calls for now, and your machine may explode, but hopefully the chance to
influence the design of syslets would make it worth the effort.

Finally, I carried the enormous cc: list for this mail over from previous
syslet releases.  If you want to be removed or added to the list for future
syslet releases, please do let me know.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/6] syslet: asm-generic support to disable syslets

2007-12-06 Thread Zach Brown

This provides an implementation of the architecture dependent portion of
syslets which disables syslet operations.

This patch is an incomplete demonstration.  All asm-*/syslet*.h files would
include these files until their architectures provide implementations which
enable syslet support.

Signed-off-by: Zach Brown [EMAIL PROTECTED]

diff --git a/include/asm-generic/syslet-abi.h b/include/asm-generic/syslet-abi.h
new file mode 100644
index 000..5d19971
--- /dev/null
+++ b/include/asm-generic/syslet-abi.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_GENERIC_SYSLET_ABI_H
+#define _ASM_GENERIC_SYSLET_ABI_H
+
+/*
+ * I'm assuming that a u64 ip and u64 esp won't be enough for all
+ * archs, so I just let each arch define its own.
+ */
+struct syslet_frame {
+};
+
+#endif
diff --git a/include/asm-generic/syslet.h b/include/asm-generic/syslet.h
new file mode 100644
index 000..de9a750
--- /dev/null
+++ b/include/asm-generic/syslet.h
@@ -0,0 +1,34 @@
+#ifndef _ASM_GENERIC_SYSLET_H
+#define _ASM_GENERIC_SYSLET_H
+
+/*
+ * This provider of the arch-specific syslet APIs is used when an architecture
+ * doesn't support syslets.
+ */
+
+/* this stops the other functions from ever being called */
+static inline int syslet_frame_valid(struct syslet_frame *frame)
+{
+   return 0;
+}
+
+static inline void set_user_frame(struct task_struct *task,
+ struct syslet_frame *frame)
+{
+   BUG();
+}
+
+static inline void move_user_context(struct task_struct *dest,
+   struct task_struct *src)
+{
+   BUG();
+}
+
+static inline int create_syslet_thread(long (*fn)(void *),
+  void *arg, unsigned long flags)
+{
+   BUG();
+   return 0;
+}
+
+#endif
diff --git a/include/asm-x86/syslet-abi.h b/include/asm-x86/syslet-abi.h
new file mode 100644
index 000..14a7182
--- /dev/null
+++ b/include/asm-x86/syslet-abi.h
@@ -0,0 +1 @@
+#include asm-generic/syslet-abi.h
diff --git a/include/asm-x86/syslet.h b/include/asm-x86/syslet.h
new file mode 100644
index 000..583d810
--- /dev/null
+++ b/include/asm-x86/syslet.h
@@ -0,0 +1 @@
+#include asm-generic/syslet.h
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/6] syslet: introduce abi structs

2007-12-06 Thread Zach Brown

This patch adds the architecture independent structures of the
syslet ABI.

Signed-off-by: Zach Brown [EMAIL PROTECTED]

diff --git a/include/linux/syslet-abi.h b/include/linux/syslet-abi.h
new file mode 100644
index 000..a8bc1a3
--- /dev/null
+++ b/include/linux/syslet-abi.h
@@ -0,0 +1,34 @@
+#ifndef _LINUX_SYSLET_ABI_H
+#define _LINUX_SYSLET_ABI_H
+
+#include asm/syslet-abi.h /* for struct syslet_frame */
+
+struct syslet_args {
+   u64 completion_ring_ptr;
+   u64 caller_data;
+   struct syslet_frame frame;
+};
+
+struct syslet_completion {
+   u64 status;
+   u64 caller_data;
+};
+
+/*
+ * The ring follows the wrapping convention as described by Andrew at:
+ * http://lkml.org/lkml/2007/4/11/276
+ * The head is updated by the kernel as completions are added and the
+ * tail is updated by userspace as completions are removed.
+ *
+ * The number of elements must be a power of two and the ring must be
+ * aligned to a u64.
+ */
+struct syslet_ring {
+   u32 kernel_head;
+   u32 user_tail;
+   u32 elements;
+   u32 wait_group;
+   struct syslet_completion comp[0];
+};
+
+#endif
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/6] syslets: add indirect args

2007-12-06 Thread Zach Brown

This adds the syslet indirect args to the indirect_params union.

This is broken, but it lets us simply demonstrate the rest of the syslet
universe around the indirect argument passing convention.

A caller could well want to perform a syscall that uses indirect arguments as a
syscall.  Maybe we turn indirect_params into a struct that contains a union for
arguments which can never be used concurrently.  This needs wider discussion.

Signed-off-by: Zach Brown [EMAIL PROTECTED]

diff --git a/include/linux/indirect.h b/include/linux/indirect.h
index 97f9ac4..5d5abd7 100644
--- a/include/linux/indirect.h
+++ b/include/linux/indirect.h
@@ -3,6 +3,7 @@
 #define _LINUX_INDIRECT_H
 
 #include asm/indirect.h
+#include linux/syslet-abi.h
 
 
 /* IMPORTANT:
@@ -14,6 +15,7 @@ union indirect_params {
   struct {
 int flags;
   } file_flags;
+  struct syslet_args syslet;
 };
 
 #define INDIRECT_PARAM(set, name) current-indirect_params.set.name
-- 
1.5.2.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/6] syslets: add generic syslets infrastructure

2007-12-06 Thread Zach Brown

The indirect syslet arguments specify where to store the completion and what
function in userspcae to return to once the syslet has been executed.  The
details of how we pass the indirect syslet arguments needs help.

We parse the indirect syslet arguments in sys_indirect() before we call
the given system call.  If they're OK we mark the task as ready to become
a syslet.  We make sure that there is a child task waiting.

We call into kernel/syslet.c from the scheduler when we try to block a task
which has been marked as ready.   A child task is woken and returns to
userspace.

We store the result of the system call in the userspace ring back up in
sys_indrect() as the system call finally finishes.  At that point the original
task returns to the frame that userspace provided in the indirect syslet args.

This generic infrastructure relies on architecture specific routines to create
a new child task, move userspace state from one kernel task to another, and to
setup the userspace return frame in ptregs.  Code in asm-generic just returns
-EINVAL until an architecture provides the needed routines.

This is a simplification of Ingo's more involved syslet and threatlet
infrastructure which was built around 'uatoms'.  Enough code has changed that
it wasn't appropriate to bring the previous Signed-off-by lines forward.

Signed-off-by: Zach Brown [EMAIL PROTECTED];

diff --git a/fs/exec.c b/fs/exec.c
index 282240a..942262f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -51,6 +51,7 @@
 #include linux/tsacct_kern.h
 #include linux/cn_proc.h
 #include linux/audit.h
+#include linux/syslet.h
 
 #include asm/uaccess.h
 #include asm/mmu_context.h
@@ -1614,6 +1615,8 @@ static int coredump_wait(int exit_code)
complete(vfork_done);
}
 
+   kill_syslet_tasks(tsk);
+
if (core_waiters)
wait_for_completion(startup_done);
 fail:
diff --git a/include/asm-generic/errno.h b/include/asm-generic/errno.h
index e8852c0..26674c4 100644
--- a/include/asm-generic/errno.h
+++ b/include/asm-generic/errno.h
@@ -106,4 +106,7 @@
 #defineEOWNERDEAD  130 /* Owner died */
 #defineENOTRECOVERABLE 131 /* State not recoverable */
 
+/* for syslets */
+#defineESYSLETPENDING  132 /* syslet syscall now pending */
+
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61a4b83..a134966 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1182,6 +1182,13 @@ struct task_struct {
 
/* Additional system call parameters.  */
union indirect_params indirect_params;
+
+   /* task waiting to return to userspace if we block as a syslet */
+   spinlock_t  syslet_lock;
+   struct list_headsyslet_tasks;
+   unsignedsyslet_ready:1,
+   syslet_return:1,
+   syslet_exit:1;
 };
 
 /*
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index addb39f..1a44838 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -55,6 +55,7 @@ struct compat_timeval;
 struct robust_list_head;
 struct getcpu_cache;
 struct indirect_registers;
+struct syslet_ring;
 
 #include linux/types.h
 #include linux/aio_abi.h
@@ -615,6 +616,8 @@ asmlinkage long sys_fallocate(int fd, int mode, loff_t 
offset, loff_t len);
 asmlinkage long sys_indirect(struct indirect_registers __user *userregs,
 void __user *userparams, size_t paramslen,
 int flags);
+asmlinkage long sys_syslet_ring_wait(struct syslet_ring __user *ring,
+unsigned long user_idx);
 
 int kernel_execve(const char *filename, char *const argv[], char *const 
envp[]);
 
diff --git a/include/linux/syslet.h b/include/linux/syslet.h
new file mode 100644
index 000..734b98f
--- /dev/null
+++ b/include/linux/syslet.h
@@ -0,0 +1,18 @@
+#ifndef _LINUX_SYSLET_H
+#define _LINUX_SYSLET_H
+
+#include linux/syslet-abi.h
+#include asm/syslet.h
+
+void syslet_init(struct task_struct *tsk);
+void kill_syslet_tasks(struct task_struct *cur);
+void syslet_schedule(struct task_struct *cur);
+int syslet_pre_indirect(void);
+int syslet_post_indirect(int status);
+
+static inline int syslet_args_present(union indirect_params *params)
+{
+   return params-syslet.completion_ring_ptr;
+}
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index bcad37d..7a7dfbe 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o 
profile.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o latency.o nsproxy.o srcu.o \
-   utsname.o notifier.o
+   utsname.o notifier.o syslet.o
 
 obj-$(CONFIG_SYSCTL) += sysctl_check.o
 obj-$(CONFIG_STACKTRACE) += stacktrace.o
diff --git a/kernel/exit.c b/kernel/exit.c
index

[PATCH 6/6] syslets: add both 32bit and 64bit x86 syslet support

2007-12-06 Thread Zach Brown

This adds the architecture-specific routines needed by syslets for x86.

The syslet thread creation routines create a new thread which executes
a kernel function and then returns to userspace instead of exiting.

move_user_context() and set_user_frame() let the scheduler modify a child
thread so that it returns to userspace at the same place that a blocking
system call would have when it finished.  This currently performs a very
expensive copy of the fpu state.  Intel is working on a more robust patch
which allocates the i387 state off of thread_struct.  When that is ready
this can just juggle pointers to transfer the fpu state.

The syslets infrastructure needs to work with ptregs for the task which
is in sys_indirect().  So we add a PTREGSCALL stub around sys_indirect()
in x86_64.

Finally, we wire up sys_syslet_ring_wait().

Signed-off-by: Zach Brown [EMAIL PROTECTED]

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index dc7f938..66a121d 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -1025,6 +1025,30 @@ ENTRY(kernel_thread_helper)
CFI_ENDPROC
 ENDPROC(kernel_thread_helper)
 
+ENTRY(syslet_thread_helper)
+   CFI_STARTPROC
+   /*
+* Allocate space on the stack for pt-regs.
+* sizeof(struct pt_regs) == 64, and we've got 8 bytes on the
+* kernel stack already:
+*/
+   subl $64-8, %esp
+   CFI_ADJUST_CFA_OFFSET 64-8
+   movl %edx,%eax
+   push %edx
+   CFI_ADJUST_CFA_OFFSET 4
+   call *%ebx
+   addl $4, %esp
+   CFI_ADJUST_CFA_OFFSET -4
+
+   movl %eax, PT_EAX(%esp)
+
+   GET_THREAD_INFO(%ebp)
+
+   jmp syscall_exit
+   CFI_ENDPROC
+ENDPROC(syslet_thread_helper)
+
 #ifdef CONFIG_XEN
 ENTRY(xen_hypervisor_callback)
CFI_STARTPROC
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3a058bb..08e34f6 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -412,6 +412,7 @@ END(\label)
PTREGSCALL stub_rt_sigsuspend, sys_rt_sigsuspend, %rdx
PTREGSCALL stub_sigaltstack, sys_sigaltstack, %rdx
PTREGSCALL stub_iopl, sys_iopl, %rsi
+   PTREGSCALL stub_indirect, sys_indirect, %r8
 
 ENTRY(ptregscall_common)
popq %r11
@@ -994,6 +995,71 @@ child_rip:
 ENDPROC(child_rip)
 
 /*
+ * Create a syslet kernel thread.  This differs from a thread created with
+ * kernel_thread() in that it returns to userspace after it finishes executing
+ * its given kernel function.
+ *
+ * C extern interface:
+ * extern long create_syslet_thread(int (*fn)(void *),
+ *  void * arg, unsigned long flags)
+ *
+ * asm input arguments:
+ * rdi: fn, rsi: arg, rdx: flags
+ */
+ENTRY(create_syslet_thread)
+   CFI_STARTPROC
+   FAKE_STACK_FRAME $syslet_child_rip
+   SAVE_ALL
+
+   # rdi: flags, rsi: usp, rdx: will be pt_regs
+   movq %rdx,%rdi
+   movq $-1, %rsi
+   movq %rsp, %rdx
+
+   xorl %r8d,%r8d
+   xorl %r9d,%r9d
+
+   # clone now
+   call do_fork
+   movq %rax,RAX(%rsp)
+   xorl %edi,%edi
+
+   /*
+* It isn't worth to check for reschedule here,
+* so internally to the x86_64 port you can rely on kernel_thread()
+* not to reschedule the child before returning, this avoids the need
+* of hacks for example to fork off the per-CPU idle tasks.
+ * [Hopefully no generic code relies on the reschedule -AK]
+*/
+   RESTORE_ALL
+   UNFAKE_STACK_FRAME
+   ret
+   CFI_ENDPROC
+ENDPROC(syslet_kernel_thread)
+
+syslet_child_rip:
+   CFI_STARTPROC
+
+   movq %rdi, %rax
+   movq %rsi, %rdi
+   call *%rax
+
+   /*
+* Fix up the PDA - we might return with sysexit:
+*/
+   RESTORE_TOP_OF_STACK %r11
+
+   /*
+* return to user-space:
+*/
+   movq %rax, RAX(%rsp)
+   RESTORE_REST
+   jmp int_ret_from_sys_call
+
+   CFI_ENDPROC
+ENDPROC(syslet_child_rip)
+
+/*
  * execve(). This function needs to use IRET, not SYSRET, to set up all state 
properly.
  *
  * C extern interface:
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 7b89958..7bf2836 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -394,6 +394,39 @@ int kernel_thread(int (*fn)(void *), void * arg, unsigned 
long flags)
 EXPORT_SYMBOL(kernel_thread);
 
 /*
+ * This gets run with %ebx containing the
+ * function to call, and %edx containing
+ * the args.
+ */
+void syslet_thread_helper(void);
+
+/*
+ * Create a syslet kernel thread.  This differs from a thread created with
+ * kernel_thread() in that it returns to userspace after it finishes executing
+ * its given kernel function.
+ */
+int create_syslet_thread(long (*fn)(void *), void *arg, unsigned long flags)
+{
+   struct pt_regs regs;
+
+   memset(regs, 0, sizeof(regs));
+
+   regs.ebx = (unsigned long)fn

Re: [PATCH 5/6] syslets: add generic syslets infrastructure

2007-12-07 Thread Zach Brown


 +/*
 + * syslet_ring doesn't have any kernel-side storage.  Userspace allocates 
 them
 + * in their address space and initializes their fields and then passes them 
 to
 + * the kernel.
 + *
 + * These hashes provide the kernel-side storage for the wait queues which
 + * sys_syslet_ring_wait() uses and the mutex which completion uses to 
 serialize
 + * the (possible blocking) ordered writes of the completion and kernel head
 + * index into the ring.
 + *
 + * We chose the bucket that supports a given ring by hashing a u32 that
 + * userspace sets in the ring.
 + */
 +#define SYSLET_HASH_BITS (CONFIG_BASE_SMALL ? 4 : 8)
 +#define SYSLET_HASH_NR (1  SYSLET_HASH_BITS)
 +#define SYSLET_HASH_MASK (SYSLET_HASH_NR - 1)
 +static wait_queue_head_t syslet_waitqs[SYSLET_HASH_NR];
 +static struct mutex syslet_muts[SYSLET_HASH_NR];
 
 Why do you care about hashed tables scalability and not using trees?

Well, this notion of letting tasks safely complete to any ring they can
address is just a possibility.  We might decide that it's not worth it.
 This implementation was an easy example that borrows from the way
futexes do similar work.

I like it because you could have, say, different processes completing
into a ring in shared memory.

If we do allow this kind of flexible ring specification, it's not at all
clear that trees would be the best way to address the scalability
limits.  There are lots of possibilities, including locking the page
lock of the page which holds the head index.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-20 Thread Zach Brown

David Miller wrote:
 From: Ulrich Drepper [EMAIL PROTECTED]
 Date: Tue, 20 Nov 2007 01:53:14 -0500

 FWIW, I think this indirect syscall stuff is the most ugly interface
 I've ever seen proposed for the kernel.

Well, there's no XML in /proc :) :).

But, yes, I agree that the internal code needs a lot more cleanup before
being considered for merging.

 And I agree with all of the objections raised by both H. Pater Anvin
 and Eric Dumazet.

I'm worried, too.  Do we have a stronger alternative?  I'm all ears,
this isn't really my area of expertise.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 4/6] Allow setting FD_CLOEXEC flag for new sockets

2007-11-20 Thread Zach Brown


 +union indirect_params {
 +  struct {
 +int flags;
 +  } file_flags;
 +};

Have you given thought to having to perform compat translation on this?
  Today it's only copied directly from the user pointer into the union
in the task_struct.

I'd love if we could only use fixed-width fields in the interface to
avoid the compat nonsense.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 1/6] actual sys_indirect code

2007-11-20 Thread Zach Brown

Ingo Molnar wrote:
 cool patchset. Small nit, the series is not bisectable:

Yeah, in a few places.

Uli, this is super easy to test if you maintain the patches with guilt.
 With some easy scripting around guilt push you can verify that the
series builds as each patch is applied.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 4/6] Allow setting FD_CLOEXEC flag for new sockets

2007-11-20 Thread Zach Brown


 Since there is no legacy interface to worry about all members added to
 the structure can and should be neutral of the word size.

OK, perhaps add a giant comment warning about that.  History tells us
that people will get it wrong.

 We've done
 this with some syscalls already (like pread64) where we always use the
 wide form in the parameter list.  It's just more simple here since it
 does not have to split into two 32-bit registers.

*nod*

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 0/6] sys_indirect system call

2007-11-20 Thread Zach Brown


 This was mentioned in one of my mails.  I added the parameter to
 accommodate Linus's and Zack's idea to use the functionality for syslets
 as well.  Not really a multiplexer, it is meant to be a execute
 synchronously or asynchronously flag.  In the latter case an additional
 parameter might be needed to indicate the notification mechanism.

I'm sure the additional parameter will be needed, and it might be pretty
involved.  I think the current notion of syslets needs, at the very least:

 - a userspace stack and IP to return to after going async
 - the completion context when completed async
 - a pointer to a return code when completed without blocking

We'll see.  I'm working on it.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-20 Thread Zach Brown


 That's only because you're being, deliberately or accidentally, vague
 about what your actual (as opposed to imagined) requirements are.

Maybe I can help by summarizing how syslets fit in to this.

Currently the syslet patches add a single submission call which includes
an argument which is a structure which duplicates the system call ABI.
The submission syscall in the kernel does some syslet specific work
which amounts to verifying state and storing it in the task_struct.  It
then has to unpack the system call arguments from this submission
syscall argument and call the specified system call.

Every architecture will need helpers, then, on either side.  They'll
need to pack their arguments into the struct and then unpack and call in
the kernel.  The PPC64 guys have already expressed concern about this.

It's, in effect, adding the syslet arguments to every single system call.

So, instead of duplicating the system call ABI in the argument to a
syslet submission syscall, we could pass the syslet arguments via this
indirect parameters convention.  This, hopefully, will reduce complexity
by reducing the number of places that we have to muck around with the
sycall ABI.

That's the high level summary, anyway.  I'm working on the simplest
expression of this mechanism at the moment.  We'll have code to argue
about before the silly thanksgiving break, I hope.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets

2007-11-20 Thread Zach Brown


 Where does this INDIRECT_PARAM() macro get defined?  I do not
 see it being defined anywhere in these patches.
 Defined in linux/indirect.h:

 +#define INDIRECT_PARAM(set, name) current-indirect_params.set.name

 Not my idea, I was following one review comment.
 
 This was not in the patches you posted, I double checked before
 sending my reply.

Not to belabor this point, but it was:

http://lkml.org/lkml/2007/11/20/53

$ grep -l INDIRECT_PARAM .git/patches/master/*
.git/patches/master/indirect-v4-4.patch
.git/patches/master/indirect-v4-5.patch
.git/patches/master/indirect-v4-6.patch

Maybe the patches got to you out of order so you saw 5/ before 4/?

- z

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]loop cleanup in fs/namespace.c - repost

2007-11-21 Thread Zach Brown


 The patch given below replaces the goto-loop by a while-based one.

That certainly looks fine.  I would also replace the 'return' with
'break', but I guess that's more of a question of personal preference.

 Besides, it removes the export for the same routine, because there are
 no users for it outside of the core VFS code.

This doesn't look fine.  Did you test this?

mntput_no_expire() is called from mntput() which is an inline function
in mount.h.  So lots of callers of mntput() in modules will end up
trying to call mntput_no_expire() from modules.

$ nm fs/fuse/fuse.ko | grep mntput_no_expire
 U mntput_no_expire

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]loop cleanup in fs/namespace.c - repost

2007-11-21 Thread Zach Brown


 This doesn't look fine.  Did you test this?
 
 Oops, my fault. Of course, I tested the patch, but kernel modules are
 disabled in my test setup, so I missed the error.

:)

 Enclosed to this message is a new patch, which replaces the goto-loop by
 the while-based one, but leaves the EXPORT_SYMBOL macro intact.

It certainly looks OK to me now, for whatever that's worth.

You probably want to wait 'till the next merge window to get it in,
though.  It's just a cleanup and so shouldn't go in this late in the -rc
line.

Maybe Andrew will be willing to queue it until that time in -mm.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6] indirect: use asmlinkage in i386 syscall table prototype

2007-12-08 Thread Zach Brown


 +extern asmlinkage long (*sys_call_table[])(long, long, long,
 This should be something like below instead, otherwise gcc wont parse
 asmlinkage as being an attribute of the function signature.
   extern long (asmlinkage *sys_call_table[])(long, long, long,

Yeah, thanks for pointing these out.  As it happened, Jens beat you to
it :).

Both problems have been fixed in the git repositories for the guilt
series and userspace tools, respectively.  You can always fetch the most
recent versions from there.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: syslets v7: back to basics

2007-12-10 Thread Zach Brown


 I pulled from your tree to look over the patches, and noticed that it
 looks like several commits were merged improperly.  It looks like they
 were auto merged or something from an email, and the commit message
 contains the email headers, rather than just the commit message in the
 body.  This leads to the shortlog showing entries that start with
 Return-Path:.

These are patches that guilt imported from email messages.  It didn't
strip the headers and I didn't care to.  I'll try to in the future, it
isn't a big deal.

 I was hoping to find at least some initial information on the overall
 design in Documentation/ but don't see any.  Have you written any yet
 that I could take a look at elsewhere maybe?

No, but it's coming.  I'd like to have some robust documentation so that
Ulrich can help me understand what more he'd need to support POSIX AIO
with syslets from glibc.

 Some of the things I was trying to figure out is does each syslet get
 its own stack,

Yes.  Each blocking operation has a thread that is performing the
operation synchronously.  The benefit is that the thread is only created
if the operation blocks.  If it doesn't block then it's a normal system
call invocation.  You don't have to manage threads and communicate the
arguments and results of system calls amongst threads for the case where
it never blocks.

 and schedule only at a few well defined points

No, every blocking point is considered a scheduling point.

 , and if
 so, would it then be fair to characterize them as kernel mode fibers?

I'm not sure what exactly you mean by kernel mode fibers (I can guess,
but I'd rather not).  From the answer of to the last question, though,
I'm going to guess that it might not be the most apt characterization.

- z

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv5 0/5] sys_indirect system call

2007-12-11 Thread Zach Brown


 Note that the sys_indirect system call takes an additional parameter which
 is for now forced to be zero.  This parameter is meant to enable the use
 of sys_indirect to create syslets, asynchronously executed system calls.
 This syslet approach is also the main reason for the interface in the form
 proposed here.

Can you talk more about how you imagined syslets working with
sys_indirect()?

I threw something together to test most of the machinery around the
actual argument passing.  It just added the syslet arguments to the union:

  http://lkml.org/lkml/2007/12/6/338
  http://lkml.org/lkml/2007/12/6/339

This obviously doesn't work because it's a perfectly reasonable thing to
want to provide the file_flags to some call as well as the syslet args
to perform that call in a syslet.

Then it butchers sys_indirect() itself:

  http://lkml.org/lkml/2007/12/6/340

The syslet machinery wants to run some code before and after the system
call itself when the syslet arguments are provided.  It can also call
almost every single system call, white-listing in sys_indirect() is
probably not the answer here.

I could throw together some ideas, of course, but thought I'd see if you
guys already had a design in mind that I could pursue.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails

2007-12-11 Thread Zach Brown

Hisashi Hifumi wrote:
 Hi.
 
 Current dio has some problems:
 1, In ext3 ordered, dio write can return with EIO because of the race
 between invalidation of
 a page and jbd. jbd pins the bhs while committing journal so
 try_to_release_page fails when jbd
 is committing the transaction.

Yeah.  It sure would be fantastic if some ext3 expert could stop this
from happening somehow.  But that hasn't happened in.. uh.. Badari, for
how many years has this been on the radar? :)

 
 Past discussion about this issue is as follows.
 http://marc.info/?t=11934343124r=1w=2
 http://marc.info/?t=11265676282r=1w=2
 
 2, invalidate_inode_pages2_range() sets ret=-EIO when
 invalidate_complete_page2()
 fails, but this ret is cleared if do_launder_page() succeed on a page of
 next index.

Oops.  That's too bad.  So maybe we should fix it by not stomping on
that return code?

ret2 = do_launder()
if (ret2 == 0)
ret2 = invalidate()
if (ret == 0)
ret = ret2

I'd be surprised if we ever wanted to mask an -EIO when later pages
laundered successfully.

 In this case, dio is carried out even if invalidate_complete_page2()
 fails on some pages.
 This can cause inconsistency between memory and blocks on HDD because
 the page
 cache still exists.

Yeah.

 I solved problems above by introducing invalidate_inode_pages3_range()
 and falling
 through to buffered I/O when invalidation of a page failed.

Well, I like the idea of more intelligently dealing with the known
problem between dio and ext3.  I'm not sure that falling back to
buffered is right.

If dio can tell that it only failed because jbd held the buffer, should
we have waited for the transaction to complete before trying to
invalidate again?

If we could do that, we'd avoid performing the IO twice.

 We can distinguish between failure of page invalidation and other errors
 with the return value of invalidate_inode_pages3_range().

I'm not sure duplicating the invalidation loop into a new function is
the right thing.  Maybe we'd just tweak inode_pages2 to indicate to the
caller the specific failing circumstances somehow.  Maybe.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix invalidate_inode_pages2_range not to clear ret

2007-12-13 Thread Zach Brown

Hisashi Hifumi wrote:
 Hi.
 
 DIO invalidates page cache through invalidate_inode_pages2_range().
 invalidate_inode_pages2_range() sets ret=-EIO when
 invalidate_complete_page2()
 fails, but this ret is cleared if do_launder_page() succeed on a page of
 next index.
 In this case, dio is carried out even if invalidate_complete_page2()
 fails on some pages.
 This can cause inconsistency between memory and blocks on HDD because
 the page
 cache still exists.
 
 Following patch fixes this issue.

I like the idea of fixing this in a separate patch, yes, thanks!  But...

 
 Thanks.
 
 Signed-off-by :Hisashi Hifumi [EMAIL PROTECTED]
 
 diff -Nrup linux-2.6.24-rc5.org/mm/truncate.c
 linux-2.6.24-rc5/mm/truncate.c
 --- linux-2.6.24-rc5.org/mm/truncate.c2007-12-12 16:32:45.0
 +0900
 +++ linux-2.6.24-rc5/mm/truncate.c2007-12-13 11:45:29.0 +0900
 @@ -392,6 +392,7 @@ int invalidate_inode_pages2_range(struct
  pgoff_t next;
  int i;
  int ret = 0;
 +int ret2 = 0;
  int did_range_unmap = 0;
  int wrapped = 0;
 
 @@ -441,13 +442,13 @@ int invalidate_inode_pages2_range(struct
  BUG_ON(page_mapped(page));
  ret = do_launder_page(mapping, page);
  if (ret == 0  !invalidate_complete_page2(mapping, page))
 -ret = -EIO;
 +ret2 = -EIO;
  unlock_page(page);
  }
  pagevec_release(pvec);
  cond_resched();
  }
 -return ret;
 +return !ret ? ret2 : ret;
  }
  EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);

... this doesn't work. Notice that it only propagates the -EIO into ret2? It 
can lose errors from do_launder_page() itself because they aren't stored into 
ret2, which is what's returned if the last do_launder_page() succeeds.

This isn't I meant when I mentioned that 'ret2' pattern. The idea is to store 
the errors from inner calls into some private var and then only promote those 
errors as the function's return code (ret) if there was an error and ret wasn't 
already set.  We do this in a few places in dio, you'll notice.  This patch 
gets the meaning of 'ret' and 'ret2' opposite from those uses, which is kind of 
confusing.

So, uh, how about the following. Totally untested, but compiled.

---

invalidate_inode_pages2_range(): consistently return first error

Hisashi Hifumi noticed that we were losing errors in
invalidate_inode_pages2_range().  Later do_launder_page() calls could overwrite
errors generated by earlier calls.  Fix this by storing do_launder_page in a
temporary variable which is only promoted to the function's return code if it
hadn't already generated an error.

Signed-off-by: Zach Brown [EMAIL PROTECTED]
Cc: Hisashi Hifumi [EMAIL PROTECTED] 

diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..24578b3 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -392,6 +392,7 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
pgoff_t next;
int i;
int ret = 0;
+   int rc;
int did_range_unmap = 0;
int wrapped = 0;
 
@@ -439,9 +440,11 @@ int invalidate_inode_pages2_range(struct address_space 
*mapping,
}
}
BUG_ON(page_mapped(page));
-   ret = do_launder_page(mapping, page);
-   if (ret == 0  !invalidate_complete_page2(mapping, 
page))
-   ret = -EIO;
+   rc = do_launder_page(mapping, page);
+   if (rc == 0  !invalidate_complete_page2(mapping, 
page))
+   rc = -EIO;
+   if (rc  ret == 0)
+   ret = rc;
unlock_page(page);
}
pagevec_release(pvec);

  
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails

2007-12-14 Thread Zach Brown


 If anyone has a testcase - I can take a look at the problem again.

I can try and throw something together..

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)

2008-01-02 Thread Zach Brown

Erez Zadok wrote:
 Setting: ltp-full-20071031, dio01 test on ext3 with Linus's latest tree.
 Kernel w/ SMP, preemption, and lockdep configured.

This is a real lock ordering problem.  Thanks for reporting it.

The updating of atime inside sys_mmap() orders the mmap_sem in the vfs
outside of the journal handle in ext3's inode dirtying:

 - #1 (jbd_handle){--..}:
[c023068f] __lock_acquire+0x9cc/0xb95
[c0230c2f] lock_acquire+0x5f/0x78
[c029c7e9] journal_start+0xee/0xf8
[c02959d6] ext3_journal_start_sb+0x48/0x4a
[c029152b] ext3_dirty_inode+0x27/0x6c
[c026f701] __mark_inode_dirty+0x29/0x144
[c026817b] touch_atime+0xb7/0xbc
[c023af6d] generic_file_mmap+0x2d/0x42
[c024a5cc] mmap_region+0x1e6/0x3b4
[c024aa6b] do_mmap_pgoff+0x1fb/0x253
[c02067af] sys_mmap2+0x9b/0xb5
[c020275e] syscall_call+0x7/0xb
[] 0x

ext3_direct_IO() orders the journal handle outside of the mmap_sem that
dio_get_page() acquires to pin pages with get_user_pages():

 - #0 (mm-mmap_sem){}:
[c023057f] __lock_acquire+0x8bc/0xb95
[c0230c2f] lock_acquire+0x5f/0x78
[c0397d4f] down_read+0x3a/0x4c
[c02778a2] dio_get_page+0x4e/0x15d
[c0278352] __blockdev_direct_IO+0x431/0xa81
[c0291318] ext3_direct_IO+0x10c/0x1a1
[c023c091] generic_file_direct_IO+0x124/0x139
[c023c0fc] generic_file_direct_write+0x56/0x11c
[c023ca15] __generic_file_aio_write_nolock+0x33d/0x489
[c023cbb9] generic_file_aio_write+0x58/0xb6
[c028d4e3] ext3_file_write+0x27/0x99
[c0256d0f] do_sync_write+0xc5/0x102
[c0257463] vfs_write+0x90/0x119
[c0257a25] sys_write+0x3d/0x61
[c02026d6] sysenter_past_esp+0x5f/0xa5
[] 0x

Two fixes come to mind:

1) use something like Peter's -mmap_prepare() to update atime before
acquiring the mmap_sem.  ( http://lkml.org/lkml/2007/11/11/97 ).  I
don't know if this would leave more paths which do a journal_start()
while holding the mmap_sem.

2) rework ext3's dio to only hold the jbd handle in ext3_get_block().
Chris has a patch for this kicking around somewhere but I'm told it has
problems exposing old blocks in ordered data mode.

Does anyone have preferences?  I could go either way.  I certainly don't
like the idea of journal handles being held across the entirety of
fs/direct-io.c.  It's yet another case of O_DIRECT differing wildly from
the buffered path :(.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: partial write should not return error code.

2008-01-03 Thread Zach Brown

Rusty Russell wrote:
 When an AIO write gets an error after writing some data (eg. ENOSPC),
 it should return the amount written already, not the error.  Just like
 write() is supposed to.

Andrew, please don't queue this fix.  I think the bug is valid but the
patch is subtly dangerous.

 diff -r 18802689361a fs/aio.c
 --- a/fs/aio.cThu Jan 03 15:22:24 2008 +1100
 +++ b/fs/aio.cThu Jan 03 18:05:25 2008 +1100
 @@ -1346,6 +1350,13 @@ static ssize_t aio_rw_vect_retry(struct 
   /* This means we must have transferred all that we could */
   /* No need to retry anymore */
   if ((ret == 0) || (iocb-ki_left == 0))
 + ret = iocb-ki_nbytes - iocb-ki_left;
 +
 + /* If we managed to write some out we return that, rather than
 +  * the eventual error. */
 + if (opcode == IOCB_CMD_PWRITEV
 +  ret  0
 +  iocb-ki_nbytes - iocb-ki_left)
   ret = iocb-ki_nbytes - iocb-ki_left;

This doesn't account for the (sigh) -EIOCB* error codes.  They must be
returned to the caller so that it can properly handle the iocb reference
counting.  Failure to do so can lead to oopses.

To be fair, I think you'll have a really hard time finding an
-aio_write() implementation which would return partial progress and
*then* one of the magical errnos.  But the infrastructure does allow it.

So maybe we could get a helper in aio.h that abstracts out the

(ret  0  ret != -EIOCBQUEUED  ret != -EIOCBRETRY)

condition.  Then I think this patch would be fine.

I assigned a bug to remind myself to revisit this if you aren't excited
by continuing with the patch:

  http://bugzilla.kernel.org/show_bug.cgi?id=9681

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: negative offset should return -EINVAL

2008-01-03 Thread Zach Brown

Rusty Russell wrote:
 An AIO read or write should return -EINVAL if the offset is negative.
 This check matches the one in pread and pwrite.
 
 This was found by the libaio test suite.
 
 Signed-off-by: Rusty Russell [EMAIL PROTECTED]

This looks fine to me.

Signed-off-by: Zach Brown [EMAIL PROTECTED]

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: partial write should not return error code.

2008-01-04 Thread Zach Brown


 
 No, that's fine, here is the new one:
 
 When an AIO write gets a non-retry error after writing some data
 (eg. ENOSPC), it should return the amount written already, not the
 error.  Just like write() is supposed to.
 
 This was found by the libaio test suite.
 
 Signed-off-by: Rusty Russell [EMAIL PROTECTED]

This looks good, feel free to push this from your tree.

Acked-By: Zach Brown [EMAIL PROTECTED]

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] Make wait_on_retry_sync_kiocb killable

2007-10-25 Thread Zach Brown


Matthew Wilcox wrote:
 Use TASK_KILLABLE to allow wait_on_retry_sync_kiocb to return -EINTR.
 All callers then check the return value and break out of their loops.

This won't work because sync kiocbs are a nasty hack that don't follow
the (also nasty) refcounting patterns of the aio core.

-EIOCBRETRY means that aio_{read,write}() has taken on the IO kiocb
reference and has ensured that call kick_iocb() will be called in the
future.

Usually kick_iocb() would queue the kiocb to have its ki_retry method
called by the kernel aio threads while holding that reference.  But
sync kiocbs are on-stack and aren't reference counted.  kick_iocb() magic:

/* sync iocbs are easy: they can only ever be executing from a

 * single context. */
if (is_sync_kiocb(iocb)) {
kiocbSetKicked(iocb);
wake_up_process(iocb-ki_obj.tsk);

return;

}

So, with this patch, if we catch a signal and return from
wait_on_retry_sync_kiocb() and return from do_sync_{read,write}() then
that on-stack sync kiocb is going to be long gone when kick_iocb() goes
to work with it.

So the first step would be to make sync kiocbs real refcounted
structures so that kick_iocb() could find that the sync submitter has
disappeared.

But then we have to worry about leaving retrying operations in flight
after the sync submitter has returned from their system call.  They
might be VERY SURPRISED to find that a read() implemented with
do_sync_read() is still writing into their userspace pointer after the
syscall was interrupted by a signal.

This leads us to the possibility of working with the ki_cancel method to
stop a pending operation if a signal is caught from a sync submitter.
In practice, nothing sets ki_cancel.

And finally, this code will not be run in a solely mainline kernel.  The
only thing in mainline that returns -EIOCBRETRY is the goofy usb gadget.
 It has both -{read,write} and -aio_{read,write} file op methods so
vfs_{read,write}() will never call do_sync_{read,write}().  Sure,
out-of-tree aio providers (SDP?) might get caught up in this.

(Ha ha!  Welcome to fs/aio.c!)

So I'm not sure where to go with this.  It's a mess, but it doesn't seem
like anything is using it.  A significant clean up of the retry and
cancelation support in fs/aio.c is in flight.  Maybe we can revisit this
once that settles down.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #5]

2007-10-25 Thread Zach Brown


 + * ERR_CAST - Explicitly cast an error-valued pointer to another pointer type
 + * @ptr: The pointer to cast.
 + *
 + * Explicitly cast an error-valued pointer to another pointer type in such a
 + * way as to make it clear that's what's going on.
 + */
 +static inline void *ERR_CAST(const void *ptr)
 +{
 + return (void *) ptr;
 +}

Just to nit, surely you don't need the cast inside the function.  The
casting happens at the call site between the argument and returned pointer.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #5]

2007-10-25 Thread Zach Brown

Roland Dreier wrote:
+static inline void *ERR_CAST(const void *ptr)
+{
+return (void *) ptr;
+}
   
   Just to nit, surely you don't need the cast inside the function.  The
   casting happens at the call site between the argument and returned pointer.
 
 The way it's written you kinda do, since it takes a const void * and
 returns a plain void *.

Ah, right, I missed that.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1040 matches

Mail list logo