date:20061214

2.6.18-stable review patch.  If anyone has any objections, please let us know.
--

From: Jeet Chaudhuri <[EMAIL PROTECTED]>

We must reserve SAR + MAX_HEADER bytes for IrLMP to fit in.
This fixes an oops reported (and fixed) by Jeet Chaudhuri, when max_sdu_size
is greater than 0.

Signed-off-by: Samuel Ortiz <[EMAIL PROTECTED]>
Signed-off-by: David S. Miller <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>

---
 net/irda/irttp.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-2.6.18.5.orig/net/irda/irttp.c
+++ linux-2.6.18.5/net/irda/irttp.c
@@ -1098,7 +1098,7 @@ int irttp_connect_request(struct tsap_cb
return -ENOMEM;
 
/* Reserve space for MUX_CONTROL and LAP header */
-   skb_reserve(tx_skb, TTP_MAX_HEADER);
+   skb_reserve(tx_skb, TTP_MAX_HEADER + TTP_SAR_HEADER);
} else {
tx_skb = userdata;
/*
@@ -1346,7 +1346,7 @@ int irttp_connect_response(struct tsap_c
return -ENOMEM;
 
/* Reserve space for MUX_CONTROL and LAP header */
-   skb_reserve(tx_skb, TTP_MAX_HEADER);
+   skb_reserve(tx_skb, TTP_MAX_HEADER + TTP_SAR_HEADER);
} else {
tx_skb = userdata;
/*

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 11/24] XFRM: Use output device disable_xfrm for forwarded packets

2.6.18-stable review patch.  If anyone has any objections, please let us know.
--

From: David Miller <[EMAIL PROTECTED]>

Currently the behaviour of disable_xfrm is inconsistent between
locally generated and forwarded packets. For locally generated
packets disable_xfrm disables the policy lookup if it is set on
the output device, for forwarded traffic however it looks at the
input device. This makes it impossible to disable xfrm on all
devices but a dummy device and use normal routing to direct
traffic to that device.

Always use the output device when checking disable_xfrm.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>
Signed-off-by: David S. Miller <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
---
commit 9be2b4e36fb04bbc968693ef95a75acc17cf2931
Author: Patrick McHardy <[EMAIL PROTECTED]>
Date:   Mon Dec 4 19:59:00 2006 -0800

 net/ipv4/route.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.18.5.orig/net/ipv4/route.c
+++ linux-2.6.18.5/net/ipv4/route.c
@@ -1775,7 +1775,7 @@ static inline int __mkroute_input(struct
 #endif
if (in_dev->cnf.no_policy)
rth->u.dst.flags |= DST_NOPOLICY;
-   if (in_dev->cnf.no_xfrm)
+   if (out_dev->cnf.no_xfrm)
rth->u.dst.flags |= DST_NOXFRM;
rth->fl.fl4_dst = daddr;
rth->rt_dst = daddr;

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 21/24] softirq: remove BUG_ONs which can incorrectly trigger

2.6.18-stable review patch.  If anyone has any objections, please let us know.
--

From: Zachary Amsden <[EMAIL PROTECTED]>

It is possible to have tasklets get scheduled before softirqd has had a chance
to spawn on all CPUs.  This is totally harmless; after success during action
CPU_UP_PREPARE, action CPU_ONLINE will be called, which immediately wakes
softirqd on the appropriate CPU to process the already pending tasklets.  So
there is no danger of having a missed wakeup for any tasklets that were
already pending.

In particular, i386 is affected by this during startup, and is visible when
using a very large initrd; during the time it takes for the initrd to be
decompressed, a timer IRQ can come in and schedule RCU callbacks.  It is also
possible that resending of a hardware IRQ via a softirq triggers the same bug.

Because of different timing conditions, this shows up in all emulators and
virtual machines tested, including Xen, VMware, Virtual PC, and Qemu.  It is
also possible to trigger on native hardware with a large enough initrd,
although I don't have a reliable case demonstrating that.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
---

 kernel/softirq.c |2 --
 1 file changed, 2 deletions(-)

--- linux-2.6.18.5.orig/kernel/softirq.c
+++ linux-2.6.18.5/kernel/softirq.c
@@ -574,8 +574,6 @@ static int __cpuinit cpu_callback(struct
 
switch (action) {
case CPU_UP_PREPARE:
-   BUG_ON(per_cpu(tasklet_vec, hotcpu).list);
-   BUG_ON(per_cpu(tasklet_hi_vec, hotcpu).list);
p = kthread_create(ksoftirqd, hcpu, "ksoftirqd/%d", hotcpu);
if (IS_ERR(p)) {
printk("ksoftirqd for %i failed\n", hotcpu);

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 09/24] PKT_SCHED act_gact: division by zero

2.6.18-stable review patch.  If anyone has any objections, please let us know.
--

From: David Miller <[EMAIL PROTECTED]>

Not returning -EINVAL, because someone might want to use the value
zero in some future gact_prob algorithm?

Signed-off-by: Kim Nordlund <[EMAIL PROTECTED]>
Signed-off-by: David S. Miller <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
---
 net/sched/act_gact.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-2.6.18.5.orig/net/sched/act_gact.c
+++ linux-2.6.18.5/net/sched/act_gact.c
@@ -54,14 +54,14 @@ static DEFINE_RWLOCK(gact_lock);
 #ifdef CONFIG_GACT_PROB
 static int gact_net_rand(struct tcf_gact *p)
 {
-   if (net_random()%p->pval)
+   if (!p->pval || net_random()%p->pval)
return p->action;
return p->paction;
 }
 
 static int gact_determ(struct tcf_gact *p)
 {
-   if (p->bstats.packets%p->pval)
+   if (!p->pval || p->bstats.packets%p->pval)
return p->action;
return p->paction;
 }

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[git patches] libata updates

2006-12-14 Thread Jeff Garzik

Includes some pre-window-closing stuff (new drivers), but I was waiting
on some bug fixes before pushing.

Please pull from 'upstream-linus' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev.git 
upstream-linus

to receive the following updates:

 drivers/ata/Kconfig|   20 ++-
 drivers/ata/Makefile   |2 +
 drivers/ata/ahci.c |2 -
 drivers/ata/ata_piix.c |   24 ++-
 drivers/ata/libata-core.c  |   16 +-
 drivers/ata/libata-scsi.c  |4 +-
 drivers/ata/pata_ali.c |4 +-
 drivers/ata/pata_cs5520.c  |2 +-
 drivers/ata/pata_cs5530.c  |8 +-
 drivers/ata/pata_hpt366.c  |2 +-
 drivers/ata/pata_hpt37x.c  |4 +-
 drivers/ata/pata_hpt3x3.c  |2 +-
 drivers/ata/pata_it8213.c  |  354 +
 drivers/ata/pata_jmicron.c |2 +-
 drivers/ata/pata_marvell.c |4 +-
 drivers/ata/pata_mpc52xx.c |  563 
 drivers/ata/pata_serverworks.c |2 +-
 drivers/ata/pata_sil680.c  |2 +-
 drivers/ata/pata_sis.c |2 +-
 drivers/ata/pata_via.c |   10 +-
 drivers/ata/pata_winbond.c |   16 +-
 drivers/ata/sata_nv.c  |   18 +-
 drivers/ata/sata_sis.c |   79 --
 drivers/ata/sata_svw.c |   41 +++-
 drivers/ata/sata_via.c |2 +-
 include/linux/libata.h |2 +-
 include/linux/pci_ids.h|1 +
 27 files changed, 1099 insertions(+), 89 deletions(-)
 create mode 100644 drivers/ata/pata_it8213.c
 create mode 100644 drivers/ata/pata_mpc52xx.c

Alan (1):
  pata_it8213: Add new driver for the IT8213 card

Jason Gaston (1):
  ata_piix: IDE mode SATA patch for Intel ICH9

Jeff Garzik (3):
  [libata] use kmap_atomic(KM_IRQ0) in SCSI simulator
  [libata] trim trailing whitespace
  [libata] sata_svw: Disable ATAPI DMA on current boards (errata workaround)

Sylvain Munaut (1):
  libata: Add support for the MPC52xx ATA controller

Tejun Heo (3):
  ata_piix: use piix_host_stop() in ich_pata_ops
  libata: don't initialize sg in ata_exec_internal() if DMA_NONE (take #2)
  ahci: do not mangle saved HOST_CAP while resetting controller

Uwe Koziolek (1):
  sata_sis: support SiS966/966L

diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig
index 984ab28..fb1de86 100644
--- a/drivers/ata/Kconfig
+++ b/drivers/ata/Kconfig
@@ -292,7 +292,7 @@ config PATA_ISAPNP
  If unsure, say N.
 
 config PATA_IT821X
-   tristate "IT821x PATA support (Experimental)"
+   tristate "IT8211/2 PATA support (Experimental)"
depends on PCI && EXPERIMENTAL
help
  This option enables support for the ITE 8211 and 8212
@@ -301,6 +301,15 @@ config PATA_IT821X
 
  If unsure, say N.
 
+config PATA_IT8213
+   tristate "IT8213 PATA support (Experimental)"
+   depends on PCI && EXPERIMENTAL
+   help
+ This option enables support for the ITE 821 PATA
+  controllers via the new ATA layer.
+
+ If unsure, say N.
+
 config PATA_JMICRON
tristate "JMicron PATA support"
depends on PCI
@@ -337,6 +346,15 @@ config PATA_MARVELL
 
  If unsure, say N.
 
+config PATA_MPC52xx
+   tristate "Freescale MPC52xx SoC internal IDE"
+   depends on PPC_MPC52xx
+   help
+ This option enables support for integrated IDE controller
+ of the Freescale MPC52xx SoC.
+
+ If unsure, say N.
+
 config PATA_MPIIX
tristate "Intel PATA MPIIX support"
depends on PCI
diff --git a/drivers/ata/Makefile b/drivers/ata/Makefile
index bc3d81a..a0df15d 100644
--- a/drivers/ata/Makefile
+++ b/drivers/ata/Makefile
@@ -33,11 +33,13 @@ obj-$(CONFIG_PATA_HPT3X2N)  += pata_hpt3x2n.o
 obj-$(CONFIG_PATA_HPT3X3)  += pata_hpt3x3.o
 obj-$(CONFIG_PATA_ISAPNP)  += pata_isapnp.o
 obj-$(CONFIG_PATA_IT821X)  += pata_it821x.o
+obj-$(CONFIG_PATA_IT8213)  += pata_it8213.o
 obj-$(CONFIG_PATA_JMICRON) += pata_jmicron.o
 obj-$(CONFIG_PATA_NETCELL) += pata_netcell.o
 obj-$(CONFIG_PATA_NS87410) += pata_ns87410.o
 obj-$(CONFIG_PATA_OPTI)+= pata_opti.o
 obj-$(CONFIG_PATA_OPTIDMA) += pata_optidma.o
+obj-$(CONFIG_PATA_MPC52xx) += pata_mpc52xx.o
 obj-$(CONFIG_PATA_MARVELL) += pata_marvell.o
 obj-$(CONFIG_PATA_MPIIX)   += pata_mpiix.o
 obj-$(CONFIG_PATA_OLDPIIX) += pata_oldpiix.o
diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index f36da48..dbae6d9 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -645,8 +645,6 @@ static int ahci_reset_controller(void __iomem *mmio, struct 
pci_dev *pdev)
u32 cap_save, impl_save, tmp;
 
cap_save = readl(mmio + HOST_CAP);
-   cap_save &= ( (1<<28) | (1<<17) );
-   cap_save |= (1 << 27);
impl_save = readl(mmio + HOST_PORTS_IMPL);
 
/* global controller reset */
diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c
index

Re: [PATCH] ata_piix: use piix_host_stop() in ich_pata_ops

2006-12-14 Thread Catalin Marinas

On 13/12/06, Catalin Marinas <[EMAIL PROTECTED]> wrote:

On 11/12/06, Tejun Heo <[EMAIL PROTECTED]> wrote:
> piix_init_one() allocates host private data which should be freed by
> piix_host_stop().  ich_pata_ops wasn't converted to piix_host_stop()
> while merging, leaking 4 bytes on driver detach.  Fix it.

I tried your patch last night but the leak is still reported. I need
to investigate further and put some printk's in the piix_host_stop
function to check whether the freeing really takes place.

The piix_host_stop() isn't actually called on my machine, so this is
not the cause of the leak.

What causes the leak seem to be the error returned by
ata_pci_init_one() called from piix_init_one(). These are the related
kernel messages:

ata_piix :00:1f.1: version 2.00ac7
ACPI: PCI Interrupt :00:1f.1[A] -> GSI 16 (level, low) -> IRQ 18
PCI: Unable to reserve I/O region #1:[EMAIL PROTECTED] for device :00:1f.1
ata_piix: probe of :00:1f.1 failed with error -16

I think the call to ata_pci_init_one() should be followed by some
clean-up in case it fails. There is also another return without
clean-up in piix_init_one() after the call to piix_disable_ahci(). I
don't have time to try this now.

--
Catalin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC: 2.6 patch] simplify drivers/md/md.c:update_size()

2006-12-14 Thread Adrian Bunk

While looking at commit 8ddeeae51f2f197b4fafcba117ee8191b49d843e,
I got the impression that this commit couldn't fix anything, since the 
"size" variable can't be changed before "fit" gets used.

Is there any big thinko, or is the patch below that slightly simplifies 
update_size() semantically equivalent to the current code?

Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

---

 drivers/md/md.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- linux-2.6.19-mm1/drivers/md/md.c.old2006-12-15 00:57:05.0 
+0100
+++ linux-2.6.19-mm1/drivers/md/md.c2006-12-15 00:57:42.0 +0100
@@ -4039,57 +4039,56 @@
 * Generate a 128 bit UUID
 */
get_random_bytes(mddev->uuid, 16);
 
mddev->new_level = mddev->level;
mddev->new_chunk = mddev->chunk_size;
mddev->new_layout = mddev->layout;
mddev->delta_disks = 0;
 
mddev->dead = 0;
return 0;
 }
 
 static int update_size(mddev_t *mddev, unsigned long size)
 {
mdk_rdev_t * rdev;
int rv;
struct list_head *tmp;
-   int fit = (size == 0);
 
if (mddev->pers->resize == NULL)
return -EINVAL;
/* The "size" is the amount of each device that is used.
 * This can only make sense for arrays with redundancy.
 * linear and raid0 always use whatever space is available
 * We can only consider changing the size if no resync
 * or reconstruction is happening, and if the new size
 * is acceptable. It must fit before the sb_offset or,
 * if that is sync_thread)
return -EBUSY;
ITERATE_RDEV(mddev,rdev,tmp) {
sector_t avail;
avail = rdev->size * 2;
 
-   if (fit && (size == 0 || size > avail/2))
+   if (size == 0)
size = avail/2;
if (avail < ((sector_t)size << 1))
return -ENOSPC;
}
rv = mddev->pers->resize(mddev, (sector_t)size *2);
if (!rv) {
struct block_device *bdev;
 
bdev = bdget_disk(mddev->gendisk, 0);
if (bdev) {
mutex_lock(>bd_inode->i_mutex);
i_size_write(bdev->bd_inode, (loff_t)mddev->array_size 
<< 10);
mutex_unlock(>bd_inode->i_mutex);
bdput(bdev);
}
}
return rv;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC: 2.6 patch] simplify drivers/md/md.c:update_size()

2006-12-14 Thread Doug Ledford

On Fri, 2006-12-15 at 01:19 +0100, Adrian Bunk wrote:
> While looking at commit 8ddeeae51f2f197b4fafcba117ee8191b49d843e,
> I got the impression that this commit couldn't fix anything, since the 
> "size" variable can't be changed before "fit" gets used.
> 
> Is there any big thinko, or is the patch below that slightly simplifies 
> update_size() semantically equivalent to the current code?

No, this patch is broken.  Where it fails is specifically the case where
you want to autofit the largest possible size, you have different size
devices, and the first device is not the smallest.  When you hit the
first device, you will set size, then as you repeat the ITERATE_RDEV
loop, when you hit the smaller device, size will be non-0 and you'll
then trigger the later if and return -ENOSPC.  In the case of autofit,
you have to preserve the fit variable instead of looking at size so you
know whether or not to modify the size when you hit a smaller device
later in the list.

> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
> 
> ---
> 
>  drivers/md/md.c |3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> --- linux-2.6.19-mm1/drivers/md/md.c.old  2006-12-15 00:57:05.0 
> +0100
> +++ linux-2.6.19-mm1/drivers/md/md.c  2006-12-15 00:57:42.0 +0100
> @@ -4039,57 +4039,56 @@
>* Generate a 128 bit UUID
>*/
>   get_random_bytes(mddev->uuid, 16);
>  
>   mddev->new_level = mddev->level;
>   mddev->new_chunk = mddev->chunk_size;
>   mddev->new_layout = mddev->layout;
>   mddev->delta_disks = 0;
>  
>   mddev->dead = 0;
>   return 0;
>  }
>  
>  static int update_size(mddev_t *mddev, unsigned long size)
>  {
>   mdk_rdev_t * rdev;
>   int rv;
>   struct list_head *tmp;
> - int fit = (size == 0);
>  
>   if (mddev->pers->resize == NULL)
>   return -EINVAL;
>   /* The "size" is the amount of each device that is used.
>* This can only make sense for arrays with redundancy.
>* linear and raid0 always use whatever space is available
>* We can only consider changing the size if no resync
>* or reconstruction is happening, and if the new size
>* is acceptable. It must fit before the sb_offset or,
>* if that is * size of each device.
>* If size is zero, we find the largest size that fits.
>*/
>   if (mddev->sync_thread)
>   return -EBUSY;
>   ITERATE_RDEV(mddev,rdev,tmp) {
>   sector_t avail;
>   avail = rdev->size * 2;
>  
> - if (fit && (size == 0 || size > avail/2))
> + if (size == 0)
>   size = avail/2;
>   if (avail < ((sector_t)size << 1))
>   return -ENOSPC;
>   }
>   rv = mddev->pers->resize(mddev, (sector_t)size *2);
>   if (!rv) {
>   struct block_device *bdev;
>  
>   bdev = bdget_disk(mddev->gendisk, 0);
>   if (bdev) {
>   mutex_lock(>bd_inode->i_mutex);
>   i_size_write(bdev->bd_inode, (loff_t)mddev->array_size 
> << 10);
>   mutex_unlock(>bd_inode->i_mutex);
>   bdput(bdev);
>   }
>   }
>   return rv;
>  }
-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: [RFC: 2.6 patch] simplify drivers/md/md.c:update_size()

2006-12-14 Thread Adrian Bunk

On Thu, Dec 14, 2006 at 07:36:35PM -0500, Doug Ledford wrote:
> On Fri, 2006-12-15 at 01:19 +0100, Adrian Bunk wrote:
> > While looking at commit 8ddeeae51f2f197b4fafcba117ee8191b49d843e,
> > I got the impression that this commit couldn't fix anything, since the 
> > "size" variable can't be changed before "fit" gets used.
> > 
> > Is there any big thinko, or is the patch below that slightly simplifies 
> > update_size() semantically equivalent to the current code?
> 
> No, this patch is broken.  Where it fails is specifically the case where
> you want to autofit the largest possible size, you have different size
> devices, and the first device is not the smallest.  When you hit the
> first device, you will set size, then as you repeat the ITERATE_RDEV
> loop, when you hit the smaller device, size will be non-0 and you'll
> then trigger the later if and return -ENOSPC.  In the case of autofit,
> you have to preserve the fit variable instead of looking at size so you
> know whether or not to modify the size when you hit a smaller device
> later in the list.
>...

OK, sorry, I've got my thinko:

ITERATE_RDEV() is a loop.

That's what I missed.

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git patches] libata updates

2006-12-14 Thread Alan

>  
> +config PATA_IT8213
> + tristate "IT8213 PATA support (Experimental)"
> + depends on PCI && EXPERIMENTAL
> + help
> +   This option enables support for the ITE 821 PATA

Typo (IT8213) - probably my fault but only just noticed it


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 06/13] Completion Queues


Functions to manipulate CQs.

Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/hw/cxgb3/iwch_cq.c |  231 +
 1 files changed, 231 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c 
b/drivers/infiniband/hw/cxgb3/iwch_cq.c
new file mode 100644
index 000..9d82df4
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+
+/*
+ * Get one cq entry from cxio and map it to openib.
+ *
+ * Returns:
+ * 0   EMPTY;
+ * 1   cqe returned
+ * -EAGAIN caller must try again
+ * any other -errnofatal error
+ */
+int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp,
+struct ib_wc *wc)
+{
+   struct iwch_qp *qhp = NULL;
+   struct t3_cqe cqe, *rd_cqe;
+   struct t3_wq *wq;
+   u32 credit = 0;
+   u8 cqe_flushed;
+   u64 cookie;
+   int ret = 1;
+
+   rd_cqe = cxio_next_cqe(>cq);
+
+   if (!rd_cqe)
+   return 0;
+
+   qhp = get_qhp(rhp, CQE_QPID(*rd_cqe));
+   if (!qhp)
+   wq = NULL;
+   else {
+   spin_lock(>lock);
+   wq = &(qhp->wq);
+   }
+   ret = cxio_poll_cq(wq, &(chp->cq), , _flushed, ,
+  );
+   if (t3a_device(chp->rhp) && credit) {
+   PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__, 
+credit, chp->cq.cqid);
+   cxio_hal_cq_op(>rdev, >cq, CQ_CREDIT_UPDATE, credit);
+   }
+
+   if (ret) {
+   ret = -EAGAIN;
+   goto out;
+   }
+   ret = 1;
+
+   wc->wr_id = cookie;
+   wc->qp_num = qhp->wq.qpid;
+   wc->vendor_err = CQE_STATUS(cqe);
+
+   PDBG("%s qpid 0x%x type %d opcode %d status 0x%x wrid hi 0x%x "
+"lo 0x%x cookie 0x%llx\n", __FUNCTION__, 
+CQE_QPID(cqe), CQE_TYPE(cqe),
+CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe),
+CQE_WRID_LOW(cqe), cookie);
+
+   if (CQE_TYPE(cqe) == 0) {
+   if (!CQE_STATUS(cqe))
+   wc->byte_len = CQE_LEN(cqe);
+   else
+   wc->byte_len = 0;
+   wc->opcode = IB_WC_RECV;
+   } else {
+   switch (CQE_OPCODE(cqe)) {
+   case T3_RDMA_WRITE:
+   wc->opcode = IB_WC_RDMA_WRITE;
+   break;
+   case T3_READ_REQ:
+   wc->opcode = IB_WC_RDMA_READ;
+   wc->byte_len = CQE_LEN(cqe);
+   break;
+   case T3_SEND:
+   case T3_SEND_WITH_SE:
+   wc->opcode = IB_WC_SEND;
+   break;
+   case T3_BIND_MW:
+   wc->opcode = IB_WC_BIND_MW;
+   break;
+
+   /* these aren't supported yet */
+   case T3_SEND_WITH_INV:
+   case T3_SEND_WITH_SE_INV:
+   case T3_LOCAL_INV:
+   case T3_FAST_REGISTER:
+   default:
+   printk(KERN_ERR MOD "Unexpected opcode %d "
+  "in the CQE received for QPID=0x%0x\n", 
+  CQE_OPCODE(cqe), CQE_QPID(cqe));
+   ret = -EINVAL;
+

[PATCH v4 07/13] Async Event Handler


Code to handle async events coming from the T3 RDMA Core.

Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/hw/cxgb3/iwch_ev.c |  231 +
 1 files changed, 231 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c 
b/drivers/infiniband/hw/cxgb3/iwch_ev.c
new file mode 100644
index 000..b0bd014
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include 
+#include 
+#include 
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp,
+ struct respQ_msg_t *rsp_msg,
+ enum ib_event_type ib_event, 
+ int send_term)
+{
+   struct ib_event event;
+   struct iwch_qp_attributes attrs;
+   struct iwch_qp *qhp;
+
+   printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x "
+  "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, 
+  CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), 
+  CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe),
+  CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+
+   spin_lock(>lock);
+   qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe));
+
+   if (!qhp) {
+   printk(KERN_ERR "%s unaffiliated error 0x%x qpid 0x%x\n", 
+  __FUNCTION__, CQE_STATUS(rsp_msg->cqe), 
+  CQE_QPID(rsp_msg->cqe));
+   spin_unlock(>lock);
+   return;
+   }
+
+   if ((qhp->attr.state == IWCH_QP_STATE_ERROR) ||
+   (qhp->attr.state == IWCH_QP_STATE_TERMINATE)) {
+   PDBG("%s AE received after RTS - "
+"qp state %d qpid 0x%x status 0x%x\n", __FUNCTION__, 
+qhp->attr.state, qhp->wq.qpid, CQE_STATUS(rsp_msg->cqe));
+   spin_unlock(>lock);
+   return;
+   }
+
+   atomic_inc(>refcnt);
+   spin_unlock(>lock);
+
+   event.event = ib_event;
+   event.device = chp->ibcq.device;
+   if (ib_event == IB_EVENT_CQ_ERR)
+   event.element.cq = >ibcq;
+   else 
+   event.element.qp = >ibqp;
+
+   if (qhp->ibqp.event_handler)
+   (*qhp->ibqp.event_handler)(, qhp->ibqp.qp_context);
+
+   if (qhp->attr.state == IWCH_QP_STATE_RTS) {
+   attrs.next_state = IWCH_QP_STATE_TERMINATE;
+   iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, 
+  , 1);
+   if (send_term)
+   iwch_post_terminate(qhp, rsp_msg);
+   } 
+
+   if (atomic_dec_and_test(>refcnt))
+   wake_up(>wait);
+}
+
+void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb)
+{
+   struct iwch_dev *rnicp;
+   struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data;
+   struct iwch_cq *chp;
+   struct iwch_qp *qhp;
+   u32 cqid = RSPQ_CQID(rsp_msg);
+
+   rnicp = (struct iwch_dev *) rdev_p->ulp;
+   spin_lock(>lock);
+   chp = get_chp(rnicp, cqid);
+   qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe));
+   if (!chp || !qhp) {
+   printk(KERN_ERR MOD "BAD AE cqid 0x%x qpid 0x%x opcode %d "
+  "status 0x%x type %d wrid.hi 0x%x wrid.lo 0x%x \n", 
+  cqid,

[PATCH v4 02/13] Device Discovery and ULLD Linkage


Code to discover all the T3 devices and register them 
with the T3 RDMA Core and the Linux RDMA Core.

Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/hw/cxgb3/iwch.c |  189 
 drivers/infiniband/hw/cxgb3/iwch.h |  175 +
 2 files changed, 364 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.c 
b/drivers/infiniband/hw/cxgb3/iwch.c
new file mode 100644
index 000..acbe449
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include 
+#include 
+
+#include 
+
+#include "cxgb3_offload.h"
+#include "iwch_provider.h"
+#include "iwch_user.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+
+#define DRV_VERSION "1.1"
+
+MODULE_AUTHOR("Boyd Faulkner, Steve Wise");
+MODULE_DESCRIPTION("Chelsio T3 RDMA Driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+
+cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];
+
+static void open_rnic_dev(struct t3cdev *);
+static void close_rnic_dev(struct t3cdev *);
+
+struct cxgb3_client t3c_client = {
+   .name = "iw_cxgb3",
+   .add = open_rnic_dev,
+   .remove = close_rnic_dev,
+   .handlers = t3c_handlers,
+   .redirect = iwch_ep_redirect
+};
+
+static LIST_HEAD(dev_list);
+static DEFINE_MUTEX(dev_mutex);
+
+static void rnic_init(struct iwch_dev *rnicp)
+{
+   PDBG("%s iwch_dev %p\n", __FUNCTION__,  rnicp);
+   idr_init(>cqidr);
+   idr_init(>qpidr);
+   idr_init(>mmidr);
+   spin_lock_init(>lock);
+
+   rnicp->attr.vendor_id = 0x168;
+   rnicp->attr.vendor_part_id = 7;
+   rnicp->attr.max_qps = T3_MAX_NUM_QP - 32;
+   rnicp->attr.max_wrs = (1UL << 24) - 1;
+   rnicp->attr.max_sge_per_wr = T3_MAX_SGE;
+   rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE;
+   rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1;
+   rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1;
+   rnicp->attr.max_mem_regs = cxio_num_stags(>rdev);
+   rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE;
+   rnicp->attr.max_pds = T3_MAX_NUM_PD - 1;
+   rnicp->attr.mem_pgsizes_bitmask = 0x7FFF;   /* 4KB-128MB */
+   rnicp->attr.can_resize_wq = 0;
+   rnicp->attr.max_rdma_reads_per_qp = 8;
+   rnicp->attr.max_rdma_read_resources =
+   rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps;
+   rnicp->attr.max_rdma_read_qp_depth = 8; /* IRD */
+   rnicp->attr.max_rdma_read_depth =
+   rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps;
+   rnicp->attr.rq_overflow_handled = 0;
+   rnicp->attr.can_modify_ird = 0;
+   rnicp->attr.can_modify_ord = 0;
+   rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1;
+   rnicp->attr.stag0_value = 1;
+   rnicp->attr.zbva_support = 1;
+   rnicp->attr.local_invalidate_fence = 1;
+   rnicp->attr.cq_overflow_detection = 1;
+   return;
+}
+
+static void open_rnic_dev(struct t3cdev *tdev)
+{
+   struct iwch_dev *rnicp;
+   static int vers_printed;
+
+   PDBG("%s t3cdev %p\n", __FUNCTION__,  tdev);
+   if (!vers_printed++) 
+   printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n",
+  DRV_VERSION);
+   rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp));
+   if (!rnicp) {
+   printk(KERN_ERR MOD "Cannot allocate ib device\n");
+   return;
+   }
+   rnicp->rdev.ulp = rnicp;
+

[PATCH v4 05/13] Queue Pairs


Code to manipulate the QP.

Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/hw/cxgb3/iwch_qp.c | 1007 +
 1 files changed, 1007 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c 
b/drivers/infiniband/hw/cxgb3/iwch_qp.c
new file mode 100644
index 000..9f6b251
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -0,0 +1,1007 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+#include "cxio_hal.h"
+
+#define NO_SUPPORT -1
+
+static inline int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr,
+  u8 * flit_cnt)
+{
+   int i;
+   u32 plen;
+
+   switch (wr->opcode) {
+   case IB_WR_SEND:
+   case IB_WR_SEND_WITH_IMM:
+   if (wr->send_flags & IB_SEND_SOLICITED)
+   wqe->send.rdmaop = T3_SEND_WITH_SE;
+   else
+   wqe->send.rdmaop = T3_SEND;
+   wqe->send.rem_stag = 0;
+   break;
+#if 0  /* Not currently supported */
+   case TYPE_SEND_INVALIDATE:
+   case TYPE_SEND_INVALIDATE_IMMEDIATE:
+   wqe->send.rdmaop = T3_SEND_WITH_INV;
+   wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+   break;
+   case TYPE_SEND_SE_INVALIDATE:
+   wqe->send.rdmaop = T3_SEND_WITH_SE_INV;
+   wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+   break;
+#endif
+   default:
+   break;
+   }
+   if (wr->num_sge > T3_MAX_SGE)
+   return -EINVAL;
+   wqe->send.reserved[0] = 0;
+   wqe->send.reserved[1] = 0;
+   wqe->send.reserved[2] = 0;
+   if (wr->opcode == IB_WR_SEND_WITH_IMM) {
+   plen = 4;
+   wqe->send.sgl[0].stag = wr->imm_data;
+   wqe->send.sgl[0].len = __constant_cpu_to_be32(0);
+   wqe->send.num_sgle = __constant_cpu_to_be32(0);
+   *flit_cnt = 5;
+   } else {
+   plen = 0;
+   for (i = 0; i < wr->num_sge; i++) {
+   if ((plen + wr->sg_list[i].length) < plen) {
+   return -EMSGSIZE;
+   }
+   plen += wr->sg_list[i].length;
+   wqe->send.sgl[i].stag =
+   cpu_to_be32(wr->sg_list[i].lkey);
+   wqe->send.sgl[i].len =
+   cpu_to_be32(wr->sg_list[i].length);
+   wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr);
+   }
+   wqe->send.num_sgle = cpu_to_be32(wr->num_sge);
+   *flit_cnt = 4 + ((wr->num_sge) << 1);
+   }
+   wqe->send.plen = cpu_to_be32(plen);
+   return 0;
+}
+
+static inline int iwch_build_rdma_write(union t3_wr *wqe, struct ib_send_wr 
*wr,
+   u8 *flit_cnt)
+{
+   int i;
+   u32 plen;
+   if (wr->num_sge > T3_MAX_SGE)
+   return -EINVAL;
+   wqe->write.rdmaop = T3_RDMA_WRITE;
+   wqe->write.reserved[0] = 0;
+   wqe->write.reserved[1] = 0;
+   wqe->write.reserved[2] = 0;
+   wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey);
+   wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr);
+
+   if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) {
+   plen = 4;
+

[PATCH v4 03/13] Provider Methods and Data Structures


Provider methods to support the Linux RDMA verbs.

Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/hw/cxgb3/iwch_provider.c | 1171 +++
 drivers/infiniband/hw/cxgb3/iwch_provider.h |  363 
 drivers/infiniband/hw/cxgb3/iwch_user.h |   68 ++
 3 files changed, 1602 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
b/drivers/infiniband/hw/cxgb3/iwch_provider.c
new file mode 100644
index 000..e9721b1
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -0,0 +1,1171 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include "iwch.h"
+#include "iwch_provider.h"
+#include "iwch_cm.h"
+#include "iwch_user.h"
+
+static int iwch_modify_port(struct ib_device *ibdev,
+   u8 port, int port_modify_mask,
+   struct ib_port_modify *props)
+{
+   return -ENOSYS;
+}
+
+static struct ib_ah *iwch_ah_create(struct ib_pd *pd,
+   struct ib_ah_attr *ah_attr)
+{
+   return ERR_PTR(-ENOSYS);
+}
+
+static int iwch_ah_destroy(struct ib_ah *ah)
+{
+   return -ENOSYS;
+}
+
+static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 
lid)
+{
+   return -ENOSYS;
+}
+
+static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 
lid)
+{
+   return -ENOSYS;
+}
+
+static int iwch_process_mad(struct ib_device *ibdev,
+   int mad_flags,
+   u8 port_num,
+   struct ib_wc *in_wc,
+   struct ib_grh *in_grh,
+   struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+   return -ENOSYS;
+}
+
+static int iwch_dealloc_ucontext(struct ib_ucontext *context)
+{
+   struct iwch_dev *rhp = to_iwch_dev(context->device);
+   struct iwch_ucontext *ucontext = to_iwch_ucontext(context);
+   PDBG("%s context %p\n", __FUNCTION__, context);
+   cxio_release_ucontext(>rdev, >uctx);
+   kfree(ucontext);
+   return 0;
+}
+
+static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev,
+   struct ib_udata *udata)
+{
+   struct iwch_ucontext *context;
+   struct iwch_dev *rhp = to_iwch_dev(ibdev);
+
+   PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+   context = kmalloc(sizeof(*context), GFP_KERNEL);
+   if (!context)
+   return ERR_PTR(-ENOMEM);
+   cxio_init_ucontext(>rdev, >uctx);
+   INIT_LIST_HEAD(>mmaps);
+   spin_lock_init(>mmap_lock);
+   return >ibucontext;
+}
+
+static int iwch_destroy_cq(struct ib_cq *ib_cq)
+{
+   struct iwch_cq *chp;
+
+   PDBG("%s ib_cq %p\n", __FUNCTION__, ib_cq);
+   chp = to_iwch_cq(ib_cq);
+
+   remove_handle(chp->rhp, >rhp->cqidr, chp->cq.cqid);
+   atomic_dec(>refcnt);
+   wait_event(chp->wait, !atomic_read(>refcnt));
+
+   cxio_destroy_cq(>rhp->rdev, >cq);
+   kfree(chp);
+   return 0;
+}
+
+static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries,
+struct ib_ucontext *context,
+struct ib_udata *udata)
+{
+   struct iwch_dev *rhp;
+   struct iwch_cq *chp;
+   struct iwch_create_cq_resp

[PATCH v4 00/13] 2.6.20 Chelsio T3 RDMA Driver


Roland, 

I think this is ready to go once the ethernet driver is pulled in.

Version 4 changes:

- Cleaned up spacing in the Kconfig file
- Remove locking.txt file - its not needed
- Remove -O1 from the debug config option
- BugFix: support new LLD interface for dual-port adapters

Version 3 changes:

- BugFix: Don't use mutex inside of the mmap function.
- BugFix: Move QP to TERMINATE when TERMINATE AE is processed
- Support the new work queue design
- Merged up to linus's tree as of 12/8/2006
- Misc nits

Version 2 changes:

- Make code sparse endian clean
- Use IDRs for mapping QP and CQ IDs to structure pointers instead
  of arrays
- Clean up confusing bitfields
- Use random32() instead of local random function
- Use krefs to track endpoint reference counts
- Misc nits

-

The following series implements the Chelsio T3 iWARP/RDMA Driver to
be considered for inclusion in 2.6.20.  It depends on the Chelsio T3
Ethernet driver which is also under review now for 2.6.20. 

The latest Chelsio T3 Ethernet driver patch can be pulled from:

http://service.chelsio.com/kernel.org/cxgb3.patch.bz2

A complete GIT kernel tree with all the T3 drivers can be pulled from:

git://staging.openfabrics.org/~swise/cxgb3.git

Thanks,

Steve.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 13/13] Kconfig/Makefile


Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/Kconfig   |1 +
 drivers/infiniband/Makefile  |1 +
 drivers/infiniband/hw/cxgb3/Kconfig  |   27 +++
 drivers/infiniband/hw/cxgb3/Makefile |   12 
 4 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 59b3932..06453ab 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -38,6 +38,7 @@ source "drivers/infiniband/hw/mthca/Kcon
 source "drivers/infiniband/hw/ipath/Kconfig"
 source "drivers/infiniband/hw/ehca/Kconfig"
 source "drivers/infiniband/hw/amso1100/Kconfig"
+source "drivers/infiniband/hw/cxgb3/Kconfig"
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
 
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index 570b30a..69bdd55 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -3,6 +3,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA)  += hw/mt
 obj-$(CONFIG_INFINIBAND_IPATH) += hw/ipath/
 obj-$(CONFIG_INFINIBAND_EHCA)  += hw/ehca/
 obj-$(CONFIG_INFINIBAND_AMSO1100)  += hw/amso1100/
+obj-$(CONFIG_INFINIBAND_CXGB3) += hw/cxgb3/
 obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)   += ulp/srp/
 obj-$(CONFIG_INFINIBAND_ISER)  += ulp/iser/
diff --git a/drivers/infiniband/hw/cxgb3/Kconfig 
b/drivers/infiniband/hw/cxgb3/Kconfig
new file mode 100644
index 000..d3db264
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Kconfig
@@ -0,0 +1,27 @@
+config INFINIBAND_CXGB3
+   tristate "Chelsio RDMA Driver"
+   depends on CHELSIO_T3 && INFINIBAND
+   select GENERIC_ALLOCATOR
+   ---help---
+ This is an iWARP/RDMA driver for the Chelsio T3 1GbE and
+ 10GbE adapters.
+
+ For general information about Chelsio and our products, visit
+ our website at .
+
+ For customer support, please visit our customer support page at
+ .
+
+ Please send feedback to <[EMAIL PROTECTED]>.
+
+ To compile this driver as a module, choose M here: the module
+ will be called iw_cxgb3.
+
+config INFINIBAND_CXGB3_DEBUG
+   bool "Verbose debugging output"
+   depends on INFINIBAND_CXGB3
+   default n
+   ---help---
+ This option causes the Chelsio RDMA driver to produce copious
+ amounts of debug messages.  Select this if you are developing
+ the driver or trying to diagnose a problem.
diff --git a/drivers/infiniband/hw/cxgb3/Makefile 
b/drivers/infiniband/hw/cxgb3/Makefile
new file mode 100644
index 000..7a89f6d
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Makefile
@@ -0,0 +1,12 @@
+EXTRA_CFLAGS += -I$(TOPDIR)/drivers/net/cxgb3 \
+   -I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core 
+
+obj-$(CONFIG_INFINIBAND_CXGB3) += iw_cxgb3.o
+
+iw_cxgb3-y :=  iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \
+  iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o
+
+ifdef CONFIG_INFINIBAND_CXGB3_DEBUG
+EXTRA_CFLAGS += -DDEBUG -g 
+iw_cxgb3-y += core/cxio_dbg.o
+endif
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 12/13] Core Debug functions


Debug code to dump various data structs, some of which are in 
adapter memory.

Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/hw/cxgb3/core/cxio_dbg.c |  205 +++
 1 files changed, 205 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c 
b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
new file mode 100644
index 000..22f4f75
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
@@ -0,0 +1,205 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifdef DEBUG
+#include 
+#include "common.h"
+#include "cxgb3_ioctl.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag) 
+{
+   struct ch_mem_range *m;
+   u64 *data;
+   int rc;
+   int size = 32;
+
+   m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+   if (!m) {
+   PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+   return;
+   }
+   m->mem_id = MEM_PMRX;
+   m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base;
+   m->len = size;
+   PDBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len);
+   rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+   if (rc) {
+   PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+   kfree(m);
+   return;
+   }
+
+   data = (u64 *)m->buf;
+   while (size > 0) {
+   PDBG("TPT %08x: %016llx\n", m->addr, (u64)*data);
+   size -= 8;
+   data++;
+   m->addr += 8;
+   }
+   kfree(m);
+}
+
+void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift)
+{
+   struct ch_mem_range *m;
+   u64 *data;
+   int rc;
+   int size, npages;
+
+   shift += 12;
+   npages = (len + (1ULL << shift) - 1) >> shift;
+   size = npages * sizeof(u64);
+
+   m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+   if (!m) {
+   PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+   return;
+   }
+   m->mem_id = MEM_PMRX;
+   m->addr = pbl_addr;
+   m->len = size;
+   PDBG("%s PBL addr 0x%x len %d depth %d\n", 
+   __FUNCTION__, m->addr, m->len, npages);
+   rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+   if (rc) {
+   PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+   kfree(m);
+   return;
+   }
+
+   data = (u64 *)m->buf;
+   while (size > 0) {
+   PDBG("PBL %08x: %016llx\n", m->addr, (u64)*data);
+   size -= 8;
+   data++;
+   m->addr += 8;
+   }
+   kfree(m);
+}
+
+void cxio_dump_wqe(union t3_wr *wqe)
+{
+   __be64 *data = (__be64 *)wqe;
+   uint size = (uint)(be64_to_cpu(*data) & 0xff);
+
+   if (size == 0) 
+   size = 8;
+   while (size > 0) {
+   PDBG("WQE %p: %016llx\n", data, be64_to_cpu(*data));
+   size--;
+   data++;
+   }
+}
+
+void cxio_dump_wce(struct t3_cqe *wce)
+{
+   __be64 *data = (__be64 *)wce;
+   int size = sizeof(*wce);
+
+   while (size > 0) {
+   PDBG("WCE %p: %016llx\n", data, be64_to_cpu(*data));
+   size -= 8;
+   data++;
+   }
+}
+
+void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents)
+{
+   struct ch_mem_range *m;
+   int size = nents * 64;
+   u64 *data;
+

[PATCH v4 10/13] Core HAL


The RDMA Core interfaces with the T3 HW and ULLD providing a low level
RDMA interface.

Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 1302 +++
 drivers/infiniband/hw/cxgb3/core/cxio_hal.h |  201 
 2 files changed, 1503 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c 
b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
new file mode 100644
index 000..ffc4ec0
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
@@ -0,0 +1,1302 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include "cxio_resource.h"
+#include "cxio_hal.h"
+#include "cxgb3_offload.h"
+#include "sge_defs.h"
+
+static struct cxio_rdev *rdev_tbl[T3_MAX_NUM_RNIC];
+static cxio_hal_ev_callback_func_t cxio_ev_cb = NULL;
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_name(char *dev_name)
+{
+   int i;
+   for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+   if (rdev_tbl[i])
+   if (!strcmp(rdev_tbl[i]->dev_name, dev_name))
+   return rdev_tbl[i];
+   return NULL;
+}
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_t3cdev(struct t3cdev
+*tdev)
+{
+   int i;
+   for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+   if (rdev_tbl[i])
+   if (rdev_tbl[i]->t3cdev_p == tdev)
+   return rdev_tbl[i];
+   return NULL;
+}
+
+static inline int cxio_hal_add_rdev(struct cxio_rdev *rdev_p)
+{
+   int i;
+   for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+   if (!rdev_tbl[i]) {
+   rdev_tbl[i] = rdev_p;
+   break;
+   }
+   return (i == T3_MAX_NUM_RNIC);
+}
+
+static inline void cxio_hal_delete_rdev(struct cxio_rdev *rdev_p)
+{
+   int i;
+   for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+   if (rdev_tbl[i] == rdev_p) {
+   rdev_tbl[i] = NULL;
+   break;
+   }
+}
+
+int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, 
+  enum t3_cq_opcode op, u32 credit)
+{
+   int ret;
+   struct t3_cqe *cqe;
+   u32 rptr;
+
+   struct rdma_cq_op setup;
+   setup.id = cq->cqid;
+   setup.credits = (op == CQ_CREDIT_UPDATE) ? credit : 0;
+   setup.op = op;
+   ret = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_OP, );
+
+   if ((ret < 0) || (op == CQ_CREDIT_UPDATE)) 
+   return ret;
+
+   /*
+* If the rearm returned an index other than our current index,
+* then there might be CQE's in flight (being DMA'd).  We must wait
+* here for them to complete or the consumer can miss a notification.
+*/
+   if (Q_PTR2IDX((cq->rptr), cq->size_log2) != ret) {
+   int i=0;
+
+   rptr = cq->rptr;
+
+   /* 
+* Keep the generation correct by bumping rptr until it
+* matches the index returned by the rearm - 1.
+*/
+   while (Q_PTR2IDX((rptr+1), cq->size_log2) != ret)
+   rptr++;
+
+   /* 
+* Now rptr is the index for the (last) cqe that was 
+* in-flight at the time the HW rearmed the CQ.  We 
+* spin until that CQE is valid.
+

[PATCH v4 01/13] Linux RDMA Core Changes


Support provider-specific data in ib_uverbs_cmd_req_notify_cq().
The Chelsio iwarp provider library needs to pass information to the
kernel verb for re-arming the CQ.

Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/core/uverbs_cmd.c  |9 +++--
 drivers/infiniband/hw/amso1100/c2.h   |2 +-
 drivers/infiniband/hw/amso1100/c2_cq.c|3 ++-
 drivers/infiniband/hw/ehca/ehca_iverbs.h  |3 ++-
 drivers/infiniband/hw/ehca/ehca_reqs.c|3 ++-
 drivers/infiniband/hw/ipath/ipath_cq.c|4 +++-
 drivers/infiniband/hw/ipath/ipath_verbs.h |3 ++-
 drivers/infiniband/hw/mthca/mthca_cq.c|6 --
 drivers/infiniband/hw/mthca/mthca_dev.h   |4 ++--
 include/rdma/ib_verbs.h   |5 +++--
 10 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index 743247e..5dd1de9 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -959,6 +959,7 @@ ssize_t ib_uverbs_req_notify_cq(struct i
int out_len)
 {
struct ib_uverbs_req_notify_cq cmd;
+   struct ib_udata   udata;
struct ib_cq  *cq;
 
if (copy_from_user(, buf, sizeof cmd))
@@ -968,8 +969,12 @@ ssize_t ib_uverbs_req_notify_cq(struct i
if (!cq)
return -EINVAL;
 
-   ib_req_notify_cq(cq, cmd.solicited_only ?
-IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
+   INIT_UDATA(, buf + sizeof cmd, 0,
+  in_len - sizeof cmd, 0); 
+
+   cq->device->req_notify_cq(cq, cmd.solicited_only ?
+ IB_CQ_SOLICITED : IB_CQ_NEXT_COMP,
+ );
 
put_cq_read(cq);
 
diff --git a/drivers/infiniband/hw/amso1100/c2.h 
b/drivers/infiniband/hw/amso1100/c2.h
index 04a9db5..9a76869 100644
--- a/drivers/infiniband/hw/amso1100/c2.h
+++ b/drivers/infiniband/hw/amso1100/c2.h
@@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2
 extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index);
 extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index);
 extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc 
*entry);
-extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, struct 
ib_udata *udata);
 
 /* CM */
 extern int c2_llp_connect(struct iw_cm_id *cm_id,
diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c 
b/drivers/infiniband/hw/amso1100/c2_cq.c
index 05c9154..7ce8bca 100644
--- a/drivers/infiniband/hw/amso1100/c2_cq.c
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -217,7 +217,8 @@ int c2_poll_cq(struct ib_cq *ibcq, int n
return npolled;
 }
 
-int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+ struct ib_udata *udata)
 {
struct c2_mq_shared __iomem *shared;
struct c2_cq *cq;
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h 
b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 3720e30..566b30c 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -135,7 +135,8 @@ int ehca_poll_cq(struct ib_cq *cq, int n
 
 int ehca_peek_cq(struct ib_cq *cq, int wc_cnt);
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify);
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify,
+  struct ib_udata *udata);
 
 struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 struct ib_qp_init_attr *init_attr,
diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c 
b/drivers/infiniband/hw/ehca/ehca_reqs.c
index b46bda1..3ed6992 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -634,7 +634,8 @@ poll_cq_exit0:
return ret;
 }
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify)
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify,
+  struct ib_udata *udata)
 {
struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
 
diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c 
b/drivers/infiniband/hw/ipath/ipath_cq.c
index 87462e0..27ba4db 100644
--- a/drivers/infiniband/hw/ipath/ipath_cq.c
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -307,13 +307,15 @@ int ipath_destroy_cq(struct ib_cq *ibcq)
  * ipath_req_notify_cq - change the notification type for a completion queue
  * @ibcq: the completion queue
  * @notify: the type of notification to request
+ * @udata: user data 
  *
  * Returns 0 for success.
  *
  * This may be called from interrupt context.  Also called by
  * ib_req_notify_cq() in the generic verbs code.
  */
-int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int ipath_req_notify_cq(struct ib_cq *ibcq, enum

[PATCH v4 08/13] Memory Registration


Functions to register memory regions.

Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/hw/cxgb3/iwch_mem.c |  170 
 1 files changed, 170 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c 
b/drivers/infiniband/hw/cxgb3/iwch_mem.c
new file mode 100644
index 000..774d11e
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c
@@ -0,0 +1,170 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include 
+
+#include 
+#include 
+
+#include "cxio_hal.h"
+#include "iwch.h"
+#include "iwch_provider.h"
+
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+   struct iwch_mr *mhp,
+   int shift,
+   __be64 *page_list)
+{
+   u32 stag;
+   u32 mmid;
+
+
+   if (cxio_register_phys_mem(>rdev,
+  , mhp->attr.pdid,
+  mhp->attr.perms,
+  mhp->attr.zbva,
+  mhp->attr.va_fbo,
+  mhp->attr.len,
+  shift-12,
+  page_list,
+  >attr.pbl_size, >attr.pbl_addr))
+   return -ENOMEM;
+   mhp->attr.state = 1;
+   mhp->attr.stag = stag;
+   mmid = stag >> 8;
+   mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+   insert_handle(rhp, >mmidr, mhp, mmid); 
+   PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp);
+   return 0;
+}
+
+int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+   struct iwch_mr *mhp,
+   int shift,
+   __be64 *page_list,
+   int npages)
+{
+   u32 stag;
+   u32 mmid;
+
+
+   /* We could support this... */
+   if (npages > mhp->attr.pbl_size)
+   return -ENOMEM;
+
+   stag = mhp->attr.stag;
+   if (cxio_reregister_phys_mem(>rdev,
+  , mhp->attr.pdid,
+  mhp->attr.perms,
+  mhp->attr.zbva,
+  mhp->attr.va_fbo,
+  mhp->attr.len,
+  shift-12,
+  page_list,
+  >attr.pbl_size, >attr.pbl_addr))
+   return -ENOMEM;
+   mhp->attr.state = 1;
+   mhp->attr.stag = stag;
+   mmid = stag >> 8;
+   mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+   insert_handle(rhp, >mmidr, mhp, mmid); 
+   PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp);
+   return 0;
+}
+
+int build_phys_page_list(struct ib_phys_buf *buffer_list,
+   int num_phys_buf,
+   u64 *iova_start,
+   u64 *total_size,
+   int *npages,
+   int *shift,
+   __be64 **page_list)
+{
+   u64 mask;
+   int i, j, n;
+
+   mask = 0;
+   *total_size = 0;
+   for (i = 0; i < num_phys_buf; ++i) {
+   if (i != 0 && buffer_list[i].addr & ~PAGE_MASK)
+   return -EINVAL;
+

[PATCH v4 09/13] Core WQE/CQE Types


T3 WQE and CQE structures, defines, etc...

Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/hw/cxgb3/core/cxio_wr.h |  685 
 1 files changed, 685 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h 
b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
new file mode 100644
index 000..45870be
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
@@ -0,0 +1,685 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __CXIO_WR_H__
+#define __CXIO_WR_H__
+
+#include 
+#include 
+#include 
+#include "firmware_exports.h"
+
+#define T3_MAX_SGE  4
+
+#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr))
+#define Q_FULL(rptr,wptr,size_log2)  ( (((wptr)-(rptr))>>(size_log2)) && \
+  ((rptr)!=(wptr)) )
+#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1))
+#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<> S_FW_RIWR_OP)) & M_FW_RIWR_OP)
+
+#define S_FW_RIWR_SOPEOP   22
+#define M_FW_RIWR_SOPEOP   0x3
+#define V_FW_RIWR_SOPEOP(x)((x) << S_FW_RIWR_SOPEOP)
+
+#define S_FW_RIWR_FLAGS8
+#define M_FW_RIWR_FLAGS0x3f
+#define V_FW_RIWR_FLAGS(x) ((x) << S_FW_RIWR_FLAGS)
+#define G_FW_RIWR_FLAGS(x) x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS)
+
+#define S_FW_RIWR_TID  8
+#define V_FW_RIWR_TID(x)   ((x) << S_FW_RIWR_TID)
+
+#define S_FW_RIWR_LEN  0
+#define V_FW_RIWR_LEN(x)   ((x) << S_FW_RIWR_LEN)
+
+#define S_FW_RIWR_GEN   31
+#define V_FW_RIWR_GEN(x)((x)  << S_FW_RIWR_GEN)
+
+struct t3_sge {
+   __be32 stag;
+   __be32 len;
+   __be64 to;
+};
+
+/* If num_sgle is zero, flit 5+ contains immediate data.*/
+struct t3_send_wr {
+   struct fw_riwrh wrh;/* 0 */
+   union t3_wrid wrid; /* 1 */
+
+   u8 rdmaop;  /* 2 */
+   u8 reserved[3];
+   __be32 rem_stag;
+   __be32 plen;/* 3 */
+   __be32 num_sgle;
+   struct t3_sge sgl[T3_MAX_SGE];  /* 4+ */
+};
+
+struct t3_local_inv_wr {
+   struct fw_riwrh wrh;/* 0 */
+   union t3_wrid wrid; /* 1 */
+   __be32 stag;/* 2 */
+   __be32 reserved3;
+};
+
+struct t3_rdma_write_wr {
+   struct fw_riwrh wrh;/* 0 */
+   union t3_wrid wrid; /* 1 */
+   u8 rdmaop;  /* 2 */
+   u8 reserved[3];
+   __be32 stag_sink;
+   __be64 to_sink; /* 3 */
+   __be32 plen;/* 4 */
+   __be32 num_sgle;
+   struct t3_sge sgl[T3_MAX_SGE];  /* 5+ */
+};
+
+struct t3_rdma_read_wr {
+   struct fw_riwrh wrh;/* 0 */
+   union t3_wrid wrid; /* 1 */
+   u8 rdmaop;  /* 2 */
+   u8 reserved[3];
+   __be32 rem_stag;
+   __be64 rem_to;  /* 3 */
+   __be32 local_stag;  /* 4 */
+   __be32 local_len;
+   __be64 local_to;/* 5 */
+};
+
+enum t3_addr_type {
+   T3_VA_BASED_TO = 0x0,
+   T3_ZERO_BASED_TO = 0x1
+} __attribute__ ((packed));
+
+enum t3_mem_perms {
+   T3_MEM_ACCESS_LOCAL_READ = 0x1,
+   T3_MEM_ACCESS_LOCAL_WRITE = 0x2,
+   T3_MEM_ACCESS_REM_READ = 0x4,
+   T3_MEM_ACCESS_REM_WRITE = 0x8
+} __attribute__ ((packed));
+
+struct t3_bind_mw_wr {
+   struct fw_riwrh wrh;/* 0 */
+   union t3_wrid wrid; /* 1 */
+   u16 reserved;   /* 2 */
+   u8 type;
+   u8 perms;
+   __be32 mr_stag;
+   __be32 mw_stag; /* 3 */
+

[PATCH v4 11/13] Core Resource Allocation


Core functions to carve up adapter memory, stag, qp, and cq IDs.

Signed-off-by: Steve Wise <[EMAIL PROTECTED]>
---

 drivers/infiniband/hw/cxgb3/core/cxio_resource.c |  331 ++
 drivers/infiniband/hw/cxgb3/core/cxio_resource.h |   70 +
 2 files changed, 401 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c 
b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
new file mode 100644
index 000..444df15
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
@@ -0,0 +1,331 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+/* Crude resource management */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "cxio_resource.h"
+#include "cxio_hal.h"
+
+static struct kfifo *rhdl_fifo;
+static spinlock_t rhdl_fifo_lock;
+
+#define RANDOM_SIZE 16
+
+static int __cxio_init_resource_fifo(struct kfifo **fifo,
+  spinlock_t *fifo_lock,
+  u32 nr, u32 skip_low,
+  u32 skip_high,
+  int random)
+{
+   u32 i, j, entry = 0, idx;
+   u32 random_bytes;
+   u32 rarray[16];
+   spin_lock_init(fifo_lock);
+
+   *fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock);
+   if (IS_ERR(*fifo))
+   return -ENOMEM;
+
+   for (i = 0; i < skip_low + skip_high; i++)
+   __kfifo_put(*fifo, (unsigned char *) , sizeof(u32));
+   if (random) {
+   j = 0;
+   random_bytes = random32();
+   for (i = 0; i < RANDOM_SIZE; i++)
+   rarray[i] = i + skip_low;
+   for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) {
+   if (j >= RANDOM_SIZE) {
+   j = 0;
+   random_bytes = random32();
+   }
+   idx = (random_bytes >> (j * 2)) & 0xF;
+   __kfifo_put(*fifo, 
+   (unsigned char *) [idx],
+   sizeof(u32));
+   rarray[idx] = i;
+   j++;
+   }
+   for (i = 0; i < RANDOM_SIZE; i++)
+   __kfifo_put(*fifo, 
+   (unsigned char *) [i],
+   sizeof(u32));
+   } else
+   for (i = skip_low; i < nr - skip_high; i++)
+   __kfifo_put(*fifo, (unsigned char *) , sizeof(u32));
+
+   for (i = 0; i < skip_low + skip_high; i++)
+   kfifo_get(*fifo, (unsigned char *) , sizeof(u32));
+   return 0;
+}
+
+static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock,
+  u32 nr, u32 skip_low, u32 skip_high)
+{
+   return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, 
+ skip_high, 0));
+}
+
+static int cxio_init_resource_fifo_random(struct kfifo **fifo,
+  spinlock_t * fifo_lock,
+  u32 nr, u32 skip_low, u32 skip_high)
+{
+
+   return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, 
+ skip_high, 1));
+}
+
+static int cxio_init_qpid_fifo(struct cxio_rdev *rdev_p)
+{
+   u32 i;
+
+   spin_lock_init(_p->rscp->qpid_fifo_lock);
+
+   rdev_p->rscp->qpid_fifo

Re: [PATCH 2.6.19-git19] BUG due to bad argument to ieee80211softmac_assoc_work

2006-12-14 Thread Johannes Berg

On Wed, 2006-12-13 at 13:17 -0500, Michael Bommarito wrote:

> Attached is a patch that fixes this (the actual change is two lines
> but context provided in patch for review).  The dmesg containing call
> trace is attached to the bugzilla entry above.

You forgot to attach the patch but IIRC it's been found and fixed
already.

johannes


signature.asc
Description: This is a digitally signed message part

Re: [RFC] split NAPI from network device.

2006-12-14 Thread Benjamin Herrenschmidt

On Wed, 2006-12-13 at 15:46 -0800, Stephen Hemminger wrote:
> Split off NAPI part from network device, this patch is build tested
> only! It breaks kernel API for network devices, and only three examples
> are fixed (skge, sky2, and tg3).
> 
> 1. Decomposition allows different NAPI <-> network device
>Some hardware has N devices for one IRQ, others like MSI-X
>want multiple receive's for one device.
> 
> 2. Cleanup locking with netpoll
> 
> 3. Change poll callback arguements and semantics
> 
> 4. Make softnet_data static (only in dev.c)

Thanks !

I'll give a go at adapting emac and maybe a few more when I get 5mn to
spare...

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.19-git19] BUG due to bad argument to ieee80211softmac_assoc_work

2006-12-14 Thread Michael Bommarito

Hello Uli,
 Yes, apologies, I had been waiting for an abandoned bugzilla entry
to get attention, and when I realized it was assigned to a dead-end, I
had simply posted the patch without checking for prior messages.
 I was further confused by the fact that it hadn't made its way into
any of the 19-gitX sets (and for that matter, the window for
2.6.20-rc1 has come and gone and this still remains unfixed), despite
how clear the error was and how trivial the fix seems.

-Mike

On 12/14/06, Uli Kunitz <[EMAIL PROTECTED]> wrote:

Michael,

I sent a patch to this list on Sunday, that patched the problem. It
seems to be migrated into the wireless-2.6 git tree.

Regards,

Uli
Am 13.12.2006 um 19:17 schrieb Michael Bommarito:

> This didn't get much attention on bugzilla and I figured it was
> important enough to forward along to the whole list since it's been
> lingering around in ieee80211-softmac since 19-git5 at least.
> http://bugzilla.kernel.org/show_bug.cgi?id=7657
>
> Somebody was passing the whole mac device structure to
> ieee80211softmac_assoc_work instead of just the assocation work, which
> lead to much death and locking.
>
> Attached is a patch that fixes this (the actual change is two lines
> but context provided in patch for review).  The dmesg containing call
> trace is attached to the bugzilla entry above.
>
> -Mike
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Uli Kunitz

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.19-git19] BUG due to bad argument to ieee80211softmac_assoc_work

2006-12-14 Thread Uli Kunitz


Michael,

I sent a patch to this list on Sunday, that patched the problem. It  
seems to be migrated into the wireless-2.6 git tree.


Regards,

Uli
Am 13.12.2006 um 19:17 schrieb Michael Bommarito:


This didn't get much attention on bugzilla and I figured it was
important enough to forward along to the whole list since it's been
lingering around in ieee80211-softmac since 19-git5 at least.
http://bugzilla.kernel.org/show_bug.cgi?id=7657

Somebody was passing the whole mac device structure to
ieee80211softmac_assoc_work instead of just the assocation work, which
lead to much death and locking.

Attached is a patch that fixes this (the actual change is two lines
but context provided in patch for review).  The dmesg containing call
trace is attached to the bugzilla entry above.

-Mike
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Uli Kunitz



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.19-git19] BUG due to bad argument to ieee80211softmac_assoc_work

2006-12-14 Thread Larry Finger

Michael Bommarito wrote:

Hello Uli,
Yes, apologies, I had been waiting for an abandoned bugzilla entry
to get attention, and when I realized it was assigned to a dead-end, I
had simply posted the patch without checking for prior messages.
I was further confused by the fact that it hadn't made its way into
any of the 19-gitX sets (and for that matter, the window for
2.6.20-rc1 has come and gone and this still remains unfixed), despite
how clear the error was and how trivial the fix seems.

I was not aware that a bugzilla entry existed for this problem. I learned about it when my system
would hang on bootup if the bcm43xx card was installed. By bisection, I learned which commit was
causing the problem. About that time, the complete fix was discussed on the netdev and bcm43xx
mailing lists. I was a little perturbed that only part of the fix was accepted into 2.6.19-gitX.

The full fix was pushed to John Linville on Dec. 10, who pushed it on to Jeff Garzik on Dec. 11. I
have not yet seen any message sending it on to Andrew Morton or Linus.

A bug fix will always be accepted, particularly one that only changes 2 lines - it is only a new
feature that will no longer be accepted once the -rc1 stage is reached. If this message doesn't do
the trick and it isn't included by -rc2, I'll ping Jeff to see what happened. Changes always take
longer than one likes, but one needs to be careful.

Larry

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Nikolai Joukov

> Nikolai Joukov wrote:
> > We have designed a new stackable file system that we called RAIF:
> > Redundant Array of Independent Filesystems.
>
> Great!
>
> > We have performed some benchmarking on a 3GHz PC with 2GB of RAM and U320
> > SCSI disks.  Compared to the Linux RAID driver, RAIF has overheads of
> > about 20-25% under the Postmark v1.5 benchmark in case of striping and
> > replication.  In case of RAID4 and RAID5-like configurations, RAIF
> > performed about two times *better* than software RAID and even better than
> > an Adaptec 2120S RAID5 controller.
>
> I am not surprised.  RAID 4/5/6 performance is highly sensitive to the
> underlying hw, and thus needs a fair amount of fine tuning.

Nevertheless, performance is not the biggest advantage of RAIF.  For
read-biased workloads RAID is always slightly faster than RAIF.  The
biggest advantages of RAIF are flexible configurations (e.g., can combine
NFS and local file systems), per-file-type storage policies, and the fact
that files are stored as files on the lower file systems (which is
convenient).

> > This is because RAIF is located above
> > file system caches and can cache parity as normal data when needed.  We
> > have more performance details in a technical report, if anyone is
> > interested.
>
> Definitely interested.  Can you give a link?

The main focus of the paper is on a general OS profiling method and not
on RAIF.  However, it has some details about the RAIF benchmarking with
Postmark in Chapter 9:

Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
operation under the same Postmark workload.

Nikolai.
-
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Charles Manning

On Friday 15 December 2006 10:01, Nikolai Joukov wrote:
> > Nikolai Joukov wrote:
> > > We have designed a new stackable file system that we called RAIF:
> > > Redundant Array of Independent Filesystems.
> >
> > Great!

Yes, definitely...

I see the major benefit being in the mobile, industrial and embedded systems 
arena. Perhaps this might come as a suprise to people, but a very large and 
ever growing number (perhaps even most) Linux devices don't use block devices 
for storage. Instead they use flash file systems or nfs, niether of which use 
local block devices.

It looks like RAIF gives a way to provide redundancy etc on these devices.


> >
> > > We have performed some benchmarking on a 3GHz PC with 2GB of RAM and
> > > U320 SCSI disks.  Compared to the Linux RAID driver, RAIF has overheads
> > > of about 20-25% under the Postmark v1.5 benchmark in case of striping
> > > and replication.  In case of RAID4 and RAID5-like configurations, RAIF
> > > performed about two times *better* than software RAID and even better
> > > than an Adaptec 2120S RAID5 controller.
> >
> > I am not surprised.  RAID 4/5/6 performance is highly sensitive to the
> > underlying hw, and thus needs a fair amount of fine tuning.
>
> Nevertheless, performance is not the biggest advantage of RAIF.  For
> read-biased workloads RAID is always slightly faster than RAIF.  The
> biggest advantages of RAIF are flexible configurations (e.g., can combine
> NFS and local file systems), per-file-type storage policies, and the fact
> that files are stored as files on the lower file systems (which is
> convenient).
>
> > > This is because RAIF is located above
> > > file system caches and can cache parity as normal data when needed.  We
> > > have more performance details in a technical report, if anyone is
> > > interested.
> >
> > Definitely interested.  Can you give a link?
>
> The main focus of the paper is on a general OS profiling method and not
> on RAIF.  However, it has some details about the RAIF benchmarking with
> Postmark in Chapter 9:
>
>   
>
> Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
> operation under the same Postmark workload.
>
> Nikolai.
> -
> Nikolai Joukov, Ph.D.
> Filesystems and Storage Laboratory
> Stony Brook University
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread berk walker




Nikolai Joukov wrote:


  

Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
operation under the same Postmark workload.

Nikolai.
-
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University

  


Well, Congratulations, Doctor!!  [Must be nice to be exiled to Stony 
Brook!!  Oh, well, not I]


For some reason, I can not connect to the above link, but I may not need 
to.  Does [should] it contain a link/pointer to the underlying source 
code?  This concept sounds very interesting, and I am sure that many of 
us would like to look closer, and maybe even get a taste.



Here's hoping that source exists, and that it is available for us.

Thanks
b-

  
-

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Nikolai Joukov

> > We started the project in April 2004.  Right now I am using it as my
> > /home/kolya file system at home.  We believe that at this stage RAIF is
> > mature enough for others to try it out.  The code is available at:
> >
> > 
> >
> > The code requires no kernel patches and compiles for a wide range of
> > kernels as a module.  The latest kernel we used it for is 2.6.13 and we
> > are in the process of porting it to 2.6.19.
> >
> > We will be happy to hear your back.
>
> When removing a file from the underlying branch, the oops below happens.
> Wouldn't it be possible to just fail the branch instead of oopsing?

This is a known problem of all Linux stackable file systems.  Users are
not supposed to change the file systems below mounted stackable file
systems (but they can read them).  One of the ways to enforce it is to use
overlay mounts.  For example, mount the lower file systems at
/raif/b0 ... /raif/bN and then mount RAIF at /raif.  Stackable file
systems recently started getting into the kernel and we hope that there
will be a better solution for this problem in the future.  Having said
that, you are right: failing the branch would be the right thing to do.

Nikolai.
-
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Nikolai Joukov

> Well, Congratulations, Doctor!!  [Must be nice to be exiled to Stony
> Brook!!  Oh, well, not I]

Long Island is a very nice place with lots of vineries and perfect sand
beaches - don't envy :-)

> Here's hoping that source exists, and that it is available for us.

I guess, you are subscribed to the linux-raid list only.  Unfortunately, I
didn't CC my post to that list and one of the replies was CC'd there
without the link.  The original post is available here:

  

And the link to the sources is:

  

Nikolai.
-
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 2.6.18.4: flush_workqueue calls mutex_lock in interrupt environment

2006-12-14 Thread Chen, Kenneth W

Andrew Morton wrote on Thursday, December 14, 2006 5:20 PM
> it's hard to disagree.
> 
> Begin forwarded message:
> > On Wed, 2006-12-13 at 08:25 +0100, xb wrote:
> > > Hi all,
> > > 
> > > Running some IO stress tests on a 8*ways IA64 platform, we got:
> > >  BUG: warning at kernel/mutex.c:132/__mutex_lock_common()  message
> > > followed by:
> > >  Unable to handle kernel paging request at virtual address
> > > 00200200
> > > oops corresponding to anon_vma_unlink() calling list_del() on a
> > > poisonned list.
> > > 
> > > Having a look to the stack, we see that flush_workqueue() calls
> > > mutex_lock() with softirqs disabled.
> > 
> > something is wrong here... flush_workqueue() is a sleeping function and
> > is not allowed to be called in such a context!
> 
> It seems utterly insane to have aio_complete() flush a workqueue. That
> function has to be called from a number of different environments,
> including non-sleep tolerant environments.
> 
> For instance it means that directIO on NFS will now cause the rpciod
> workqueues to call flush_workqueue(aio_wq), thus slowing down all RPC
> activity.

The bug appears to be somewhere else, somehow the ref count on ioctx is
all messed up.

In aio_complete, __put_ioctx() should not be invoked because ref count
on ioctx is supposedly more than 2, aio_complete decrement it once and
should return without invoking the free function.

The real freeing ioctx should be coming from exit_aio() or io_destroy(),
in which case both wait until no further pending AIO request via
wait_for_all_aios().

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.18 mmap hangs unrelated apps

2006-12-14 Thread Michal Sabala

Hello LKML,

I am observing processes entering uninterruptible sleep apparently due
to an unrelated application using mmap over nfs. Applications in
"uninterruptible sleep" hang indefinitely while other applications
continue working properly.

The code causing the mmap nfs hangs does the following:
(as replicated by the included test-mmap.c file)

  1. create file on nfs (file_A, descr_A)
  2. make file_A a sparse 200MB file
  3. mmap descr_A
  4. close descr_A
  5. unlink file_A
  6. memcpy 200MB to mmaped buffer
  7. create a second file on nfs (file_B, descr_B)
  8. write() 200MB from mmaped buffer to descr_B
  9. close descr_B
  10. munmap first file

This code may need to be ran tens to hundred runs to trigger the condition.

During the execution of the above code, unrelated applications enter
uninterruptible sleep (D) - usually firefox2.0, Xorg/XFree86, gimp2.2, gconfd
or bash; probably the most active processes.

`dmesg` shows nothing of interest.

`free` shows anywhere between 1MB and 80MB of memory still remaining
free when the problem occurs.

`cat /proc/*PID*/wchan` for all hanging processes contains page_sync.

* Client Setups:

  Linux 2.6.18 debian kernel (not tainted)
  Intel P3/800
  512MB ram
  0 swap
  NFS root (rw,noatime,rsize=8192,wsize=8192,nfsvers=3,hard,lock,udp)
  NIC: 100mbit tulip Cardbus
  NFS server is Linux 2.6.8 (debian)
  Gnome running with ooffice, gimp2.2 and firefox2 open

  and

  Linux 2.6.18 debian kernel (not tainted)
  Intel P4/2.8
  mem=192M boot option
  0 swap
  NFS home (rw,nosuid,rsize=8192,wsize=8192,hard)
  NIC: 100mbit e100 PCI
  NFS server is Apple OSX 10.3
  Gnome running with ooffice, gimp2.2 and firefox2 open

This happens with NFS servers based on Linux 2.6.8 and OSX 10.3.x. There
is nothing unusual in the server log files. Other than large nfs mmaps
on limited ram clients, NFS clients are 100% stable (file locking, performance,
6 month uptimes, etc..)

NOTE:
  I also ran the same code on the P4 machine in /tmp (local disk)
and it too caused some applications to enter uninterruptible sleep
(dozens of consecutive runs were needed). As such this looks not to
be directly related to nfs.

I would like to assist in any way I can in tracking this bug. I am open
to running patched kernels, etc...

Thank You,
 Sincerely,

   Michal Sabala



PS. thank you for all the hard work on the Linux kernel.


--- test-mmap.c: 

#include 
#include 
#include 
#include 
#include 
#include 

#include 

int main (int argc, char * argv[] ){

  char * data = 0;
  int blocks = 12800;
  int bSize = 16384;
  
  char mmapFileName[] = "temp-XX";
  int mmapFileDes = mkstemp( mmapFileName );
  if ( mmapFileDes == -1 ){
printf( "cannot make temporary file %s !\n", mmapFileName );
exit( -1 );
  }

  printf( "using desc %d tempfile %s\n", mmapFileDes, mmapFileName );

  errno = 0;  
  if ( lseek( mmapFileDes, (blocks*bSize)-1, SEEK_SET ) == -1 ){
if ( errno != 0 ){
perror ( "lseek error: " );
}
printf(  "cannot lseek tempfile %s !\n", mmapFileName);
close( mmapFileDes );
unlink( mmapFileName );
exit( -1 );
  }

  if ( write( mmapFileDes, "X", 1 ) != 1 ){
printf(  "cannot sparse write tempfile %s !\n", mmapFileName);
close( mmapFileDes );
unlink( mmapFileName );
exit( -1 );
  }

  data = mmap ( NULL, (blocks*bSize), PROT_READ | PROT_WRITE, MAP_SHARED, 
mmapFileDes, 0 );
  if ( data == (void *) -1 ){
printf(  "mmap of %s failed!\n", mmapFileName );
close( mmapFileDes );
unlink( mmapFileName );
exit( -1 );
  }

  printf( "block size: %d, blocks num: %d\n", bSize, blocks);

  close( mmapFileDes );
  unlink( mmapFileName );

  int i;
  char * ptr = data;
  for ( i = 1; i <= blocks; i++ ){
printf( "wrote %d of %d blocks to %s\n", i, blocks, mmapFileName );
memset( ptr, 0, bSize ); 
ptr += bSize;
  }

  // msync( data, blocks*bSize, MS_SYNC );

  char destFile[] = "destination-XX";
  int destDes = mkstemp( destFile );
  if ( destDes == -1 ){
printf( "cannot make destination file %s !\n", destFile );
exit( -1 );
  }

  printf( "using desc %d destfile %s\n", destDes, destFile);
 
  ptr = data;
  for ( i = 1; i <= blocks; i++ ){
int wLen = write( destDes, ptr, bSize );
printf( "wrote %d of %d blocks to %s\n", i, blocks, destFile );
if ( wLen != bSize ){
  printf( "debug: short write to %s at %d bytes\n", destFile, wLen );
}
ptr += bSize;
  }
  
  close( destDes );
  
  munmap( data, blocks*bSize );

  exit( 0 );
}

-- 
Michal "Saahbs" Sabala
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ieee1394 in 2.6.20-rc1 (was Re: Linux 2.6.20-rc1)

2006-12-14 Thread Gene Heskett

On Thursday 14 December 2006 12:48, Stefan Richter wrote:
[...]
>
>(Anyway, that's unrelated to Gene's issues.)

And which I haven't had a chance to check yet, the camera is still in the 
truck and I've been busier than a one legged man in an ass kicking 
contest today.  I did get 2.6.20-rc1 built and its whats running, but 
that is as far as I got, too many other honeydo's.  Tomorrow hopefully.  
If I don't wind up using a backhoe for a divining rod, looking for our 
sewer which is beginning to nag us occasionally.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2.6.20-rc1] fix vm_events_fold_cpu() build breakage

2006-12-14 Thread Magnus Damm

fix vm_events_fold_cpu() build breakage

2.6.20-rc1 does not build properly if CONFIG_VM_EVENT_COUNTERS is set
and CONFIG_HOTPLUG is unset:

  CC  init/version.o
  LD  init/built-in.o
  LD  .tmp_vmlinux1
mm/built-in.o: In function `page_alloc_cpu_notify':
page_alloc.c:(.text+0x56eb): undefined reference to `vm_events_fold_cpu'
make: *** [.tmp_vmlinux1] Error 1

Signed-Off-By: Magnus Damm <[EMAIL PROTECTED]>
---

 Applies on top of linux-2.6.20-rc1.

 include/linux/vmstat.h |4 
 1 file changed, 4 insertions(+)

--- 0001/include/linux/vmstat.h
+++ 0003/include/linux/vmstat.h 2006-12-15 11:46:23.0 +0900
@@ -73,7 +73,11 @@ static inline void count_vm_events(enum 
 }
 
 extern void all_vm_events(unsigned long *);
+#ifdef CONFIG_HOTPLUG
 extern void vm_events_fold_cpu(int cpu);
+#else
+#define vm_events_fold_cpu(x)  do { } while (0)
+#endif
 
 #else
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[BUG -rt] scheduling in atomic.

2006-12-14 Thread Steven Rostedt

Ingo,

I've hit this. I compiled the kernel as CONFIG_PREEMPT, and turned off
IRQ's as threads.

BUG: scheduling while atomic: swapper/0x0001/1, CPU#3

Call Trace:
 [] dump_trace+0xaa/0x404
 [] show_trace+0x3c/0x52
 [] dump_stack+0x15/0x17
 [] __sched_text_start+0x8a/0xbb7
 [] schedule+0xd3/0xf3
 [] flush_cpu_workqueue+0x72/0xa4
 [] flush_workqueue+0x6d/0x95
 [] schedule_on_each_cpu+0xe8/0xff
 [] filevec_add_drain_all+0x12/0x14
 [] remove_proc_entry+0xaf/0x258
 [] unregister_handler_proc+0x23/0x48
 [] free_irq+0xda/0x114
 [] i8042_probe+0x338/0x75c
 [] platform_drv_probe+0x12/0x14
 [] really_probe+0x54/0xee
 [] driver_probe_device+0xae/0xba
 [] __device_attach+0x9/0xb
 [] bus_for_each_drv+0x47/0x7d
 [] device_attach+0x65/0x79
 [] bus_attach_device+0x24/0x4c
 [] device_add+0x38f/0x505
 [] platform_device_add+0x11a/0x152
 [] i8042_init+0x2b0/0x30d
 [] init+0x182/0x344
 [] child_rip+0xa/0x12


Seems that we have this in remove_proc_entry:

spin_lock(_subdir_lock);
for (p = >subdir; *p; p=&(*p)->next ) {

[...]

proc_kill_inodes(de);

[...]

}
spin_unlock(_subdir_lock);

And in proc_kill_inodes:

static void proc_kill_inodes(struct proc_dir_entry *de)
{
struct file *filp;
struct super_block *sb = proc_mnt->mnt_sb;

/*
 * Actually it's a partial revoke().
 */
filevec_add_drain_all();

[...]
}

and in filevec_add_drain_all:

int filevec_add_drain_all(void)
{
return schedule_on_each_cpu(filevec_add_drain_per_cpu, NULL);
}


And schedule_on_each_cpu is easily schedulable.

So it seems that it schedules while holding a spin lock.

I don't know this code very well, and don't have time to look too deep
into it, but I figure that I would report it.

-- Steve




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.18.3 also 2.6.19 XFS xfs_force_shutdown (was: XFS internal error [...])

2006-12-14 Thread David Chinner

On Thu, Dec 14, 2006 at 06:21:49PM +0900, Shinichiro HIDA wrote:
> Hi,
> 
> ;; Sorry for late, and Thanks for following up.
> 
> > In <[EMAIL PROTECTED]> 
> > David Chinner <[EMAIL PROTECTED]> wrote:
> > On Wed, Dec 13, 2006 at 02:12:23PM +0900, Shinichiro HIDA wrote:
> > > Hi,
> > > 
> > > I met same problem on my 2 machines, 2.6.19 (Debian unstable) also
> > > 2.6.18.3 (Debian stable),
> > Should have been preceeded with some other output explaining the
> > reason for the shutdown.

> Dec 12 21:31:25 lune kernel: xfs_da_do_buf: bno 16777216
> Dec 12 21:31:25 lune kernel: dir: inode 9078346
> Dec 12 21:31:25 lune kernel: Filesystem "hdf5": XFS internal error 
> xfs_da_do_buf(1) at line 1995 of file fs/xfs/xfs_da_btree.c.  Caller 
> 0xc02982ec

Ok, that bno (16777216) is a definite sign of corruption
caused by the 2.6.17.x (x <=6) kernels.

> > Did these machines run 2.6.17.x where x<= 6?
> > i.e. is this problem:
> 
> > http://oss.sgi.com/projects/xfs/faq.html#dir2
> 
> Yes, I could boot this machine(lune) with 2.6.17.6. 

I wasn't suggesting that you use this kernel - that could cause more
corruption to occur.  What I was asking is if you have run a kernel
of this version in the past (i.e. before you upgraded to 2.6.18.3)?

Regardless, I suggest you get the latest xfsprogs and run xfs_repair
on your filesystems to fix the problem.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 007 of 14] knfsd: SUNRPC: Provide room in svc_rqst for larger addresses

On Wed, 13 Dec 2006 10:59:11 +1100
NeilBrown <[EMAIL PROTECTED]> wrote:

> From: Chuck Lever <[EMAIL PROTECTED]>
> Expand the rq_addr field to allow it to contain larger addresses.

This patch breaks the NFS server on my heroically modern RH FC1 machine.

There's a mysterious 30-second pause when initscripts are bringing up
mountd.

showmount (from a FC5 client) works:

box:/usr/src/25> 0 showmount -e vmm
Export list for vmm:
/ *
/mnt/hda5 *

But things get really exciting when we try to mount it:


box:/usr/src/25> 0 mount vmm:/mnt/hda5 /mnt 
*** buffer overflow detected ***: mount terminated
=== Backtrace: =
/lib64/libc.so.6(__chk_fail+0x2f)[0x32adbdfaef]
mount[0x40bcf8]
mount[0x4044c5]
mount[0x405850]
mount[0x406388]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x32adb1ce54]
mount[0x4034a9]
=== Memory map: 
0040-00414000 r-xp  08:01 3041513
/bin/mount
00513000-00514000 rw-p 00013000 08:01 3041513
/bin/mount
00514000-00516000 rw-p 00514000 00:00 0 
00613000-00615000 rw-p 00013000 08:01 3041513
/bin/mount
00615000-00636000 rw-p 00615000 00:00 0  [heap]
32ad90-32ad91a000 r-xp  08:01 1619031
/lib64/ld-2.4.so
32ada19000-32ada1a000 r--p 00019000 08:01 1619031
/lib64/ld-2.4.so
32ada1a000-32ada1b000 rw-p 0001a000 08:01 1619031
/lib64/ld-2.4.so
32adb0-32adc3f000 r-xp  08:01 1619091
/lib64/libc-2.4.so
32adc3f000-32add3f000 ---p 0013f000 08:01 1619091
/lib64/libc-2.4.so
32add3f000-32add43000 r--p 0013f000 08:01 1619091
/lib64/libc-2.4.so
32add43000-32add44000 rw-p 00143000 08:01 1619091
/lib64/libc-2.4.so
32add44000-32add49000 rw-p 32add44000 00:00 0 
32ade0-32ade02000 r-xp  08:01 1619011
/lib64/libuuid.so.1.2
32ade02000-32adf02000 ---p 2000 08:01 1619011
/lib64/libuuid.so.1.2
32adf02000-32adf03000 rw-p 2000 08:01 1619011
/lib64/libuuid.so.1.2
32ae00-32ae002000 r-xp  08:01 1619095
/lib64/libdl-2.4.so
32ae002000-32ae102000 ---p 2000 08:01 1619095
/lib64/libdl-2.4.so
32ae102000-32ae103000 r--p 2000 08:01 1619095
/lib64/libdl-2.4.so
32ae103000-32ae104000 rw-p 3000 08:01 1619095
/lib64/libdl-2.4.so
32ae20-32ae20e000 r-xp  08:01 1619005
/lib64/libdevmapper.so.1.02
32ae20e000-32ae30e000 ---p e000 08:01 1619005
/lib64/libdevmapper.so.1.02
32ae30e000-32ae31 rw-p e000 08:01 1619005
/lib64/libdevmapper.so.1.02
32ae40-32ae408000 r-xp  08:01 1619066
/lib64/libblkid.so.1.0
32ae408000-32ae508000 ---p 8000 08:01 1619066
/lib64/libblkid.so.1.0
32ae508000-32ae509000 rw-p 8000 08:01 1619066
/lib64/libblkid.so.1.0
32b060-32b060d000 r-xp  08:01 1619093
/lib64/libgcc_s-4.1.1-20060525.so.1
32b060d000-32b070d000 ---p d000 08:01 1619093
/lib64/libgcc_s-4.1.1-20060525.so.1
32b070d000-32b070e000 rw-p d000 08:01 1619093
/lib64/libgcc_s-4.1.1-20060525.so.1
32b280-32b2814000 r-xp  08:01 1619102
/lib64/libselinux.so.1
32b2814000-32b2913000 ---p 00014000 08:01 1619102
/lib64/libselinux.so.1
32b2913000-32b2915000 rw-p 00013000 08:01 1619102
/lib64/libselinux.so.1
32b2915000-32b2916000 rw-p 32b2915000 00:00 0 
32b2a0-32b2a38000 r-xp  08:01 1619101
/lib64/libsepol.so.1
32b2a38000-32b2b37000 ---p 00038000 08:01 1619101
/lib64/libsepol.so.1
32b2b37000-32b2b38000 rw-p 00037000 08:01 1619101
/lib64/libsepol.so.1
32b2b38000-32b2b42000 rw-p 32b2b38000 00:00 0 
2b9eea00c000-2b9eea00d000 rw-p 2b9eea00c000 00:00 0 
2b9eea032000-2b9eea036000 rw-p 2b9eea032000 00:00 0 
2b9eea036000-2b9eea039000 r-xp  08:01 1618858
/lib64/libsetrans.so.0
2b9eea039000-2b9eea138000 ---p 3000 08:01 1618858
/lib64/libsetrans.so.0
2b9eea138000-2b9eea139000 rw-p 2000 08:01 1618858
/lib64/libsetrans.so.0
2b9eea139000-2b9eea143000 r-xp  08:01 1619053
/lib64/libnss_files-2.4.so
2b9eea143000-2b9eea242000 ---p a000 08:01 1619053
/lib64/libnss_files-2.4.so
2b9eea242000-2b9eea243000 r--p 9000 08:01 1619053
/lib64/libnss_files-2.4.so
2b9eea243000-2b9eea244000 rw-p a000 08:01 1619053
/lib64/libnss_files-2.4.so

Re: Abolishing the DMCA

2006-12-14 Thread Alexandre Oliva

On Dec 14, 2006, Greg KH <[EMAIL PROTECTED]> wrote:

> I think you missed the point that my patch prevents valid usages of
> non-GPL modules from happening, which is not acceptable.

What if you changed your patch so as to only permit loading of
possibly-infringing drivers after some flag in /proc is set, and
logging to the console a message explaining (i) why such drivers might
be infringing and how to contact the copyright holders to get the
infringement stopped, and (ii) how to get it loaded if you believe
it's ok.

Then the patch would change from a probably-harmful DRM technique to
an educational tool, that wouldn't impose any major inconvenience to
those who are entitled to use the combination of code that can't be
distributed.

-- 
Alexandre Oliva http://www.lsd.ic.unicamp.br/~oliva/
FSF Latin America Board Member http://www.fsfla.org/
Red Hat Compiler Engineer   [EMAIL PROTECTED], gcc.gnu.org}
Free Software Evangelist  [EMAIL PROTECTED], gnu.org}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] lumpy reclaim v2

On Wed, 6 Dec 2006 16:59:35 +
Andy Whitcroft <[EMAIL PROTECTED]> wrote:

> + tmp = __pfn_to_page(pfn);

ia64 doesn't implement __page_to_pfn.  Why did you not use page_to_pfn()?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce time_data, a new structure to hold jiffies, xtime, xtime_lock, wall_to_monotonic, calc_load_count and avenrun

On Wed, 13 Dec 2006 22:26:26 +0100
Eric Dumazet <[EMAIL PROTECTED]> wrote:

> This patch introduces a new structure called time_data, where some time 
> keeping related variables are put together to share as few cache lines as 
> possible.

ia64 refers to xtime_lock from assembly and hence doesn't link.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GPL only modules

2006-12-14 Thread Alexandre Oliva

On Dec 14, 2006, "Jeff V. Merkey" <[EMAIL PROTECTED]> wrote:

> FREE implies a transfer of ownsership

It's about freedom, not price.  And even then, it's the license that
has not cost, not the copyright.

> and you also have to contend with the Doctrine of Estoppel.  i.e. if
> someone has been using the code for over two years, and you have not
> brought a cause of action, you are BARRED from doing so under the
> Doctrine of Estoppel and statute of limitations.

Sure, but we're not necessarily talking about code that is two years
old.  We're talking about future releases.  Then, if someone
interfaces with code that was already there before, they might claim
they're still entitled to do so.  But if it's new code they interface
with, or new code they wrote after this clarification is published,
would they still be entitled to estoppel?  FWIW, IANAL.

-- 
Alexandre Oliva http://www.lsd.ic.unicamp.br/~oliva/
FSF Latin America Board Member http://www.fsfla.org/
Red Hat Compiler Engineer   [EMAIL PROTECTED], gcc.gnu.org}
Free Software Evangelist  [EMAIL PROTECTED], gnu.org}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Task watchers v2

Associate function calls with significant events in a task's lifetime much like
we handle kernel and module init/exit functions. This creates a table for each
of the following events in the task_watchers_table ELF section:

WATCH_TASK_INIT at the beginning of a fork/clone system call when the
new task struct first becomes available.

WATCH_TASK_CLONE just before returning successfully from a fork/clone.

WATCH_TASK_EXEC just before successfully returning from the exec
system call.

WATCH_TASK_UID every time a task's real or effective user id changes.

WATCH_TASK_GID every time a task's real or effective group id changes.

WATCH_TASK_EXIT at the beginning of do_exit when a task is exiting
for any reason. 

WATCH_TASK_FREE is called before critical task structures like
the mm_struct become inaccessible and the task is subsequently freed.

The next patch will add a debugfs interface for measuring fork and exit rates
which can be used to calculate the overhead of the task watcher infrastructure.

Subsequent patches will make use of task watchers to simplify fork, exit,
and many of the system calls that set [er][ug]ids.

Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>
Cc: Jes Sorensen <[EMAIL PROTECTED]>
Cc: Christoph Hellwig <[EMAIL PROTECTED]>
Cc: Al Viro <[EMAIL PROTECTED]>
Cc: Steve Grubb <[EMAIL PROTECTED]>
Cc: linux-audit@redhat.com
Cc: Paul Jackson <[EMAIL PROTECTED]>
---
 fs/exec.c |3 ++
 include/asm-generic/vmlinux.lds.h |   26 +
 include/linux/init.h  |   47 ++
 kernel/Makefile   |2 -
 kernel/exit.c |3 ++
 kernel/fork.c |   15 
 kernel/sys.c  |9 +++
 kernel/task_watchers.c|   41 +
 8 files changed, 141 insertions(+), 5 deletions(-)

Index: linux-2.6.19/kernel/sys.c
===
--- linux-2.6.19.orig/kernel/sys.c
+++ linux-2.6.19/kernel/sys.c
@@ -27,10 +27,11 @@
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
 
@@ -957,10 +958,11 @@ asmlinkage long sys_setregid(gid_t rgid,
current->fsgid = new_egid;
current->egid = new_egid;
current->gid = new_rgid;
key_fsgid_changed(current);
proc_id_connector(current, PROC_EVENT_GID);
+   notify_task_watchers(WATCH_TASK_GID, 0, current);
return 0;
 }
 
 /*
  * setgid() is implemented like SysV w/ SAVED_IDS 
@@ -992,10 +994,11 @@ asmlinkage long sys_setgid(gid_t gid)
else
return -EPERM;
 
key_fsgid_changed(current);
proc_id_connector(current, PROC_EVENT_GID);
+   notify_task_watchers(WATCH_TASK_GID, 0, current);
return 0;
 }
   
 static int set_user(uid_t new_ruid, int dumpclear)
 {
@@ -1080,10 +1083,11 @@ asmlinkage long sys_setreuid(uid_t ruid,
current->suid = current->euid;
current->fsuid = current->euid;
 
key_fsuid_changed(current);
proc_id_connector(current, PROC_EVENT_UID);
+   notify_task_watchers(WATCH_TASK_UID, 0, current);
 
return security_task_post_setuid(old_ruid, old_euid, old_suid, 
LSM_SETID_RE);
 }
 
 
@@ -1127,10 +1131,11 @@ asmlinkage long sys_setuid(uid_t uid)
current->fsuid = current->euid = uid;
current->suid = new_suid;
 
key_fsuid_changed(current);
proc_id_connector(current, PROC_EVENT_UID);
+   notify_task_watchers(WATCH_TASK_UID, 0, current);
 
return security_task_post_setuid(old_ruid, old_euid, old_suid, 
LSM_SETID_ID);
 }
 
 
@@ -1175,10 +1180,11 @@ asmlinkage long sys_setresuid(uid_t ruid
if (suid != (uid_t) -1)
current->suid = suid;
 
key_fsuid_changed(current);
proc_id_connector(current, PROC_EVENT_UID);
+   notify_task_watchers(WATCH_TASK_UID, 0, current);
 
return security_task_post_setuid(old_ruid, old_euid, old_suid, 
LSM_SETID_RES);
 }
 
 asmlinkage long sys_getresuid(uid_t __user *ruid, uid_t __user *euid, uid_t 
__user *suid)
@@ -1227,10 +1233,11 @@ asmlinkage long sys_setresgid(gid_t rgid
if (sgid != (gid_t) -1)
current->sgid = sgid;
 
key_fsgid_changed(current);
proc_id_connector(current, PROC_EVENT_GID);
+   notify_task_watchers(WATCH_TASK_GID, 0, current);
return 0;
 }
 
 asmlinkage long sys_getresgid(gid_t __user *rgid, gid_t __user *egid, gid_t 
__user *sgid)
 {
@@ -1268,10 +1275,11 @@ asmlinkage long sys_setfsuid(uid_t uid)
current->fsuid = uid;
}
 
key_fsuid_changed(current);
proc_id_connector(current, PROC_EVENT_UID);
+   notify_task_watchers(WATCH_TASK_UID, 0, current);
 
security_task_post_setuid(old_fsuid, (uid_t)-1, (uid_t)-1, 
LSM_SETID_FS);
 
return old_fsuid;
 }
@@

[PATCH 00/10] Introduction

This is version 2 of my Task Watchers patches with performance enhancements.

Task watchers calls functions whenever a task forks, execs, changes its
[re][ug]id, or exits.

Task watchers is primarily useful to existing kernel code as a means of making
the code in fork and exit more readable. Kernel code uses these paths by
marking a function as a task watcher much like modules mark their init
functions with module_init(). This improves the readability of copy_process().

The first patch adds the basic infrastructure of task watchers: notification
function calls in the various paths and a table of function pointers to be
called. It uses an ELF section because parts of the table must be gathered
from all over the kernel code and using the linker is easier than resolving
and maintaining complex header interdependencies. Furthermore, using a list
proved to have much higher impact on the size of the patches and was deemed
unacceptable overhead. An ELF table is also ideal because its "readonly" nature 
means that no locking nor list traversal are required.

Subsequent patches adapt existing parts of the kernel to use a task watcher
 -- typically in the fork, clone, and exit paths:

FEATURE (notes)   RELEVANT CONFIG VARIABLE
---
audit [ CONFIG_AUDIT ...  ]
semundo   [ CONFIG_SYSVIPC]
cpusets   [ CONFIG_CPUSETS]
mempolicy [ CONFIG_NUMA   ]
trace irqflags[ CONFIG_TRACE_IRQFLAGS ]
lockdep   [ CONFIG_LOCKDEP]
keys (for processes -- not for thread groups) [ CONFIG_KEYS   ]
process events connector  [ CONFIG_PROC_EVENTS]

TODO:
Mark the task watcher table ELF section read-only. I've tried to "fix"
the .lds files to do this with no success. I'd really appreciate help
from folks familiar with writing linker scripts.

I'm working on three more patches that add support for creating a task
watcher from within a module using an ELF section. They haven't recieved
as much attention since I've been focusing on measuring the performance
impact of these patches.

Changes:
since v2 ():
Added ELF section annotations to the functions handling the events
Added section annotation to the lookup table in kernel/task_watchers.c
Added prefetch hints to the function pointer array walk
Renamed the macros (better?)
Retested the patches
Reduced noise in test results (0.6 - 1%, 2+% previously)

With the last prefetch patch I was able to measure a performance increase in
the range of 0.4 to 2.8%. I sampled 100 times and took the mean for each patch.
Since the numbers seemed to be a source of confusion last time I've tried to
simplify them here:

PatchMean (forks/second)
06925.16 (baseline)
17170.81  task watchers
27100.34  audit
37114.47  semundo
47185.7   cpusets
57121.41  numa-mempolicy
67070.82  irqflags
77012.61  lockdep
87116.54  keys
97116.35  procevents
12   7109.52  prefetch

7109.52 - 6925.16 = +184 forks/second (+2.6%)

So the patch series now actually improves performance a little.

All the numbers from the tests are available if anyone wishes to analyze them
independently.

Please consider for inclusion in -mm.

Cheers,
-Matt Helsley
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Register process keyrings task watcher

Make the keyring code use a task watcher to initialize and free per-task data.

NOTE:
We can't make copy_thread_group_keys() in copy_signal() a task watcher because 
it needs the task's signal field (struct signal_struct).

Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
Cc: David Howells <[EMAIL PROTECTED]>
---
 include/linux/key.h  |8 
 kernel/exit.c|2 --
 kernel/fork.c|6 +-
 kernel/sys.c |8 
 security/keys/process_keys.c |   21 ++---
 5 files changed, 15 insertions(+), 30 deletions(-)

Index: linux-2.6.19/include/linux/key.h
===
--- linux-2.6.19.orig/include/linux/key.h
+++ linux-2.6.19/include/linux/key.h
@@ -335,18 +335,14 @@ extern void keyring_replace_payload(stru
  */
 extern struct key root_user_keyring, root_session_keyring;
 extern int alloc_uid_keyring(struct user_struct *user,
 struct task_struct *ctx);
 extern void switch_uid_keyring(struct user_struct *new_user);
-extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk);
 extern int copy_thread_group_keys(struct task_struct *tsk);
-extern void exit_keys(struct task_struct *tsk);
 extern void exit_thread_group_keys(struct signal_struct *tg);
 extern int suid_keys(struct task_struct *tsk);
 extern int exec_keys(struct task_struct *tsk);
-extern void key_fsuid_changed(struct task_struct *tsk);
-extern void key_fsgid_changed(struct task_struct *tsk);
 extern void key_init(void);
 
 #define __install_session_keyring(tsk, keyring)\
 ({ \
struct key *old_session = tsk->signal->session_keyring; \
@@ -365,18 +361,14 @@ extern void key_init(void);
 #define key_ref_to_ptr(k)  ({ NULL; })
 #define is_key_possessed(k)0
 #define alloc_uid_keyring(u,c) 0
 #define switch_uid_keyring(u)  do { } while(0)
 #define __install_session_keyring(t, k)({ NULL; })
-#define copy_keys(f,t) 0
 #define copy_thread_group_keys(t)  0
-#define exit_keys(t)   do { } while(0)
 #define exit_thread_group_keys(tg) do { } while(0)
 #define suid_keys(t)   do { } while(0)
 #define exec_keys(t)   do { } while(0)
-#define key_fsuid_changed(t)   do { } while(0)
-#define key_fsgid_changed(t)   do { } while(0)
 #define key_init() do { } while(0)
 
 /* Initial keyrings */
 extern struct key root_user_keyring;
 extern struct key root_session_keyring;
Index: linux-2.6.19/kernel/fork.c
===
--- linux-2.6.19.orig/kernel/fork.c
+++ linux-2.6.19/kernel/fork.c
@@ -1077,14 +1077,12 @@ static struct task_struct *copy_process(
goto bad_fork_cleanup_fs;
if ((retval = copy_signal(clone_flags, p)))
goto bad_fork_cleanup_sighand;
if ((retval = copy_mm(clone_flags, p)))
goto bad_fork_cleanup_signal;
-   if ((retval = copy_keys(clone_flags, p)))
-   goto bad_fork_cleanup_mm;
if ((retval = copy_namespaces(clone_flags, p)))
-   goto bad_fork_cleanup_keys;
+   goto bad_fork_cleanup_mm;
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespaces;
 
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : 
NULL;
@@ -1226,12 +1224,10 @@ static struct task_struct *copy_process(
proc_fork_connector(p);
return p;
 
 bad_fork_cleanup_namespaces:
exit_task_namespaces(p);
-bad_fork_cleanup_keys:
-   exit_keys(p);
 bad_fork_cleanup_mm:
if (p->mm)
mmput(p->mm);
 bad_fork_cleanup_signal:
cleanup_signal(p);
Index: linux-2.6.19/security/keys/process_keys.c
===
--- linux-2.6.19.orig/security/keys/process_keys.c
+++ linux-2.6.19/security/keys/process_keys.c
@@ -15,10 +15,11 @@
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
 /* session keyring create vs join semaphore */
 static DEFINE_MUTEX(key_session_mutex);
@@ -276,11 +277,12 @@ int copy_thread_group_keys(struct task_s
 
 /*/
 /*
  * copy the keys for fork
  */
-int copy_keys(unsigned long clone_flags, struct task_struct *tsk)
+static int __task_init copy_keys(unsigned long clone_flags,
+struct task_struct *tsk)
 {
key_check(tsk->thread_keyring);
key_check(tsk->request_key_auth);
 
/* no thread keyring yet */
@@ -290,10 +292,11 @@ int copy_keys(unsigned long clone_flags,
key_get(tsk->request_key_auth);
 
return 0;

Register audit task watcher

Change audit to register a task watcher function rather than modify
the copy_process() and do_exit() paths directly.

Removes an unlikely() hint from kernel/exit.c:
if (unlikely(tsk->audit_context))
audit_free(tsk);
This use of unlikely() is an artifact of audit_free()'s former invocation from
__put_task_struct() (commit: fa84cb935d4ec601528f5e2f0d5d31e7876a5044).
Clearly in the __put_task_struct() path it would be called much more frequently
than do_exit() and hence the use of unlikely() there was justified. However, in
the new location the hint most likely offers no measurable performance impact.

Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
Cc: Al Viro <[EMAIL PROTECTED]>
Cc: Steve Grubb <[EMAIL PROTECTED]>
Cc: linux-audit@redhat.com
---
 include/linux/audit.h |4 
 kernel/auditsc.c  |9 ++---
 kernel/exit.c |3 ---
 kernel/fork.c |7 +--
 4 files changed, 7 insertions(+), 16 deletions(-)

Index: linux-2.6.19/kernel/auditsc.c
===
--- linux-2.6.19.orig/kernel/auditsc.c
+++ linux-2.6.19/kernel/auditsc.c
@@ -677,11 +677,11 @@ static inline struct audit_context *audi
  * Filter on the task information and allocate a per-task audit context
  * if necessary.  Doing so turns on system call auditing for the
  * specified task.  This is called from copy_process, so no lock is
  * needed.
  */
-int audit_alloc(struct task_struct *tsk)
+static int __task_init audit_alloc(unsigned long val, struct task_struct *tsk)
 {
struct audit_context *context;
enum audit_state state;
 
if (likely(!audit_enabled))
@@ -703,10 +703,11 @@ int audit_alloc(struct task_struct *tsk)
 
tsk->audit_context  = context;
set_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
return 0;
 }
+DEFINE_TASK_INITCALL(audit_alloc);
 
 static inline void audit_free_context(struct audit_context *context)
 {
struct audit_context *previous;
int  count = 0;
@@ -1033,28 +1034,30 @@ static void audit_log_exit(struct audit_
  * audit_free - free a per-task audit context
  * @tsk: task whose audit context block to free
  *
  * Called from copy_process and do_exit
  */
-void audit_free(struct task_struct *tsk)
+static int __task_free audit_free(unsigned long val, struct task_struct *tsk)
 {
struct audit_context *context;
 
context = audit_get_context(tsk, 0, 0);
if (likely(!context))
-   return;
+   return 0;
 
/* Check for system calls that do not go through the exit
 * function (e.g., exit_group), then free context block. 
 * We use GFP_ATOMIC here because we might be doing this 
 * in the context of the idle thread */
/* that can happen only if we are called from do_exit() */
if (context->in_syscall && context->auditable)
audit_log_exit(context, tsk);
 
audit_free_context(context);
+   return 0;
 }
+DEFINE_TASK_FREECALL(audit_free);
 
 /**
  * audit_syscall_entry - fill in an audit record at syscall entry
  * @tsk: task being audited
  * @arch: architecture type
Index: linux-2.6.19/include/linux/audit.h
===
--- linux-2.6.19.orig/include/linux/audit.h
+++ linux-2.6.19/include/linux/audit.h
@@ -332,12 +332,10 @@ struct mqstat;
 extern int __init audit_register_class(int class, unsigned *list);
 extern int audit_classify_syscall(int abi, unsigned syscall);
 #ifdef CONFIG_AUDITSYSCALL
 /* These are defined in auditsc.c */
/* Public API */
-extern int  audit_alloc(struct task_struct *task);
-extern void audit_free(struct task_struct *task);
 extern void audit_syscall_entry(int arch,
int major, unsigned long a0, unsigned long a1,
unsigned long a2, unsigned long a3);
 extern void audit_syscall_exit(int failed, long return_code);
 extern void __audit_getname(const char *name);
@@ -432,12 +430,10 @@ static inline int audit_mq_getsetattr(mq
return __audit_mq_getsetattr(mqdes, mqstat);
return 0;
 }
 extern int audit_n_rules;
 #else
-#define audit_alloc(t) ({ 0; })
-#define audit_free(t) do { ; } while (0)
 #define audit_syscall_entry(ta,a,b,c,d,e) do { ; } while (0)
 #define audit_syscall_exit(f,r) do { ; } while (0)
 #define audit_dummy_context() 1
 #define audit_getname(n) do { ; } while (0)
 #define audit_putname(n) do { ; } while (0)
Index: linux-2.6.19/kernel/fork.c
===
--- linux-2.6.19.orig/kernel/fork.c
+++ linux-2.6.19/kernel/fork.c
@@ -37,11 +37,10 @@
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
@@ -1102,15 +1101,13 @@ static struct task_struct *copy_process(
p->blocked_on = NULL; /* not blocked yet */

Prefetch hint

Prefetch the entire array of function pointers.

Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>

---
 kernel/task_watchers.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6.19/kernel/task_watchers.c
===
--- linux-2.6.19.orig/kernel/task_watchers.c
+++ linux-2.6.19/kernel/task_watchers.c
@@ -1,6 +1,7 @@
 #include 
+#include 
 
 /* Defined in include/asm-generic/vmlinux.lds.h */
 extern const task_watcher_fn __start_task_init[],
__start_task_clone[], __start_task_exec[],
__start_task_uid[], __start_task_gid[],
@@ -30,10 +31,11 @@ int notify_task_watchers(unsigned int ev
 
tw_call = twtable[ev];
tw_end = twtable[ev + 1];
 
/* Call all of the watchers, report the first error */
+   prefetch_range(tw_call, tw_end - tw_call);
for (; tw_call < tw_end; tw_call++) {
err = (*tw_call)(val, tsk);
if (unlikely((err < 0) && (ret_err == 0)))
ret_err = err;
}

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Register IRQ flag tracing task watcher

Register an irq-flag-tracing task watcher instead of hooking into
copy_process().

Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
---
 kernel/fork.c   |   19 ---
 kernel/irq/handle.c |   24 
 2 files changed, 24 insertions(+), 19 deletions(-)

Index: linux-2.6.19/kernel/fork.c
===
--- linux-2.6.19.orig/kernel/fork.c
+++ linux-2.6.19/kernel/fork.c
@@ -1059,29 +1059,10 @@ static struct task_struct *copy_process(
p->tgid = current->tgid;
 
retval = notify_task_watchers(WATCH_TASK_INIT, clone_flags, p);
if (retval < 0)
goto bad_fork_cleanup_delays_binfmt;
-#ifdef CONFIG_TRACE_IRQFLAGS
-   p->irq_events = 0;
-#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
-   p->hardirqs_enabled = 1;
-#else
-   p->hardirqs_enabled = 0;
-#endif
-   p->hardirq_enable_ip = 0;
-   p->hardirq_enable_event = 0;
-   p->hardirq_disable_ip = _THIS_IP_;
-   p->hardirq_disable_event = 0;
-   p->softirqs_enabled = 1;
-   p->softirq_enable_ip = _THIS_IP_;
-   p->softirq_enable_event = 0;
-   p->softirq_disable_ip = 0;
-   p->softirq_disable_event = 0;
-   p->hardirq_context = 0;
-   p->softirq_context = 0;
-#endif
 #ifdef CONFIG_LOCKDEP
p->lockdep_depth = 0; /* no locks held yet */
p->curr_chain_key = 0;
p->lockdep_recursion = 0;
 #endif
Index: linux-2.6.19/kernel/irq/handle.c
===
--- linux-2.6.19.orig/kernel/irq/handle.c
+++ linux-2.6.19/kernel/irq/handle.c
@@ -13,10 +13,11 @@
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include "internals.h"
 
 /**
  * handle_bad_irq - handle spurious and unhandled irqs
@@ -266,6 +267,29 @@ void early_init_irq_lock_class(void)
 
for (i = 0; i < NR_IRQS; i++)
lockdep_set_class(_desc[i].lock, _desc_lock_class);
 }
 
+static int __task_init init_task_trace_irqflags(unsigned long clone_flags,
+   struct task_struct *p)
+{
+   p->irq_events = 0;
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+   p->hardirqs_enabled = 1;
+#else
+   p->hardirqs_enabled = 0;
+#endif
+   p->hardirq_enable_ip = 0;
+   p->hardirq_enable_event = 0;
+   p->hardirq_disable_ip = _THIS_IP_;
+   p->hardirq_disable_event = 0;
+   p->softirqs_enabled = 1;
+   p->softirq_enable_ip = _THIS_IP_;
+   p->softirq_enable_event = 0;
+   p->softirq_disable_ip = 0;
+   p->softirq_disable_event = 0;
+   p->hardirq_context = 0;
+   p->softirq_context = 0;
+   return 0;
+}
+DEFINE_TASK_INITCALL(init_task_trace_irqflags);
 #endif

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Register process events connector

Make the Process events connector use task watchers instead of hooking the
paths it's interested in.

Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
---
 drivers/connector/cn_proc.c |   51 +++-
 fs/exec.c   |1 
 include/linux/cn_proc.h |   21 --
 kernel/exit.c   |2 -
 kernel/fork.c   |2 -
 kernel/sys.c|8 --
 6 files changed, 36 insertions(+), 49 deletions(-)

Index: linux-2.6.19/drivers/connector/cn_proc.c
===
--- linux-2.6.19.orig/drivers/connector/cn_proc.c
+++ linux-2.6.19/drivers/connector/cn_proc.c
@@ -44,19 +44,20 @@ static inline void get_seq(__u32 *ts, in
*ts = get_cpu_var(proc_event_counts)++;
*cpu = smp_processor_id();
put_cpu_var(proc_event_counts);
 }
 
-void proc_fork_connector(struct task_struct *task)
+static int proc_fork_connector(unsigned long clone_flags,
+  struct task_struct *task)
 {
struct cn_msg *msg;
struct proc_event *ev;
__u8 buffer[CN_PROC_MSG_SIZE];
struct timespec ts;
 
if (atomic_read(_event_num_listeners) < 1)
-   return;
+   return 0;
 
msg = (struct cn_msg*)buffer;
ev = (struct proc_event*)msg->data;
get_seq(>seq, >cpu);
ktime_get_ts(); /* get high res monotonic timestamp */
@@ -70,21 +71,24 @@ void proc_fork_connector(struct task_str
memcpy(>id, _proc_event_id, sizeof(msg->id));
msg->ack = 0; /* not used */
msg->len = sizeof(*ev);
/*  If cn_netlink_send() failed, the data is not sent */
cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL);
+   return 0;
 }
+DEFINE_TASK_CLONECALL(proc_fork_connector);
 
-void proc_exec_connector(struct task_struct *task)
+static int proc_exec_connector(unsigned long ignore,
+  struct task_struct *task)
 {
struct cn_msg *msg;
struct proc_event *ev;
struct timespec ts;
__u8 buffer[CN_PROC_MSG_SIZE];
 
if (atomic_read(_event_num_listeners) < 1)
-   return;
+   return 0;
 
msg = (struct cn_msg*)buffer;
ev = (struct proc_event*)msg->data;
get_seq(>seq, >cpu);
ktime_get_ts(); /* get high res monotonic timestamp */
@@ -95,21 +99,23 @@ void proc_exec_connector(struct task_str
 
memcpy(>id, _proc_event_id, sizeof(msg->id));
msg->ack = 0; /* not used */
msg->len = sizeof(*ev);
cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL);
+   return 0;
 }
+DEFINE_TASK_EXECCALL(proc_exec_connector);
 
-void proc_id_connector(struct task_struct *task, int which_id)
+static int process_change_id(unsigned long which_id, struct task_struct *task)
 {
struct cn_msg *msg;
struct proc_event *ev;
__u8 buffer[CN_PROC_MSG_SIZE];
struct timespec ts;
 
if (atomic_read(_event_num_listeners) < 1)
-   return;
+   return 0;
 
msg = (struct cn_msg*)buffer;
ev = (struct proc_event*)msg->data;
ev->what = which_id;
ev->event_data.id.process_pid = task->pid;
@@ -119,47 +125,64 @@ void proc_id_connector(struct task_struc
ev->event_data.id.e.euid = task->euid;
} else if (which_id == PROC_EVENT_GID) {
ev->event_data.id.r.rgid = task->gid;
ev->event_data.id.e.egid = task->egid;
} else
-   return;
+   return 0;
get_seq(>seq, >cpu);
ktime_get_ts(); /* get high res monotonic timestamp */
ev->timestamp_ns = timespec_to_ns();
 
memcpy(>id, _proc_event_id, sizeof(msg->id));
msg->ack = 0; /* not used */
msg->len = sizeof(*ev);
cn_netlink_send(msg, CN_IDX_PROC, GFP_KERNEL);
+   return 0;
+}
+
+static int proc_change_uid_connector(unsigned long ignore,
+struct task_struct *task)
+{
+   return process_change_id(PROC_EVENT_UID, task);
+}
+DEFINE_TASK_UIDCALL(proc_change_uid_connector);
+
+static int proc_change_gid_connector(unsigned long ignore,
+struct task_struct *task)
+{
+   return process_change_id(PROC_EVENT_GID, task);
 }
+DEFINE_TASK_GIDCALL(proc_change_gid_connector);
 
-void proc_exit_connector(struct task_struct *task)
+static int proc_exit_connector(unsigned long code, struct task_struct *task)
 {
struct cn_msg *msg;
struct proc_event *ev;
__u8 buffer[CN_PROC_MSG_SIZE];
struct timespec ts;
 
if (atomic_read(_event_num_listeners) < 1)
-   return;
+   return 0;
 
msg = (struct cn_msg*)buffer;
ev = (struct proc_event*)msg->data;
get_seq(>seq, >cpu);
ktime_get_ts(); /* get high res monotonic timestamp */
ev->timestamp_ns =

Register lockdep task watcher

Register a task watcher for lockdep instead of hooking into copy_process().

Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
---
 kernel/fork.c|5 -
 kernel/lockdep.c |   11 +++
 2 files changed, 11 insertions(+), 5 deletions(-)

Index: linux-2.6.19/kernel/fork.c
===
--- linux-2.6.19.orig/kernel/fork.c
+++ linux-2.6.19/kernel/fork.c
@@ -1059,15 +1059,10 @@ static struct task_struct *copy_process(
p->tgid = current->tgid;
 
retval = notify_task_watchers(WATCH_TASK_INIT, clone_flags, p);
if (retval < 0)
goto bad_fork_cleanup_delays_binfmt;
-#ifdef CONFIG_LOCKDEP
-   p->lockdep_depth = 0; /* no locks held yet */
-   p->curr_chain_key = 0;
-   p->lockdep_recursion = 0;
-#endif
 
 #ifdef CONFIG_DEBUG_MUTEXES
p->blocked_on = NULL; /* not blocked yet */
 #endif
 
Index: linux-2.6.19/kernel/lockdep.c
===
--- linux-2.6.19.orig/kernel/lockdep.c
+++ linux-2.6.19/kernel/lockdep.c
@@ -25,10 +25,11 @@
  * mapping lock dependencies runtime.
  */
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
@@ -2557,10 +2558,20 @@ void __init lockdep_init(void)
INIT_LIST_HEAD(chainhash_table + i);
 
lockdep_initialized = 1;
 }
 
+static int __task_init init_task_lockdep(unsigned long clone_flags,
+struct task_struct *p)
+{
+   p->lockdep_depth = 0; /* no locks held yet */
+   p->curr_chain_key = 0;
+   p->lockdep_recursion = 0;
+   return 0;
+}
+DEFINE_TASK_INITCALL(init_task_lockdep);
+
 void __init lockdep_info(void)
 {
printk("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., 
Ingo Molnar\n");
 
printk("... MAX_LOCKDEP_SUBCLASSES:%lu\n", MAX_LOCKDEP_SUBCLASSES);

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Register NUMA mempolicy task watcher

Register a NUMA mempolicy task watcher instead of hooking into
copy_process() and do_exit() directly.

Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
---
 kernel/exit.c  |4 
 kernel/fork.c  |   15 +--
 mm/mempolicy.c |   25 +
 3 files changed, 26 insertions(+), 18 deletions(-)

Benchmark results:
System: 4 1.7GHz ppc64 (Power 4+) processors, 30968600MB RAM, 2.6.19-rc2-mm2 
kernel

Clone   Number of Children Cloned
500075001   12500   15000   
17500

---
Mean17836.3 18085.2 18220.4 18225   18319   
18339
Dev 302.801 314.617 303.079 293.46  287.267 
294.819
Err (%) 1.69767 1.73963 1.6634  1.6102  1.56814 
1.60761

ForkNumber of Children Forked
500075001   12500   15000   
17500

---
Mean17896.2 17990   18100.6 18242.3 18244   
18346.9
Dev 301.64  285.698 295.646 304.361 299.472 
287.153
Err (%) 1.6855  1.58809 1.63335 1.66844 1.64148 
1.56513

Kernbench:
Elapsed: 124.532s User: 439.732s System: 46.497s CPU: 389.9%
439.71user 46.48system 2:04.24elapsed 391%CPU (0avgtext+0avgdata 0maxresident)k
439.79user 46.42system 2:05.10elapsed 388%CPU (0avgtext+0avgdata 0maxresident)k
439.74user 46.44system 2:04.60elapsed 390%CPU (0avgtext+0avgdata 0maxresident)k
439.75user 46.64system 2:04.74elapsed 389%CPU (0avgtext+0avgdata 0maxresident)k
439.61user 46.45system 2:05.36elapsed 387%CPU (0avgtext+0avgdata 0maxresident)k
439.60user 46.43system 2:04.33elapsed 390%CPU (0avgtext+0avgdata 0maxresident)k
439.77user 46.47system 2:04.34elapsed 391%CPU (0avgtext+0avgdata 0maxresident)k
439.87user 46.45system 2:04.10elapsed 391%CPU (0avgtext+0avgdata 0maxresident)k
439.76user 46.71system 2:04.58elapsed 390%CPU (0avgtext+0avgdata 0maxresident)k
439.72user 46.48system 2:03.93elapsed 392%CPU (0avgtext+0avgdata 0maxresident)k

Index: linux-2.6.19/mm/mempolicy.c
===
--- linux-2.6.19.orig/mm/mempolicy.c
+++ linux-2.6.19/mm/mempolicy.c
@@ -87,10 +87,11 @@
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 
 /* Internal flags */
@@ -1331,10 +1332,34 @@ struct mempolicy *__mpol_copy(struct mem
}
}
return new;
 }
 
+static int __task_init init_task_mempolicy(unsigned long clone_flags,
+  struct task_struct *tsk)
+{
+   tsk->mempolicy = mpol_copy(tsk->mempolicy);
+   if (IS_ERR(tsk->mempolicy)) {
+   int retval;
+
+   retval = PTR_ERR(tsk->mempolicy);
+   tsk->mempolicy = NULL;
+   return retval;
+   }
+   mpol_fix_fork_child_flag(tsk);
+   return 0;
+}
+DEFINE_TASK_INITCALL(init_task_mempolicy);
+
+static int __task_free free_task_mempolicy(unsigned int ignored,
+  struct task_struct *tsk)
+{
+   mpol_free(tsk->mempolicy);
+   tsk->mempolicy = NULL;
+}
+DEFINE_TASK_FREECALL(free_task_mempolicy);
+
 /* Slow path of a mempolicy comparison */
 int __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 {
if (!a || !b)
return 0;
Index: linux-2.6.19/kernel/fork.c
===
--- linux-2.6.19.orig/kernel/fork.c
+++ linux-2.6.19/kernel/fork.c
@@ -1059,19 +1059,10 @@ static struct task_struct *copy_process(
p->tgid = current->tgid;
 
retval = notify_task_watchers(WATCH_TASK_INIT, clone_flags, p);
if (retval < 0)
goto bad_fork_cleanup_delays_binfmt;
-#ifdef CONFIG_NUMA
-   p->mempolicy = mpol_copy(p->mempolicy);
-   if (IS_ERR(p->mempolicy)) {
-   retval = PTR_ERR(p->mempolicy);
-   p->mempolicy = NULL;
-   goto bad_fork_cleanup_delays_binfmt;
-   }
-   mpol_fix_fork_child_flag(p);
-#endif
 #ifdef CONFIG_TRACE_IRQFLAGS
p->irq_events = 0;
 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
p->hardirqs_enabled = 1;
 #else
@@ -1098,11 +1089,11 @@ static struct task_struct *copy_process(
 #ifdef CONFIG_DEBUG_MUTEXES
p->blocked_on = NULL; /* not blocked yet */
 #endif
 
if ((retval = security_task_alloc(p)))
-   goto bad_fork_cleanup_policy;
+   goto bad_fork_cleanup_delays_binfmt;
/* copy all the process information */
if ((retval = copy_files(clone_flags, p)))
goto bad_fork_cleanup_security;
if ((retval =

Register cpuset task watcher

Register a task watcher for cpusets instead of hooking into
copy_process() and do_exit() directly.

Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
Cc: Paul Jackson <[EMAIL PROTECTED]>
---
 include/linux/cpuset.h |4 
 kernel/cpuset.c|   11 +--
 kernel/exit.c  |2 --
 kernel/fork.c  |6 +-
 4 files changed, 10 insertions(+), 13 deletions(-)

Index: linux-2.6.19/kernel/fork.c
===
--- linux-2.6.19.orig/kernel/fork.c
+++ linux-2.6.19/kernel/fork.c
@@ -28,11 +28,10 @@
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
@@ -1060,17 +1059,16 @@ static struct task_struct *copy_process(
p->tgid = current->tgid;
 
retval = notify_task_watchers(WATCH_TASK_INIT, clone_flags, p);
if (retval < 0)
goto bad_fork_cleanup_delays_binfmt;
-   cpuset_fork(p);
 #ifdef CONFIG_NUMA
p->mempolicy = mpol_copy(p->mempolicy);
if (IS_ERR(p->mempolicy)) {
retval = PTR_ERR(p->mempolicy);
p->mempolicy = NULL;
-   goto bad_fork_cleanup_cpuset;
+   goto bad_fork_cleanup_delays_binfmt;
}
mpol_fix_fork_child_flag(p);
 #endif
 #ifdef CONFIG_TRACE_IRQFLAGS
p->irq_events = 0;
@@ -1279,13 +1277,11 @@ bad_fork_cleanup_files:
 bad_fork_cleanup_security:
security_task_free(p);
 bad_fork_cleanup_policy:
 #ifdef CONFIG_NUMA
mpol_free(p->mempolicy);
-bad_fork_cleanup_cpuset:
 #endif
-   cpuset_exit(p);
 bad_fork_cleanup_delays_binfmt:
delayacct_tsk_free(p);
notify_task_watchers(WATCH_TASK_FREE, 0, p);
if (p->binfmt)
module_put(p->binfmt->module);
Index: linux-2.6.19/kernel/cpuset.c
===
--- linux-2.6.19.orig/kernel/cpuset.c
+++ linux-2.6.19/kernel/cpuset.c
@@ -47,10 +47,11 @@
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
 
@@ -2173,17 +2174,20 @@ void __init cpuset_init_smp(void)
  *
  * At the point that cpuset_fork() is called, 'current' is the parent
  * task, and the passed argument 'child' points to the child task.
  **/
 
-void cpuset_fork(struct task_struct *child)
+static int __task_init cpuset_fork(unsigned long clone_flags,
+   struct task_struct *child)
 {
task_lock(current);
child->cpuset = current->cpuset;
atomic_inc(>cpuset->count);
task_unlock(current);
+   return 0;
 }
+DEFINE_TASK_INITCALL(cpuset_fork);
 
 /**
  * cpuset_exit - detach cpuset from exiting task
  * @tsk: pointer to task_struct of exiting process
  *
@@ -2240,11 +2244,12 @@ void cpuset_fork(struct task_struct *chi
  *to NULL here, and check in cpuset_update_task_memory_state()
  *for a NULL pointer.  This hack avoids that NULL check, for no
  *cost (other than this way too long comment ;).
  **/
 
-void cpuset_exit(struct task_struct *tsk)
+static int __task_free cpuset_exit(unsigned long exit_code,
+   struct task_struct *tsk)
 {
struct cpuset *cs;
 
cs = tsk->cpuset;
tsk->cpuset = _cpuset;  /* the_top_cpuset_hack - see above */
@@ -2258,11 +2263,13 @@ void cpuset_exit(struct task_struct *tsk
mutex_unlock(_mutex);
cpuset_release_agent(pathbuf);
} else {
atomic_dec(>count);
}
+   return 0;
 }
+DEFINE_TASK_FREECALL(cpuset_exit);
 
 /**
  * cpuset_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
  * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
  *
Index: linux-2.6.19/kernel/exit.c
===
--- linux-2.6.19.orig/kernel/exit.c
+++ linux-2.6.19/kernel/exit.c
@@ -28,11 +28,10 @@
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
@@ -918,11 +917,10 @@ fastcall NORET_TYPE void do_exit(long co
if (group_dead)
acct_process();
__exit_files(tsk);
__exit_fs(tsk);
exit_thread();
-   cpuset_exit(tsk);
exit_keys(tsk);
 
if (group_dead && tsk->signal->leader)
disassociate_ctty(1);
 
Index: linux-2.6.19/include/linux/cpuset.h
===
--- linux-2.6.19.orig/include/linux/cpuset.h
+++ linux-2.6.19/include/linux/cpuset.h
@@ -17,12 +17,10 @@
 extern int number_of_cpusets;  /* How many cpusets are defined in system? */
 
 extern int cpuset_init_early(void);
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
-extern void cpuset_fork(struct task_struct *p);
-extern void cpuset_exit(struct task_struct *p);
 extern cpumask_t cpuset_cpus_allowed(struct task_struct *p);

Register semundo task watcher

Make the semaphore undo code use a task watcher instead of hooking into
copy_process() and do_exit() directly.

Signed-off-by: Matt Helsley <[EMAIL PROTECTED]>
---
 include/linux/sem.h |   17 -
 ipc/sem.c   |   12 
 kernel/exit.c   |2 --
 kernel/fork.c   |6 +-
 4 files changed, 9 insertions(+), 28 deletions(-)

Index: linux-2.6.19/ipc/sem.c
===
--- linux-2.6.19.orig/ipc/sem.c
+++ linux-2.6.19/ipc/sem.c
@@ -81,10 +81,11 @@
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include "util.h"
 
 #define sem_ids(ns)(*((ns)->ids[IPC_SEM_IDS]))
@@ -1287,11 +1288,11 @@ asmlinkage long sys_semop (int semid, st
  * See the notes above unlock_semundo() regarding the spin_lock_init()
  * in this code.  Initialize the undo_list->lock here instead of 
get_undo_list()
  * because of the reasoning in the comment above unlock_semundo.
  */
 
-int copy_semundo(unsigned long clone_flags, struct task_struct *tsk)
+static int __task_init copy_semundo(unsigned long clone_flags, struct 
task_struct *tsk)
 {
struct sem_undo_list *undo_list;
int error;
 
if (clone_flags & CLONE_SYSVSEM) {
@@ -1303,10 +1304,11 @@ int copy_semundo(unsigned long clone_fla
} else 
tsk->sysvsem.undo_list = NULL;
 
return 0;
 }
+DEFINE_TASK_INITCALL(copy_semundo);
 
 /*
  * add semadj values to semaphores, free undo structures.
  * undo structures are not freed when semaphore arrays are destroyed
  * so some of them may be out of date.
@@ -1316,22 +1318,22 @@ int copy_semundo(unsigned long clone_fla
  * should we queue up and wait until we can do so legally?
  * The original implementation attempted to do this (queue and wait).
  * The current implementation does not do so. The POSIX standard
  * and SVID should be consulted to determine what behavior is mandated.
  */
-void exit_sem(struct task_struct *tsk)
+static int __task_free exit_sem(unsigned long ignored, struct task_struct *tsk)
 {
struct sem_undo_list *undo_list;
struct sem_undo *u, **up;
struct ipc_namespace *ns;
 
undo_list = tsk->sysvsem.undo_list;
if (!undo_list)
-   return;
+   return 0;
 
if (!atomic_dec_and_test(_list->refcnt))
-   return;
+   return 0;
 
ns = tsk->nsproxy->ipc_ns;
/* There's no need to hold the semundo list lock, as current
  * is the last task exiting for this undo list.
 */
@@ -1394,11 +1396,13 @@ found:
update_queue(sma);
 next_entry:
sem_unlock(sma);
}
kfree(undo_list);
+   return 0;
 }
+DEFINE_TASK_FREECALL(exit_sem);
 
 #ifdef CONFIG_PROC_FS
 static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 {
struct sem_array *sma = it;
Index: linux-2.6.19/kernel/exit.c
===
--- linux-2.6.19.orig/kernel/exit.c
+++ linux-2.6.19/kernel/exit.c
@@ -45,11 +45,10 @@
 #include 
 #include 
 #include 
 #include 
 
-extern void sem_exit (void);
 extern struct task_struct *child_reaper;
 
 static void exit_mm(struct task_struct * tsk);
 
 static void __unhash_process(struct task_struct *p)
@@ -916,11 +915,10 @@ fastcall NORET_TYPE void do_exit(long co
exit_mm(tsk);
notify_task_watchers(WATCH_TASK_FREE, code, tsk);
 
if (group_dead)
acct_process();
-   exit_sem(tsk);
__exit_files(tsk);
__exit_fs(tsk);
exit_thread();
cpuset_exit(tsk);
exit_keys(tsk);
Index: linux-2.6.19/kernel/fork.c
===
--- linux-2.6.19.orig/kernel/fork.c
+++ linux-2.6.19/kernel/fork.c
@@ -1102,14 +1102,12 @@ static struct task_struct *copy_process(
 #endif
 
if ((retval = security_task_alloc(p)))
goto bad_fork_cleanup_policy;
/* copy all the process information */
-   if ((retval = copy_semundo(clone_flags, p)))
-   goto bad_fork_cleanup_security;
if ((retval = copy_files(clone_flags, p)))
-   goto bad_fork_cleanup_semundo;
+   goto bad_fork_cleanup_security;
if ((retval = copy_fs(clone_flags, p)))
goto bad_fork_cleanup_files;
if ((retval = copy_sighand(clone_flags, p)))
goto bad_fork_cleanup_fs;
if ((retval = copy_signal(clone_flags, p)))
@@ -1276,12 +1274,10 @@ bad_fork_cleanup_sighand:
__cleanup_sighand(p->sighand);
 bad_fork_cleanup_fs:
exit_fs(p); /* blocking */
 bad_fork_cleanup_files:
exit_files(p); /* blocking */
-bad_fork_cleanup_semundo:
-   exit_sem(p);
 bad_fork_cleanup_security:
security_task_free(p);
 bad_fork_cleanup_policy:
 #ifdef CONFIG_NUMA
mpol_free(p->mempolicy);
Index: linux-2.6.19/include/linux/sem.h

Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Al Boldi

Nikolai Joukov wrote:
> > > We started the project in April 2004.  Right now I am using it as my
> > > /home/kolya file system at home.  We believe that at this stage RAIF
> > > is mature enough for others to try it out.  The code is available at:
> > >
> > >   
> > >
> > > The code requires no kernel patches and compiles for a wide range of
> > > kernels as a module.  The latest kernel we used it for is 2.6.13 and
> > > we are in the process of porting it to 2.6.19.
> > >
> > > We will be happy to hear your back.
> >
> > When removing a file from the underlying branch, the oops below happens.
> > Wouldn't it be possible to just fail the branch instead of oopsing?
>
> This is a known problem of all Linux stackable file systems.  Users are
> not supposed to change the file systems below mounted stackable file
> systems (but they can read them).  One of the ways to enforce it is to use
> overlay mounts.  For example, mount the lower file systems at
> /raif/b0 ... /raif/bN and then mount RAIF at /raif.  Stackable file
> systems recently started getting into the kernel and we hope that there
> will be a better solution for this problem in the future.  Having said
> that, you are right: failing the branch would be the right thing to do.

Good.  It seems that there is also some tmpfs/raif-over-nfs deadlock 
situation.  Can't really tell if it's the kernel or the raif, but when do 
you think the patches could be brought into sync with the current mainline?


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

2006-12-14 Thread Al Boldi

Nikolai Joukov wrote:
> > Nikolai Joukov wrote:
> > > We have designed a new stackable file system that we called RAIF:
> > > Redundant Array of Independent Filesystems.
> >
> > Great!
> >
> > > We have performed some benchmarking on a 3GHz PC with 2GB of RAM and
> > > U320 SCSI disks.  Compared to the Linux RAID driver, RAIF has
> > > overheads of about 20-25% under the Postmark v1.5 benchmark in case of
> > > striping and replication.  In case of RAID4 and RAID5-like
> > > configurations, RAIF performed about two times *better* than software
> > > RAID and even better than an Adaptec 2120S RAID5 controller.
> >
> > I am not surprised.  RAID 4/5/6 performance is highly sensitive to the
> > underlying hw, and thus needs a fair amount of fine tuning.
>
> Nevertheless, performance is not the biggest advantage of RAIF.  For
> read-biased workloads RAID is always slightly faster than RAIF.  The
> biggest advantages of RAIF are flexible configurations (e.g., can combine
> NFS and local file systems), per-file-type storage policies, and the fact
> that files are stored as files on the lower file systems (which is
> convenient).

Ok, a I was just about to inform you of a three nfs-branch raif which was 
unable to fill the net pipe.  So it looks like a 25% performance hit across 
the board.  Should be possible to reduce to sub 3% though once RAIF matures, 
don't you think?


> > > This is because RAIF is located above
> > > file system caches and can cache parity as normal data when needed. 
> > > We have more performance details in a technical report, if anyone is
> > > interested.
> >
> > Definitely interested.  Can you give a link?
>
> The main focus of the paper is on a general OS profiling method and not
> on RAIF.  However, it has some details about the RAIF benchmarking with
> Postmark in Chapter 9:
>
>   
>
> Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
> operation under the same Postmark workload.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

crash in 'wake_up_interruptible()' on SMP

2006-12-14 Thread kiran kumar


Can some one explain why I see the below crash on Intel Xeon SMP box.
The kernel version is 2.6.11. This is what I'm trying to do in the
driver.

1.Submit a request to a device in 'unlocked_ioctl()' and issue
'wait_event_interruptible_timeout()' for 10 jiffies. There can be many
such outstanding requests issued by different processes and all these
are placed in a queue.
2.The 'wake_up_interruptible()' is issued either from tasklet or a
poll-thread which polls on the status of the request.
3. The request queue is protected using  'spin_lock_bh/spin_unlock_bh'
to be softIRQ safe. I'm stating this to point that
'spin_lock_irqsave/spin_lock_irqrestore' is issued only within waitQ.

If i either not use 'unlocked_ioctl()' i.e. use ioctl() (or)comment
out 'wake_up_interruptible()' call I don't see the crash. Is
wake_up_interruptible SMP safe???

Regards,
kiran

/-/
[EMAIL PROTECTED] ~]# [ cut here ]
kernel BUG at include/asm/spinlock.h:112!
invalid operand:  [#1]
SMP
Modules linked in: pkp_drv(U) md5 ipv6 parport_pc lp parport autofs4
rfcomm l2cap bluetooth sunrpc dm_mod video button battery ac uhci_hcd
hw_random i2c_i801 i2c_core shpchp e1000 e100 mii floppy sata_sil
libata scsi_mod ext3 jbd
CPU:0
EIP:0060:[]Tainted: P  VLI
EFLAGS: 00210002   (2.6.11-1.1369_FC4smp)
EIP is at _spin_unlock_irqrestore+0x26/0x30
eax: 0001   ebx: cfefb810   ecx: cfefb810   edx: 00200292
esi:    edi: e0af0f20   ebp:    esp: c8ae2e10
ds: 007b   es: 007b   ss: 0068
Process swamp (pid: 5920, threadinfo=c8ae2000 task=c79d4a80)
Stack: badc0ded e0c8e1af  e0af0268 e0a86ef8 e0c87f46 0802 
  00200286 e0a86ef8 00200286 e0c9c580  e0c87dd1 cfec1810 d028a810
   bf9fc228 e0c8dab4 0001 c8ae2000 3f37331a bf9fc228 e0c9c580
/--/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.20-rc1-mm1


Temporarily at

http://userweb.kernel.org/~akpm/2.6.20-rc1-mm1/

Will appear later at


ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc1/2.6.20-rc1-mm1/



- Added the avr32 devel tree as git-avr32.patch (Haavard Skinnemoen)

- Don't enable locking API self-tests on powerpc - it explodes in a
  spectacular fashion.




Boilerplate:

- See the `hot-fixes' directory for any important updates to this patchset.

- To fetch an -mm tree using git, use (for example)

  git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git 
tag v2.6.16-rc2-mm1
  git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1

- -mm kernel commit activity can be reviewed by subscribing to the
  mm-commits mailing list.

echo "subscribe mm-commits" | mail [EMAIL PROTECTED]

- If you hit a bug in -mm and it is not obvious which patch caused it, it is
  most valuable if you can perform a bisection search to identify which patch
  introduced the bug.  Instructions for this process are at

http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt

  But beware that this process takes some time (around ten rebuilds and
  reboots), so consider reporting the bug first and if we cannot immediately
  identify the faulty patch, then perform the bisection search.

- When reporting bugs, please try to Cc: the relevant maintainer and mailing
  list on any email.

- When reporting bugs in this kernel via email, please also rewrite the
  email Subject: in some manner to reflect the nature of the bug.  Some
  developers filter by Subject: when looking for messages to read.

- Semi-daily snapshots of the -mm lineup are uploaded to
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/ and are announced on
  the mm-commits list.



Changes since 2.6.19-mm1:

 origin.patch
 git-acpi.patch
 git-alsa.patch
 git-avr32.patch
 git-cpufreq.patch
 git-drm.patch
 git-dvb.patch
 git-gfs2-nmw.patch
 git-ieee1394.patch
 git-infiniband.patch
 git-libata-all.patch
 git-lxdialog.patch
 git-mmc.patch
 git-mmc-fixup.patch
 git-mtd.patch
 git-ubi.patch
 git-netdev-all.patch
 git-ioat.patch
 git-ocfs2.patch
 git-pcmcia.patch
 git-chelsio.patch
 git-selinux.patch
 git-pciseg.patch
 git-s390.patch
 git-sh.patch
 git-sas.patch
 git-sparc64.patch
 git-qla3xxx.patch
 git-wireless.patch
 git-gccbug.patch

 git trees.

-x86-smp-export-smp_num_siblings-for-oprofile.patch
-tty-export-get_current_tty.patch
-ieee80211softmac-fix-errors-related-to-the-work_struct-changes.patch
-kvm-add-missing-include.patch
-kvm-put-kvm-in-a-new-virtualization-menu.patch
-kvm-clean-up-amd-svm-debug-registers-load-and-unload.patch
-kvm-replace-__x86_64__-with-config_x86_64.patch
-fix-more-workqueue-build-breakage-tps65010.patch
-another-build-fix-header-rearrangements-osk.patch
-uml-fix-net_kern-workqueue-abuse.patch
-isdn-gigaset-fix-possible-missing-wakeup.patch
-i2o_exec_exit-and-i2o_driver_exit-should-not-be-__exit.patch
-cpufreq-fix-bug-in-duplicate-freq-elimination-code-in-acpi-cpufreq.patch
-gregkh-driver-modules-state.patch
-gregkh-driver-driver-core-delete-virtual-directory-on-class_unregister.patch
-gregkh-driver-debugfs-inotify-create-mkdir-support.patch
-gregkh-driver-debugfs-coding-style-fixes.patch
-gregkh-driver-debugfs-file-directory-creation-error-handling.patch
-gregkh-driver-debugfs-more-file-directory-creation-error-handling.patch
-gregkh-driver-debugfs-file-directory-removal-fix.patch
-gregkh-driver-driver-core-platform_driver_probe-can-save-codespace-save-codespace.patch
-gregkh-driver-driver-core-make-platform_device_add_data-accept-a-const-pointer.patch
-gregkh-driver-driver-core-deprecate-pm_legacy-default-it-to-n.patch
-drm-fix-return-value-check.patch
-drm-handle-pci_enable_device-failure.patch
-jdelvare-i2c-i2c-documentation-typos.patch
-jdelvare-i2c-i2c-update-i2c-id-list.patch
-jdelvare-i2c-i2c-delete-ite-bus-driver.patch
-jdelvare-i2c-i2c-pnx-new-driver.patch
-jdelvare-i2c-i2c-ibm_iic-add_request_release_mem_region.patch
-jdelvare-i2c-i2c-nforce2-cleanup.patch
-jdelvare-i2c-i2c-lockdep-handle-recursive-locking.patch
-jdelvare-i2c-i2c-at91-new-bus-driver.patch
-jdelvare-i2c-i2c-dev-make-I2C_FUNCS-ioctl-faster.patch
-jdelvare-i2c-i2c-remove-extraneous-whitespace.patch
-jdelvare-i2c-i2c-core-use-__ATTR.patch
-jdelvare-i2c-i2c-i801-documentation-update.patch
-jdelvare-i2c-i2c-fix-broken-ds1337-initialization.patch
-jdelvare-i2c-i2c-versatile-new-arm-bus-driver.patch
-jdelvare-i2c-i2c-discard-del-bus-wrappers.patch
-jdelvare-i2c-i2c-i801-enable-PEC-on-ICH6.patch
-jdelvare-i2c-i2c-dev-fix-return-value-check.patch
-jdelvare-i2c-i2c-dev-merge-kfree.patch
-jdelvare-i2c-i2c-omap-prescaler-formula.patch
-jdelvare-hwmon-hwmon-f71805f-add-fanctl-1-prepare.patch
-jdelvare-hwmon-hwmon-f71805f-add-fanctl-2-manual-mode.patch
-jdelvare-hwmon-hwmon-f71805f-add-fanctl-3-pwm-freq.patch
-jdelvare-hwmon-hwmon-f71805f-add-fanctl-4-pwm-mode.patch
-jdelvare-hwmon-hwmon-f71805f-add-fanctl-5-speed-mode.patch

RE: 2.6.18.4: flush_workqueue calls mutex_lock in interrupt environment

2006-12-14 Thread Chen, Kenneth W

Chen, Kenneth wrote on Thursday, December 14, 2006 5:59 PM
> > It seems utterly insane to have aio_complete() flush a workqueue. That
> > function has to be called from a number of different environments,
> > including non-sleep tolerant environments.
> > 
> > For instance it means that directIO on NFS will now cause the rpciod
> > workqueues to call flush_workqueue(aio_wq), thus slowing down all RPC
> > activity.
> 
> The bug appears to be somewhere else, somehow the ref count on ioctx is
> all messed up.
> 
> In aio_complete, __put_ioctx() should not be invoked because ref count
> on ioctx is supposedly more than 2, aio_complete decrement it once and
> should return without invoking the free function.
> 
> The real freeing ioctx should be coming from exit_aio() or io_destroy(),
> in which case both wait until no further pending AIO request via
> wait_for_all_aios().

Ah, I think I see the bug: it must be a race between io_destroy() and
aio_complete().  A possible scenario:

cpu0   cpu1
io_destroy aio_complete
  wait_for_all_aios {__aio_put_req
 ... ctx->reqs_active--;
 if (!ctx->reqs_active)
return;
  }
  ...
  put_ioctx(ioctx)

 put_ioctx(ctx);
bam! Bug trigger!

AIO finished on cpu1 and while in the middle of aio_complete, cpu0 starts
io_destroy sequence, sees no pending AIO, went ahead decrement the ref
count on ioctx.  At a later point in aio_complete, the put_ioctx decrement
last ref count and calls the ioctx freeing function and there it triggered
the bug warning.

A simple fix would be to access ctx->reqs_active inside ctx spin lock in 
wait_for_all_aios().  At the mean time, I would like to
remove ref counting
for each iocb because we already performing ref count using reqs_active. This
would also prevent similar buggy code in the future.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>

--- ./fs/aio.c.orig 2006-11-29 13:57:37.0 -0800
+++ ./fs/aio.c  2006-12-14 20:45:14.0 -0800
@@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
 
+   spin_lock_irq(>ctx_lock);
if (!ctx->reqs_active)
-   return;
+   goto out;
 
add_wait_queue(>wait, );
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx->reqs_active) {
+   spin_unlock_irq(>ctx_lock);
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+   spin_lock_irq(>ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
remove_wait_queue(>wait, );
+
+out:
+   spin_unlock_irq(>ctx_lock);
 }
 
 /* wait_on_sync_kiocb:
@@ -425,7 +431,6 @@ static struct kiocb fastcall *__aio_get_
ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0);
if (ctx->reqs_active < aio_ring_avail(>ring_info, ring)) {
list_add(>ki_list, >active_reqs);
-   get_ioctx(ctx);
ctx->reqs_active++;
okay = 1;
}
@@ -538,8 +543,6 @@ int fastcall aio_put_req(struct kiocb *r
spin_lock_irq(>ctx_lock);
ret = __aio_put_req(ctx, req);
spin_unlock_irq(>ctx_lock);
-   if (ret)
-   put_ioctx(ctx);
return ret;
 }
 
@@ -795,8 +798,7 @@ static int __aio_run_iocbs(struct kioctx
 */
iocb->ki_users++;   /* grab extra reference */
aio_run_iocb(iocb);
-   if (__aio_put_req(ctx, iocb))  /* drop extra ref */
-   put_ioctx(ctx);
+   __aio_put_req(ctx, iocb);
}
if (!list_empty(>run_list))
return 1;
@@ -942,7 +944,6 @@ int fastcall aio_complete(struct kiocb *
struct io_event *event;
unsigned long   flags;
unsigned long   tail;
-   int ret;
 
/*
 * Special case handling for sync iocbs:
@@ -1011,18 +1012,12 @@ int fastcall aio_complete(struct kiocb *
pr_debug("%ld retries: %zd of %zd\n", iocb->ki_retried,
iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes);
 put_rq:
-   /* everything turned out well, dispose of the aiocb. */
-   ret = __aio_put_req(ctx, iocb);
-
spin_unlock_irqrestore(>ctx_lock, flags);
 
if (waitqueue_active(>wait))
wake_up(>wait);
 
-   if (ret)
-   put_ioctx(ctx);
-
-   return ret;
+   return aio_put_req(iocb);
 }
 
 /* aio_read_evt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: realtime-preempt and arm

2006-12-14 Thread tike64

Steven Rostedt <[EMAIL PROTECTED]> wrote:
> So you got a big jitter using nanosleep???  If that's the case, could
> you post the times you got. I'll also boot a kernel with the latest
> -rt patch, without highres compiled, and see if I can reproduce the
> same on x86.

You're very kind! Here you go:

This is from "Linux uclibc 2.6.14.2 #12 PREEMPT" without -rt:

100 revs; min: 19888 max: 20386 avg: 20013
100 revs; min: 19724 max: 20296 avg: 20013
100 revs; min: 19920 max: 20322 avg: 20013
100 revs; min: 19840 max: 20323 avg: 20016
100 revs; min: 10276 max: 42789 avg: 21294
100 revs; min: 10466 max: 34080 avg: 21687
100 revs; min: 10249 max: 30594 avg: 21161
100 revs; min: 10962 max: 34421 avg: 21415
100 revs; min: 10437 max: 31338 avg: 20562
100 revs; min: 11660 max: 29751 avg: 21066
100 revs; min: 10457 max: 30612 avg: 21417
100 revs; min: 10270 max: 37828 avg: 21513

First four lines are with the system otherwise idle. Then I fired 'ls
-Rl /mnt/some/nfs/share' on a framebuffer console.

And the same on a "Linux uclibc 2.6.18-rt6 #19 PREEMPT":

100 revs; min: 19847 max: 20242 avg: 20014
100 revs; min: 19685 max: 20332 avg: 20014
100 revs; min: 19652 max: 20374 avg: 20014
100 revs; min: 19622 max: 20399 avg: 20012
100 revs; min: 19736 max: 26612 avg: 20074
100 revs; min: 19478 max: 21199 avg: 20021
100 revs; min: 19569 max: 21093 avg: 20022
100 revs; min: 19582 max: 20460 avg: 20017
100 revs; min: 19723 max: 20410 avg: 20016
100 revs; min: 19459 max: 24565 avg: 20056
100 revs; min: 19610 max: 24257 avg: 20053
100 revs; min: 19376 max: 26848 avg: 20079
100 revs; min: 19445 max: 26522 avg: 20077
100 revs; min: 19510 max: 22349 avg: 20034
100 revs; min: 19562 max: 20334 avg: 20017

The one to be blamed the most seems to be FB. 'ls ... > /dev/null'
leads to less than 2ms slips.

I'm supposed to make a 10ms control loop, so I could live with a couple
of ms jitter. 7ms is rather high and I think it tells about some
problem which makes one wonder if even higher occasional slips are
possible.

I made my test code visible if you want to take a look: www dot
riihineva dot no-ip dot org uphill public uphill test-rt.c

--

tike

Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abolishing the DMCA (was GPL only modules)

2006-12-14 Thread Willy Tarreau

On Thu, Dec 14, 2006 at 01:09:06PM -0800, Michael ODonald wrote:
> Linus Torvalds wrote:
> > DMCA is bad because it puts technical limits over
> > the rights expressly granted by copyright law.
> 
> The best ways to get rich corporations on our side in fighting the
> DMCA is to use the DMCA to hurt their profits. Companies that rely on
> binary drivers would have several options:
> 
> 1) Lobby politicians to repeal the DMCA, thereby allowing the
> companies to *internally* circumvent Linuxs GPL-only
> pseudo-restriction all they want by simply changing the source code.
> 
> 2) Release the binary drivers as open source or use their economic
> clout to pressure the makers of the binary drivers.
> 
> 3) Use FOSS-friendly hardware.
> 
> Im sorry, but theres currently no economic push for repealing the
> DMCA; the only people trying to abolish it are idealists who are
> easily out-bought by the media cartel. This is our only chance to put
> some corporate money muscle behind the otherwise doomed anti-DMCA
> movement.

4) make no effort to support Linux

You're not the center of the world, never forget it !

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/3] acpi: Add a docked sysfs file to the dock driver.

2006-12-14 Thread Len Brown

On Thursday 14 December 2006 02:16, Holger Macht wrote:
> On Mon 11. Dec - 12:05:08, Kristen Carlson Accardi wrote:

> > Ok - how is this?
>
> Looks good to me, thanks!

> > Signed-off-by: Kristen Carlson Accardi <[EMAIL PROTECTED]>
>
> Signed-off-by: Holger Macht <[EMAIL PROTECTED]>

Applied.
thanks,
-Len

commit 8ea86e0ba7c9d16ae0f35cb0c4165194fa573f7a
Author: Kristen Carlson Accardi <[EMAIL PROTECTED]>
Date:   Mon Dec 11 12:05:08 2006 -0800

ACPI: dock: add uevent to indicate change in device status

Send a uevent to indicate a device change whenever we dock or
undock, so that userspace may now check the dock status via sysfs.

Signed-off-by: Kristen Carlson Accardi <[EMAIL PROTECTED]>
Signed-off-by: Holger Macht <[EMAIL PROTECTED]>
Signed-off-by: Len Brown <[EMAIL PROTECTED]>

diff --git a/drivers/acpi/dock.c b/drivers/acpi/dock.c
index 8c6828b..215f5b3 100644
--- a/drivers/acpi/dock.c
+++ b/drivers/acpi/dock.c
@@ -326,10 +326,12 @@ static void hotplug_dock_devices(struct dock_station 
*ds, u32 event)
 
 static void dock_event(struct dock_station *ds, u32 event, int num)
 {
+   struct device *dev = _device.dev;
/*
-* we don't do events until someone tells me that
-* they would like to have them.
+* Indicate that the status of the dock station has
+* changed.
 */
+   kobject_uevent(>kobj, KOBJ_CHANGE);
 }
 
 /**
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Asynchronous Crypto suppor for MPC8360E's Security Engine

2006-12-14 Thread n . balaji


Hi,
  I am working on MPC8360E Security Engine. I have ported the Openswan
2.4.5(IPSec --KLIPS) with OCF to MPC8360E's Security Engine (Talitos).
Encryption and Decryption is working. But when I check the performance of
Talitos with netio benchmark Tool, IPSec S/W Algorithms is giving more
bandwidth than Talitos.
  I do not know that why Talitos is giving less bandwidth and any probelm
in Openswan or OCF or Talitos driver or Talitos H/W. Please give your
suggestions and if you have any link related to Talitos, send to me.

  Linux kernel version is 2.6.11.

 I am not a member of the above mailing lists. Please send the mail to me.

-Thanks
 N.Balaji


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/6] SMP boot hook for paravirt

Add VMI SMP boot hook.  We emulate a regular boot sequence and use the
same APIC IPI initiation, we just poke magic values to load into the CPU
state when the startup IPI is received, rather than having to jump through a
real mode trampoline.

This is all that was needed to get SMP to work.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>
Subject: SMP boot hook for paravirt

diff -r acfb7a15715f arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c   Thu Dec 14 16:22:03 2006 -0800
+++ b/arch/i386/kernel/paravirt.c   Thu Dec 14 16:51:48 2006 -0800
@@ -572,5 +572,7 @@ struct paravirt_ops paravirt_ops = {
 
.irq_enable_sysexit = native_irq_enable_sysexit,
.iret = native_iret,
+
+   .startup_ipi_hook = (void *)native_nop,
 };
 EXPORT_SYMBOL(paravirt_ops);
diff -r acfb7a15715f arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.cThu Dec 14 16:22:03 2006 -0800
+++ b/arch/i386/kernel/smpboot.cThu Dec 14 16:51:52 2006 -0800
@@ -831,6 +831,13 @@ wakeup_secondary_cpu(int phys_apicid, un
num_starts = 0;
 
/*
+* Paravirt / VMI wants a startup IPI hook here to set up the
+* target processor state.
+*/
+   startup_ipi_hook(phys_apicid, (unsigned long) start_secondary,
+(unsigned long) stack_start.esp);
+
+   /*
 * Run STARTUP IPI loop.
 */
Dprintk("#startup loops: %d.\n", num_starts);
diff -r acfb7a15715f include/asm-i386/paravirt.h
--- a/include/asm-i386/paravirt.h   Thu Dec 14 16:22:03 2006 -0800
+++ b/include/asm-i386/paravirt.h   Thu Dec 14 16:51:48 2006 -0800
@@ -151,6 +151,8 @@ struct paravirt_ops
/* These two are jmp to, not actually called. */
void (fastcall *irq_enable_sysexit)(void);
void (fastcall *iret)(void);
+
+   void (fastcall *startup_ipi_hook)(int phys_apicid, unsigned long 
start_eip, unsigned long start_esp);
 };
 
 /* Mark a paravirt probe function. */
@@ -323,6 +325,13 @@ static inline unsigned long apic_read(un
 }
 #endif
 
+#ifdef CONFIG_SMP
+static inline void startup_ipi_hook(int phys_apicid, unsigned long start_eip,
+   unsigned long start_esp)
+{
+   return paravirt_ops.startup_ipi_hook(phys_apicid, start_eip, start_esp);
+}
+#endif
 
 #define __flush_tlb() paravirt_ops.flush_tlb_user()
 #define __flush_tlb_global() paravirt_ops.flush_tlb_kernel()
diff -r acfb7a15715f include/asm-i386/smp.h
--- a/include/asm-i386/smp.hThu Dec 14 16:22:03 2006 -0800
+++ b/include/asm-i386/smp.hThu Dec 14 16:52:21 2006 -0800
@@ -52,6 +52,11 @@ extern void cpu_uninit(void);
 extern void cpu_uninit(void);
 #endif
 
+#ifndef CONFIG_PARAVIRT
+#define startup_ipi_hook(phys_apicid, start_eip, start_esp)\
+do { } while (0)
+#endif
+
 /*
  * This function is needed by all SMP systems. It must _always_ be valid
  * from the initial startup. We map APIC_BASE very early in page_setup(),
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/6] VMI timer patches

VMI timer code.  It works by taking over the local APIC clock when APIC is
configured, which requires a couple hooks into the APIC code.  The backend
timer code could be commonized into the timer infrastructure, but there are
some pieces missing (stolen time, in particular), and the exact semantics
of when to do accounting for NO_IDLE need to be shared between different
hypervisors as well.  So for now, VMI timer is a separate module.

Subject: VMI timer patches
Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff -r 77e4058e936b arch/i386/Kconfig
--- a/arch/i386/Kconfig Thu Dec 14 16:40:14 2006 -0800
+++ b/arch/i386/Kconfig Thu Dec 14 16:40:16 2006 -0800
@@ -1227,3 +1227,12 @@ config KTIME_SCALAR
 config KTIME_SCALAR
bool
default y
+
+config NO_IDLE_HZ
+   bool
+   depends on PARAVIRT
+   default y
+   help
+ Switches the regular HZ timer off when the system is going idle.
+ This helps a hypervisor detect that the Linux system is idle,
+ reducing the overhead of idle systems.
diff -r 77e4058e936b arch/i386/kernel/Makefile
--- a/arch/i386/kernel/Makefile Thu Dec 14 16:40:14 2006 -0800
+++ b/arch/i386/kernel/Makefile Thu Dec 14 16:40:16 2006 -0800
@@ -40,7 +40,7 @@ obj-$(CONFIG_HPET_TIMER)  += hpet.o
 obj-$(CONFIG_HPET_TIMER)   += hpet.o
 obj-$(CONFIG_K8_NB)+= k8.o
 
-obj-$(CONFIG_VMI)  += vmi.o
+obj-$(CONFIG_VMI)  += vmi.o vmitime.o
 
 # Make sure this is linked after any other paravirt_ops structs: see head.S
 obj-$(CONFIG_PARAVIRT) += paravirt.o
diff -r 77e4058e936b arch/i386/kernel/apic.c
--- a/arch/i386/kernel/apic.c   Thu Dec 14 16:40:14 2006 -0800
+++ b/arch/i386/kernel/apic.c   Thu Dec 14 16:40:16 2006 -0800
@@ -1395,7 +1395,7 @@ int __init APIC_init_uniprocessor (void)
if (!skip_ioapic_setup && nr_ioapics)
setup_IO_APIC();
 #endif
-   setup_boot_APIC_clock();
+   setup_boot_clock();
 
return 0;
 }
diff -r 77e4058e936b arch/i386/kernel/entry.S
--- a/arch/i386/kernel/entry.S  Thu Dec 14 16:40:14 2006 -0800
+++ b/arch/i386/kernel/entry.S  Thu Dec 14 16:40:16 2006 -0800
@@ -622,6 +622,11 @@ ENTRY(name)\
 /* The include is where all of the SMP etc. interrupts come from */
 #include "entry_arch.h"
 
+/* This alternate entry is needed because we hijack the apic LVTT */
+#if defined(CONFIG_VMI) && defined(CONFIG_X86_LOCAL_APIC)
+BUILD_INTERRUPT(apic_vmi_timer_interrupt,LOCAL_TIMER_VECTOR)
+#endif
+
 KPROBE_ENTRY(page_fault)
RING0_EC_FRAME
pushl $do_page_fault
diff -r 77e4058e936b arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c   Thu Dec 14 16:40:14 2006 -0800
+++ b/arch/i386/kernel/paravirt.c   Thu Dec 14 16:40:16 2006 -0800
@@ -544,6 +544,8 @@ struct paravirt_ops paravirt_ops = {
.apic_write = native_apic_write,
.apic_write_atomic = native_apic_write_atomic,
.apic_read = native_apic_read,
+   .setup_boot_clock = setup_boot_APIC_clock,
+   .setup_secondary_clock = setup_secondary_APIC_clock,
 #endif
.set_lazy_mode = (void *)native_nop,
 
diff -r 77e4058e936b arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.cThu Dec 14 16:40:14 2006 -0800
+++ b/arch/i386/kernel/smpboot.cThu Dec 14 16:40:16 2006 -0800
@@ -556,7 +556,7 @@ static void __devinit start_secondary(vo
smp_callin();
while (!cpu_isset(smp_processor_id(), smp_commenced_mask))
rep_nop();
-   setup_secondary_APIC_clock();
+   setup_secondary_clock();
if (nmi_watchdog == NMI_IO_APIC) {
disable_8259A_irq(0);
enable_NMI_through_LVT0(NULL);
@@ -1330,7 +1330,7 @@ static void __init smp_boot_cpus(unsigne
 
smpboot_setup_io_apic();
 
-   setup_boot_APIC_clock();
+   setup_boot_clock();
 
/*
 * Synchronize the TSC with the AP
diff -r 77e4058e936b arch/i386/kernel/time.c
--- a/arch/i386/kernel/time.c   Thu Dec 14 16:40:14 2006 -0800
+++ b/arch/i386/kernel/time.c   Thu Dec 14 16:40:16 2006 -0800
@@ -232,6 +232,7 @@ static void sync_cmos_clock(unsigned lon
 static void sync_cmos_clock(unsigned long dummy);
 
 static DEFINE_TIMER(sync_cmos_timer, sync_cmos_clock, 0, 0);
+int no_sync_cmos_clock;
 
 static void sync_cmos_clock(unsigned long dummy)
 {
@@ -275,7 +276,8 @@ static void sync_cmos_clock(unsigned lon
 
 void notify_arch_cmos_timer(void)
 {
-   mod_timer(_cmos_timer, jiffies + 1);
+   if (!no_sync_cmos_clock)
+   mod_timer(_cmos_timer, jiffies + 1);
 }
 
 static long clock_cmos_diff;
diff -r 77e4058e936b arch/i386/kernel/tsc.c
--- a/arch/i386/kernel/tsc.cThu Dec 14 16:40:14 2006 -0800
+++ b/arch/i386/kernel/tsc.cThu Dec 14 16:40:16 2006 -0800
@@ -23,6 +23,7 @@
  * an extra value to store the TSC freq
  */
 unsigned int tsc_khz;
+unsigned long long (*custom_sched_clock)(void);
 
 int tsc_disable __cpuinitdata = 0;
 
@@

[PATCH 0/6] VMI paravirt-ops patches

These are the patches for the VMI backend to paravirt-ops.  Base
kernel where I tested them was 2.6.19-git20.

Basically, there are only a couple of hooks needed that were left
out of the initial paravirt-ops merge, and then the backend code
is a very straightforward implementation of the paravirt-ops
functions.

Andrew or Linus, please apply or shoot me nasty feedback that I
will promptly turn into marvelous looking code.  I've Cc'd Andi,
who originally was going to take up the patches, but seems to
have been snowed in.

Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/6] Page allocation hooks for VMI backend

The VMI backend uses explicit page type notification to track shadow
page tables.  The allocation of page table roots is especially tricky.
We need to clone the root for non-PAE mode while it is protected under
the pgd lock to correctly copy the shadow.

We don't need to allocate pgds in PAE mode, (PDPs in Intel terminology)
as they only have 4 entries, and are cached entirely by the processor,
which makes shadowing them rather simple.

For base page table level allocation, pmd_populate provides the exact hook
point we need.  Also, we need to allocate pages when splitting a large page,
and we must release pages before returning the page to any free pool.

Despite being required with these slightly odd semantics for VMI, Xen also 
uses these hooks to determine the exact moment when page tables are created
or released.

Subject: Page allocation hooks for VMI backend
Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -545,6 +545,12 @@ struct paravirt_ops paravirt_ops = {
.flush_tlb_kernel = native_flush_tlb_global,
.flush_tlb_single = native_flush_tlb_single,
 
+   .alloc_pt = (void *)native_nop,
+   .alloc_pd = (void *)native_nop,
+   .alloc_pd_clone = (void *)native_nop,
+   .release_pt = (void *)native_nop,
+   .release_pd = (void *)native_nop,
+
.set_pte = native_set_pte,
.set_pte_at = native_set_pte_at,
.set_pmd = native_set_pmd,
===
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -62,6 +62,7 @@ static pmd_t * __init one_md_table_init(

 #ifdef CONFIG_X86_PAE
pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
+   paravirt_alloc_pd(__pa(pmd_table) >> PAGE_SHIFT);
set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
pud = pud_offset(pgd, 0);
if (pmd_table != pmd_offset(pud, 0)) 
@@ -82,6 +83,7 @@ static pte_t * __init one_page_table_ini
 {
if (pmd_none(*pmd)) {
pte_t *page_table = (pte_t *) 
alloc_bootmem_low_pages(PAGE_SIZE);
+   paravirt_alloc_pt(__pa(page_table) >> PAGE_SHIFT);
set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE));
if (page_table != pte_offset_kernel(pmd, 0))
BUG();  
@@ -347,6 +349,8 @@ static void __init pagetable_init (void)
/* Init entries of the first-level page table to the zero page */
for (i = 0; i < PTRS_PER_PGD; i++)
set_pgd(pgd_base + i, __pgd(__pa(empty_zero_page) | 
_PAGE_PRESENT));
+#else
+   paravirt_alloc_pd(__pa(swapper_pg_dir) >> PAGE_SHIFT);
 #endif
 
/* Enable PSE if available */
===
--- a/arch/i386/mm/pageattr.c
+++ b/arch/i386/mm/pageattr.c
@@ -60,6 +60,7 @@ static struct page *split_large_page(uns
address = __pa(address);
addr = address & LARGE_PAGE_MASK; 
pbase = (pte_t *)page_address(base);
+   paravirt_alloc_pt(page_to_pfn(base));
for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
set_pte([i], pfn_pte(addr >> PAGE_SHIFT,
   addr == address ? prot : ref_prot));
@@ -166,6 +167,7 @@ __change_page_attr(struct page *page, pg
if (!PageReserved(kpte_page)) {
if (cpu_has_pse && (page_private(kpte_page) == 0)) {
ClearPagePrivate(kpte_page);
+   paravirt_release_pt(page_to_pfn(kpte_page));
list_add(_page->lru, _list);
revert_page(kpte_page, address);
}
===
--- a/arch/i386/mm/pgtable.c
+++ b/arch/i386/mm/pgtable.c
@@ -245,8 +245,14 @@ void pgd_ctor(void *pgd, kmem_cache_t *c
clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD,
swapper_pg_dir + USER_PTRS_PER_PGD,
KERNEL_PGD_PTRS);
+
if (PTRS_PER_PMD > 1)
return;
+
+   /* must happen under lock */
+   paravirt_alloc_pd_clone(__pa(pgd) >> PAGE_SHIFT,
+   __pa(swapper_pg_dir) >> PAGE_SHIFT,
+   USER_PTRS_PER_PGD, PTRS_PER_PGD - USER_PTRS_PER_PGD);
 
pgd_list_add(pgd);
spin_unlock_irqrestore(_lock, flags);
@@ -257,6 +263,7 @@ void pgd_dtor(void *pgd, kmem_cache_t *c
 {
unsigned long flags; /* can be called from interrupt context */
 
+   paravirt_release_pd(__pa(pgd) >> PAGE_SHIFT);
spin_lock_irqsave(_lock, flags);
pgd_list_del(pgd);
spin_unlock_irqrestore(_lock, flags);
@@ -274,13 +281,18 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
pmd_t *pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL);
if (!pmd)

[PATCH 5/6] VMI backend for paravirt-ops

Fairly straightforward implementation of VMI backend for paravirt-ops.

Subject: VMI backend for paravirt-ops
Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff -r d8711b11c1eb arch/i386/Kconfig
--- a/arch/i386/Kconfig Tue Dec 12 13:51:06 2006 -0800
+++ b/arch/i386/Kconfig Tue Dec 12 13:51:13 2006 -0800
@@ -192,6 +192,15 @@ config PARAVIRT
  under a hypervisor, improving performance significantly.
  However, when run without a hypervisor the kernel is
  theoretically slower.  If in doubt, say N.
+
+config VMI
+   bool "VMI Paravirt-ops support"
+   depends on PARAVIRT
+   default y
+   help
+ VMI provides a paravirtualized interface to multiple hypervisors
+ include VMware ESX server and Xen by connecting to a ROM module
+ provided by the hypervisor.
 
 config ACPI_SRAT
bool
diff -r d8711b11c1eb arch/i386/kernel/Makefile
--- a/arch/i386/kernel/Makefile Tue Dec 12 13:51:06 2006 -0800
+++ b/arch/i386/kernel/Makefile Tue Dec 12 13:51:13 2006 -0800
@@ -39,6 +39,8 @@ obj-$(CONFIG_EARLY_PRINTK)+= early_prin
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 obj-$(CONFIG_HPET_TIMER)   += hpet.o
 obj-$(CONFIG_K8_NB)+= k8.o
+
+obj-$(CONFIG_VMI)  += vmi.o
 
 # Make sure this is linked after any other paravirt_ops structs: see head.S
 obj-$(CONFIG_PARAVIRT) += paravirt.o
diff -r d8711b11c1eb arch/i386/kernel/head.S
--- a/arch/i386/kernel/head.S   Tue Dec 12 13:51:06 2006 -0800
+++ b/arch/i386/kernel/head.S   Tue Dec 12 13:51:13 2006 -0800
@@ -360,7 +360,7 @@ 1:  movb $1,X86_HARD_MATH
  * cpu_gdt_table and boot_pda; for secondary CPUs, these will be
  * that CPU's GDT and PDA.
  */
-setup_pda:
+ENTRY(setup_pda)
/* get the PDA pointer */
movl start_pda, %eax
 
diff -r d8711b11c1eb arch/i386/kernel/io_apic.c
--- a/arch/i386/kernel/io_apic.cTue Dec 12 13:51:06 2006 -0800
+++ b/arch/i386/kernel/io_apic.cTue Dec 12 13:51:13 2006 -0800
@@ -1914,7 +1914,7 @@ static void __init setup_ioapic_ids_from
 static void __init setup_ioapic_ids_from_mpc(void) { }
 #endif
 
-static int no_timer_check __initdata;
+int no_timer_check __initdata;
 
 static int __init notimercheck(char *s)
 {
diff -r d8711b11c1eb arch/i386/kernel/setup.c
--- a/arch/i386/kernel/setup.c  Tue Dec 12 13:51:06 2006 -0800
+++ b/arch/i386/kernel/setup.c  Tue Dec 12 13:51:13 2006 -0800
@@ -60,6 +60,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -581,6 +582,14 @@ void __init setup_arch(char **cmdline_p)
 
max_low_pfn = setup_memory();
 
+#ifdef CONFIG_VMI
+   /*
+* Must be after max_low_pfn is determined, and before kernel
+* pagetables are setup.
+*/
+   vmi_init();
+#endif
+
/*
 * NOTE: before this point _nobody_ is allowed to allocate
 * any memory using the bootmem allocator.  Although the
diff -r d8711b11c1eb arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.cTue Dec 12 13:51:06 2006 -0800
+++ b/arch/i386/kernel/smpboot.cTue Dec 12 13:51:13 2006 -0800
@@ -63,6 +63,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Set if we find a B stepping CPU */
 static int __devinitdata smp_b_stepping;
@@ -547,6 +548,9 @@ static void __devinit start_secondary(vo
 * booting is too fragile that we want to limit the
 * things done here to the most necessary things.
 */
+#ifdef CONFIG_VMI
+   vmi_bringup();
+#endif
secondary_cpu_init();
preempt_disable();
smp_callin();
diff -r d8711b11c1eb arch/i386/mm/pgtable.c
--- a/arch/i386/mm/pgtable.cTue Dec 12 13:51:06 2006 -0800
+++ b/arch/i386/mm/pgtable.cTue Dec 12 13:51:13 2006 -0800
@@ -171,6 +171,8 @@ void reserve_top_address(unsigned long r
 void reserve_top_address(unsigned long reserve)
 {
BUG_ON(fixmaps > 0);
+   printk(KERN_INFO "Reserving virtual address space above 0x%08x\n",
+  (int)-reserve);
 #ifdef CONFIG_COMPAT_VDSO
BUG_ON(reserve != 0);
 #else
diff -r d8711b11c1eb include/asm-i386/timer.h
--- a/include/asm-i386/timer.h  Tue Dec 12 13:51:06 2006 -0800
+++ b/include/asm-i386/timer.h  Tue Dec 12 13:51:13 2006 -0800
@@ -8,6 +8,7 @@ void setup_pit_timer(void);
 /* Modifiers for buggy PIT handling */
 extern int pit_latch_buggy;
 extern int timer_ack;
+extern int no_timer_check;
 extern int recalibrate_cpu_khz(void);
 
 #endif
diff -r d8711b11c1eb arch/i386/kernel/vmi.c
--- /dev/null   Thu Jan 01 00:00:00 1970 +
+++ b/arch/i386/kernel/vmi.cTue Dec 12 13:51:13 2006 -0800
@@ -0,0 +1,901 @@
+/*
+ * VMI specific paravirt-ops implementation
+ *
+ * Copyright (C) 2005, VMware, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is

[PATCH 2/6] Paravirt CPU hypercall batching mode

The VMI ROM has a mode where hypercalls can be queued and batched.  This turns
out to be a significant win during context switch, but must be done at a
specific point before side effects to CPU state are visible to subsequent
instructions.  This is similar to the MMU batching hooks already provided.
The same hooks could be used by the Xen backend to implement a context switch
multicall.

To explain a bit more about lazy modes in the paravirt patches, basically, the
idea is that only one of lazy CPU or MMU mode can be active at any given time.
Lazy MMU mode is similar to this lazy CPU mode, and allows for batching of
multiple PTE updates (say, inside a remap loop), but to avoid keeping some kind
of state machine about when to flush cpu or mmu updates, we just allow one or
the other to be active.  Although there is no real reason a more comprehensive
scheme could not be implemented, there is also no demonstrated need for this
extra complexity.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>
Subject: Paravirt CPU hypercall batching mode

diff -r 01f2e46c1416 arch/i386/kernel/paravirt.c
--- a/arch/i386/kernel/paravirt.c   Thu Dec 14 14:26:24 2006 -0800
+++ b/arch/i386/kernel/paravirt.c   Thu Dec 14 14:44:56 2006 -0800
@@ -545,6 +545,7 @@ struct paravirt_ops paravirt_ops = {
.apic_write_atomic = native_apic_write_atomic,
.apic_read = native_apic_read,
 #endif
+   .set_lazy_mode = (void *)native_nop,
 
.flush_tlb_user = native_flush_tlb,
.flush_tlb_kernel = native_flush_tlb_global,
diff -r 01f2e46c1416 arch/i386/kernel/process.c
--- a/arch/i386/kernel/process.cThu Dec 14 14:26:24 2006 -0800
+++ b/arch/i386/kernel/process.cThu Dec 14 14:50:22 2006 -0800
@@ -665,6 +665,31 @@ struct task_struct fastcall * __switch_t
load_TLS(next, cpu);
 
/*
+* Now maybe handle debug registers and/or IO bitmaps
+*/
+   if (unlikely((task_thread_info(next_p)->flags & _TIF_WORK_CTXSW)
+   || test_tsk_thread_flag(prev_p, TIF_IO_BITMAP)))
+   __switch_to_xtra(next_p, tss);
+
+   disable_tsc(prev_p, next_p);
+
+   /*
+* Leave lazy mode, flushing any hypercalls made here.
+* This must be done before restoring TLS segments so
+* the GDT and LDT are properly updated, and must be
+* done before math_state_restore, so the TS bit is up
+* to date.
+*/
+   arch_leave_lazy_cpu_mode();
+
+   /* If the task has used fpu the last 5 timeslices, just do a full
+* restore of the math state immediately to avoid the trap; the
+* chances of needing FPU soon are obviously high now
+*/
+   if (next_p->fpu_counter > 5)
+   math_state_restore();
+
+   /*
 * Restore %fs if needed.
 *
 * Glibc normally makes %fs be zero.
@@ -673,22 +698,6 @@ struct task_struct fastcall * __switch_t
loadsegment(fs, next->fs);
 
write_pda(pcurrent, next_p);
-
-   /*
-* Now maybe handle debug registers and/or IO bitmaps
-*/
-   if (unlikely((task_thread_info(next_p)->flags & _TIF_WORK_CTXSW)
-   || test_tsk_thread_flag(prev_p, TIF_IO_BITMAP)))
-   __switch_to_xtra(next_p, tss);
-
-   disable_tsc(prev_p, next_p);
-
-   /* If the task has used fpu the last 5 timeslices, just do a full
-* restore of the math state immediately to avoid the trap; the
-* chances of needing FPU soon are obviously high now
-*/
-   if (next_p->fpu_counter > 5)
-   math_state_restore();
 
return prev_p;
 }
diff -r 01f2e46c1416 include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h Thu Dec 14 14:26:24 2006 -0800
+++ b/include/asm-generic/pgtable.h Thu Dec 14 14:44:56 2006 -0800
@@ -183,6 +183,19 @@ static inline void ptep_set_wrprotect(st
 #endif
 
 /*
+ * A facility to provide batching of the reload of page tables with the
+ * actual context switch code for paravirtualized guests.  By convention,
+ * only one of the lazy modes (CPU, MMU) should be active at any given
+ * time, entry should never be nested, and entry and exits should always
+ * be paired.  This is for sanity of maintaining and reasoning about the
+ * kernel code.
+ */
+#ifndef __HAVE_ARCH_ENTER_LAZY_CPU_MODE
+#define arch_enter_lazy_cpu_mode() do {} while (0)
+#define arch_leave_lazy_cpu_mode() do {} while (0)
+#endif
+
+/*
  * When walking page tables, get the address of the next boundary,
  * or the end address of the range if that comes earlier.  Although no
  * vma end wraps to 0, rounded up __boundary may wrap to 0 throughout.
diff -r 01f2e46c1416 include/asm-i386/paravirt.h
--- a/include/asm-i386/paravirt.h   Thu Dec 14 14:26:24 2006 -0800
+++ b/include/asm-i386/paravirt.h   Thu Dec 14 14:44:56 2006 -0800
@@ -146,6 +146,8 @@ struct paravirt_ops
void (fastcall *pmd_clear)(pmd_t *pmdp);
 #endif
 
+   void (fastcall

[PATCH 3/6] IOPL handling for paravirt guests