Re: [RFC 0/8] Cpuset aware writeback

2007-01-15 Thread Peter Zijlstra
On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote:
> Currently cpusets are not able to do proper writeback since
> dirty ratio calculations and writeback are all done for the system
> as a whole. This may result in a large percentage of a cpuset
> to become dirty without writeout being triggered. Under NFS
> this can lead to OOM conditions.
> 
> Writeback will occur during the LRU scans. But such writeout
> is not effective since we write page by page and not in inode page
> order (regular writeback).
> 
> In order to fix the problem we first of all introduce a method to
> establish a map of nodes that contain dirty pages for each
> inode mapping.
> 
> Secondly we modify the dirty limit calculation to be based
> on the acctive cpuset.
> 
> If we are in a cpuset then we select only inodes for writeback
> that have pages on the nodes of the cpuset.
> 
> After we have the cpuset throttling in place we can then make
> further fixups:
> 
> A. We can do inode based writeout from direct reclaim
>avoiding single page writes to the filesystem.
> 
> B. We add a new counter NR_UNRECLAIMABLE that is subtracted
>from the available pages in a node. This allows us to
>accurately calculate the dirty ratio even if large portions
>of the node have been allocated for huge pages or for
>slab pages.

What about mlock'ed pages?

> There are a couple of points where some better ideas could be used:
> 
> 1. The nodemask expands the inode structure significantly if the
> architecture allows a high number of nodes. This is only an issue
> for IA64. For that platform we expand the inode structure by 128 byte
> (to support 1024 nodes). The last patch attempts to address the issue
> by using the knowledge about the maximum possible number of nodes
> determined on bootup to shrink the nodemask.

Not the prettiest indeed, no ideas though.

> 2. The calculation of the per cpuset limits can require looping
> over a number of nodes which may bring the performance of get_dirty_limits
> near pre 2.6.18 performance (before the introduction of the ZVC counters)
> (only for cpuset based limit calculation). There is no way of keeping these
> counters per cpuset since cpusets may overlap.

Well, you gain functionality, you loose some runtime, sad but probably
worth it.

Otherwise it all looks good.

Acked-by: Peter Zijlstra <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


82571EB gigabit on e1000 in 2.6.20-rc5

2007-01-15 Thread Allen Parker
I have a PCI-E pro/1000 MT Quad Port adapter, which works quite well 
under 2.6.19.2 but fails to see link under 2.6.20-rc5. Earlier today I 
reported this to [EMAIL PROTECTED], but thought I should get the 
word out in case someone else is testing this kernel on this nic chipset.


Due to changes between 2.6.19.2 and 2.6.20, Intel driver 7.3.20 will not 
compile for 2.6.20, nor will the 2.6.19.2 in-tree driver.


Error output:
  CC [M]  drivers/net/e1000/e1000_main.o
drivers/net/e1000/e1000_main.c:1132:45: error: macro "INIT_WORK" passed 
3 arguments, but takes just 2

drivers/net/e1000/e1000_main.c: In function 'e1000_probe':
drivers/net/e1000/e1000_main.c:1131: error: 'INIT_WORK' undeclared 
(first use in this function)
drivers/net/e1000/e1000_main.c:1131: error: (Each undeclared identifier 
is reported only once
drivers/net/e1000/e1000_main.c:1131: error: for each function it appears 
in.)

make[3]: *** [drivers/net/e1000/e1000_main.o] Error 1

lspci -nn output (quad port):
09:00.0 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit 
Ethernet Controller [8086:10a4] (rev 06)
09:00.1 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit 
Ethernet Controller [8086:10a4] (rev 06)
0a:00.0 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit 
Ethernet Controller [8086:10a4] (rev 06)
0a:00.1 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit 
Ethernet Controller [8086:10a4] (rev 06)

lspci -nn output (dual port):
07:00.0 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit 
Ethernet Controller [8086:105e] (rev 06)
07:00.1 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit 
Ethernet Controller [8086:105e] (rev 06)


From what I've been able to gather, other Intel Pro/1000 chipsets work 
fine in 2.6.20-rc5. If the e1000 guys need any assistance testing, I'll 
be more than happy to volunteer myself as a guinea pig for patches.


Allen Parker
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 9/10][RFC] aio: usb gadget remove aio file ops

2007-01-15 Thread David Brownell
On Monday 15 January 2007 5:54 pm, Nate Diller wrote:
> This removes the aio implementation from the usb gadget file system. 

NAK.  I see a deep mis-understanding here.


> Aside 
> from making very creative (!) use of the aio retry path, it can't be of any
> use performance-wise 

Other than the basic win of letting one userspace thread keep an I/O
stream active while at the same time processing the data it reads or
writes??  That's the "async" part of AIO.

There's a not-so-little thing called "I/O overlap" ... which is the only
way to prevent wasting bandwidth between (non-cacheable) I/O requests,
and thus is the only way to let userspace code achieve anything close
to the maximum I/O bandwidth the hardware can achieve.

We want to see the host side "usbfs" evolve to support AIO like this
too, for the same reasons.  (Currently it has fairly ugly AIO code
that looks unlike any other AIO code in Linux.  Recent updates to
support a file-per-endpoint device model are a necessary precursor
to switching over to standard AIO syscalls.)


> because it always kmalloc()s a bounce buffer for the 
> *whole* I/O size.

By and large that's a negligible factor compared to being able to
achieve I/O overlap.  ISTR the reason for not doing fancy DMA magic
was that the cost of this style AIO was under 1 KByte object code
on ARM, which was easy to justify ... while DMA magic to do that
sort of stuff would be much fatter, as well as more error prone.

(And that's why the "creative" use of the retry path.  As I've
observed before, "retry" is a misnomer in the general sense of
an async I/O framework.  It's more of a semi-completion callback;
I/O can't in general be "retried" on error or fault, and even in
the current usage it's not really a "retry".)


Now that high speed peripheral hardware is becoming more common on
embedded Linuxes -- TI has DaVinci, OMAP 2430, TUSB6010 (as found
in the new Nokia 800 tablets); Atmel AVR32 AP7000; at least a couple
parts that should be able to use the same musb_hdrc driver as those
TI parts; and a few other chips I've heard of -- there may be some
virtue in eliminating the memcpy, since those CPUs don't have many
MIPS to waste.  (Iff the memcpy turns out to be a real issue...)


> Perhaps the only reason to keep it around is the ability 
> to cancel I/O requests, which only applies when using the user space async
> I/O interface.  

It's good to have almost the complete kernel API functionality
exposed to userspace, and having I/O cancelation is an inevitable
consequence of a complete AIO framework ... but that particular
issue was not a driving concern.


The reason for AIO is to have a *STANDARD* userspace interface
for *ASYNC I/O* which otherwise can't exist.  You know, the kind
of I/O interface that can't be implemented with read() and write()
syscalls, which for non-buffered I/O necessarily preclude all I/O
overlap.  AIO itself is a direct match to most I/O frameworks'
primitives.  (AIOCB being directly analagous to peripheral side
"struct usb_request" and host side "struct urb".)


You know, I've always thought that one reason the AIO discussions
seemed strange is that they weren't really focussed on I/O (the
lowlevel after-the-caches stuff) so much as filesystems (several
layers up in the stack, with intervening caching frameworks).

The first several implementations of AIO that I saw were restricted
to "real" I/O and not applicable to disk backed files.  So while I
was glad the Linux approach didn't make that mistake, it's seemed
that it might be wanting to make a converse mistake: neglecting I/O
that isn't aimed at data stored on disks.


> I highly doubt that is enough incentive to justify the extra 
> complexity here or in user-space, so I think it's a safe bet to remove this. 
> If that feature still desired, it would be possible to implement a sync
> interface that does an interruptible sleep.

What's needed is an async, non-sleeeping, interface ... with I/O
overlap.  That's antithetical to using read()/write() calls, so
your proposed approach couldn't possibly work.

- Dave


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 4/10][RFC] aio: convert aio_complete to file_endio_t

2007-01-15 Thread David Brownell
On Monday 15 January 2007 5:54 pm, Nate Diller wrote:
> --- a/drivers/usb/gadget/inode.c  2007-01-12 14:42:29.0 -0800
> +++ b/drivers/usb/gadget/inode.c  2007-01-12 14:25:34.0 -0800
> @@ -559,35 +559,32 @@ static int ep_aio_cancel(struct kiocb *i
>   return value;
>  }
>  
> -static ssize_t ep_aio_read_retry(struct kiocb *iocb)
> +static int ep_aio_read_retry(struct kiocb *iocb)
>  {
>   struct kiocb_priv   *priv = iocb->private;
> - ssize_t len, total;
> - int i;
> + ssize_t total;
> + int i, err = 0;
>  
>   /* we "retry" to get the right mm context for this: */
>  
>   /* copy stuff into user buffers */
>   total = priv->actual;
> - len = 0;
>   for (i=0; i < priv->nr_segs; i++) {
>   ssize_t this = min((ssize_t)(priv->iv[i].iov_len), total);
>  
>   if (copy_to_user(priv->iv[i].iov_base, priv->buf, this)) {
> - if (len == 0)
> - len = -EFAULT;
> + err = -EFAULT;

Discarding the capability to report partial success, e.g. that the first N
bytes were properly transferred?  I don't see any virtue in that change.
Quite the opposite in fact.

I think you're also expecting that if N bytes were requested, that's always
how many will be received.  That's not true for packetized I/O such as USB
isochronous transfers ... where it's quite legit (and in some cases routine)
for the other end to send packets that are shorter than the maximum allowed.
Sending a zero length packet is not the same as sending no packet at all,
for another example.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Some kind of 2.6.19 NFS regression

2007-01-15 Thread Trond Myklebust
On Mon, 2007-01-15 at 18:26 -0500, Daniel Drake wrote:
> Hi,
> 
> Tim Ryan has reported the following bug at the Gentoo bugzilla:
> 
> https://bugs.gentoo.org/show_bug.cgi?id=162199
> 
> His home dir is mounted over NFS. 2.6.18 worked OK but 2.6.19 is very 
> slow to load the desktop environment. NFS is suspected here as the 
> problem does not exist for users with local homedirs. This might not be 
> a straightforward performance issue as it does seem to perform OK on the 
> console.
> 
> The bug still exists in unpatched 2.6.20-rc5.
> 
> Is this a known issue? Should we report a new bug on the kernel bugzilla?
> 
> Thanks,
> Daniel

I couldn't find any information whatsoever in that bug report as to what
mount options he is using, or what server export options are in use. No
info either about what networking hardware he is using (or what drivers
are in use).

I'd also recommend using something like ttcp to see if large packets
(NFS read/write packets are typically ~ 32k large) are being transmitted
efficiently.

Cheers
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] slip: Replace kmalloc() + memset() pairs with the appropriate kzalloc() calls

2007-01-15 Thread joe jin
This patch replace kmalloc() + memset() pairs with the appropriate
kzalloc().

Signed-off-by: Joe Jin <[EMAIL PROTECTED]>

--- drivers/net/slip.c.orig 2007-01-16 14:21:52.0 +0800
+++ drivers/net/slip.c  2007-01-16 14:23:07.0 +0800
@@ -1343,15 +1343,12 @@
printk(KERN_INFO "SLIP linefill/keepalive option.\n");
 #endif
 
-   slip_devs = kmalloc(sizeof(struct net_device *)*slip_maxdev,
GFP_KERNEL);
+   slip_devs = kzalloc(sizeof(struct net_device *)*slip_maxdev,
GFP_KERNEL);
if (!slip_devs) {
printk(KERN_ERR "SLIP: Can't allocate slip devices array!  
Uaargh! (-
> No SLIP available)\n");
return -ENOMEM;
}
 
-   /* Clear the pointer array, we allocate devices when we need them */
-   memset(slip_devs, 0, sizeof(struct net_device *)*slip_maxdev);
-
/* Fill in our line protocol discipline, and register it */
if ((status = tty_register_ldisc(N_SLIP, _ldisc)) != 0)  {
printk(KERN_ERR "SLIP: can't register line discipline (err = 
%d)\n",
status);



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 5/6] per namespace tunables

2007-01-15 Thread Nadia . Derbey
[PATCH 05/06]


This patch introduces all that is needed to process per namespace tunables.


Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]>


---
 include/linux/akt.h   |   12 +++
 kernel/autotune/akt.c |   80 ++
 2 files changed, 73 insertions(+), 19 deletions(-)

Index: linux-2.6.20-rc4/include/linux/akt.h
===
--- linux-2.6.20-rc4.orig/include/linux/akt.h   2007-01-15 15:21:47.0 
+0100
+++ linux-2.6.20-rc4/include/linux/akt.h2007-01-15 15:31:44.0 
+0100
@@ -154,6 +154,7 @@ struct auto_tune {
  */
 #define AUTO_TUNE_ENABLE  0x01
 #define TUNABLE_REGISTERED  0x02
+#define TUNABLE_IPC_NS  0x04
 
 
 /*
@@ -204,6 +205,8 @@ static inline int is_tunable_registered(
}
 
 
+#define DECLARE_TUNABLE(s) struct auto_tune s;
+
 #define DEFINE_TUNABLE(s, thr, min, max, tun, chk, type)   \
struct auto_tune s = TUNABLE_INIT(#s, thr, min, max, tun, chk, type)
 
@@ -215,6 +218,13 @@ static inline int is_tunable_registered(
(s).max.abs_value.val_##type = _max;\
} while (0)
 
+#define init_tunable_ipcns(ns, s, thr, min, max, tun, chk, type)   \
+   do {\
+   DEFINE_TUNABLE(s, thr, min, max, tun, chk, type);   \
+   s.flags |= TUNABLE_IPC_NS;  \
+   ns->s = s;  \
+   } while (0)
+
 
 static inline void set_autotuning_routine(struct auto_tune *tunable,
auto_tune_fn fn)
@@ -269,7 +279,9 @@ extern ssize_t store_tunable_max(struct 
 #else  /* CONFIG_AKT */
 
 
+#define DECLARE_TUNABLE(s)
 #define DEFINE_TUNABLE(s, thresh, min, max, tun, chk, type)
+#define init_tunable_ipcns(ns, s, th, m, M, tun, chk, type)  do { } while (0)
 #define set_tunable_min_max(s, min, max, type)   do { } while (0)
 #define set_autotuning_routine(s, fn)do { } while (0)
 
Index: linux-2.6.20-rc4/kernel/autotune/akt.c
===
--- linux-2.6.20-rc4.orig/kernel/autotune/akt.c 2007-01-15 15:25:35.0 
+0100
+++ linux-2.6.20-rc4/kernel/autotune/akt.c  2007-01-15 15:37:16.0 
+0100
@@ -32,6 +32,7 @@
  *  store_tunable_min  (exported)
  *  show_tunable_max   (exported)
  *  store_tunable_max  (exported)
+ *  get_ns_tunable (static)
  */
 
 #include 
@@ -45,6 +46,8 @@
 #define AKT_AUTO   1
 #define AKT_MANUAL 0
 
+static struct auto_tune *get_ns_tunable(struct auto_tune *);
+
 
 
 /*
@@ -142,6 +145,7 @@ int unregister_tunable(struct auto_tune 
 ssize_t show_tuning_mode(struct auto_tune *tun_addr, char *buf)
 {
int valid;
+   struct auto_tune *which;
 
if (tun_addr == NULL) {
printk(KERN_ERR
@@ -149,11 +153,13 @@ ssize_t show_tuning_mode(struct auto_tun
return -EINVAL;
}
 
-   spin_lock(_addr->tunable_lck);
+   which = get_ns_tunable(tun_addr);
+
+   spin_lock(>tunable_lck);
 
-   valid = is_auto_tune_enabled(tun_addr);
+   valid = is_auto_tune_enabled(which);
 
-   spin_unlock(_addr->tunable_lck);
+   spin_unlock(>tunable_lck);
 
return snprintf(buf, PAGE_SIZE, "%d\n", valid);
 }
@@ -176,6 +182,7 @@ ssize_t store_tuning_mode(struct auto_tu
size_t count)
 {
int new_value;
+   struct auto_tune *which;
int rc;
 
if ((rc = sscanf(buffer, "%d", _value)) != 1)
@@ -190,18 +197,20 @@ ssize_t store_tuning_mode(struct auto_tu
return -EINVAL;
}
 
-   spin_lock(_addr->tunable_lck);
+   which = get_ns_tunable(tun_addr);
+
+   spin_lock(>tunable_lck);
 
switch (new_value) {
case AKT_AUTO:
-   tun_addr->flags |= AUTO_TUNE_ENABLE;
+   which->flags |= AUTO_TUNE_ENABLE;
break;
case AKT_MANUAL:
-   tun_addr->flags &= ~AUTO_TUNE_ENABLE;
+   which->flags &= ~AUTO_TUNE_ENABLE;
break;
}
 
-   spin_unlock(_addr->tunable_lck);
+   spin_unlock(>tunable_lck);
 
return strnlen(buffer, PAGE_SIZE);
 }
@@ -218,6 +227,7 @@ ssize_t store_tuning_mode(struct auto_tu
 ssize_t show_tunable_min(struct auto_tune *tun_addr, char *buf)
 {
ssize_t rc;
+   struct auto_tune *which;
 
if (tun_addr == NULL) {
printk(KERN_ERR
@@ -225,11 +235,13 @@ ssize_t show_tunable_min(struct auto_tun
return -EINVAL;
}
 
-   spin_lock(_addr->tunable_lck);
+   which = get_ns_tunable(tun_addr);
 
-   rc = tun_addr->min.show(tun_addr, buf);
+   spin_lock(>tunable_lck);
 
-   spin_unlock(_addr->tunable_lck);
+   rc = which->min.show(which, buf);
+

[RFC][PATCH 6/6] automatic tuning applied to some kernel components

2007-01-15 Thread Nadia . Derbey
[PATCH 06/06]


The following kernel components register a tunable structure and call the
auto-tuning routine:
  . file system
  . shared memory (per namespace)
  . semaphore (per namespace)
  . message queues (per namespace)


Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]>


---
 fs/file_table.c |   81 
 include/linux/akt.h |1 
 include/linux/ipc.h |6 +++
 init/main.c |1 
 ipc/msg.c   |   19 
 ipc/sem.c   |   41 ++
 ipc/shm.c   |   74 ---
 7 files changed, 218 insertions(+), 5 deletions(-)

Index: linux-2.6.20-rc4/fs/file_table.c
===
--- linux-2.6.20-rc4.orig/fs/file_table.c   2007-01-15 13:08:14.0 
+0100
+++ linux-2.6.20-rc4/fs/file_table.c2007-01-15 15:44:39.0 +0100
@@ -21,6 +21,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -34,6 +36,71 @@ __cacheline_aligned_in_smp DEFINE_SPINLO
 
 static struct percpu_counter nr_files __cacheline_aligned_in_smp;
 
+#ifdef CONFIG_AKT
+
+static int get_nr_files(void);
+
+/** automatic tuning **/
+#define FILPTHRESH 80  /* threshold = 80% */
+
+/*
+ * FUNCTION:This is the routine called to accomplish auto tuning for the
+ *  max_files tunable.
+ *
+ *  Upwards adjustment:
+ *  Adjustment is needed if nr_files has reached
+ *  (threshold / 100 * max_files)
+ *  In that case, max_files is set to
+ *  (tunable + max_files * (100 - threshold) / 100)
+ *
+ *  Downards adjustment:
+ *   Adjustment is needed if nr_files has fallen under
+ *   (threshold / 100 * max_files previous value)
+ *   In that case max_files is set back to its previous value,
+ *   i.e. to (max_files * 100 / (200 - threshold))
+ *
+ * PARAMETERS:  cmd: controls the adjustment direction (up / down)
+ *  params: pointer to the registered tunable structure
+ *
+ * EXECUTION ENVIRONMENT: This routine should be called with the
+ *params->tunable_lck lock held
+ *
+ * RETURN VALUE: 1 if tunable has been adjusted
+ *   0 else
+ */
+static inline int maxfiles_auto_tuning(int cmd, struct auto_tune *params)
+{
+   int thr = params->threshold;
+   int min = params->min.value.val_int;
+   int max = params->max.value.val_int;
+   int tun = files_stat.max_files;
+
+   if (cmd == AKT_UP) {
+   if (get_nr_files() >= tun * thr / 100 && tun < max) {
+   int new = tun * (200 - thr) / 100;
+
+   files_stat.max_files = min(max, new);
+   return 1;
+   } else
+   return 0;
+   }
+
+   if (get_nr_files() < tun * thr / (200 - thr) && tun > min) {
+   int new = tun * 100 / (200 - thr);
+
+   files_stat.max_files = max(min, new);
+   return 1;
+   } else
+   return 0;
+}
+
+#endif /* CONFIG_AKT */
+
+/* The maximum value will be known later on */
+DEFINE_TUNABLE(maxfiles_akt, FILPTHRESH, 0, 0, _stat.max_files,
+   _files, int);
+
+
 static inline void file_free_rcu(struct rcu_head *head)
 {
struct file *f =  container_of(head, struct file, f_u.fu_rcuhead);
@@ -44,6 +111,8 @@ static inline void file_free(struct file
 {
percpu_counter_dec(_files);
call_rcu(>f_u.fu_rcuhead, file_free_rcu);
+
+   activate_auto_tuning(AKT_DOWN, _akt);
 }
 
 /*
@@ -91,6 +160,8 @@ struct file *get_empty_filp(void)
static int old_max;
struct file * f;
 
+   activate_auto_tuning(AKT_UP, _akt);
+
/*
 * Privileged users can go above max_files
 */
@@ -299,6 +370,16 @@ void __init files_init(unsigned long mem
files_stat.max_files = n; 
if (files_stat.max_files < NR_FILE)
files_stat.max_files = NR_FILE;
+
+   set_tunable_min_max(maxfiles_akt, n, n * 2, int);
+   set_autotuning_routine(_akt, maxfiles_auto_tuning);
+
files_defer_init();
percpu_counter_init(_files, 0);
 } 
+
+void __init files_late_init(void)
+{
+   if (register_tunable(_akt))
+   printk(KERN_WARNING "Failed registering tunable file-max\n");
+}
Index: linux-2.6.20-rc4/include/linux/akt.h
===
--- linux-2.6.20-rc4.orig/include/linux/akt.h   2007-01-15 15:31:44.0 
+0100
+++ linux-2.6.20-rc4/include/linux/akt.h2007-01-15 15:45:29.0 
+0100
@@ -295,5 +295,6 @@ static inline void init_auto_tuning(void
 #endif /* CONFIG_AKT */
 
 extern void fork_late_init(void);
+extern void files_late_init(void);
 
 #endif /* AKT_H */
Index: 

[RFC][PATCH 4/6] min and max kobjects

2007-01-15 Thread Nadia . Derbey
[PATCH 04/06]


Introduces the kobjects associated to each tunable min and max value


Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]>


---
 include/linux/akt.h |   30 
 include/linux/akt_ops.h |  311 
 kernel/autotune/akt.c   |  120 
 kernel/autotune/akt_sysfs.c |8 +
 4 files changed, 469 insertions(+)

Index: linux-2.6.20-rc4/include/linux/akt.h
===
--- linux-2.6.20-rc4.orig/include/linux/akt.h   2007-01-15 15:08:41.0 
+0100
+++ linux-2.6.20-rc4/include/linux/akt.h2007-01-15 15:21:47.0 
+0100
@@ -62,6 +62,13 @@ struct tunable_kobject {
  * auto_tune structure.
  * These values are type dependent and are used as high / low boundaries when
  * tuning up or down.
+ * The show and store routines (thare are type dependent too) are here for
+ * sysfs support (since the min and max can be updated through sysfs).
+ * The abs_value field is used to check that we are not:
+ *   . falling under the very 1st min value when updating the min value
+ * through sysfs
+ *   . going over the very 1st max value when updating the max value
+ * through sysfs
  * The type is known when the tunable is defined (see DEFINE_TUNABLE macro).
  */
 struct typed_value {
@@ -74,6 +81,17 @@ struct typed_value {
long   val_long;
ulong  val_ulong;
} value;
+   union {
+   short  val_short;
+   ushort val_ushort;
+   intval_int;
+   uint   val_uint;
+   size_t val_size_t;
+   long   val_long;
+   ulong  val_ulong;
+   } abs_value;
+   ssize_t (*show)(struct auto_tune *, char *);
+   ssize_t (*store)(struct auto_tune *, const char *, size_t);
 };
 
 
@@ -170,9 +188,15 @@ static inline int is_tunable_registered(
.threshold  = (_thresh),\
.min= { \
.value  = { .val_##type = (_min), },\
+   .abs_value  = { .val_##type = (_min), },\
+   .show   = show_tunable_min_##type,  \
+   .store  = store_tunable_min_##type, \
},  \
.max= { \
.value  = { .val_##type = (_max), },\
+   .abs_value  = { .val_##type = (_max), },\
+   .show   = show_tunable_max_##type,  \
+   .store  = store_tunable_max_##type, \
},  \
.tun_kobj   = { .tun = NULL, }, \
.tunable= (_tun),   \
@@ -186,7 +210,9 @@ static inline int is_tunable_registered(
 #define set_tunable_min_max(s, _min, _max, type)   \
do {\
(s).min.value.val_##type = _min;\
+   (s).min.abs_value.val_##type = _min;\
(s).max.value.val_##type = _max;\
+   (s).max.abs_value.val_##type = _max;\
} while (0)
 
 
@@ -234,6 +260,10 @@ extern int unregister_tunable(struct aut
 extern int tunable_sysfs_setup(struct auto_tune *);
 extern ssize_t show_tuning_mode(struct auto_tune *, char *);
 extern ssize_t store_tuning_mode(struct auto_tune *, const char *, size_t);
+extern ssize_t show_tunable_min(struct auto_tune *, char *);
+extern ssize_t store_tunable_min(struct auto_tune *, const char *, size_t);
+extern ssize_t show_tunable_max(struct auto_tune *, char *);
+extern ssize_t store_tunable_max(struct auto_tune *, const char *, size_t);
 
 
 #else  /* CONFIG_AKT */
Index: linux-2.6.20-rc4/include/linux/akt_ops.h
===
--- linux-2.6.20-rc4.orig/include/linux/akt_ops.h   2007-01-15 
14:28:16.0 +0100
+++ linux-2.6.20-rc4/include/linux/akt_ops.h2007-01-15 15:22:53.0 
+0100
@@ -182,5 +182,316 @@ static inline int default_auto_tuning_ul
 }
 
 
+/*
+ * member can be one of min / max
+ */
+#define __show_tunable_member(member, p, type, buf, format, y) \
+do {   \
+   type _xx = (type) p->member.value.val_##type;   \
+   \
+   y = snprintf(buf, PAGE_SIZE, format "\n", _xx); \
+} while (0)
+
+/*
+ * Show routines for the min and max tunables values
+ */
+static inline ssize_t show_tunable_min_short(struct auto_tune *p, char *buf)
+{
+   ssize_t _count;
+   __show_tunable_member(min, p, 

[RFC][PATCH 2/6] auto_tuning activation

2007-01-15 Thread Nadia . Derbey
[PATCH 02/06]

Introduces the auto-tuning activation routine

The auto-tuning routine is called by the fork kernel component


Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]>


---
 include/linux/akt.h |   50 ++
 kernel/exit.c   |   11 +++
 kernel/fork.c   |2 ++
 3 files changed, 63 insertions(+)

Index: linux-2.6.20-rc4/include/linux/akt.h
===
--- linux-2.6.20-rc4.orig/include/linux/akt.h   2007-01-15 14:26:24.0 
+0100
+++ linux-2.6.20-rc4/include/linux/akt.h2007-01-15 15:00:31.0 
+0100
@@ -118,12 +118,22 @@ struct auto_tune {
 /*
  * Flags for a registered tunable
  */
+#define AUTO_TUNE_ENABLE  0x01
 #define TUNABLE_REGISTERED  0x02
 
 
 /*
  * When calling this routine the tunable lock should be held
  */
+static inline int is_auto_tune_enabled(struct auto_tune *tunable)
+{
+   return (tunable->flags & AUTO_TUNE_ENABLE) == AUTO_TUNE_ENABLE;
+}
+
+
+/*
+ * When calling this routine the tunable lock should be held
+ */
 static inline int is_tunable_registered(struct auto_tune *tunable)
 {
return (tunable->flags & TUNABLE_REGISTERED) == TUNABLE_REGISTERED;
@@ -163,6 +173,44 @@ static inline int is_tunable_registered(
} while (0)
 
 
+static inline void set_autotuning_routine(struct auto_tune *tunable,
+   auto_tune_fn fn)
+{
+   if (fn != NULL)
+   tunable->auto_tune = fn;
+}
+
+
+/*
+ * direction may be one of:
+ *AKT_UP: adjust up (i.e. increase tunable value when needed)
+ *AKT_DOWN: adjust down (i.e. decrease tunable value when needed)
+ */
+static inline int activate_auto_tuning(int direction,
+   struct auto_tune *tunable)
+{
+   int ret = 0;
+
+   BUG_ON(direction != AKT_UP && direction != AKT_DOWN);
+
+   if (tunable == NULL)
+   return 0;
+
+   spin_lock(>tunable_lck);
+
+   if (!is_auto_tune_enabled(tunable) ||
+   !is_tunable_registered(tunable)) {
+   spin_unlock(>tunable_lck);
+   return 0;
+   }
+
+   ret = tunable->auto_tune(direction, tunable);
+
+   spin_unlock(>tunable_lck);
+   return ret;
+}
+
+
 
 extern int register_tunable(struct auto_tune *);
 extern int unregister_tunable(struct auto_tune *);
@@ -173,7 +221,9 @@ extern int unregister_tunable(struct aut
 
 #define DEFINE_TUNABLE(s, thresh, min, max, tun, chk, type)
 #define set_tunable_min_max(s, min, max, type)   do { } while (0)
+#define set_autotuning_routine(s, fn)do { } while (0)
 
+#define activate_auto_tuning(direction, tunable) ( { 0; } )
 
 #define register_tunable(a) 0
 #define unregister_tunable(a)   0
Index: linux-2.6.20-rc4/kernel/fork.c
===
--- linux-2.6.20-rc4.orig/kernel/fork.c 2007-01-15 14:36:48.0 +0100
+++ linux-2.6.20-rc4/kernel/fork.c  2007-01-15 14:57:28.0 +0100
@@ -995,6 +995,8 @@ static struct task_struct *copy_process(
if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
return ERR_PTR(-EINVAL);
 
+   activate_auto_tuning(AKT_UP, _threads_akt);
+
retval = security_task_create(clone_flags);
if (retval)
goto fork_out;
Index: linux-2.6.20-rc4/kernel/exit.c
===
--- linux-2.6.20-rc4.orig/kernel/exit.c 2007-01-15 13:08:15.0 +0100
+++ linux-2.6.20-rc4/kernel/exit.c  2007-01-15 14:58:23.0 +0100
@@ -42,12 +42,15 @@
 #include  /* for audit_free() */
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
 
+extern struct auto_tune max_threads_akt;
+
 extern void sem_exit (void);
 
 static void exit_mm(struct task_struct * tsk);
@@ -172,6 +175,14 @@ repeat:
 
sched_exit(p);
write_unlock_irq(_lock);
+
+   /*
+* nr_threads has been decremented in __unhash_process: adjust
+* max_threads down if needed
+* We do it here to avoid calling activate_auto_tuning under lock
+*/
+   activate_auto_tuning(AKT_DOWN, _threads_akt);
+
proc_flush_task(p);
release_thread(p);
call_rcu(>rcu, delayed_put_task_struct);

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 3/6] tunables associated kobjects

2007-01-15 Thread Nadia . Derbey
[PATCH 03/06]


Introduces the kobjects associated to each tunable and the sysfs registration


Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]>


---
 include/linux/akt.h |   25 -
 init/main.c |1 
 kernel/autotune/Makefile|2 
 kernel/autotune/akt.c   |   86 +
 kernel/autotune/akt_sysfs.c |  214 
 5 files changed, 324 insertions(+), 4 deletions(-)

Index: linux-2.6.20-rc4/include/linux/akt.h
===
--- linux-2.6.20-rc4.orig/include/linux/akt.h   2007-01-15 15:00:31.0 
+0100
+++ linux-2.6.20-rc4/include/linux/akt.h2007-01-15 15:08:41.0 
+0100
@@ -48,6 +48,16 @@ typedef int (*auto_tune_fn)(int, struct 
 
 
 /*
+ * for sysfs support
+ */
+struct tunable_kobject {
+   struct kobject kobj;
+   struct auto_tune *tun;
+};
+
+
+
+/*
  * Structure used to describe the min / max values for a tunable inside the
  * auto_tune structure.
  * These values are type dependent and are used as high / low boundaries when
@@ -73,7 +83,12 @@ struct typed_value {
  * allocated for each registered tunable, and the associated kobject exported
  * via sysfs.
  *
- * The structure lock (tunable_lck) protects
+ * This structure may be accessed in 2 ways:
+ *   . directly from inside the kernel susbsystem that uses it (during tunable
+ * automatic adjustment)
+ *   . from sysfs, while updating the kobject attributes
+ *
+ * In both cases, the structure lock (tunable_lck) is taken: it protects
  * against concurrent accesses to tunable and checked pointers
  *
  * A pointer to this structure is passed in to  the automatic adjustment
@@ -108,6 +123,7 @@ struct auto_tune {
/* and associated show / store routines) */
struct typed_value max; /* max value the tunable can ever reach */
/* and associated show / store routines) */
+   struct tunable_kobjecttun_kobj; /* used for sysfs support */
void *tunable;  /* address of the tunable to adjust */
void *checked;  /* address of the variable that is controlled by */
/* the tunable. This is the calling subsystem's */
@@ -158,6 +174,7 @@ static inline int is_tunable_registered(
.max= { \
.value  = { .val_##type = (_max), },\
},  \
+   .tun_kobj   = { .tun = NULL, }, \
.tunable= (_tun),   \
.checked= (_chk),   \
}
@@ -211,9 +228,12 @@ static inline int activate_auto_tuning(i
 }
 
 
-
+extern void init_auto_tuning(void);
 extern int register_tunable(struct auto_tune *);
 extern int unregister_tunable(struct auto_tune *);
+extern int tunable_sysfs_setup(struct auto_tune *);
+extern ssize_t show_tuning_mode(struct auto_tune *, char *);
+extern ssize_t store_tuning_mode(struct auto_tune *, const char *, size_t);
 
 
 #else  /* CONFIG_AKT */
@@ -228,6 +248,7 @@ extern int unregister_tunable(struct aut
 #define register_tunable(a) 0
 #define unregister_tunable(a)   0
 
+static inline void init_auto_tuning(void)   { }
 
 #endif /* CONFIG_AKT */
 
Index: linux-2.6.20-rc4/init/main.c
===
--- linux-2.6.20-rc4.orig/init/main.c   2007-01-15 14:29:17.0 +0100
+++ linux-2.6.20-rc4/init/main.c2007-01-15 15:09:27.0 +0100
@@ -614,6 +614,7 @@ asmlinkage void __init start_kernel(void
signals_init();
/* rootfs populating might need page-writeback */
page_writeback_init();
+   init_auto_tuning();
fork_late_init();
 #ifdef CONFIG_PROC_FS
proc_root_init();
Index: linux-2.6.20-rc4/kernel/autotune/Makefile
===
--- linux-2.6.20-rc4.orig/kernel/autotune/Makefile  2007-01-15 
14:31:57.0 +0100
+++ linux-2.6.20-rc4/kernel/autotune/Makefile   2007-01-15 15:09:57.0 
+0100
@@ -2,6 +2,6 @@
 # Makefile for akt
 #
 
-obj-y := akt.o
+obj-y := akt.o akt_sysfs.o
 
 
Index: linux-2.6.20-rc4/kernel/autotune/akt.c
===
--- linux-2.6.20-rc4.orig/kernel/autotune/akt.c 2007-01-15 14:51:54.0 
+0100
+++ linux-2.6.20-rc4/kernel/autotune/akt.c  2007-01-15 15:13:31.0 
+0100
@@ -26,6 +26,8 @@
  *   FUNCTIONS:
  *  register_tunable   (exported)
  *  unregister_tunable (exported)
+ *  show_tuning_mode   (exported)
+ *  store_tuning_mode  (exported)
  */
 
 #include 
@@ -36,6 +38,8 @@
 
 
 
+#define AKT_AUTO   1
+#define 

[RFC][PATCH 1/6] Tunable structure and registration routines

2007-01-15 Thread Nadia . Derbey
[PATCH 01/06]

Defines the auto_tune structure: this is the structure that contains the
information needed by the adjustment routine for a given tunable.
Also defines the registration routines.

The fork kernel component defines a tunable structure for the threads-max
tunable and registers it.


Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]>


---
 Documentation/00-INDEX  |2 
 Documentation/auto_tune.txt |  333 
 fs/Kconfig  |2 
 include/linux/akt.h |  186 
 include/linux/akt_ops.h |  186 
 init/main.c |2 
 kernel/Makefile |1 
 kernel/autotune/Kconfig |   30 +++
 kernel/autotune/Makefile|7 
 kernel/autotune/akt.c   |  123 
 kernel/fork.c   |   18 ++
 11 files changed, 890 insertions(+)

Index: linux-2.6.20-rc4/Documentation/00-INDEX
===
--- linux-2.6.20-rc4.orig/Documentation/00-INDEX2007-01-15 
13:08:13.0 +0100
+++ linux-2.6.20-rc4/Documentation/00-INDEX 2007-01-15 14:17:22.0 
+0100
@@ -52,6 +52,8 @@ applying-patches.txt
- description of various trees and how to apply their patches.
 arm/
- directory with info about Linux on the ARM architecture.
+auto_tune.txt
+   - info on the Automatic Kernel Tunables (AKT) feature.
 basic_profiling.txt
- basic instructions for those who wants to profile Linux kernel.
 binfmt_misc.txt
Index: linux-2.6.20-rc4/Documentation/auto_tune.txt
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.20-rc4/Documentation/auto_tune.txt2007-01-15 
14:19:18.0 +0100
@@ -0,0 +1,333 @@
+   Automatic Kernel Tunables
+=
+
+  Nadia Derbey ([EMAIL PROTECTED])
+
+
+
+This feature aims at making the kernel automatically change the tunables
+values as it sees resources running out.
+
+The AKT framework is made of 2 parts:
+
+1) Kernel part:
+Interfaces are provided to the kernel subsystems, to (un)register the
+tunables that might be automatically tuned in the future.
+
+Registering a tunable consists in the following steps:
+- a structure is declared and filled by the kernel subsystem for the
+registered tunable
+- that tunable structure is registered into sysfs
+
+Registration should be done during the kernel subsystem initialization step.
+
+Unregistering a tunable is the reverse operation. It should not be necessary
+for the kernel subsystems: it is only useful when unloading modules that would
+have registered a tunable during their loading step.
+
+The routines interfaces are the following:
+
+1.1) Declaring a tunable:
+
+A tunable structure should be declared and defined by the kernel subsystems as
+follows:
+
+DEFINE_TUNABLE(structure_name, threshold, min, max,
+   tunable_variable_ptr, checked_variable_ptr,
+   tunable_variable_type);
+
+Parameters:
+- structure_name: this is the name of the tunable structure
+
+- threshold: percentage to apply to the tunable value to detect if adjustment
+is needed
+
+- min: minimum value the tunable can ever reach (needed when adjusting down
+the tunable)
+
+- max: maximum value the tunable can ever reach (needed when adjusting up the
+tunable)
+
+- tunable_variable_ptr: address of the tunable that will be adjusted if
+needed.
+(ex: in kernel/fork.c it is max_threads's address)
+
+- checked_variable_ptr: address of the variable that is controlled by the
+tunable. This is the calling subsystem's object counter.
+(ex: in kernel/fork.c it is nr_threads's address: nr_threads should
+always remain < max_threads)
+
+- tunable_variable_type: this type is important since it helps choosing the
+appropriate automatic tuning routine.
+It can be one of short / ushort / int / uint / size_t / long / ulong
+
+The automatic tuning routine (i.e. the routine that should be called when
+automatic tuning is activated) is set to the default one:
+default_auto_tuning_().
+ is chosen according to the tunable_variable_type parameters.
+All the previously listed parameters are useful to this routine.
+Refer to the description of the automatic adjustment routine to see how
+these parameters are actually used.
+
+Refer to "Updating the auto-tuning function pointer" to know how to set
+this routine to another one.
+
+
+1.2) Updating a tunable's characteristics
+
+1.2.1) Updating min / max values:
+
+Sometimes, when calling DEFINE_TUNABLE(), the min and max values are not
+exactly known, yet. In that case, the following routine should be called
+once these values are known:
+
+set_tunable_min_max(structure_name, new_min, new_max)
+
+Parameters:
+- structure_name: this is the name of the tunable structure
+
+- new_min: minimum value the tunable can 

[RFC][PATCH 0/6] Automatice kernel tunables (AKT)

2007-01-15 Thread Nadia . Derbey
This is a series of patches that introduces a feature that makes the kernel
automatically change the tunables values as it sees resources running out.

The AKT framework is made of 2 parts:

1) Kernel part:
Interfaces are provided to the kernel subsystems, to (un)register the
tunables that might be automatically tuned in the future.

Registering a tunable consists in the following steps:
- a structure is declared and filled by the kernel subsystem for the
registered tunable
- that tunable structure is registered into sysfs

Registration should be done during the kernel subsystem initialization step.


Another interface is provided to the kernel subsystems, to activate the
automatic tuning for a registered tunable. It can be called during resource
allocation to tune up, and during resource freeing to tune down the registered
tunable. The automatic tuning routine is called only if the tunable has
been enabled to be automatically tuning in sysfs.

2) User part:

AKT uses sysfs to enable the tunables management from the user world (mainly
making them automatic or manual).

akt uses sysfs in the following way:
- a tunables subsystem (tunables_subsys) is declared and registered during akt
initialization.
- registering a tunable is equivalent to registering the corresponding kobject
within that subsystem.
- each tunable kobject has 3 associated attributes, all with a RW mode (i.e.
the show() and store() methods are provided for them):
. autotune: enables to (de)activate automatic tuning for the tunable
. max: enables to set a new maximum value for the tunable
. min: enables to set a new minimum value for the tunable

The only way to activate automatic tuning is from user side:
- the directory /sys/tunables is created during the init phase.
- each time a tunable is registered by a kernel subsystem, a directory is
created for it under /sys/tunables.
- This directory contains 1 file for each tunable kobject attribute



These patches should be applied to 2.6.20-rc4, in the following order:

[PATCH 1/6]: tunables_registration.patch
[PATCH 2/6]: auto_tuning_activation.patch
[PATCH 3/6]: auto_tuning_kobjects.patch
[PATCH 4/6]: tunable_min_max_kobjects.patch
[PATCH 5/6]: per_namespace_tunables.patch
[PATCH 6/6]: auto_tune_applied.patch

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-rc5: known unfixed regressions

2007-01-15 Thread David Chinner
On Sat, Jan 13, 2007 at 08:11:25AM +0100, Adrian Bunk wrote:
> On Fri, Jan 12, 2007 at 02:27:48PM -0500, Linus Torvalds wrote:
> >...
> > A lot of developers (including me) will be gone next week for 
> > Linux.Conf.Au, so you have a week of rest and quiet to test this, and 
> > report any problems. 
> > 
> > Not that there will be any, right? You all behave now!
> >...
> 
> This still leaves the old regressions we have not yet fixed...
> 
> 
> This email lists some known regressions in 2.6.20-rc5 compared to 2.6.19.
> 
> 
> Subject: BUG: at mm/truncate.c:60 cancel_dirty_page()  (XFS)
> References : http://lkml.org/lkml/2007/1/5/308
> Submitter  : Sami Farin <[EMAIL PROTECTED]>
> Handled-By : David Chinner <[EMAIL PROTECTED]>
> Status : problem is being discussed

I'm at LCA and been having laptop dramas so the fix is being held up at this
point. I and trying to test a change right now that adds an optional unmap
to truncate_inode_pages_range as XFS needs, in some circumstances, to toss
out dirty pages (with dirty bufferheads) and hence requires truncate semantics
that are currently missing unmap calls.

Semi-untested patch attached below.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


 fs/xfs/linux-2.6/xfs_fs_subr.c |6 ++--
 include/linux/mm.h |2 +
 mm/truncate.c  |   60 -
 3 files changed, 60 insertions(+), 8 deletions(-)

Index: linux-2.6.19/fs/xfs/linux-2.6/xfs_fs_subr.c
===
--- linux-2.6.19.orig/fs/xfs/linux-2.6/xfs_fs_subr.c2006-10-03 
23:22:36.0 +1000
+++ linux-2.6.19/fs/xfs/linux-2.6/xfs_fs_subr.c 2007-01-17 01:24:51.771273750 
+1100
@@ -32,7 +32,8 @@ fs_tosspages(
struct inode*ip = vn_to_inode(vp);
 
if (VN_CACHED(vp))
-   truncate_inode_pages(ip->i_mapping, first);
+   truncate_unmap_inode_pages_range(ip->i_mapping,
+first, last, 1);
 }
 
 void
@@ -49,7 +50,8 @@ fs_flushinval_pages(
if (VN_TRUNC(vp))
VUNTRUNCATE(vp);
filemap_write_and_wait(ip->i_mapping);
-   truncate_inode_pages(ip->i_mapping, first);
+   truncate_unmap_inode_pages_range(ip->i_mapping,
+first, last, 1);
}
 }
 
Index: linux-2.6.19/include/linux/mm.h
===
--- linux-2.6.19.orig/include/linux/mm.h2007-01-17 01:21:16.01779 
+1100
+++ linux-2.6.19/include/linux/mm.h 2007-01-17 01:24:51.775274000 +1100
@@ -1058,6 +1058,8 @@ extern unsigned long page_unuse(struct p
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
   loff_t lstart, loff_t lend);
+extern void truncate_unmap_inode_pages_range(struct address_space *,
+  loff_t lstart, loff_t lend, int unmap);
 
 /* generic vm_area_ops exported for stackable file systems */
 extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int 
*);
Index: linux-2.6.19/mm/truncate.c
===
--- linux-2.6.19.orig/mm/truncate.c 2007-01-17 01:21:23.074231000 +1100
+++ linux-2.6.19/mm/truncate.c  2007-01-17 01:24:51.779274250 +1100
@@ -59,7 +59,7 @@ void cancel_dirty_page(struct page *page
 
WARN_ON(++warncount < 5);
}
-   
+
if (TestClearPageDirty(page)) {
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
@@ -122,16 +122,34 @@ invalidate_complete_page(struct address_
return ret;
 }
 
+/*
+ * This is a helper for truncate_unmap_inode_page. Unmap the page we
+ * are passed. Page must be locked by the caller.
+ */
+static void
+unmap_single_page(struct address_space *mapping, struct page *page)
+{
+   BUG_ON(!PageLocked(page));
+   while (page_mapped(page)) {
+   unmap_mapping_range(mapping,
+   (loff_t)page->index << PAGE_CACHE_SHIFT,
+   PAGE_CACHE_SIZE, 0);
+   }
+}
+
 /**
- * truncate_inode_pages - truncate range of pages specified by start and
+ * truncate_unmap_inode_pages_range - truncate range of pages specified by
+ * start and end byte offsets and optionally unmap them first.
  * end byte offsets
  * @mapping: mapping to truncate
  * @lstart: offset from which to truncate
  * @lend: offset to which to truncate
+ * @unmap: unmap whole truncated pages if non-zero
  *
  * Truncate the page cache, removing the pages that are between
  * specified offsets (and zeroing out partial page
- * (if lstart is not page aligned)).
+ * (if lstart is not page aligned)). If specified, unmap the pages

[PATCH] Remove a number of "dead" config variables.

2007-01-15 Thread Robert P. J. Day

  Remove Kconfig entries (and some documentation) for apparently
"dead" config variables.

Signed-off-by: Robert P. J. Day <[EMAIL PROTECTED]>

---

  A script I threw together identified the following as apparently
useless config variables.  By "useless," I mean that they:

  1) aren't consulted by any Makefile
  2) aren't checked by any source or header file
  3) don't further select any Kconfig settings

etc.  In short, they don't seem to be able to affect the build in any
way.

  The variables that are being removed:

USB_SERIAL_SAFE_PADDED
AEDSP16_MPU401
X86_XADD
PARIDE_PARPORT
AIC7XXX_PROBE_EISA_VL
AIC79XX_ENABLE_RD_STRM
SCSI_NCR53C8XX_PROFILE
53C700_IO_MAPPED
ZISOFS_FS
DLCI_COUNT
MOUSE_ATIXL
LCD_DEVICE

  The removal was compile tested based on "make allyesconfig".  If any
of these variables are still being used in some way, they are keeping
it very well hidden.

 Documentation/scsi/ncr53c8xx.txt |5 -
 arch/arm/configs/pnx4008_defconfig   |1 -
 arch/i386/Kconfig.cpu|5 -
 arch/um/config.release   |1 -
 drivers/block/paride/Kconfig |8 +---
 drivers/input/mouse/Kconfig  |6 --
 drivers/net/wan/Kconfig  |   11 ---
 drivers/scsi/Kconfig |   16 
 drivers/scsi/aic7xxx/Kconfig.aic79xx |   12 
 drivers/scsi/aic7xxx/Kconfig.aic7xxx |   10 --
 drivers/usb/serial/Kconfig   |4 
 drivers/video/backlight/Kconfig  |5 -
 fs/Kconfig   |6 --
 sound/oss/Kconfig|   12 
 14 files changed, 1 insertion(+), 101 deletions(-)

diff --git a/Documentation/scsi/ncr53c8xx.txt b/Documentation/scsi/ncr53c8xx.txt
index caf10b1..88ef88b 100644
--- a/Documentation/scsi/ncr53c8xx.txt
+++ b/Documentation/scsi/ncr53c8xx.txt
@@ -562,11 +562,6 @@ if only one has a flaw for some SCSI feature, you can 
disable the
 support by the driver of this feature at linux start-up and enable
 this feature after boot-up only for devices that support it safely.

-CONFIG_SCSI_NCR53C8XX_PROFILE_SUPPORT  (default answer: n)
-This option must be set for profiling information to be gathered
-and printed out through the proc file system. This features may
-impact performances.
-
 CONFIG_SCSI_NCR53C8XX_IOMAPPED   (default answer: n)
 Answer "y" if you suspect your mother board to not allow memory mapped I/O.
 May slow down performance a little.  This option is required by
diff --git a/arch/arm/configs/pnx4008_defconfig 
b/arch/arm/configs/pnx4008_defconfig
index b5e11aa..268b292 100644
--- a/arch/arm/configs/pnx4008_defconfig
+++ b/arch/arm/configs/pnx4008_defconfig
@@ -1395,7 +1395,6 @@ CONFIG_AUTOFS4_FS=m
 CONFIG_ISO9660_FS=m
 CONFIG_JOLIET=y
 CONFIG_ZISOFS=y
-CONFIG_ZISOFS_FS=m
 CONFIG_UDF_FS=m
 CONFIG_UDF_NLS=y

diff --git a/arch/i386/Kconfig.cpu b/arch/i386/Kconfig.cpu
index 2aecfba..b99c0e2 100644
--- a/arch/i386/Kconfig.cpu
+++ b/arch/i386/Kconfig.cpu
@@ -226,11 +226,6 @@ config X86_CMPXCHG
depends on !M386
default y

-config X86_XADD
-   bool
-   depends on !M386
-   default y
-
 config X86_L1_CACHE_SHIFT
int
default "7" if MPENTIUM4 || X86_GENERIC
diff --git a/arch/um/config.release b/arch/um/config.release
index fc68bcb..861b59b 100644
--- a/arch/um/config.release
+++ b/arch/um/config.release
@@ -253,7 +253,6 @@ CONFIG_LOCKD_V4=y
 # CONFIG_NCPFS_SMALLDOS is not set
 # CONFIG_NCPFS_NLS is not set
 # CONFIG_NCPFS_EXTRAS is not set
-# CONFIG_ZISOFS_FS is not set
 CONFIG_ZLIB_FS_INFLATE=m

 #
diff --git a/drivers/block/paride/Kconfig b/drivers/block/paride/Kconfig
index c0d2854..28cf308 100644
--- a/drivers/block/paride/Kconfig
+++ b/drivers/block/paride/Kconfig
@@ -2,14 +2,8 @@
 # PARIDE configuration
 #
 # PARIDE doesn't need PARPORT, but if PARPORT is configured as a module,
-# PARIDE must also be a module.  The bogus CONFIG_PARIDE_PARPORT option
-# controls the choices given to the user ...
+# PARIDE must also be a module.
 # PARIDE only supports PC style parports. Tough for USB or other parports...
-config PARIDE_PARPORT
-   tristate
-   depends on PARIDE!=n
-   default m if PARPORT_PC=m
-   default y if PARPORT_PC!=m

 comment "Parallel IDE high-level drivers"
depends on PARIDE
diff --git a/drivers/input/mouse/Kconfig b/drivers/input/mouse/Kconfig
index 35d998c..0befb49 100644
--- a/drivers/input/mouse/Kconfig
+++ b/drivers/input/mouse/Kconfig
@@ -60,12 +60,6 @@ config MOUSE_INPORT
  To compile this driver as a module, choose M here: the
  module will be called inport.

-config MOUSE_ATIXL
-   bool "ATI XL variant"
-   depends on MOUSE_INPORT
-   help
- Say Y here if your mouse is of the ATI XL variety.
-
 config MOUSE_LOGIBM
tristate "Logitech busmouse"
depends on ISA
diff --git a/drivers/net/wan/Kconfig b/drivers/net/wan/Kconfig
index 21f76f5..b550b51 100644

[RFC 4/8] Per cpuset dirty ratio handling and writeout

2007-01-15 Thread Christoph Lameter
Make page writeback obey cpuset constraints

Currently dirty throttling does not work properly in a cpuset.

If f.e a cpuset contains only 1/10th of available memory then all of the
memory of a cpuset can be dirtied without any writes being triggered. If we
are writing to a device that is mounted via NFS then the write operation
may be terminated with OOM since NFS is not allowed to allocate more pages
for writeout.

If all of the cpusets memory is dirty then only 10% of total memory is dirty.
The background writeback threshold is usually set at 10% and the synchrononous
threshold at 40%. So we are still below the global limits while the dirty
ratio in the cpuset is 100%!

This patch makes dirty writeout cpuset aware. When determining the
dirty limits in get_dirty_limits() we calculate values based on the
nodes that are reachable from the current process (that has been
dirtying the page). Then we can trigger writeout based on the
dirty ratio of the memory in the cpuset.

We trigger writeout in a a cpuset specific way. We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset. If an inode fulfills that requirement then we begin writeout
of the dirty pages of that inode.

Adding up all the counters for each node in a cpuset may seem to be quite
an expensive operation (in particular for large cpusets with hundreds of
nodes) compared to just accessing the global counters if we do not have
a cpuset. However, please remember that I only recently introduced
the global counters. Before 2.6.18 we did add up per processor
counters for each processor on each invocation of get_dirty_limits().
We now add per node information which I think is equal or less effort
since there are less nodes than processors.

Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc5/include/linux/writeback.h
===
--- linux-2.6.20-rc5.orig/include/linux/writeback.h 2007-01-15 
21:34:43.0 -0600
+++ linux-2.6.20-rc5/include/linux/writeback.h  2007-01-15 21:37:05.209897874 
-0600
@@ -59,11 +59,12 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned for_writepages:1;  /* This is a writepages() call */
unsigned range_cyclic:1;/* range_start is cyclic */
+   nodemask_t *nodes;  /* Set of nodes of interest */
 };
 
 /*
  * fs/fs-writeback.c
- */
+ */
 void writeback_inodes(struct writeback_control *wbc);
 void wake_up_inode(struct inode *inode);
 int inode_wait(void *);
Index: linux-2.6.20-rc5/mm/page-writeback.c
===
--- linux-2.6.20-rc5.orig/mm/page-writeback.c   2007-01-15 21:34:43.0 
-0600
+++ linux-2.6.20-rc5/mm/page-writeback.c2007-01-15 21:35:28.013794159 
-0600
@@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode);
 
 static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
+struct dirty_limits {
+   long thresh_background;
+   long thresh_dirty;
+   unsigned long nr_dirty;
+   unsigned long nr_unstable;
+   unsigned long nr_writeback;
+};
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -120,31 +128,74 @@ static void background_writeout(unsigned
  * We make sure that the background writeout level is below the adjusted
  * clamping level.
  */
-static void
-get_dirty_limits(long *pbackground, long *pdirty,
-   struct address_space *mapping)
+static int
+get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping,
+   nodemask_t *nodes)
 {
int background_ratio;   /* Percentages */
int dirty_ratio;
int unmapped_ratio;
long background;
long dirty;
-   unsigned long available_memory = vm_total_pages;
+   unsigned long available_memory;
+   unsigned long high_memory;
+   unsigned long nr_mapped;
struct task_struct *tsk;
+   int is_subset = 0;
 
+#ifdef CONFIG_CPUSETS
+   /*
+* Calculate the limits relative to the current cpuset if necessary.
+*/
+   if (unlikely(nodes &&
+   !nodes_subset(node_online_map, *nodes))) {
+   int node;
+
+   is_subset = 1;
+   memset(dl, 0, sizeof(struct dirty_limits));
+   available_memory = 0;
+   high_memory = 0;
+   nr_mapped = 0;
+   for_each_node_mask(node, *nodes) {
+   if (!node_online(node))
+   continue;
+   dl->nr_dirty += node_page_state(node, NR_FILE_DIRTY);
+   dl->nr_unstable +=
+   node_page_state(node, NR_UNSTABLE_NFS);
+   dl->nr_writeback +=
+   node_page_state(node, 

[RFC 6/8] Throttle vm writeout per cpuset

2007-01-15 Thread Christoph Lameter
Throttle VM writeout in a cpuset aware way

This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine and will
therefore not throttle unless global limits are reached.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc5/include/linux/writeback.h
===
--- linux-2.6.20-rc5.orig/include/linux/writeback.h 2007-01-15 
21:37:05.209897874 -0600
+++ linux-2.6.20-rc5/include/linux/writeback.h  2007-01-15 21:37:33.283671963 
-0600
@@ -85,7 +85,7 @@ static inline void wait_on_inode(struct 
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(void);
+void throttle_vm_writeout(nodemask_t *);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
Index: linux-2.6.20-rc5/mm/page-writeback.c
===
--- linux-2.6.20-rc5.orig/mm/page-writeback.c   2007-01-15 21:35:28.013794159 
-0600
+++ linux-2.6.20-rc5/mm/page-writeback.c2007-01-15 21:37:33.302228293 
-0600
@@ -349,12 +349,12 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(void)
+void throttle_vm_writeout(nodemask_t *nodes)
 {
struct dirty_limits dl;
 
 for ( ; ; ) {
-   get_dirty_limits(, NULL, _online_map);
+   get_dirty_limits(, NULL, nodes);
 
 /*
  * Boost the allowable dirty threshold a bit for page
Index: linux-2.6.20-rc5/mm/vmscan.c
===
--- linux-2.6.20-rc5.orig/mm/vmscan.c   2007-01-15 21:37:26.605346439 -0600
+++ linux-2.6.20-rc5/mm/vmscan.c2007-01-15 21:37:33.316878027 -0600
@@ -949,7 +949,7 @@ static unsigned long shrink_zone(int pri
}
}
 
-   throttle_vm_writeout();
+   throttle_vm_writeout(_current_mems_allowed);
 
atomic_dec(>reclaim_in_progress);
return nr_reclaimed;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 7/8] Exclude unreclaimable pages from dirty ration calculation

2007-01-15 Thread Christoph Lameter
Consider unreclaimable pages during dirty limit calculation

Tracking unreclaimable pages helps us to calculate the dirty ratio
the right way. If a large number of unreclaimable pages are allocated
(through the slab or through huge pages) then write throttling will
no longer work since the limit cannot be reached anymore.

So we simply subtract the number of unreclaimable pages from the pages
considered for writeout threshold calculation.

Other code that allocates significant amounts of memory for device
drivers etc could also be modified to take advantage of this functionality.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc5/include/linux/mmzone.h
===
--- linux-2.6.20-rc5.orig/include/linux/mmzone.h2007-01-12 
12:54:26.0 -0600
+++ linux-2.6.20-rc5/include/linux/mmzone.h 2007-01-15 21:37:37.579950696 
-0600
@@ -53,6 +53,7 @@ enum zone_stat_item {
NR_FILE_PAGES,
NR_SLAB_RECLAIMABLE,
NR_SLAB_UNRECLAIMABLE,
+   NR_UNRECLAIMABLE,
NR_PAGETABLE,   /* used for pagetables */
NR_FILE_DIRTY,
NR_WRITEBACK,
Index: linux-2.6.20-rc5/fs/proc/proc_misc.c
===
--- linux-2.6.20-rc5.orig/fs/proc/proc_misc.c   2007-01-12 12:54:26.0 
-0600
+++ linux-2.6.20-rc5/fs/proc/proc_misc.c2007-01-15 21:37:37.641479580 
-0600
@@ -174,6 +174,7 @@ static int meminfo_read_proc(char *page,
"Slab: %8lu kB\n"
"SReclaimable: %8lu kB\n"
"SUnreclaim:   %8lu kB\n"
+   "Unreclaimabl: %8lu kB\n"
"PageTables:   %8lu kB\n"
"NFS_Unstable: %8lu kB\n"
"Bounce:   %8lu kB\n"
@@ -205,6 +206,7 @@ static int meminfo_read_proc(char *page,
global_page_state(NR_SLAB_UNRECLAIMABLE)),
K(global_page_state(NR_SLAB_RECLAIMABLE)),
K(global_page_state(NR_SLAB_UNRECLAIMABLE)),
+   K(global_page_state(NR_UNRECLAIMABLE)),
K(global_page_state(NR_PAGETABLE)),
K(global_page_state(NR_UNSTABLE_NFS)),
K(global_page_state(NR_BOUNCE)),
Index: linux-2.6.20-rc5/mm/hugetlb.c
===
--- linux-2.6.20-rc5.orig/mm/hugetlb.c  2007-01-12 12:54:26.0 -0600
+++ linux-2.6.20-rc5/mm/hugetlb.c   2007-01-15 21:37:37.664919155 -0600
@@ -115,6 +115,8 @@ static int alloc_fresh_huge_page(void)
nr_huge_pages_node[page_to_nid(page)]++;
spin_unlock(_lock);
put_page(page); /* free it into the hugepage allocator */
+   mod_zone_page_state(page_zone(page), NR_UNRECLAIMABLE,
+   HPAGE_SIZE / PAGE_SIZE);
return 1;
}
return 0;
@@ -183,6 +185,8 @@ static void update_and_free_page(struct 
1 << PG_dirty | 1 << PG_active | 1 << 
PG_reserved |
1 << PG_private | 1<< PG_writeback);
}
+   mod_zone_page_state(page_zone(page), NR_UNRECLAIMABLE,
+   - (HPAGE_SIZE / PAGE_SIZE));
page[1].lru.next = NULL;
set_page_refcounted(page);
__free_pages(page, HUGETLB_PAGE_ORDER);
Index: linux-2.6.20-rc5/mm/vmstat.c
===
--- linux-2.6.20-rc5.orig/mm/vmstat.c   2007-01-12 12:54:26.0 -0600
+++ linux-2.6.20-rc5/mm/vmstat.c2007-01-15 21:37:37.686405431 -0600
@@ -459,6 +459,7 @@ static const char * const vmstat_text[] 
"nr_file_pages",
"nr_slab_reclaimable",
"nr_slab_unreclaimable",
+   "nr_unreclaimable",
"nr_page_table_pages",
"nr_dirty",
"nr_writeback",
Index: linux-2.6.20-rc5/mm/page-writeback.c
===
--- linux-2.6.20-rc5.orig/mm/page-writeback.c   2007-01-15 21:37:33.302228293 
-0600
+++ linux-2.6.20-rc5/mm/page-writeback.c2007-01-15 21:37:37.697148570 
-0600
@@ -165,7 +165,9 @@ get_dirty_limits(struct dirty_limits *dl
dl->nr_writeback +=
node_page_state(node, NR_WRITEBACK);
available_memory +=
-   NODE_DATA(node)->node_present_pages;
+   NODE_DATA(node)->node_present_pages
+   - node_page_state(node, NR_UNRECLAIMABLE)
+   - node_page_state(node, NR_SLAB_UNRECLAIMABLE);
 #ifdef CONFIG_HIGHMEM
high_memory += NODE_DATA(node)
->node_zones[ZONE_HIGHMEM]->present_pages;
@@ -180,7 +182,9 @@ get_dirty_limits(struct dirty_limits *dl
dl->nr_dirty = 

[RFC 3/8] Add a nodemask to pdflush functions

2007-01-15 Thread Christoph Lameter
pdflush: Allow the passing of a nodemask parameter

If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.

So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc5/include/linux/writeback.h
===
--- linux-2.6.20-rc5.orig/include/linux/writeback.h 2007-01-15 
21:34:38.564104333 -0600
+++ linux-2.6.20-rc5/include/linux/writeback.h  2007-01-15 21:34:43.135798088 
-0600
@@ -81,7 +81,7 @@ static inline void wait_on_inode(struct 
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(void);
@@ -109,7 +109,8 @@ balance_dirty_pages_ratelimited(struct a
balance_dirty_pages_ratelimited_nr(mapping, 1);
 }
 
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes),
+   unsigned long arg0, nodemask_t *nodes);
 extern int generic_writepages(struct address_space *mapping,
  struct writeback_control *wbc);
 int do_writepages(struct address_space *mapping, struct writeback_control 
*wbc);
Index: linux-2.6.20-rc5/mm/page-writeback.c
===
--- linux-2.6.20-rc5.orig/mm/page-writeback.c   2007-01-15 21:34:38.573870823 
-0600
+++ linux-2.6.20-rc5/mm/page-writeback.c2007-01-15 21:34:43.150447823 
-0600
@@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
 /*
  * Work out the current dirty-memory clamping and background writeout
@@ -244,7 +244,7 @@ static void balance_dirty_pages(struct a
 */
if ((laptop_mode && pages_written) ||
 (!laptop_mode && (nr_reclaimable > background_thresh)))
-   pdflush_operation(background_writeout, 0);
+   pdflush_operation(background_writeout, 0, NULL);
 }
 
 void set_page_dirty_balance(struct page *page)
@@ -325,7 +325,7 @@ void throttle_vm_writeout(void)
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
-static void background_writeout(unsigned long _min_pages)
+static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
 {
long min_pages = _min_pages;
struct writeback_control wbc = {
@@ -365,12 +365,12 @@ static void background_writeout(unsigned
  * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
  * -1 if all pdflush threads were busy.
  */
-int wakeup_pdflush(long nr_pages)
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes)
 {
if (nr_pages == 0)
nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
-   return pdflush_operation(background_writeout, nr_pages);
+   return pdflush_operation(background_writeout, nr_pages, nodes);
 }
 
 static void wb_timer_fn(unsigned long unused);
@@ -394,7 +394,7 @@ static DEFINE_TIMER(laptop_mode_wb_timer
  * older_than_this takes precedence over nr_to_write.  So we'll only write back
  * all dirty pages if they are all attached to "old" mappings.
  */
-static void wb_kupdate(unsigned long arg)
+static void wb_kupdate(unsigned long arg, nodemask_t *unused)
 {
unsigned long oldest_jif;
unsigned long start_jif;
@@ -454,18 +454,18 @@ int dirty_writeback_centisecs_handler(ct
 
 static void wb_timer_fn(unsigned long unused)
 {
-   if (pdflush_operation(wb_kupdate, 0) < 0)
+   if (pdflush_operation(wb_kupdate, 0, NULL) < 0)
mod_timer(_timer, jiffies + HZ); /* delay 1 second */
 }
 
-static void laptop_flush(unsigned long unused)
+static void laptop_flush(unsigned long unused, nodemask_t *unused2)
 {
sys_sync();
 }
 
 static void laptop_timer_fn(unsigned long unused)
 {
-   pdflush_operation(laptop_flush, 0);
+   pdflush_operation(laptop_flush, 0, NULL);
 }
 
 /*
Index: linux-2.6.20-rc5/mm/pdflush.c
===
--- linux-2.6.20-rc5.orig/mm/pdflush.c  2007-01-15 21:34:38.582660664 -0600
+++ linux-2.6.20-rc5/mm/pdflush.c   2007-01-15 21:34:43.161190961 -0600
@@ -83,10 +83,12 @@ static unsigned long last_empty_jifs;
  */
 struct pdflush_work {
struct task_struct *who;/* The thread */
-   void (*fn)(unsigned long);  /* A callback function */
+   void (*fn)(unsigned long, nodemask_t *); /* A callback function 

[RFC 8/8] Reduce inode memory usage for systems with a high MAX_NUMNODES

2007-01-15 Thread Christoph Lameter
Dynamically reduce the size of the nodemask_t in struct inode

The nodemask_t in struct inode can potentially waste a lot of memory if
MAX_NUMNODES is high. For IA64 MAX_NUMNODES is 1024 by default which
results in 128 bytes to be used for the nodemask. This means that the
memory use for inodes may increase significantly since they all now
include a dirty_map. These may be unecessarily large on smaller systems.

We placed the nodemask at the end of struct inode. This patch avoids
touching the later part of the nodemask if the actual maximum possible
node on the system is less than 1024. If MAX_NUMNODES is larger than
BITS_PER_LONG (and we may use more than one word for the nodemask) then
we calculate the number of bytes that may be taken off the end of
an inode. We can then create the inode caches without those bytes
effectively saving memory. On a IA64 system booting with a
maximum of 64 nodes we may save 120 of those 128 bytes per inode.

This is only done for filesystems that are typically used for NUMA
systems: xfs, nfs, ext3, ext4 and reiserfs. Other filesystems will
always use the full length of the inode.

This solution may be a bit hokey. I tried other approaches but this
one seemed to be the simplest with the least complications. Maybe someone
else can come up with a better solution?

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc5/fs/xfs/linux-2.6/xfs_super.c
===
--- linux-2.6.20-rc5.orig/fs/xfs/linux-2.6/xfs_super.c  2007-01-15 
22:33:55.0 -0600
+++ linux-2.6.20-rc5/fs/xfs/linux-2.6/xfs_super.c   2007-01-15 
22:35:07.596529498 -0600
@@ -370,7 +370,9 @@ xfs_fs_inode_init_once(
 STATIC int
 xfs_init_zones(void)
 {
-   xfs_vnode_zone = kmem_zone_init_flags(sizeof(bhv_vnode_t), "xfs_vnode",
+   xfs_vnode_zone = kmem_zone_init_flags(sizeof(bhv_vnode_t)
+   - unused_numa_nodemask_bytes,
+   "xfs_vnode",
KM_ZONE_HWALIGN | KM_ZONE_RECLAIM |
KM_ZONE_SPREAD,
xfs_fs_inode_init_once);
Index: linux-2.6.20-rc5/include/linux/fs.h
===
--- linux-2.6.20-rc5.orig/include/linux/fs.h2007-01-15 22:33:55.0 
-0600
+++ linux-2.6.20-rc5/include/linux/fs.h 2007-01-15 22:35:07.621922373 -0600
@@ -591,6 +591,14 @@ struct inode {
void*i_private; /* fs or device private pointer */
 #ifdef CONFIG_CPUSETS
nodemask_t  dirty_nodes;/* Map of nodes with dirty 
pages */
+   /*
+* Note that we may only use a portion of the bitmap in dirty_nodes
+* if we have a large MAX_NUMNODES but the number of possible nodes
+* is small in order to reduce the size of the inode.
+*
+* Bits after nr_node_ids (one node beyond the last possible
+* node_id) may not be accessed.
+*/
 #endif
 };
 
Index: linux-2.6.20-rc5/fs/ext3/super.c
===
--- linux-2.6.20-rc5.orig/fs/ext3/super.c   2007-01-15 22:33:55.0 
-0600
+++ linux-2.6.20-rc5/fs/ext3/super.c2007-01-15 22:35:07.646338599 -0600
@@ -480,7 +480,8 @@ static void init_once(void * foo, struct
 static int init_inodecache(void)
 {
ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
-sizeof(struct ext3_inode_info),
+sizeof(struct ext3_inode_info)
+   - unused_numa_nodemask_bytes,
 0, (SLAB_RECLAIM_ACCOUNT|
SLAB_MEM_SPREAD),
 init_once, NULL);
Index: linux-2.6.20-rc5/fs/inode.c
===
--- linux-2.6.20-rc5.orig/fs/inode.c2007-01-15 22:33:55.0 -0600
+++ linux-2.6.20-rc5/fs/inode.c 2007-01-15 22:35:07.661964984 -0600
@@ -1399,7 +1399,8 @@ void __init inode_init(unsigned long mem
 
/* inode slab cache */
inode_cachep = kmem_cache_create("inode_cache",
-sizeof(struct inode),
+sizeof(struct inode)
+   - unused_numa_nodemask_bytes,
 0,
 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 SLAB_MEM_SPREAD),
Index: linux-2.6.20-rc5/fs/reiserfs/super.c
===
--- linux-2.6.20-rc5.orig/fs/reiserfs/super.c   2007-01-15 22:33:55.0 
-0600
+++ linux-2.6.20-rc5/fs/reiserfs/super.c2007-01-15 

[RFC 2/8] Add a map to inodes to track dirty pages per node

2007-01-15 Thread Christoph Lameter
Add a dirty map to the inode

In a NUMA system it is helpful to know where the dirty pages of a mapping
are located. That way we will be able to implement writeout for applications
that are constrained to a portion of the memory of the system as required by
cpusets.

Two functions are introduced to manage the dirty node map:

cpuset_clear_dirty_nodes() and cpuset_update_nodes(). Both are defined using
macros since the definition of struct inode may not be available in cpuset.h.

The dirty map is cleared when the inode is cleared. There is no
synchronization (except for atomic nature of node_set) for the dirty_map. The
only problem that could be done is that we do not write out an inode if a
node bit is not set. That is rare and will be impossibly rare if multiple pages
are involved. There is therefore a slight chance that we have missed a dirty
node if it just contains a single page. Which is likely tolerable.

This patch increases the size of struct inode for the NUMA case. For
most arches that only support up to 64 nodes this is simply adding one
unsigned long.

However, the default Itanium configuration allows for up to 1024 nodes.
On Itanium we add 128 byte per inode. A later patch will make the size of
the per node bit array dynamic so that the size of the inode slab caches
is properly sized.

Signed-off-by; Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc5/fs/fs-writeback.c
===
--- linux-2.6.20-rc5.orig/fs/fs-writeback.c 2007-01-12 12:54:26.0 
-0600
+++ linux-2.6.20-rc5/fs/fs-writeback.c  2007-01-15 22:34:12.065241639 -0600
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 /**
@@ -223,11 +224,13 @@ __sync_single_inode(struct inode *inode,
/*
 * The inode is clean, inuse
 */
+   cpuset_clear_dirty_nodes(inode);
list_move(>i_list, _in_use);
} else {
/*
 * The inode is clean, unused
 */
+   cpuset_clear_dirty_nodes(inode);
list_move(>i_list, _unused);
}
}
Index: linux-2.6.20-rc5/fs/inode.c
===
--- linux-2.6.20-rc5.orig/fs/inode.c2007-01-12 12:54:26.0 -0600
+++ linux-2.6.20-rc5/fs/inode.c 2007-01-15 22:33:55.802081773 -0600
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * This is needed for the following functions:
@@ -134,6 +135,7 @@ static struct inode *alloc_inode(struct 
inode->i_cdev = NULL;
inode->i_rdev = 0;
inode->dirtied_when = 0;
+   cpuset_clear_dirty_nodes(inode);
if (security_inode_alloc(inode)) {
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
Index: linux-2.6.20-rc5/include/linux/fs.h
===
--- linux-2.6.20-rc5.orig/include/linux/fs.h2007-01-12 12:54:26.0 
-0600
+++ linux-2.6.20-rc5/include/linux/fs.h 2007-01-15 22:33:55.876307100 -0600
@@ -589,6 +589,9 @@ struct inode {
void*i_security;
 #endif
void*i_private; /* fs or device private pointer */
+#ifdef CONFIG_CPUSETS
+   nodemask_t  dirty_nodes;/* Map of nodes with dirty 
pages */
+#endif
 };
 
 /*
Index: linux-2.6.20-rc5/mm/page-writeback.c
===
--- linux-2.6.20-rc5.orig/mm/page-writeback.c   2007-01-12 12:54:26.0 
-0600
+++ linux-2.6.20-rc5/mm/page-writeback.c2007-01-15 22:34:14.425802376 
-0600
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -780,6 +781,7 @@ int __set_page_dirty_nobuffers(struct pa
if (mapping->host) {
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+   cpuset_update_dirty_nodes(mapping->host, page);
}
return 1;
}
Index: linux-2.6.20-rc5/fs/buffer.c
===
--- linux-2.6.20-rc5.orig/fs/buffer.c   2007-01-12 12:54:26.0 -0600
+++ linux-2.6.20-rc5/fs/buffer.c2007-01-15 22:34:14.459008443 -0600
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 static void invalidate_bh_lrus(void);
@@ -739,6 +740,7 @@ int __set_page_dirty_buffers(struct page
}
write_unlock_irq(>tree_lock);
__mark_inode_dirty(mapping->host, 

[RFC 5/8] Make writeout during reclaim cpuset aware

2007-01-15 Thread Christoph Lameter
Direct reclaim: cpuset aware writeout

During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either

A. No writeout occurs because global dirty limits have not been reached

or

B. Writeout starts randomly for some dirty inode in the system. Pdflush
   may just write out data for nodes in another cpuset and miss doing
   proper dirty handling for the current cpuset.

In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.

Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim as in an SMP system.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc5/mm/vmscan.c
===
--- linux-2.6.20-rc5.orig/mm/vmscan.c   2007-01-15 21:34:43.173887398 -0600
+++ linux-2.6.20-rc5/mm/vmscan.c2007-01-15 21:37:26.605346439 -0600
@@ -1065,7 +1065,8 @@ unsigned long try_to_free_pages(struct z
 */
if (total_scanned > sc.swap_cluster_max +
sc.swap_cluster_max / 2) {
-   wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+   wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+   _current_mems_allowed);
sc.may_writepage = 1;
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 0/8] Cpuset aware writeback

2007-01-15 Thread Christoph Lameter
Currently cpusets are not able to do proper writeback since
dirty ratio calculations and writeback are all done for the system
as a whole. This may result in a large percentage of a cpuset
to become dirty without writeout being triggered. Under NFS
this can lead to OOM conditions.

Writeback will occur during the LRU scans. But such writeout
is not effective since we write page by page and not in inode page
order (regular writeback).

In order to fix the problem we first of all introduce a method to
establish a map of nodes that contain dirty pages for each
inode mapping.

Secondly we modify the dirty limit calculation to be based
on the acctive cpuset.

If we are in a cpuset then we select only inodes for writeback
that have pages on the nodes of the cpuset.

After we have the cpuset throttling in place we can then make
further fixups:

A. We can do inode based writeout from direct reclaim
   avoiding single page writes to the filesystem.

B. We add a new counter NR_UNRECLAIMABLE that is subtracted
   from the available pages in a node. This allows us to
   accurately calculate the dirty ratio even if large portions
   of the node have been allocated for huge pages or for
   slab pages.

There are a couple of points where some better ideas could be used:

1. The nodemask expands the inode structure significantly if the
architecture allows a high number of nodes. This is only an issue
for IA64. For that platform we expand the inode structure by 128 byte
(to support 1024 nodes). The last patch attempts to address the issue
by using the knowledge about the maximum possible number of nodes
determined on bootup to shrink the nodemask.

2. The calculation of the per cpuset limits can require looping
over a number of nodes which may bring the performance of get_dirty_limits
near pre 2.6.18 performance (before the introduction of the ZVC counters)
(only for cpuset based limit calculation). There is no way of keeping these
counters per cpuset since cpusets may overlap.

Paul probably needs to go through this and may want additional fixes to
keep things in harmony with cpusets.

Tested on:
IA64 NUMA 128p, 12p

Compiles on:
i386 SMP
x86_64 UP


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 1/8] Convert higest_possible_node_id() into nr_node_ids

2007-01-15 Thread Christoph Lameter
Replace highest_possible_node_id() with nr_node_ids

highest_possible_node_id() is used to calculate the last possible node id
so that the network subsystem can figure out how to size per node arrays.

I think having the ability to determine the maximum amount of nodes in
a system at runtime is useful but then we should name this entry
correspondingly and also only calculate the value once on bootup.

This patch introduces nr_node_ids and replaces the use of
highest_possible_node_id(). nr_node_ids is calculated on bootup when
the page allocators pagesets are initialized.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.20-rc4-mm1/include/linux/nodemask.h
===
--- linux-2.6.20-rc4-mm1.orig/include/linux/nodemask.h  2007-01-06 
21:45:51.0 -0800
+++ linux-2.6.20-rc4-mm1/include/linux/nodemask.h   2007-01-12 
12:59:50.0 -0800
@@ -352,7 +352,7 @@
 #define node_possible(node)node_isset((node), node_possible_map)
 #define first_online_node  first_node(node_online_map)
 #define next_online_node(nid)  next_node((nid), node_online_map)
-int highest_possible_node_id(void);
+extern int nr_node_ids;
 #else
 #define num_online_nodes() 1
 #define num_possible_nodes()   1
@@ -360,7 +360,7 @@
 #define node_possible(node)((node) == 0)
 #define first_online_node  0
 #define next_online_node(nid)  (MAX_NUMNODES)
-#define highest_possible_node_id() 0
+#define nr_node_ids1
 #endif
 
 #define any_online_node(mask)  \
Index: linux-2.6.20-rc4-mm1/mm/page_alloc.c
===
--- linux-2.6.20-rc4-mm1.orig/mm/page_alloc.c   2007-01-12 12:58:26.0 
-0800
+++ linux-2.6.20-rc4-mm1/mm/page_alloc.c2007-01-12 12:59:50.0 
-0800
@@ -679,6 +679,26 @@
return i;
 }
 
+#if MAX_NUMNODES > 1
+int nr_node_ids __read_mostly;
+EXPORT_SYMBOL(nr_node_ids);
+
+/*
+ * Figure out the number of possible node ids.
+ */
+static void __init setup_nr_node_ids(void)
+{
+   unsigned int node;
+   unsigned int highest = 0;
+
+   for_each_node_mask(node, node_possible_map)
+   highest = node;
+   nr_node_ids = highest + 1;
+}
+#else
+static void __init setup_nr_node_ids(void) {}
+#endif
+
 #ifdef CONFIG_NUMA
 /*
  * Called from the slab reaper to drain pagesets on a particular node that
@@ -3318,6 +3338,7 @@
min_free_kbytes = 65536;
setup_per_zone_pages_min();
setup_per_zone_lowmem_reserve();
+   setup_nr_node_ids();
return 0;
 }
 module_init(init_per_zone_pages_min)
@@ -3519,18 +3540,4 @@
 EXPORT_SYMBOL(page_to_pfn);
 #endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */
 
-#if MAX_NUMNODES > 1
-/*
- * Find the highest possible node id.
- */
-int highest_possible_node_id(void)
-{
-   unsigned int node;
-   unsigned int highest = 0;
 
-   for_each_node_mask(node, node_possible_map)
-   highest = node;
-   return highest;
-}
-EXPORT_SYMBOL(highest_possible_node_id);
-#endif
Index: linux-2.6.20-rc4-mm1/net/sunrpc/svc.c
===
--- linux-2.6.20-rc4-mm1.orig/net/sunrpc/svc.c  2007-01-06 21:45:51.0 
-0800
+++ linux-2.6.20-rc4-mm1/net/sunrpc/svc.c   2007-01-12 12:59:50.0 
-0800
@@ -116,7 +116,7 @@
 static int
 svc_pool_map_init_percpu(struct svc_pool_map *m)
 {
-   unsigned int maxpools = highest_possible_processor_id()+1;
+   unsigned int maxpools = nr_node_ids;
unsigned int pidx = 0;
unsigned int cpu;
int err;
@@ -144,7 +144,7 @@
 static int
 svc_pool_map_init_pernode(struct svc_pool_map *m)
 {
-   unsigned int maxpools = highest_possible_node_id()+1;
+   unsigned int maxpools = nr_node_ids;
unsigned int pidx = 0;
unsigned int node;
int err;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 2/10][RFC] aio: net use struct socket for io

2007-01-15 Thread Stephen Hemminger
On Mon, 15 Jan 2007 17:54:50 -0800
Nate Diller <[EMAIL PROTECTED]> wrote:

> Remove unused arg from socket operations
> 
> The sendmsg and recvmsg socket operations take a kiocb pointer, but none of
> the functions actually use it.  There's really no need even theoretically,
> it's really quite ugly having it there at all.  Also, removing it will pave
> the way for a more generic completion path in the file_operations.
> 
> ---

Would getting rid of these make later implementation of AIO networking
harder?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 3/10][RFC] aio: use iov_length instead of ki_left

2007-01-15 Thread Nate Diller

On 1/15/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

On Mon, Jan 15, 2007 at 05:54:50PM -0800, Nate Diller wrote:
> Convert code using iocb->ki_left to use the more generic iov_length() call.

No way.  We need to reduce the numer of iovec traversals, not adding
more of them.


ok, I can work on a version of this that uses struct iodesc.  Maybe
something like this?

struct iodesc {
   struct iovec *iov;
   unsigned long nr_segs;
   size_t nbytes;
};

I suppose it's worth doing the iodesc thing along with this patchset
anyway, since it'll avoid an extra round of interface churn.

NATE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush_cpu_workqueue: don't flush an empty ->worklist

2007-01-15 Thread Srivatsa Vaddagiri
On Mon, Jan 15, 2007 at 07:55:16PM +0300, Oleg Nesterov wrote:
> > What if 'singlethread_cpu' dies?
> 
> Still can't understand you. Probably you missed what singlethread_cpu is.

oops yes ..I had mistakenly thought that create_workqueue_thread() will
bind worker thread to singlethread_cpu for single_threaded workqueue.
So it isn't a problem.
 
> > What abt __create_workqueue/schedule_on_each_cpu?
> 
> As I said already __create_workqueue() needs a fix, schedule_on_each_cpu()
> is already broken, and should be fixed as well.

__create_workqueue() creates worker threads for all online CPUs
currently. Accessing the online_map could be racy unless we 
serialize the access with hotplug event (thr' a mutex like workqueue
mutex held between LOCK_ACQ/LOCK_RELEASE messages or process freezer)
OR take special measures as was done in flush_workqueue. How were
you considering to deal with that raciness?

> > > The whole purpose of this change to avoid this!
> >
> > I guess it depends on how __create_workqueue/schedule_on_each_cpu is
> > modified (whether we take/release lock upon LOCK_ACQ/LOCK_RELEASE)
> 
> Sorry, can't understand this...

I meant to say that depending on how we modify
__create_workqueue/schedule_on_each_cpu to avoid racy-access to
online_map, we can debate whether workqueue mutex needs to be held
between LOCK_ACQ/LOCK_RELEASE messages in the callback.

> > What abt stopping that thread in CPU_DOWN_PREPARE (before freezing
> > processes)? I understand that it may add to the latency, but compared to
> > the overall latency of process freezer, I suspect it may not be much.
> 
> Srivatsa, why do you think this would be better?
> 
> It add to the complexity! What do you mean by "stopping that thread" ?
> Kill it? - this is wrong. 

I meant issuing kthread_stop() in DOWN_PREPARE so that worker
thread exits itself (much before CPU is actually brought down).

Do you see any problems with that?

Even if there are problems with it, how abt something like below:

workqueue_cpu_callback()
{

CPU_DEAD:
/* threads are still frozen at this point */
take_over_work();
kthread_mark_stop(worker_thread);
break;

CPU_CLEAN_THREADS:
/* all threads resumed by now */
kthread_stop(worker_thread); /* task_struct ref required? */
break;

}

kthread_mark_stop() will mark somewhere in task_struct that the thread
should exit when it comes out of refrigerator.

worker_thread()
{

while (!kthread_should_stop()) {
if (cwq->freezeable)
try_to_freeze();

if (kthread_marked_stop(current))
break;

...

}
}

The advantage I see above is that, when take_over_work() is running, we wont 
race with functions like flush_workqueue() (threads still frozen at that
point) and hence we avoid hacks like migrate_sequence. This will also
let functions like flush_workqueue() easily access cpu_online_map as
below -without- any special locking/hacks (which I consider a great
benefit for programmers).

flush_workqueue()
{
for_each_online_cpu(i)
flush_cpu_workqueue(i);
}

Do you see any problems with this later approach?

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: allocation failed: out of vmalloc space error treating and VIDEO1394 IOC LISTEN CHANNEL ioctl failed problem

2007-01-15 Thread David Moore
On Mon, 2007-01-15 at 16:43 -0500, Kristian Høgsberg wrote:
> On 1/15/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote:
> > again the best way is for you to provide an mmap method... you can then
> > fill in the pages and keep that in some sort of array; this is for
> > example also what the DRI/DRM layer does for textures etc...
> 
> That sounds a lot like what I have now (mmap method, array of pages)
> so I'll just stick with that.

It sounds like the distinction Arjan is getting at is that the buffer
should exist in the process's virtual address space instead of the
kernel's virtual address space so that we have plenty of space available
to us.

Thus, we should use get_user_pages() instead of vmalloc().  I think
get_user_pages() will also automatically pin the memory.  And we'll also
need to call get_user_pages() from a custom mmap() handler so that we
know what process virtual address to assign to the region.

Is that right Arjan?

Thanks,

David

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CPUSET related breakage of sys_mbind

2007-01-15 Thread Paul Jackson
Patch looks good - thanks, Bob.

Signed-off-by: Paul Jackson <[EMAIL PROTECTED]>

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private

2007-01-15 Thread Nate Diller

On 1/15/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

On Mon, Jan 15, 2007 at 05:54:50PM -0800, Nate Diller wrote:
> This series is an attempt to generalize the async I/O paths to be
> implementation agnostic.  It completely eliminates knowledge of
> the kiocb structure in the generic code and makes it private within the
> current aio code.  Things get noticeably cleaner without that layering
> violation.
>
> The new interface takes a file_endio_t function pointer, and a private data
> pointer, which would normally be aio_complete and a kiocb pointer,
> respectively.  If the aio submission function gets back EIOCBQUEUED, that is
> a guarantee that the endio function will be called, or *already has been
> called*.  If the file_endio_t pointer provided to aio_[read|write] is NULL,
> the FS must block on I/O completion, then return either the number of bytes
> read, or an error.

I don't really like this patchet at all.  At some point it's a lot nicer
to have a lot of paramaters that are related and passed down a long
callchain into a structure, and I think the aio code is over that threshold.
The completion function cleanups look okay to me, but I'd rather add
that completion function to struct kiocb instead of removing kiocb use.

I have this slight feeling you want to use this completions for something
else than the current aio code, if that's the case it would help
if you could explain briefly in what direction your heading.


Actually I agree with you more than you might think.  I had intended
this to mesh with your struct iodesc idea, where iodesc would contain
the iovec pointer, nr_segs, iov_length, and whatever else needs to be
there, potentially even the endio function and its private data, tying
those to the iovec instead of a separate structure that needs to be
kept in sync.  There's a distinct layering that should exist between
things that should accompany the iovec transparently, and private data
that should be attached opaquely by layers above.

The biggest thing I have in mind for this patch, actually, is to fix
up the *sync* paths.  I don't think we should be waiting on sync I/O
at the *top* of the call stack, like with wait_on_sync_kiocb(), I'd
say the best place to wait is at the *bottom*, down in the I/O
scheduler.  This would make it a lot easier to clean up the completion
paths, because in the sync case, you'd be right back in process
context again as you traverse upward through the RAID, encryption,
loopback, directIO, FS log commit, etc.  It doesn't by itself
eliminate the need for all the threads and workqueues and such that
those layers each own, but it is a step in the right direction.

Now if you want to talk about long-term vaporware style ideas, yeah, I
do have my own thoughts on how aio should work.  And from Agami's
perspective, this patch also makes it easier for us to do certain
debugging traces that we wish to hack together, in order to profile
performance on our platform.  But I'd be hesitant to make those
arguments, cause they are largely irrelevant (we can obviously carry
the patch for debugging without buy-in from the community).  This is
the right thing to do from a design perspective.  Hopefully it enables
a new architecture that can reduce context switches in I/O completion,
and reduce overhead.  That's the real motive ;)

NATE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CPUSET related breakage of sys_mbind

2007-01-15 Thread Paul Jackson
Christoph wrote:
> Cpusets is your thing so I think you could fix this the right way.

But wasn't it your patch that broke ...

Actually, I'd have blessed Bob Picco's patch, as it's done the right
way, with a cpuset_* macro hook, defined twice in cpuset.h, with and
without CONFIG_CPUSET, where the without case compiles to a no-op.
This is the same way as is used for the couple dozen other cpuset
kernel hooks.

But I thought you were already signed up for this one, so I didn't want
to trample on your efforts.

And, perhaps more important, I understood you had some other patches in
the works that have cpuset hooks.  I'm thinking it would be a good idea
to learn how these hooks are done, so we don't have to come around here
again.

How about this ... you take another look at Bob's patch.  If it's ok by
you too, then we can both bless it, and that should do it.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CPUSET related breakage of sys_mbind

2007-01-15 Thread Christoph Lameter
On Mon, 15 Jan 2007, Paul Jackson wrote:

> You're right about this problemI think that Christoph Lameter
> (added to cc list) is working on a fix for this.

Cpusets is your thing so I think you could fix this the right way. There 
are already two different patches fixing this. Just make it the way that 
it fits cpusets.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Initramfs and /sbin/hotplug fun

2007-01-15 Thread Mark Rustad

On Jan 15, 2007, at 1:54 PM, Andrew Walrond wrote:


Olaf Hering wrote:

Why do you need /sbin/hotplug anyway, just for firmware loading for a
non-modular kernel?


I guess this is unusual, but FWIW...

I have a custom distro and I was just looking for the easiest way  
to create a bootable rescue pen-drive. So I just took a working  
distro, added an init->sbin/init symlink, cpio'ed it into an  
initramfs, and booted it up. Works a treat, except for the early  
hotplug calls.


I have a kernel that needs to have early hotplug calls to load  
firmware. I just rolled my own simple hotplug scripts to only address  
that issue and have not had a problem since. The mdev in busybox that  
is in the gentoo initramfs didn't seem to be able to handle it, so I  
just made my own scripts. In my case I needed QLogic firmware so root  
could be on FC.


FWIW, it is a real PITA to not be able to build a monolithic kernel  
that can bring up root on its own. I will stipulate that I am an old- 
school guy that likes monolithic kernels, but I do feel that  
something has been lost. Yes, I am aware of the reasons for the  
change, else I would have written something when I was fighting the  
battle, but I still don't have to like it.


--
Mark Rustad, [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT question

2007-01-15 Thread Jörn Engel
On Fri, 12 January 2007 00:19:45 +0800, Aubrey wrote:
>
> Yes for desktop, server, but maybe not for embedded system, specially
> for no-mmu linux. In many embedded system cases, the whole system is
> running in the ram, including file system. So it's not necessary using
> page cache anymore. Page cache can't improve performance on these
> cases, but only fragment memory.

You were not very specific, so I have to guess that you're referring to
the problem of having two copies of the same file in RAM - one in the
page cache and one in the "backing store", which is just RAM.

There are two solutions to this problem.  One is tmpfs, which doesn't
use a backing store and keeps all data in the page cache.  The other is
xip, which doesn't use the page cache and goes directly to backing
store.  Unlike O_DIRECT, xip only works with a RAM or de-facto RAM
backing store (NOR flash works read-only).

So if you really care about memory waste in embedded systems, you should
have a look at mm/filemap_xip.c and continue Carsten Otte's work.

Jörn

-- 
Fantasy is more important than knowledge. Knowledge is limited,
while fantasy embraces the whole world.
-- Albert Einstein
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] X.25 Add missing sock_put in x25_receive_data

2007-01-15 Thread David Miller
From: ahendry <[EMAIL PROTECTED]>
Date: Tue, 09 Jan 2007 09:32:17 +1100

> __x25_find_socket does a sock_hold.
> This adds a missing sock_put in x25_receive_data.
> 
> Signed-off-by: Andrew Hendry <[EMAIL PROTECTED]>

Applied, thanks a lot.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private

2007-01-15 Thread Christoph Hellwig
On Mon, Jan 15, 2007 at 05:54:50PM -0800, Nate Diller wrote:
> This series is an attempt to generalize the async I/O paths to be
> implementation agnostic.  It completely eliminates knowledge of
> the kiocb structure in the generic code and makes it private within the
> current aio code.  Things get noticeably cleaner without that layering
> violation.
> 
> The new interface takes a file_endio_t function pointer, and a private data
> pointer, which would normally be aio_complete and a kiocb pointer,
> respectively.  If the aio submission function gets back EIOCBQUEUED, that is
> a guarantee that the endio function will be called, or *already has been
> called*.  If the file_endio_t pointer provided to aio_[read|write] is NULL,
> the FS must block on I/O completion, then return either the number of bytes
> read, or an error.

I don't really like this patchet at all.  At some point it's a lot nicer
to have a lot of paramaters that are related and passed down a long
callchain into a structure, and I think the aio code is over that threshold.
The completion function cleanups look okay to me, but I'd rather add
that completion function to struct kiocb instead of removing kiocb use.

I have this slight feeling you want to use this completions for something
else than the current aio code, if that's the case it would help
if you could explain briefly in what direction your heading.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] Re: [patch 20/20] XEN-paravirt: Add Xen virtual block device driver.

2007-01-15 Thread Mark Williamson
> > +
> > +   err = xenbus_printf(xbt, dev->nodename,
> > +   "ring-ref","%u", info->ring_ref);
>
> why do you need your own printf?

xenbus_printf isn't a printf replacement - it is used for writing a formatted 
string into XenStore (which contains  VM configuration data in a 
human-readable form).

Internally it does a vsnprintf into a buffer and writes the resulting string 
to the XenStore.

Cheers,
Mark

> > +static inline int GET_ID_FROM_FREELIST(
>
> does this really need screaming?
>
> > +
> > +int blkif_ioctl(struct inode *inode, struct file *filep,
> > +   unsigned command, unsigned long argument)
> > +{
> > +   int i;
> > +
> > +   DPRINTK_IOCTL("command: 0x%x, argument: 0x%lx, dev: 0x%04x\n",
> > + command, (long)argument, inode->i_rdev);
> > +
> > +   switch (command) {
> > +   case CDROMMULTISESSION:
> > +   DPRINTK("FIXME: support multisession CDs later\n");
> > +   for (i = 0; i < sizeof(struct cdrom_multisession); i++)
> > +   if (put_user(0, (char __user *)(argument + i)))
> > +   return -EFAULT;
> > +   return 0;
> > +
> > +   default:
> > +   /*printk(KERN_ALERT "ioctl %08x not supported by Xen blkdev\n",
> > + command);*/
> > +   return -EINVAL; /* same return as native Linux */
> > +   }
>
> eh so you implement no ioctls.. why then implement the ioctl method at
> all?

I'm not familiar with this code...  but perhaps the (fake) multisession 
handling is to keep userspace that queries this happy?  I can't really think 
of anywhere this would apply off the top of my head, though.

Cheers,
Mark

> > +static struct xenbus_driver blkfront = {
> > +   .name = "vbd",
> > +   .owner = THIS_MODULE,
> > +   .ids = blkfront_ids,
> > +   .probe = blkfront_probe,
> > +   .remove = blkfront_remove,
> > +   .resume = blkfront_resume,
> > +   .otherend_changed = backend_changed,
> > +};
>
> this can be const
>
> > +
> > +#define DPRINTK(_f, _a...) pr_debug(_f, ## _a)
>
> why this silly abstraction? Just use pr_debug in the code directly
>
>
>
>
> ___
> Xen-devel mailing list
> [EMAIL PROTECTED]
> http://lists.xensource.com/xen-devel

-- 
Dave: Just a question. What use is a unicyle with no seat?  And no pedals!
Mark: To answer a question with a question: What use is a skateboard?
Dave: Skateboards have wheels.
Mark: My wheel has a wheel!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Robert Hancock

Björn Steinbrink wrote:
It should be correct the way it is - that check is trying to prevent 
ATAPI commands from using DMA until the slave_config function has been 
called to set up the DMA parameters properly. When the 
NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which 
disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) 
device on the channel this wouldn't affect you anyway.


I wondered about it, because the flag is cleared when adma_enabled is 1,
which seems to be consistent with everything but nv_adma_check_atapi_dma.


When ADMA is enabled we can't use ATAPI at all (or so says NVidia 
anyway), so it has to be disabled when an ATAPI device is detected in 
slave_config. Since doing that implies using the legacy BMDMA engine 
with its greater restrictions, this is why we need to prevent DMA 
transfers from being attempted until those restrictions have been set 
properly. (Otherwise, the libata core will try to use PACKET commands on 
an ATAPI device with DMA enabled before slave_config is even called.)



Thus I thought that nv_adma_check_atapi_dma might be wrong, but maybe
setting/clearing the flag is wrong instead? *feels lost*


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-15 Thread Ravikiran G Thirumalai
On Sat, Jan 13, 2007 at 01:20:23PM -0800, Andrew Morton wrote:
> 
> Seeing the code helps.

But there was a subtle problem with hold time instrumentation here.
The code assumed the critical section exiting through 
spin_unlock_irq entered critical section with spin_lock_irq, but that
might not be the case always, and the instrumentation for hold time goes bad
when that happens (as in shrink_inactive_list)

> 
> >  The
> > instrumentation goes like this:
> > 
> > void __lockfunc _spin_lock_irq(spinlock_t *lock)
> > {
> > unsigned long long t1,t2;
> > local_irq_disable();
> > t1 = get_cycles_sync();
> > preempt_disable();
> > spin_acquire(>dep_map, 0, 0, _RET_IP_);
> > _raw_spin_lock(lock);
> > t2 = get_cycles_sync();
> > lock->raw_lock.htsc = t2;
> > if (lock->spin_time < (t2 - t1))
> > lock->spin_time = t2 - t1;
> > }
> > ...
> > 
> > void __lockfunc _spin_unlock_irq(spinlock_t *lock)
> > {
> > unsigned long long t1 ;
> > spin_release(>dep_map, 1, _RET_IP_);
> > t1 = get_cycles_sync();
> > if (lock->cs_time < (t1 -  lock->raw_lock.htsc))
> > lock->cs_time = t1 -  lock->raw_lock.htsc;
> > _raw_spin_unlock(lock);
> > local_irq_enable();
> > preempt_enable();
> > }
> > 
...
> 
> OK, now we need to do a dump_stack() each time we discover a new max hold
> time.  That might a bit tricky: the printk code does spinlocking too so
> things could go recursively deadlocky.  Maybe make spin_unlock_irq() return
> the hold time then do:

What I found now after fixing the above is that hold time is not bad --
249461 cycles on the 2.6 GHZ opteron with powernow disabled in the BIOS.
The spin time is still in orders of seconds.

Hence this looks like a hardware fairness issue.

Attaching the instrumentation patch with this email FR.


Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock.h 
2007-01-14 22:36:46.694248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h  2007-01-15 
15:40:36.554248000 -0800
@@ -6,6 +6,18 @@
 #include 
 #include 
 
+/* Like get_cycles, but make sure the CPU is synchronized. */
+static inline unsigned long long get_cycles_sync2(void)
+{
+   unsigned long long ret;
+   unsigned eax;
+   /* Don't do an additional sync on CPUs where we know
+  RDTSC is already synchronous. */
+   alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
+ "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
+   rdtscll(ret);
+   return ret;
+}
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
  *
@@ -34,6 +46,7 @@ static inline void __raw_spin_lock(raw_s
"jle 3b\n\t"
"jmp 1b\n"
"2:\t" : "=m" (lock->slock) : : "memory");
+   lock->htsc = get_cycles_sync2();
 }
 
 /*
@@ -62,6 +75,7 @@ static inline void __raw_spin_lock_flags
"jmp 4b\n"
"5:\n\t"
: "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");
+   lock->htsc = get_cycles_sync2();
 }
 #endif
 
@@ -74,11 +88,16 @@ static inline int __raw_spin_trylock(raw
:"=q" (oldval), "=m" (lock->slock)
:"0" (0) : "memory");
 
+   if (oldval)
+   lock->htsc = get_cycles_sync2();
return oldval > 0;
 }
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
 {
+   unsigned long long t = get_cycles_sync2();
+   if (lock->hold_time <  t - lock->htsc)
+   lock->hold_time = t - lock->htsc;
asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
 }
 
Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock_types.h   
2007-01-14 22:36:46.714248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h
2007-01-15 14:23:37.204248000 -0800
@@ -7,9 +7,11 @@
 
 typedef struct {
unsigned int slock;
+   unsigned long long hold_time;
+   unsigned long long htsc;
 } raw_spinlock_t;
 
-#define __RAW_SPIN_LOCK_UNLOCKED   { 1 }
+#define __RAW_SPIN_LOCK_UNLOCKED   { 1,0,0 }
 
 typedef struct {
unsigned int lock;
Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock.h  2007-01-14 
22:36:48.464248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h   2007-01-14 
22:41:30.964248000 -0800
@@ -231,8 +231,8 @@ do {
\
 # define spin_unlock(lock) 

Re: [stable] 2.6.19.2 regression introduced by "IPV4/IPV6: Fix inet{, 6} device initialization order."

2007-01-15 Thread David Miller
From: YOSHIFUJI Hideaki <[EMAIL PROTECTED]>
Date: Tue, 16 Jan 2007 11:06:30 +0900 (JST)

> In article <[EMAIL PROTECTED]> (at Tue, 16 Jan 2007 03:01:56 +0100), Gabriel 
> C <[EMAIL PROTECTED]> says:
> 
> > Should be the fix from http://bugzilla.kernel.org/show_bug.cgi?id=7817
> 
> I've resent the patch to <[EMAIL PROTECTED]>.

Thank you.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CPUSET related breakage of sys_mbind

2007-01-15 Thread Paul Jackson
You're right about this problemI think that Christoph Lameter
(added to cc list) is working on a fix for this.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Provide an interface to limit total page cache.

2007-01-15 Thread Roy Huang

The possible cause is a bug in kswapd thread, or shrink_all_memory
cannot be called in kswapd thread.

On 1/15/07, Vaidyanathan Srinivasan <[EMAIL PROTECTED]> wrote:


Roy Huang wrote:
> A patch provide a interface to limit total page cache in
> /proc/sys/vm/pagecache_ratio. The default value is 90 percent. Any
> feedback is appreciated.

[snip]

I tried to run your patch on PPC64 SMP machine, unfortunately kswapd
crashes the kernel when the pagecache limit is exceeded!

->dd if=/dev/zero of=/tmp/foo bs=1M count=1200
cpu 0x0: Vector: 300 (Data Access) at [c12d7ad0]
pc: c00976ac: .kswapd+0x3a4/0x4f0
lr: c00976ac: .kswapd+0x3a4/0x4f0
sp: c12d7d50
   msr: 80009032
   dar: 0
 dsisr: 4200
  current = 0xcfed7040
  paca= 0xc063fb80
pid   = 134, comm = kswapd0
[ cut here ]
enter ? for help
[c12d7ee0] c0069150 .kthread+0x124/0x174
[c12d7f90] c00247b4 .kernel_thread+0x4c/0x68
0:mon>

Steps to recreate fail:

# sync
# echo 1 > /proc/sys/vm/drop_caches
MemTotal:  1014584 kB
MemFree:905536 kB
Buffers:  3232 kB
Cached:  57628 kB
SwapCached:  0 kB
Active:  47664 kB
Inactive:33160 kB
SwapTotal: 1526164 kB
SwapFree:  1526164 kB
Dirty: 108 kB
Writeback:   0 kB
AnonPages:   19976 kB
Mapped:  15084 kB
Slab:19724 kB
SReclaimable: 8536 kB
SUnreclaim:  11188 kB
PageTables:972 kB
NFS_Unstable:0 kB
Bounce:  0 kB
CommitLimit:   2033456 kB
Committed_AS:87884 kB
VmallocTotal: 8589934592 kB
VmallocUsed:  2440 kB
VmallocChunk: 8589932152 kB
HugePages_Total: 0
HugePages_Free:  0
HugePages_Rsvd:  0
Hugepagesize:16384 kB

# echo 50 > /proc/sys/vm/pagecache_ratio
# dd if=/dev/zero of=/tmp/foo bs=1M count=1200

Basically fill pagecache with overlimit dirty file pages and check
if the reclaim happened and the limit was not exceeded.

--Vaidy





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: allocation failed: out of vmalloc space error treating and VIDEO1394 IOC LISTEN CHANNEL ioctl failed problem

2007-01-15 Thread Peter Antoniac
On Tuesday 16 January 2007 06:43, Kristian Høgsberg wrote:
> On 1/15/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote:
> > there is a lot of pain involved with doing things this way, it is a TON
> > better if YOU provide the memory via a custom mmap handler for a device
> > driver.
> > (there are a lot of security nightmares involved with the opposite
> > model, like the user can put any kind of memory there, even pci mmio
> > space)
>
> OK, point taken.  I don't have a strong preference for the opposite
> model, it just seems elegant that you can let user space handle
> allocation and pin and map the pages as needed.  But you're right, it
> certainly is easier to give safe memory to user space in the first
> place rather than try to make sure user space isn't trying to trick
> us.

I am glad that the discussion is heading to the right place thanks to David.

Yes. Probably that is the best solution. In the case of the ring buffers, 
based on my discussion with Damien, 4 buffers are probably optimal. If the 
user is allocating them, in case of normal cameras, this is somewhere around 
4 MiB, lets say maxim 16 MiB. So, everything should be ok for normal people, 
at least for now. The problem is when the cameras require bigger images (we 
are thinking about the future, right) and maybe also more buffers in the DMA 
ring buffer. If you leave that to the user, it will require some hacking 
skills if we are using the current model from libdc1394 and video1394. Why? 
Because if you use 10 buffers with some big images it is likely you are going 
out of the 64 MiB. In that case, we were thinking to give a nice error (that 
is why we needed to know the amount available for mmap/vmalloc) and instruct 
the user to change the kernel boot time allocation of memory in a way that 
will fit the range (the vmalloc=xxx at startup - the "hacking"). So, in a 
way, it will be nice to have the solution close to the one proposed by David. 
Do you think that if the user allocates small buffers (instead of the big 
ring buffer) and sends the list to the driver, this will help in breaking the 
64 limit? I have doubts about it, but I am not good at this level of VMA. 
Anyway, I hope that something can be done to allow bigger DMA ring buffers 
without the user needing to reboot the system with some parameter.

> > >   Then is does an ioctl() on the firewire control device
> >
> > ioctls are evil ;) esp an "mmap me" ioctl
>
> Ah, I'm not mmap'ing it from the ioctl, I do implement the mma file
> operation for this.  However, you have to do an ioctl before mapping
> the device to configure the dma context.
>
> Other than that what is the problem with ioctls, and more interesting,
> what is the alternative?  I don't expect (or want) a bunch of syscalls
> to be added for this, so I don't really see what other mechanism I
> should use for this.
>
> > > It's not too difficult from what I'm doing now, I'd just like to give
> > > user space more control over the buffers it uses for streaming (i.e.
> > > letting user space allocate them).  What I'm missing here is: how do I
> > > actually pin a page in memory?  I'm sure it's not too difficult, but I
> > > haven't yet figured it out and I'm sure somebody knows it off the top
> > > of his head.
> >
> > again the best way is for you to provide an mmap method... you can then
> > fill in the pages and keep that in some sort of array; this is for
> > example also what the DRI/DRM layer does for textures etc...
>
> That sounds a lot like what I have now (mmap method, array of pages)
> so I'll just stick with that.
>
> thanks,
> Kristian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm 2/10][RFC] aio: net use struct socket for io

2007-01-15 Thread Nate Diller
Remove unused arg from socket operations

The sendmsg and recvmsg socket operations take a kiocb pointer, but none of
the functions actually use it.  There's really no need even theoretically,
it's really quite ugly having it there at all.  Also, removing it will pave
the way for a more generic completion path in the file_operations.

---

 drivers/net/pppoe.c   |8 +++
 include/linux/net.h   |   18 +++--
 include/net/bluetooth/bluetooth.h |2 -
 include/net/inet_common.h |3 --
 include/net/sock.h|   19 --
 include/net/tcp.h |6 ++---
 include/net/udp.h |3 --
 net/appletalk/ddp.c   |5 +---
 net/atm/common.c  |6 +
 net/atm/common.h  |7 ++
 net/ax25/af_ax25.c|7 ++
 net/bluetooth/af_bluetooth.c  |4 +--
 net/bluetooth/hci_sock.c  |7 ++
 net/bluetooth/l2cap.c |2 -
 net/bluetooth/rfcomm/sock.c   |8 +++
 net/bluetooth/sco.c   |3 --
 net/core/sock.c   |   12 ---
 net/dccp/dccp.h   |8 +++
 net/dccp/probe.c  |3 --
 net/dccp/proto.c  |7 ++
 net/decnet/af_decnet.c|7 ++
 net/econet/af_econet.c|7 ++
 net/ipv4/af_inet.c|5 +---
 net/ipv4/raw.c|8 ++-
 net/ipv4/tcp.c|7 ++
 net/ipv4/tcp_probe.c  |3 --
 net/ipv4/udp.c|9 +++-
 net/ipv4/udp_impl.h   |2 -
 net/ipv6/raw.c|6 +
 net/ipv6/udp.c|   10 +++--
 net/ipv6/udp_impl.h   |6 +
 net/ipx/af_ipx.c  |7 ++
 net/irda/af_irda.c|   29 +---
 net/key/af_key.c  |6 +
 net/llc/af_llc.c  |7 ++
 net/netlink/af_netlink.c  |6 +
 net/netrom/af_netrom.c|7 ++
 net/packet/af_packet.c|   11 --
 net/rose/af_rose.c|7 ++
 net/sctp/socket.c |9 +++-
 net/socket.c  |   32 ++-
 net/tipc/socket.c |   28 +--
 net/unix/af_unix.c|   39 +++---
 net/wanrouter/af_wanpipe.c|7 ++
 net/x25/af_x25.c  |6 +
 45 files changed, 166 insertions(+), 243 deletions(-)

---

diff -urpN -X dontdiff a/drivers/net/pppoe.c b/drivers/net/pppoe.c
--- a/drivers/net/pppoe.c   2007-01-12 11:18:47.244855016 -0800
+++ b/drivers/net/pppoe.c   2007-01-12 11:29:21.179177108 -0800
@@ -746,8 +746,8 @@ static int pppoe_ioctl(struct socket *so
 }
 
 
-static int pppoe_sendmsg(struct kiocb *iocb, struct socket *sock,
- struct msghdr *m, size_t total_len)
+static int pppoe_sendmsg(struct socket *sock, struct msghdr *m,
+size_t total_len)
 {
struct sk_buff *skb = NULL;
struct sock *sk = sock->sk;
@@ -912,8 +912,8 @@ static struct ppp_channel_ops pppoe_chan
.start_xmit = pppoe_xmit,
 };
 
-static int pppoe_recvmsg(struct kiocb *iocb, struct socket *sock,
- struct msghdr *m, size_t total_len, int flags)
+static int pppoe_recvmsg(struct socket *sock, struct msghdr *m,
+size_t total_len, int flags)
 {
struct sock *sk = sock->sk;
struct sk_buff *skb = NULL;
diff -urpN -X dontdiff a/include/linux/net.h b/include/linux/net.h
--- a/include/linux/net.h   2007-01-12 11:18:56.683629587 -0800
+++ b/include/linux/net.h   2007-01-12 11:29:21.185175058 -0800
@@ -118,7 +118,6 @@ struct socket {
 
 struct vm_area_struct;
 struct page;
-struct kiocb;
 struct sockaddr;
 struct msghdr;
 struct module;
@@ -156,11 +155,10 @@ struct proto_ops {
  int optname, char __user *optval, int 
optlen);
int (*compat_getsockopt)(struct socket *sock, int level,
  int optname, char __user *optval, int 
__user *optlen);
-   int (*sendmsg)   (struct kiocb *iocb, struct socket *sock,
- struct msghdr *m, size_t total_len);
-   int (*recvmsg)   (struct kiocb *iocb, struct socket *sock,
- struct msghdr *m, size_t total_len,
- int flags);
+   int (*sendmsg)   (struct socket *sock, struct msghdr *m,
+ size_t total_len);
+   int (*recvmsg)   (struct socket *sock, struct msghdr *m,
+ size_t total_len, int flags);
int  

[PATCH -mm 8/10][RFC] aio: make direct_IO aops use file_endio_t

2007-01-15 Thread Nate Diller
This converts the _locking variant of blockdev_direct_IO to use a generic
endio function, and updates all the FS callsites.

---

 Documentation/filesystems/Locking |5 +++--
 Documentation/filesystems/vfs.txt |5 +++--
 fs/block_dev.c|9 -
 fs/ext2/inode.c   |   12 +---
 fs/ext3/inode.c   |   11 +--
 fs/ext4/inode.c   |   11 +--
 fs/fat/inode.c|   12 ++--
 fs/gfs2/ops_address.c |8 
 fs/hfs/inode.c|   13 ++---
 fs/hfsplus/inode.c|   13 ++---
 fs/jfs/inode.c|   12 +---
 fs/nfs/direct.c   |8 +---
 fs/ocfs2/aops.c   |9 +
 fs/reiserfs/inode.c   |   13 +
 fs/xfs/linux-2.6/xfs_aops.c   |   11 ++-
 fs/xfs/linux-2.6/xfs_lrw.c|4 ++--
 include/linux/fs.h|   28 +---
 include/linux/nfs_fs.h|4 ++--
 mm/filemap.c  |   34 ++
 19 files changed, 108 insertions(+), 114 deletions(-)

---

diff -urpN -X dontdiff a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
--- a/Documentation/filesystems/Locking 2007-01-12 20:26:06.0 -0800
+++ b/Documentation/filesystems/Locking 2007-01-12 20:42:37.0 -0800
@@ -169,8 +169,9 @@ prototypes:
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
-   int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
-   loff_t offset, unsigned long nr_segs);
+   int (*direct_IO)(int, struct file *, const struct iovec *iov,
+   loff_t offset, unsigned long nr_segs,
+   file_endio_t *endio, void *endio_data);
int (*launder_page) (struct page *);
 
 locking rules:
diff -urpN -X dontdiff a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
--- a/Documentation/filesystems/vfs.txt 2007-01-12 20:26:06.0 -0800
+++ b/Documentation/filesystems/vfs.txt 2007-01-12 20:42:37.0 -0800
@@ -537,8 +537,9 @@ struct address_space_operations {
sector_t (*bmap)(struct address_space *, sector_t);
int (*invalidatepage) (struct page *, unsigned long);
int (*releasepage) (struct page *, int);
-   ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
-   loff_t offset, unsigned long nr_segs);
+   ssize_t (*direct_IO)(int, struct file *, const struct iovec *iov,
+   loff_t offset, unsigned long nr_segs,
+   file_endio_t *endio, void *endio_data);
struct page* (*get_xip_page)(struct address_space *, sector_t,
int);
/* migrate the contents of a page to the specified target */
diff -urpN -X dontdiff a/fs/block_dev.c b/fs/block_dev.c
--- a/fs/block_dev.c2007-01-12 20:29:02.0 -0800
+++ b/fs/block_dev.c2007-01-12 20:42:37.0 -0800
@@ -222,10 +222,11 @@ static void blk_unget_page(struct page *
 }
 
 static ssize_t
-blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-loff_t pos, unsigned long nr_segs)
+blkdev_direct_IO(int rw, struct file *file, const struct iovec *iov,
+loff_t pos, unsigned long nr_segs, file_endio_t *endio,
+void *endio_data)
 {
-   struct inode *inode = iocb->ki_filp->f_mapping->host;
+   struct inode *inode = file->f_mapping->host;
unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode)));
unsigned blocksize_mask = (1 << blkbits) - 1;
unsigned long seg = 0;  /* iov segment iterator */
@@ -239,8 +240,6 @@ blkdev_direct_IO(int rw, struct kiocb *i
loff_t size;/* size of block device */
struct bio *bio;
struct bdev_aio stack_io, *io;
-   file_endio_t *endio = aio_complete;
-   void *endio_data = iocb;
struct page *page;
struct pvec pvec;
 
diff -urpN -X dontdiff a/fs/ext2/inode.c b/fs/ext2/inode.c
--- a/fs/ext2/inode.c   2007-01-12 20:26:06.0 -0800
+++ b/fs/ext2/inode.c   2007-01-12 20:42:37.0 -0800
@@ -752,14 +752,12 @@ static sector_t ext2_bmap(struct address
 }
 
 static ssize_t
-ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-   loff_t offset, unsigned long nr_segs)
+ext2_direct_IO(int rw, struct file *file, const struct iovec *iov,
+  loff_t offset, unsigned long nr_segs, file_endio_t *endio,
+  void *endio_data)
 {
-   struct file *file = iocb->ki_filp;
-   struct inode *inode = file->f_mapping->host;
-
-   return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-   

Re: [PATCH] Provide an interface to limit total page cache.

2007-01-15 Thread Roy Huang

Hi Balbir,

Thanks for your comment.

On 1/15/07, Balbir Singh <[EMAIL PROTECTED]> wrote:


wakeup_kswapd and shrink_all_memory use swappiness to determine what to reclaim
(mapped pages or page cache).  This patch does not ensure that only
page cache is
reclaimed/limited. If the swappiness value is high, mapped pages will be hit.


You are right, it is possible to release mapped pages. It can be
avoided by add a field in "struct scan_control" to determine whether
mapped pages will be released.


One could get similar functionality by implementing resource management.

Resource  management splits tasks into groups and does management of
resources for the
groups rather than the whole system. Such a facility will come with a
resource controller for
memory (split into finer grain rss/page cache/mlock'ed memory, etc),
one for cpu, etc.

I s there any more information in detail about resource controller?
Even there is a resource controller for tasks, all memory is also
possbile to be eaten up by page cache.


Balbir


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH -rt] RCU priority boosting that survives moderate testing

2007-01-15 Thread Paul E. McKenney
Hello!

This is a updated version of the earlier RCU-boosting patch
(http://lkml.org/lkml/2007/1/2/347).  It boosts the priority of RCU
read-side critical sections in -rt kernels, and the context diff is
almost 300 lines shorter than its predecessor.  Simplifications were
inspired by the act of attempting to design enterprise-level testing
for this patch's predecessor -- after all, you don't have to write tests
for any code that you manage to eliminate!

Still lacks tie-in to OOM, and still needs more vigorous testing (though
less so than its predecessor).  However, a design doc is on its way.

This version permits the system administrator to manually adjust the
priority of the RCU-booster task, which will result in RCU boosting
to the priority one slot less-favored than the booster task itself.
Any tasks that have been previously boosted will have their priority
adjusted to align with the RCU-booster task's new priority.

As always, any and all comments appreciated!

Signed-off-by: Paul E. McKenney <[EMAIL PROTECTED]>
---

 include/linux/init_task.h  |   12 +
 include/linux/rcupdate.h   |   12 +
 include/linux/rcupreempt.h |   19 +
 include/linux/sched.h  |   16 +
 init/main.c|1 
 kernel/Kconfig.preempt |   32 ++
 kernel/fork.c  |6 
 kernel/rcupreempt.c|  536 +
 kernel/rtmutex.c   |9 
 kernel/sched.c |5 
 10 files changed, 645 insertions(+), 3 deletions(-)

diff -urpNa -X dontdiff linux-2.6.20-rc4-rt1/include/linux/init_task.h 
linux-2.6.20-rc4-rt1-rcub/include/linux/init_task.h
--- linux-2.6.20-rc4-rt1/include/linux/init_task.h  2007-01-09 
10:59:54.0 -0800
+++ linux-2.6.20-rc4-rt1-rcub/include/linux/init_task.h 2007-01-09 
11:01:12.0 -0800
@@ -87,6 +87,17 @@ extern struct nsproxy init_nsproxy;
.siglock= __SPIN_LOCK_UNLOCKED(sighand.siglock),\
 }
 
+#ifdef CONFIG_PREEMPT_RCU_BOOST
+#define INIT_RCU_BOOST_PRIO .rcu_prio  = MAX_PRIO,
+#define INIT_PREEMPT_RCU_BOOST(tsk)\
+   .rcub_rbdp  = NULL, \
+   .rcub_state = RCU_BOOST_IDLE,   \
+   .rcub_entry = LIST_HEAD_INIT(tsk.rcub_entry),
+#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
+#define INIT_RCU_BOOST_PRIO
+#define INIT_PREEMPT_RCU_BOOST(tsk)
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */
+
 extern struct group_info init_groups;
 
 /*
@@ -143,6 +154,7 @@ extern struct group_info init_groups;
.pi_lock= RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock),  \
INIT_TRACE_IRQFLAGS \
INIT_LOCKDEP\
+   INIT_PREEMPT_RCU_BOOST(tsk) \
 }
 
 
diff -urpNa -X dontdiff linux-2.6.20-rc4-rt1/include/linux/rcupdate.h 
linux-2.6.20-rc4-rt1-rcub/include/linux/rcupdate.h
--- linux-2.6.20-rc4-rt1/include/linux/rcupdate.h   2007-01-09 
10:59:54.0 -0800
+++ linux-2.6.20-rc4-rt1-rcub/include/linux/rcupdate.h  2007-01-09 
11:01:12.0 -0800
@@ -227,6 +227,18 @@ extern void rcu_barrier(void);
 extern void rcu_init(void);
 extern void rcu_advance_callbacks(int cpu, int user);
 extern void rcu_check_callbacks(int cpu, int user);
+#ifdef CONFIG_PREEMPT_RCU_BOOST
+extern void init_rcu_boost_late(void);
+extern void __rcu_preempt_boost(void);
+#define rcu_preempt_boost() \
+   do { \
+   if (unlikely(current->rcu_read_lock_nesting > 0)) \
+   __rcu_preempt_boost(); \
+   } while (0)
+#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
+#define init_rcu_boost_late()
+#define rcu_preempt_boost()
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */
 
 #endif /* __KERNEL__ */
 #endif /* __LINUX_RCUPDATE_H */
diff -urpNa -X dontdiff linux-2.6.20-rc4-rt1/include/linux/rcupreempt.h 
linux-2.6.20-rc4-rt1-rcub/include/linux/rcupreempt.h
--- linux-2.6.20-rc4-rt1/include/linux/rcupreempt.h 2007-01-09 
10:59:54.0 -0800
+++ linux-2.6.20-rc4-rt1-rcub/include/linux/rcupreempt.h2007-01-09 
11:01:12.0 -0800
@@ -42,6 +42,25 @@
 #include 
 #include 
 
+#ifdef CONFIG_PREEMPT_RCU_BOOST
+/*
+ * Task state with respect to being RCU-boosted.  This state is changed
+ * by the task itself in response to the following three events:
+ * 1. Preemption (or block on lock) while in RCU read-side critical section.
+ * 2. Outermost rcu_read_unlock() for blocked RCU read-side critical section.
+ *
+ * The RCU-boost task also updates the state when boosting priority.
+ */
+enum rcu_boost_state {
+   RCU_BOOST_IDLE = 0,/* Not yet blocked if in RCU read-side. */
+   RCU_BOOST_BLOCKED = 1, /* Blocked from RCU read-side. */
+   RCU_BOOSTED = 2,   /* Boosting complete. */
+};
+
+#define N_RCU_BOOST_STATE (RCU_BOOSTED + 1)
+
+#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST 

Re: [PATCH -mm 3/10][RFC] aio: use iov_length instead of ki_left

2007-01-15 Thread Christoph Hellwig
On Mon, Jan 15, 2007 at 05:54:50PM -0800, Nate Diller wrote:
> Convert code using iocb->ki_left to use the more generic iov_length() call. 

No way.  We need to reduce the numer of iovec traversals, not adding
more of them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm 7/10][RFC] aio: make __blockdev_direct_IO use file_endio_t

2007-01-15 Thread Nate Diller
This converts the internals of __blockdev_direct_IO in fs/direct-io.c to use
a generic endio function, instead of directly calling aio_complete.  It also
changes the semantics of dio_iodone to be more friendly to its only users,
xfs and ocfs2.  This allows the caller to know how to release locks and tear
down data structures on error.

It also converts the _own_locking and _no_locking variants of
blockdev_direct_IO to use a generic endio function.

---

 fs/direct-io.c  |   74 ++--
 fs/gfs2/ops_address.c   |6 +--
 fs/ocfs2/aops.c |   15 ++--
 fs/ocfs2/aops.h |8 
 fs/ocfs2/file.c |   18 --
 fs/ocfs2/inode.h|2 -
 fs/xfs/linux-2.6/xfs_aops.c |   33 +++
 include/linux/fs.h  |   57 ++---
 8 files changed, 104 insertions(+), 109 deletions(-)

---

diff -urpN -X dontdiff a/fs/direct-io.c b/fs/direct-io.c
--- a/fs/direct-io.c2007-01-12 14:53:48.0 -0800
+++ b/fs/direct-io.c2007-01-12 15:06:44.0 -0800
@@ -67,7 +67,7 @@ struct dio {
struct bio *bio;/* bio under assembly */
struct inode *inode;
int rw;
-   loff_t i_size;  /* i_size when submitted */
+   unsigned max_to_read;   /* (i_size when submitted) - offset */
int lock_type;  /* doesn't change */
unsigned blkbits;   /* doesn't change */
unsigned blkfactor; /* When we're using an alignment which
@@ -89,6 +89,7 @@ struct dio {
int reap_counter;   /* rate limit reaping */
get_block_t *get_block; /* block mapping function */
dio_iodone_t *end_io;   /* IO completion function */
+   void *destructor_data;  /* private data for completion fn */
sector_t final_block_in_bio;/* current final block in bio + 1 */
sector_t next_block_for_io; /* next block to be put under IO,
   in dio_blocks units */
@@ -127,7 +128,8 @@ struct dio {
struct task_struct *waiter; /* waiting task (NULL if none) */
 
/* AIO related stuff */
-   struct kiocb *iocb; /* kiocb */
+   file_endio_t *file_endio;   /* aio completion function */
+   void *endio_data;   /* private data for aio completion */
int is_async;   /* is IO async ? */
int io_error;   /* IO error in completion path */
ssize_t result; /* IO result */
@@ -222,7 +224,7 @@ static struct page *dio_get_page(struct 
  * filesystems can use it to hold additional state between get_block calls and
  * dio_complete.
  */
-static int dio_complete(struct dio *dio, loff_t offset, int ret)
+static int dio_complete(struct dio *dio, int ret)
 {
/*
 * AIO submission can race with bio completion to get here while
@@ -232,25 +234,21 @@ static int dio_complete(struct dio *dio,
 */
if (ret == -EIOCBQUEUED)
ret = 0;
+   if (ret == 0)
+   ret = dio->page_errors;
+   if (ret == 0)
+   ret = dio->io_error;
 
if (dio->result) {
/* Check for short read case */
-   if ((dio->rw == READ) && ((offset + dio->result) > dio->i_size))
-   dio->result = dio->i_size - offset;
+   if ((dio->rw == READ) && (dio->result > dio->max_to_read))
+   dio->result = dio->max_to_read;
}
 
-   if (dio->end_io && dio->result)
-   dio->end_io(dio->iocb, offset, dio->result,
-   dio->map_bh.b_private);
if (dio->lock_type == DIO_LOCKING)
/* lockdep: non-owner release */
up_read_non_owner(>inode->i_alloc_sem);
 
-   if (ret == 0)
-   ret = dio->page_errors;
-   if (ret == 0)
-   ret = dio->io_error;
-
return ret;
 }
 
@@ -277,8 +275,11 @@ static int dio_bio_end_aio(struct bio *b
spin_unlock_irqrestore(>bio_lock, flags);
 
if (remaining == 0) {
-   int err = dio_complete(dio, dio->iocb->ki_pos, 0);
-   aio_complete(dio->iocb, dio->result, err);
+   int err = dio_complete(dio, 0);
+   if (dio->end_io)
+   dio->end_io(dio->destructor_data, dio->result,
+   dio->map_bh.b_private);
+   dio->file_endio(dio->endio_data, dio->result, err);
kfree(dio);
}
 
@@ -944,10 +945,11 @@ out:
  * Releases both i_mutex and i_alloc_sem
  */
 static ssize_t
-direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
+direct_io_worker(int rw, struct file *file, struct inode *inode, 
const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
unsigned 

[PATCH -mm 9/10][RFC] aio: usb gadget remove aio file ops

2007-01-15 Thread Nate Diller
This removes the aio implementation from the usb gadget file system.  Aside
from making very creative (!) use of the aio retry path, it can't be of any
use performance-wise because it always kmalloc()s a bounce buffer for the
*whole* I/O size.  Perhaps the only reason to keep it around is the ability
to cancel I/O requests, which only applies when using the user space async
I/O interface.  I highly doubt that is enough incentive to justify the extra
complexity here or in user-space, so I think it's a safe bet to remove this. 
If that feature still desired, it would be possible to implement a sync
interface that does an interruptible sleep.

I can be convinced otherwise, but the alternatives are difficult.  See for
example the "fuse, get_user_pages, flush_anon_page, aliasing caches and all
that again" LKML thread recently for why it's waaay easier to kmalloc a
bounce buffer here, and (ab)use the retry interface.

---

diff -urpN -X dontdiff a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
--- a/drivers/usb/gadget/inode.c2007-01-10 13:23:46.0 -0800
+++ b/drivers/usb/gadget/inode.c2007-01-10 16:56:09.0 -0800
@@ -527,218 +527,6 @@ static int ep_ioctl (struct inode *inode
 
 /*--*/
 
-/* ASYNCHRONOUS ENDPOINT I/O OPERATIONS (bulk/intr/iso) */
-
-struct kiocb_priv {
-   struct usb_request  *req;
-   struct ep_data  *epdata;
-   void*buf;
-   const struct iovec  *iv;
-   unsigned long   nr_segs;
-   unsignedactual;
-};
-
-static int ep_aio_cancel(struct kiocb *iocb, struct io_event *e)
-{
-   struct kiocb_priv   *priv = iocb->private;
-   struct ep_data  *epdata;
-   int value;
-
-   local_irq_disable();
-   epdata = priv->epdata;
-   // spin_lock(>dev->lock);
-   kiocbSetCancelled(iocb);
-   if (likely(epdata && epdata->ep && priv->req))
-   value = usb_ep_dequeue (epdata->ep, priv->req);
-   else
-   value = -EINVAL;
-   // spin_unlock(>dev->lock);
-   local_irq_enable();
-
-   aio_put_req(iocb);
-   return value;
-}
-
-static int ep_aio_read_retry(struct kiocb *iocb)
-{
-   struct kiocb_priv   *priv = iocb->private;
-   ssize_t total;
-   int i, err = 0;
-
-   /* we "retry" to get the right mm context for this: */
-
-   /* copy stuff into user buffers */
-   total = priv->actual;
-   for (i=0; i < priv->nr_segs; i++) {
-   ssize_t this = min((ssize_t)(priv->iv[i].iov_len), total);
-
-   if (copy_to_user(priv->iv[i].iov_base, priv->buf, this)) {
-   err = -EFAULT;
-   break;
-   }
-
-   total -= this;
-   if (total == 0)
-   break;
-   }
-   kfree(priv->buf);
-   kfree(priv);
-   aio_put_req(iocb);
-   return err;
-}
-
-static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
-{
-   struct kiocb*iocb = req->context;
-   struct kiocb_priv   *priv = iocb->private;
-   struct ep_data  *epdata = priv->epdata;
-
-   /* lock against disconnect (and ideally, cancel) */
-   spin_lock(>dev->lock);
-   priv->req = NULL;
-   priv->epdata = NULL;
-   if (priv->iv == NULL
-   || unlikely(req->actual == 0)
-   || unlikely(kiocbIsCancelled(iocb))) {
-   kfree(req->buf);
-   kfree(priv);
-   iocb->private = NULL;
-   /* aio_complete() reports bytes-transferred _and_ faults */
-   if (unlikely(kiocbIsCancelled(iocb)))
-   aio_put_req(iocb);
-   else
-   aio_complete(iocb, req->actual, req->status);
-   } else {
-   /* retry() won't report both; so we hide some faults */
-   if (unlikely(0 != req->status))
-   DBG(epdata->dev, "%s fault %d len %d\n",
-   ep->name, req->status, req->actual);
-
-   priv->buf = req->buf;
-   priv->actual = req->actual;
-   kick_iocb(iocb);
-   }
-   spin_unlock(>dev->lock);
-
-   usb_ep_free_request(ep, req);
-   put_ep(epdata);
-}
-
-static ssize_t
-ep_aio_rwtail(
-   struct kiocb*iocb,
-   char*buf,
-   size_t  len,
-   struct ep_data  *epdata,
-   const struct iovec *iv,
-   unsigned long   nr_segs
-)
-{
-   struct kiocb_priv   *priv;
-   struct usb_request  *req;
-   ssize_t value;
-
-   priv = kmalloc(sizeof *priv, GFP_KERNEL);
-   if (!priv) {
-   value = -ENOMEM;
-fail:
-   kfree(buf);
-   return value;
-   }
-   

Re: [stable] 2.6.19.2 regression introduced by "IPV4/IPV6: Fix inet{, 6} device initialization order."

2007-01-15 Thread YOSHIFUJI Hideaki / 吉藤英明
In article <[EMAIL PROTECTED]> (at Tue, 16 Jan 2007 03:01:56 +0100), Gabriel C 
<[EMAIL PROTECTED]> says:

> Greg KH schrieb:
> > On Sun, Jan 14, 2007 at 09:30:08PM -0800, David Miller wrote:
> >   
> >> From: David Stevens <[EMAIL PROTECTED]>
> >> Date: Sun, 14 Jan 2007 19:47:49 -0800
> >>
> >> 
> >>> I think it's better to add the fix than withdraw this patch, since
> >>> the original bug is a crash.
> >>>   
> >> I completely agree.
> >> 
> >
> > Great, can someone forward the patch to us?
> >   
> 
> Should be the fix from http://bugzilla.kernel.org/show_bug.cgi?id=7817

I've resent the patch to <[EMAIL PROTECTED]>.

--yoshfuji
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm 3/10][RFC] aio: use iov_length instead of ki_left

2007-01-15 Thread Nate Diller
Convert code using iocb->ki_left to use the more generic iov_length() call. 

---

diff -urpN -X dontdiff a/fs/ocfs2/file.c b/fs/ocfs2/file.c
--- a/fs/ocfs2/file.c   2007-01-10 11:50:26.0 -0800
+++ b/fs/ocfs2/file.c   2007-01-10 12:42:09.0 -0800
@@ -1157,7 +1157,7 @@ static ssize_t ocfs2_file_aio_write(stru
   filp->f_path.dentry->d_name.name);
 
/* happy write of zero bytes */
-   if (iocb->ki_left == 0)
+   if (iov_length(iov, nr_segs) == 0)
return 0;
 
mutex_lock(>i_mutex);
@@ -1177,7 +1177,7 @@ static ssize_t ocfs2_file_aio_write(stru
}
 
ret = ocfs2_prepare_inode_for_write(filp->f_path.dentry, >ki_pos,
-   iocb->ki_left, appending);
+   iov_length(iov, nr_segs), appending);
if (ret < 0) {
mlog_errno(ret);
goto out;
diff -urpN -X dontdiff a/fs/smbfs/file.c b/fs/smbfs/file.c
--- a/fs/smbfs/file.c   2007-01-10 11:50:28.0 -0800
+++ b/fs/smbfs/file.c   2007-01-10 12:42:09.0 -0800
@@ -222,7 +222,7 @@ smb_file_aio_read(struct kiocb *iocb, co
ssize_t status;
 
VERBOSE("file %s/%s, [EMAIL PROTECTED]", DENTRY_PATH(dentry),
-   (unsigned long) iocb->ki_left, (unsigned long) pos);
+   (unsigned long) iov_length(iov, nr_segs), (unsigned long) pos);
 
status = smb_revalidate_inode(dentry);
if (status) {
@@ -328,7 +328,7 @@ smb_file_aio_write(struct kiocb *iocb, c
 
VERBOSE("file %s/%s, [EMAIL PROTECTED]",
DENTRY_PATH(dentry),
-   (unsigned long) iocb->ki_left, (unsigned long) pos);
+   (unsigned long) iov_length(iov, nr_segs), (unsigned long) pos);
 
result = smb_revalidate_inode(dentry);
if (result) {
@@ -341,7 +341,7 @@ smb_file_aio_write(struct kiocb *iocb, c
if (result)
goto out;
 
-   if (iocb->ki_left > 0) {
+   if (iov_length(iov, nr_segs) > 0) {
result = generic_file_aio_write(iocb, iov, nr_segs, pos);
VERBOSE("pos=%ld, size=%ld, mtime=%ld, atime=%ld\n",
(long) file->f_pos, (long) dentry->d_inode->i_size,
diff -urpN -X dontdiff a/fs/udf/file.c b/fs/udf/file.c
--- a/fs/udf/file.c 2007-01-10 11:53:02.0 -0800
+++ b/fs/udf/file.c 2007-01-10 12:42:09.0 -0800
@@ -109,7 +109,7 @@ static ssize_t udf_file_aio_write(struct
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_path.dentry->d_inode;
int err, pos;
-   size_t count = iocb->ki_left;
+   size_t count = iov_length(iov, nr_segs);
 
if (UDF_I_ALLOCTYPE(inode) == ICBTAG_FLAG_AD_IN_ICB)
{
diff -urpN -X dontdiff a/net/socket.c b/net/socket.c
--- a/net/socket.c  2007-01-10 12:40:54.0 -0800
+++ b/net/socket.c  2007-01-10 12:42:09.0 -0800
@@ -632,7 +632,7 @@ static ssize_t sock_aio_read(struct kioc
if (pos != 0)
return -ESPIPE;
 
-   if (iocb->ki_left == 0) /* Match SYS5 behaviour */
+   if (iov_length(iov, nr_segs) == 0)  /* Match SYS5 behaviour */
return 0;
 
for (i = 0; i < nr_segs; i++)
@@ -660,7 +660,7 @@ static ssize_t sock_aio_write(struct kio
if (pos != 0)
return -ESPIPE;
 
-   if (iocb->ki_left == 0) /* Match SYS5 behaviour */
+   if (iov_length(iov, nr_segs) == 0)  /* Match SYS5 behaviour */
return 0;
 
for (i = 0; i < nr_segs; i++)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -mm 6/10][RFC] aio: make nfs_directIO use file_endio_t

2007-01-15 Thread Nate Diller
This converts the iternals of nfs's directIO support to use a generic endio
function, instead of directly calling aio_complete.  It's pretty easy
because it already has a pretty abstracted completion path.

---

diff -urpN -X dontdiff a/fs/nfs/direct.c b/fs/nfs/direct.c
--- a/fs/nfs/direct.c   2007-01-12 14:53:48.0 -0800
+++ b/fs/nfs/direct.c   2007-01-12 15:02:30.0 -0800
@@ -68,7 +68,6 @@ struct nfs_direct_req {
 
/* I/O parameters */
struct nfs_open_context *ctx;   /* file open context info */
-   struct kiocb *  iocb;   /* controlling i/o request */
struct inode *  inode;  /* target file of i/o */
 
/* completion state */
@@ -77,6 +76,8 @@ struct nfs_direct_req {
ssize_t count,  /* bytes actually processed */
error;  /* any reported error */
struct completion   completion; /* wait for i/o completion */
+   file_endio_t*endio; /* async completion function */
+   void*endio_data;/* private completion data */
 
/* commit state */
struct list_headrewrite_list;   /* saved nfs_write_data structs 
*/
@@ -151,7 +152,7 @@ static inline struct nfs_direct_req *nfs
kref_get(>kref);
init_completion(>completion);
INIT_LIST_HEAD(>rewrite_list);
-   dreq->iocb = NULL;
+   dreq->endio = NULL;
dreq->ctx = NULL;
spin_lock_init(>lock);
atomic_set(>io_count, 0);
@@ -179,7 +180,7 @@ static ssize_t nfs_direct_wait(struct nf
ssize_t result = -EIOCBQUEUED;
 
/* Async requests don't wait here */
-   if (dreq->iocb)
+   if (!dreq->endio)
goto out;
 
result = wait_for_completion_interruptible(>completion);
@@ -194,14 +195,10 @@ out:
return (ssize_t) result;
 }
 
-/*
- * Synchronous I/O uses a stack-allocated iocb.  Thus we can't trust
- * the iocb is still valid here if this is a synchronous request.
- */
 static void nfs_direct_complete(struct nfs_direct_req *dreq)
 {
-   if (dreq->iocb)
-   aio_complete(dreq->iocb, dreq->count, dreq->error);
+   if (dreq->endio)
+   dreq->endio(dreq->endio_data, dreq->count, dreq->error);
 
complete_all(>completion);
 
@@ -332,11 +329,13 @@ static ssize_t nfs_direct_read_schedule(
return result < 0 ? (ssize_t) result : -EFAULT;
 }
 
-static ssize_t nfs_direct_read(struct kiocb *iocb, unsigned long user_addr, 
size_t count, loff_t pos)
+static ssize_t nfs_direct_read(struct file *file, unsigned long user_addr,
+  size_t count, loff_t pos,
+  file_endio_t *endio, void *endio_data)
 {
ssize_t result = 0;
sigset_t oldset;
-   struct inode *inode = iocb->ki_filp->f_mapping->host;
+   struct inode *inode = file->f_mapping->host;
struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_direct_req *dreq;
 
@@ -345,9 +344,9 @@ static ssize_t nfs_direct_read(struct ki
return -ENOMEM;
 
dreq->inode = inode;
-   dreq->ctx = get_nfs_open_context((struct nfs_open_context 
*)iocb->ki_filp->private_data);
-   if (!is_sync_kiocb(iocb))
-   dreq->iocb = iocb;
+   dreq->ctx = get_nfs_open_context((struct nfs_open_context 
*)file->private_data);
+   dreq->endio = endio;
+   dreq->endio_data = endio_data;
 
nfs_add_stats(inode, NFSIOS_DIRECTREADBYTES, count);
rpc_clnt_sigmask(clnt, );
@@ -663,11 +662,13 @@ static ssize_t nfs_direct_write_schedule
return result < 0 ? (ssize_t) result : -EFAULT;
 }
 
-static ssize_t nfs_direct_write(struct kiocb *iocb, unsigned long user_addr, 
size_t count, loff_t pos)
+static ssize_t nfs_direct_write(struct file *file, unsigned long user_addr,
+   size_t count, loff_t pos,
+   file_endio_t *endio, void *endio_data)
 {
ssize_t result = 0;
sigset_t oldset;
-   struct inode *inode = iocb->ki_filp->f_mapping->host;
+   struct inode *inode = file->f_mapping->host;
struct rpc_clnt *clnt = NFS_CLIENT(inode);
struct nfs_direct_req *dreq;
size_t wsize = NFS_SERVER(inode)->wsize;
@@ -682,9 +683,9 @@ static ssize_t nfs_direct_write(struct k
sync = FLUSH_STABLE;
 
dreq->inode = inode;
-   dreq->ctx = get_nfs_open_context((struct nfs_open_context 
*)iocb->ki_filp->private_data);
-   if (!is_sync_kiocb(iocb))
-   dreq->iocb = iocb;
+   dreq->ctx = get_nfs_open_context((struct nfs_open_context 
*)file->private_data);
+   dreq->endio = endio;
+   dreq->endio_data = endio_data;
 
nfs_add_stats(inode, NFSIOS_DIRECTWRITTENBYTES, count);
 
@@ -701,10 +702,12 @@ static ssize_t nfs_direct_write(struct k
 
 /**
  * nfs_file_direct_read - file direct read 

[PATCH -mm 4/10][RFC] aio: convert aio_complete to file_endio_t

2007-01-15 Thread Nate Diller
Define a new function typedef for I/O completion at the file/iovec level --

typedef void (file_endio_t)(void *endio_data, ssize_t count, int err);

and convert aio_complete and all its callers to this new prototype.

---

 drivers/usb/gadget/inode.c |   24 +++---
 fs/aio.c   |   59 -
 fs/block_dev.c |8 +-
 fs/direct-io.c |   18 +
 fs/nfs/direct.c|9 ++
 include/linux/aio.h|   11 +++-
 include/linux/fs.h |2 +
 7 files changed, 61 insertions(+), 70 deletions(-)

---

diff -urpN -X dontdiff a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c
--- a/drivers/usb/gadget/inode.c2007-01-12 14:42:29.0 -0800
+++ b/drivers/usb/gadget/inode.c2007-01-12 14:25:34.0 -0800
@@ -559,35 +559,32 @@ static int ep_aio_cancel(struct kiocb *i
return value;
 }
 
-static ssize_t ep_aio_read_retry(struct kiocb *iocb)
+static int ep_aio_read_retry(struct kiocb *iocb)
 {
struct kiocb_priv   *priv = iocb->private;
-   ssize_t len, total;
-   int i;
+   ssize_t total;
+   int i, err = 0;
 
/* we "retry" to get the right mm context for this: */
 
/* copy stuff into user buffers */
total = priv->actual;
-   len = 0;
for (i=0; i < priv->nr_segs; i++) {
ssize_t this = min((ssize_t)(priv->iv[i].iov_len), total);
 
if (copy_to_user(priv->iv[i].iov_base, priv->buf, this)) {
-   if (len == 0)
-   len = -EFAULT;
+   err = -EFAULT;
break;
}
 
total -= this;
-   len += this;
if (total == 0)
break;
}
kfree(priv->buf);
kfree(priv);
aio_put_req(iocb);
-   return len;
+   return err;
 }
 
 static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req)
@@ -610,9 +607,7 @@ static void ep_aio_complete(struct usb_e
if (unlikely(kiocbIsCancelled(iocb)))
aio_put_req(iocb);
else
-   aio_complete(iocb,
-   req->actual ? req->actual : req->status,
-   req->status);
+   aio_complete(iocb, req->actual, req->status);
} else {
/* retry() won't report both; so we hide some faults */
if (unlikely(0 != req->status))
@@ -702,16 +697,17 @@ ep_aio_read(struct kiocb *iocb, const st
 {
struct ep_data  *epdata = iocb->ki_filp->private_data;
char*buf;
+   size_t  len = iov_length(iov, nr_segs);
 
if (unlikely(epdata->desc.bEndpointAddress & USB_DIR_IN))
return -EINVAL;
 
-   buf = kmalloc(iocb->ki_left, GFP_KERNEL);
+   buf = kmalloc(len, GFP_KERNEL);
if (unlikely(!buf))
return -ENOMEM;
 
iocb->ki_retry = ep_aio_read_retry;
-   return ep_aio_rwtail(iocb, buf, iocb->ki_left, epdata, iov, nr_segs);
+   return ep_aio_rwtail(iocb, buf, len, epdata, iov, nr_segs);
 }
 
 static ssize_t
@@ -726,7 +722,7 @@ ep_aio_write(struct kiocb *iocb, const s
if (unlikely(!(epdata->desc.bEndpointAddress & USB_DIR_IN)))
return -EINVAL;
 
-   buf = kmalloc(iocb->ki_left, GFP_KERNEL);
+   buf = kmalloc(iov_length(iov, nr_segs), GFP_KERNEL);
if (unlikely(!buf))
return -ENOMEM;
 
diff -urpN -X dontdiff a/fs/aio.c b/fs/aio.c
--- a/fs/aio.c  2007-01-12 14:42:29.0 -0800
+++ b/fs/aio.c  2007-01-12 14:29:20.0 -0800
@@ -658,16 +658,16 @@ static inline int __queue_kicked_iocb(st
  * simplifies the coding of individual aio operations as
  * it avoids various potential races.
  */
-static ssize_t aio_run_iocb(struct kiocb *iocb)
+static void aio_run_iocb(struct kiocb *iocb)
 {
struct kioctx   *ctx = iocb->ki_ctx;
-   ssize_t (*retry)(struct kiocb *);
+   int (*retry)(struct kiocb *);
wait_queue_t *io_wait = current->io_wait;
-   ssize_t ret;
+   int err;
 
if (!(retry = iocb->ki_retry)) {
printk("aio_run_iocb: iocb->ki_retry = NULL\n");
-   return 0;
+   return;
}
 
/*
@@ -702,8 +702,8 @@ static ssize_t aio_run_iocb(struct kiocb
 
/* Quit retrying if the i/o has been cancelled */
if (kiocbIsCancelled(iocb)) {
-   ret = -EINTR;
-   aio_complete(iocb, ret, 0);
+   err = -EINTR;
+   aio_complete(iocb, iocb->ki_nbytes - iocb->ki_left, err);
/* must not access the iocb after this */
goto out;
}
@@ -720,17 +720,17 @@ static ssize_t 

[PATCH -mm 5/10][RFC] aio: make blk_directIO use file_endio_t

2007-01-15 Thread Nate Diller
Convert the internals of blkdev_direct_IO to use a generic endio function,
instead of directly calling aio_complete.  This may also fix some bugs/races
in this code, for instance it checks bio->bi_size instead of assuming it's
zero, and it atomically accumulates the bytes_done counter (assuming that
the bio completion handler can't race with itself *might* be valid here, but
the direct-io code makes no such assumption).  I'm also pretty sure that
the address_space->directIO functions aren't supposed to mess with the
iocb->ki_pos or ->ki_left.

---

diff -urpN -X dontdiff a/fs/block_dev.c b/fs/block_dev.c
--- a/fs/block_dev.c2007-01-12 20:26:25.0 -0800
+++ b/fs/block_dev.c2007-01-12 20:23:55.0 -0800
@@ -131,10 +131,32 @@ blkdev_get_block(struct inode *inode, se
return 0;
 }
 
-static int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error)
+struct bdev_aio {
+   atomic_tiocount;/* refcount */
+   atomic_tbytes_done; /* byte counter */
+   int err;/* error handling */
+   file_endio_t*endio; /* end I/O notify fn */
+   void*endio_data;/* notify fn private data */
+};
+
+static void blk_io_put(struct bdev_aio *io)
+{
+   if (!atomic_dec_and_test(>iocount))
+   return;
+
+   if (!io->endio)
+   return complete((struct completion*)io->endio_data);
+
+   io->endio(io->endio_data, atomic_read(>bytes_done), io->err);
+   kfree(io);
+}
+
+static int blk_bio_endio(struct bio *bio, unsigned int bytes_done, int error)
 {
-   struct kiocb *iocb = bio->bi_private;
-   atomic_t *bio_count = >ki_bio_count;
+   struct bdev_aio *io = bio->bi_private;
+
+   if (bio->bi_size)
+   return 1;
 
if (bio_data_dir(bio) == READ)
bio_check_pages_dirty(bio);
@@ -143,16 +165,21 @@ static int blk_end_aio(struct bio *bio, 
bio_put(bio);
}
 
-   /* iocb->ki_nbytes stores error code from LLDD */
-   if (error)
-   iocb->ki_nbytes = -EIO;
-
-   if (atomic_dec_and_test(bio_count))
-   aio_complete(iocb, iocb->ki_left, iocb->ki_nbytes);
+   if (error)
+   io->err = error;
+   atomic_add(bytes_done, >bytes_done);
 
+   blk_io_put(io);
return 0;
 }
 
+static void blk_io_init(struct bdev_aio *io)
+{
+   atomic_set(>iocount, 1);
+   atomic_set(>bytes_done, 0);
+   io->err = 0;
+}
+
 #define VEC_SIZE   16
 struct pvec {
unsigned short nr;
@@ -208,24 +235,33 @@ blkdev_direct_IO(int rw, struct kiocb *i
 
unsigned long addr; /* user iovec address */
size_t count;   /* user iovec len */
-   size_t nbytes = iocb->ki_nbytes = iocb->ki_left; /* total xfer size */
+   size_t nbytes;   /* total xfer size */
loff_t size;/* size of block device */
struct bio *bio;
-   atomic_t *bio_count = >ki_bio_count;
+   struct bdev_aio stack_io, *io;
+   file_endio_t *endio = aio_complete;
+   void *endio_data = iocb;
struct page *page;
struct pvec pvec;
 
pvec.nr = 0;
pvec.idx = 0;
 
+   io = _io;
+   if (endio) {
+   io = kmalloc(sizeof(struct bdev_aio), GFP_KERNEL);
+   if (!io)
+   return -ENOMEM;
+   }
+   blk_io_init(io);
+
if (pos & blocksize_mask)
return -EINVAL;
 
+   nbytes = iov_length(iov, nr_segs);
size = i_size_read(inode);
-   if (pos + nbytes > size) {
+   if (pos + nbytes > size)
nbytes = size - pos;
-   iocb->ki_left = nbytes;
-   }
 
/*
 * check first non-zero iov alignment, the remaining
@@ -237,7 +273,6 @@ blkdev_direct_IO(int rw, struct kiocb *i
if (addr & blocksize_mask || count & blocksize_mask)
return -EINVAL;
} while (!count && ++seg < nr_segs);
-   atomic_set(bio_count, 1);
 
while (nbytes) {
/* roughly estimate number of bio vec needed */
@@ -248,8 +283,8 @@ blkdev_direct_IO(int rw, struct kiocb *i
/* bio_alloc should not fail with GFP_KERNEL flag */
bio = bio_alloc(GFP_KERNEL, nvec);
bio->bi_bdev = I_BDEV(inode);
-   bio->bi_end_io = blk_end_aio;
-   bio->bi_private = iocb;
+   bio->bi_end_io = blk_bio_endio;
+   bio->bi_private = io;
bio->bi_sector = pos >> blkbits;
 same_bio:
cur_off = addr & ~PAGE_MASK;
@@ -289,18 +324,27 @@ same_bio:
/* bio is ready, submit it */
if (rw == READ)
bio_set_pages_dirty(bio);
-   atomic_inc(bio_count);
+   atomic_inc(>iocount);
submit_bio(rw, bio);
}
 
 

[PATCH -mm 1/10][RFC] aio: scm remove struct siocb

2007-01-15 Thread Nate Diller
this patch removes struct sock_iocb

Its purpose seems to have dwindled to a mere container for struct
scm_cookie, and all of the users of scm_cookie seem to require
re-initializing it each time anyway.  Besides, keeping such data around from
one call to the next seems to me like a layering violation, if not a bug,
considering that the sync IO code can use this call path too.

All scm_cookie users are converted to unconditionally allocate on the stack,
and sock_iocb and all its helpers are removed.  This also simplifies the
socket aio submission path (is that even used?)

---

 include/net/scm.h|2 
 include/net/sock.h   |   26 -
 net/netlink/af_netlink.c |   18 ++
 net/socket.c |  131 +++
 net/unix/af_unix.c   |   77 ++-
 5 files changed, 68 insertions(+), 186 deletions(-)

---

diff -urpN -X dontdiff a/include/net/scm.h b/include/net/scm.h
--- a/include/net/scm.h 2006-11-29 13:57:37.0 -0800
+++ b/include/net/scm.h 2007-01-10 12:10:19.0 -0800
@@ -23,7 +23,6 @@ struct scm_cookie
 #ifdef CONFIG_SECURITY_NETWORK
u32 secid;  /* Passed security ID   */
 #endif
-   unsigned long   seq;/* Connection seqno */
 };
 
 extern void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm);
@@ -56,7 +55,6 @@ static __inline__ int scm_send(struct so
scm->creds.gid = p->gid;
scm->creds.pid = p->tgid;
scm->fp = NULL;
-   scm->seq = 0;
unix_get_peersec_dgram(sock, scm);
if (msg->msg_controllen <= 0)
return 0;
diff -urpN -X dontdiff a/include/net/sock.h b/include/net/sock.h
--- a/include/net/sock.h2007-01-10 11:50:54.0 -0800
+++ b/include/net/sock.h2007-01-10 12:15:35.0 -0800
@@ -75,10 +75,9 @@
  * between user contexts and software interrupt processing, whereas the
  * mini-semaphore synchronizes multiple users amongst themselves.
  */
-struct sock_iocb;
 typedef struct {
spinlock_t  slock;
-   struct sock_iocb*owner;
+   void*owner;
wait_queue_head_t   wq;
/*
 * We express the mutex-alike socket_lock semantics
@@ -656,29 +655,6 @@ static inline void __sk_prot_rehash(stru
 #define SOCK_BINDADDR_LOCK 4
 #define SOCK_BINDPORT_LOCK 8
 
-/* sock_iocb: used to kick off async processing of socket ios */
-struct sock_iocb {
-   struct list_headlist;
-
-   int flags;
-   int size;
-   struct socket   *sock;
-   struct sock *sk;
-   struct scm_cookie   *scm;
-   struct msghdr   *msg, async_msg;
-   struct kiocb*kiocb;
-};
-
-static inline struct sock_iocb *kiocb_to_siocb(struct kiocb *iocb)
-{
-   return (struct sock_iocb *)iocb->private;
-}
-
-static inline struct kiocb *siocb_to_kiocb(struct sock_iocb *si)
-{
-   return si->kiocb;
-}
-
 struct socket_alloc {
struct socket socket;
struct inode vfs_inode;
diff -urpN -X dontdiff a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
--- a/net/netlink/af_netlink.c  2007-01-10 11:53:12.0 -0800
+++ b/net/netlink/af_netlink.c  2007-01-10 12:10:19.0 -0800
@@ -1106,7 +1106,6 @@ static inline void netlink_rcv_wake(stru
 static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
   struct msghdr *msg, size_t len)
 {
-   struct sock_iocb *siocb = kiocb_to_siocb(kiocb);
struct sock *sk = sock->sk;
struct netlink_sock *nlk = nlk_sk(sk);
struct sockaddr_nl *addr=msg->msg_name;
@@ -1119,9 +1118,7 @@ static int netlink_sendmsg(struct kiocb 
if (msg->msg_flags_OOB)
return -EOPNOTSUPP;
 
-   if (NULL == siocb->scm)
-   siocb->scm = 
-   err = scm_send(sock, msg, siocb->scm);
+   err = scm_send(sock, msg, );
if (err < 0)
return err;
 
@@ -1155,7 +1152,7 @@ static int netlink_sendmsg(struct kiocb 
NETLINK_CB(skb).dst_group = dst_group;
NETLINK_CB(skb).loginuid = audit_get_loginuid(current->audit_context);
selinux_get_task_sid(current, &(NETLINK_CB(skb).sid));
-   memcpy(NETLINK_CREDS(skb), >scm->creds, sizeof(struct ucred));
+   memcpy(NETLINK_CREDS(skb), , sizeof(struct ucred));
 
/* What can I do? Netlink is asynchronous, so that
   we will have to save current capabilities to
@@ -1189,7 +1186,6 @@ static int netlink_recvmsg(struct kiocb 
   struct msghdr *msg, size_t len,
   int flags)
 {
-   struct sock_iocb *siocb = kiocb_to_siocb(kiocb);
struct scm_cookie scm;
struct sock *sk = sock->sk;
struct netlink_sock *nlk = nlk_sk(sk);
@@ -1230,17 +1226,15 @@ static int netlink_recvmsg(struct kiocb 
if 

[PATCH -mm 0/10][RFC] aio: make struct kiocb private

2007-01-15 Thread Nate Diller
This series is an attempt to generalize the async I/O paths to be
implementation agnostic.  It completely eliminates knowledge of
the kiocb structure in the generic code and makes it private within the
current aio code.  Things get noticeably cleaner without that layering
violation.

The new interface takes a file_endio_t function pointer, and a private data
pointer, which would normally be aio_complete and a kiocb pointer,
respectively.  If the aio submission function gets back EIOCBQUEUED, that is
a guarantee that the endio function will be called, or *already has been
called*.  If the file_endio_t pointer provided to aio_[read|write] is NULL,
the FS must block on I/O completion, then return either the number of bytes
read, or an error.

I had to touch more areas that I had originally expected, so there are
changes in a corner of the socket code, and a slight behavior change in the
direct-io completion path with affects XFS and OCFS2.  I would appreciate
further review there, so I copied some extra people I hope can help.

This patch is against 2.6.20-rc4-mm1.  It has been compile-tested at each
stage.  It needs some runtime testing yet, but I prefer to get it out for
commentary and test later.

These patches are for RFC only and have not yet been signed off.

NATE

---

 Documentation/filesystems/Locking |   11 +
 Documentation/filesystems/vfs.txt |   11 +
 arch/s390/hypfs/inode.c   |   16 +-
 drivers/net/pppoe.c   |8 -
 drivers/net/tun.c |   13 +-
 drivers/usb/gadget/inode.c|  239 +-
 fs/aio.c  |   74 ++-
 fs/bad_inode.c|   10 -
 fs/block_dev.c|  109 +++--
 fs/cifs/cifsfs.c  |   10 -
 fs/compat.c   |   56 
 fs/direct-io.c|   92 --
 fs/ecryptfs/file.c|   16 +-
 fs/ext2/inode.c   |   12 -
 fs/ext3/file.c|9 -
 fs/ext3/inode.c   |   11 -
 fs/ext4/file.c|9 -
 fs/ext4/inode.c   |   11 -
 fs/fat/inode.c|   12 -
 fs/fuse/dev.c |   13 +-
 fs/gfs2/ops_address.c |   14 +-
 fs/hfs/inode.c|   13 --
 fs/hfsplus/inode.c|   13 --
 fs/jfs/inode.c|   12 -
 fs/nfs/direct.c   |   92 +++---
 fs/nfs/file.c |   62 +
 fs/ntfs/file.c|   71 ++-
 fs/ocfs2/aops.c   |   24 +--
 fs/ocfs2/aops.h   |8 -
 fs/ocfs2/file.c   |   44 +++---
 fs/ocfs2/inode.h  |2 
 fs/pipe.c |   12 -
 fs/read_write.c   |  225 ---
 fs/read_write.h   |8 -
 fs/reiserfs/inode.c   |   13 --
 fs/smbfs/file.c   |   28 ++--
 fs/udf/file.c |   13 +-
 fs/xfs/linux-2.6/xfs_aops.c   |   44 +++---
 fs/xfs/linux-2.6/xfs_file.c   |   58 +
 fs/xfs/linux-2.6/xfs_lrw.c|   29 ++--
 fs/xfs/linux-2.6/xfs_lrw.h|   10 -
 fs/xfs/linux-2.6/xfs_vnode.h  |   20 +--
 include/linux/aio.h   |   11 -
 include/linux/fs.h|  114 +-
 include/linux/net.h   |   18 +-
 include/linux/nfs_fs.h|   12 -
 include/net/bluetooth/bluetooth.h |2 
 include/net/inet_common.h |3 
 include/net/scm.h |2 
 include/net/sock.h|   45 +--
 include/net/tcp.h |6 
 include/net/udp.h |3 
 mm/filemap.c  |  109 -
 net/appletalk/ddp.c   |5 
 net/atm/common.c  |6 
 net/atm/common.h  |7 -
 net/ax25/af_ax25.c|7 -
 net/bluetooth/af_bluetooth.c  |4 
 net/bluetooth/hci_sock.c  |7 -
 net/bluetooth/l2cap.c |2 
 net/bluetooth/rfcomm/sock.c   |8 -
 net/bluetooth/sco.c   |3 
 net/core/sock.c   |   12 -
 net/dccp/dccp.h   |8 -
 net/dccp/probe.c  |3 
 net/dccp/proto.c  |7 -
 net/decnet/af_decnet.c|7 -
 net/econet/af_econet.c|7 -
 net/ipv4/af_inet.c|5 
 net/ipv4/raw.c|8 -
 net/ipv4/tcp.c|7 -
 net/ipv4/tcp_probe.c  |3 
 net/ipv4/udp.c|9 -
 net/ipv4/udp_impl.h   |2 
 net/ipv6/raw.c|6 
 net/ipv6/udp.c|   10 -
 net/ipv6/udp_impl.h   |6 
 net/ipx/af_ipx.c  |7 -
 net/irda/af_irda.c|   29 ++--
 net/key/af_key.c 

Re: [stable] 2.6.19.2 regression introduced by "IPV4/IPV6: Fix inet{, 6} device initialization order."

2007-01-15 Thread Gabriel C
Greg KH schrieb:
> On Sun, Jan 14, 2007 at 09:30:08PM -0800, David Miller wrote:
>   
>> From: David Stevens <[EMAIL PROTECTED]>
>> Date: Sun, 14 Jan 2007 19:47:49 -0800
>>
>> 
>>> I think it's better to add the fix than withdraw this patch, since
>>> the original bug is a crash.
>>>   
>> I completely agree.
>> 
>
> Great, can someone forward the patch to us?
>   

Should be the fix from http://bugzilla.kernel.org/show_bug.cgi?id=7817

> thanks,
>
> greg k-h
>   

Regards,

Gabriel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Jeff Garzik

Robert Hancock wrote:
I'll try your stress test when I get a chance, but I doubt I'll run into 
the same problem and I haven't seen any similar reports. Perhaps it's 
some kind of wierd timing issue or incompatibility between the 
controller and that drive when running in ADMA mode? I seem to remember 
various reports of issues with certain Maxtor drives and some nForce 
SATA controllers under Windows at least..



Just to eliminate things, has disabling ADMA been attempted?

It can be disabled using the sata_nv.adma module parameter.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Jeff Garzik

Robert Hancock wrote:
Note that the ATA-7 spec for FLUSH CACHE says that "This command may 
take longer than 30 s to complete."


Yep...

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Jeff Garzik

Jens Axboe wrote:

On Mon, Jan 15 2007, Jeff Garzik wrote:

Jens Axboe wrote:

I'd be surprised if the device would not obey the 7 second timeout rule
that seems to be set in stone and not allow more dirty in-drive cache
than it could flush out in approximately that time.
AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other 
commands...


Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as
it would pretty much guarentee lower latencies for random writes and
write back caching. The concern is the barrier code, of course. I guess
I should do some timings on potential worst case patterns some day. Alan
may have done that sometime in the past, iirc.


FWIW:  According to the drive guys (Eric M, among others), FLUSH CACHE 
will "probably" be under 30 seconds, but pathological cases might even 
extend beyond that.


Definitely more than 7 seconds in less-than-pathological cases, 
unfortunately...


The SCSI layer /should/ already take this (30 second timeout) into 
account, for SYNCHRONIZE CACHE (and thus FLUSH CACHE for libata) but I'm 
too slack to check at the moment.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.20-rc3 01/01] usb: Sierra Wireless auto set D0

2007-01-15 Thread Kevin Lloyd

from: Kevin Lloyd <[EMAIL PROTECTED]>

This patch ensures that the device is turned on when inserted into the 
system (which mostly affects the EM5725 and MC5720. It also adds more 
VID/PIDs and matches the N_OUT_URB with the airprime driver.


Signed-off-by: Kevin Lloyd <[EMAIL PROTECTED]>

---

--- linux-2.6.20-rc5/drivers/usb/serial/sierra.c.orig   2007-01-15 
15:17:15.0 -0800
+++ linux-2.6.20-rc5/drivers/usb/serial/sierra.c2007-01-15 
15:41:56.0 -0800
@@ -14,9 +14,31 @@
  Whom based his on the Keyspan driver by Hugh Blemings <[EMAIL PROTECTED]>

  History:
+v.1.0.6:
+ klloyd
+ Added more devices and added Vendor Specific USB message to make sure
+ that devices are in D0 state when they start. This is very important for
+ MC5720 and EM5625 modules that go between Windows and Non-Windows 
+ machines.

+v.1.0.5:
+ Greg KH
+ This saves over 30 lines and fixes a warning from sparse and allows
+ debugging to work dynamically like all other usb-serial drivers.
+ klloyd
+ Changed versioning to v.x.y.z
+v.1.04:
+ klloyd
+ Adds significant throughput increase to the Sierra driver (uses multiple
+ urgs for download link). This patch also updates the current sierra.c 
+ driver so that it supports both 3-port Sierra devices and 1-port legacy

+ devices and removes Sierra's references in other related files (Kconfig
+ and airprime.c).
+v.1.03
+ klloyd
+ Adds DTR line control support and impliments urb control.
*/

-#define DRIVER_VERSION "v.1.0.5"
+#define DRIVER_VERSION "v.1.0.6"
#define DRIVER_AUTHOR "Kevin Lloyd <[EMAIL PROTECTED]>"
#define DRIVER_DESC "USB Driver for Sierra Wireless USB modems"

@@ -31,14 +53,14 @@


static struct usb_device_id id_table [] = {
+   { USB_DEVICE(0x1199, 0x0017) }, /* Sierra Wireless EM5625 */
{ USB_DEVICE(0x1199, 0x0018) }, /* Sierra Wireless MC5720 */
{ USB_DEVICE(0x1199, 0x0020) }, /* Sierra Wireless MC5725 */
-   { USB_DEVICE(0x1199, 0x0017) }, /* Sierra Wireless EM5625 */
{ USB_DEVICE(0x1199, 0x0019) }, /* Sierra Wireless AirCard 595 */
-   { USB_DEVICE(0x1199, 0x0218) }, /* Sierra Wireless MC5720 */
+   { USB_DEVICE(0x1199, 0x0021) }, /* Sierra Wireless AirCard 597E */
{ USB_DEVICE(0x1199, 0x6802) }, /* Sierra Wireless MC8755 */
+   { USB_DEVICE(0x1199, 0x6804) }, /* Sierra Wireless MC8755 */
{ USB_DEVICE(0x1199, 0x6803) }, /* Sierra Wireless MC8765 */
-   { USB_DEVICE(0x1199, 0x6804) }, /* Sierra Wireless MC8755 for Europe */
{ USB_DEVICE(0x1199, 0x6812) }, /* Sierra Wireless MC8775 */
{ USB_DEVICE(0x1199, 0x6820) }, /* Sierra Wireless AirCard 875 */

@@ -55,14 +77,14 @@ static struct usb_device_id id_table_1po
};

static struct usb_device_id id_table_3port [] = {
+   { USB_DEVICE(0x1199, 0x0017) }, /* Sierra Wireless EM5625 */
{ USB_DEVICE(0x1199, 0x0018) }, /* Sierra Wireless MC5720 */
{ USB_DEVICE(0x1199, 0x0020) }, /* Sierra Wireless MC5725 */
-   { USB_DEVICE(0x1199, 0x0017) }, /* Sierra Wireless EM5625 */
{ USB_DEVICE(0x1199, 0x0019) }, /* Sierra Wireless AirCard 595 */
-   { USB_DEVICE(0x1199, 0x0218) }, /* Sierra Wireless MC5720 */
+   { USB_DEVICE(0x1199, 0x0021) }, /* Sierra Wireless AirCard 597E */
{ USB_DEVICE(0x1199, 0x6802) }, /* Sierra Wireless MC8755 */
+   { USB_DEVICE(0x1199, 0x6804) }, /* Sierra Wireless MC8755 */
{ USB_DEVICE(0x1199, 0x6803) }, /* Sierra Wireless MC8765 */
-   { USB_DEVICE(0x1199, 0x6804) }, /* Sierra Wireless MC8755 for Europe */
{ USB_DEVICE(0x1199, 0x6812) }, /* Sierra Wireless MC8775 */
{ USB_DEVICE(0x1199, 0x6820) }, /* Sierra Wireless AirCard 875 */
{ }
@@ -81,7 +103,7 @@ static int debug;

/* per port private data */
#define N_IN_URB4
-#define N_OUT_URB  1
+#define N_OUT_URB  4
#define IN_BUFLEN   4096
#define OUT_BUFLEN  128

@@ -123,6 +145,7 @@ static int sierra_send_setup(struct usb_
return usb_control_msg(serial->dev,
usb_rcvctrlpipe(serial->dev, 0),
0x22,0x21,val,0,NULL,0,USB_CTRL_SET_TIMEOUT);
+
}

return 0;
@@ -396,6 +419,8 @@ static int sierra_open(struct usb_serial
struct usb_serial *serial = port->serial;
int i, err;
struct urb *urb;
+   int result;
+   __u16 set_mode_dzero = 0x; //Set mode to D0

portdata = usb_get_serial_port_data(port);

@@ -442,6 +467,11 @@ static int sierra_open(struct usb_serial

port->tty->low_latency = 1;

+   //set mode to D0
+   result = usb_control_msg(serial->dev,
+   usb_rcvctrlpipe(serial->dev, 0),
+   0x00,0x40,set_mode_dzero,0,NULL,0,USB_CTRL_SET_TIMEOUT);
+
sierra_send_setup(port);

return (0);



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Problem with POSIX threads in latest kernel...

2007-01-15 Thread J.A. Magallón
Hi...

I run the (almost) latest -mm kernel (2.6.20-rc3-mm1), and see some strange 
behaviour
with POSIX threads (glibc-2.4).
I have downgraded my test to a simple textboox example for a SMP-safe spool
queue, it's just a circular queue with a mutex and a condition variable for in
and out. I have seen the same structure in several places.

Well, it just sometimes gets blocked. GDB says its stuck in pthread_wait().
I could swear it worked on previous kernels. It works as is on IRIX.
I will try to build an older kernel to test.
I takes a second to block it with something like while :; tst; done.

Any ideas ?

--
J.A. Magallon  \   Software is like sex:
 \ It's better when it's free
Mandriva Linux release 2007.1 (Cooker) for i586
Linux 2.6.19-jam04 (gcc 4.1.2 20061110 (prerelease) 
(4.1.2-0.20061110.2mdv2007.1)) #0 SMP PREEMPT
#include 
#include 
#include 
#include 

#define SIZE		16

intjobs[SIZE];
intin;
intslots;
pthread_mutex_t	slots_mutex;
pthread_cond_t	slots_cond;
intout;
intitems;
pthread_mutex_t	items_mutex;
pthread_cond_t	items_cond;

void put(int job);
void get(int* job);
void* prod(void* data);
void* cons(void* data);

int main(int argc,char** argv)
{
	pthread_t	prodid,consid;

	in = 0;
	slots = SIZE;
	pthread_mutex_init(_mutex,0);
	pthread_cond_init(_cond,0);
	out = 0;
	items = 0;
	pthread_mutex_init(_mutex,0);
	pthread_cond_init(_cond,0);

	pthread_setconcurrency(3);
	pthread_create(,0,prod,0);
	pthread_create(,0,cons,0);

	pthread_join(prodid,0);
	pthread_join(consid,0);

	return 0;
}

void* prod(void* data)
{
	int	i;

	for (i=0; i<1000; i++)
	{
		if (!(i%100))
			printf("put %d\n",i);
		put(i);
	}
	put(-1);
	puts("prod done");

	return 0;
}

void* cons(void* data)
{
	int	i;
	do
	{
		get();
		if (!(i%100))
			printf("got %d\n",i);
	}
	while (i>=0);
	puts("cons done");

	return 0;
}

void put(int job)
{
	pthread_mutex_lock(_mutex);
		while (slots<=0)
			pthread_cond_wait(_cond,_mutex);
		jobs[in] = job;
		in++;
		in %= SIZE;
		slots--;
		items++;
	pthread_mutex_unlock(_mutex);

	pthread_mutex_lock(_mutex);
		pthread_cond_signal(_cond);
	pthread_mutex_unlock(_mutex);
}

void get(int* job)
{
	pthread_mutex_lock(_mutex);
		while (items<=0)
			pthread_cond_wait(_cond,_mutex);
		*job = jobs[out];
		out++;
		out %= SIZE;
		items--;
		slots++;
	pthread_mutex_unlock(_mutex);

	pthread_mutex_lock(_mutex);
		pthread_cond_signal(_cond);
	pthread_mutex_unlock(_mutex);
}



Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Björn Steinbrink
On 2007.01.15 18:34:43 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >>My latest bisection attempt actually led to your sata_nv ADMA commit. [1]
> >>I've now backed out that patch from 2.6.20-rc5 and have my stress test
> >>running for 20 minutes now ("record" for a bad kernel surviving that
> >>test is about 40 minutes IIRC). I'll keep it running for at least 2 more
> >>hours.
> >
> >Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out
> >survived about 3 hours of testing, while the average was around 5
> >minutes for a failure, sometimes even before I could log in.
> >I took a look at the patch, but I can't really tell anything.
> >nv_adma_check_atapi_dma somehow looks like it should not negate its
> >return value, so that it returns 0 (atapi dma available) when
> >adma_enable was 1. But I'm not exactly confident about that either ;)
> >Will it hurt if I try to remove the negation?
> 
> It should be correct the way it is - that check is trying to prevent 
> ATAPI commands from using DMA until the slave_config function has been 
> called to set up the DMA parameters properly. When the 
> NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which 
> disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) 
> device on the channel this wouldn't affect you anyway.

I wondered about it, because the flag is cleared when adma_enabled is 1,
which seems to be consistent with everything but nv_adma_check_atapi_dma.
Thus I thought that nv_adma_check_atapi_dma might be wrong, but maybe
setting/clearing the flag is wrong instead? *feels lost*

> I'll try your stress test when I get a chance, but I doubt I'll run into 
> the same problem and I haven't seen any similar reports. Perhaps it's 
> some kind of wierd timing issue or incompatibility between the 
> controller and that drive when running in ADMA mode? I seem to remember 
> various reports of issues with certain Maxtor drives and some nForce 
> SATA controllers under Windows at least..

I just checked Maxtor's knowledge base, that incompatibility does not
affect my drive.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.20-rc5 nfs+krb => oops

2007-01-15 Thread syrius . ml

Hi there,

I've been curious enough to try 2.6.20-rc5 with nfs4/kerberos.
It was working fine before. I was using 2.6.18.1 on the client and
2.6.20-rc3-git4 on server and today i tried 2.6.20-rc5 on both client
and server. (both running up to date debian/sid)
Trying to mount a nfs4 or nfs3 share with krb5 (did try with krb5 and
krb5p) produces this oops on the client side:
(each time I tried i got the same oops)

[ cut here ]
kernel BUG at net/sunrpc/sched.c:902!
invalid opcode:  [#1]
PREEMPT 
Modules linked in: rpcsec_gss_spkm3 rfcomm l2cap bluetooth nfsd exportfs nsc_irc
c tun ipv6 dm_snapshot dm_mirror dm_mod eeprom i2c_isa eth1394 usbhid snd_intel8
x0 snd_ac97_codec ac97_bus snd_pcm_oss snd_pcm snd_mixer_oss snd_seq_oss snd_seq
_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device ohci1394 i
eee1394 ipw2200 snd ieee80211 ieee80211_crypt i2c_i801 psmouse ide_cd r8169 rtc 
irda ehci_hcd uhci_hcd serio_raw i2c_core cdrom snd_page_alloc usbcore evdev crc
_ccitt
CPU:0
EIP:0060:[]Not tainted VLI
EFLAGS: 00210297   (2.6.20-rc5 #3)
EIP is at rpc_release_task+0x8f/0xc0
eax: f7e40c80   ebx: f7e40c80   ecx: f51eaac0   edx: c03fcc80
esi: fff3   edi: f6f21c40   ebp: f6f21bf0   esp: f6f21be4
ds: 007b   es: 007b   ss: 0068
Process mount (pid: 4286, ti=f6f2 task=f6c52030 task.ti=f6f2)
Stack: f6f21bf0 c03f7a77 f7e40c80 f6f21c10 c03f7c0d  feff  
   f6f21c7c f76f1180  f6f21c30 c01fe0d6 f6f21c40 7ffbfaef fffe 
   f6f21c7c f6de1a40 f76f1b80 f6f21c58 c01fe436 0fff  c050a180 
Call Trace:
 [] show_trace_log_lvl+0x1a/0x30
 [] show_stack_log_lvl+0xa9/0xd0
 [] show_registers+0x1ef/0x360
 [] die+0x10b/0x210
 [] do_trap+0x82/0xb0
 [] do_invalid_op+0x97/0xb0
 [] error_code+0x74/0x7c
 [] rpc_call_sync+0x8d/0xb0
 [] nfs3_rpc_wrapper+0x46/0x70
 [] nfs3_proc_getattr+0x46/0x80
 [] nfs_create_server+0x2cf/0x520
 [] nfs_get_sb+0xbd/0x580
 [] vfs_kern_mount+0x40/0x90
 [] do_kern_mount+0x36/0x50
 [] do_mount+0x24e/0x690
 [] sys_mount+0x6f/0xb0
 [] sysenter_past_esp+0x5f/0x85
 ===
Code: d8 e8 86 fc ff ff c7 03 00 00 00 00 8d 43 68 0f ba 73 68 04 ba 04 00 00 00
 e8 5e 1d d3 ff 89 d8 e8 f7 fe ff ff 83 c4 08 5b 5d c3 <0f> 0b eb fe 0f 0b eb fe
 e8 84 2a 01 00 eb be 0f b7 80 94 00 00 
EIP: [] rpc_release_task+0x8f/0xc0 SS:ESP 0068:f6f21be4


( was a proto=udp mount )
I can provide more informations if needed, but i'm pretty it would be
reproducible easily.

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch-mm] Workaround for RAID breakage

2007-01-15 Thread Jens Axboe
On Mon, Jan 15 2007, Thomas Gleixner wrote:
> On Mon, 2007-01-15 at 09:08 +0100, Thomas Gleixner wrote:
> > > Thomas saw something similar yesterday and he the partial results that 
> > > git.block (between rc2-mm1 and rc4-mm1) breaks certain disk drivers or 
> > > filesystems drivers. For me it worked fine, so it must be only on some 
> > > combinations. The changes to ll_rw_block.c look quite extensive.
> > 
> > Yes. Jens Axboe confirmed yesterday that the plug changes broke RAID.
> 
> I tracked this down and found two problems:
> 
> - The new plug/unplug code does not check for underruns. That allows the
> plug count (ioc->plugged) to become negative. This gets triggered from
> various places. 
>
> AFAICS this is intentional to avoid checks all over the place, but the
> underflow check is missing. All we need to do is make sure, that in case
> of ioc->plugged == 0 we return early and bug, if there is either a queue
> plugged in or the plugged_list is not empty.
> 
> Jens ?

It should not go negative, that would be a bug elsewhere. So it's
interesting if it does, we should definitely put a WARN_ON() check in
there for that.

> - The raid1 code has no bitmap set in remount r/w. So the
> pending_bio_list gets not processed for quite a time. The workaround is
> to kick mddev->thread, so the list is processed. Not sure about that.
> 
> Neil ?

Super, thanks for that Thomas! I'll merge it in the plug branch.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] adjust use of unplug in elevator code

2007-01-15 Thread Jens Axboe
On Mon, Jan 15 2007, Linas Vepstas wrote:
> 
> Hi Chris, Jens,
> Can you look at this, and push upstream if this looks reasonable
> to you? It fixes a bug I've been tripping over.
> 
> --linas
> 
> 
> A flag was recently added to the elevator code to avoid
> performing an unplug when reuests are being re-queued.
> The goal of this flag was to avoid a deep recursion that
> can occur when re-queueing requests after a SCSI device/host 
> reset.  See http://lkml.org/lkml/2006/5/17/254
> 
> However, that fix added the flag near the bottom of a case
> statement, where an earlier break (in an if statement) could
> transport one out of the case, without setting the flag.
> This patch sets the flag earlier in the case statement.
> 
> I re-discovered the deep recursion recently during testing;
> I was told that it was a known problem, and the fix to it was
> in the kernel I was testing. Indeed it was ... but it didn't
> fix the bug. With the patch below, I no longer see the bug.
> 
> Signed-off by: Linas Vepstas <[EMAIL PROTECTED]>
> Cc: Jens Axboe <[EMAIL PROTECTED]>
> Cc: Chris Wright <[EMAIL PROTECTED]>
> 
> 
>  block/elevator.c |   11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6.20-rc4/block/elevator.c
> ===
> --- linux-2.6.20-rc4.orig/block/elevator.c2007-01-15 14:16:03.0 
> -0600
> +++ linux-2.6.20-rc4/block/elevator.c 2007-01-15 14:20:04.0 -0600
> @@ -590,6 +590,12 @@ void elv_insert(request_queue_t *q, stru
>*/
>   rq->cmd_flags |= REQ_SOFTBARRIER;
>  
> + /*
> +  * Most requeues happen because of a busy condition,
> +  * don't force unplug of the queue for that case.
> +  */
> + unplug_it = 0;
> +
>   if (q->ordseq == 0) {
>   list_add(>queuelist, >queue_head);
>   break;
> @@ -604,11 +610,6 @@ void elv_insert(request_queue_t *q, stru
>   }
>  
>   list_add_tail(>queuelist, pos);
> - /*
> -  * most requeues happen because of a busy condition, don't
> -  * force unplug of the queue for that case.
> -  */
> - unplug_it = 0;
>   break;

Ah, yes it definitely should be moved up, thanks for that!

Acked-by: Jens Axboe <[EMAIL PROTECTED]>

I'll get this merged for 2.6.21.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to flush the disk write cache from userspace

2007-01-15 Thread Jens Axboe
On Sun, Jan 14 2007, Ricardo Correia wrote:
> Hi, (please CC: to my email address, I'm not subscribed)
> 
> Quick question: how can I flush the disk write cache from userspace?
> 
> Long question:
> 
> I'm porting the Solaris ZFS filesystem to the FUSE/Linux filesystem
> framework.  This is a copy-on-write, transactional filesystem and so
> it needs to ensure correct ordering of writes when transactions are
> written to disk.
> 
> At the moment, when transactions end, I'm using a fsync() on the block
> device followed by a ioctl(BLKFLSBUF).
> 
> This is because, according to the fsync manpage, even after fsync()
> returns, data might still be in the disk write cache, so fsync by
> itself doesn't guarantee data safety on power failure.

Depends. Only if the file system does the right thing here, iirc only
reiserfs with barriers enabled issue a real disk flush for fsync. So you
can't rely on it in general.

> I was looking for something like the Solaris
> ioctl(DKIOCFLUSHWRITECACHE), which does exactly what I need.
> 
> The most similar thing I could find was ioctl(BLKFLSBUF), however a
> search for BLKFLSBUF on the Linux 2.6.15 source doesn't seem to return
> anything related to IDE or SCSI disks.
> 
> Can I trust ioctl(BLKFLSBUF) to flush disks' write caches (for disks
> that follow the specs)?

BLKFLSBUF doesn't flush the disk cache either, it just flushes
every dirty page in the block device address space. It would not be very
hard to do, basically we have most of the support code in place for this
for IO barriers. Basically it would be something like:

blockdev_cache_flush(bdev)
{
request_queue_t *q = bdev_get_queue(bdev);
struct request *rq = blk_get_request(q, WRITE, GFP_WHATEVER);
int ret;

ret = blk_execute_rq(q, bdev->bd_disk, rq, 0);
blk_put_request(rq);
return ret;
}

Somewhat simplified of course, but it should get the point across.
Putting that in fs/buffer.c:sync_blockdev() would make BLKFLSBUF work.

As always with these things, the devil is in the details. It requires
the device to support a ->prepare_flush() queue hook, and not all
devices do that. It will work for IDE/SATA/SCSI, though. In some devices
you don't want/need to do a real disk flush, it depends on the write
cache settings, battery backing, etc.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Robert Hancock

Jens Axboe wrote:

On Mon, Jan 15 2007, Jeff Garzik wrote:

Jens Axboe wrote:

I'd be surprised if the device would not obey the 7 second timeout rule
that seems to be set in stone and not allow more dirty in-drive cache
than it could flush out in approximately that time.
AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other 
commands...


Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as
it would pretty much guarentee lower latencies for random writes and
write back caching. The concern is the barrier code, of course. I guess
I should do some timings on potential worst case patterns some day. Alan
may have done that sometime in the past, iirc.



Note that the ATA-7 spec for FLUSH CACHE says that "This command may 
take longer than 30 s to complete."


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Robert Hancock

Björn Steinbrink wrote:

My latest bisection attempt actually led to your sata_nv ADMA commit. [1]
I've now backed out that patch from 2.6.20-rc5 and have my stress test
running for 20 minutes now ("record" for a bad kernel surviving that
test is about 40 minutes IIRC). I'll keep it running for at least 2 more
hours.


Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out
survived about 3 hours of testing, while the average was around 5
minutes for a failure, sometimes even before I could log in.
I took a look at the patch, but I can't really tell anything.
nv_adma_check_atapi_dma somehow looks like it should not negate its
return value, so that it returns 0 (atapi dma available) when
adma_enable was 1. But I'm not exactly confident about that either ;)
Will it hurt if I try to remove the negation?


It should be correct the way it is - that check is trying to prevent 
ATAPI commands from using DMA until the slave_config function has been 
called to set up the DMA parameters properly. When the 
NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which 
disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) 
device on the channel this wouldn't affect you anyway.


I'll try your stress test when I get a chance, but I doubt I'll run into 
the same problem and I haven't seen any similar reports. Perhaps it's 
some kind of wierd timing issue or incompatibility between the 
controller and that drive when running in ADMA mode? I seem to remember 
various reports of issues with certain Maxtor drives and some nForce 
SATA controllers under Windows at least..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2007-01-15 Thread Robert Hancock

Christoph Anton Mitterer wrote:

Sorry, as always I've forgot some things... *g*


Robert Hancock wrote:

If this is related to some problem with using the GART IOMMU with memory 
hole remapping enabled

What is that GART thing exactly? Is this the hardware IOMMU? I've always
thought GART was something graphics card related,.. but if so,.. how
could this solve our problem (that seems to occur mainly on harddisks)?


The GART built into the Athlon 64/Opteron CPUs is normally used for 
remapping graphics memory so that an AGP graphics card can see 
physically non-contiguous memory as one contiguous region. However, 
Linux can also use it as an IOMMU which allows devices which normally 
can't access memory above 4GB to see a mapping of that memory that 
resides below 4GB. In pre-2.6.20 kernels both the SATA and PATA 
controllers on the nForce 4 chipsets can only access memory below 4GB so 
transfers to memory above this mark have to go through the IOMMU. In 
2.6.20 this limitation is lifted on the nForce4 SATA controllers.




then 2.6.20-rc kernels may avoid this problem on 
nForce4 CK804/MCP04 chipsets as far as transfers to/from the SATA 
controller are concerned

Does this mean that PATA is no related? The corruption appears on PATA
disks to, so why should it only solve the issue at SATA disks? Sounds a
bit strange to me?


The PATA controller will still be using 32-bit DMA and so may also use 
the IOMMU, so this problem would not be avoided.




as the sata_nv driver now supports 64-bit DMA 
on these chipsets and so no longer requires the IOMMU.
  

Can you explain this a little bit more please? Is this a drawback (like
a performance decrease)? Like under Windows where they never use the
hardware iommu but always do it via software?


No, it shouldn't cause any performance loss. In previous kernels the 
nForce4 SATA controller was controlled using an interface quite similar 
to a PATA controller. In 2.6.20 kernels they use a more efficient 
interface that NVidia calls ADMA, which in addition to supporting NCQ 
also supports DMA without any 4GB limitations, so it can access all 
memory directly without requiring IOMMU assistance.


Note that if this corruption problem is, as has been suggested, related 
to memory hole remapping and the IOMMU, then this change only prevents 
the SATA controller transfers from experiencing this problem. Transfers 
on the PATA controller as well as any other devices with 32-bit DMA 
limitations might still have problems. As such this really just avoids 
the problem, not fixes it.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Jens Axboe
On Mon, Jan 15 2007, Jeff Garzik wrote:
> Jens Axboe wrote:
> >I'd be surprised if the device would not obey the 7 second timeout rule
> >that seems to be set in stone and not allow more dirty in-drive cache
> >than it could flush out in approximately that time.
> 
> AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other 
> commands...

Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as
it would pretty much guarentee lower latencies for random writes and
write back caching. The concern is the barrier code, of course. I guess
I should do some timings on potential worst case patterns some day. Alan
may have done that sometime in the past, iirc.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What does this scsi error mean ?

2007-01-15 Thread Olivier Galibert
On Mon, Jan 15, 2007 at 11:14:52PM +, Alan wrote:
> If you pull the drive and test it in another box does it show the same ?

I'm going to try that.  The prolem requires 3-7 days to appear, so I
won't know immediatly.


> And what does a scsi verify have to say ?

Running, looks like it's gonna take a little while.

  OG.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: I broke my port numbers :(

2007-01-15 Thread Sami Farin
On Mon, Jan 15, 2007 at 23:55:15 +0200, Sami Farin wrote:
> I know this may be entirely my fault but I have tried reversing
> all of my _own_ patches I applied to 2.6.19.2 but can't find what broke this.
> I did three times "netcat 127.0.0.69 42", notice the different
> port numbers.

Hmm...  when I do "rmmod iptable_nat ip_nat", it works.

# iptables -t nat --list -nvx
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts  bytes target prot opt in out source   
destination 

Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts  bytes target prot opt in out source   
destination 

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
pkts  bytes target prot opt in out source   
destination 
I didn't know functions in ip_nat_proto_tcp.o were called
when I have empty nat table.  Oops...

without iptable_nat ip_nat:
64 bytes from 127.0.0.1: icmp_seq=3 ttl=61 time=0.053 ms

with them:
64 bytes from 127.0.0.1: icmp_seq=3 ttl=61 time=0.065 ms

*shrug* live and learn.

2007-01-16 00:44:43.616266500 <4>[ 5672.924459]  [] 
dump_trace+0x215/0x21a
2007-01-16 00:44:43.616267500 <4>[ 5672.924492]  [] 
show_trace_log_lvl+0x1a/0x30
2007-01-16 00:44:43.616269500 <4>[ 5672.924511]  [] 
show_trace+0x12/0x14
2007-01-16 00:44:43.616270500 <4>[ 5672.924529]  [] 
dump_stack+0x19/0x1b
2007-01-16 00:44:43.616271500 <4>[ 5672.924547]  [] 
tcp_unique_tuple+0xd7/0x130 [ip_nat]
2007-01-16 00:44:43.616272500 <4>[ 5672.924585]  [] 
get_unique_tuple+0x5a/0x6e [ip_nat]
2007-01-16 00:44:43.616285500 <4>[ 5672.924593]  [] 
ip_nat_setup_info+0x73/0x1e6 [ip_nat]
2007-01-16 00:44:43.616287500 <4>[ 5672.924601]  [] 
ip_nat_rule_find+0x90/0xb0 [iptable_nat]
2007-01-16 00:44:43.616288500 <4>[ 5672.924610]  [] 
ip_nat_fn+0xd5/0x1ac [iptable_nat]
2007-01-16 00:44:43.616289500 <4>[ 5672.924617]  [] 
ip_nat_out+0x56/0xd3 [iptable_nat]
2007-01-16 00:44:43.616290500 <4>[ 5672.924624]  [] 
nf_iterate+0x4b/0x77
2007-01-16 00:44:43.616295500 <4>[ 5672.925610]  [] 
nf_hook_slow+0x58/0xdf
2007-01-16 00:44:43.617058500 <4>[ 5672.926562]  [] 
ip_output+0x187/0x26a
2007-01-16 00:44:43.618005500 <4>[ 5672.927511]  [] 
ip_queue_xmit+0x4c9/0x5a4
2007-01-16 00:44:43.618955500 <4>[ 5672.928461]  [] 
tcp_transmit_skb+0x25b/0x466
2007-01-16 00:44:43.619911500 <4>[ 5672.929417]  [] 
tcp_connect+0x133/0x1d1
2007-01-16 00:44:43.620865500 <4>[ 5672.930371]  [] 
tcp_v4_connect+0x404/0x750
2007-01-16 00:44:43.621821500 <4>[ 5672.931327]  [] 
inet_stream_connect+0x123/0x1b1
2007-01-16 00:44:43.622789500 <4>[ 5672.932295]  [] 
sys_connect+0x9c/0xbe
2007-01-16 00:44:43.623679500 <4>[ 5672.933185]  [] 
sys_socketcall+0xd2/0x272
2007-01-16 00:44:43.624612500 <4>[ 5672.934072]  [] 
syscall_call+0x7/0xb
2007-01-16 00:44:43.624614500 <4>[ 5672.934092]  [<00645410>] 0x645410
2007-01-16 00:44:43.624615500 <4>[ 5672.934116]  ===

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20-rc4-mm1

2007-01-15 Thread Jens Axboe
On Mon, Jan 15 2007, Ingo Molnar wrote:
> 
> * Jens Axboe <[EMAIL PROTECTED]> wrote:
> 
> > > In a previous write invoked by: fsck.ext3(1896): WRITE block 8552 on 
> > > sdb1 end_buffer_async_write() is invoked.
> > > 
> > > sdb1 is not a part of a raid device.
> > 
> > When I briefly tested this before I left (and found it broken), doing 
> > a cat /proc/mdstat got things going again. Hard if that's your rootfs, 
> > it's just a hint :-)
> 
> hm, so you knew it's broken, still you let Andrew pick it up, or am i 
> misunderstanding something?

Well the raid issue wasn't known before it was in -mm.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Björn Steinbrink
On 2007.01.15 22:17:24 +0100, Björn Steinbrink wrote:
> On 2007.01.14 17:43:53 -0600, Robert Hancock wrote:
> > Björn Steinbrink wrote:
> > >Hi,
> > >
> > >with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite
> > >often, with 2.6.19 there are no such exceptions. dmesg and lspci -v
> > >output follows. In the meantime, I'll start bisecting.
> > 
> > ...
> > 
> > >ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> > >ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in
> > > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> > >ata1: soft resetting port
> > >ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> > >ata1.00: configured for UDMA/133
> > >ata1: EH complete
> > >SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
> > >sda: Write Protect is off
> > >sda: Mode Sense: 00 3a 00 00
> > >SCSI device sda: write cache: enabled, read cache: enabled, doesn't 
> > >support DPO or FUA
> > 
> > Looks like all of these errors are from a FLUSH CACHE command and the 
> > drive is indicating that it is no longer busy, so presumably done. 
> > That's not a DMA-mapped command, so it wouldn't go through the ADMA 
> > machinery and I wouldn't have expected this to be handled any 
> > differently from before. Curious..
> 
> My latest bisection attempt actually led to your sata_nv ADMA commit. [1]
> I've now backed out that patch from 2.6.20-rc5 and have my stress test
> running for 20 minutes now ("record" for a bad kernel surviving that
> test is about 40 minutes IIRC). I'll keep it running for at least 2 more
> hours.

Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out
survived about 3 hours of testing, while the average was around 5
minutes for a failure, sometimes even before I could log in.
I took a look at the patch, but I can't really tell anything.
nv_adma_check_atapi_dma somehow looks like it should not negate its
return value, so that it returns 0 (atapi dma available) when
adma_enable was 1. But I'm not exactly confident about that either ;)
Will it hurt if I try to remove the negation?

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What does this scsi error mean ?

2007-01-15 Thread Olivier Galibert
On Tue, Jan 16, 2007 at 12:27:17AM +0100, Stefan Richter wrote:
> On 15 Jan, Olivier Galibert wrote:
> > sd 0:0:0:0: SCSI error: return code = 0x0802
> > sda: Current: sense key: Hardware Error
> > ASC=0x42 ASCQ=0x0
> 
> The Additional Sense Code means "power-on or self-test failure" FWIW.
> (SPC-4 annex D)

Given that happens between 3 days to a week after bootup on the root
drive, it's obviously not the "power on" part.  It's kinda annoying
nothing appears in the smart logs though:

smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: IBM-ESXS ST936701LCFN Version: B41D
Serial number: 3LC0C8P07647WLMV
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Tue Jan 16 00:33:09 2007 CET
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature: 33 C
Drive Trip Temperature:60 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 16206797
  Blocks received from initiator = 83607272
  Blocks read from cache and sent to initiator = 3311410
  Number of read and write commands whose size <= segment size = 2801896
  Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 533.07
  number of minutes until next internal SMART test = 112

Error counter log:
   Errors Corrected by   Total   Correction Gigabytes
Total
   ECC  rereads/errors   algorithm  processed
uncorrected
   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  
errors
read:  104740 0 10474  10474 61.360 
  0
write: 00 0 0  0 58.647 
  2

Non-medium error count:  1457822

SMART Self-test log
Num  Test  Status segment  LifeTime  LBA_first_err 
[SK ASC ASQ]
 Description  number   (hours)
# 1  Background long   Completed   - 407 - 
[-   --]
# 2  Background short  Completed   - 243 - 
[-   --]

Long (extended) Self Test duration: 793 seconds [13.2 minutes]


  OG.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] CPUSET related breakage of sys_mbind

2007-01-15 Thread Bob Picco

current->mems_allowed is defined for CONFIG_CPUSETS. This broke !CPUSETS
build. I compiled and linked tested both variants.

Signed-off-by: Bob Picco <[EMAIL PROTECTED]>

 include/linux/cpuset.h |6 ++
 mm/mempolicy.c |2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6.20-rc4-mm1/mm/mempolicy.c
===
--- linux-2.6.20-rc4-mm1.orig/mm/mempolicy.c2007-01-15 09:21:58.0 
-0500
+++ linux-2.6.20-rc4-mm1/mm/mempolicy.c 2007-01-15 17:51:15.0 -0500
@@ -882,9 +882,9 @@ asmlinkage long sys_mbind(unsigned long 
int err;
 
err = get_nodes(, nmask, maxnode);
-   nodes_and(nodes, nodes, current->mems_allowed);
if (err)
return err;
+   cpuset_nodes_allowed();
return do_mbind(start, len, mode, , flags);
 }
 
Index: linux-2.6.20-rc4-mm1/include/linux/cpuset.h
===
--- linux-2.6.20-rc4-mm1.orig/include/linux/cpuset.h2007-01-15 
09:21:32.0 -0500
+++ linux-2.6.20-rc4-mm1/include/linux/cpuset.h 2007-01-15 14:01:30.0 
-0500
@@ -75,6 +75,11 @@ static inline int cpuset_do_slab_mem_spr
 
 extern void cpuset_track_online_nodes(void);
 
+static inline void cpuset_nodes_allowed(nodemask_t *nodes)
+{
+   nodes_and(*nodes, *nodes, current->mems_allowed);
+}
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init_early(void) { return 0; }
@@ -145,6 +150,7 @@ static inline int cpuset_do_slab_mem_spr
 }
 
 static inline void cpuset_track_online_nodes(void) {}
+static inline void cpuset_nodes_allowed(nodemask_t *nodes) {}
 
 #endif /* !CONFIG_CPUSETS */
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What does this scsi error mean ?

2007-01-15 Thread Stefan Richter
On 15 Jan, Olivier Galibert wrote:
> sd 0:0:0:0: SCSI error: return code = 0x0802
> sda: Current: sense key: Hardware Error
> ASC=0x42 ASCQ=0x0

The Additional Sense Code means "power-on or self-test failure" FWIW.
(SPC-4 annex D)
-- 
Stefan Richter
-=-=-=== ---= =
http://arcgraph.de/sr/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Some kind of 2.6.19 NFS regression

2007-01-15 Thread Daniel Drake

Hi,

Tim Ryan has reported the following bug at the Gentoo bugzilla:

https://bugs.gentoo.org/show_bug.cgi?id=162199

His home dir is mounted over NFS. 2.6.18 worked OK but 2.6.19 is very 
slow to load the desktop environment. NFS is suspected here as the 
problem does not exist for users with local homedirs. This might not be 
a straightforward performance issue as it does seem to perform OK on the 
console.


The bug still exists in unpatched 2.6.20-rc5.

Is this a known issue? Should we report a new bug on the kernel bugzilla?

Thanks,
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2007-01-15 Thread Christoph Anton Mitterer
Sorry, as always I've forgot some things... *g*


Robert Hancock wrote:

> If this is related to some problem with using the GART IOMMU with memory 
> hole remapping enabled
What is that GART thing exactly? Is this the hardware IOMMU? I've always
thought GART was something graphics card related,.. but if so,.. how
could this solve our problem (that seems to occur mainly on harddisks)?

> then 2.6.20-rc kernels may avoid this problem on 
> nForce4 CK804/MCP04 chipsets as far as transfers to/from the SATA 
> controller are concerned
Does this mean that PATA is no related? The corruption appears on PATA
disks to, so why should it only solve the issue at SATA disks? Sounds a
bit strange to me?

> as the sata_nv driver now supports 64-bit DMA 
> on these chipsets and so no longer requires the IOMMU.
>   
Can you explain this a little bit more please? Is this a drawback (like
a performance decrease)? Like under Windows where they never use the
hardware iommu but always do it via software?


Best wishes,
Chris.
begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



Re: What does this scsi error mean ?

2007-01-15 Thread Alan
> Both smart and the internal blade diagnostics say "everything is a-ok
> with the drive, there hasn't been any error ever except a bunch of
> corrected ECC ones, and no more than with a similar drive in another
> working blade".  Hence my initial post.  "Hardware error" is kinda
> imprecise, so I was wondering whether it was unexpected controller
> answer, detected transmission error, block write error, sector not
> found...  Is there a way to have more information?

Well the right place to look would indeed have been the SMART data
providing the drive didn't get into a state it couldn't update it.
Hardware error comes from the drive deciding something is wrong (or a
raid card faking it I guess). That covers everything from power
fluctuations and overheating through firmware consistency failures and
more.

If you pull the drive and test it in another box does it show the same ?
And what does a scsi verify have to say ?


Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2007-01-15 Thread Christoph Anton Mitterer
Hi everybody.

Sorry again for my late reply...

Robert gave us the following interesting information some days ago:

Robert Hancock wrote:
> If this is related to some problem with using the GART IOMMU with memory 
> hole remapping enabled, then 2.6.20-rc kernels may avoid this problem on 
> nForce4 CK804/MCP04 chipsets as far as transfers to/from the SATA 
> controller are concerned as the sata_nv driver now supports 64-bit DMA 
> on these chipsets and so no longer requires the IOMMU.
>   


I've just tested it with my "normal" BIOS settings, that is memhole
mapping = hardware, IOMMU = enabled and 64MB and _without_ (!)
iommu=soft as kernel parameters.
I only had the time for a small test (that is 3 passes with each 10
complete sha512sums cyles over about 30GB data)... but sofar, no
corruption occured.

It is surely far to eraly to tell that our issue was solved by
2.6.20-rc-something but I ask all of you that had systems that
suffered from the corruption to make _intensive_ tests with the most
recent rc of 2.6.20 (I've used 2.6.20-rc5) and report your results.
I'll do a extensive test tomorrow.

And of course (!!): Test without using iommu=soft and with enabled
memhole mapping (in the BIOS). (It won't make any sense to look if the
new kernel solves our problem while still applying one of our two
workarounds).


Please also note that there might be two completely data corruption
problems. The onle "solved" by iommu=soft and another reported by Kurtis
D. Rader.
I've asked him to clarify this in a post. :-)



Ok,... now if this (the new kernel) would really solve the issue... we
should try to find out what exactly was changed in the code, and if it
sounds logical that this solved the problem or not.
The new kernel could just make the corruption even more rare.


Best wishes,
Chris.


begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



Re: [PATCH 2.6.19] USB HID: proper LED-mapping (support for SpaceNavigator)

2007-01-15 Thread Simon Budig
Jiri Kosina ([EMAIL PROTECTED]) wrote:
> On Mon, 15 Jan 2007, Simon Budig wrote:
> > Is it possible that there is a regression in the hid-debug stuff? The
> > mapping does not seem to appear in the dmesg-output. I unfortunately
> > don't have an earlier kernel available right now to verify, but now the
> > output on plugging in the device looks like this:
> 
[...]
> (after I check why the debug output seems to be broken),

Actually this might have been a false alarm. I remembered about
/var/log/messages and looked up how this looked like with earlier
kernels - turns out it looks exactly the same.

(the values dumped there seem to be the initial values of a given field
in a HID-Report)

So there is no regression there, sorry about the confusion.

Bye,
Simon
-- 
  [EMAIL PROTECTED]  http://simon.budig.de/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -mm] AVR32: fix build breakage

2007-01-15 Thread Ben Nizette

On Mon, 15 Jan 2007 09:37:35 +0100
Haavard Skinnemoen <[EMAIL PROTECTED]> wrote:
>
> On Mon, 15 Jan 2007 14:48:57 +1100
> Ben Nizette <[EMAIL PROTECTED]> wrote:
>
>> Remove an unwanted remnant of the recent revert of AVR32/AT91 SPI 
patches in -mm.  Without this patch, the AVR32 build of 
2.6.20-rc[34]-mm1 breaks.

>
> Actually, this is broken in my tree. Wonder how I managed to do that
> and not even notice it.
>

Interestly git://www.atmel.no/~hskinnemoen/linux/kernel/avr32.git master 
is still fine


> I'll apply this patch and push out a new avr32-arch branch for Andrew.
> Thanks for testing.

Sounds good, no worries.

--Ben
>
> Haavard
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Cell SPU task notification -- updated patch: #1

2007-01-15 Thread Maynard Johnson

Attached is an updated patch that addresses Michael Ellerman's comments.

One comment made by Michael has not yet been addressed:
The comment was in regard to the for-loop in 
spufs/sched.c:notify_spus_active().  He wondered if the scheduler can 
swap a context from one node to another.  If so, there's a small window 
in this loop (where we switch the lock from one node's active list to 
the next) where it may be possible we might miss waking up a context and 
send a spurious wakeup to another.

Arnd . . . can you comment on this question?

Thanks.
-Maynard

Subject: Enable SPU switch notification to detect currently active SPU tasks.

From: Maynard Johnson <[EMAIL PROTECTED]>

This patch adds to the capability of spu_switch_event_register so that the
caller is also notified of currently active SPU tasks.  It also exports
spu_switch_event_register and spu_switch_event_unregister.

Signed-off-by: Maynard Johnson <[EMAIL PROTECTED]>


Index: linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/sched.c
===
--- linux-2.6.19-rc6-arnd1+patches.orig/arch/powerpc/platforms/cell/spufs/sched.c	2006-12-04 10:56:04.730698720 -0600
+++ linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/sched.c	2007-01-15 16:22:31.808461448 -0600
@@ -84,15 +84,42 @@
 			ctx ? ctx->object_id : 0, spu);
 }
 
+static void notify_spus_active(void)
+{
+   int node;
+	/* Wake up the active spu_contexts. When the awakened processes 
+	 * sees their notify_active flag is set, they will call
+	 * spu_notify_already_active().
+	 */
+	for (node = 0; node < MAX_NUMNODES; node++) {
+		struct spu *spu;
+		mutex_lock(_prio->active_mutex[node]);
+list_for_each_entry(spu, _prio->active_list[node], list) {
+			struct spu_context *ctx = spu->ctx;
+			spu->notify_active = 1;
+			wake_up_all(>stop_wq);
+			smp_wmb();
+		}
+mutex_unlock(_prio->active_mutex[node]);
+	}
+	yield();
+}
+
 int spu_switch_event_register(struct notifier_block * n)
 {
-	return blocking_notifier_chain_register(_switch_notifier, n);
+	int ret;
+	ret = blocking_notifier_chain_register(_switch_notifier, n);
+	if (!ret)
+		notify_spus_active();
+	return ret;
 }
+EXPORT_SYMBOL_GPL(spu_switch_event_register);
 
 int spu_switch_event_unregister(struct notifier_block * n)
 {
 	return blocking_notifier_chain_unregister(_switch_notifier, n);
 }
+EXPORT_SYMBOL_GPL(spu_switch_event_unregister);
 
 
 static inline void bind_context(struct spu *spu, struct spu_context *ctx)
@@ -250,6 +277,14 @@
 	return spu_get_idle(ctx, flags);
 }
 
+void spu_notify_already_active(struct spu_context *ctx)
+{
+	struct spu *spu = ctx->spu;
+	if (!spu)
+		return;
+	spu_switch_notify(spu, ctx);
+}
+
 /* The three externally callable interfaces
  * for the scheduler begin here.
  *
Index: linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/spufs.h
===
--- linux-2.6.19-rc6-arnd1+patches.orig/arch/powerpc/platforms/cell/spufs/spufs.h	2007-01-08 18:18:40.093354608 -0600
+++ linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/spufs.h	2007-01-08 18:31:03.610345792 -0600
@@ -183,6 +183,7 @@
 void spu_yield(struct spu_context *ctx);
 int __init spu_sched_init(void);
 void __exit spu_sched_exit(void);
+void spu_notify_already_active(struct spu_context *ctx);
 
 extern char *isolated_loader;
 
Index: linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/run.c
===
--- linux-2.6.19-rc6-arnd1+patches.orig/arch/powerpc/platforms/cell/spufs/run.c	2007-01-08 18:33:51.979311680 -0600
+++ linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/run.c	2007-01-15 16:31:30.10442 -0600
@@ -45,9 +45,11 @@
 	u64 pte_fault;
 
 	*stat = ctx->ops->status_read(ctx);
-	if (ctx->state != SPU_STATE_RUNNABLE)
-		return 1;
+	smp_rmb();
+
 	spu = ctx->spu;
+	if (ctx->state != SPU_STATE_RUNNABLE || spu->notify_active)
+		return 1;
 	pte_fault = spu->dsisr &
 	(MFC_DSISR_PTE_NOT_FOUND | MFC_DSISR_ACCESS_DENIED);
 	return (!(*stat & 0x1) || pte_fault || spu->class_0_pending) ? 1 : 0;
@@ -304,6 +306,7 @@
 		   u32 *npc, u32 *event)
 {
 	int ret;
+	struct * spu;
 	u32 status;
 
 	if (down_interruptible(>run_sema))
@@ -317,8 +320,16 @@
 
 	do {
 		ret = spufs_wait(ctx->stop_wq, spu_stopped(ctx, ));
+		spu = ctx->spu;
 		if (unlikely(ret))
 			break;
+		if (unlikely(spu->notify_active)) {
+			spu->notify_active = 0;
+			if (!(status & SPU_STATUS_STOPPED_BY_STOP)) {
+spu_notify_already_active(ctx);
+continue;
+			}
+		}
 		if ((status & SPU_STATUS_STOPPED_BY_STOP) &&
 		(status >> SPU_STOP_STATUS_SHIFT == 0x2104)) {
 			ret = spu_process_callback(ctx);


Re: High CPU usage with sata_nv

2007-01-15 Thread ris
On Mon, 15 Jan 2007 18:26:42 +, Frederik Deweerdt wrote
> On Mon, Jan 15, 2007 at 06:54:50PM +0200, ris wrote:
> > I have motherboard with nforce 590 SLI (MCP55) chipset.
> > On other systems all its ok.
> > 
> > But i tried a lot o kernels, configurations and always get cpu at 100% when
> > copying files.
> > I use SATA II samsung hard drive.
> > 
> Any dmesg complain? Could you send the hdparm -I  ?
> Regards,
> Frederik


Ok ... 

hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
Model Number:   SAMSUNG SP2504C
Serial Number:  S09QJ13LA07964
Firmware Revision:  VT100-50
Standards:
Used: ATA/ATAPI-7 T13 1532D revision 4a
Supported: 7 6 5 4
Configuration:
Logical max current
cylinders   16383   16383
heads   16  16
sectors/track   63  63
--
CHS current addressable sectors:   16514064
LBAuser addressable sectors:  268435455
LBA48  user addressable sectors:  488397168
device size with M = 1024*1024:  238475 MBytes
device size with M = 1000*1000:  250059 MBytes (250 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16  Current = 1
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 udma7
 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
 Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   *SMART feature set
Security Mode feature set
   *Power Management feature set
   *Write cache
   *Look-ahead
   *Host Protected Area feature set
   *WRITE_BUFFER command
   *READ_BUFFER command
   *NOP cmd
   *DOWNLOAD_MICROCODE
SET_MAX security extension
Automatic Acoustic Management feature set
   *48-bit Address feature set
   *Device Configuration Overlay feature set
   *Mandatory FLUSH_CACHE
   *FLUSH_CACHE_EXT
   *SMART error logging
   *SMART self-test
   *General Purpose Logging feature set
   *Segmented DOWNLOAD_MICROCODE
   *SATA-I signaling speed (1.5Gb/s)
   *SATA-II signaling speed (3.0Gb/s)
   *Native Command Queueing (NCQ)
   *Host-initiated interface power management
   *Phy event counters
DMA Setup Auto-Activate optimization
Device-initiated interface power management
   *Software settings preservation
   *SMART Command Transport (SCT) feature set
   *SCT Long Sector Access (AC1)
   *SCT LBA Segment Access (AC2)
   *SCT Error Recovery Control (AC3)
   *SCT Features Control (AC4)
   *SCT Data Tables (AC5)
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
supported: enhanced erase
88min for SECURITY ERASE UNIT. 88min for ENHANCED SECURITY ERASE UNIT.
Checksum: correct

and dmesg



Linux version 2.6.19-gentoo-r4 ([EMAIL PROTECTED]) (gcc version 4.1.1 (Gentoo
4.1.1-r3)) #2 SMP Mon Jan 15 15:14:18 CET 2007
Command line: BOOT_IMAGE=Gentoo root=802
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009f000 (usable)
 BIOS-e820: 0009f000 - 000a (reserved)
 BIOS-e820: 000f - 0010 (reserved)
 BIOS-e820: 0010 - 3fee (usable)
 BIOS-e820: 3fee - 3fee3000 (ACPI NVS)
 BIOS-e820: 3fee3000 - 3fef (ACPI data)
 BIOS-e820: 3fef - 3ff0 (reserved)
 BIOS-e820: f000 - f400 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 261856) 1 entries of 256 used
end_pfn_map = 1048576
DMI 2.4 present.
ACPI: RSDP (v002 Nvidia) @ 0x000f8040
ACPI: XSDT (v001 Nvidia ASUSACPI 0x42302e31 AWRD 0x) @ 
0x3fee30c0
ACPI: FADT (v003 Nvidia ASUSACPI 0x42302e31 AWRD 0x) @ 
0x3feed200
ACPI: HPET (v001 Nvidia ASUSACPI 0x42302e31 AWRD 0x0098) @ 
0x3feed400
ACPI: MCFG (v001 Nvidia ASUSACPI 0x42302e31 AWRD 0x) @ 
0x3feed480
ACPI: MADT (v001 Nvidia ASUSACPI 0x42302e31 AWRD 0x) @ 
0x3feed340
ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT 

Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2007-01-15 Thread Christoph Anton Mitterer
Hi.

Some days ago I received the following message from "Sunny Days". I
think he did not send it lkml so I forward it now:

Sunny Days wrote:
> hello,
>
> i have done some extensive testing on this.
>
> various opterons, always single socket
> various dimms 1 and 2gb modules
> and hitachi+seagate disks with various firmwares and sizes
> but i am getting a diferent pattern in the corruption.
> My test file was 10gb.
>
> I have mapped the earliest corruption as low as 10mb in the written data.
> i have also monitor the adress range used from the cp /md5sum proccess
> under /proc//$PID/maps to see if i could find a pattern but i was
> unable to.
>
> i also tested ext2 and lvm with similar results aka corruption.
> later on the week i should get a pci promise controller and test on that one.
>
> Things i have not tested is the patch that linus released 10 days ago
> and reiserfs3/4
>
> my nvidia chipset was ck804 (a3)
>
> Hope somehow we get to the bottom of this.
>
> Hope this helps
>
>
> btw amd erratas that could possible influence this are
>
> 115, 123, 156 with the latter been fascinating as it the workaround
> suggested is 0x0 page entry.
>
>   

Does anyone has any opinions about this? Could you please read the
mentioned erratas and tell me what you think?

Best wishes,
Chris.

@ Sunny Days: Thanks for you mail.
begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



[PATCH] seq_file conversion: toshiba.c

2007-01-15 Thread Alexey Dobriyan
Compile-tested.

Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]>
---

 drivers/char/toshiba.c |   35 +--
 1 file changed, 25 insertions(+), 10 deletions(-)

--- a/drivers/char/toshiba.c
+++ b/drivers/char/toshiba.c
@@ -68,6 +68,7 @@ #include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -298,12 +299,10 @@ static int tosh_ioctl(struct inode *ip, 
  * Print the information for /proc/toshiba
  */
 #ifdef CONFIG_PROC_FS
-static int tosh_get_info(char *buffer, char **start, off_t fpos, int length)
+static int proc_toshiba_show(struct seq_file *m, void *v)
 {
-   char *temp;
int key;
 
-   temp = buffer;
key = tosh_fn_status();
 
/* Arguments
@@ -314,8 +313,7 @@ static int tosh_get_info(char *buffer, c
 4) BIOS date (in SCI date format)
 5) Fn Key status
*/
-
-   temp += sprintf(temp, "1.1 0x%04x %d.%d %d.%d 0x%04x 0x%02x\n",
+   seq_printf(m, "1.1 0x%04x %d.%d %d.%d 0x%04x 0x%02x\n",
tosh_id,
(tosh_sci & 0xff00)>>8,
tosh_sci & 0xff,
@@ -323,9 +321,21 @@ static int tosh_get_info(char *buffer, c
tosh_bios & 0xff,
tosh_date,
key);
+   return 0;
+}
 
-   return temp-buffer;
+static int proc_toshiba_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, proc_toshiba_show, NULL);
 }
+
+static const struct file_operations proc_toshiba_fops = {
+   .owner  = THIS_MODULE,
+   .open   = proc_toshiba_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
 #endif
 
 
@@ -508,10 +518,15 @@ static int __init toshiba_init(void)
return retval;
 
 #ifdef CONFIG_PROC_FS
-   /* register the proc entry */
-   if (create_proc_info_entry("toshiba", 0, NULL, tosh_get_info) == NULL) {
-   misc_deregister(_device);
-   return -ENOMEM;
+   {
+   struct proc_dir_entry *pde;
+
+   pde = create_proc_entry("toshiba", 0, NULL);
+   if (!pde) {
+   misc_deregister(_device);
+   return -ENOMEM;
+   }
+   pde->proc_fops = _toshiba_fops;
}
 #endif
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sed s/gawk/awk/ scripts/gen_init_ramfs.sh

2007-01-15 Thread Sam Ravnborg
On Mon, Jan 15, 2007 at 04:24:17PM -0500, Rob Landley wrote:
> Signed-off-by: Rob Landley <[EMAIL PROTECTED]>
Acked-by: Sam Ravnborg <[EMAIL PROTECTED]>

PS
My dev machine is broke and need a new one before kbuild.git will
be alive again.
Considering an AMD Athlon 64 X2 based one with Nvidia GeForce™ 6150LE:
http://h10010.www1.hp.com/wwpc/dk/da/ho/WF06b/34307-351123-1284187-1284187-1284187-12726540-78048221.html
Anyone with comments on this choice?

Sam
> 
> Use "awk" instead of "gawk".
> 
> -- 
> 
> There's a symlink from awk to gawk if you're using the gnu tools, but no
> symlink from gawk to awk if you're using BusyBox or some such.  (There's a
> reason for the existence of standard names.  Can we use them please?)
> 
> --- linux-2.6.19.2/scripts/gen_initramfs_list.sh  2007-01-10 
> 14:10:37.0 -0500
> +++ linux-new/scripts/gen_initramfs_list.sh   2007-01-15 10:14:41.0 
> -0500
> @@ -121,9 +121,9 @@
>   "nod")
>   local dev_type=
>   local maj=$(LC_ALL=C ls -l "${location}" | \
> - gawk '{sub(/,/, "", $5); print $5}')
> + awk '{sub(/,/, "", $5); print $5}')
>   local min=$(LC_ALL=C ls -l "${location}" | \
> - gawk '{print $6}')
> + awk '{print $6}')
>  
>   if [ -b "${location}" ]; then
>   dev_type="b"
> @@ -134,7 +134,7 @@
>   ;;
>   "slink")
>   local target=$(LC_ALL=C ls -l "${location}" | \
> - gawk '{print $11}')
> + awk '{print $11}')
>   str="${ftype} ${name} ${target} ${str}"
>   ;;
>   *)
> 
> -- 
> "Perfection is reached, not when there is no longer anything to add, but
> when there is no longer anything to take away." - Antoine de Saint-Exupery
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: umask ignored in mkdir(2)?

2007-01-15 Thread Hugh Dickins
[I've rearranged this to avoid a horrid mix of top and bottom posting]

On Sun, 14 Jan 2007, Tigran Aivazian wrote:
> On Sun, 14 Jan 2007, Tigran Aivazian wrote:
> > On Sun, 14 Jan 2007, Tigran Aivazian wrote:
> > > I think I may have found a bug --- on one of my machines the umask value
> > > is ignored by ext3 (but honoured on tmpfs) for mkdir system call:
> > > 
> > > $ cd /tmp
> > > $ df -T .
> > > FilesystemType   1K-blocks  Used Available Use% Mounted on
> > > /dev/hdf1 ext3   189238556 155721568  23749068  87% /
> > > $ rmdir ok ; mkdir ok ; ls -ld ok
> > > rmdir: ok: No such file or directory
> > > drwxrwxrwx 2 tigran tigran 4096 Jan 14 20:36 ok/
> > > $ umask
> > > 0022
> > > $ cd /dev/shm
> > > $ df -T .
> > > FilesystemType   1K-blocks  Used Available Use% Mounted on
> > > tmpfstmpfs  517988 0517988   0% /dev/shm
> > > $ rmdir ok ; mkdir ok ; ls -ld ok
> > > rmdir: ok: No such file or directory
> > > drwxr-xr-x 2 tigran tigran 40 Jan 14 20:36 ok/
> > > $ uname -a
> > > Linux ws 2.6.19.1 #6 SMP Sun Jan 14 20:03:30 GMT 2007 i686 i686 i386
> > > GNU/Linux
> > > $ grep -i acl /usr/src/linux/.config
> > > # CONFIG_FS_POSIX_ACL is not set
> > > # CONFIG_TMPFS_POSIX_ACL is not set
> > > # CONFIG_NFS_V3_ACL is not set
> > > # CONFIG_NFSD_V3_ACL is not set
> > > 
> > > As you see, ACL is not configured in, and neither are extended attributes:
> > > 
> > > $ grep -i xattr /usr/src/linux/.config
> > > # CONFIG_EXT2_FS_XATTR is not set
> > > # CONFIG_EXT3_FS_XATTR is not set
> > > 
> > > So, this is something fs-specific. What do you think?
> >
> > I forgot to mention that on another machine running the same kernel version
> > with the same (as close as a UP machine can be to SMP) kernel configuration
> > the umask is honoured properly on ext3 filesystem.
> 
> I figured it out! I thought you might be interested --- the reason is the
> mismatch between the default mount options stored in the superblock on disk
> and the filesystem features compiled into the kernel.
> 
> Namely, dumpe2fs on the offending filesystems showed the following default
> mount options:
> 
> user_xattr acl
> 
> but on good filesystems it showed "(none)". So, I used "tune2fs -o ^acl"
> (and ^user_xattr) to clear these in the superblock and mounted the filesystem
> --- and now mkdir system call works as expected, i.e. honours the umask.
> 
> Maybe the ext3 filesystem should automatically detect this (the mismatch) and
> printk a warning so the user is told that his filesystem is mounted in
> extremely insecure way, i.e. making directories as root will result in lots of
> 0777 places (e.g. try "make modules_install" --- this will create lots of
> security holes in /lib/modules).
> 
> I cc'd linux-kernel as someone may wish to fix this.

Good find!  Though I suppose not much of a worry for distros,
whose kernels will always(?) have ACLs configured in.

I get sooo confused when there's multiple ways of switching something
on and off (at the ifdef level and at the mount opts level and at the
tuning level), looks like others do too.  Here's my third version of
a patch, already wondering if a fourth would be better (at the point
where they set s_flags) ... no, I think this one is more robust...


[PATCH] fix umask when noACL kernel meets extN tuned for ACLs

Fix insecure default behaviour reported by Tigran Aivazian: if an ext2
or ext3 or ext4 filesystem is tuned to mount with "acl", but mounted by
a kernel built without ACL support, then umask was ignored when creating
inodes - though root or user has umask 022, touch creates files as 0666,
and mkdir creates directories as 0777.

This appears to have worked right until 2.6.11, when a fix to the default
mode on symlinks (always 0777) assumed VFS applies umask: which it does,
unless the mount is marked for ACLs; but ext[234] set MS_POSIXACL in
s_flags according to s_mount_opt set according to def_mount_opts.

We could revert to the 2.6.10 ext[234]_init_acl (adding an S_ISLNK test);
but other filesystems only set MS_POSIXACL when ACLs are configured.  We
could fix this at another level; but it seems most robust to avoid setting
the s_mount_opt flag in the first place (at the expense of more ifdefs).

Likewise don't set the XATTR_USER flag when built without XATTR support.

Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
---

 fs/ext2/super.c |4 
 fs/ext3/super.c |4 
 fs/ext4/super.c |4 
 3 files changed, 12 insertions(+)

--- 2.6.20-rc5/fs/ext2/super.c  2007-01-13 08:46:07.0 +
+++ linux/fs/ext2/super.c   2007-01-15 20:48:38.0 +
@@ -708,10 +708,14 @@ static int ext2_fill_super(struct super_
set_opt(sbi->s_mount_opt, GRPID);
if (def_mount_opts & EXT2_DEFM_UID16)
set_opt(sbi->s_mount_opt, NO_UID32);
+#ifdef CONFIG_EXT2_FS_XATTR
if (def_mount_opts & EXT2_DEFM_XATTR_USER)
set_opt(sbi->s_mount_opt, XATTR_USER);
+#endif
+#ifdef 

I broke my port numbers :(

2007-01-15 Thread Sami Farin
I know this may be entirely my fault but I have tried reversing
all of my _own_ patches I applied to 2.6.19.2 but can't find what broke this.
I did three times "netcat 127.0.0.69 42", notice the different
port numbers.

First, if someone could attempt this on 2.6.19.2 or 2.6.20-rc* ,
and tell it works, I shut up.

2007-01-15 23:42:05.833636 IP (tos 0x0, ttl  61, id 34230, offset 0, flags 
[DF], proto: TCP (6), length: 60) 127.0.0.69.23287 > 127.0.0.69.42: SWE, cksum 
0x0281 (correct), 674651575:674651575(0) win 32792 
2007-01-15 23:42:05.833673 IP (tos 0x0, ttl  61, id 0, offset 0, flags [DF], 
proto: TCP (6), length: 40) 127.0.0.69.42 > 127.0.0.69.52935: R, cksum 0x5c66 
(correct), 0:0(0) ack 674651576 win 0

2007-01-15 23:42:06.009245 IP (tos 0x0, ttl  61, id 11189, offset 0, flags 
[DF], proto: TCP (6), length: 60) 127.0.0.69.20161 > 127.0.0.69.42: SWE, cksum 
0x96b3 (correct), 678941897:678941897(0) win 32792 
2007-01-15 23:42:06.009289 IP (tos 0x0, ttl  61, id 0, offset 0, flags [DF], 
proto: TCP (6), length: 40) 127.0.0.69.42 > 127.0.0.69.52936: R, cksum 0xe511 
(correct), 0:0(0) ack 678941898 win 0

2007-01-15 23:42:06.169587 IP (tos 0x0, ttl  61, id 36607, offset 0, flags 
[DF], proto: TCP (6), length: 60) 127.0.0.69.52470 > 127.0.0.69.42: SWE, cksum 
0x15b5 (correct), 681498315:681498315(0) win 32792 
2007-01-15 23:42:06.169624 IP (tos 0x0, ttl  61, id 0, offset 0, flags [DF], 
proto: TCP (6), length: 40) 127.0.0.69.42 > 127.0.0.69.52937: R, cksum 0xe2e7 
(correct), 0:0(0) ack 681498316 win 0

If something was listening on port 42, it would see the wrong port,
e.g. 23287, 20161 or 52470, not 52935, 52936 or 52937.

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] seq_file conversion: coda

2007-01-15 Thread Alexey Dobriyan
Compile-tested.

Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]>
---

 fs/coda/sysctl.c |   76 ---
 1 file changed, 39 insertions(+), 37 deletions(-)

--- a/fs/coda/sysctl.c
+++ b/fs/coda/sysctl.c
@@ -15,6 +15,7 @@ #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -84,15 +85,11 @@ static int do_reset_coda_cache_inv_stats
return 0;
 }
 
-static int coda_vfs_stats_get_info( char * buffer, char ** start,
-   off_t offset, int length)
+static int proc_vfs_stats_show(struct seq_file *m, void *v)
 {
-   int len=0;
-   off_t begin;
struct coda_vfs_stats * ps = & coda_vfs_stat;
   
-  /* this works as long as we are below 1024 characters! */
-   len += sprintf( buffer,
+   seq_printf(m,
"Coda VFS statistics\n"
"===\n\n"
"File Operations:\n"
@@ -132,28 +129,14 @@ static int coda_vfs_stats_get_info( char
ps->rmdir,
ps->rename,
ps->permission); 
-
-   begin = offset;
-   *start = buffer + begin;
-   len -= begin;
-
-   if ( len > length )
-   len = length;
-   if ( len < 0 )
-   len = 0;
-
-   return len;
+   return 0;
 }
 
-static int coda_cache_inv_stats_get_info( char * buffer, char ** start,
- off_t offset, int length)
+static int proc_cache_inv_stats_show(struct seq_file *m, void *v)
 {
-   int len=0;
-   off_t begin;
struct coda_cache_inv_stats * ps = & coda_cache_inv_stat;
   
-   /* this works as long as we are below 1024 characters! */
-   len += sprintf( buffer,
+   seq_printf(m,
"Coda cache invalidation statistics\n"
"==\n\n"
"flush\t\t%9d\n"
@@ -170,19 +153,35 @@ static int coda_cache_inv_stats_get_info
ps->zap_vnode,
ps->purge_fid,
ps->replace );
-  
-   begin = offset;
-   *start = buffer + begin;
-   len -= begin;
+   return 0;
+}
 
-   if ( len > length )
-   len = length;
-   if ( len < 0 )
-   len = 0;
+static int proc_vfs_stats_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, proc_vfs_stats_show, NULL);
+}
 
-   return len;
+static int proc_cache_inv_stats_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, proc_cache_inv_stats_show, NULL);
 }
 
+static const struct file_operations proc_vfs_stats_fops = {
+   .owner  = THIS_MODULE,
+   .open   = proc_vfs_stats_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static const struct file_operations proc_cache_inv_stats_fops = {
+   .owner  = THIS_MODULE,
+   .open   = proc_cache_inv_stats_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
 static ctl_table coda_table[] = {
{CODA_TIMEOUT, "timeout", _timeout, sizeof(int), 0644, NULL, 
_dointvec},
{CODA_HARD, "hard", _hard, sizeof(int), 0644, NULL, 
_dointvec},
@@ -212,9 +211,6 @@ static struct proc_dir_entry* proc_fs_co
 
 #endif
 
-#define coda_proc_create(name,get_info) \
-   create_proc_info_entry(name, 0, proc_fs_coda, get_info)
-
 void coda_sysctl_init(void)
 {
reset_coda_vfs_stats();
@@ -223,9 +219,15 @@ void coda_sysctl_init(void)
 #ifdef CONFIG_PROC_FS
proc_fs_coda = proc_mkdir("coda", proc_root_fs);
if (proc_fs_coda) {
+   struct proc_dir_entry *pde;
+
proc_fs_coda->owner = THIS_MODULE;
-   coda_proc_create("vfs_stats", coda_vfs_stats_get_info);
-   coda_proc_create("cache_inv_stats", 
coda_cache_inv_stats_get_info);
+   pde = create_proc_entry("vfs_stats", 0, proc_fs_coda);
+   if (pde)
+   pde->proc_fops = _vfs_stats_fops;
+   pde = create_proc_entry("cache_inv_stats", 0, proc_fs_coda);
+   if (pde)
+   pde->proc_fops = _cache_inv_stats_fops;
}
 #endif
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] adjust use of unplug in elevator code

2007-01-15 Thread Linas Vepstas

Hi Chris, Jens,
Can you look at this, and push upstream if this looks reasonable
to you? It fixes a bug I've been tripping over.

--linas


A flag was recently added to the elevator code to avoid
performing an unplug when reuests are being re-queued.
The goal of this flag was to avoid a deep recursion that
can occur when re-queueing requests after a SCSI device/host 
reset.  See http://lkml.org/lkml/2006/5/17/254

However, that fix added the flag near the bottom of a case
statement, where an earlier break (in an if statement) could
transport one out of the case, without setting the flag.
This patch sets the flag earlier in the case statement.

I re-discovered the deep recursion recently during testing;
I was told that it was a known problem, and the fix to it was
in the kernel I was testing. Indeed it was ... but it didn't
fix the bug. With the patch below, I no longer see the bug.

Signed-off by: Linas Vepstas <[EMAIL PROTECTED]>
Cc: Jens Axboe <[EMAIL PROTECTED]>
Cc: Chris Wright <[EMAIL PROTECTED]>


 block/elevator.c |   11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

Index: linux-2.6.20-rc4/block/elevator.c
===
--- linux-2.6.20-rc4.orig/block/elevator.c  2007-01-15 14:16:03.0 
-0600
+++ linux-2.6.20-rc4/block/elevator.c   2007-01-15 14:20:04.0 -0600
@@ -590,6 +590,12 @@ void elv_insert(request_queue_t *q, stru
 */
rq->cmd_flags |= REQ_SOFTBARRIER;
 
+   /*
+* Most requeues happen because of a busy condition,
+* don't force unplug of the queue for that case.
+*/
+   unplug_it = 0;
+
if (q->ordseq == 0) {
list_add(>queuelist, >queue_head);
break;
@@ -604,11 +610,6 @@ void elv_insert(request_queue_t *q, stru
}
 
list_add_tail(>queuelist, pos);
-   /*
-* most requeues happen because of a busy condition, don't
-* force unplug of the queue for that case.
-*/
-   unplug_it = 0;
break;
 
default:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What does this scsi error mean ?

2007-01-15 Thread Olivier Galibert
On Mon, Jan 15, 2007 at 06:45:40PM +, Alan wrote:
> On Mon, 15 Jan 2007 18:16:02 +0100
> Olivier Galibert <[EMAIL PROTECTED]> wrote:
> 
> > sd 0:0:0:0: SCSI error: return code = 0x0802
> > sda: Current: sense key: Hardware Error
> > ASC=0x42 ASCQ=0x0
> 
> I'll give you a clue: The words "Hardware Error".
> 
> Run a SCSI verify pass on the drive with some drive utilities and see
> what happens. If you are lucky it'll just reallocate blocks and decide
> the drive is ok, if not well see what the smart data thinks.

Both smart and the internal blade diagnostics say "everything is a-ok
with the drive, there hasn't been any error ever except a bunch of
corrected ECC ones, and no more than with a similar drive in another
working blade".  Hence my initial post.  "Hardware error" is kinda
imprecise, so I was wondering whether it was unexpected controller
answer, detected transmission error, block write error, sector not
found...  Is there a way to have more information?

  OG.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: allocation failed: out of vmalloc space error treating and VIDEO1394 IOC LISTEN CHANNEL ioctl failed problem

2007-01-15 Thread Kristian Høgsberg

On 1/15/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote:


> However, what I'd really like to do is to leave it to user space to
> allocate the memory as David describes.  In the transmit case, user
> space allocates memory (malloc or mmap) and loads the payload into
> that buffer.

there is a lot of pain involved with doing things this way, it is a TON
better if YOU provide the memory via a custom mmap handler for a device
driver.
(there are a lot of security nightmares involved with the opposite
model, like the user can put any kind of memory there, even pci mmio
space)


OK, point taken.  I don't have a strong preference for the opposite
model, it just seems elegant that you can let user space handle
allocation and pin and map the pages as needed.  But you're right, it
certainly is easier to give safe memory to user space in the first
place rather than try to make sure user space isn't trying to trick
us.


>   Then is does an ioctl() on the firewire control device

ioctls are evil ;) esp an "mmap me" ioctl


Ah, I'm not mmap'ing it from the ioctl, I do implement the mma file
operation for this.  However, you have to do an ioctl before mapping
the device to configure the dma context.

Other than that what is the problem with ioctls, and more interesting,
what is the alternative?  I don't expect (or want) a bunch of syscalls
to be added for this, so I don't really see what other mechanism I
should use for this.


> It's not too difficult from what I'm doing now, I'd just like to give
> user space more control over the buffers it uses for streaming (i.e.
> letting user space allocate them).  What I'm missing here is: how do I
> actually pin a page in memory?  I'm sure it's not too difficult, but I
> haven't yet figured it out and I'm sure somebody knows it off the top
> of his head.

again the best way is for you to provide an mmap method... you can then
fill in the pages and keep that in some sort of array; this is for
example also what the DRI/DRM layer does for textures etc...


That sounds a lot like what I have now (mmap method, array of pages)
so I'll just stick with that.

thanks,
Kristian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: allocation failed: out of vmalloc space error treating and VIDEO1394 IOC LISTEN CHANNEL ioctl failed problem

2007-01-15 Thread Arjan van de Ven

> However, what I'd really like to do is to leave it to user space to
> allocate the memory as David describes.  In the transmit case, user
> space allocates memory (malloc or mmap) and loads the payload into
> that buffer.

there is a lot of pain involved with doing things this way, it is a TON
better if YOU provide the memory via a custom mmap handler for a device
driver.
(there are a lot of security nightmares involved with the opposite
model, like the user can put any kind of memory there, even pci mmio
space)

>   Then is does an ioctl() on the firewire control device

ioctls are evil ;) esp an "mmap me" ioctl

> It's not too difficult from what I'm doing now, I'd just like to give
> user space more control over the buffers it uses for streaming (i.e.
> letting user space allocate them).  What I'm missing here is: how do I
> actually pin a page in memory?  I'm sure it's not too difficult, but I
> haven't yet figured it out and I'm sure somebody knows it off the top
> of his head.

again the best way is for you to provide an mmap method... you can then
fill in the pages and keep that in some sort of array; this is for
example also what the DRI/DRM layer does for textures etc...

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sed s/gawk/awk/ scripts/gen_init_ramfs.sh

2007-01-15 Thread Rob Landley
Signed-off-by: Rob Landley <[EMAIL PROTECTED]>

Use "awk" instead of "gawk".

-- 

There's a symlink from awk to gawk if you're using the gnu tools, but no
symlink from gawk to awk if you're using BusyBox or some such.  (There's a
reason for the existence of standard names.  Can we use them please?)

--- linux-2.6.19.2/scripts/gen_initramfs_list.sh2007-01-10 
14:10:37.0 -0500
+++ linux-new/scripts/gen_initramfs_list.sh 2007-01-15 10:14:41.0 
-0500
@@ -121,9 +121,9 @@
"nod")
local dev_type=
local maj=$(LC_ALL=C ls -l "${location}" | \
-   gawk '{sub(/,/, "", $5); print $5}')
+   awk '{sub(/,/, "", $5); print $5}')
local min=$(LC_ALL=C ls -l "${location}" | \
-   gawk '{print $6}')
+   awk '{print $6}')
 
if [ -b "${location}" ]; then
dev_type="b"
@@ -134,7 +134,7 @@
;;
"slink")
local target=$(LC_ALL=C ls -l "${location}" | \
-   gawk '{print $11}')
+   awk '{print $11}')
str="${ftype} ${name} ${target} ${str}"
;;
*)

-- 
"Perfection is reached, not when there is no longer anything to add, but
when there is no longer anything to take away." - Antoine de Saint-Exupery
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Björn Steinbrink
On 2007.01.14 17:43:53 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >Hi,
> >
> >with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite
> >often, with 2.6.19 there are no such exceptions. dmesg and lspci -v
> >output follows. In the meantime, I'll start bisecting.
> 
> ...
> 
> >ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> >ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in
> > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> >ata1: soft resetting port
> >ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> >ata1.00: configured for UDMA/133
> >ata1: EH complete
> >SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
> >sda: Write Protect is off
> >sda: Mode Sense: 00 3a 00 00
> >SCSI device sda: write cache: enabled, read cache: enabled, doesn't 
> >support DPO or FUA
> 
> Looks like all of these errors are from a FLUSH CACHE command and the 
> drive is indicating that it is no longer busy, so presumably done. 
> That's not a DMA-mapped command, so it wouldn't go through the ADMA 
> machinery and I wouldn't have expected this to be handled any 
> differently from before. Curious..

My latest bisection attempt actually led to your sata_nv ADMA commit. [1]
I've now backed out that patch from 2.6.20-rc5 and have my stress test
running for 20 minutes now ("record" for a bad kernel surviving that
test is about 40 minutes IIRC). I'll keep it running for at least 2 more
hours.

The test is pretty simple:
while /bin/true; do ls -lR > /dev/null; done
while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done

running in parallel.

Björn

[1] 2dec7555e6bf2772749113ea0ad454fcdb8cf861
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.19] USB HID: proper LED-mapping (support for SpaceNavigator)

2007-01-15 Thread Jiri Kosina
On Mon, 15 Jan 2007, Simon Budig wrote:

> Is it possible that there is a regression in the hid-debug stuff? The
> mapping does not seem to appear in the dmesg-output. I unfortunately
> don't have an earlier kernel available right now to verify, but now the
> output on plugging in the device looks like this:

Hi Simon,

thanks, I queued the LED mapping fix for upstream.

I agree with Vojtech and Marcel that it doesn't make much sense having the 
hid-debug as a header file - I will fix it, and apply your patch to it 
(after I check why the debug output seems to be broken), you don't have to 
resend it, thanks.

-- 
Jiri Kosina
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   >