Re: [RFC 0/8] Cpuset aware writeback
On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote: > Currently cpusets are not able to do proper writeback since > dirty ratio calculations and writeback are all done for the system > as a whole. This may result in a large percentage of a cpuset > to become dirty without writeout being triggered. Under NFS > this can lead to OOM conditions. > > Writeback will occur during the LRU scans. But such writeout > is not effective since we write page by page and not in inode page > order (regular writeback). > > In order to fix the problem we first of all introduce a method to > establish a map of nodes that contain dirty pages for each > inode mapping. > > Secondly we modify the dirty limit calculation to be based > on the acctive cpuset. > > If we are in a cpuset then we select only inodes for writeback > that have pages on the nodes of the cpuset. > > After we have the cpuset throttling in place we can then make > further fixups: > > A. We can do inode based writeout from direct reclaim >avoiding single page writes to the filesystem. > > B. We add a new counter NR_UNRECLAIMABLE that is subtracted >from the available pages in a node. This allows us to >accurately calculate the dirty ratio even if large portions >of the node have been allocated for huge pages or for >slab pages. What about mlock'ed pages? > There are a couple of points where some better ideas could be used: > > 1. The nodemask expands the inode structure significantly if the > architecture allows a high number of nodes. This is only an issue > for IA64. For that platform we expand the inode structure by 128 byte > (to support 1024 nodes). The last patch attempts to address the issue > by using the knowledge about the maximum possible number of nodes > determined on bootup to shrink the nodemask. Not the prettiest indeed, no ideas though. > 2. The calculation of the per cpuset limits can require looping > over a number of nodes which may bring the performance of get_dirty_limits > near pre 2.6.18 performance (before the introduction of the ZVC counters) > (only for cpuset based limit calculation). There is no way of keeping these > counters per cpuset since cpusets may overlap. Well, you gain functionality, you loose some runtime, sad but probably worth it. Otherwise it all looks good. Acked-by: Peter Zijlstra <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
82571EB gigabit on e1000 in 2.6.20-rc5
I have a PCI-E pro/1000 MT Quad Port adapter, which works quite well under 2.6.19.2 but fails to see link under 2.6.20-rc5. Earlier today I reported this to [EMAIL PROTECTED], but thought I should get the word out in case someone else is testing this kernel on this nic chipset. Due to changes between 2.6.19.2 and 2.6.20, Intel driver 7.3.20 will not compile for 2.6.20, nor will the 2.6.19.2 in-tree driver. Error output: CC [M] drivers/net/e1000/e1000_main.o drivers/net/e1000/e1000_main.c:1132:45: error: macro "INIT_WORK" passed 3 arguments, but takes just 2 drivers/net/e1000/e1000_main.c: In function 'e1000_probe': drivers/net/e1000/e1000_main.c:1131: error: 'INIT_WORK' undeclared (first use in this function) drivers/net/e1000/e1000_main.c:1131: error: (Each undeclared identifier is reported only once drivers/net/e1000/e1000_main.c:1131: error: for each function it appears in.) make[3]: *** [drivers/net/e1000/e1000_main.o] Error 1 lspci -nn output (quad port): 09:00.0 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller [8086:10a4] (rev 06) 09:00.1 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller [8086:10a4] (rev 06) 0a:00.0 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller [8086:10a4] (rev 06) 0a:00.1 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller [8086:10a4] (rev 06) lspci -nn output (dual port): 07:00.0 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller [8086:105e] (rev 06) 07:00.1 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller [8086:105e] (rev 06) From what I've been able to gather, other Intel Pro/1000 chipsets work fine in 2.6.20-rc5. If the e1000 guys need any assistance testing, I'll be more than happy to volunteer myself as a guinea pig for patches. Allen Parker - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 9/10][RFC] aio: usb gadget remove aio file ops
On Monday 15 January 2007 5:54 pm, Nate Diller wrote: > This removes the aio implementation from the usb gadget file system. NAK. I see a deep mis-understanding here. > Aside > from making very creative (!) use of the aio retry path, it can't be of any > use performance-wise Other than the basic win of letting one userspace thread keep an I/O stream active while at the same time processing the data it reads or writes?? That's the "async" part of AIO. There's a not-so-little thing called "I/O overlap" ... which is the only way to prevent wasting bandwidth between (non-cacheable) I/O requests, and thus is the only way to let userspace code achieve anything close to the maximum I/O bandwidth the hardware can achieve. We want to see the host side "usbfs" evolve to support AIO like this too, for the same reasons. (Currently it has fairly ugly AIO code that looks unlike any other AIO code in Linux. Recent updates to support a file-per-endpoint device model are a necessary precursor to switching over to standard AIO syscalls.) > because it always kmalloc()s a bounce buffer for the > *whole* I/O size. By and large that's a negligible factor compared to being able to achieve I/O overlap. ISTR the reason for not doing fancy DMA magic was that the cost of this style AIO was under 1 KByte object code on ARM, which was easy to justify ... while DMA magic to do that sort of stuff would be much fatter, as well as more error prone. (And that's why the "creative" use of the retry path. As I've observed before, "retry" is a misnomer in the general sense of an async I/O framework. It's more of a semi-completion callback; I/O can't in general be "retried" on error or fault, and even in the current usage it's not really a "retry".) Now that high speed peripheral hardware is becoming more common on embedded Linuxes -- TI has DaVinci, OMAP 2430, TUSB6010 (as found in the new Nokia 800 tablets); Atmel AVR32 AP7000; at least a couple parts that should be able to use the same musb_hdrc driver as those TI parts; and a few other chips I've heard of -- there may be some virtue in eliminating the memcpy, since those CPUs don't have many MIPS to waste. (Iff the memcpy turns out to be a real issue...) > Perhaps the only reason to keep it around is the ability > to cancel I/O requests, which only applies when using the user space async > I/O interface. It's good to have almost the complete kernel API functionality exposed to userspace, and having I/O cancelation is an inevitable consequence of a complete AIO framework ... but that particular issue was not a driving concern. The reason for AIO is to have a *STANDARD* userspace interface for *ASYNC I/O* which otherwise can't exist. You know, the kind of I/O interface that can't be implemented with read() and write() syscalls, which for non-buffered I/O necessarily preclude all I/O overlap. AIO itself is a direct match to most I/O frameworks' primitives. (AIOCB being directly analagous to peripheral side "struct usb_request" and host side "struct urb".) You know, I've always thought that one reason the AIO discussions seemed strange is that they weren't really focussed on I/O (the lowlevel after-the-caches stuff) so much as filesystems (several layers up in the stack, with intervening caching frameworks). The first several implementations of AIO that I saw were restricted to "real" I/O and not applicable to disk backed files. So while I was glad the Linux approach didn't make that mistake, it's seemed that it might be wanting to make a converse mistake: neglecting I/O that isn't aimed at data stored on disks. > I highly doubt that is enough incentive to justify the extra > complexity here or in user-space, so I think it's a safe bet to remove this. > If that feature still desired, it would be possible to implement a sync > interface that does an interruptible sleep. What's needed is an async, non-sleeeping, interface ... with I/O overlap. That's antithetical to using read()/write() calls, so your proposed approach couldn't possibly work. - Dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 4/10][RFC] aio: convert aio_complete to file_endio_t
On Monday 15 January 2007 5:54 pm, Nate Diller wrote: > --- a/drivers/usb/gadget/inode.c 2007-01-12 14:42:29.0 -0800 > +++ b/drivers/usb/gadget/inode.c 2007-01-12 14:25:34.0 -0800 > @@ -559,35 +559,32 @@ static int ep_aio_cancel(struct kiocb *i > return value; > } > > -static ssize_t ep_aio_read_retry(struct kiocb *iocb) > +static int ep_aio_read_retry(struct kiocb *iocb) > { > struct kiocb_priv *priv = iocb->private; > - ssize_t len, total; > - int i; > + ssize_t total; > + int i, err = 0; > > /* we "retry" to get the right mm context for this: */ > > /* copy stuff into user buffers */ > total = priv->actual; > - len = 0; > for (i=0; i < priv->nr_segs; i++) { > ssize_t this = min((ssize_t)(priv->iv[i].iov_len), total); > > if (copy_to_user(priv->iv[i].iov_base, priv->buf, this)) { > - if (len == 0) > - len = -EFAULT; > + err = -EFAULT; Discarding the capability to report partial success, e.g. that the first N bytes were properly transferred? I don't see any virtue in that change. Quite the opposite in fact. I think you're also expecting that if N bytes were requested, that's always how many will be received. That's not true for packetized I/O such as USB isochronous transfers ... where it's quite legit (and in some cases routine) for the other end to send packets that are shorter than the maximum allowed. Sending a zero length packet is not the same as sending no packet at all, for another example. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Some kind of 2.6.19 NFS regression
On Mon, 2007-01-15 at 18:26 -0500, Daniel Drake wrote: > Hi, > > Tim Ryan has reported the following bug at the Gentoo bugzilla: > > https://bugs.gentoo.org/show_bug.cgi?id=162199 > > His home dir is mounted over NFS. 2.6.18 worked OK but 2.6.19 is very > slow to load the desktop environment. NFS is suspected here as the > problem does not exist for users with local homedirs. This might not be > a straightforward performance issue as it does seem to perform OK on the > console. > > The bug still exists in unpatched 2.6.20-rc5. > > Is this a known issue? Should we report a new bug on the kernel bugzilla? > > Thanks, > Daniel I couldn't find any information whatsoever in that bug report as to what mount options he is using, or what server export options are in use. No info either about what networking hardware he is using (or what drivers are in use). I'd also recommend using something like ttcp to see if large packets (NFS read/write packets are typically ~ 32k large) are being transmitted efficiently. Cheers Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] slip: Replace kmalloc() + memset() pairs with the appropriate kzalloc() calls
This patch replace kmalloc() + memset() pairs with the appropriate kzalloc(). Signed-off-by: Joe Jin <[EMAIL PROTECTED]> --- drivers/net/slip.c.orig 2007-01-16 14:21:52.0 +0800 +++ drivers/net/slip.c 2007-01-16 14:23:07.0 +0800 @@ -1343,15 +1343,12 @@ printk(KERN_INFO "SLIP linefill/keepalive option.\n"); #endif - slip_devs = kmalloc(sizeof(struct net_device *)*slip_maxdev, GFP_KERNEL); + slip_devs = kzalloc(sizeof(struct net_device *)*slip_maxdev, GFP_KERNEL); if (!slip_devs) { printk(KERN_ERR "SLIP: Can't allocate slip devices array! Uaargh! (- > No SLIP available)\n"); return -ENOMEM; } - /* Clear the pointer array, we allocate devices when we need them */ - memset(slip_devs, 0, sizeof(struct net_device *)*slip_maxdev); - /* Fill in our line protocol discipline, and register it */ if ((status = tty_register_ldisc(N_SLIP, _ldisc)) != 0) { printk(KERN_ERR "SLIP: can't register line discipline (err = %d)\n", status); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 5/6] per namespace tunables
[PATCH 05/06] This patch introduces all that is needed to process per namespace tunables. Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]> --- include/linux/akt.h | 12 +++ kernel/autotune/akt.c | 80 ++ 2 files changed, 73 insertions(+), 19 deletions(-) Index: linux-2.6.20-rc4/include/linux/akt.h === --- linux-2.6.20-rc4.orig/include/linux/akt.h 2007-01-15 15:21:47.0 +0100 +++ linux-2.6.20-rc4/include/linux/akt.h2007-01-15 15:31:44.0 +0100 @@ -154,6 +154,7 @@ struct auto_tune { */ #define AUTO_TUNE_ENABLE 0x01 #define TUNABLE_REGISTERED 0x02 +#define TUNABLE_IPC_NS 0x04 /* @@ -204,6 +205,8 @@ static inline int is_tunable_registered( } +#define DECLARE_TUNABLE(s) struct auto_tune s; + #define DEFINE_TUNABLE(s, thr, min, max, tun, chk, type) \ struct auto_tune s = TUNABLE_INIT(#s, thr, min, max, tun, chk, type) @@ -215,6 +218,13 @@ static inline int is_tunable_registered( (s).max.abs_value.val_##type = _max;\ } while (0) +#define init_tunable_ipcns(ns, s, thr, min, max, tun, chk, type) \ + do {\ + DEFINE_TUNABLE(s, thr, min, max, tun, chk, type); \ + s.flags |= TUNABLE_IPC_NS; \ + ns->s = s; \ + } while (0) + static inline void set_autotuning_routine(struct auto_tune *tunable, auto_tune_fn fn) @@ -269,7 +279,9 @@ extern ssize_t store_tunable_max(struct #else /* CONFIG_AKT */ +#define DECLARE_TUNABLE(s) #define DEFINE_TUNABLE(s, thresh, min, max, tun, chk, type) +#define init_tunable_ipcns(ns, s, th, m, M, tun, chk, type) do { } while (0) #define set_tunable_min_max(s, min, max, type) do { } while (0) #define set_autotuning_routine(s, fn)do { } while (0) Index: linux-2.6.20-rc4/kernel/autotune/akt.c === --- linux-2.6.20-rc4.orig/kernel/autotune/akt.c 2007-01-15 15:25:35.0 +0100 +++ linux-2.6.20-rc4/kernel/autotune/akt.c 2007-01-15 15:37:16.0 +0100 @@ -32,6 +32,7 @@ * store_tunable_min (exported) * show_tunable_max (exported) * store_tunable_max (exported) + * get_ns_tunable (static) */ #include @@ -45,6 +46,8 @@ #define AKT_AUTO 1 #define AKT_MANUAL 0 +static struct auto_tune *get_ns_tunable(struct auto_tune *); + /* @@ -142,6 +145,7 @@ int unregister_tunable(struct auto_tune ssize_t show_tuning_mode(struct auto_tune *tun_addr, char *buf) { int valid; + struct auto_tune *which; if (tun_addr == NULL) { printk(KERN_ERR @@ -149,11 +153,13 @@ ssize_t show_tuning_mode(struct auto_tun return -EINVAL; } - spin_lock(_addr->tunable_lck); + which = get_ns_tunable(tun_addr); + + spin_lock(>tunable_lck); - valid = is_auto_tune_enabled(tun_addr); + valid = is_auto_tune_enabled(which); - spin_unlock(_addr->tunable_lck); + spin_unlock(>tunable_lck); return snprintf(buf, PAGE_SIZE, "%d\n", valid); } @@ -176,6 +182,7 @@ ssize_t store_tuning_mode(struct auto_tu size_t count) { int new_value; + struct auto_tune *which; int rc; if ((rc = sscanf(buffer, "%d", _value)) != 1) @@ -190,18 +197,20 @@ ssize_t store_tuning_mode(struct auto_tu return -EINVAL; } - spin_lock(_addr->tunable_lck); + which = get_ns_tunable(tun_addr); + + spin_lock(>tunable_lck); switch (new_value) { case AKT_AUTO: - tun_addr->flags |= AUTO_TUNE_ENABLE; + which->flags |= AUTO_TUNE_ENABLE; break; case AKT_MANUAL: - tun_addr->flags &= ~AUTO_TUNE_ENABLE; + which->flags &= ~AUTO_TUNE_ENABLE; break; } - spin_unlock(_addr->tunable_lck); + spin_unlock(>tunable_lck); return strnlen(buffer, PAGE_SIZE); } @@ -218,6 +227,7 @@ ssize_t store_tuning_mode(struct auto_tu ssize_t show_tunable_min(struct auto_tune *tun_addr, char *buf) { ssize_t rc; + struct auto_tune *which; if (tun_addr == NULL) { printk(KERN_ERR @@ -225,11 +235,13 @@ ssize_t show_tunable_min(struct auto_tun return -EINVAL; } - spin_lock(_addr->tunable_lck); + which = get_ns_tunable(tun_addr); - rc = tun_addr->min.show(tun_addr, buf); + spin_lock(>tunable_lck); - spin_unlock(_addr->tunable_lck); + rc = which->min.show(which, buf); +
[RFC][PATCH 6/6] automatic tuning applied to some kernel components
[PATCH 06/06] The following kernel components register a tunable structure and call the auto-tuning routine: . file system . shared memory (per namespace) . semaphore (per namespace) . message queues (per namespace) Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]> --- fs/file_table.c | 81 include/linux/akt.h |1 include/linux/ipc.h |6 +++ init/main.c |1 ipc/msg.c | 19 ipc/sem.c | 41 ++ ipc/shm.c | 74 --- 7 files changed, 218 insertions(+), 5 deletions(-) Index: linux-2.6.20-rc4/fs/file_table.c === --- linux-2.6.20-rc4.orig/fs/file_table.c 2007-01-15 13:08:14.0 +0100 +++ linux-2.6.20-rc4/fs/file_table.c2007-01-15 15:44:39.0 +0100 @@ -21,6 +21,8 @@ #include #include #include +#include +#include #include @@ -34,6 +36,71 @@ __cacheline_aligned_in_smp DEFINE_SPINLO static struct percpu_counter nr_files __cacheline_aligned_in_smp; +#ifdef CONFIG_AKT + +static int get_nr_files(void); + +/** automatic tuning **/ +#define FILPTHRESH 80 /* threshold = 80% */ + +/* + * FUNCTION:This is the routine called to accomplish auto tuning for the + * max_files tunable. + * + * Upwards adjustment: + * Adjustment is needed if nr_files has reached + * (threshold / 100 * max_files) + * In that case, max_files is set to + * (tunable + max_files * (100 - threshold) / 100) + * + * Downards adjustment: + * Adjustment is needed if nr_files has fallen under + * (threshold / 100 * max_files previous value) + * In that case max_files is set back to its previous value, + * i.e. to (max_files * 100 / (200 - threshold)) + * + * PARAMETERS: cmd: controls the adjustment direction (up / down) + * params: pointer to the registered tunable structure + * + * EXECUTION ENVIRONMENT: This routine should be called with the + *params->tunable_lck lock held + * + * RETURN VALUE: 1 if tunable has been adjusted + * 0 else + */ +static inline int maxfiles_auto_tuning(int cmd, struct auto_tune *params) +{ + int thr = params->threshold; + int min = params->min.value.val_int; + int max = params->max.value.val_int; + int tun = files_stat.max_files; + + if (cmd == AKT_UP) { + if (get_nr_files() >= tun * thr / 100 && tun < max) { + int new = tun * (200 - thr) / 100; + + files_stat.max_files = min(max, new); + return 1; + } else + return 0; + } + + if (get_nr_files() < tun * thr / (200 - thr) && tun > min) { + int new = tun * 100 / (200 - thr); + + files_stat.max_files = max(min, new); + return 1; + } else + return 0; +} + +#endif /* CONFIG_AKT */ + +/* The maximum value will be known later on */ +DEFINE_TUNABLE(maxfiles_akt, FILPTHRESH, 0, 0, _stat.max_files, + _files, int); + + static inline void file_free_rcu(struct rcu_head *head) { struct file *f = container_of(head, struct file, f_u.fu_rcuhead); @@ -44,6 +111,8 @@ static inline void file_free(struct file { percpu_counter_dec(_files); call_rcu(>f_u.fu_rcuhead, file_free_rcu); + + activate_auto_tuning(AKT_DOWN, _akt); } /* @@ -91,6 +160,8 @@ struct file *get_empty_filp(void) static int old_max; struct file * f; + activate_auto_tuning(AKT_UP, _akt); + /* * Privileged users can go above max_files */ @@ -299,6 +370,16 @@ void __init files_init(unsigned long mem files_stat.max_files = n; if (files_stat.max_files < NR_FILE) files_stat.max_files = NR_FILE; + + set_tunable_min_max(maxfiles_akt, n, n * 2, int); + set_autotuning_routine(_akt, maxfiles_auto_tuning); + files_defer_init(); percpu_counter_init(_files, 0); } + +void __init files_late_init(void) +{ + if (register_tunable(_akt)) + printk(KERN_WARNING "Failed registering tunable file-max\n"); +} Index: linux-2.6.20-rc4/include/linux/akt.h === --- linux-2.6.20-rc4.orig/include/linux/akt.h 2007-01-15 15:31:44.0 +0100 +++ linux-2.6.20-rc4/include/linux/akt.h2007-01-15 15:45:29.0 +0100 @@ -295,5 +295,6 @@ static inline void init_auto_tuning(void #endif /* CONFIG_AKT */ extern void fork_late_init(void); +extern void files_late_init(void); #endif /* AKT_H */ Index:
[RFC][PATCH 4/6] min and max kobjects
[PATCH 04/06] Introduces the kobjects associated to each tunable min and max value Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]> --- include/linux/akt.h | 30 include/linux/akt_ops.h | 311 kernel/autotune/akt.c | 120 kernel/autotune/akt_sysfs.c |8 + 4 files changed, 469 insertions(+) Index: linux-2.6.20-rc4/include/linux/akt.h === --- linux-2.6.20-rc4.orig/include/linux/akt.h 2007-01-15 15:08:41.0 +0100 +++ linux-2.6.20-rc4/include/linux/akt.h2007-01-15 15:21:47.0 +0100 @@ -62,6 +62,13 @@ struct tunable_kobject { * auto_tune structure. * These values are type dependent and are used as high / low boundaries when * tuning up or down. + * The show and store routines (thare are type dependent too) are here for + * sysfs support (since the min and max can be updated through sysfs). + * The abs_value field is used to check that we are not: + * . falling under the very 1st min value when updating the min value + * through sysfs + * . going over the very 1st max value when updating the max value + * through sysfs * The type is known when the tunable is defined (see DEFINE_TUNABLE macro). */ struct typed_value { @@ -74,6 +81,17 @@ struct typed_value { long val_long; ulong val_ulong; } value; + union { + short val_short; + ushort val_ushort; + intval_int; + uint val_uint; + size_t val_size_t; + long val_long; + ulong val_ulong; + } abs_value; + ssize_t (*show)(struct auto_tune *, char *); + ssize_t (*store)(struct auto_tune *, const char *, size_t); }; @@ -170,9 +188,15 @@ static inline int is_tunable_registered( .threshold = (_thresh),\ .min= { \ .value = { .val_##type = (_min), },\ + .abs_value = { .val_##type = (_min), },\ + .show = show_tunable_min_##type, \ + .store = store_tunable_min_##type, \ }, \ .max= { \ .value = { .val_##type = (_max), },\ + .abs_value = { .val_##type = (_max), },\ + .show = show_tunable_max_##type, \ + .store = store_tunable_max_##type, \ }, \ .tun_kobj = { .tun = NULL, }, \ .tunable= (_tun), \ @@ -186,7 +210,9 @@ static inline int is_tunable_registered( #define set_tunable_min_max(s, _min, _max, type) \ do {\ (s).min.value.val_##type = _min;\ + (s).min.abs_value.val_##type = _min;\ (s).max.value.val_##type = _max;\ + (s).max.abs_value.val_##type = _max;\ } while (0) @@ -234,6 +260,10 @@ extern int unregister_tunable(struct aut extern int tunable_sysfs_setup(struct auto_tune *); extern ssize_t show_tuning_mode(struct auto_tune *, char *); extern ssize_t store_tuning_mode(struct auto_tune *, const char *, size_t); +extern ssize_t show_tunable_min(struct auto_tune *, char *); +extern ssize_t store_tunable_min(struct auto_tune *, const char *, size_t); +extern ssize_t show_tunable_max(struct auto_tune *, char *); +extern ssize_t store_tunable_max(struct auto_tune *, const char *, size_t); #else /* CONFIG_AKT */ Index: linux-2.6.20-rc4/include/linux/akt_ops.h === --- linux-2.6.20-rc4.orig/include/linux/akt_ops.h 2007-01-15 14:28:16.0 +0100 +++ linux-2.6.20-rc4/include/linux/akt_ops.h2007-01-15 15:22:53.0 +0100 @@ -182,5 +182,316 @@ static inline int default_auto_tuning_ul } +/* + * member can be one of min / max + */ +#define __show_tunable_member(member, p, type, buf, format, y) \ +do { \ + type _xx = (type) p->member.value.val_##type; \ + \ + y = snprintf(buf, PAGE_SIZE, format "\n", _xx); \ +} while (0) + +/* + * Show routines for the min and max tunables values + */ +static inline ssize_t show_tunable_min_short(struct auto_tune *p, char *buf) +{ + ssize_t _count; + __show_tunable_member(min, p,
[RFC][PATCH 2/6] auto_tuning activation
[PATCH 02/06] Introduces the auto-tuning activation routine The auto-tuning routine is called by the fork kernel component Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]> --- include/linux/akt.h | 50 ++ kernel/exit.c | 11 +++ kernel/fork.c |2 ++ 3 files changed, 63 insertions(+) Index: linux-2.6.20-rc4/include/linux/akt.h === --- linux-2.6.20-rc4.orig/include/linux/akt.h 2007-01-15 14:26:24.0 +0100 +++ linux-2.6.20-rc4/include/linux/akt.h2007-01-15 15:00:31.0 +0100 @@ -118,12 +118,22 @@ struct auto_tune { /* * Flags for a registered tunable */ +#define AUTO_TUNE_ENABLE 0x01 #define TUNABLE_REGISTERED 0x02 /* * When calling this routine the tunable lock should be held */ +static inline int is_auto_tune_enabled(struct auto_tune *tunable) +{ + return (tunable->flags & AUTO_TUNE_ENABLE) == AUTO_TUNE_ENABLE; +} + + +/* + * When calling this routine the tunable lock should be held + */ static inline int is_tunable_registered(struct auto_tune *tunable) { return (tunable->flags & TUNABLE_REGISTERED) == TUNABLE_REGISTERED; @@ -163,6 +173,44 @@ static inline int is_tunable_registered( } while (0) +static inline void set_autotuning_routine(struct auto_tune *tunable, + auto_tune_fn fn) +{ + if (fn != NULL) + tunable->auto_tune = fn; +} + + +/* + * direction may be one of: + *AKT_UP: adjust up (i.e. increase tunable value when needed) + *AKT_DOWN: adjust down (i.e. decrease tunable value when needed) + */ +static inline int activate_auto_tuning(int direction, + struct auto_tune *tunable) +{ + int ret = 0; + + BUG_ON(direction != AKT_UP && direction != AKT_DOWN); + + if (tunable == NULL) + return 0; + + spin_lock(>tunable_lck); + + if (!is_auto_tune_enabled(tunable) || + !is_tunable_registered(tunable)) { + spin_unlock(>tunable_lck); + return 0; + } + + ret = tunable->auto_tune(direction, tunable); + + spin_unlock(>tunable_lck); + return ret; +} + + extern int register_tunable(struct auto_tune *); extern int unregister_tunable(struct auto_tune *); @@ -173,7 +221,9 @@ extern int unregister_tunable(struct aut #define DEFINE_TUNABLE(s, thresh, min, max, tun, chk, type) #define set_tunable_min_max(s, min, max, type) do { } while (0) +#define set_autotuning_routine(s, fn)do { } while (0) +#define activate_auto_tuning(direction, tunable) ( { 0; } ) #define register_tunable(a) 0 #define unregister_tunable(a) 0 Index: linux-2.6.20-rc4/kernel/fork.c === --- linux-2.6.20-rc4.orig/kernel/fork.c 2007-01-15 14:36:48.0 +0100 +++ linux-2.6.20-rc4/kernel/fork.c 2007-01-15 14:57:28.0 +0100 @@ -995,6 +995,8 @@ static struct task_struct *copy_process( if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM)) return ERR_PTR(-EINVAL); + activate_auto_tuning(AKT_UP, _threads_akt); + retval = security_task_create(clone_flags); if (retval) goto fork_out; Index: linux-2.6.20-rc4/kernel/exit.c === --- linux-2.6.20-rc4.orig/kernel/exit.c 2007-01-15 13:08:15.0 +0100 +++ linux-2.6.20-rc4/kernel/exit.c 2007-01-15 14:58:23.0 +0100 @@ -42,12 +42,15 @@ #include /* for audit_free() */ #include #include +#include #include #include #include #include +extern struct auto_tune max_threads_akt; + extern void sem_exit (void); static void exit_mm(struct task_struct * tsk); @@ -172,6 +175,14 @@ repeat: sched_exit(p); write_unlock_irq(_lock); + + /* +* nr_threads has been decremented in __unhash_process: adjust +* max_threads down if needed +* We do it here to avoid calling activate_auto_tuning under lock +*/ + activate_auto_tuning(AKT_DOWN, _threads_akt); + proc_flush_task(p); release_thread(p); call_rcu(>rcu, delayed_put_task_struct); -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 3/6] tunables associated kobjects
[PATCH 03/06] Introduces the kobjects associated to each tunable and the sysfs registration Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]> --- include/linux/akt.h | 25 - init/main.c |1 kernel/autotune/Makefile|2 kernel/autotune/akt.c | 86 + kernel/autotune/akt_sysfs.c | 214 5 files changed, 324 insertions(+), 4 deletions(-) Index: linux-2.6.20-rc4/include/linux/akt.h === --- linux-2.6.20-rc4.orig/include/linux/akt.h 2007-01-15 15:00:31.0 +0100 +++ linux-2.6.20-rc4/include/linux/akt.h2007-01-15 15:08:41.0 +0100 @@ -48,6 +48,16 @@ typedef int (*auto_tune_fn)(int, struct /* + * for sysfs support + */ +struct tunable_kobject { + struct kobject kobj; + struct auto_tune *tun; +}; + + + +/* * Structure used to describe the min / max values for a tunable inside the * auto_tune structure. * These values are type dependent and are used as high / low boundaries when @@ -73,7 +83,12 @@ struct typed_value { * allocated for each registered tunable, and the associated kobject exported * via sysfs. * - * The structure lock (tunable_lck) protects + * This structure may be accessed in 2 ways: + * . directly from inside the kernel susbsystem that uses it (during tunable + * automatic adjustment) + * . from sysfs, while updating the kobject attributes + * + * In both cases, the structure lock (tunable_lck) is taken: it protects * against concurrent accesses to tunable and checked pointers * * A pointer to this structure is passed in to the automatic adjustment @@ -108,6 +123,7 @@ struct auto_tune { /* and associated show / store routines) */ struct typed_value max; /* max value the tunable can ever reach */ /* and associated show / store routines) */ + struct tunable_kobjecttun_kobj; /* used for sysfs support */ void *tunable; /* address of the tunable to adjust */ void *checked; /* address of the variable that is controlled by */ /* the tunable. This is the calling subsystem's */ @@ -158,6 +174,7 @@ static inline int is_tunable_registered( .max= { \ .value = { .val_##type = (_max), },\ }, \ + .tun_kobj = { .tun = NULL, }, \ .tunable= (_tun), \ .checked= (_chk), \ } @@ -211,9 +228,12 @@ static inline int activate_auto_tuning(i } - +extern void init_auto_tuning(void); extern int register_tunable(struct auto_tune *); extern int unregister_tunable(struct auto_tune *); +extern int tunable_sysfs_setup(struct auto_tune *); +extern ssize_t show_tuning_mode(struct auto_tune *, char *); +extern ssize_t store_tuning_mode(struct auto_tune *, const char *, size_t); #else /* CONFIG_AKT */ @@ -228,6 +248,7 @@ extern int unregister_tunable(struct aut #define register_tunable(a) 0 #define unregister_tunable(a) 0 +static inline void init_auto_tuning(void) { } #endif /* CONFIG_AKT */ Index: linux-2.6.20-rc4/init/main.c === --- linux-2.6.20-rc4.orig/init/main.c 2007-01-15 14:29:17.0 +0100 +++ linux-2.6.20-rc4/init/main.c2007-01-15 15:09:27.0 +0100 @@ -614,6 +614,7 @@ asmlinkage void __init start_kernel(void signals_init(); /* rootfs populating might need page-writeback */ page_writeback_init(); + init_auto_tuning(); fork_late_init(); #ifdef CONFIG_PROC_FS proc_root_init(); Index: linux-2.6.20-rc4/kernel/autotune/Makefile === --- linux-2.6.20-rc4.orig/kernel/autotune/Makefile 2007-01-15 14:31:57.0 +0100 +++ linux-2.6.20-rc4/kernel/autotune/Makefile 2007-01-15 15:09:57.0 +0100 @@ -2,6 +2,6 @@ # Makefile for akt # -obj-y := akt.o +obj-y := akt.o akt_sysfs.o Index: linux-2.6.20-rc4/kernel/autotune/akt.c === --- linux-2.6.20-rc4.orig/kernel/autotune/akt.c 2007-01-15 14:51:54.0 +0100 +++ linux-2.6.20-rc4/kernel/autotune/akt.c 2007-01-15 15:13:31.0 +0100 @@ -26,6 +26,8 @@ * FUNCTIONS: * register_tunable (exported) * unregister_tunable (exported) + * show_tuning_mode (exported) + * store_tuning_mode (exported) */ #include @@ -36,6 +38,8 @@ +#define AKT_AUTO 1 +#define
[RFC][PATCH 1/6] Tunable structure and registration routines
[PATCH 01/06] Defines the auto_tune structure: this is the structure that contains the information needed by the adjustment routine for a given tunable. Also defines the registration routines. The fork kernel component defines a tunable structure for the threads-max tunable and registers it. Signed-off-by: Nadia Derbey <[EMAIL PROTECTED]> --- Documentation/00-INDEX |2 Documentation/auto_tune.txt | 333 fs/Kconfig |2 include/linux/akt.h | 186 include/linux/akt_ops.h | 186 init/main.c |2 kernel/Makefile |1 kernel/autotune/Kconfig | 30 +++ kernel/autotune/Makefile|7 kernel/autotune/akt.c | 123 kernel/fork.c | 18 ++ 11 files changed, 890 insertions(+) Index: linux-2.6.20-rc4/Documentation/00-INDEX === --- linux-2.6.20-rc4.orig/Documentation/00-INDEX2007-01-15 13:08:13.0 +0100 +++ linux-2.6.20-rc4/Documentation/00-INDEX 2007-01-15 14:17:22.0 +0100 @@ -52,6 +52,8 @@ applying-patches.txt - description of various trees and how to apply their patches. arm/ - directory with info about Linux on the ARM architecture. +auto_tune.txt + - info on the Automatic Kernel Tunables (AKT) feature. basic_profiling.txt - basic instructions for those who wants to profile Linux kernel. binfmt_misc.txt Index: linux-2.6.20-rc4/Documentation/auto_tune.txt === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.20-rc4/Documentation/auto_tune.txt2007-01-15 14:19:18.0 +0100 @@ -0,0 +1,333 @@ + Automatic Kernel Tunables += + + Nadia Derbey ([EMAIL PROTECTED]) + + + +This feature aims at making the kernel automatically change the tunables +values as it sees resources running out. + +The AKT framework is made of 2 parts: + +1) Kernel part: +Interfaces are provided to the kernel subsystems, to (un)register the +tunables that might be automatically tuned in the future. + +Registering a tunable consists in the following steps: +- a structure is declared and filled by the kernel subsystem for the +registered tunable +- that tunable structure is registered into sysfs + +Registration should be done during the kernel subsystem initialization step. + +Unregistering a tunable is the reverse operation. It should not be necessary +for the kernel subsystems: it is only useful when unloading modules that would +have registered a tunable during their loading step. + +The routines interfaces are the following: + +1.1) Declaring a tunable: + +A tunable structure should be declared and defined by the kernel subsystems as +follows: + +DEFINE_TUNABLE(structure_name, threshold, min, max, + tunable_variable_ptr, checked_variable_ptr, + tunable_variable_type); + +Parameters: +- structure_name: this is the name of the tunable structure + +- threshold: percentage to apply to the tunable value to detect if adjustment +is needed + +- min: minimum value the tunable can ever reach (needed when adjusting down +the tunable) + +- max: maximum value the tunable can ever reach (needed when adjusting up the +tunable) + +- tunable_variable_ptr: address of the tunable that will be adjusted if +needed. +(ex: in kernel/fork.c it is max_threads's address) + +- checked_variable_ptr: address of the variable that is controlled by the +tunable. This is the calling subsystem's object counter. +(ex: in kernel/fork.c it is nr_threads's address: nr_threads should +always remain < max_threads) + +- tunable_variable_type: this type is important since it helps choosing the +appropriate automatic tuning routine. +It can be one of short / ushort / int / uint / size_t / long / ulong + +The automatic tuning routine (i.e. the routine that should be called when +automatic tuning is activated) is set to the default one: +default_auto_tuning_(). + is chosen according to the tunable_variable_type parameters. +All the previously listed parameters are useful to this routine. +Refer to the description of the automatic adjustment routine to see how +these parameters are actually used. + +Refer to "Updating the auto-tuning function pointer" to know how to set +this routine to another one. + + +1.2) Updating a tunable's characteristics + +1.2.1) Updating min / max values: + +Sometimes, when calling DEFINE_TUNABLE(), the min and max values are not +exactly known, yet. In that case, the following routine should be called +once these values are known: + +set_tunable_min_max(structure_name, new_min, new_max) + +Parameters: +- structure_name: this is the name of the tunable structure + +- new_min: minimum value the tunable can
[RFC][PATCH 0/6] Automatice kernel tunables (AKT)
This is a series of patches that introduces a feature that makes the kernel automatically change the tunables values as it sees resources running out. The AKT framework is made of 2 parts: 1) Kernel part: Interfaces are provided to the kernel subsystems, to (un)register the tunables that might be automatically tuned in the future. Registering a tunable consists in the following steps: - a structure is declared and filled by the kernel subsystem for the registered tunable - that tunable structure is registered into sysfs Registration should be done during the kernel subsystem initialization step. Another interface is provided to the kernel subsystems, to activate the automatic tuning for a registered tunable. It can be called during resource allocation to tune up, and during resource freeing to tune down the registered tunable. The automatic tuning routine is called only if the tunable has been enabled to be automatically tuning in sysfs. 2) User part: AKT uses sysfs to enable the tunables management from the user world (mainly making them automatic or manual). akt uses sysfs in the following way: - a tunables subsystem (tunables_subsys) is declared and registered during akt initialization. - registering a tunable is equivalent to registering the corresponding kobject within that subsystem. - each tunable kobject has 3 associated attributes, all with a RW mode (i.e. the show() and store() methods are provided for them): . autotune: enables to (de)activate automatic tuning for the tunable . max: enables to set a new maximum value for the tunable . min: enables to set a new minimum value for the tunable The only way to activate automatic tuning is from user side: - the directory /sys/tunables is created during the init phase. - each time a tunable is registered by a kernel subsystem, a directory is created for it under /sys/tunables. - This directory contains 1 file for each tunable kobject attribute These patches should be applied to 2.6.20-rc4, in the following order: [PATCH 1/6]: tunables_registration.patch [PATCH 2/6]: auto_tuning_activation.patch [PATCH 3/6]: auto_tuning_kobjects.patch [PATCH 4/6]: tunable_min_max_kobjects.patch [PATCH 5/6]: per_namespace_tunables.patch [PATCH 6/6]: auto_tune_applied.patch -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc5: known unfixed regressions
On Sat, Jan 13, 2007 at 08:11:25AM +0100, Adrian Bunk wrote: > On Fri, Jan 12, 2007 at 02:27:48PM -0500, Linus Torvalds wrote: > >... > > A lot of developers (including me) will be gone next week for > > Linux.Conf.Au, so you have a week of rest and quiet to test this, and > > report any problems. > > > > Not that there will be any, right? You all behave now! > >... > > This still leaves the old regressions we have not yet fixed... > > > This email lists some known regressions in 2.6.20-rc5 compared to 2.6.19. > > > Subject: BUG: at mm/truncate.c:60 cancel_dirty_page() (XFS) > References : http://lkml.org/lkml/2007/1/5/308 > Submitter : Sami Farin <[EMAIL PROTECTED]> > Handled-By : David Chinner <[EMAIL PROTECTED]> > Status : problem is being discussed I'm at LCA and been having laptop dramas so the fix is being held up at this point. I and trying to test a change right now that adds an optional unmap to truncate_inode_pages_range as XFS needs, in some circumstances, to toss out dirty pages (with dirty bufferheads) and hence requires truncate semantics that are currently missing unmap calls. Semi-untested patch attached below. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group fs/xfs/linux-2.6/xfs_fs_subr.c |6 ++-- include/linux/mm.h |2 + mm/truncate.c | 60 - 3 files changed, 60 insertions(+), 8 deletions(-) Index: linux-2.6.19/fs/xfs/linux-2.6/xfs_fs_subr.c === --- linux-2.6.19.orig/fs/xfs/linux-2.6/xfs_fs_subr.c2006-10-03 23:22:36.0 +1000 +++ linux-2.6.19/fs/xfs/linux-2.6/xfs_fs_subr.c 2007-01-17 01:24:51.771273750 +1100 @@ -32,7 +32,8 @@ fs_tosspages( struct inode*ip = vn_to_inode(vp); if (VN_CACHED(vp)) - truncate_inode_pages(ip->i_mapping, first); + truncate_unmap_inode_pages_range(ip->i_mapping, +first, last, 1); } void @@ -49,7 +50,8 @@ fs_flushinval_pages( if (VN_TRUNC(vp)) VUNTRUNCATE(vp); filemap_write_and_wait(ip->i_mapping); - truncate_inode_pages(ip->i_mapping, first); + truncate_unmap_inode_pages_range(ip->i_mapping, +first, last, 1); } } Index: linux-2.6.19/include/linux/mm.h === --- linux-2.6.19.orig/include/linux/mm.h2007-01-17 01:21:16.01779 +1100 +++ linux-2.6.19/include/linux/mm.h 2007-01-17 01:24:51.775274000 +1100 @@ -1058,6 +1058,8 @@ extern unsigned long page_unuse(struct p extern void truncate_inode_pages(struct address_space *, loff_t); extern void truncate_inode_pages_range(struct address_space *, loff_t lstart, loff_t lend); +extern void truncate_unmap_inode_pages_range(struct address_space *, + loff_t lstart, loff_t lend, int unmap); /* generic vm_area_ops exported for stackable file systems */ extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *); Index: linux-2.6.19/mm/truncate.c === --- linux-2.6.19.orig/mm/truncate.c 2007-01-17 01:21:23.074231000 +1100 +++ linux-2.6.19/mm/truncate.c 2007-01-17 01:24:51.779274250 +1100 @@ -59,7 +59,7 @@ void cancel_dirty_page(struct page *page WARN_ON(++warncount < 5); } - + if (TestClearPageDirty(page)) { struct address_space *mapping = page->mapping; if (mapping && mapping_cap_account_dirty(mapping)) { @@ -122,16 +122,34 @@ invalidate_complete_page(struct address_ return ret; } +/* + * This is a helper for truncate_unmap_inode_page. Unmap the page we + * are passed. Page must be locked by the caller. + */ +static void +unmap_single_page(struct address_space *mapping, struct page *page) +{ + BUG_ON(!PageLocked(page)); + while (page_mapped(page)) { + unmap_mapping_range(mapping, + (loff_t)page->index << PAGE_CACHE_SHIFT, + PAGE_CACHE_SIZE, 0); + } +} + /** - * truncate_inode_pages - truncate range of pages specified by start and + * truncate_unmap_inode_pages_range - truncate range of pages specified by + * start and end byte offsets and optionally unmap them first. * end byte offsets * @mapping: mapping to truncate * @lstart: offset from which to truncate * @lend: offset to which to truncate + * @unmap: unmap whole truncated pages if non-zero * * Truncate the page cache, removing the pages that are between * specified offsets (and zeroing out partial page - * (if lstart is not page aligned)). + * (if lstart is not page aligned)). If specified, unmap the pages
[PATCH] Remove a number of "dead" config variables.
Remove Kconfig entries (and some documentation) for apparently "dead" config variables. Signed-off-by: Robert P. J. Day <[EMAIL PROTECTED]> --- A script I threw together identified the following as apparently useless config variables. By "useless," I mean that they: 1) aren't consulted by any Makefile 2) aren't checked by any source or header file 3) don't further select any Kconfig settings etc. In short, they don't seem to be able to affect the build in any way. The variables that are being removed: USB_SERIAL_SAFE_PADDED AEDSP16_MPU401 X86_XADD PARIDE_PARPORT AIC7XXX_PROBE_EISA_VL AIC79XX_ENABLE_RD_STRM SCSI_NCR53C8XX_PROFILE 53C700_IO_MAPPED ZISOFS_FS DLCI_COUNT MOUSE_ATIXL LCD_DEVICE The removal was compile tested based on "make allyesconfig". If any of these variables are still being used in some way, they are keeping it very well hidden. Documentation/scsi/ncr53c8xx.txt |5 - arch/arm/configs/pnx4008_defconfig |1 - arch/i386/Kconfig.cpu|5 - arch/um/config.release |1 - drivers/block/paride/Kconfig |8 +--- drivers/input/mouse/Kconfig |6 -- drivers/net/wan/Kconfig | 11 --- drivers/scsi/Kconfig | 16 drivers/scsi/aic7xxx/Kconfig.aic79xx | 12 drivers/scsi/aic7xxx/Kconfig.aic7xxx | 10 -- drivers/usb/serial/Kconfig |4 drivers/video/backlight/Kconfig |5 - fs/Kconfig |6 -- sound/oss/Kconfig| 12 14 files changed, 1 insertion(+), 101 deletions(-) diff --git a/Documentation/scsi/ncr53c8xx.txt b/Documentation/scsi/ncr53c8xx.txt index caf10b1..88ef88b 100644 --- a/Documentation/scsi/ncr53c8xx.txt +++ b/Documentation/scsi/ncr53c8xx.txt @@ -562,11 +562,6 @@ if only one has a flaw for some SCSI feature, you can disable the support by the driver of this feature at linux start-up and enable this feature after boot-up only for devices that support it safely. -CONFIG_SCSI_NCR53C8XX_PROFILE_SUPPORT (default answer: n) -This option must be set for profiling information to be gathered -and printed out through the proc file system. This features may -impact performances. - CONFIG_SCSI_NCR53C8XX_IOMAPPED (default answer: n) Answer "y" if you suspect your mother board to not allow memory mapped I/O. May slow down performance a little. This option is required by diff --git a/arch/arm/configs/pnx4008_defconfig b/arch/arm/configs/pnx4008_defconfig index b5e11aa..268b292 100644 --- a/arch/arm/configs/pnx4008_defconfig +++ b/arch/arm/configs/pnx4008_defconfig @@ -1395,7 +1395,6 @@ CONFIG_AUTOFS4_FS=m CONFIG_ISO9660_FS=m CONFIG_JOLIET=y CONFIG_ZISOFS=y -CONFIG_ZISOFS_FS=m CONFIG_UDF_FS=m CONFIG_UDF_NLS=y diff --git a/arch/i386/Kconfig.cpu b/arch/i386/Kconfig.cpu index 2aecfba..b99c0e2 100644 --- a/arch/i386/Kconfig.cpu +++ b/arch/i386/Kconfig.cpu @@ -226,11 +226,6 @@ config X86_CMPXCHG depends on !M386 default y -config X86_XADD - bool - depends on !M386 - default y - config X86_L1_CACHE_SHIFT int default "7" if MPENTIUM4 || X86_GENERIC diff --git a/arch/um/config.release b/arch/um/config.release index fc68bcb..861b59b 100644 --- a/arch/um/config.release +++ b/arch/um/config.release @@ -253,7 +253,6 @@ CONFIG_LOCKD_V4=y # CONFIG_NCPFS_SMALLDOS is not set # CONFIG_NCPFS_NLS is not set # CONFIG_NCPFS_EXTRAS is not set -# CONFIG_ZISOFS_FS is not set CONFIG_ZLIB_FS_INFLATE=m # diff --git a/drivers/block/paride/Kconfig b/drivers/block/paride/Kconfig index c0d2854..28cf308 100644 --- a/drivers/block/paride/Kconfig +++ b/drivers/block/paride/Kconfig @@ -2,14 +2,8 @@ # PARIDE configuration # # PARIDE doesn't need PARPORT, but if PARPORT is configured as a module, -# PARIDE must also be a module. The bogus CONFIG_PARIDE_PARPORT option -# controls the choices given to the user ... +# PARIDE must also be a module. # PARIDE only supports PC style parports. Tough for USB or other parports... -config PARIDE_PARPORT - tristate - depends on PARIDE!=n - default m if PARPORT_PC=m - default y if PARPORT_PC!=m comment "Parallel IDE high-level drivers" depends on PARIDE diff --git a/drivers/input/mouse/Kconfig b/drivers/input/mouse/Kconfig index 35d998c..0befb49 100644 --- a/drivers/input/mouse/Kconfig +++ b/drivers/input/mouse/Kconfig @@ -60,12 +60,6 @@ config MOUSE_INPORT To compile this driver as a module, choose M here: the module will be called inport. -config MOUSE_ATIXL - bool "ATI XL variant" - depends on MOUSE_INPORT - help - Say Y here if your mouse is of the ATI XL variety. - config MOUSE_LOGIBM tristate "Logitech busmouse" depends on ISA diff --git a/drivers/net/wan/Kconfig b/drivers/net/wan/Kconfig index 21f76f5..b550b51 100644
[RFC 4/8] Per cpuset dirty ratio handling and writeout
Make page writeback obey cpuset constraints Currently dirty throttling does not work properly in a cpuset. If f.e a cpuset contains only 1/10th of available memory then all of the memory of a cpuset can be dirtied without any writes being triggered. If we are writing to a device that is mounted via NFS then the write operation may be terminated with OOM since NFS is not allowed to allocate more pages for writeout. If all of the cpusets memory is dirty then only 10% of total memory is dirty. The background writeback threshold is usually set at 10% and the synchrononous threshold at 40%. So we are still below the global limits while the dirty ratio in the cpuset is 100%! This patch makes dirty writeout cpuset aware. When determining the dirty limits in get_dirty_limits() we calculate values based on the nodes that are reachable from the current process (that has been dirtying the page). Then we can trigger writeout based on the dirty ratio of the memory in the cpuset. We trigger writeout in a a cpuset specific way. We go through the dirty inodes and search for inodes that have dirty pages on the nodes of the active cpuset. If an inode fulfills that requirement then we begin writeout of the dirty pages of that inode. Adding up all the counters for each node in a cpuset may seem to be quite an expensive operation (in particular for large cpusets with hundreds of nodes) compared to just accessing the global counters if we do not have a cpuset. However, please remember that I only recently introduced the global counters. Before 2.6.18 we did add up per processor counters for each processor on each invocation of get_dirty_limits(). We now add per node information which I think is equal or less effort since there are less nodes than processors. Christoph Lameter <[EMAIL PROTECTED]> Index: linux-2.6.20-rc5/include/linux/writeback.h === --- linux-2.6.20-rc5.orig/include/linux/writeback.h 2007-01-15 21:34:43.0 -0600 +++ linux-2.6.20-rc5/include/linux/writeback.h 2007-01-15 21:37:05.209897874 -0600 @@ -59,11 +59,12 @@ struct writeback_control { unsigned for_reclaim:1; /* Invoked from the page allocator */ unsigned for_writepages:1; /* This is a writepages() call */ unsigned range_cyclic:1;/* range_start is cyclic */ + nodemask_t *nodes; /* Set of nodes of interest */ }; /* * fs/fs-writeback.c - */ + */ void writeback_inodes(struct writeback_control *wbc); void wake_up_inode(struct inode *inode); int inode_wait(void *); Index: linux-2.6.20-rc5/mm/page-writeback.c === --- linux-2.6.20-rc5.orig/mm/page-writeback.c 2007-01-15 21:34:43.0 -0600 +++ linux-2.6.20-rc5/mm/page-writeback.c2007-01-15 21:35:28.013794159 -0600 @@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode); static void background_writeout(unsigned long _min_pages, nodemask_t *nodes); +struct dirty_limits { + long thresh_background; + long thresh_dirty; + unsigned long nr_dirty; + unsigned long nr_unstable; + unsigned long nr_writeback; +}; + /* * Work out the current dirty-memory clamping and background writeout * thresholds. @@ -120,31 +128,74 @@ static void background_writeout(unsigned * We make sure that the background writeout level is below the adjusted * clamping level. */ -static void -get_dirty_limits(long *pbackground, long *pdirty, - struct address_space *mapping) +static int +get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping, + nodemask_t *nodes) { int background_ratio; /* Percentages */ int dirty_ratio; int unmapped_ratio; long background; long dirty; - unsigned long available_memory = vm_total_pages; + unsigned long available_memory; + unsigned long high_memory; + unsigned long nr_mapped; struct task_struct *tsk; + int is_subset = 0; +#ifdef CONFIG_CPUSETS + /* +* Calculate the limits relative to the current cpuset if necessary. +*/ + if (unlikely(nodes && + !nodes_subset(node_online_map, *nodes))) { + int node; + + is_subset = 1; + memset(dl, 0, sizeof(struct dirty_limits)); + available_memory = 0; + high_memory = 0; + nr_mapped = 0; + for_each_node_mask(node, *nodes) { + if (!node_online(node)) + continue; + dl->nr_dirty += node_page_state(node, NR_FILE_DIRTY); + dl->nr_unstable += + node_page_state(node, NR_UNSTABLE_NFS); + dl->nr_writeback += + node_page_state(node,
[RFC 6/8] Throttle vm writeout per cpuset
Throttle VM writeout in a cpuset aware way This bases the vm throttling from the reclaim path on the dirty ratio of the cpuset. Note that a cpuset is only effective if shrink_zone is called from direct reclaim. kswapd has a cpuset context that includes the whole machine and will therefore not throttle unless global limits are reached. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Index: linux-2.6.20-rc5/include/linux/writeback.h === --- linux-2.6.20-rc5.orig/include/linux/writeback.h 2007-01-15 21:37:05.209897874 -0600 +++ linux-2.6.20-rc5/include/linux/writeback.h 2007-01-15 21:37:33.283671963 -0600 @@ -85,7 +85,7 @@ static inline void wait_on_inode(struct int wakeup_pdflush(long nr_pages, nodemask_t *nodes); void laptop_io_completion(void); void laptop_sync_completion(void); -void throttle_vm_writeout(void); +void throttle_vm_writeout(nodemask_t *); /* These are exported to sysctl. */ extern int dirty_background_ratio; Index: linux-2.6.20-rc5/mm/page-writeback.c === --- linux-2.6.20-rc5.orig/mm/page-writeback.c 2007-01-15 21:35:28.013794159 -0600 +++ linux-2.6.20-rc5/mm/page-writeback.c2007-01-15 21:37:33.302228293 -0600 @@ -349,12 +349,12 @@ void balance_dirty_pages_ratelimited_nr( } EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); -void throttle_vm_writeout(void) +void throttle_vm_writeout(nodemask_t *nodes) { struct dirty_limits dl; for ( ; ; ) { - get_dirty_limits(, NULL, _online_map); + get_dirty_limits(, NULL, nodes); /* * Boost the allowable dirty threshold a bit for page Index: linux-2.6.20-rc5/mm/vmscan.c === --- linux-2.6.20-rc5.orig/mm/vmscan.c 2007-01-15 21:37:26.605346439 -0600 +++ linux-2.6.20-rc5/mm/vmscan.c2007-01-15 21:37:33.316878027 -0600 @@ -949,7 +949,7 @@ static unsigned long shrink_zone(int pri } } - throttle_vm_writeout(); + throttle_vm_writeout(_current_mems_allowed); atomic_dec(>reclaim_in_progress); return nr_reclaimed; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 7/8] Exclude unreclaimable pages from dirty ration calculation
Consider unreclaimable pages during dirty limit calculation Tracking unreclaimable pages helps us to calculate the dirty ratio the right way. If a large number of unreclaimable pages are allocated (through the slab or through huge pages) then write throttling will no longer work since the limit cannot be reached anymore. So we simply subtract the number of unreclaimable pages from the pages considered for writeout threshold calculation. Other code that allocates significant amounts of memory for device drivers etc could also be modified to take advantage of this functionality. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Index: linux-2.6.20-rc5/include/linux/mmzone.h === --- linux-2.6.20-rc5.orig/include/linux/mmzone.h2007-01-12 12:54:26.0 -0600 +++ linux-2.6.20-rc5/include/linux/mmzone.h 2007-01-15 21:37:37.579950696 -0600 @@ -53,6 +53,7 @@ enum zone_stat_item { NR_FILE_PAGES, NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, + NR_UNRECLAIMABLE, NR_PAGETABLE, /* used for pagetables */ NR_FILE_DIRTY, NR_WRITEBACK, Index: linux-2.6.20-rc5/fs/proc/proc_misc.c === --- linux-2.6.20-rc5.orig/fs/proc/proc_misc.c 2007-01-12 12:54:26.0 -0600 +++ linux-2.6.20-rc5/fs/proc/proc_misc.c2007-01-15 21:37:37.641479580 -0600 @@ -174,6 +174,7 @@ static int meminfo_read_proc(char *page, "Slab: %8lu kB\n" "SReclaimable: %8lu kB\n" "SUnreclaim: %8lu kB\n" + "Unreclaimabl: %8lu kB\n" "PageTables: %8lu kB\n" "NFS_Unstable: %8lu kB\n" "Bounce: %8lu kB\n" @@ -205,6 +206,7 @@ static int meminfo_read_proc(char *page, global_page_state(NR_SLAB_UNRECLAIMABLE)), K(global_page_state(NR_SLAB_RECLAIMABLE)), K(global_page_state(NR_SLAB_UNRECLAIMABLE)), + K(global_page_state(NR_UNRECLAIMABLE)), K(global_page_state(NR_PAGETABLE)), K(global_page_state(NR_UNSTABLE_NFS)), K(global_page_state(NR_BOUNCE)), Index: linux-2.6.20-rc5/mm/hugetlb.c === --- linux-2.6.20-rc5.orig/mm/hugetlb.c 2007-01-12 12:54:26.0 -0600 +++ linux-2.6.20-rc5/mm/hugetlb.c 2007-01-15 21:37:37.664919155 -0600 @@ -115,6 +115,8 @@ static int alloc_fresh_huge_page(void) nr_huge_pages_node[page_to_nid(page)]++; spin_unlock(_lock); put_page(page); /* free it into the hugepage allocator */ + mod_zone_page_state(page_zone(page), NR_UNRECLAIMABLE, + HPAGE_SIZE / PAGE_SIZE); return 1; } return 0; @@ -183,6 +185,8 @@ static void update_and_free_page(struct 1 << PG_dirty | 1 << PG_active | 1 << PG_reserved | 1 << PG_private | 1<< PG_writeback); } + mod_zone_page_state(page_zone(page), NR_UNRECLAIMABLE, + - (HPAGE_SIZE / PAGE_SIZE)); page[1].lru.next = NULL; set_page_refcounted(page); __free_pages(page, HUGETLB_PAGE_ORDER); Index: linux-2.6.20-rc5/mm/vmstat.c === --- linux-2.6.20-rc5.orig/mm/vmstat.c 2007-01-12 12:54:26.0 -0600 +++ linux-2.6.20-rc5/mm/vmstat.c2007-01-15 21:37:37.686405431 -0600 @@ -459,6 +459,7 @@ static const char * const vmstat_text[] "nr_file_pages", "nr_slab_reclaimable", "nr_slab_unreclaimable", + "nr_unreclaimable", "nr_page_table_pages", "nr_dirty", "nr_writeback", Index: linux-2.6.20-rc5/mm/page-writeback.c === --- linux-2.6.20-rc5.orig/mm/page-writeback.c 2007-01-15 21:37:33.302228293 -0600 +++ linux-2.6.20-rc5/mm/page-writeback.c2007-01-15 21:37:37.697148570 -0600 @@ -165,7 +165,9 @@ get_dirty_limits(struct dirty_limits *dl dl->nr_writeback += node_page_state(node, NR_WRITEBACK); available_memory += - NODE_DATA(node)->node_present_pages; + NODE_DATA(node)->node_present_pages + - node_page_state(node, NR_UNRECLAIMABLE) + - node_page_state(node, NR_SLAB_UNRECLAIMABLE); #ifdef CONFIG_HIGHMEM high_memory += NODE_DATA(node) ->node_zones[ZONE_HIGHMEM]->present_pages; @@ -180,7 +182,9 @@ get_dirty_limits(struct dirty_limits *dl dl->nr_dirty =
[RFC 3/8] Add a nodemask to pdflush functions
pdflush: Allow the passing of a nodemask parameter If we want to support nodeset specific writeout then we need a way to communicate the set of nodes that an operation should affect. So add a nodemask_t parameter to the pdflush functions and also store the nodemask in the pdflush control structure. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Index: linux-2.6.20-rc5/include/linux/writeback.h === --- linux-2.6.20-rc5.orig/include/linux/writeback.h 2007-01-15 21:34:38.564104333 -0600 +++ linux-2.6.20-rc5/include/linux/writeback.h 2007-01-15 21:34:43.135798088 -0600 @@ -81,7 +81,7 @@ static inline void wait_on_inode(struct /* * mm/page-writeback.c */ -int wakeup_pdflush(long nr_pages); +int wakeup_pdflush(long nr_pages, nodemask_t *nodes); void laptop_io_completion(void); void laptop_sync_completion(void); void throttle_vm_writeout(void); @@ -109,7 +109,8 @@ balance_dirty_pages_ratelimited(struct a balance_dirty_pages_ratelimited_nr(mapping, 1); } -int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0); +int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes), + unsigned long arg0, nodemask_t *nodes); extern int generic_writepages(struct address_space *mapping, struct writeback_control *wbc); int do_writepages(struct address_space *mapping, struct writeback_control *wbc); Index: linux-2.6.20-rc5/mm/page-writeback.c === --- linux-2.6.20-rc5.orig/mm/page-writeback.c 2007-01-15 21:34:38.573870823 -0600 +++ linux-2.6.20-rc5/mm/page-writeback.c2007-01-15 21:34:43.150447823 -0600 @@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ -static void background_writeout(unsigned long _min_pages); +static void background_writeout(unsigned long _min_pages, nodemask_t *nodes); /* * Work out the current dirty-memory clamping and background writeout @@ -244,7 +244,7 @@ static void balance_dirty_pages(struct a */ if ((laptop_mode && pages_written) || (!laptop_mode && (nr_reclaimable > background_thresh))) - pdflush_operation(background_writeout, 0); + pdflush_operation(background_writeout, 0, NULL); } void set_page_dirty_balance(struct page *page) @@ -325,7 +325,7 @@ void throttle_vm_writeout(void) * writeback at least _min_pages, and keep writing until the amount of dirty * memory is less than the background threshold, or until we're all clean. */ -static void background_writeout(unsigned long _min_pages) +static void background_writeout(unsigned long _min_pages, nodemask_t *unused) { long min_pages = _min_pages; struct writeback_control wbc = { @@ -365,12 +365,12 @@ static void background_writeout(unsigned * the whole world. Returns 0 if a pdflush thread was dispatched. Returns * -1 if all pdflush threads were busy. */ -int wakeup_pdflush(long nr_pages) +int wakeup_pdflush(long nr_pages, nodemask_t *nodes) { if (nr_pages == 0) nr_pages = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); - return pdflush_operation(background_writeout, nr_pages); + return pdflush_operation(background_writeout, nr_pages, nodes); } static void wb_timer_fn(unsigned long unused); @@ -394,7 +394,7 @@ static DEFINE_TIMER(laptop_mode_wb_timer * older_than_this takes precedence over nr_to_write. So we'll only write back * all dirty pages if they are all attached to "old" mappings. */ -static void wb_kupdate(unsigned long arg) +static void wb_kupdate(unsigned long arg, nodemask_t *unused) { unsigned long oldest_jif; unsigned long start_jif; @@ -454,18 +454,18 @@ int dirty_writeback_centisecs_handler(ct static void wb_timer_fn(unsigned long unused) { - if (pdflush_operation(wb_kupdate, 0) < 0) + if (pdflush_operation(wb_kupdate, 0, NULL) < 0) mod_timer(_timer, jiffies + HZ); /* delay 1 second */ } -static void laptop_flush(unsigned long unused) +static void laptop_flush(unsigned long unused, nodemask_t *unused2) { sys_sync(); } static void laptop_timer_fn(unsigned long unused) { - pdflush_operation(laptop_flush, 0); + pdflush_operation(laptop_flush, 0, NULL); } /* Index: linux-2.6.20-rc5/mm/pdflush.c === --- linux-2.6.20-rc5.orig/mm/pdflush.c 2007-01-15 21:34:38.582660664 -0600 +++ linux-2.6.20-rc5/mm/pdflush.c 2007-01-15 21:34:43.161190961 -0600 @@ -83,10 +83,12 @@ static unsigned long last_empty_jifs; */ struct pdflush_work { struct task_struct *who;/* The thread */ - void (*fn)(unsigned long); /* A callback function */ + void (*fn)(unsigned long, nodemask_t *); /* A callback function
[RFC 8/8] Reduce inode memory usage for systems with a high MAX_NUMNODES
Dynamically reduce the size of the nodemask_t in struct inode The nodemask_t in struct inode can potentially waste a lot of memory if MAX_NUMNODES is high. For IA64 MAX_NUMNODES is 1024 by default which results in 128 bytes to be used for the nodemask. This means that the memory use for inodes may increase significantly since they all now include a dirty_map. These may be unecessarily large on smaller systems. We placed the nodemask at the end of struct inode. This patch avoids touching the later part of the nodemask if the actual maximum possible node on the system is less than 1024. If MAX_NUMNODES is larger than BITS_PER_LONG (and we may use more than one word for the nodemask) then we calculate the number of bytes that may be taken off the end of an inode. We can then create the inode caches without those bytes effectively saving memory. On a IA64 system booting with a maximum of 64 nodes we may save 120 of those 128 bytes per inode. This is only done for filesystems that are typically used for NUMA systems: xfs, nfs, ext3, ext4 and reiserfs. Other filesystems will always use the full length of the inode. This solution may be a bit hokey. I tried other approaches but this one seemed to be the simplest with the least complications. Maybe someone else can come up with a better solution? Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Index: linux-2.6.20-rc5/fs/xfs/linux-2.6/xfs_super.c === --- linux-2.6.20-rc5.orig/fs/xfs/linux-2.6/xfs_super.c 2007-01-15 22:33:55.0 -0600 +++ linux-2.6.20-rc5/fs/xfs/linux-2.6/xfs_super.c 2007-01-15 22:35:07.596529498 -0600 @@ -370,7 +370,9 @@ xfs_fs_inode_init_once( STATIC int xfs_init_zones(void) { - xfs_vnode_zone = kmem_zone_init_flags(sizeof(bhv_vnode_t), "xfs_vnode", + xfs_vnode_zone = kmem_zone_init_flags(sizeof(bhv_vnode_t) + - unused_numa_nodemask_bytes, + "xfs_vnode", KM_ZONE_HWALIGN | KM_ZONE_RECLAIM | KM_ZONE_SPREAD, xfs_fs_inode_init_once); Index: linux-2.6.20-rc5/include/linux/fs.h === --- linux-2.6.20-rc5.orig/include/linux/fs.h2007-01-15 22:33:55.0 -0600 +++ linux-2.6.20-rc5/include/linux/fs.h 2007-01-15 22:35:07.621922373 -0600 @@ -591,6 +591,14 @@ struct inode { void*i_private; /* fs or device private pointer */ #ifdef CONFIG_CPUSETS nodemask_t dirty_nodes;/* Map of nodes with dirty pages */ + /* +* Note that we may only use a portion of the bitmap in dirty_nodes +* if we have a large MAX_NUMNODES but the number of possible nodes +* is small in order to reduce the size of the inode. +* +* Bits after nr_node_ids (one node beyond the last possible +* node_id) may not be accessed. +*/ #endif }; Index: linux-2.6.20-rc5/fs/ext3/super.c === --- linux-2.6.20-rc5.orig/fs/ext3/super.c 2007-01-15 22:33:55.0 -0600 +++ linux-2.6.20-rc5/fs/ext3/super.c2007-01-15 22:35:07.646338599 -0600 @@ -480,7 +480,8 @@ static void init_once(void * foo, struct static int init_inodecache(void) { ext3_inode_cachep = kmem_cache_create("ext3_inode_cache", -sizeof(struct ext3_inode_info), +sizeof(struct ext3_inode_info) + - unused_numa_nodemask_bytes, 0, (SLAB_RECLAIM_ACCOUNT| SLAB_MEM_SPREAD), init_once, NULL); Index: linux-2.6.20-rc5/fs/inode.c === --- linux-2.6.20-rc5.orig/fs/inode.c2007-01-15 22:33:55.0 -0600 +++ linux-2.6.20-rc5/fs/inode.c 2007-01-15 22:35:07.661964984 -0600 @@ -1399,7 +1399,8 @@ void __init inode_init(unsigned long mem /* inode slab cache */ inode_cachep = kmem_cache_create("inode_cache", -sizeof(struct inode), +sizeof(struct inode) + - unused_numa_nodemask_bytes, 0, (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC| SLAB_MEM_SPREAD), Index: linux-2.6.20-rc5/fs/reiserfs/super.c === --- linux-2.6.20-rc5.orig/fs/reiserfs/super.c 2007-01-15 22:33:55.0 -0600 +++ linux-2.6.20-rc5/fs/reiserfs/super.c2007-01-15
[RFC 2/8] Add a map to inodes to track dirty pages per node
Add a dirty map to the inode In a NUMA system it is helpful to know where the dirty pages of a mapping are located. That way we will be able to implement writeout for applications that are constrained to a portion of the memory of the system as required by cpusets. Two functions are introduced to manage the dirty node map: cpuset_clear_dirty_nodes() and cpuset_update_nodes(). Both are defined using macros since the definition of struct inode may not be available in cpuset.h. The dirty map is cleared when the inode is cleared. There is no synchronization (except for atomic nature of node_set) for the dirty_map. The only problem that could be done is that we do not write out an inode if a node bit is not set. That is rare and will be impossibly rare if multiple pages are involved. There is therefore a slight chance that we have missed a dirty node if it just contains a single page. Which is likely tolerable. This patch increases the size of struct inode for the NUMA case. For most arches that only support up to 64 nodes this is simply adding one unsigned long. However, the default Itanium configuration allows for up to 1024 nodes. On Itanium we add 128 byte per inode. A later patch will make the size of the per node bit array dynamic so that the size of the inode slab caches is properly sized. Signed-off-by; Christoph Lameter <[EMAIL PROTECTED]> Index: linux-2.6.20-rc5/fs/fs-writeback.c === --- linux-2.6.20-rc5.orig/fs/fs-writeback.c 2007-01-12 12:54:26.0 -0600 +++ linux-2.6.20-rc5/fs/fs-writeback.c 2007-01-15 22:34:12.065241639 -0600 @@ -22,6 +22,7 @@ #include #include #include +#include #include "internal.h" /** @@ -223,11 +224,13 @@ __sync_single_inode(struct inode *inode, /* * The inode is clean, inuse */ + cpuset_clear_dirty_nodes(inode); list_move(>i_list, _in_use); } else { /* * The inode is clean, unused */ + cpuset_clear_dirty_nodes(inode); list_move(>i_list, _unused); } } Index: linux-2.6.20-rc5/fs/inode.c === --- linux-2.6.20-rc5.orig/fs/inode.c2007-01-12 12:54:26.0 -0600 +++ linux-2.6.20-rc5/fs/inode.c 2007-01-15 22:33:55.802081773 -0600 @@ -22,6 +22,7 @@ #include #include #include +#include /* * This is needed for the following functions: @@ -134,6 +135,7 @@ static struct inode *alloc_inode(struct inode->i_cdev = NULL; inode->i_rdev = 0; inode->dirtied_when = 0; + cpuset_clear_dirty_nodes(inode); if (security_inode_alloc(inode)) { if (inode->i_sb->s_op->destroy_inode) inode->i_sb->s_op->destroy_inode(inode); Index: linux-2.6.20-rc5/include/linux/fs.h === --- linux-2.6.20-rc5.orig/include/linux/fs.h2007-01-12 12:54:26.0 -0600 +++ linux-2.6.20-rc5/include/linux/fs.h 2007-01-15 22:33:55.876307100 -0600 @@ -589,6 +589,9 @@ struct inode { void*i_security; #endif void*i_private; /* fs or device private pointer */ +#ifdef CONFIG_CPUSETS + nodemask_t dirty_nodes;/* Map of nodes with dirty pages */ +#endif }; /* Index: linux-2.6.20-rc5/mm/page-writeback.c === --- linux-2.6.20-rc5.orig/mm/page-writeback.c 2007-01-12 12:54:26.0 -0600 +++ linux-2.6.20-rc5/mm/page-writeback.c2007-01-15 22:34:14.425802376 -0600 @@ -33,6 +33,7 @@ #include #include #include +#include /* * The maximum number of pages to writeout in a single bdflush/kupdate @@ -780,6 +781,7 @@ int __set_page_dirty_nobuffers(struct pa if (mapping->host) { /* !PageAnon && !swapper_space */ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + cpuset_update_dirty_nodes(mapping->host, page); } return 1; } Index: linux-2.6.20-rc5/fs/buffer.c === --- linux-2.6.20-rc5.orig/fs/buffer.c 2007-01-12 12:54:26.0 -0600 +++ linux-2.6.20-rc5/fs/buffer.c2007-01-15 22:34:14.459008443 -0600 @@ -42,6 +42,7 @@ #include #include #include +#include static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); static void invalidate_bh_lrus(void); @@ -739,6 +740,7 @@ int __set_page_dirty_buffers(struct page } write_unlock_irq(>tree_lock); __mark_inode_dirty(mapping->host,
[RFC 5/8] Make writeout during reclaim cpuset aware
Direct reclaim: cpuset aware writeout During direct reclaim we traverse down a zonelist and are carefully checking each zone if its a member of the active cpuset. But then we call pdflush without enforcing the same restrictions. In a larger system this may have the effect of a massive amount of pages being dirtied and then either A. No writeout occurs because global dirty limits have not been reached or B. Writeout starts randomly for some dirty inode in the system. Pdflush may just write out data for nodes in another cpuset and miss doing proper dirty handling for the current cpuset. In both cases dirty pages in the zones of interest may not be affected and writeout may not occur as necessary. Fix that by restricting pdflush to the active cpuset. Writeout will occur from direct reclaim as in an SMP system. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Index: linux-2.6.20-rc5/mm/vmscan.c === --- linux-2.6.20-rc5.orig/mm/vmscan.c 2007-01-15 21:34:43.173887398 -0600 +++ linux-2.6.20-rc5/mm/vmscan.c2007-01-15 21:37:26.605346439 -0600 @@ -1065,7 +1065,8 @@ unsigned long try_to_free_pages(struct z */ if (total_scanned > sc.swap_cluster_max + sc.swap_cluster_max / 2) { - wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL); + wakeup_pdflush(laptop_mode ? 0 : total_scanned, + _current_mems_allowed); sc.may_writepage = 1; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 0/8] Cpuset aware writeback
Currently cpusets are not able to do proper writeback since dirty ratio calculations and writeback are all done for the system as a whole. This may result in a large percentage of a cpuset to become dirty without writeout being triggered. Under NFS this can lead to OOM conditions. Writeback will occur during the LRU scans. But such writeout is not effective since we write page by page and not in inode page order (regular writeback). In order to fix the problem we first of all introduce a method to establish a map of nodes that contain dirty pages for each inode mapping. Secondly we modify the dirty limit calculation to be based on the acctive cpuset. If we are in a cpuset then we select only inodes for writeback that have pages on the nodes of the cpuset. After we have the cpuset throttling in place we can then make further fixups: A. We can do inode based writeout from direct reclaim avoiding single page writes to the filesystem. B. We add a new counter NR_UNRECLAIMABLE that is subtracted from the available pages in a node. This allows us to accurately calculate the dirty ratio even if large portions of the node have been allocated for huge pages or for slab pages. There are a couple of points where some better ideas could be used: 1. The nodemask expands the inode structure significantly if the architecture allows a high number of nodes. This is only an issue for IA64. For that platform we expand the inode structure by 128 byte (to support 1024 nodes). The last patch attempts to address the issue by using the knowledge about the maximum possible number of nodes determined on bootup to shrink the nodemask. 2. The calculation of the per cpuset limits can require looping over a number of nodes which may bring the performance of get_dirty_limits near pre 2.6.18 performance (before the introduction of the ZVC counters) (only for cpuset based limit calculation). There is no way of keeping these counters per cpuset since cpusets may overlap. Paul probably needs to go through this and may want additional fixes to keep things in harmony with cpusets. Tested on: IA64 NUMA 128p, 12p Compiles on: i386 SMP x86_64 UP - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 1/8] Convert higest_possible_node_id() into nr_node_ids
Replace highest_possible_node_id() with nr_node_ids highest_possible_node_id() is used to calculate the last possible node id so that the network subsystem can figure out how to size per node arrays. I think having the ability to determine the maximum amount of nodes in a system at runtime is useful but then we should name this entry correspondingly and also only calculate the value once on bootup. This patch introduces nr_node_ids and replaces the use of highest_possible_node_id(). nr_node_ids is calculated on bootup when the page allocators pagesets are initialized. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> Index: linux-2.6.20-rc4-mm1/include/linux/nodemask.h === --- linux-2.6.20-rc4-mm1.orig/include/linux/nodemask.h 2007-01-06 21:45:51.0 -0800 +++ linux-2.6.20-rc4-mm1/include/linux/nodemask.h 2007-01-12 12:59:50.0 -0800 @@ -352,7 +352,7 @@ #define node_possible(node)node_isset((node), node_possible_map) #define first_online_node first_node(node_online_map) #define next_online_node(nid) next_node((nid), node_online_map) -int highest_possible_node_id(void); +extern int nr_node_ids; #else #define num_online_nodes() 1 #define num_possible_nodes() 1 @@ -360,7 +360,7 @@ #define node_possible(node)((node) == 0) #define first_online_node 0 #define next_online_node(nid) (MAX_NUMNODES) -#define highest_possible_node_id() 0 +#define nr_node_ids1 #endif #define any_online_node(mask) \ Index: linux-2.6.20-rc4-mm1/mm/page_alloc.c === --- linux-2.6.20-rc4-mm1.orig/mm/page_alloc.c 2007-01-12 12:58:26.0 -0800 +++ linux-2.6.20-rc4-mm1/mm/page_alloc.c2007-01-12 12:59:50.0 -0800 @@ -679,6 +679,26 @@ return i; } +#if MAX_NUMNODES > 1 +int nr_node_ids __read_mostly; +EXPORT_SYMBOL(nr_node_ids); + +/* + * Figure out the number of possible node ids. + */ +static void __init setup_nr_node_ids(void) +{ + unsigned int node; + unsigned int highest = 0; + + for_each_node_mask(node, node_possible_map) + highest = node; + nr_node_ids = highest + 1; +} +#else +static void __init setup_nr_node_ids(void) {} +#endif + #ifdef CONFIG_NUMA /* * Called from the slab reaper to drain pagesets on a particular node that @@ -3318,6 +3338,7 @@ min_free_kbytes = 65536; setup_per_zone_pages_min(); setup_per_zone_lowmem_reserve(); + setup_nr_node_ids(); return 0; } module_init(init_per_zone_pages_min) @@ -3519,18 +3540,4 @@ EXPORT_SYMBOL(page_to_pfn); #endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */ -#if MAX_NUMNODES > 1 -/* - * Find the highest possible node id. - */ -int highest_possible_node_id(void) -{ - unsigned int node; - unsigned int highest = 0; - for_each_node_mask(node, node_possible_map) - highest = node; - return highest; -} -EXPORT_SYMBOL(highest_possible_node_id); -#endif Index: linux-2.6.20-rc4-mm1/net/sunrpc/svc.c === --- linux-2.6.20-rc4-mm1.orig/net/sunrpc/svc.c 2007-01-06 21:45:51.0 -0800 +++ linux-2.6.20-rc4-mm1/net/sunrpc/svc.c 2007-01-12 12:59:50.0 -0800 @@ -116,7 +116,7 @@ static int svc_pool_map_init_percpu(struct svc_pool_map *m) { - unsigned int maxpools = highest_possible_processor_id()+1; + unsigned int maxpools = nr_node_ids; unsigned int pidx = 0; unsigned int cpu; int err; @@ -144,7 +144,7 @@ static int svc_pool_map_init_pernode(struct svc_pool_map *m) { - unsigned int maxpools = highest_possible_node_id()+1; + unsigned int maxpools = nr_node_ids; unsigned int pidx = 0; unsigned int node; int err; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 2/10][RFC] aio: net use struct socket for io
On Mon, 15 Jan 2007 17:54:50 -0800 Nate Diller <[EMAIL PROTECTED]> wrote: > Remove unused arg from socket operations > > The sendmsg and recvmsg socket operations take a kiocb pointer, but none of > the functions actually use it. There's really no need even theoretically, > it's really quite ugly having it there at all. Also, removing it will pave > the way for a more generic completion path in the file_operations. > > --- Would getting rid of these make later implementation of AIO networking harder? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 3/10][RFC] aio: use iov_length instead of ki_left
On 1/15/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: On Mon, Jan 15, 2007 at 05:54:50PM -0800, Nate Diller wrote: > Convert code using iocb->ki_left to use the more generic iov_length() call. No way. We need to reduce the numer of iovec traversals, not adding more of them. ok, I can work on a version of this that uses struct iodesc. Maybe something like this? struct iodesc { struct iovec *iov; unsigned long nr_segs; size_t nbytes; }; I suppose it's worth doing the iodesc thing along with this patchset anyway, since it'll avoid an extra round of interface churn. NATE - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush_cpu_workqueue: don't flush an empty ->worklist
On Mon, Jan 15, 2007 at 07:55:16PM +0300, Oleg Nesterov wrote: > > What if 'singlethread_cpu' dies? > > Still can't understand you. Probably you missed what singlethread_cpu is. oops yes ..I had mistakenly thought that create_workqueue_thread() will bind worker thread to singlethread_cpu for single_threaded workqueue. So it isn't a problem. > > What abt __create_workqueue/schedule_on_each_cpu? > > As I said already __create_workqueue() needs a fix, schedule_on_each_cpu() > is already broken, and should be fixed as well. __create_workqueue() creates worker threads for all online CPUs currently. Accessing the online_map could be racy unless we serialize the access with hotplug event (thr' a mutex like workqueue mutex held between LOCK_ACQ/LOCK_RELEASE messages or process freezer) OR take special measures as was done in flush_workqueue. How were you considering to deal with that raciness? > > > The whole purpose of this change to avoid this! > > > > I guess it depends on how __create_workqueue/schedule_on_each_cpu is > > modified (whether we take/release lock upon LOCK_ACQ/LOCK_RELEASE) > > Sorry, can't understand this... I meant to say that depending on how we modify __create_workqueue/schedule_on_each_cpu to avoid racy-access to online_map, we can debate whether workqueue mutex needs to be held between LOCK_ACQ/LOCK_RELEASE messages in the callback. > > What abt stopping that thread in CPU_DOWN_PREPARE (before freezing > > processes)? I understand that it may add to the latency, but compared to > > the overall latency of process freezer, I suspect it may not be much. > > Srivatsa, why do you think this would be better? > > It add to the complexity! What do you mean by "stopping that thread" ? > Kill it? - this is wrong. I meant issuing kthread_stop() in DOWN_PREPARE so that worker thread exits itself (much before CPU is actually brought down). Do you see any problems with that? Even if there are problems with it, how abt something like below: workqueue_cpu_callback() { CPU_DEAD: /* threads are still frozen at this point */ take_over_work(); kthread_mark_stop(worker_thread); break; CPU_CLEAN_THREADS: /* all threads resumed by now */ kthread_stop(worker_thread); /* task_struct ref required? */ break; } kthread_mark_stop() will mark somewhere in task_struct that the thread should exit when it comes out of refrigerator. worker_thread() { while (!kthread_should_stop()) { if (cwq->freezeable) try_to_freeze(); if (kthread_marked_stop(current)) break; ... } } The advantage I see above is that, when take_over_work() is running, we wont race with functions like flush_workqueue() (threads still frozen at that point) and hence we avoid hacks like migrate_sequence. This will also let functions like flush_workqueue() easily access cpu_online_map as below -without- any special locking/hacks (which I consider a great benefit for programmers). flush_workqueue() { for_each_online_cpu(i) flush_cpu_workqueue(i); } Do you see any problems with this later approach? -- Regards, vatsa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: allocation failed: out of vmalloc space error treating and VIDEO1394 IOC LISTEN CHANNEL ioctl failed problem
On Mon, 2007-01-15 at 16:43 -0500, Kristian Høgsberg wrote: > On 1/15/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote: > > again the best way is for you to provide an mmap method... you can then > > fill in the pages and keep that in some sort of array; this is for > > example also what the DRI/DRM layer does for textures etc... > > That sounds a lot like what I have now (mmap method, array of pages) > so I'll just stick with that. It sounds like the distinction Arjan is getting at is that the buffer should exist in the process's virtual address space instead of the kernel's virtual address space so that we have plenty of space available to us. Thus, we should use get_user_pages() instead of vmalloc(). I think get_user_pages() will also automatically pin the memory. And we'll also need to call get_user_pages() from a custom mmap() handler so that we know what process virtual address to assign to the region. Is that right Arjan? Thanks, David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CPUSET related breakage of sys_mbind
Patch looks good - thanks, Bob. Signed-off-by: Paul Jackson <[EMAIL PROTECTED]> -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private
On 1/15/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: On Mon, Jan 15, 2007 at 05:54:50PM -0800, Nate Diller wrote: > This series is an attempt to generalize the async I/O paths to be > implementation agnostic. It completely eliminates knowledge of > the kiocb structure in the generic code and makes it private within the > current aio code. Things get noticeably cleaner without that layering > violation. > > The new interface takes a file_endio_t function pointer, and a private data > pointer, which would normally be aio_complete and a kiocb pointer, > respectively. If the aio submission function gets back EIOCBQUEUED, that is > a guarantee that the endio function will be called, or *already has been > called*. If the file_endio_t pointer provided to aio_[read|write] is NULL, > the FS must block on I/O completion, then return either the number of bytes > read, or an error. I don't really like this patchet at all. At some point it's a lot nicer to have a lot of paramaters that are related and passed down a long callchain into a structure, and I think the aio code is over that threshold. The completion function cleanups look okay to me, but I'd rather add that completion function to struct kiocb instead of removing kiocb use. I have this slight feeling you want to use this completions for something else than the current aio code, if that's the case it would help if you could explain briefly in what direction your heading. Actually I agree with you more than you might think. I had intended this to mesh with your struct iodesc idea, where iodesc would contain the iovec pointer, nr_segs, iov_length, and whatever else needs to be there, potentially even the endio function and its private data, tying those to the iovec instead of a separate structure that needs to be kept in sync. There's a distinct layering that should exist between things that should accompany the iovec transparently, and private data that should be attached opaquely by layers above. The biggest thing I have in mind for this patch, actually, is to fix up the *sync* paths. I don't think we should be waiting on sync I/O at the *top* of the call stack, like with wait_on_sync_kiocb(), I'd say the best place to wait is at the *bottom*, down in the I/O scheduler. This would make it a lot easier to clean up the completion paths, because in the sync case, you'd be right back in process context again as you traverse upward through the RAID, encryption, loopback, directIO, FS log commit, etc. It doesn't by itself eliminate the need for all the threads and workqueues and such that those layers each own, but it is a step in the right direction. Now if you want to talk about long-term vaporware style ideas, yeah, I do have my own thoughts on how aio should work. And from Agami's perspective, this patch also makes it easier for us to do certain debugging traces that we wish to hack together, in order to profile performance on our platform. But I'd be hesitant to make those arguments, cause they are largely irrelevant (we can obviously carry the patch for debugging without buy-in from the community). This is the right thing to do from a design perspective. Hopefully it enables a new architecture that can reduce context switches in I/O completion, and reduce overhead. That's the real motive ;) NATE - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CPUSET related breakage of sys_mbind
Christoph wrote: > Cpusets is your thing so I think you could fix this the right way. But wasn't it your patch that broke ... Actually, I'd have blessed Bob Picco's patch, as it's done the right way, with a cpuset_* macro hook, defined twice in cpuset.h, with and without CONFIG_CPUSET, where the without case compiles to a no-op. This is the same way as is used for the couple dozen other cpuset kernel hooks. But I thought you were already signed up for this one, so I didn't want to trample on your efforts. And, perhaps more important, I understood you had some other patches in the works that have cpuset hooks. I'm thinking it would be a good idea to learn how these hooks are done, so we don't have to come around here again. How about this ... you take another look at Bob's patch. If it's ok by you too, then we can both bless it, and that should do it. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CPUSET related breakage of sys_mbind
On Mon, 15 Jan 2007, Paul Jackson wrote: > You're right about this problemI think that Christoph Lameter > (added to cc list) is working on a fix for this. Cpusets is your thing so I think you could fix this the right way. There are already two different patches fixing this. Just make it the way that it fits cpusets. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Initramfs and /sbin/hotplug fun
On Jan 15, 2007, at 1:54 PM, Andrew Walrond wrote: Olaf Hering wrote: Why do you need /sbin/hotplug anyway, just for firmware loading for a non-modular kernel? I guess this is unusual, but FWIW... I have a custom distro and I was just looking for the easiest way to create a bootable rescue pen-drive. So I just took a working distro, added an init->sbin/init symlink, cpio'ed it into an initramfs, and booted it up. Works a treat, except for the early hotplug calls. I have a kernel that needs to have early hotplug calls to load firmware. I just rolled my own simple hotplug scripts to only address that issue and have not had a problem since. The mdev in busybox that is in the gentoo initramfs didn't seem to be able to handle it, so I just made my own scripts. In my case I needed QLogic firmware so root could be on FC. FWIW, it is a real PITA to not be able to build a monolithic kernel that can bring up root on its own. I will stipulate that I am an old- school guy that likes monolithic kernels, but I do feel that something has been lost. Yes, I am aware of the reasons for the change, else I would have written something when I was fighting the battle, but I still don't have to like it. -- Mark Rustad, [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Fri, 12 January 2007 00:19:45 +0800, Aubrey wrote: > > Yes for desktop, server, but maybe not for embedded system, specially > for no-mmu linux. In many embedded system cases, the whole system is > running in the ram, including file system. So it's not necessary using > page cache anymore. Page cache can't improve performance on these > cases, but only fragment memory. You were not very specific, so I have to guess that you're referring to the problem of having two copies of the same file in RAM - one in the page cache and one in the "backing store", which is just RAM. There are two solutions to this problem. One is tmpfs, which doesn't use a backing store and keeps all data in the page cache. The other is xip, which doesn't use the page cache and goes directly to backing store. Unlike O_DIRECT, xip only works with a RAM or de-facto RAM backing store (NOR flash works read-only). So if you really care about memory waste in embedded systems, you should have a look at mm/filemap_xip.c and continue Carsten Otte's work. Jörn -- Fantasy is more important than knowledge. Knowledge is limited, while fantasy embraces the whole world. -- Albert Einstein - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] X.25 Add missing sock_put in x25_receive_data
From: ahendry <[EMAIL PROTECTED]> Date: Tue, 09 Jan 2007 09:32:17 +1100 > __x25_find_socket does a sock_hold. > This adds a missing sock_put in x25_receive_data. > > Signed-off-by: Andrew Hendry <[EMAIL PROTECTED]> Applied, thanks a lot. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm 0/10][RFC] aio: make struct kiocb private
On Mon, Jan 15, 2007 at 05:54:50PM -0800, Nate Diller wrote: > This series is an attempt to generalize the async I/O paths to be > implementation agnostic. It completely eliminates knowledge of > the kiocb structure in the generic code and makes it private within the > current aio code. Things get noticeably cleaner without that layering > violation. > > The new interface takes a file_endio_t function pointer, and a private data > pointer, which would normally be aio_complete and a kiocb pointer, > respectively. If the aio submission function gets back EIOCBQUEUED, that is > a guarantee that the endio function will be called, or *already has been > called*. If the file_endio_t pointer provided to aio_[read|write] is NULL, > the FS must block on I/O completion, then return either the number of bytes > read, or an error. I don't really like this patchet at all. At some point it's a lot nicer to have a lot of paramaters that are related and passed down a long callchain into a structure, and I think the aio code is over that threshold. The completion function cleanups look okay to me, but I'd rather add that completion function to struct kiocb instead of removing kiocb use. I have this slight feeling you want to use this completions for something else than the current aio code, if that's the case it would help if you could explain briefly in what direction your heading. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] Re: [patch 20/20] XEN-paravirt: Add Xen virtual block device driver.
> > + > > + err = xenbus_printf(xbt, dev->nodename, > > + "ring-ref","%u", info->ring_ref); > > why do you need your own printf? xenbus_printf isn't a printf replacement - it is used for writing a formatted string into XenStore (which contains VM configuration data in a human-readable form). Internally it does a vsnprintf into a buffer and writes the resulting string to the XenStore. Cheers, Mark > > +static inline int GET_ID_FROM_FREELIST( > > does this really need screaming? > > > + > > +int blkif_ioctl(struct inode *inode, struct file *filep, > > + unsigned command, unsigned long argument) > > +{ > > + int i; > > + > > + DPRINTK_IOCTL("command: 0x%x, argument: 0x%lx, dev: 0x%04x\n", > > + command, (long)argument, inode->i_rdev); > > + > > + switch (command) { > > + case CDROMMULTISESSION: > > + DPRINTK("FIXME: support multisession CDs later\n"); > > + for (i = 0; i < sizeof(struct cdrom_multisession); i++) > > + if (put_user(0, (char __user *)(argument + i))) > > + return -EFAULT; > > + return 0; > > + > > + default: > > + /*printk(KERN_ALERT "ioctl %08x not supported by Xen blkdev\n", > > + command);*/ > > + return -EINVAL; /* same return as native Linux */ > > + } > > eh so you implement no ioctls.. why then implement the ioctl method at > all? I'm not familiar with this code... but perhaps the (fake) multisession handling is to keep userspace that queries this happy? I can't really think of anywhere this would apply off the top of my head, though. Cheers, Mark > > +static struct xenbus_driver blkfront = { > > + .name = "vbd", > > + .owner = THIS_MODULE, > > + .ids = blkfront_ids, > > + .probe = blkfront_probe, > > + .remove = blkfront_remove, > > + .resume = blkfront_resume, > > + .otherend_changed = backend_changed, > > +}; > > this can be const > > > + > > +#define DPRINTK(_f, _a...) pr_debug(_f, ## _a) > > why this silly abstraction? Just use pr_debug in the code directly > > > > > ___ > Xen-devel mailing list > [EMAIL PROTECTED] > http://lists.xensource.com/xen-devel -- Dave: Just a question. What use is a unicyle with no seat? And no pedals! Mark: To answer a question with a question: What use is a skateboard? Dave: Skateboards have wheels. Mark: My wheel has a wheel! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: It should be correct the way it is - that check is trying to prevent ATAPI commands from using DMA until the slave_config function has been called to set up the DMA parameters properly. When the NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) device on the channel this wouldn't affect you anyway. I wondered about it, because the flag is cleared when adma_enabled is 1, which seems to be consistent with everything but nv_adma_check_atapi_dma. When ADMA is enabled we can't use ATAPI at all (or so says NVidia anyway), so it has to be disabled when an ATAPI device is detected in slave_config. Since doing that implies using the legacy BMDMA engine with its greater restrictions, this is why we need to prevent DMA transfers from being attempted until those restrictions have been set properly. (Otherwise, the libata core will try to use PACKET commands on an ATAPI device with DMA enabled before slave_config is even called.) Thus I thought that nv_adma_check_atapi_dma might be wrong, but maybe setting/clearing the flag is wrong instead? *feels lost* -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 01:20:23PM -0800, Andrew Morton wrote: > > Seeing the code helps. But there was a subtle problem with hold time instrumentation here. The code assumed the critical section exiting through spin_unlock_irq entered critical section with spin_lock_irq, but that might not be the case always, and the instrumentation for hold time goes bad when that happens (as in shrink_inactive_list) > > > The > > instrumentation goes like this: > > > > void __lockfunc _spin_lock_irq(spinlock_t *lock) > > { > > unsigned long long t1,t2; > > local_irq_disable(); > > t1 = get_cycles_sync(); > > preempt_disable(); > > spin_acquire(>dep_map, 0, 0, _RET_IP_); > > _raw_spin_lock(lock); > > t2 = get_cycles_sync(); > > lock->raw_lock.htsc = t2; > > if (lock->spin_time < (t2 - t1)) > > lock->spin_time = t2 - t1; > > } > > ... > > > > void __lockfunc _spin_unlock_irq(spinlock_t *lock) > > { > > unsigned long long t1 ; > > spin_release(>dep_map, 1, _RET_IP_); > > t1 = get_cycles_sync(); > > if (lock->cs_time < (t1 - lock->raw_lock.htsc)) > > lock->cs_time = t1 - lock->raw_lock.htsc; > > _raw_spin_unlock(lock); > > local_irq_enable(); > > preempt_enable(); > > } > > ... > > OK, now we need to do a dump_stack() each time we discover a new max hold > time. That might a bit tricky: the printk code does spinlocking too so > things could go recursively deadlocky. Maybe make spin_unlock_irq() return > the hold time then do: What I found now after fixing the above is that hold time is not bad -- 249461 cycles on the 2.6 GHZ opteron with powernow disabled in the BIOS. The spin time is still in orders of seconds. Hence this looks like a hardware fairness issue. Attaching the instrumentation patch with this email FR. Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h === --- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock.h 2007-01-14 22:36:46.694248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h 2007-01-15 15:40:36.554248000 -0800 @@ -6,6 +6,18 @@ #include #include +/* Like get_cycles, but make sure the CPU is synchronized. */ +static inline unsigned long long get_cycles_sync2(void) +{ + unsigned long long ret; + unsigned eax; + /* Don't do an additional sync on CPUs where we know + RDTSC is already synchronous. */ + alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC, + "=a" (eax), "0" (1) : "ebx","ecx","edx","memory"); + rdtscll(ret); + return ret; +} /* * Your basic SMP spinlocks, allowing only a single CPU anywhere * @@ -34,6 +46,7 @@ static inline void __raw_spin_lock(raw_s "jle 3b\n\t" "jmp 1b\n" "2:\t" : "=m" (lock->slock) : : "memory"); + lock->htsc = get_cycles_sync2(); } /* @@ -62,6 +75,7 @@ static inline void __raw_spin_lock_flags "jmp 4b\n" "5:\n\t" : "+m" (lock->slock) : "r" ((unsigned)flags) : "memory"); + lock->htsc = get_cycles_sync2(); } #endif @@ -74,11 +88,16 @@ static inline int __raw_spin_trylock(raw :"=q" (oldval), "=m" (lock->slock) :"0" (0) : "memory"); + if (oldval) + lock->htsc = get_cycles_sync2(); return oldval > 0; } static inline void __raw_spin_unlock(raw_spinlock_t *lock) { + unsigned long long t = get_cycles_sync2(); + if (lock->hold_time < t - lock->htsc) + lock->hold_time = t - lock->htsc; asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory"); } Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h === --- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock_types.h 2007-01-14 22:36:46.714248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h 2007-01-15 14:23:37.204248000 -0800 @@ -7,9 +7,11 @@ typedef struct { unsigned int slock; + unsigned long long hold_time; + unsigned long long htsc; } raw_spinlock_t; -#define __RAW_SPIN_LOCK_UNLOCKED { 1 } +#define __RAW_SPIN_LOCK_UNLOCKED { 1,0,0 } typedef struct { unsigned int lock; Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h === --- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock.h 2007-01-14 22:36:48.464248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h 2007-01-14 22:41:30.964248000 -0800 @@ -231,8 +231,8 @@ do { \ # define spin_unlock(lock)
Re: [stable] 2.6.19.2 regression introduced by "IPV4/IPV6: Fix inet{, 6} device initialization order."
From: YOSHIFUJI Hideaki <[EMAIL PROTECTED]> Date: Tue, 16 Jan 2007 11:06:30 +0900 (JST) > In article <[EMAIL PROTECTED]> (at Tue, 16 Jan 2007 03:01:56 +0100), Gabriel > C <[EMAIL PROTECTED]> says: > > > Should be the fix from http://bugzilla.kernel.org/show_bug.cgi?id=7817 > > I've resent the patch to <[EMAIL PROTECTED]>. Thank you. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CPUSET related breakage of sys_mbind
You're right about this problemI think that Christoph Lameter (added to cc list) is working on a fix for this. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Provide an interface to limit total page cache.
The possible cause is a bug in kswapd thread, or shrink_all_memory cannot be called in kswapd thread. On 1/15/07, Vaidyanathan Srinivasan <[EMAIL PROTECTED]> wrote: Roy Huang wrote: > A patch provide a interface to limit total page cache in > /proc/sys/vm/pagecache_ratio. The default value is 90 percent. Any > feedback is appreciated. [snip] I tried to run your patch on PPC64 SMP machine, unfortunately kswapd crashes the kernel when the pagecache limit is exceeded! ->dd if=/dev/zero of=/tmp/foo bs=1M count=1200 cpu 0x0: Vector: 300 (Data Access) at [c12d7ad0] pc: c00976ac: .kswapd+0x3a4/0x4f0 lr: c00976ac: .kswapd+0x3a4/0x4f0 sp: c12d7d50 msr: 80009032 dar: 0 dsisr: 4200 current = 0xcfed7040 paca= 0xc063fb80 pid = 134, comm = kswapd0 [ cut here ] enter ? for help [c12d7ee0] c0069150 .kthread+0x124/0x174 [c12d7f90] c00247b4 .kernel_thread+0x4c/0x68 0:mon> Steps to recreate fail: # sync # echo 1 > /proc/sys/vm/drop_caches MemTotal: 1014584 kB MemFree:905536 kB Buffers: 3232 kB Cached: 57628 kB SwapCached: 0 kB Active: 47664 kB Inactive:33160 kB SwapTotal: 1526164 kB SwapFree: 1526164 kB Dirty: 108 kB Writeback: 0 kB AnonPages: 19976 kB Mapped: 15084 kB Slab:19724 kB SReclaimable: 8536 kB SUnreclaim: 11188 kB PageTables:972 kB NFS_Unstable:0 kB Bounce: 0 kB CommitLimit: 2033456 kB Committed_AS:87884 kB VmallocTotal: 8589934592 kB VmallocUsed: 2440 kB VmallocChunk: 8589932152 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize:16384 kB # echo 50 > /proc/sys/vm/pagecache_ratio # dd if=/dev/zero of=/tmp/foo bs=1M count=1200 Basically fill pagecache with overlimit dirty file pages and check if the reclaim happened and the limit was not exceeded. --Vaidy - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: allocation failed: out of vmalloc space error treating and VIDEO1394 IOC LISTEN CHANNEL ioctl failed problem
On Tuesday 16 January 2007 06:43, Kristian Høgsberg wrote: > On 1/15/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote: > > there is a lot of pain involved with doing things this way, it is a TON > > better if YOU provide the memory via a custom mmap handler for a device > > driver. > > (there are a lot of security nightmares involved with the opposite > > model, like the user can put any kind of memory there, even pci mmio > > space) > > OK, point taken. I don't have a strong preference for the opposite > model, it just seems elegant that you can let user space handle > allocation and pin and map the pages as needed. But you're right, it > certainly is easier to give safe memory to user space in the first > place rather than try to make sure user space isn't trying to trick > us. I am glad that the discussion is heading to the right place thanks to David. Yes. Probably that is the best solution. In the case of the ring buffers, based on my discussion with Damien, 4 buffers are probably optimal. If the user is allocating them, in case of normal cameras, this is somewhere around 4 MiB, lets say maxim 16 MiB. So, everything should be ok for normal people, at least for now. The problem is when the cameras require bigger images (we are thinking about the future, right) and maybe also more buffers in the DMA ring buffer. If you leave that to the user, it will require some hacking skills if we are using the current model from libdc1394 and video1394. Why? Because if you use 10 buffers with some big images it is likely you are going out of the 64 MiB. In that case, we were thinking to give a nice error (that is why we needed to know the amount available for mmap/vmalloc) and instruct the user to change the kernel boot time allocation of memory in a way that will fit the range (the vmalloc=xxx at startup - the "hacking"). So, in a way, it will be nice to have the solution close to the one proposed by David. Do you think that if the user allocates small buffers (instead of the big ring buffer) and sends the list to the driver, this will help in breaking the 64 limit? I have doubts about it, but I am not good at this level of VMA. Anyway, I hope that something can be done to allow bigger DMA ring buffers without the user needing to reboot the system with some parameter. > > > Then is does an ioctl() on the firewire control device > > > > ioctls are evil ;) esp an "mmap me" ioctl > > Ah, I'm not mmap'ing it from the ioctl, I do implement the mma file > operation for this. However, you have to do an ioctl before mapping > the device to configure the dma context. > > Other than that what is the problem with ioctls, and more interesting, > what is the alternative? I don't expect (or want) a bunch of syscalls > to be added for this, so I don't really see what other mechanism I > should use for this. > > > > It's not too difficult from what I'm doing now, I'd just like to give > > > user space more control over the buffers it uses for streaming (i.e. > > > letting user space allocate them). What I'm missing here is: how do I > > > actually pin a page in memory? I'm sure it's not too difficult, but I > > > haven't yet figured it out and I'm sure somebody knows it off the top > > > of his head. > > > > again the best way is for you to provide an mmap method... you can then > > fill in the pages and keep that in some sort of array; this is for > > example also what the DRI/DRM layer does for textures etc... > > That sounds a lot like what I have now (mmap method, array of pages) > so I'll just stick with that. > > thanks, > Kristian - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 2/10][RFC] aio: net use struct socket for io
Remove unused arg from socket operations The sendmsg and recvmsg socket operations take a kiocb pointer, but none of the functions actually use it. There's really no need even theoretically, it's really quite ugly having it there at all. Also, removing it will pave the way for a more generic completion path in the file_operations. --- drivers/net/pppoe.c |8 +++ include/linux/net.h | 18 +++-- include/net/bluetooth/bluetooth.h |2 - include/net/inet_common.h |3 -- include/net/sock.h| 19 -- include/net/tcp.h |6 ++--- include/net/udp.h |3 -- net/appletalk/ddp.c |5 +--- net/atm/common.c |6 + net/atm/common.h |7 ++ net/ax25/af_ax25.c|7 ++ net/bluetooth/af_bluetooth.c |4 +-- net/bluetooth/hci_sock.c |7 ++ net/bluetooth/l2cap.c |2 - net/bluetooth/rfcomm/sock.c |8 +++ net/bluetooth/sco.c |3 -- net/core/sock.c | 12 --- net/dccp/dccp.h |8 +++ net/dccp/probe.c |3 -- net/dccp/proto.c |7 ++ net/decnet/af_decnet.c|7 ++ net/econet/af_econet.c|7 ++ net/ipv4/af_inet.c|5 +--- net/ipv4/raw.c|8 ++- net/ipv4/tcp.c|7 ++ net/ipv4/tcp_probe.c |3 -- net/ipv4/udp.c|9 +++- net/ipv4/udp_impl.h |2 - net/ipv6/raw.c|6 + net/ipv6/udp.c| 10 +++-- net/ipv6/udp_impl.h |6 + net/ipx/af_ipx.c |7 ++ net/irda/af_irda.c| 29 +--- net/key/af_key.c |6 + net/llc/af_llc.c |7 ++ net/netlink/af_netlink.c |6 + net/netrom/af_netrom.c|7 ++ net/packet/af_packet.c| 11 -- net/rose/af_rose.c|7 ++ net/sctp/socket.c |9 +++- net/socket.c | 32 ++- net/tipc/socket.c | 28 +-- net/unix/af_unix.c| 39 +++--- net/wanrouter/af_wanpipe.c|7 ++ net/x25/af_x25.c |6 + 45 files changed, 166 insertions(+), 243 deletions(-) --- diff -urpN -X dontdiff a/drivers/net/pppoe.c b/drivers/net/pppoe.c --- a/drivers/net/pppoe.c 2007-01-12 11:18:47.244855016 -0800 +++ b/drivers/net/pppoe.c 2007-01-12 11:29:21.179177108 -0800 @@ -746,8 +746,8 @@ static int pppoe_ioctl(struct socket *so } -static int pppoe_sendmsg(struct kiocb *iocb, struct socket *sock, - struct msghdr *m, size_t total_len) +static int pppoe_sendmsg(struct socket *sock, struct msghdr *m, +size_t total_len) { struct sk_buff *skb = NULL; struct sock *sk = sock->sk; @@ -912,8 +912,8 @@ static struct ppp_channel_ops pppoe_chan .start_xmit = pppoe_xmit, }; -static int pppoe_recvmsg(struct kiocb *iocb, struct socket *sock, - struct msghdr *m, size_t total_len, int flags) +static int pppoe_recvmsg(struct socket *sock, struct msghdr *m, +size_t total_len, int flags) { struct sock *sk = sock->sk; struct sk_buff *skb = NULL; diff -urpN -X dontdiff a/include/linux/net.h b/include/linux/net.h --- a/include/linux/net.h 2007-01-12 11:18:56.683629587 -0800 +++ b/include/linux/net.h 2007-01-12 11:29:21.185175058 -0800 @@ -118,7 +118,6 @@ struct socket { struct vm_area_struct; struct page; -struct kiocb; struct sockaddr; struct msghdr; struct module; @@ -156,11 +155,10 @@ struct proto_ops { int optname, char __user *optval, int optlen); int (*compat_getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen); - int (*sendmsg) (struct kiocb *iocb, struct socket *sock, - struct msghdr *m, size_t total_len); - int (*recvmsg) (struct kiocb *iocb, struct socket *sock, - struct msghdr *m, size_t total_len, - int flags); + int (*sendmsg) (struct socket *sock, struct msghdr *m, + size_t total_len); + int (*recvmsg) (struct socket *sock, struct msghdr *m, + size_t total_len, int flags); int
[PATCH -mm 8/10][RFC] aio: make direct_IO aops use file_endio_t
This converts the _locking variant of blockdev_direct_IO to use a generic endio function, and updates all the FS callsites. --- Documentation/filesystems/Locking |5 +++-- Documentation/filesystems/vfs.txt |5 +++-- fs/block_dev.c|9 - fs/ext2/inode.c | 12 +--- fs/ext3/inode.c | 11 +-- fs/ext4/inode.c | 11 +-- fs/fat/inode.c| 12 ++-- fs/gfs2/ops_address.c |8 fs/hfs/inode.c| 13 ++--- fs/hfsplus/inode.c| 13 ++--- fs/jfs/inode.c| 12 +--- fs/nfs/direct.c |8 +--- fs/ocfs2/aops.c |9 + fs/reiserfs/inode.c | 13 + fs/xfs/linux-2.6/xfs_aops.c | 11 ++- fs/xfs/linux-2.6/xfs_lrw.c|4 ++-- include/linux/fs.h| 28 +--- include/linux/nfs_fs.h|4 ++-- mm/filemap.c | 34 ++ 19 files changed, 108 insertions(+), 114 deletions(-) --- diff -urpN -X dontdiff a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking --- a/Documentation/filesystems/Locking 2007-01-12 20:26:06.0 -0800 +++ b/Documentation/filesystems/Locking 2007-01-12 20:42:37.0 -0800 @@ -169,8 +169,9 @@ prototypes: sector_t (*bmap)(struct address_space *, sector_t); int (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, int); - int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, - loff_t offset, unsigned long nr_segs); + int (*direct_IO)(int, struct file *, const struct iovec *iov, + loff_t offset, unsigned long nr_segs, + file_endio_t *endio, void *endio_data); int (*launder_page) (struct page *); locking rules: diff -urpN -X dontdiff a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt --- a/Documentation/filesystems/vfs.txt 2007-01-12 20:26:06.0 -0800 +++ b/Documentation/filesystems/vfs.txt 2007-01-12 20:42:37.0 -0800 @@ -537,8 +537,9 @@ struct address_space_operations { sector_t (*bmap)(struct address_space *, sector_t); int (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, int); - ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, - loff_t offset, unsigned long nr_segs); + ssize_t (*direct_IO)(int, struct file *, const struct iovec *iov, + loff_t offset, unsigned long nr_segs, + file_endio_t *endio, void *endio_data); struct page* (*get_xip_page)(struct address_space *, sector_t, int); /* migrate the contents of a page to the specified target */ diff -urpN -X dontdiff a/fs/block_dev.c b/fs/block_dev.c --- a/fs/block_dev.c2007-01-12 20:29:02.0 -0800 +++ b/fs/block_dev.c2007-01-12 20:42:37.0 -0800 @@ -222,10 +222,11 @@ static void blk_unget_page(struct page * } static ssize_t -blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, -loff_t pos, unsigned long nr_segs) +blkdev_direct_IO(int rw, struct file *file, const struct iovec *iov, +loff_t pos, unsigned long nr_segs, file_endio_t *endio, +void *endio_data) { - struct inode *inode = iocb->ki_filp->f_mapping->host; + struct inode *inode = file->f_mapping->host; unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode))); unsigned blocksize_mask = (1 << blkbits) - 1; unsigned long seg = 0; /* iov segment iterator */ @@ -239,8 +240,6 @@ blkdev_direct_IO(int rw, struct kiocb *i loff_t size;/* size of block device */ struct bio *bio; struct bdev_aio stack_io, *io; - file_endio_t *endio = aio_complete; - void *endio_data = iocb; struct page *page; struct pvec pvec; diff -urpN -X dontdiff a/fs/ext2/inode.c b/fs/ext2/inode.c --- a/fs/ext2/inode.c 2007-01-12 20:26:06.0 -0800 +++ b/fs/ext2/inode.c 2007-01-12 20:42:37.0 -0800 @@ -752,14 +752,12 @@ static sector_t ext2_bmap(struct address } static ssize_t -ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, - loff_t offset, unsigned long nr_segs) +ext2_direct_IO(int rw, struct file *file, const struct iovec *iov, + loff_t offset, unsigned long nr_segs, file_endio_t *endio, + void *endio_data) { - struct file *file = iocb->ki_filp; - struct inode *inode = file->f_mapping->host; - - return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, -
Re: [PATCH] Provide an interface to limit total page cache.
Hi Balbir, Thanks for your comment. On 1/15/07, Balbir Singh <[EMAIL PROTECTED]> wrote: wakeup_kswapd and shrink_all_memory use swappiness to determine what to reclaim (mapped pages or page cache). This patch does not ensure that only page cache is reclaimed/limited. If the swappiness value is high, mapped pages will be hit. You are right, it is possible to release mapped pages. It can be avoided by add a field in "struct scan_control" to determine whether mapped pages will be released. One could get similar functionality by implementing resource management. Resource management splits tasks into groups and does management of resources for the groups rather than the whole system. Such a facility will come with a resource controller for memory (split into finer grain rss/page cache/mlock'ed memory, etc), one for cpu, etc. I s there any more information in detail about resource controller? Even there is a resource controller for tasks, all memory is also possbile to be eaten up by page cache. Balbir - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH -rt] RCU priority boosting that survives moderate testing
Hello! This is a updated version of the earlier RCU-boosting patch (http://lkml.org/lkml/2007/1/2/347). It boosts the priority of RCU read-side critical sections in -rt kernels, and the context diff is almost 300 lines shorter than its predecessor. Simplifications were inspired by the act of attempting to design enterprise-level testing for this patch's predecessor -- after all, you don't have to write tests for any code that you manage to eliminate! Still lacks tie-in to OOM, and still needs more vigorous testing (though less so than its predecessor). However, a design doc is on its way. This version permits the system administrator to manually adjust the priority of the RCU-booster task, which will result in RCU boosting to the priority one slot less-favored than the booster task itself. Any tasks that have been previously boosted will have their priority adjusted to align with the RCU-booster task's new priority. As always, any and all comments appreciated! Signed-off-by: Paul E. McKenney <[EMAIL PROTECTED]> --- include/linux/init_task.h | 12 + include/linux/rcupdate.h | 12 + include/linux/rcupreempt.h | 19 + include/linux/sched.h | 16 + init/main.c|1 kernel/Kconfig.preempt | 32 ++ kernel/fork.c |6 kernel/rcupreempt.c| 536 + kernel/rtmutex.c |9 kernel/sched.c |5 10 files changed, 645 insertions(+), 3 deletions(-) diff -urpNa -X dontdiff linux-2.6.20-rc4-rt1/include/linux/init_task.h linux-2.6.20-rc4-rt1-rcub/include/linux/init_task.h --- linux-2.6.20-rc4-rt1/include/linux/init_task.h 2007-01-09 10:59:54.0 -0800 +++ linux-2.6.20-rc4-rt1-rcub/include/linux/init_task.h 2007-01-09 11:01:12.0 -0800 @@ -87,6 +87,17 @@ extern struct nsproxy init_nsproxy; .siglock= __SPIN_LOCK_UNLOCKED(sighand.siglock),\ } +#ifdef CONFIG_PREEMPT_RCU_BOOST +#define INIT_RCU_BOOST_PRIO .rcu_prio = MAX_PRIO, +#define INIT_PREEMPT_RCU_BOOST(tsk)\ + .rcub_rbdp = NULL, \ + .rcub_state = RCU_BOOST_IDLE, \ + .rcub_entry = LIST_HEAD_INIT(tsk.rcub_entry), +#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */ +#define INIT_RCU_BOOST_PRIO +#define INIT_PREEMPT_RCU_BOOST(tsk) +#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */ + extern struct group_info init_groups; /* @@ -143,6 +154,7 @@ extern struct group_info init_groups; .pi_lock= RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock), \ INIT_TRACE_IRQFLAGS \ INIT_LOCKDEP\ + INIT_PREEMPT_RCU_BOOST(tsk) \ } diff -urpNa -X dontdiff linux-2.6.20-rc4-rt1/include/linux/rcupdate.h linux-2.6.20-rc4-rt1-rcub/include/linux/rcupdate.h --- linux-2.6.20-rc4-rt1/include/linux/rcupdate.h 2007-01-09 10:59:54.0 -0800 +++ linux-2.6.20-rc4-rt1-rcub/include/linux/rcupdate.h 2007-01-09 11:01:12.0 -0800 @@ -227,6 +227,18 @@ extern void rcu_barrier(void); extern void rcu_init(void); extern void rcu_advance_callbacks(int cpu, int user); extern void rcu_check_callbacks(int cpu, int user); +#ifdef CONFIG_PREEMPT_RCU_BOOST +extern void init_rcu_boost_late(void); +extern void __rcu_preempt_boost(void); +#define rcu_preempt_boost() \ + do { \ + if (unlikely(current->rcu_read_lock_nesting > 0)) \ + __rcu_preempt_boost(); \ + } while (0) +#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */ +#define init_rcu_boost_late() +#define rcu_preempt_boost() +#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */ #endif /* __KERNEL__ */ #endif /* __LINUX_RCUPDATE_H */ diff -urpNa -X dontdiff linux-2.6.20-rc4-rt1/include/linux/rcupreempt.h linux-2.6.20-rc4-rt1-rcub/include/linux/rcupreempt.h --- linux-2.6.20-rc4-rt1/include/linux/rcupreempt.h 2007-01-09 10:59:54.0 -0800 +++ linux-2.6.20-rc4-rt1-rcub/include/linux/rcupreempt.h2007-01-09 11:01:12.0 -0800 @@ -42,6 +42,25 @@ #include #include +#ifdef CONFIG_PREEMPT_RCU_BOOST +/* + * Task state with respect to being RCU-boosted. This state is changed + * by the task itself in response to the following three events: + * 1. Preemption (or block on lock) while in RCU read-side critical section. + * 2. Outermost rcu_read_unlock() for blocked RCU read-side critical section. + * + * The RCU-boost task also updates the state when boosting priority. + */ +enum rcu_boost_state { + RCU_BOOST_IDLE = 0,/* Not yet blocked if in RCU read-side. */ + RCU_BOOST_BLOCKED = 1, /* Blocked from RCU read-side. */ + RCU_BOOSTED = 2, /* Boosting complete. */ +}; + +#define N_RCU_BOOST_STATE (RCU_BOOSTED + 1) + +#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST
Re: [PATCH -mm 3/10][RFC] aio: use iov_length instead of ki_left
On Mon, Jan 15, 2007 at 05:54:50PM -0800, Nate Diller wrote: > Convert code using iocb->ki_left to use the more generic iov_length() call. No way. We need to reduce the numer of iovec traversals, not adding more of them. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 7/10][RFC] aio: make __blockdev_direct_IO use file_endio_t
This converts the internals of __blockdev_direct_IO in fs/direct-io.c to use a generic endio function, instead of directly calling aio_complete. It also changes the semantics of dio_iodone to be more friendly to its only users, xfs and ocfs2. This allows the caller to know how to release locks and tear down data structures on error. It also converts the _own_locking and _no_locking variants of blockdev_direct_IO to use a generic endio function. --- fs/direct-io.c | 74 ++-- fs/gfs2/ops_address.c |6 +-- fs/ocfs2/aops.c | 15 ++-- fs/ocfs2/aops.h |8 fs/ocfs2/file.c | 18 -- fs/ocfs2/inode.h|2 - fs/xfs/linux-2.6/xfs_aops.c | 33 +++ include/linux/fs.h | 57 ++--- 8 files changed, 104 insertions(+), 109 deletions(-) --- diff -urpN -X dontdiff a/fs/direct-io.c b/fs/direct-io.c --- a/fs/direct-io.c2007-01-12 14:53:48.0 -0800 +++ b/fs/direct-io.c2007-01-12 15:06:44.0 -0800 @@ -67,7 +67,7 @@ struct dio { struct bio *bio;/* bio under assembly */ struct inode *inode; int rw; - loff_t i_size; /* i_size when submitted */ + unsigned max_to_read; /* (i_size when submitted) - offset */ int lock_type; /* doesn't change */ unsigned blkbits; /* doesn't change */ unsigned blkfactor; /* When we're using an alignment which @@ -89,6 +89,7 @@ struct dio { int reap_counter; /* rate limit reaping */ get_block_t *get_block; /* block mapping function */ dio_iodone_t *end_io; /* IO completion function */ + void *destructor_data; /* private data for completion fn */ sector_t final_block_in_bio;/* current final block in bio + 1 */ sector_t next_block_for_io; /* next block to be put under IO, in dio_blocks units */ @@ -127,7 +128,8 @@ struct dio { struct task_struct *waiter; /* waiting task (NULL if none) */ /* AIO related stuff */ - struct kiocb *iocb; /* kiocb */ + file_endio_t *file_endio; /* aio completion function */ + void *endio_data; /* private data for aio completion */ int is_async; /* is IO async ? */ int io_error; /* IO error in completion path */ ssize_t result; /* IO result */ @@ -222,7 +224,7 @@ static struct page *dio_get_page(struct * filesystems can use it to hold additional state between get_block calls and * dio_complete. */ -static int dio_complete(struct dio *dio, loff_t offset, int ret) +static int dio_complete(struct dio *dio, int ret) { /* * AIO submission can race with bio completion to get here while @@ -232,25 +234,21 @@ static int dio_complete(struct dio *dio, */ if (ret == -EIOCBQUEUED) ret = 0; + if (ret == 0) + ret = dio->page_errors; + if (ret == 0) + ret = dio->io_error; if (dio->result) { /* Check for short read case */ - if ((dio->rw == READ) && ((offset + dio->result) > dio->i_size)) - dio->result = dio->i_size - offset; + if ((dio->rw == READ) && (dio->result > dio->max_to_read)) + dio->result = dio->max_to_read; } - if (dio->end_io && dio->result) - dio->end_io(dio->iocb, offset, dio->result, - dio->map_bh.b_private); if (dio->lock_type == DIO_LOCKING) /* lockdep: non-owner release */ up_read_non_owner(>inode->i_alloc_sem); - if (ret == 0) - ret = dio->page_errors; - if (ret == 0) - ret = dio->io_error; - return ret; } @@ -277,8 +275,11 @@ static int dio_bio_end_aio(struct bio *b spin_unlock_irqrestore(>bio_lock, flags); if (remaining == 0) { - int err = dio_complete(dio, dio->iocb->ki_pos, 0); - aio_complete(dio->iocb, dio->result, err); + int err = dio_complete(dio, 0); + if (dio->end_io) + dio->end_io(dio->destructor_data, dio->result, + dio->map_bh.b_private); + dio->file_endio(dio->endio_data, dio->result, err); kfree(dio); } @@ -944,10 +945,11 @@ out: * Releases both i_mutex and i_alloc_sem */ static ssize_t -direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, +direct_io_worker(int rw, struct file *file, struct inode *inode, const struct iovec *iov, loff_t offset, unsigned long nr_segs, unsigned
[PATCH -mm 9/10][RFC] aio: usb gadget remove aio file ops
This removes the aio implementation from the usb gadget file system. Aside from making very creative (!) use of the aio retry path, it can't be of any use performance-wise because it always kmalloc()s a bounce buffer for the *whole* I/O size. Perhaps the only reason to keep it around is the ability to cancel I/O requests, which only applies when using the user space async I/O interface. I highly doubt that is enough incentive to justify the extra complexity here or in user-space, so I think it's a safe bet to remove this. If that feature still desired, it would be possible to implement a sync interface that does an interruptible sleep. I can be convinced otherwise, but the alternatives are difficult. See for example the "fuse, get_user_pages, flush_anon_page, aliasing caches and all that again" LKML thread recently for why it's waaay easier to kmalloc a bounce buffer here, and (ab)use the retry interface. --- diff -urpN -X dontdiff a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c --- a/drivers/usb/gadget/inode.c2007-01-10 13:23:46.0 -0800 +++ b/drivers/usb/gadget/inode.c2007-01-10 16:56:09.0 -0800 @@ -527,218 +527,6 @@ static int ep_ioctl (struct inode *inode /*--*/ -/* ASYNCHRONOUS ENDPOINT I/O OPERATIONS (bulk/intr/iso) */ - -struct kiocb_priv { - struct usb_request *req; - struct ep_data *epdata; - void*buf; - const struct iovec *iv; - unsigned long nr_segs; - unsignedactual; -}; - -static int ep_aio_cancel(struct kiocb *iocb, struct io_event *e) -{ - struct kiocb_priv *priv = iocb->private; - struct ep_data *epdata; - int value; - - local_irq_disable(); - epdata = priv->epdata; - // spin_lock(>dev->lock); - kiocbSetCancelled(iocb); - if (likely(epdata && epdata->ep && priv->req)) - value = usb_ep_dequeue (epdata->ep, priv->req); - else - value = -EINVAL; - // spin_unlock(>dev->lock); - local_irq_enable(); - - aio_put_req(iocb); - return value; -} - -static int ep_aio_read_retry(struct kiocb *iocb) -{ - struct kiocb_priv *priv = iocb->private; - ssize_t total; - int i, err = 0; - - /* we "retry" to get the right mm context for this: */ - - /* copy stuff into user buffers */ - total = priv->actual; - for (i=0; i < priv->nr_segs; i++) { - ssize_t this = min((ssize_t)(priv->iv[i].iov_len), total); - - if (copy_to_user(priv->iv[i].iov_base, priv->buf, this)) { - err = -EFAULT; - break; - } - - total -= this; - if (total == 0) - break; - } - kfree(priv->buf); - kfree(priv); - aio_put_req(iocb); - return err; -} - -static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req) -{ - struct kiocb*iocb = req->context; - struct kiocb_priv *priv = iocb->private; - struct ep_data *epdata = priv->epdata; - - /* lock against disconnect (and ideally, cancel) */ - spin_lock(>dev->lock); - priv->req = NULL; - priv->epdata = NULL; - if (priv->iv == NULL - || unlikely(req->actual == 0) - || unlikely(kiocbIsCancelled(iocb))) { - kfree(req->buf); - kfree(priv); - iocb->private = NULL; - /* aio_complete() reports bytes-transferred _and_ faults */ - if (unlikely(kiocbIsCancelled(iocb))) - aio_put_req(iocb); - else - aio_complete(iocb, req->actual, req->status); - } else { - /* retry() won't report both; so we hide some faults */ - if (unlikely(0 != req->status)) - DBG(epdata->dev, "%s fault %d len %d\n", - ep->name, req->status, req->actual); - - priv->buf = req->buf; - priv->actual = req->actual; - kick_iocb(iocb); - } - spin_unlock(>dev->lock); - - usb_ep_free_request(ep, req); - put_ep(epdata); -} - -static ssize_t -ep_aio_rwtail( - struct kiocb*iocb, - char*buf, - size_t len, - struct ep_data *epdata, - const struct iovec *iv, - unsigned long nr_segs -) -{ - struct kiocb_priv *priv; - struct usb_request *req; - ssize_t value; - - priv = kmalloc(sizeof *priv, GFP_KERNEL); - if (!priv) { - value = -ENOMEM; -fail: - kfree(buf); - return value; - } -
Re: [stable] 2.6.19.2 regression introduced by "IPV4/IPV6: Fix inet{, 6} device initialization order."
In article <[EMAIL PROTECTED]> (at Tue, 16 Jan 2007 03:01:56 +0100), Gabriel C <[EMAIL PROTECTED]> says: > Greg KH schrieb: > > On Sun, Jan 14, 2007 at 09:30:08PM -0800, David Miller wrote: > > > >> From: David Stevens <[EMAIL PROTECTED]> > >> Date: Sun, 14 Jan 2007 19:47:49 -0800 > >> > >> > >>> I think it's better to add the fix than withdraw this patch, since > >>> the original bug is a crash. > >>> > >> I completely agree. > >> > > > > Great, can someone forward the patch to us? > > > > Should be the fix from http://bugzilla.kernel.org/show_bug.cgi?id=7817 I've resent the patch to <[EMAIL PROTECTED]>. --yoshfuji - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 3/10][RFC] aio: use iov_length instead of ki_left
Convert code using iocb->ki_left to use the more generic iov_length() call. --- diff -urpN -X dontdiff a/fs/ocfs2/file.c b/fs/ocfs2/file.c --- a/fs/ocfs2/file.c 2007-01-10 11:50:26.0 -0800 +++ b/fs/ocfs2/file.c 2007-01-10 12:42:09.0 -0800 @@ -1157,7 +1157,7 @@ static ssize_t ocfs2_file_aio_write(stru filp->f_path.dentry->d_name.name); /* happy write of zero bytes */ - if (iocb->ki_left == 0) + if (iov_length(iov, nr_segs) == 0) return 0; mutex_lock(>i_mutex); @@ -1177,7 +1177,7 @@ static ssize_t ocfs2_file_aio_write(stru } ret = ocfs2_prepare_inode_for_write(filp->f_path.dentry, >ki_pos, - iocb->ki_left, appending); + iov_length(iov, nr_segs), appending); if (ret < 0) { mlog_errno(ret); goto out; diff -urpN -X dontdiff a/fs/smbfs/file.c b/fs/smbfs/file.c --- a/fs/smbfs/file.c 2007-01-10 11:50:28.0 -0800 +++ b/fs/smbfs/file.c 2007-01-10 12:42:09.0 -0800 @@ -222,7 +222,7 @@ smb_file_aio_read(struct kiocb *iocb, co ssize_t status; VERBOSE("file %s/%s, [EMAIL PROTECTED]", DENTRY_PATH(dentry), - (unsigned long) iocb->ki_left, (unsigned long) pos); + (unsigned long) iov_length(iov, nr_segs), (unsigned long) pos); status = smb_revalidate_inode(dentry); if (status) { @@ -328,7 +328,7 @@ smb_file_aio_write(struct kiocb *iocb, c VERBOSE("file %s/%s, [EMAIL PROTECTED]", DENTRY_PATH(dentry), - (unsigned long) iocb->ki_left, (unsigned long) pos); + (unsigned long) iov_length(iov, nr_segs), (unsigned long) pos); result = smb_revalidate_inode(dentry); if (result) { @@ -341,7 +341,7 @@ smb_file_aio_write(struct kiocb *iocb, c if (result) goto out; - if (iocb->ki_left > 0) { + if (iov_length(iov, nr_segs) > 0) { result = generic_file_aio_write(iocb, iov, nr_segs, pos); VERBOSE("pos=%ld, size=%ld, mtime=%ld, atime=%ld\n", (long) file->f_pos, (long) dentry->d_inode->i_size, diff -urpN -X dontdiff a/fs/udf/file.c b/fs/udf/file.c --- a/fs/udf/file.c 2007-01-10 11:53:02.0 -0800 +++ b/fs/udf/file.c 2007-01-10 12:42:09.0 -0800 @@ -109,7 +109,7 @@ static ssize_t udf_file_aio_write(struct struct file *file = iocb->ki_filp; struct inode *inode = file->f_path.dentry->d_inode; int err, pos; - size_t count = iocb->ki_left; + size_t count = iov_length(iov, nr_segs); if (UDF_I_ALLOCTYPE(inode) == ICBTAG_FLAG_AD_IN_ICB) { diff -urpN -X dontdiff a/net/socket.c b/net/socket.c --- a/net/socket.c 2007-01-10 12:40:54.0 -0800 +++ b/net/socket.c 2007-01-10 12:42:09.0 -0800 @@ -632,7 +632,7 @@ static ssize_t sock_aio_read(struct kioc if (pos != 0) return -ESPIPE; - if (iocb->ki_left == 0) /* Match SYS5 behaviour */ + if (iov_length(iov, nr_segs) == 0) /* Match SYS5 behaviour */ return 0; for (i = 0; i < nr_segs; i++) @@ -660,7 +660,7 @@ static ssize_t sock_aio_write(struct kio if (pos != 0) return -ESPIPE; - if (iocb->ki_left == 0) /* Match SYS5 behaviour */ + if (iov_length(iov, nr_segs) == 0) /* Match SYS5 behaviour */ return 0; for (i = 0; i < nr_segs; i++) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -mm 6/10][RFC] aio: make nfs_directIO use file_endio_t
This converts the iternals of nfs's directIO support to use a generic endio function, instead of directly calling aio_complete. It's pretty easy because it already has a pretty abstracted completion path. --- diff -urpN -X dontdiff a/fs/nfs/direct.c b/fs/nfs/direct.c --- a/fs/nfs/direct.c 2007-01-12 14:53:48.0 -0800 +++ b/fs/nfs/direct.c 2007-01-12 15:02:30.0 -0800 @@ -68,7 +68,6 @@ struct nfs_direct_req { /* I/O parameters */ struct nfs_open_context *ctx; /* file open context info */ - struct kiocb * iocb; /* controlling i/o request */ struct inode * inode; /* target file of i/o */ /* completion state */ @@ -77,6 +76,8 @@ struct nfs_direct_req { ssize_t count, /* bytes actually processed */ error; /* any reported error */ struct completion completion; /* wait for i/o completion */ + file_endio_t*endio; /* async completion function */ + void*endio_data;/* private completion data */ /* commit state */ struct list_headrewrite_list; /* saved nfs_write_data structs */ @@ -151,7 +152,7 @@ static inline struct nfs_direct_req *nfs kref_get(>kref); init_completion(>completion); INIT_LIST_HEAD(>rewrite_list); - dreq->iocb = NULL; + dreq->endio = NULL; dreq->ctx = NULL; spin_lock_init(>lock); atomic_set(>io_count, 0); @@ -179,7 +180,7 @@ static ssize_t nfs_direct_wait(struct nf ssize_t result = -EIOCBQUEUED; /* Async requests don't wait here */ - if (dreq->iocb) + if (!dreq->endio) goto out; result = wait_for_completion_interruptible(>completion); @@ -194,14 +195,10 @@ out: return (ssize_t) result; } -/* - * Synchronous I/O uses a stack-allocated iocb. Thus we can't trust - * the iocb is still valid here if this is a synchronous request. - */ static void nfs_direct_complete(struct nfs_direct_req *dreq) { - if (dreq->iocb) - aio_complete(dreq->iocb, dreq->count, dreq->error); + if (dreq->endio) + dreq->endio(dreq->endio_data, dreq->count, dreq->error); complete_all(>completion); @@ -332,11 +329,13 @@ static ssize_t nfs_direct_read_schedule( return result < 0 ? (ssize_t) result : -EFAULT; } -static ssize_t nfs_direct_read(struct kiocb *iocb, unsigned long user_addr, size_t count, loff_t pos) +static ssize_t nfs_direct_read(struct file *file, unsigned long user_addr, + size_t count, loff_t pos, + file_endio_t *endio, void *endio_data) { ssize_t result = 0; sigset_t oldset; - struct inode *inode = iocb->ki_filp->f_mapping->host; + struct inode *inode = file->f_mapping->host; struct rpc_clnt *clnt = NFS_CLIENT(inode); struct nfs_direct_req *dreq; @@ -345,9 +344,9 @@ static ssize_t nfs_direct_read(struct ki return -ENOMEM; dreq->inode = inode; - dreq->ctx = get_nfs_open_context((struct nfs_open_context *)iocb->ki_filp->private_data); - if (!is_sync_kiocb(iocb)) - dreq->iocb = iocb; + dreq->ctx = get_nfs_open_context((struct nfs_open_context *)file->private_data); + dreq->endio = endio; + dreq->endio_data = endio_data; nfs_add_stats(inode, NFSIOS_DIRECTREADBYTES, count); rpc_clnt_sigmask(clnt, ); @@ -663,11 +662,13 @@ static ssize_t nfs_direct_write_schedule return result < 0 ? (ssize_t) result : -EFAULT; } -static ssize_t nfs_direct_write(struct kiocb *iocb, unsigned long user_addr, size_t count, loff_t pos) +static ssize_t nfs_direct_write(struct file *file, unsigned long user_addr, + size_t count, loff_t pos, + file_endio_t *endio, void *endio_data) { ssize_t result = 0; sigset_t oldset; - struct inode *inode = iocb->ki_filp->f_mapping->host; + struct inode *inode = file->f_mapping->host; struct rpc_clnt *clnt = NFS_CLIENT(inode); struct nfs_direct_req *dreq; size_t wsize = NFS_SERVER(inode)->wsize; @@ -682,9 +683,9 @@ static ssize_t nfs_direct_write(struct k sync = FLUSH_STABLE; dreq->inode = inode; - dreq->ctx = get_nfs_open_context((struct nfs_open_context *)iocb->ki_filp->private_data); - if (!is_sync_kiocb(iocb)) - dreq->iocb = iocb; + dreq->ctx = get_nfs_open_context((struct nfs_open_context *)file->private_data); + dreq->endio = endio; + dreq->endio_data = endio_data; nfs_add_stats(inode, NFSIOS_DIRECTWRITTENBYTES, count); @@ -701,10 +702,12 @@ static ssize_t nfs_direct_write(struct k /** * nfs_file_direct_read - file direct read
[PATCH -mm 4/10][RFC] aio: convert aio_complete to file_endio_t
Define a new function typedef for I/O completion at the file/iovec level -- typedef void (file_endio_t)(void *endio_data, ssize_t count, int err); and convert aio_complete and all its callers to this new prototype. --- drivers/usb/gadget/inode.c | 24 +++--- fs/aio.c | 59 - fs/block_dev.c |8 +- fs/direct-io.c | 18 + fs/nfs/direct.c|9 ++ include/linux/aio.h| 11 +++- include/linux/fs.h |2 + 7 files changed, 61 insertions(+), 70 deletions(-) --- diff -urpN -X dontdiff a/drivers/usb/gadget/inode.c b/drivers/usb/gadget/inode.c --- a/drivers/usb/gadget/inode.c2007-01-12 14:42:29.0 -0800 +++ b/drivers/usb/gadget/inode.c2007-01-12 14:25:34.0 -0800 @@ -559,35 +559,32 @@ static int ep_aio_cancel(struct kiocb *i return value; } -static ssize_t ep_aio_read_retry(struct kiocb *iocb) +static int ep_aio_read_retry(struct kiocb *iocb) { struct kiocb_priv *priv = iocb->private; - ssize_t len, total; - int i; + ssize_t total; + int i, err = 0; /* we "retry" to get the right mm context for this: */ /* copy stuff into user buffers */ total = priv->actual; - len = 0; for (i=0; i < priv->nr_segs; i++) { ssize_t this = min((ssize_t)(priv->iv[i].iov_len), total); if (copy_to_user(priv->iv[i].iov_base, priv->buf, this)) { - if (len == 0) - len = -EFAULT; + err = -EFAULT; break; } total -= this; - len += this; if (total == 0) break; } kfree(priv->buf); kfree(priv); aio_put_req(iocb); - return len; + return err; } static void ep_aio_complete(struct usb_ep *ep, struct usb_request *req) @@ -610,9 +607,7 @@ static void ep_aio_complete(struct usb_e if (unlikely(kiocbIsCancelled(iocb))) aio_put_req(iocb); else - aio_complete(iocb, - req->actual ? req->actual : req->status, - req->status); + aio_complete(iocb, req->actual, req->status); } else { /* retry() won't report both; so we hide some faults */ if (unlikely(0 != req->status)) @@ -702,16 +697,17 @@ ep_aio_read(struct kiocb *iocb, const st { struct ep_data *epdata = iocb->ki_filp->private_data; char*buf; + size_t len = iov_length(iov, nr_segs); if (unlikely(epdata->desc.bEndpointAddress & USB_DIR_IN)) return -EINVAL; - buf = kmalloc(iocb->ki_left, GFP_KERNEL); + buf = kmalloc(len, GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; iocb->ki_retry = ep_aio_read_retry; - return ep_aio_rwtail(iocb, buf, iocb->ki_left, epdata, iov, nr_segs); + return ep_aio_rwtail(iocb, buf, len, epdata, iov, nr_segs); } static ssize_t @@ -726,7 +722,7 @@ ep_aio_write(struct kiocb *iocb, const s if (unlikely(!(epdata->desc.bEndpointAddress & USB_DIR_IN))) return -EINVAL; - buf = kmalloc(iocb->ki_left, GFP_KERNEL); + buf = kmalloc(iov_length(iov, nr_segs), GFP_KERNEL); if (unlikely(!buf)) return -ENOMEM; diff -urpN -X dontdiff a/fs/aio.c b/fs/aio.c --- a/fs/aio.c 2007-01-12 14:42:29.0 -0800 +++ b/fs/aio.c 2007-01-12 14:29:20.0 -0800 @@ -658,16 +658,16 @@ static inline int __queue_kicked_iocb(st * simplifies the coding of individual aio operations as * it avoids various potential races. */ -static ssize_t aio_run_iocb(struct kiocb *iocb) +static void aio_run_iocb(struct kiocb *iocb) { struct kioctx *ctx = iocb->ki_ctx; - ssize_t (*retry)(struct kiocb *); + int (*retry)(struct kiocb *); wait_queue_t *io_wait = current->io_wait; - ssize_t ret; + int err; if (!(retry = iocb->ki_retry)) { printk("aio_run_iocb: iocb->ki_retry = NULL\n"); - return 0; + return; } /* @@ -702,8 +702,8 @@ static ssize_t aio_run_iocb(struct kiocb /* Quit retrying if the i/o has been cancelled */ if (kiocbIsCancelled(iocb)) { - ret = -EINTR; - aio_complete(iocb, ret, 0); + err = -EINTR; + aio_complete(iocb, iocb->ki_nbytes - iocb->ki_left, err); /* must not access the iocb after this */ goto out; } @@ -720,17 +720,17 @@ static ssize_t
[PATCH -mm 5/10][RFC] aio: make blk_directIO use file_endio_t
Convert the internals of blkdev_direct_IO to use a generic endio function, instead of directly calling aio_complete. This may also fix some bugs/races in this code, for instance it checks bio->bi_size instead of assuming it's zero, and it atomically accumulates the bytes_done counter (assuming that the bio completion handler can't race with itself *might* be valid here, but the direct-io code makes no such assumption). I'm also pretty sure that the address_space->directIO functions aren't supposed to mess with the iocb->ki_pos or ->ki_left. --- diff -urpN -X dontdiff a/fs/block_dev.c b/fs/block_dev.c --- a/fs/block_dev.c2007-01-12 20:26:25.0 -0800 +++ b/fs/block_dev.c2007-01-12 20:23:55.0 -0800 @@ -131,10 +131,32 @@ blkdev_get_block(struct inode *inode, se return 0; } -static int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error) +struct bdev_aio { + atomic_tiocount;/* refcount */ + atomic_tbytes_done; /* byte counter */ + int err;/* error handling */ + file_endio_t*endio; /* end I/O notify fn */ + void*endio_data;/* notify fn private data */ +}; + +static void blk_io_put(struct bdev_aio *io) +{ + if (!atomic_dec_and_test(>iocount)) + return; + + if (!io->endio) + return complete((struct completion*)io->endio_data); + + io->endio(io->endio_data, atomic_read(>bytes_done), io->err); + kfree(io); +} + +static int blk_bio_endio(struct bio *bio, unsigned int bytes_done, int error) { - struct kiocb *iocb = bio->bi_private; - atomic_t *bio_count = >ki_bio_count; + struct bdev_aio *io = bio->bi_private; + + if (bio->bi_size) + return 1; if (bio_data_dir(bio) == READ) bio_check_pages_dirty(bio); @@ -143,16 +165,21 @@ static int blk_end_aio(struct bio *bio, bio_put(bio); } - /* iocb->ki_nbytes stores error code from LLDD */ - if (error) - iocb->ki_nbytes = -EIO; - - if (atomic_dec_and_test(bio_count)) - aio_complete(iocb, iocb->ki_left, iocb->ki_nbytes); + if (error) + io->err = error; + atomic_add(bytes_done, >bytes_done); + blk_io_put(io); return 0; } +static void blk_io_init(struct bdev_aio *io) +{ + atomic_set(>iocount, 1); + atomic_set(>bytes_done, 0); + io->err = 0; +} + #define VEC_SIZE 16 struct pvec { unsigned short nr; @@ -208,24 +235,33 @@ blkdev_direct_IO(int rw, struct kiocb *i unsigned long addr; /* user iovec address */ size_t count; /* user iovec len */ - size_t nbytes = iocb->ki_nbytes = iocb->ki_left; /* total xfer size */ + size_t nbytes; /* total xfer size */ loff_t size;/* size of block device */ struct bio *bio; - atomic_t *bio_count = >ki_bio_count; + struct bdev_aio stack_io, *io; + file_endio_t *endio = aio_complete; + void *endio_data = iocb; struct page *page; struct pvec pvec; pvec.nr = 0; pvec.idx = 0; + io = _io; + if (endio) { + io = kmalloc(sizeof(struct bdev_aio), GFP_KERNEL); + if (!io) + return -ENOMEM; + } + blk_io_init(io); + if (pos & blocksize_mask) return -EINVAL; + nbytes = iov_length(iov, nr_segs); size = i_size_read(inode); - if (pos + nbytes > size) { + if (pos + nbytes > size) nbytes = size - pos; - iocb->ki_left = nbytes; - } /* * check first non-zero iov alignment, the remaining @@ -237,7 +273,6 @@ blkdev_direct_IO(int rw, struct kiocb *i if (addr & blocksize_mask || count & blocksize_mask) return -EINVAL; } while (!count && ++seg < nr_segs); - atomic_set(bio_count, 1); while (nbytes) { /* roughly estimate number of bio vec needed */ @@ -248,8 +283,8 @@ blkdev_direct_IO(int rw, struct kiocb *i /* bio_alloc should not fail with GFP_KERNEL flag */ bio = bio_alloc(GFP_KERNEL, nvec); bio->bi_bdev = I_BDEV(inode); - bio->bi_end_io = blk_end_aio; - bio->bi_private = iocb; + bio->bi_end_io = blk_bio_endio; + bio->bi_private = io; bio->bi_sector = pos >> blkbits; same_bio: cur_off = addr & ~PAGE_MASK; @@ -289,18 +324,27 @@ same_bio: /* bio is ready, submit it */ if (rw == READ) bio_set_pages_dirty(bio); - atomic_inc(bio_count); + atomic_inc(>iocount); submit_bio(rw, bio); }
[PATCH -mm 1/10][RFC] aio: scm remove struct siocb
this patch removes struct sock_iocb Its purpose seems to have dwindled to a mere container for struct scm_cookie, and all of the users of scm_cookie seem to require re-initializing it each time anyway. Besides, keeping such data around from one call to the next seems to me like a layering violation, if not a bug, considering that the sync IO code can use this call path too. All scm_cookie users are converted to unconditionally allocate on the stack, and sock_iocb and all its helpers are removed. This also simplifies the socket aio submission path (is that even used?) --- include/net/scm.h|2 include/net/sock.h | 26 - net/netlink/af_netlink.c | 18 ++ net/socket.c | 131 +++ net/unix/af_unix.c | 77 ++- 5 files changed, 68 insertions(+), 186 deletions(-) --- diff -urpN -X dontdiff a/include/net/scm.h b/include/net/scm.h --- a/include/net/scm.h 2006-11-29 13:57:37.0 -0800 +++ b/include/net/scm.h 2007-01-10 12:10:19.0 -0800 @@ -23,7 +23,6 @@ struct scm_cookie #ifdef CONFIG_SECURITY_NETWORK u32 secid; /* Passed security ID */ #endif - unsigned long seq;/* Connection seqno */ }; extern void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm); @@ -56,7 +55,6 @@ static __inline__ int scm_send(struct so scm->creds.gid = p->gid; scm->creds.pid = p->tgid; scm->fp = NULL; - scm->seq = 0; unix_get_peersec_dgram(sock, scm); if (msg->msg_controllen <= 0) return 0; diff -urpN -X dontdiff a/include/net/sock.h b/include/net/sock.h --- a/include/net/sock.h2007-01-10 11:50:54.0 -0800 +++ b/include/net/sock.h2007-01-10 12:15:35.0 -0800 @@ -75,10 +75,9 @@ * between user contexts and software interrupt processing, whereas the * mini-semaphore synchronizes multiple users amongst themselves. */ -struct sock_iocb; typedef struct { spinlock_t slock; - struct sock_iocb*owner; + void*owner; wait_queue_head_t wq; /* * We express the mutex-alike socket_lock semantics @@ -656,29 +655,6 @@ static inline void __sk_prot_rehash(stru #define SOCK_BINDADDR_LOCK 4 #define SOCK_BINDPORT_LOCK 8 -/* sock_iocb: used to kick off async processing of socket ios */ -struct sock_iocb { - struct list_headlist; - - int flags; - int size; - struct socket *sock; - struct sock *sk; - struct scm_cookie *scm; - struct msghdr *msg, async_msg; - struct kiocb*kiocb; -}; - -static inline struct sock_iocb *kiocb_to_siocb(struct kiocb *iocb) -{ - return (struct sock_iocb *)iocb->private; -} - -static inline struct kiocb *siocb_to_kiocb(struct sock_iocb *si) -{ - return si->kiocb; -} - struct socket_alloc { struct socket socket; struct inode vfs_inode; diff -urpN -X dontdiff a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c --- a/net/netlink/af_netlink.c 2007-01-10 11:53:12.0 -0800 +++ b/net/netlink/af_netlink.c 2007-01-10 12:10:19.0 -0800 @@ -1106,7 +1106,6 @@ static inline void netlink_rcv_wake(stru static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock, struct msghdr *msg, size_t len) { - struct sock_iocb *siocb = kiocb_to_siocb(kiocb); struct sock *sk = sock->sk; struct netlink_sock *nlk = nlk_sk(sk); struct sockaddr_nl *addr=msg->msg_name; @@ -1119,9 +1118,7 @@ static int netlink_sendmsg(struct kiocb if (msg->msg_flags_OOB) return -EOPNOTSUPP; - if (NULL == siocb->scm) - siocb->scm = - err = scm_send(sock, msg, siocb->scm); + err = scm_send(sock, msg, ); if (err < 0) return err; @@ -1155,7 +1152,7 @@ static int netlink_sendmsg(struct kiocb NETLINK_CB(skb).dst_group = dst_group; NETLINK_CB(skb).loginuid = audit_get_loginuid(current->audit_context); selinux_get_task_sid(current, &(NETLINK_CB(skb).sid)); - memcpy(NETLINK_CREDS(skb), >scm->creds, sizeof(struct ucred)); + memcpy(NETLINK_CREDS(skb), , sizeof(struct ucred)); /* What can I do? Netlink is asynchronous, so that we will have to save current capabilities to @@ -1189,7 +1186,6 @@ static int netlink_recvmsg(struct kiocb struct msghdr *msg, size_t len, int flags) { - struct sock_iocb *siocb = kiocb_to_siocb(kiocb); struct scm_cookie scm; struct sock *sk = sock->sk; struct netlink_sock *nlk = nlk_sk(sk); @@ -1230,17 +1226,15 @@ static int netlink_recvmsg(struct kiocb if
[PATCH -mm 0/10][RFC] aio: make struct kiocb private
This series is an attempt to generalize the async I/O paths to be implementation agnostic. It completely eliminates knowledge of the kiocb structure in the generic code and makes it private within the current aio code. Things get noticeably cleaner without that layering violation. The new interface takes a file_endio_t function pointer, and a private data pointer, which would normally be aio_complete and a kiocb pointer, respectively. If the aio submission function gets back EIOCBQUEUED, that is a guarantee that the endio function will be called, or *already has been called*. If the file_endio_t pointer provided to aio_[read|write] is NULL, the FS must block on I/O completion, then return either the number of bytes read, or an error. I had to touch more areas that I had originally expected, so there are changes in a corner of the socket code, and a slight behavior change in the direct-io completion path with affects XFS and OCFS2. I would appreciate further review there, so I copied some extra people I hope can help. This patch is against 2.6.20-rc4-mm1. It has been compile-tested at each stage. It needs some runtime testing yet, but I prefer to get it out for commentary and test later. These patches are for RFC only and have not yet been signed off. NATE --- Documentation/filesystems/Locking | 11 + Documentation/filesystems/vfs.txt | 11 + arch/s390/hypfs/inode.c | 16 +- drivers/net/pppoe.c |8 - drivers/net/tun.c | 13 +- drivers/usb/gadget/inode.c| 239 +- fs/aio.c | 74 ++- fs/bad_inode.c| 10 - fs/block_dev.c| 109 +++-- fs/cifs/cifsfs.c | 10 - fs/compat.c | 56 fs/direct-io.c| 92 -- fs/ecryptfs/file.c| 16 +- fs/ext2/inode.c | 12 - fs/ext3/file.c|9 - fs/ext3/inode.c | 11 - fs/ext4/file.c|9 - fs/ext4/inode.c | 11 - fs/fat/inode.c| 12 - fs/fuse/dev.c | 13 +- fs/gfs2/ops_address.c | 14 +- fs/hfs/inode.c| 13 -- fs/hfsplus/inode.c| 13 -- fs/jfs/inode.c| 12 - fs/nfs/direct.c | 92 +++--- fs/nfs/file.c | 62 + fs/ntfs/file.c| 71 ++- fs/ocfs2/aops.c | 24 +-- fs/ocfs2/aops.h |8 - fs/ocfs2/file.c | 44 +++--- fs/ocfs2/inode.h |2 fs/pipe.c | 12 - fs/read_write.c | 225 --- fs/read_write.h |8 - fs/reiserfs/inode.c | 13 -- fs/smbfs/file.c | 28 ++-- fs/udf/file.c | 13 +- fs/xfs/linux-2.6/xfs_aops.c | 44 +++--- fs/xfs/linux-2.6/xfs_file.c | 58 + fs/xfs/linux-2.6/xfs_lrw.c| 29 ++-- fs/xfs/linux-2.6/xfs_lrw.h| 10 - fs/xfs/linux-2.6/xfs_vnode.h | 20 +-- include/linux/aio.h | 11 - include/linux/fs.h| 114 +- include/linux/net.h | 18 +- include/linux/nfs_fs.h| 12 - include/net/bluetooth/bluetooth.h |2 include/net/inet_common.h |3 include/net/scm.h |2 include/net/sock.h| 45 +-- include/net/tcp.h |6 include/net/udp.h |3 mm/filemap.c | 109 - net/appletalk/ddp.c |5 net/atm/common.c |6 net/atm/common.h |7 - net/ax25/af_ax25.c|7 - net/bluetooth/af_bluetooth.c |4 net/bluetooth/hci_sock.c |7 - net/bluetooth/l2cap.c |2 net/bluetooth/rfcomm/sock.c |8 - net/bluetooth/sco.c |3 net/core/sock.c | 12 - net/dccp/dccp.h |8 - net/dccp/probe.c |3 net/dccp/proto.c |7 - net/decnet/af_decnet.c|7 - net/econet/af_econet.c|7 - net/ipv4/af_inet.c|5 net/ipv4/raw.c|8 - net/ipv4/tcp.c|7 - net/ipv4/tcp_probe.c |3 net/ipv4/udp.c|9 - net/ipv4/udp_impl.h |2 net/ipv6/raw.c|6 net/ipv6/udp.c| 10 - net/ipv6/udp_impl.h |6 net/ipx/af_ipx.c |7 - net/irda/af_irda.c| 29 ++-- net/key/af_key.c
Re: [stable] 2.6.19.2 regression introduced by "IPV4/IPV6: Fix inet{, 6} device initialization order."
Greg KH schrieb: > On Sun, Jan 14, 2007 at 09:30:08PM -0800, David Miller wrote: > >> From: David Stevens <[EMAIL PROTECTED]> >> Date: Sun, 14 Jan 2007 19:47:49 -0800 >> >> >>> I think it's better to add the fix than withdraw this patch, since >>> the original bug is a crash. >>> >> I completely agree. >> > > Great, can someone forward the patch to us? > Should be the fix from http://bugzilla.kernel.org/show_bug.cgi?id=7817 > thanks, > > greg k-h > Regards, Gabriel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Robert Hancock wrote: I'll try your stress test when I get a chance, but I doubt I'll run into the same problem and I haven't seen any similar reports. Perhaps it's some kind of wierd timing issue or incompatibility between the controller and that drive when running in ADMA mode? I seem to remember various reports of issues with certain Maxtor drives and some nForce SATA controllers under Windows at least.. Just to eliminate things, has disabling ADMA been attempted? It can be disabled using the sata_nv.adma module parameter. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Robert Hancock wrote: Note that the ATA-7 spec for FLUSH CACHE says that "This command may take longer than 30 s to complete." Yep... Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Jens Axboe wrote: On Mon, Jan 15 2007, Jeff Garzik wrote: Jens Axboe wrote: I'd be surprised if the device would not obey the 7 second timeout rule that seems to be set in stone and not allow more dirty in-drive cache than it could flush out in approximately that time. AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other commands... Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as it would pretty much guarentee lower latencies for random writes and write back caching. The concern is the barrier code, of course. I guess I should do some timings on potential worst case patterns some day. Alan may have done that sometime in the past, iirc. FWIW: According to the drive guys (Eric M, among others), FLUSH CACHE will "probably" be under 30 seconds, but pathological cases might even extend beyond that. Definitely more than 7 seconds in less-than-pathological cases, unfortunately... The SCSI layer /should/ already take this (30 second timeout) into account, for SYNCHRONIZE CACHE (and thus FLUSH CACHE for libata) but I'm too slack to check at the moment. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2.6.20-rc3 01/01] usb: Sierra Wireless auto set D0
from: Kevin Lloyd <[EMAIL PROTECTED]> This patch ensures that the device is turned on when inserted into the system (which mostly affects the EM5725 and MC5720. It also adds more VID/PIDs and matches the N_OUT_URB with the airprime driver. Signed-off-by: Kevin Lloyd <[EMAIL PROTECTED]> --- --- linux-2.6.20-rc5/drivers/usb/serial/sierra.c.orig 2007-01-15 15:17:15.0 -0800 +++ linux-2.6.20-rc5/drivers/usb/serial/sierra.c2007-01-15 15:41:56.0 -0800 @@ -14,9 +14,31 @@ Whom based his on the Keyspan driver by Hugh Blemings <[EMAIL PROTECTED]> History: +v.1.0.6: + klloyd + Added more devices and added Vendor Specific USB message to make sure + that devices are in D0 state when they start. This is very important for + MC5720 and EM5625 modules that go between Windows and Non-Windows + machines. +v.1.0.5: + Greg KH + This saves over 30 lines and fixes a warning from sparse and allows + debugging to work dynamically like all other usb-serial drivers. + klloyd + Changed versioning to v.x.y.z +v.1.04: + klloyd + Adds significant throughput increase to the Sierra driver (uses multiple + urgs for download link). This patch also updates the current sierra.c + driver so that it supports both 3-port Sierra devices and 1-port legacy + devices and removes Sierra's references in other related files (Kconfig + and airprime.c). +v.1.03 + klloyd + Adds DTR line control support and impliments urb control. */ -#define DRIVER_VERSION "v.1.0.5" +#define DRIVER_VERSION "v.1.0.6" #define DRIVER_AUTHOR "Kevin Lloyd <[EMAIL PROTECTED]>" #define DRIVER_DESC "USB Driver for Sierra Wireless USB modems" @@ -31,14 +53,14 @@ static struct usb_device_id id_table [] = { + { USB_DEVICE(0x1199, 0x0017) }, /* Sierra Wireless EM5625 */ { USB_DEVICE(0x1199, 0x0018) }, /* Sierra Wireless MC5720 */ { USB_DEVICE(0x1199, 0x0020) }, /* Sierra Wireless MC5725 */ - { USB_DEVICE(0x1199, 0x0017) }, /* Sierra Wireless EM5625 */ { USB_DEVICE(0x1199, 0x0019) }, /* Sierra Wireless AirCard 595 */ - { USB_DEVICE(0x1199, 0x0218) }, /* Sierra Wireless MC5720 */ + { USB_DEVICE(0x1199, 0x0021) }, /* Sierra Wireless AirCard 597E */ { USB_DEVICE(0x1199, 0x6802) }, /* Sierra Wireless MC8755 */ + { USB_DEVICE(0x1199, 0x6804) }, /* Sierra Wireless MC8755 */ { USB_DEVICE(0x1199, 0x6803) }, /* Sierra Wireless MC8765 */ - { USB_DEVICE(0x1199, 0x6804) }, /* Sierra Wireless MC8755 for Europe */ { USB_DEVICE(0x1199, 0x6812) }, /* Sierra Wireless MC8775 */ { USB_DEVICE(0x1199, 0x6820) }, /* Sierra Wireless AirCard 875 */ @@ -55,14 +77,14 @@ static struct usb_device_id id_table_1po }; static struct usb_device_id id_table_3port [] = { + { USB_DEVICE(0x1199, 0x0017) }, /* Sierra Wireless EM5625 */ { USB_DEVICE(0x1199, 0x0018) }, /* Sierra Wireless MC5720 */ { USB_DEVICE(0x1199, 0x0020) }, /* Sierra Wireless MC5725 */ - { USB_DEVICE(0x1199, 0x0017) }, /* Sierra Wireless EM5625 */ { USB_DEVICE(0x1199, 0x0019) }, /* Sierra Wireless AirCard 595 */ - { USB_DEVICE(0x1199, 0x0218) }, /* Sierra Wireless MC5720 */ + { USB_DEVICE(0x1199, 0x0021) }, /* Sierra Wireless AirCard 597E */ { USB_DEVICE(0x1199, 0x6802) }, /* Sierra Wireless MC8755 */ + { USB_DEVICE(0x1199, 0x6804) }, /* Sierra Wireless MC8755 */ { USB_DEVICE(0x1199, 0x6803) }, /* Sierra Wireless MC8765 */ - { USB_DEVICE(0x1199, 0x6804) }, /* Sierra Wireless MC8755 for Europe */ { USB_DEVICE(0x1199, 0x6812) }, /* Sierra Wireless MC8775 */ { USB_DEVICE(0x1199, 0x6820) }, /* Sierra Wireless AirCard 875 */ { } @@ -81,7 +103,7 @@ static int debug; /* per port private data */ #define N_IN_URB4 -#define N_OUT_URB 1 +#define N_OUT_URB 4 #define IN_BUFLEN 4096 #define OUT_BUFLEN 128 @@ -123,6 +145,7 @@ static int sierra_send_setup(struct usb_ return usb_control_msg(serial->dev, usb_rcvctrlpipe(serial->dev, 0), 0x22,0x21,val,0,NULL,0,USB_CTRL_SET_TIMEOUT); + } return 0; @@ -396,6 +419,8 @@ static int sierra_open(struct usb_serial struct usb_serial *serial = port->serial; int i, err; struct urb *urb; + int result; + __u16 set_mode_dzero = 0x; //Set mode to D0 portdata = usb_get_serial_port_data(port); @@ -442,6 +467,11 @@ static int sierra_open(struct usb_serial port->tty->low_latency = 1; + //set mode to D0 + result = usb_control_msg(serial->dev, + usb_rcvctrlpipe(serial->dev, 0), + 0x00,0x40,set_mode_dzero,0,NULL,0,USB_CTRL_SET_TIMEOUT); + sierra_send_setup(port); return (0); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at
Problem with POSIX threads in latest kernel...
Hi... I run the (almost) latest -mm kernel (2.6.20-rc3-mm1), and see some strange behaviour with POSIX threads (glibc-2.4). I have downgraded my test to a simple textboox example for a SMP-safe spool queue, it's just a circular queue with a mutex and a condition variable for in and out. I have seen the same structure in several places. Well, it just sometimes gets blocked. GDB says its stuck in pthread_wait(). I could swear it worked on previous kernels. It works as is on IRIX. I will try to build an older kernel to test. I takes a second to block it with something like while :; tst; done. Any ideas ? -- J.A. Magallon \ Software is like sex: \ It's better when it's free Mandriva Linux release 2007.1 (Cooker) for i586 Linux 2.6.19-jam04 (gcc 4.1.2 20061110 (prerelease) (4.1.2-0.20061110.2mdv2007.1)) #0 SMP PREEMPT #include #include #include #include #define SIZE 16 intjobs[SIZE]; intin; intslots; pthread_mutex_t slots_mutex; pthread_cond_t slots_cond; intout; intitems; pthread_mutex_t items_mutex; pthread_cond_t items_cond; void put(int job); void get(int* job); void* prod(void* data); void* cons(void* data); int main(int argc,char** argv) { pthread_t prodid,consid; in = 0; slots = SIZE; pthread_mutex_init(_mutex,0); pthread_cond_init(_cond,0); out = 0; items = 0; pthread_mutex_init(_mutex,0); pthread_cond_init(_cond,0); pthread_setconcurrency(3); pthread_create(,0,prod,0); pthread_create(,0,cons,0); pthread_join(prodid,0); pthread_join(consid,0); return 0; } void* prod(void* data) { int i; for (i=0; i<1000; i++) { if (!(i%100)) printf("put %d\n",i); put(i); } put(-1); puts("prod done"); return 0; } void* cons(void* data) { int i; do { get(); if (!(i%100)) printf("got %d\n",i); } while (i>=0); puts("cons done"); return 0; } void put(int job) { pthread_mutex_lock(_mutex); while (slots<=0) pthread_cond_wait(_cond,_mutex); jobs[in] = job; in++; in %= SIZE; slots--; items++; pthread_mutex_unlock(_mutex); pthread_mutex_lock(_mutex); pthread_cond_signal(_cond); pthread_mutex_unlock(_mutex); } void get(int* job) { pthread_mutex_lock(_mutex); while (items<=0) pthread_cond_wait(_cond,_mutex); *job = jobs[out]; out++; out %= SIZE; items--; slots++; pthread_mutex_unlock(_mutex); pthread_mutex_lock(_mutex); pthread_cond_signal(_cond); pthread_mutex_unlock(_mutex); }
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.15 18:34:43 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >>My latest bisection attempt actually led to your sata_nv ADMA commit. [1] > >>I've now backed out that patch from 2.6.20-rc5 and have my stress test > >>running for 20 minutes now ("record" for a bad kernel surviving that > >>test is about 40 minutes IIRC). I'll keep it running for at least 2 more > >>hours. > > > >Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out > >survived about 3 hours of testing, while the average was around 5 > >minutes for a failure, sometimes even before I could log in. > >I took a look at the patch, but I can't really tell anything. > >nv_adma_check_atapi_dma somehow looks like it should not negate its > >return value, so that it returns 0 (atapi dma available) when > >adma_enable was 1. But I'm not exactly confident about that either ;) > >Will it hurt if I try to remove the negation? > > It should be correct the way it is - that check is trying to prevent > ATAPI commands from using DMA until the slave_config function has been > called to set up the DMA parameters properly. When the > NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which > disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) > device on the channel this wouldn't affect you anyway. I wondered about it, because the flag is cleared when adma_enabled is 1, which seems to be consistent with everything but nv_adma_check_atapi_dma. Thus I thought that nv_adma_check_atapi_dma might be wrong, but maybe setting/clearing the flag is wrong instead? *feels lost* > I'll try your stress test when I get a chance, but I doubt I'll run into > the same problem and I haven't seen any similar reports. Perhaps it's > some kind of wierd timing issue or incompatibility between the > controller and that drive when running in ADMA mode? I seem to remember > various reports of issues with certain Maxtor drives and some nForce > SATA controllers under Windows at least.. I just checked Maxtor's knowledge base, that incompatibility does not affect my drive. Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.20-rc5 nfs+krb => oops
Hi there, I've been curious enough to try 2.6.20-rc5 with nfs4/kerberos. It was working fine before. I was using 2.6.18.1 on the client and 2.6.20-rc3-git4 on server and today i tried 2.6.20-rc5 on both client and server. (both running up to date debian/sid) Trying to mount a nfs4 or nfs3 share with krb5 (did try with krb5 and krb5p) produces this oops on the client side: (each time I tried i got the same oops) [ cut here ] kernel BUG at net/sunrpc/sched.c:902! invalid opcode: [#1] PREEMPT Modules linked in: rpcsec_gss_spkm3 rfcomm l2cap bluetooth nfsd exportfs nsc_irc c tun ipv6 dm_snapshot dm_mirror dm_mod eeprom i2c_isa eth1394 usbhid snd_intel8 x0 snd_ac97_codec ac97_bus snd_pcm_oss snd_pcm snd_mixer_oss snd_seq_oss snd_seq _midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device ohci1394 i eee1394 ipw2200 snd ieee80211 ieee80211_crypt i2c_i801 psmouse ide_cd r8169 rtc irda ehci_hcd uhci_hcd serio_raw i2c_core cdrom snd_page_alloc usbcore evdev crc _ccitt CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00210297 (2.6.20-rc5 #3) EIP is at rpc_release_task+0x8f/0xc0 eax: f7e40c80 ebx: f7e40c80 ecx: f51eaac0 edx: c03fcc80 esi: fff3 edi: f6f21c40 ebp: f6f21bf0 esp: f6f21be4 ds: 007b es: 007b ss: 0068 Process mount (pid: 4286, ti=f6f2 task=f6c52030 task.ti=f6f2) Stack: f6f21bf0 c03f7a77 f7e40c80 f6f21c10 c03f7c0d feff f6f21c7c f76f1180 f6f21c30 c01fe0d6 f6f21c40 7ffbfaef fffe f6f21c7c f6de1a40 f76f1b80 f6f21c58 c01fe436 0fff c050a180 Call Trace: [] show_trace_log_lvl+0x1a/0x30 [] show_stack_log_lvl+0xa9/0xd0 [] show_registers+0x1ef/0x360 [] die+0x10b/0x210 [] do_trap+0x82/0xb0 [] do_invalid_op+0x97/0xb0 [] error_code+0x74/0x7c [] rpc_call_sync+0x8d/0xb0 [] nfs3_rpc_wrapper+0x46/0x70 [] nfs3_proc_getattr+0x46/0x80 [] nfs_create_server+0x2cf/0x520 [] nfs_get_sb+0xbd/0x580 [] vfs_kern_mount+0x40/0x90 [] do_kern_mount+0x36/0x50 [] do_mount+0x24e/0x690 [] sys_mount+0x6f/0xb0 [] sysenter_past_esp+0x5f/0x85 === Code: d8 e8 86 fc ff ff c7 03 00 00 00 00 8d 43 68 0f ba 73 68 04 ba 04 00 00 00 e8 5e 1d d3 ff 89 d8 e8 f7 fe ff ff 83 c4 08 5b 5d c3 <0f> 0b eb fe 0f 0b eb fe e8 84 2a 01 00 eb be 0f b7 80 94 00 00 EIP: [] rpc_release_task+0x8f/0xc0 SS:ESP 0068:f6f21be4 ( was a proto=udp mount ) I can provide more informations if needed, but i'm pretty it would be reproducible easily. -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch-mm] Workaround for RAID breakage
On Mon, Jan 15 2007, Thomas Gleixner wrote: > On Mon, 2007-01-15 at 09:08 +0100, Thomas Gleixner wrote: > > > Thomas saw something similar yesterday and he the partial results that > > > git.block (between rc2-mm1 and rc4-mm1) breaks certain disk drivers or > > > filesystems drivers. For me it worked fine, so it must be only on some > > > combinations. The changes to ll_rw_block.c look quite extensive. > > > > Yes. Jens Axboe confirmed yesterday that the plug changes broke RAID. > > I tracked this down and found two problems: > > - The new plug/unplug code does not check for underruns. That allows the > plug count (ioc->plugged) to become negative. This gets triggered from > various places. > > AFAICS this is intentional to avoid checks all over the place, but the > underflow check is missing. All we need to do is make sure, that in case > of ioc->plugged == 0 we return early and bug, if there is either a queue > plugged in or the plugged_list is not empty. > > Jens ? It should not go negative, that would be a bug elsewhere. So it's interesting if it does, we should definitely put a WARN_ON() check in there for that. > - The raid1 code has no bitmap set in remount r/w. So the > pending_bio_list gets not processed for quite a time. The workaround is > to kick mddev->thread, so the list is processed. Not sure about that. > > Neil ? Super, thanks for that Thomas! I'll merge it in the plug branch. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] adjust use of unplug in elevator code
On Mon, Jan 15 2007, Linas Vepstas wrote: > > Hi Chris, Jens, > Can you look at this, and push upstream if this looks reasonable > to you? It fixes a bug I've been tripping over. > > --linas > > > A flag was recently added to the elevator code to avoid > performing an unplug when reuests are being re-queued. > The goal of this flag was to avoid a deep recursion that > can occur when re-queueing requests after a SCSI device/host > reset. See http://lkml.org/lkml/2006/5/17/254 > > However, that fix added the flag near the bottom of a case > statement, where an earlier break (in an if statement) could > transport one out of the case, without setting the flag. > This patch sets the flag earlier in the case statement. > > I re-discovered the deep recursion recently during testing; > I was told that it was a known problem, and the fix to it was > in the kernel I was testing. Indeed it was ... but it didn't > fix the bug. With the patch below, I no longer see the bug. > > Signed-off by: Linas Vepstas <[EMAIL PROTECTED]> > Cc: Jens Axboe <[EMAIL PROTECTED]> > Cc: Chris Wright <[EMAIL PROTECTED]> > > > block/elevator.c | 11 ++- > 1 file changed, 6 insertions(+), 5 deletions(-) > > Index: linux-2.6.20-rc4/block/elevator.c > === > --- linux-2.6.20-rc4.orig/block/elevator.c2007-01-15 14:16:03.0 > -0600 > +++ linux-2.6.20-rc4/block/elevator.c 2007-01-15 14:20:04.0 -0600 > @@ -590,6 +590,12 @@ void elv_insert(request_queue_t *q, stru >*/ > rq->cmd_flags |= REQ_SOFTBARRIER; > > + /* > + * Most requeues happen because of a busy condition, > + * don't force unplug of the queue for that case. > + */ > + unplug_it = 0; > + > if (q->ordseq == 0) { > list_add(>queuelist, >queue_head); > break; > @@ -604,11 +610,6 @@ void elv_insert(request_queue_t *q, stru > } > > list_add_tail(>queuelist, pos); > - /* > - * most requeues happen because of a busy condition, don't > - * force unplug of the queue for that case. > - */ > - unplug_it = 0; > break; Ah, yes it definitely should be moved up, thanks for that! Acked-by: Jens Axboe <[EMAIL PROTECTED]> I'll get this merged for 2.6.21. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to flush the disk write cache from userspace
On Sun, Jan 14 2007, Ricardo Correia wrote: > Hi, (please CC: to my email address, I'm not subscribed) > > Quick question: how can I flush the disk write cache from userspace? > > Long question: > > I'm porting the Solaris ZFS filesystem to the FUSE/Linux filesystem > framework. This is a copy-on-write, transactional filesystem and so > it needs to ensure correct ordering of writes when transactions are > written to disk. > > At the moment, when transactions end, I'm using a fsync() on the block > device followed by a ioctl(BLKFLSBUF). > > This is because, according to the fsync manpage, even after fsync() > returns, data might still be in the disk write cache, so fsync by > itself doesn't guarantee data safety on power failure. Depends. Only if the file system does the right thing here, iirc only reiserfs with barriers enabled issue a real disk flush for fsync. So you can't rely on it in general. > I was looking for something like the Solaris > ioctl(DKIOCFLUSHWRITECACHE), which does exactly what I need. > > The most similar thing I could find was ioctl(BLKFLSBUF), however a > search for BLKFLSBUF on the Linux 2.6.15 source doesn't seem to return > anything related to IDE or SCSI disks. > > Can I trust ioctl(BLKFLSBUF) to flush disks' write caches (for disks > that follow the specs)? BLKFLSBUF doesn't flush the disk cache either, it just flushes every dirty page in the block device address space. It would not be very hard to do, basically we have most of the support code in place for this for IO barriers. Basically it would be something like: blockdev_cache_flush(bdev) { request_queue_t *q = bdev_get_queue(bdev); struct request *rq = blk_get_request(q, WRITE, GFP_WHATEVER); int ret; ret = blk_execute_rq(q, bdev->bd_disk, rq, 0); blk_put_request(rq); return ret; } Somewhat simplified of course, but it should get the point across. Putting that in fs/buffer.c:sync_blockdev() would make BLKFLSBUF work. As always with these things, the devil is in the details. It requires the device to support a ->prepare_flush() queue hook, and not all devices do that. It will work for IDE/SATA/SCSI, though. In some devices you don't want/need to do a real disk flush, it depends on the write cache settings, battery backing, etc. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Jens Axboe wrote: On Mon, Jan 15 2007, Jeff Garzik wrote: Jens Axboe wrote: I'd be surprised if the device would not obey the 7 second timeout rule that seems to be set in stone and not allow more dirty in-drive cache than it could flush out in approximately that time. AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other commands... Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as it would pretty much guarentee lower latencies for random writes and write back caching. The concern is the barrier code, of course. I guess I should do some timings on potential worst case patterns some day. Alan may have done that sometime in the past, iirc. Note that the ATA-7 spec for FLUSH CACHE says that "This command may take longer than 30 s to complete." -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: My latest bisection attempt actually led to your sata_nv ADMA commit. [1] I've now backed out that patch from 2.6.20-rc5 and have my stress test running for 20 minutes now ("record" for a bad kernel surviving that test is about 40 minutes IIRC). I'll keep it running for at least 2 more hours. Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out survived about 3 hours of testing, while the average was around 5 minutes for a failure, sometimes even before I could log in. I took a look at the patch, but I can't really tell anything. nv_adma_check_atapi_dma somehow looks like it should not negate its return value, so that it returns 0 (atapi dma available) when adma_enable was 1. But I'm not exactly confident about that either ;) Will it hurt if I try to remove the negation? It should be correct the way it is - that check is trying to prevent ATAPI commands from using DMA until the slave_config function has been called to set up the DMA parameters properly. When the NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) device on the channel this wouldn't affect you anyway. I'll try your stress test when I get a chance, but I doubt I'll run into the same problem and I haven't seen any similar reports. Perhaps it's some kind of wierd timing issue or incompatibility between the controller and that drive when running in ADMA mode? I seem to remember various reports of issues with certain Maxtor drives and some nForce SATA controllers under Windows at least.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!
Christoph Anton Mitterer wrote: Sorry, as always I've forgot some things... *g* Robert Hancock wrote: If this is related to some problem with using the GART IOMMU with memory hole remapping enabled What is that GART thing exactly? Is this the hardware IOMMU? I've always thought GART was something graphics card related,.. but if so,.. how could this solve our problem (that seems to occur mainly on harddisks)? The GART built into the Athlon 64/Opteron CPUs is normally used for remapping graphics memory so that an AGP graphics card can see physically non-contiguous memory as one contiguous region. However, Linux can also use it as an IOMMU which allows devices which normally can't access memory above 4GB to see a mapping of that memory that resides below 4GB. In pre-2.6.20 kernels both the SATA and PATA controllers on the nForce 4 chipsets can only access memory below 4GB so transfers to memory above this mark have to go through the IOMMU. In 2.6.20 this limitation is lifted on the nForce4 SATA controllers. then 2.6.20-rc kernels may avoid this problem on nForce4 CK804/MCP04 chipsets as far as transfers to/from the SATA controller are concerned Does this mean that PATA is no related? The corruption appears on PATA disks to, so why should it only solve the issue at SATA disks? Sounds a bit strange to me? The PATA controller will still be using 32-bit DMA and so may also use the IOMMU, so this problem would not be avoided. as the sata_nv driver now supports 64-bit DMA on these chipsets and so no longer requires the IOMMU. Can you explain this a little bit more please? Is this a drawback (like a performance decrease)? Like under Windows where they never use the hardware iommu but always do it via software? No, it shouldn't cause any performance loss. In previous kernels the nForce4 SATA controller was controlled using an interface quite similar to a PATA controller. In 2.6.20 kernels they use a more efficient interface that NVidia calls ADMA, which in addition to supporting NCQ also supports DMA without any 4GB limitations, so it can access all memory directly without requiring IOMMU assistance. Note that if this corruption problem is, as has been suggested, related to memory hole remapping and the IOMMU, then this change only prevents the SATA controller transfers from experiencing this problem. Transfers on the PATA controller as well as any other devices with 32-bit DMA limitations might still have problems. As such this really just avoids the problem, not fixes it. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Mon, Jan 15 2007, Jeff Garzik wrote: > Jens Axboe wrote: > >I'd be surprised if the device would not obey the 7 second timeout rule > >that seems to be set in stone and not allow more dirty in-drive cache > >than it could flush out in approximately that time. > > AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other > commands... Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as it would pretty much guarentee lower latencies for random writes and write back caching. The concern is the barrier code, of course. I guess I should do some timings on potential worst case patterns some day. Alan may have done that sometime in the past, iirc. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: What does this scsi error mean ?
On Mon, Jan 15, 2007 at 11:14:52PM +, Alan wrote: > If you pull the drive and test it in another box does it show the same ? I'm going to try that. The prolem requires 3-7 days to appear, so I won't know immediatly. > And what does a scsi verify have to say ? Running, looks like it's gonna take a little while. OG. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: I broke my port numbers :(
On Mon, Jan 15, 2007 at 23:55:15 +0200, Sami Farin wrote: > I know this may be entirely my fault but I have tried reversing > all of my _own_ patches I applied to 2.6.19.2 but can't find what broke this. > I did three times "netcat 127.0.0.69 42", notice the different > port numbers. Hmm... when I do "rmmod iptable_nat ip_nat", it works. # iptables -t nat --list -nvx Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination I didn't know functions in ip_nat_proto_tcp.o were called when I have empty nat table. Oops... without iptable_nat ip_nat: 64 bytes from 127.0.0.1: icmp_seq=3 ttl=61 time=0.053 ms with them: 64 bytes from 127.0.0.1: icmp_seq=3 ttl=61 time=0.065 ms *shrug* live and learn. 2007-01-16 00:44:43.616266500 <4>[ 5672.924459] [] dump_trace+0x215/0x21a 2007-01-16 00:44:43.616267500 <4>[ 5672.924492] [] show_trace_log_lvl+0x1a/0x30 2007-01-16 00:44:43.616269500 <4>[ 5672.924511] [] show_trace+0x12/0x14 2007-01-16 00:44:43.616270500 <4>[ 5672.924529] [] dump_stack+0x19/0x1b 2007-01-16 00:44:43.616271500 <4>[ 5672.924547] [] tcp_unique_tuple+0xd7/0x130 [ip_nat] 2007-01-16 00:44:43.616272500 <4>[ 5672.924585] [] get_unique_tuple+0x5a/0x6e [ip_nat] 2007-01-16 00:44:43.616285500 <4>[ 5672.924593] [] ip_nat_setup_info+0x73/0x1e6 [ip_nat] 2007-01-16 00:44:43.616287500 <4>[ 5672.924601] [] ip_nat_rule_find+0x90/0xb0 [iptable_nat] 2007-01-16 00:44:43.616288500 <4>[ 5672.924610] [] ip_nat_fn+0xd5/0x1ac [iptable_nat] 2007-01-16 00:44:43.616289500 <4>[ 5672.924617] [] ip_nat_out+0x56/0xd3 [iptable_nat] 2007-01-16 00:44:43.616290500 <4>[ 5672.924624] [] nf_iterate+0x4b/0x77 2007-01-16 00:44:43.616295500 <4>[ 5672.925610] [] nf_hook_slow+0x58/0xdf 2007-01-16 00:44:43.617058500 <4>[ 5672.926562] [] ip_output+0x187/0x26a 2007-01-16 00:44:43.618005500 <4>[ 5672.927511] [] ip_queue_xmit+0x4c9/0x5a4 2007-01-16 00:44:43.618955500 <4>[ 5672.928461] [] tcp_transmit_skb+0x25b/0x466 2007-01-16 00:44:43.619911500 <4>[ 5672.929417] [] tcp_connect+0x133/0x1d1 2007-01-16 00:44:43.620865500 <4>[ 5672.930371] [] tcp_v4_connect+0x404/0x750 2007-01-16 00:44:43.621821500 <4>[ 5672.931327] [] inet_stream_connect+0x123/0x1b1 2007-01-16 00:44:43.622789500 <4>[ 5672.932295] [] sys_connect+0x9c/0xbe 2007-01-16 00:44:43.623679500 <4>[ 5672.933185] [] sys_socketcall+0xd2/0x272 2007-01-16 00:44:43.624612500 <4>[ 5672.934072] [] syscall_call+0x7/0xb 2007-01-16 00:44:43.624614500 <4>[ 5672.934092] [<00645410>] 0x645410 2007-01-16 00:44:43.624615500 <4>[ 5672.934116] === -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20-rc4-mm1
On Mon, Jan 15 2007, Ingo Molnar wrote: > > * Jens Axboe <[EMAIL PROTECTED]> wrote: > > > > In a previous write invoked by: fsck.ext3(1896): WRITE block 8552 on > > > sdb1 end_buffer_async_write() is invoked. > > > > > > sdb1 is not a part of a raid device. > > > > When I briefly tested this before I left (and found it broken), doing > > a cat /proc/mdstat got things going again. Hard if that's your rootfs, > > it's just a hint :-) > > hm, so you knew it's broken, still you let Andrew pick it up, or am i > misunderstanding something? Well the raid issue wasn't known before it was in -mm. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.15 22:17:24 +0100, Björn Steinbrink wrote: > On 2007.01.14 17:43:53 -0600, Robert Hancock wrote: > > Björn Steinbrink wrote: > > >Hi, > > > > > >with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite > > >often, with 2.6.19 there are no such exceptions. dmesg and lspci -v > > >output follows. In the meantime, I'll start bisecting. > > > > ... > > > > >ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > > >ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in > > > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > > >ata1: soft resetting port > > >ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > > >ata1.00: configured for UDMA/133 > > >ata1: EH complete > > >SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB) > > >sda: Write Protect is off > > >sda: Mode Sense: 00 3a 00 00 > > >SCSI device sda: write cache: enabled, read cache: enabled, doesn't > > >support DPO or FUA > > > > Looks like all of these errors are from a FLUSH CACHE command and the > > drive is indicating that it is no longer busy, so presumably done. > > That's not a DMA-mapped command, so it wouldn't go through the ADMA > > machinery and I wouldn't have expected this to be handled any > > differently from before. Curious.. > > My latest bisection attempt actually led to your sata_nv ADMA commit. [1] > I've now backed out that patch from 2.6.20-rc5 and have my stress test > running for 20 minutes now ("record" for a bad kernel surviving that > test is about 40 minutes IIRC). I'll keep it running for at least 2 more > hours. Yep, that one seems to be guilty. 2.6.20-rc5 with that commit backed out survived about 3 hours of testing, while the average was around 5 minutes for a failure, sometimes even before I could log in. I took a look at the patch, but I can't really tell anything. nv_adma_check_atapi_dma somehow looks like it should not negate its return value, so that it returns 0 (atapi dma available) when adma_enable was 1. But I'm not exactly confident about that either ;) Will it hurt if I try to remove the negation? Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: What does this scsi error mean ?
On Tue, Jan 16, 2007 at 12:27:17AM +0100, Stefan Richter wrote: > On 15 Jan, Olivier Galibert wrote: > > sd 0:0:0:0: SCSI error: return code = 0x0802 > > sda: Current: sense key: Hardware Error > > ASC=0x42 ASCQ=0x0 > > The Additional Sense Code means "power-on or self-test failure" FWIW. > (SPC-4 annex D) Given that happens between 3 days to a week after bootup on the root drive, it's obviously not the "power on" part. It's kinda annoying nothing appears in the smart logs though: smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: IBM-ESXS ST936701LCFN Version: B41D Serial number: 3LC0C8P07647WLMV Device type: disk Transport protocol: Parallel SCSI (SPI-4) Local Time is: Tue Jan 16 00:33:09 2007 CET Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 33 C Drive Trip Temperature:60 C Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 16206797 Blocks received from initiator = 83607272 Blocks read from cache and sent to initiator = 3311410 Number of read and write commands whose size <= segment size = 2801896 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 533.07 number of minutes until next internal SMART test = 112 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 104740 0 10474 10474 61.360 0 write: 00 0 0 0 58.647 2 Non-medium error count: 1457822 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Completed - 407 - [- --] # 2 Background short Completed - 243 - [- --] Long (extended) Self Test duration: 793 seconds [13.2 minutes] OG. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] CPUSET related breakage of sys_mbind
current->mems_allowed is defined for CONFIG_CPUSETS. This broke !CPUSETS build. I compiled and linked tested both variants. Signed-off-by: Bob Picco <[EMAIL PROTECTED]> include/linux/cpuset.h |6 ++ mm/mempolicy.c |2 +- 2 files changed, 7 insertions(+), 1 deletion(-) Index: linux-2.6.20-rc4-mm1/mm/mempolicy.c === --- linux-2.6.20-rc4-mm1.orig/mm/mempolicy.c2007-01-15 09:21:58.0 -0500 +++ linux-2.6.20-rc4-mm1/mm/mempolicy.c 2007-01-15 17:51:15.0 -0500 @@ -882,9 +882,9 @@ asmlinkage long sys_mbind(unsigned long int err; err = get_nodes(, nmask, maxnode); - nodes_and(nodes, nodes, current->mems_allowed); if (err) return err; + cpuset_nodes_allowed(); return do_mbind(start, len, mode, , flags); } Index: linux-2.6.20-rc4-mm1/include/linux/cpuset.h === --- linux-2.6.20-rc4-mm1.orig/include/linux/cpuset.h2007-01-15 09:21:32.0 -0500 +++ linux-2.6.20-rc4-mm1/include/linux/cpuset.h 2007-01-15 14:01:30.0 -0500 @@ -75,6 +75,11 @@ static inline int cpuset_do_slab_mem_spr extern void cpuset_track_online_nodes(void); +static inline void cpuset_nodes_allowed(nodemask_t *nodes) +{ + nodes_and(*nodes, *nodes, current->mems_allowed); +} + #else /* !CONFIG_CPUSETS */ static inline int cpuset_init_early(void) { return 0; } @@ -145,6 +150,7 @@ static inline int cpuset_do_slab_mem_spr } static inline void cpuset_track_online_nodes(void) {} +static inline void cpuset_nodes_allowed(nodemask_t *nodes) {} #endif /* !CONFIG_CPUSETS */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: What does this scsi error mean ?
On 15 Jan, Olivier Galibert wrote: > sd 0:0:0:0: SCSI error: return code = 0x0802 > sda: Current: sense key: Hardware Error > ASC=0x42 ASCQ=0x0 The Additional Sense Code means "power-on or self-test failure" FWIW. (SPC-4 annex D) -- Stefan Richter -=-=-=== ---= = http://arcgraph.de/sr/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Some kind of 2.6.19 NFS regression
Hi, Tim Ryan has reported the following bug at the Gentoo bugzilla: https://bugs.gentoo.org/show_bug.cgi?id=162199 His home dir is mounted over NFS. 2.6.18 worked OK but 2.6.19 is very slow to load the desktop environment. NFS is suspected here as the problem does not exist for users with local homedirs. This might not be a straightforward performance issue as it does seem to perform OK on the console. The bug still exists in unpatched 2.6.20-rc5. Is this a known issue? Should we report a new bug on the kernel bugzilla? Thanks, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!
Sorry, as always I've forgot some things... *g* Robert Hancock wrote: > If this is related to some problem with using the GART IOMMU with memory > hole remapping enabled What is that GART thing exactly? Is this the hardware IOMMU? I've always thought GART was something graphics card related,.. but if so,.. how could this solve our problem (that seems to occur mainly on harddisks)? > then 2.6.20-rc kernels may avoid this problem on > nForce4 CK804/MCP04 chipsets as far as transfers to/from the SATA > controller are concerned Does this mean that PATA is no related? The corruption appears on PATA disks to, so why should it only solve the issue at SATA disks? Sounds a bit strange to me? > as the sata_nv driver now supports 64-bit DMA > on these chipsets and so no longer requires the IOMMU. > Can you explain this a little bit more please? Is this a drawback (like a performance decrease)? Like under Windows where they never use the hardware iommu but always do it via software? Best wishes, Chris. begin:vcard fn:Mitterer, Christoph Anton n:Mitterer;Christoph Anton email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard
Re: What does this scsi error mean ?
> Both smart and the internal blade diagnostics say "everything is a-ok > with the drive, there hasn't been any error ever except a bunch of > corrected ECC ones, and no more than with a similar drive in another > working blade". Hence my initial post. "Hardware error" is kinda > imprecise, so I was wondering whether it was unexpected controller > answer, detected transmission error, block write error, sector not > found... Is there a way to have more information? Well the right place to look would indeed have been the SMART data providing the drive didn't get into a state it couldn't update it. Hardware error comes from the drive deciding something is wrong (or a raid card faking it I guess). That covers everything from power fluctuations and overheating through firmware consistency failures and more. If you pull the drive and test it in another box does it show the same ? And what does a scsi verify have to say ? Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!
Hi everybody. Sorry again for my late reply... Robert gave us the following interesting information some days ago: Robert Hancock wrote: > If this is related to some problem with using the GART IOMMU with memory > hole remapping enabled, then 2.6.20-rc kernels may avoid this problem on > nForce4 CK804/MCP04 chipsets as far as transfers to/from the SATA > controller are concerned as the sata_nv driver now supports 64-bit DMA > on these chipsets and so no longer requires the IOMMU. > I've just tested it with my "normal" BIOS settings, that is memhole mapping = hardware, IOMMU = enabled and 64MB and _without_ (!) iommu=soft as kernel parameters. I only had the time for a small test (that is 3 passes with each 10 complete sha512sums cyles over about 30GB data)... but sofar, no corruption occured. It is surely far to eraly to tell that our issue was solved by 2.6.20-rc-something but I ask all of you that had systems that suffered from the corruption to make _intensive_ tests with the most recent rc of 2.6.20 (I've used 2.6.20-rc5) and report your results. I'll do a extensive test tomorrow. And of course (!!): Test without using iommu=soft and with enabled memhole mapping (in the BIOS). (It won't make any sense to look if the new kernel solves our problem while still applying one of our two workarounds). Please also note that there might be two completely data corruption problems. The onle "solved" by iommu=soft and another reported by Kurtis D. Rader. I've asked him to clarify this in a post. :-) Ok,... now if this (the new kernel) would really solve the issue... we should try to find out what exactly was changed in the code, and if it sounds logical that this solved the problem or not. The new kernel could just make the corruption even more rare. Best wishes, Chris. begin:vcard fn:Mitterer, Christoph Anton n:Mitterer;Christoph Anton email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard
Re: [PATCH 2.6.19] USB HID: proper LED-mapping (support for SpaceNavigator)
Jiri Kosina ([EMAIL PROTECTED]) wrote: > On Mon, 15 Jan 2007, Simon Budig wrote: > > Is it possible that there is a regression in the hid-debug stuff? The > > mapping does not seem to appear in the dmesg-output. I unfortunately > > don't have an earlier kernel available right now to verify, but now the > > output on plugging in the device looks like this: > [...] > (after I check why the debug output seems to be broken), Actually this might have been a false alarm. I remembered about /var/log/messages and looked up how this looked like with earlier kernels - turns out it looks exactly the same. (the values dumped there seem to be the initial values of a given field in a HID-Report) So there is no regression there, sorry about the confusion. Bye, Simon -- [EMAIL PROTECTED] http://simon.budig.de/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm] AVR32: fix build breakage
On Mon, 15 Jan 2007 09:37:35 +0100 Haavard Skinnemoen <[EMAIL PROTECTED]> wrote: > > On Mon, 15 Jan 2007 14:48:57 +1100 > Ben Nizette <[EMAIL PROTECTED]> wrote: > >> Remove an unwanted remnant of the recent revert of AVR32/AT91 SPI patches in -mm. Without this patch, the AVR32 build of 2.6.20-rc[34]-mm1 breaks. > > Actually, this is broken in my tree. Wonder how I managed to do that > and not even notice it. > Interestly git://www.atmel.no/~hskinnemoen/linux/kernel/avr32.git master is still fine > I'll apply this patch and push out a new avr32-arch branch for Andrew. > Thanks for testing. Sounds good, no worries. --Ben > > Haavard > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Cell SPU task notification -- updated patch: #1
Attached is an updated patch that addresses Michael Ellerman's comments. One comment made by Michael has not yet been addressed: The comment was in regard to the for-loop in spufs/sched.c:notify_spus_active(). He wondered if the scheduler can swap a context from one node to another. If so, there's a small window in this loop (where we switch the lock from one node's active list to the next) where it may be possible we might miss waking up a context and send a spurious wakeup to another. Arnd . . . can you comment on this question? Thanks. -Maynard Subject: Enable SPU switch notification to detect currently active SPU tasks. From: Maynard Johnson <[EMAIL PROTECTED]> This patch adds to the capability of spu_switch_event_register so that the caller is also notified of currently active SPU tasks. It also exports spu_switch_event_register and spu_switch_event_unregister. Signed-off-by: Maynard Johnson <[EMAIL PROTECTED]> Index: linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/sched.c === --- linux-2.6.19-rc6-arnd1+patches.orig/arch/powerpc/platforms/cell/spufs/sched.c 2006-12-04 10:56:04.730698720 -0600 +++ linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/sched.c 2007-01-15 16:22:31.808461448 -0600 @@ -84,15 +84,42 @@ ctx ? ctx->object_id : 0, spu); } +static void notify_spus_active(void) +{ + int node; + /* Wake up the active spu_contexts. When the awakened processes + * sees their notify_active flag is set, they will call + * spu_notify_already_active(). + */ + for (node = 0; node < MAX_NUMNODES; node++) { + struct spu *spu; + mutex_lock(_prio->active_mutex[node]); +list_for_each_entry(spu, _prio->active_list[node], list) { + struct spu_context *ctx = spu->ctx; + spu->notify_active = 1; + wake_up_all(>stop_wq); + smp_wmb(); + } +mutex_unlock(_prio->active_mutex[node]); + } + yield(); +} + int spu_switch_event_register(struct notifier_block * n) { - return blocking_notifier_chain_register(_switch_notifier, n); + int ret; + ret = blocking_notifier_chain_register(_switch_notifier, n); + if (!ret) + notify_spus_active(); + return ret; } +EXPORT_SYMBOL_GPL(spu_switch_event_register); int spu_switch_event_unregister(struct notifier_block * n) { return blocking_notifier_chain_unregister(_switch_notifier, n); } +EXPORT_SYMBOL_GPL(spu_switch_event_unregister); static inline void bind_context(struct spu *spu, struct spu_context *ctx) @@ -250,6 +277,14 @@ return spu_get_idle(ctx, flags); } +void spu_notify_already_active(struct spu_context *ctx) +{ + struct spu *spu = ctx->spu; + if (!spu) + return; + spu_switch_notify(spu, ctx); +} + /* The three externally callable interfaces * for the scheduler begin here. * Index: linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/spufs.h === --- linux-2.6.19-rc6-arnd1+patches.orig/arch/powerpc/platforms/cell/spufs/spufs.h 2007-01-08 18:18:40.093354608 -0600 +++ linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/spufs.h 2007-01-08 18:31:03.610345792 -0600 @@ -183,6 +183,7 @@ void spu_yield(struct spu_context *ctx); int __init spu_sched_init(void); void __exit spu_sched_exit(void); +void spu_notify_already_active(struct spu_context *ctx); extern char *isolated_loader; Index: linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/run.c === --- linux-2.6.19-rc6-arnd1+patches.orig/arch/powerpc/platforms/cell/spufs/run.c 2007-01-08 18:33:51.979311680 -0600 +++ linux-2.6.19-rc6-arnd1+patches/arch/powerpc/platforms/cell/spufs/run.c 2007-01-15 16:31:30.10442 -0600 @@ -45,9 +45,11 @@ u64 pte_fault; *stat = ctx->ops->status_read(ctx); - if (ctx->state != SPU_STATE_RUNNABLE) - return 1; + smp_rmb(); + spu = ctx->spu; + if (ctx->state != SPU_STATE_RUNNABLE || spu->notify_active) + return 1; pte_fault = spu->dsisr & (MFC_DSISR_PTE_NOT_FOUND | MFC_DSISR_ACCESS_DENIED); return (!(*stat & 0x1) || pte_fault || spu->class_0_pending) ? 1 : 0; @@ -304,6 +306,7 @@ u32 *npc, u32 *event) { int ret; + struct * spu; u32 status; if (down_interruptible(>run_sema)) @@ -317,8 +320,16 @@ do { ret = spufs_wait(ctx->stop_wq, spu_stopped(ctx, )); + spu = ctx->spu; if (unlikely(ret)) break; + if (unlikely(spu->notify_active)) { + spu->notify_active = 0; + if (!(status & SPU_STATUS_STOPPED_BY_STOP)) { +spu_notify_already_active(ctx); +continue; + } + } if ((status & SPU_STATUS_STOPPED_BY_STOP) && (status >> SPU_STOP_STATUS_SHIFT == 0x2104)) { ret = spu_process_callback(ctx);
Re: High CPU usage with sata_nv
On Mon, 15 Jan 2007 18:26:42 +, Frederik Deweerdt wrote > On Mon, Jan 15, 2007 at 06:54:50PM +0200, ris wrote: > > I have motherboard with nforce 590 SLI (MCP55) chipset. > > On other systems all its ok. > > > > But i tried a lot o kernels, configurations and always get cpu at 100% when > > copying files. > > I use SATA II samsung hard drive. > > > Any dmesg complain? Could you send the hdparm -I ? > Regards, > Frederik Ok ... hdparm -I /dev/sda /dev/sda: ATA device, with non-removable media Model Number: SAMSUNG SP2504C Serial Number: S09QJ13LA07964 Firmware Revision: VT100-50 Standards: Used: ATA/ATAPI-7 T13 1532D revision 4a Supported: 7 6 5 4 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBAuser addressable sectors: 268435455 LBA48 user addressable sectors: 488397168 device size with M = 1024*1024: 238475 MBytes device size with M = 1000*1000: 250059 MBytes (250 GB) Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 1 Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 udma7 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: *SMART feature set Security Mode feature set *Power Management feature set *Write cache *Look-ahead *Host Protected Area feature set *WRITE_BUFFER command *READ_BUFFER command *NOP cmd *DOWNLOAD_MICROCODE SET_MAX security extension Automatic Acoustic Management feature set *48-bit Address feature set *Device Configuration Overlay feature set *Mandatory FLUSH_CACHE *FLUSH_CACHE_EXT *SMART error logging *SMART self-test *General Purpose Logging feature set *Segmented DOWNLOAD_MICROCODE *SATA-I signaling speed (1.5Gb/s) *SATA-II signaling speed (3.0Gb/s) *Native Command Queueing (NCQ) *Host-initiated interface power management *Phy event counters DMA Setup Auto-Activate optimization Device-initiated interface power management *Software settings preservation *SMART Command Transport (SCT) feature set *SCT Long Sector Access (AC1) *SCT LBA Segment Access (AC2) *SCT Error Recovery Control (AC3) *SCT Features Control (AC4) *SCT Data Tables (AC5) Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count supported: enhanced erase 88min for SECURITY ERASE UNIT. 88min for ENHANCED SECURITY ERASE UNIT. Checksum: correct and dmesg Linux version 2.6.19-gentoo-r4 ([EMAIL PROTECTED]) (gcc version 4.1.1 (Gentoo 4.1.1-r3)) #2 SMP Mon Jan 15 15:14:18 CET 2007 Command line: BOOT_IMAGE=Gentoo root=802 BIOS-provided physical RAM map: BIOS-e820: - 0009f000 (usable) BIOS-e820: 0009f000 - 000a (reserved) BIOS-e820: 000f - 0010 (reserved) BIOS-e820: 0010 - 3fee (usable) BIOS-e820: 3fee - 3fee3000 (ACPI NVS) BIOS-e820: 3fee3000 - 3fef (ACPI data) BIOS-e820: 3fef - 3ff0 (reserved) BIOS-e820: f000 - f400 (reserved) BIOS-e820: fec0 - 0001 (reserved) Entering add_active_range(0, 0, 159) 0 entries of 256 used Entering add_active_range(0, 256, 261856) 1 entries of 256 used end_pfn_map = 1048576 DMI 2.4 present. ACPI: RSDP (v002 Nvidia) @ 0x000f8040 ACPI: XSDT (v001 Nvidia ASUSACPI 0x42302e31 AWRD 0x) @ 0x3fee30c0 ACPI: FADT (v003 Nvidia ASUSACPI 0x42302e31 AWRD 0x) @ 0x3feed200 ACPI: HPET (v001 Nvidia ASUSACPI 0x42302e31 AWRD 0x0098) @ 0x3feed400 ACPI: MCFG (v001 Nvidia ASUSACPI 0x42302e31 AWRD 0x) @ 0x3feed480 ACPI: MADT (v001 Nvidia ASUSACPI 0x42302e31 AWRD 0x) @ 0x3feed340 ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT
Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!
Hi. Some days ago I received the following message from "Sunny Days". I think he did not send it lkml so I forward it now: Sunny Days wrote: > hello, > > i have done some extensive testing on this. > > various opterons, always single socket > various dimms 1 and 2gb modules > and hitachi+seagate disks with various firmwares and sizes > but i am getting a diferent pattern in the corruption. > My test file was 10gb. > > I have mapped the earliest corruption as low as 10mb in the written data. > i have also monitor the adress range used from the cp /md5sum proccess > under /proc//$PID/maps to see if i could find a pattern but i was > unable to. > > i also tested ext2 and lvm with similar results aka corruption. > later on the week i should get a pci promise controller and test on that one. > > Things i have not tested is the patch that linus released 10 days ago > and reiserfs3/4 > > my nvidia chipset was ck804 (a3) > > Hope somehow we get to the bottom of this. > > Hope this helps > > > btw amd erratas that could possible influence this are > > 115, 123, 156 with the latter been fascinating as it the workaround > suggested is 0x0 page entry. > > Does anyone has any opinions about this? Could you please read the mentioned erratas and tell me what you think? Best wishes, Chris. @ Sunny Days: Thanks for you mail. begin:vcard fn:Mitterer, Christoph Anton n:Mitterer;Christoph Anton email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard
[PATCH] seq_file conversion: toshiba.c
Compile-tested. Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]> --- drivers/char/toshiba.c | 35 +-- 1 file changed, 25 insertions(+), 10 deletions(-) --- a/drivers/char/toshiba.c +++ b/drivers/char/toshiba.c @@ -68,6 +68,7 @@ #include #include #include #include +#include #include @@ -298,12 +299,10 @@ static int tosh_ioctl(struct inode *ip, * Print the information for /proc/toshiba */ #ifdef CONFIG_PROC_FS -static int tosh_get_info(char *buffer, char **start, off_t fpos, int length) +static int proc_toshiba_show(struct seq_file *m, void *v) { - char *temp; int key; - temp = buffer; key = tosh_fn_status(); /* Arguments @@ -314,8 +313,7 @@ static int tosh_get_info(char *buffer, c 4) BIOS date (in SCI date format) 5) Fn Key status */ - - temp += sprintf(temp, "1.1 0x%04x %d.%d %d.%d 0x%04x 0x%02x\n", + seq_printf(m, "1.1 0x%04x %d.%d %d.%d 0x%04x 0x%02x\n", tosh_id, (tosh_sci & 0xff00)>>8, tosh_sci & 0xff, @@ -323,9 +321,21 @@ static int tosh_get_info(char *buffer, c tosh_bios & 0xff, tosh_date, key); + return 0; +} - return temp-buffer; +static int proc_toshiba_open(struct inode *inode, struct file *file) +{ + return single_open(file, proc_toshiba_show, NULL); } + +static const struct file_operations proc_toshiba_fops = { + .owner = THIS_MODULE, + .open = proc_toshiba_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, +}; #endif @@ -508,10 +518,15 @@ static int __init toshiba_init(void) return retval; #ifdef CONFIG_PROC_FS - /* register the proc entry */ - if (create_proc_info_entry("toshiba", 0, NULL, tosh_get_info) == NULL) { - misc_deregister(_device); - return -ENOMEM; + { + struct proc_dir_entry *pde; + + pde = create_proc_entry("toshiba", 0, NULL); + if (!pde) { + misc_deregister(_device); + return -ENOMEM; + } + pde->proc_fops = _toshiba_fops; } #endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sed s/gawk/awk/ scripts/gen_init_ramfs.sh
On Mon, Jan 15, 2007 at 04:24:17PM -0500, Rob Landley wrote: > Signed-off-by: Rob Landley <[EMAIL PROTECTED]> Acked-by: Sam Ravnborg <[EMAIL PROTECTED]> PS My dev machine is broke and need a new one before kbuild.git will be alive again. Considering an AMD Athlon 64 X2 based one with Nvidia GeForce™ 6150LE: http://h10010.www1.hp.com/wwpc/dk/da/ho/WF06b/34307-351123-1284187-1284187-1284187-12726540-78048221.html Anyone with comments on this choice? Sam > > Use "awk" instead of "gawk". > > -- > > There's a symlink from awk to gawk if you're using the gnu tools, but no > symlink from gawk to awk if you're using BusyBox or some such. (There's a > reason for the existence of standard names. Can we use them please?) > > --- linux-2.6.19.2/scripts/gen_initramfs_list.sh 2007-01-10 > 14:10:37.0 -0500 > +++ linux-new/scripts/gen_initramfs_list.sh 2007-01-15 10:14:41.0 > -0500 > @@ -121,9 +121,9 @@ > "nod") > local dev_type= > local maj=$(LC_ALL=C ls -l "${location}" | \ > - gawk '{sub(/,/, "", $5); print $5}') > + awk '{sub(/,/, "", $5); print $5}') > local min=$(LC_ALL=C ls -l "${location}" | \ > - gawk '{print $6}') > + awk '{print $6}') > > if [ -b "${location}" ]; then > dev_type="b" > @@ -134,7 +134,7 @@ > ;; > "slink") > local target=$(LC_ALL=C ls -l "${location}" | \ > - gawk '{print $11}') > + awk '{print $11}') > str="${ftype} ${name} ${target} ${str}" > ;; > *) > > -- > "Perfection is reached, not when there is no longer anything to add, but > when there is no longer anything to take away." - Antoine de Saint-Exupery > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: umask ignored in mkdir(2)?
[I've rearranged this to avoid a horrid mix of top and bottom posting] On Sun, 14 Jan 2007, Tigran Aivazian wrote: > On Sun, 14 Jan 2007, Tigran Aivazian wrote: > > On Sun, 14 Jan 2007, Tigran Aivazian wrote: > > > I think I may have found a bug --- on one of my machines the umask value > > > is ignored by ext3 (but honoured on tmpfs) for mkdir system call: > > > > > > $ cd /tmp > > > $ df -T . > > > FilesystemType 1K-blocks Used Available Use% Mounted on > > > /dev/hdf1 ext3 189238556 155721568 23749068 87% / > > > $ rmdir ok ; mkdir ok ; ls -ld ok > > > rmdir: ok: No such file or directory > > > drwxrwxrwx 2 tigran tigran 4096 Jan 14 20:36 ok/ > > > $ umask > > > 0022 > > > $ cd /dev/shm > > > $ df -T . > > > FilesystemType 1K-blocks Used Available Use% Mounted on > > > tmpfstmpfs 517988 0517988 0% /dev/shm > > > $ rmdir ok ; mkdir ok ; ls -ld ok > > > rmdir: ok: No such file or directory > > > drwxr-xr-x 2 tigran tigran 40 Jan 14 20:36 ok/ > > > $ uname -a > > > Linux ws 2.6.19.1 #6 SMP Sun Jan 14 20:03:30 GMT 2007 i686 i686 i386 > > > GNU/Linux > > > $ grep -i acl /usr/src/linux/.config > > > # CONFIG_FS_POSIX_ACL is not set > > > # CONFIG_TMPFS_POSIX_ACL is not set > > > # CONFIG_NFS_V3_ACL is not set > > > # CONFIG_NFSD_V3_ACL is not set > > > > > > As you see, ACL is not configured in, and neither are extended attributes: > > > > > > $ grep -i xattr /usr/src/linux/.config > > > # CONFIG_EXT2_FS_XATTR is not set > > > # CONFIG_EXT3_FS_XATTR is not set > > > > > > So, this is something fs-specific. What do you think? > > > > I forgot to mention that on another machine running the same kernel version > > with the same (as close as a UP machine can be to SMP) kernel configuration > > the umask is honoured properly on ext3 filesystem. > > I figured it out! I thought you might be interested --- the reason is the > mismatch between the default mount options stored in the superblock on disk > and the filesystem features compiled into the kernel. > > Namely, dumpe2fs on the offending filesystems showed the following default > mount options: > > user_xattr acl > > but on good filesystems it showed "(none)". So, I used "tune2fs -o ^acl" > (and ^user_xattr) to clear these in the superblock and mounted the filesystem > --- and now mkdir system call works as expected, i.e. honours the umask. > > Maybe the ext3 filesystem should automatically detect this (the mismatch) and > printk a warning so the user is told that his filesystem is mounted in > extremely insecure way, i.e. making directories as root will result in lots of > 0777 places (e.g. try "make modules_install" --- this will create lots of > security holes in /lib/modules). > > I cc'd linux-kernel as someone may wish to fix this. Good find! Though I suppose not much of a worry for distros, whose kernels will always(?) have ACLs configured in. I get sooo confused when there's multiple ways of switching something on and off (at the ifdef level and at the mount opts level and at the tuning level), looks like others do too. Here's my third version of a patch, already wondering if a fourth would be better (at the point where they set s_flags) ... no, I think this one is more robust... [PATCH] fix umask when noACL kernel meets extN tuned for ACLs Fix insecure default behaviour reported by Tigran Aivazian: if an ext2 or ext3 or ext4 filesystem is tuned to mount with "acl", but mounted by a kernel built without ACL support, then umask was ignored when creating inodes - though root or user has umask 022, touch creates files as 0666, and mkdir creates directories as 0777. This appears to have worked right until 2.6.11, when a fix to the default mode on symlinks (always 0777) assumed VFS applies umask: which it does, unless the mount is marked for ACLs; but ext[234] set MS_POSIXACL in s_flags according to s_mount_opt set according to def_mount_opts. We could revert to the 2.6.10 ext[234]_init_acl (adding an S_ISLNK test); but other filesystems only set MS_POSIXACL when ACLs are configured. We could fix this at another level; but it seems most robust to avoid setting the s_mount_opt flag in the first place (at the expense of more ifdefs). Likewise don't set the XATTR_USER flag when built without XATTR support. Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]> --- fs/ext2/super.c |4 fs/ext3/super.c |4 fs/ext4/super.c |4 3 files changed, 12 insertions(+) --- 2.6.20-rc5/fs/ext2/super.c 2007-01-13 08:46:07.0 + +++ linux/fs/ext2/super.c 2007-01-15 20:48:38.0 + @@ -708,10 +708,14 @@ static int ext2_fill_super(struct super_ set_opt(sbi->s_mount_opt, GRPID); if (def_mount_opts & EXT2_DEFM_UID16) set_opt(sbi->s_mount_opt, NO_UID32); +#ifdef CONFIG_EXT2_FS_XATTR if (def_mount_opts & EXT2_DEFM_XATTR_USER) set_opt(sbi->s_mount_opt, XATTR_USER); +#endif +#ifdef
I broke my port numbers :(
I know this may be entirely my fault but I have tried reversing all of my _own_ patches I applied to 2.6.19.2 but can't find what broke this. I did three times "netcat 127.0.0.69 42", notice the different port numbers. First, if someone could attempt this on 2.6.19.2 or 2.6.20-rc* , and tell it works, I shut up. 2007-01-15 23:42:05.833636 IP (tos 0x0, ttl 61, id 34230, offset 0, flags [DF], proto: TCP (6), length: 60) 127.0.0.69.23287 > 127.0.0.69.42: SWE, cksum 0x0281 (correct), 674651575:674651575(0) win 32792 2007-01-15 23:42:05.833673 IP (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto: TCP (6), length: 40) 127.0.0.69.42 > 127.0.0.69.52935: R, cksum 0x5c66 (correct), 0:0(0) ack 674651576 win 0 2007-01-15 23:42:06.009245 IP (tos 0x0, ttl 61, id 11189, offset 0, flags [DF], proto: TCP (6), length: 60) 127.0.0.69.20161 > 127.0.0.69.42: SWE, cksum 0x96b3 (correct), 678941897:678941897(0) win 32792 2007-01-15 23:42:06.009289 IP (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto: TCP (6), length: 40) 127.0.0.69.42 > 127.0.0.69.52936: R, cksum 0xe511 (correct), 0:0(0) ack 678941898 win 0 2007-01-15 23:42:06.169587 IP (tos 0x0, ttl 61, id 36607, offset 0, flags [DF], proto: TCP (6), length: 60) 127.0.0.69.52470 > 127.0.0.69.42: SWE, cksum 0x15b5 (correct), 681498315:681498315(0) win 32792 2007-01-15 23:42:06.169624 IP (tos 0x0, ttl 61, id 0, offset 0, flags [DF], proto: TCP (6), length: 40) 127.0.0.69.42 > 127.0.0.69.52937: R, cksum 0xe2e7 (correct), 0:0(0) ack 681498316 win 0 If something was listening on port 42, it would see the wrong port, e.g. 23287, 20161 or 52470, not 52935, 52936 or 52937. -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] seq_file conversion: coda
Compile-tested. Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]> --- fs/coda/sysctl.c | 76 --- 1 file changed, 39 insertions(+), 37 deletions(-) --- a/fs/coda/sysctl.c +++ b/fs/coda/sysctl.c @@ -15,6 +15,7 @@ #include #include #include #include +#include #include #include #include @@ -84,15 +85,11 @@ static int do_reset_coda_cache_inv_stats return 0; } -static int coda_vfs_stats_get_info( char * buffer, char ** start, - off_t offset, int length) +static int proc_vfs_stats_show(struct seq_file *m, void *v) { - int len=0; - off_t begin; struct coda_vfs_stats * ps = & coda_vfs_stat; - /* this works as long as we are below 1024 characters! */ - len += sprintf( buffer, + seq_printf(m, "Coda VFS statistics\n" "===\n\n" "File Operations:\n" @@ -132,28 +129,14 @@ static int coda_vfs_stats_get_info( char ps->rmdir, ps->rename, ps->permission); - - begin = offset; - *start = buffer + begin; - len -= begin; - - if ( len > length ) - len = length; - if ( len < 0 ) - len = 0; - - return len; + return 0; } -static int coda_cache_inv_stats_get_info( char * buffer, char ** start, - off_t offset, int length) +static int proc_cache_inv_stats_show(struct seq_file *m, void *v) { - int len=0; - off_t begin; struct coda_cache_inv_stats * ps = & coda_cache_inv_stat; - /* this works as long as we are below 1024 characters! */ - len += sprintf( buffer, + seq_printf(m, "Coda cache invalidation statistics\n" "==\n\n" "flush\t\t%9d\n" @@ -170,19 +153,35 @@ static int coda_cache_inv_stats_get_info ps->zap_vnode, ps->purge_fid, ps->replace ); - - begin = offset; - *start = buffer + begin; - len -= begin; + return 0; +} - if ( len > length ) - len = length; - if ( len < 0 ) - len = 0; +static int proc_vfs_stats_open(struct inode *inode, struct file *file) +{ + return single_open(file, proc_vfs_stats_show, NULL); +} - return len; +static int proc_cache_inv_stats_open(struct inode *inode, struct file *file) +{ + return single_open(file, proc_cache_inv_stats_show, NULL); } +static const struct file_operations proc_vfs_stats_fops = { + .owner = THIS_MODULE, + .open = proc_vfs_stats_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, +}; + +static const struct file_operations proc_cache_inv_stats_fops = { + .owner = THIS_MODULE, + .open = proc_cache_inv_stats_open, + .read = seq_read, + .llseek = seq_lseek, + .release= single_release, +}; + static ctl_table coda_table[] = { {CODA_TIMEOUT, "timeout", _timeout, sizeof(int), 0644, NULL, _dointvec}, {CODA_HARD, "hard", _hard, sizeof(int), 0644, NULL, _dointvec}, @@ -212,9 +211,6 @@ static struct proc_dir_entry* proc_fs_co #endif -#define coda_proc_create(name,get_info) \ - create_proc_info_entry(name, 0, proc_fs_coda, get_info) - void coda_sysctl_init(void) { reset_coda_vfs_stats(); @@ -223,9 +219,15 @@ void coda_sysctl_init(void) #ifdef CONFIG_PROC_FS proc_fs_coda = proc_mkdir("coda", proc_root_fs); if (proc_fs_coda) { + struct proc_dir_entry *pde; + proc_fs_coda->owner = THIS_MODULE; - coda_proc_create("vfs_stats", coda_vfs_stats_get_info); - coda_proc_create("cache_inv_stats", coda_cache_inv_stats_get_info); + pde = create_proc_entry("vfs_stats", 0, proc_fs_coda); + if (pde) + pde->proc_fops = _vfs_stats_fops; + pde = create_proc_entry("cache_inv_stats", 0, proc_fs_coda); + if (pde) + pde->proc_fops = _cache_inv_stats_fops; } #endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] adjust use of unplug in elevator code
Hi Chris, Jens, Can you look at this, and push upstream if this looks reasonable to you? It fixes a bug I've been tripping over. --linas A flag was recently added to the elevator code to avoid performing an unplug when reuests are being re-queued. The goal of this flag was to avoid a deep recursion that can occur when re-queueing requests after a SCSI device/host reset. See http://lkml.org/lkml/2006/5/17/254 However, that fix added the flag near the bottom of a case statement, where an earlier break (in an if statement) could transport one out of the case, without setting the flag. This patch sets the flag earlier in the case statement. I re-discovered the deep recursion recently during testing; I was told that it was a known problem, and the fix to it was in the kernel I was testing. Indeed it was ... but it didn't fix the bug. With the patch below, I no longer see the bug. Signed-off by: Linas Vepstas <[EMAIL PROTECTED]> Cc: Jens Axboe <[EMAIL PROTECTED]> Cc: Chris Wright <[EMAIL PROTECTED]> block/elevator.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) Index: linux-2.6.20-rc4/block/elevator.c === --- linux-2.6.20-rc4.orig/block/elevator.c 2007-01-15 14:16:03.0 -0600 +++ linux-2.6.20-rc4/block/elevator.c 2007-01-15 14:20:04.0 -0600 @@ -590,6 +590,12 @@ void elv_insert(request_queue_t *q, stru */ rq->cmd_flags |= REQ_SOFTBARRIER; + /* +* Most requeues happen because of a busy condition, +* don't force unplug of the queue for that case. +*/ + unplug_it = 0; + if (q->ordseq == 0) { list_add(>queuelist, >queue_head); break; @@ -604,11 +610,6 @@ void elv_insert(request_queue_t *q, stru } list_add_tail(>queuelist, pos); - /* -* most requeues happen because of a busy condition, don't -* force unplug of the queue for that case. -*/ - unplug_it = 0; break; default: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: What does this scsi error mean ?
On Mon, Jan 15, 2007 at 06:45:40PM +, Alan wrote: > On Mon, 15 Jan 2007 18:16:02 +0100 > Olivier Galibert <[EMAIL PROTECTED]> wrote: > > > sd 0:0:0:0: SCSI error: return code = 0x0802 > > sda: Current: sense key: Hardware Error > > ASC=0x42 ASCQ=0x0 > > I'll give you a clue: The words "Hardware Error". > > Run a SCSI verify pass on the drive with some drive utilities and see > what happens. If you are lucky it'll just reallocate blocks and decide > the drive is ok, if not well see what the smart data thinks. Both smart and the internal blade diagnostics say "everything is a-ok with the drive, there hasn't been any error ever except a bunch of corrected ECC ones, and no more than with a similar drive in another working blade". Hence my initial post. "Hardware error" is kinda imprecise, so I was wondering whether it was unexpected controller answer, detected transmission error, block write error, sector not found... Is there a way to have more information? OG. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: allocation failed: out of vmalloc space error treating and VIDEO1394 IOC LISTEN CHANNEL ioctl failed problem
On 1/15/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote: > However, what I'd really like to do is to leave it to user space to > allocate the memory as David describes. In the transmit case, user > space allocates memory (malloc or mmap) and loads the payload into > that buffer. there is a lot of pain involved with doing things this way, it is a TON better if YOU provide the memory via a custom mmap handler for a device driver. (there are a lot of security nightmares involved with the opposite model, like the user can put any kind of memory there, even pci mmio space) OK, point taken. I don't have a strong preference for the opposite model, it just seems elegant that you can let user space handle allocation and pin and map the pages as needed. But you're right, it certainly is easier to give safe memory to user space in the first place rather than try to make sure user space isn't trying to trick us. > Then is does an ioctl() on the firewire control device ioctls are evil ;) esp an "mmap me" ioctl Ah, I'm not mmap'ing it from the ioctl, I do implement the mma file operation for this. However, you have to do an ioctl before mapping the device to configure the dma context. Other than that what is the problem with ioctls, and more interesting, what is the alternative? I don't expect (or want) a bunch of syscalls to be added for this, so I don't really see what other mechanism I should use for this. > It's not too difficult from what I'm doing now, I'd just like to give > user space more control over the buffers it uses for streaming (i.e. > letting user space allocate them). What I'm missing here is: how do I > actually pin a page in memory? I'm sure it's not too difficult, but I > haven't yet figured it out and I'm sure somebody knows it off the top > of his head. again the best way is for you to provide an mmap method... you can then fill in the pages and keep that in some sort of array; this is for example also what the DRI/DRM layer does for textures etc... That sounds a lot like what I have now (mmap method, array of pages) so I'll just stick with that. thanks, Kristian - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: allocation failed: out of vmalloc space error treating and VIDEO1394 IOC LISTEN CHANNEL ioctl failed problem
> However, what I'd really like to do is to leave it to user space to > allocate the memory as David describes. In the transmit case, user > space allocates memory (malloc or mmap) and loads the payload into > that buffer. there is a lot of pain involved with doing things this way, it is a TON better if YOU provide the memory via a custom mmap handler for a device driver. (there are a lot of security nightmares involved with the opposite model, like the user can put any kind of memory there, even pci mmio space) > Then is does an ioctl() on the firewire control device ioctls are evil ;) esp an "mmap me" ioctl > It's not too difficult from what I'm doing now, I'd just like to give > user space more control over the buffers it uses for streaming (i.e. > letting user space allocate them). What I'm missing here is: how do I > actually pin a page in memory? I'm sure it's not too difficult, but I > haven't yet figured it out and I'm sure somebody knows it off the top > of his head. again the best way is for you to provide an mmap method... you can then fill in the pages and keep that in some sort of array; this is for example also what the DRI/DRM layer does for textures etc... -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sed s/gawk/awk/ scripts/gen_init_ramfs.sh
Signed-off-by: Rob Landley <[EMAIL PROTECTED]> Use "awk" instead of "gawk". -- There's a symlink from awk to gawk if you're using the gnu tools, but no symlink from gawk to awk if you're using BusyBox or some such. (There's a reason for the existence of standard names. Can we use them please?) --- linux-2.6.19.2/scripts/gen_initramfs_list.sh2007-01-10 14:10:37.0 -0500 +++ linux-new/scripts/gen_initramfs_list.sh 2007-01-15 10:14:41.0 -0500 @@ -121,9 +121,9 @@ "nod") local dev_type= local maj=$(LC_ALL=C ls -l "${location}" | \ - gawk '{sub(/,/, "", $5); print $5}') + awk '{sub(/,/, "", $5); print $5}') local min=$(LC_ALL=C ls -l "${location}" | \ - gawk '{print $6}') + awk '{print $6}') if [ -b "${location}" ]; then dev_type="b" @@ -134,7 +134,7 @@ ;; "slink") local target=$(LC_ALL=C ls -l "${location}" | \ - gawk '{print $11}') + awk '{print $11}') str="${ftype} ${name} ${target} ${str}" ;; *) -- "Perfection is reached, not when there is no longer anything to add, but when there is no longer anything to take away." - Antoine de Saint-Exupery - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.14 17:43:53 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >Hi, > > > >with 2.6.20-rc{2,4,5} (no other tested yet) I see SATA exceptions quite > >often, with 2.6.19 there are no such exceptions. dmesg and lspci -v > >output follows. In the meantime, I'll start bisecting. > > ... > > >ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > >ata1.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 in > > res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) > >ata1: soft resetting port > >ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > >ata1.00: configured for UDMA/133 > >ata1: EH complete > >SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB) > >sda: Write Protect is off > >sda: Mode Sense: 00 3a 00 00 > >SCSI device sda: write cache: enabled, read cache: enabled, doesn't > >support DPO or FUA > > Looks like all of these errors are from a FLUSH CACHE command and the > drive is indicating that it is no longer busy, so presumably done. > That's not a DMA-mapped command, so it wouldn't go through the ADMA > machinery and I wouldn't have expected this to be handled any > differently from before. Curious.. My latest bisection attempt actually led to your sata_nv ADMA commit. [1] I've now backed out that patch from 2.6.20-rc5 and have my stress test running for 20 minutes now ("record" for a bad kernel surviving that test is about 40 minutes IIRC). I'll keep it running for at least 2 more hours. The test is pretty simple: while /bin/true; do ls -lR > /dev/null; done while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done running in parallel. Björn [1] 2dec7555e6bf2772749113ea0ad454fcdb8cf861 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.19] USB HID: proper LED-mapping (support for SpaceNavigator)
On Mon, 15 Jan 2007, Simon Budig wrote: > Is it possible that there is a regression in the hid-debug stuff? The > mapping does not seem to appear in the dmesg-output. I unfortunately > don't have an earlier kernel available right now to verify, but now the > output on plugging in the device looks like this: Hi Simon, thanks, I queued the LED mapping fix for upstream. I agree with Vojtech and Marcel that it doesn't make much sense having the hid-debug as a header file - I will fix it, and apply your patch to it (after I check why the debug output seems to be broken), you don't have to resend it, thanks. -- Jiri Kosina - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/