Re: libata extension
Vitaliyi wrote: Good Day Say i want to implement extended set of ATA commands available to userspace for building diagnostic tools. I need 0x40 -- read verify and 0x32 -- write long with error handling, for example. I was trying ide driver through ioctl's, but seems it lack of functionality and full of gotchas. Furthermore it oopses sometimes. Is it possible to use libata for such purpose or i need to write separate IDE driver ? By the way, i'm sure it should be done in kernel space since i'm going to deal with some hdd manufacturer commands. P.S. I was looking through libata and ide sources and documentation but still dont have broad picture. I believe you should be able to do this by sending ATA pass-through SCSI commands into the device using SG_IO, without any kernel changes. It's really the mechanism that's meant for this.. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 1/3] Add ability to keep track of callers of symbol_(get|put)
On Sat, 10 Mar 2007 02:31:35 -0200 Mauro Carvalho Chehab [EMAIL PROTECTED] wrote: From: Trent Piepho [EMAIL PROTECTED] When a module uses symbol_get() to increase the ref count of another module, there is no record what module called symbol_get(). A module can show up as having other users, but there is no way to tell who those users are. This adds that ability to symbol_put() and symbol_get(). One day I'll write a script which unwordwraps patches and then you'll all need to find new ways of torturing me. This patch needed rather a lot of help in the coding-style department. Hopefully Rusty can comment on the content, because I'm all exhausted from cleaning it up. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2/6] 2.6.21-rc2: known regressions
* Pavel Machek [EMAIL PROTECTED] wrote: Probably tweaking the webpage doesnt help because people dont get there - as the results plainly show it. Maybe some more automation would be useful too, a tool that detects failed resume and tries all those options that makes sense on that box or something? It's not like that Unfortunately, these tend to crash the box when you pass wrong options, and I do not see easy way to test can user see whats on display automatically. you could perhaps try what X's modesetting utility does: display a dialog box that times out if it does not get clicked on, and reboot if it did not get clicked on. Likewise, detect upon the next bootup that a suspend-test was in progress (and didnt get back via normal resume), via some temporary file. That way both the 'did not resume and i had to power-cycle' and the 'resume did not restore my X' problems can be handled. Finally, when the correct options have been established (worse-case with a small number of reboots and yes, indeed the resume did not work fine clicks done upon bootup by the user), automatically fill in a webform in firefox and ask the user to do a single click to submit that form. techniques like that have more chance i think to get Linux suspend/resume anywhere near to working. The current 'rely on the developer' technique apparently does not work. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 6/7] Account for the number of tasks within container
Paul Menage wrote: On 3/6/07, Pavel Emelianov [EMAIL PROTECTED] wrote: The idea is: Task may be the entity that allocates the resources and the entity that is a resource allocated. When task is the first entity it may move across containers (that is implemented in your patches). When task is a resource it shouldn't move across containers like files or pages do. More generally - allocated resources hold reference to original container till they die. No resource migration is performed. Did I express my idea cleanly? Yes, but I disagree with the premise. The title of your patch is Account for the number of tasks within container, but that's not what the subsystem does, it accounts for the number of forks within the container that aren't directly accompanied by an exit. Ideally, resources like files and pages would be able to follow tasks as well. The reason that files and pages aren't easily migrated from one container to another is that there could be sharing involved; figuring out the sharing can be expensive, and it's not clear what to do if two users are in different containers. But in the case of a task count, there are no such issues with sharing, so it seems to me to be more sensible (and more efficient) to just limit the number of tasks in a container. i.e. when moving a task into a container or forking a task within a container, increment the count; when moving a task out of a container or when it exits, decrement the count. Sounds reasonable. I'll take this into account when I make the next iteration. Thanks. With your approach, if you were to set the task limit of an empty container A to 1, and then move a process P from B into A, P would be able to fork a new child, since the task count would be 0 (as P was being charged to B still). Surely the fact that there's 1 process in A should prevent P from forking? Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/7] Resource counters
Herbert Poetzl wrote: On Wed, Mar 07, 2007 at 10:19:05AM +0300, Pavel Emelianov wrote: Balbir Singh wrote: Pavel Emelianov wrote: Introduce generic structures and routines for resource accounting. Each resource accounting container is supposed to aggregate it, container_subsystem_state and its resource-specific members within. diff -upr linux-2.6.20.orig/include/linux/res_counter.h linux-2.6.20-0/include/linux/res_counter.h --- linux-2.6.20.orig/include/linux/res_counter.h2007-03-06 13:39:17.0 +0300 +++ linux-2.6.20-0/include/linux/res_counter.h2007-03-06 13:33:28.0 +0300 @@ -0,0 +1,83 @@ +#ifndef __RES_COUNTER_H__ +#define __RES_COUNTER_H__ +/* + * resource counters + * + * Copyright 2007 OpenVZ SWsoft Inc + * + * Author: Pavel Emelianov [EMAIL PROTECTED] + * + */ + +#include linux/container.h + +struct res_counter { +unsigned long usage; +unsigned long limit; +unsigned long failcnt; +spinlock_t lock; +}; + +enum { +RES_USAGE, +RES_LIMIT, +RES_FAILCNT, +}; + +ssize_t res_counter_read(struct res_counter *cnt, int member, +const char __user *buf, size_t nbytes, loff_t *pos); +ssize_t res_counter_write(struct res_counter *cnt, int member, +const char __user *buf, size_t nbytes, loff_t *pos); + +static inline void res_counter_init(struct res_counter *cnt) +{ +spin_lock_init(cnt-lock); +cnt-limit = (unsigned long)LONG_MAX; +} + Is there any way to indicate that there are no limits on this container. Yes - LONG_MAX is essentially a no limit value as no container will ever have such many files :) -1 or ~0 is a viable choice for userspace to communicate 'infinite' or 'unlimited' OK, I'll make ULONG_MAX :) LONG_MAX is quite huge, but still when the administrator wants to configure a container to *un-limited usage*, it becomes hard for the administrator. +static inline int res_counter_charge_locked(struct res_counter *cnt, +unsigned long val) +{ +if (cnt-usage = cnt-limit - val) { +cnt-usage += val; +return 0; +} + +cnt-failcnt++; +return -ENOMEM; +} + +static inline int res_counter_charge(struct res_counter *cnt, +unsigned long val) +{ +int ret; +unsigned long flags; + +spin_lock_irqsave(cnt-lock, flags); +ret = res_counter_charge_locked(cnt, val); +spin_unlock_irqrestore(cnt-lock, flags); +return ret; +} + Will atomic counters help here. I'm afraid no. We have to atomically check for limit and alter one of usage or failcnt depending on the checking result. Making this with atomic_xxx ops will require at least two ops. Linux-VServer does the accounting with atomic counters, so that works quite fine, just do the checks at the beginning of whatever resource allocation and the accounting once the resource is acquired ... This works quite fine on non-preempted kernels. From the time you checked for resource till you really account it kernel may preempt and let another process pass through vx_anything_avail() check. If we'll remove failcnt this would look like while (atomic_cmpxchg(...)) which is also not that good. Moreover - in RSS accounting patches I perform page list manipulations under this lock, so this also saves one atomic op. it still hasn't been shown that this kind of RSS limit doesn't add big time overhead to normal operations (inside and outside of such a resource container) note that the 'usual' memory accounting is much more lightweight and serves similar purposes ... It OOM-kills current int case of limit hit instead of reclaiming pages or killing *memory eater* to free memory. best, Herbert +static inline void res_counter_uncharge_locked(struct res_counter *cnt, +unsigned long val) +{ +if (unlikely(cnt-usage val)) { +WARN_ON(1); +val = cnt-usage; +} + +cnt-usage -= val; +} + +static inline void res_counter_uncharge(struct res_counter *cnt, +unsigned long val) +{ +unsigned long flags; + +spin_lock_irqsave(cnt-lock, flags); +res_counter_uncharge_locked(cnt, val); +spin_unlock_irqrestore(cnt-lock, flags); +} + +#endif diff -upr linux-2.6.20.orig/init/Kconfig linux-2.6.20-0/init/Kconfig --- linux-2.6.20.orig/init/Kconfig2007-03-06 13:33:28.0 +0300 +++ linux-2.6.20-0/init/Kconfig2007-03-06 13:33:28.0 +0300 @@ -265,6 +265,10 @@ config CPUSETS Say N if unsure. +config RESOURCE_COUNTERS +bool +select CONTAINERS + config SYSFS_DEPRECATED bool Create deprecated sysfs files default y diff -upr linux-2.6.20.orig/kernel/Makefile linux-2.6.20-0/kernel/Makefile --- linux-2.6.20.orig/kernel/Makefile2007-03-06 13:33:28.0 +0300 +++ linux-2.6.20-0/kernel/Makefile2007-03-06 13:33:28.0 +0300 @@ -51,6
Re: [RFC][PATCH 2/7] RSS controller core
Herbert Poetzl wrote: On Tue, Mar 06, 2007 at 02:00:36PM -0800, Andrew Morton wrote: On Tue, 06 Mar 2007 17:55:29 +0300 Pavel Emelianov [EMAIL PROTECTED] wrote: +struct rss_container { + struct res_counter res; + struct list_head page_list; + struct container_subsys_state css; +}; + +struct page_container { + struct page *page; + struct rss_container *cnt; + struct list_head list; +}; ah. This looks good. I'll find a hunk of time to go through this work and through Paul's patches. It'd be good to get both patchsets lined up in -mm within a couple of weeks. But.. doesn't look so good for me, mainly becaus of the additional per page data and per page processing on 4GB memory, with 100 guests, 50% shared for each guest, this basically means ~1mio pages, 500k shared and 1500k x sizeof(page_container) entries, which roughly boils down to ~25MB of wasted memory ... increase the amount of shared pages and it starts getting worse, but maybe I'm missing something here You are. Each page has only one page_container associated with it despite the number of containers it is shared between. We need to decide whether we want to do per-container memory limitation via these data structures, or whether we do it via a physical scan of some software zone, possibly based on Mel's patches. why not do simple page accounting (as done currently in Linux) and use that for the limits, without keeping the reference from container to page? As I've already answered in my previous letter simple limiting w/o per-container reclamation and per-container oom killer isn't a good memory management. It doesn't allow to handle resource shortage gracefully. This patchset provides more grace way to handle this, but full memory management includes accounting of VMA-length as well (returning ENOMEM from system call) but we've decided to start with RSS. best, Herbert ___ Containers mailing list [EMAIL PROTECTED] https://lists.osdl.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[BIG] Re: sched rsdl fix for 0.28
Le dimanche 11 mars 2007 à 11:07 +1100, Con Kolivas a écrit : sched rsdl fix Doesn't change a thing. Always breaks at the same place (though depending on hardware timings? the trace is not always the same). Pretty sure nothing happens before this failure -- Nicolas Mailhot signature.asc Description: Ceci est une partie de message numériquement signée
Re: [BIG] Re: sched rsdl fix for 0.28
On Sunday 11 March 2007 20:10, Nicolas Mailhot wrote: Le dimanche 11 mars 2007 à 11:07 +1100, Con Kolivas a écrit : sched rsdl fix Doesn't change a thing. Always breaks at the same place (though depending on hardware timings? the trace is not always the same). Pretty sure nothing happens before this failure Bummer. The only other thing to try is v0.29 posted recently. I still haven't got a good way to reproduce this locally but I'll keep trying. Thanks for testing. -- -ck - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BIG] Re: sched rsdl fix for 0.28
On Sunday 11 March 2007 20:21, Con Kolivas wrote: On Sunday 11 March 2007 20:10, Nicolas Mailhot wrote: Le dimanche 11 mars 2007 à 11:07 +1100, Con Kolivas a écrit : sched rsdl fix Doesn't change a thing. Always breaks at the same place (though depending on hardware timings? the trace is not always the same). Pretty sure nothing happens before this failure Bummer. The only other thing to try is v0.29 posted recently. I still haven't got a good way to reproduce this locally but I'll keep trying. Thanks for testing. Oh and if that oopses and you still have the time, could you please test 0.29 on 2.6.20.2 (available from same directory). -- -ck - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: Make nenuconfig does not save parameters.
[Sam Ravnborg - Sat, Mar 10, 2007 at 11:45:34PM +0100] | On Sat, Mar 10, 2007 at 10:34:41PM +0100, Jan Engelhardt wrote: | | On Mar 10 2007 22:27, Sam Ravnborg wrote: | On Sat, Mar 10, 2007 at 07:23:41PM +0100, Jan Engelhardt wrote: | | Whether the 'working config file path' should change when you do | 'Save as Alternate' or not, is a menuconfig axiom. Ask Sam Ravnborg | if you want it changed :-) | | Current behaviour is not logical but on the other hand I do not | see a big need to make it so. | It seems that people very seldom uses save alternate anyway. | | But patches are welcome. | | ^_^ The patch has already been posted, has not it? | No. | Either we keep current behaviour or we change to the normal | behaviour with a Save as... as know from all other programs. | | Sam | Hi Sam, here is a patch for menuconfig that shows current configuration file. So I think menuconfig does its work well but the only thing we need is to show location of an _active_ configuration. Any comments are welcome (and you may swear at me too :) Cyrill diff --git a/scripts/kconfig/mconf.c b/scripts/kconfig/mconf.c index 3f9a132..cde6792 100644 --- a/scripts/kconfig/mconf.c +++ b/scripts/kconfig/mconf.c @@ -602,6 +602,12 @@ static void conf(struct menu *menu) item_set_tag('L'); item_make(_(Save an Alternate Configuration File)); item_set_tag('S'); + item_make(--- ); + item_set_tag(':'); + item_make(_(Current Configuration File: )); + item_set_tag(':'); + item_add_str(%s, filename); + } dialog_clear(); res = dialog_menu(prompt ? prompt : _(Main Menu), @@ -816,8 +822,11 @@ static void conf_load(void) case 0: if (!dialog_input_result[0]) return; - if (!conf_read(dialog_input_result)) + if (!conf_read(dialog_input_result)) { + memset(filename, 0x0, PATH_MAX+1); + strncpy(filename, dialog_input_result, PATH_MAX); return; + } show_textbox(NULL, _(File does not exist!), 5, 38); break; case 1: @@ -840,8 +849,11 @@ static void conf_save(void) case 0: if (!dialog_input_result[0]) return; - if (!conf_write(dialog_input_result)) + if (!conf_write(dialog_input_result)) { + memset(filename, 0x0, PATH_MAX+1); + strncpy(filename, dialog_input_result, PATH_MAX); return; + } show_textbox(NULL, _(Can't create file! Probably a nonexistent directory.), 5, 60); break; case 1: @@ -903,7 +915,7 @@ int main(int ac, char **av) switch (res) { case 0: - if (conf_write(NULL)) { + if (conf_write(filename)) { fprintf(stderr, _(\n\n Error during writing of the kernel configuration.\n Your kernel configuration changes were NOT saved.
Re: Use of absolute timeouts for oneshot timers
On Sat, 2007-03-10 at 16:42 -0800, Jeremy Fitzhardinge wrote: Thomas Gleixner wrote: It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute time, which is read back from the clocksource, even if we use a relative value for real hardware clock event devices to program the next event. We calculate the delta between the absolute event and now. So we never get an accumulating error. What problem are you observing ? Actually, two things. There was the unexpected pauses during boot, which is trivially fixable by not using the Xen periodic timer, and using the single-shot fallback. But I'm making the more general observation that if you use an absolute rather than relative time to set the single-shot timeout, then you have to deal with a long-term cumulative drift between the kernel's monotonic time and the hypervisor's monotonic time. This can happen even if your clocksource is derived directly from the hypervisor monotonic time, because running ntp will warp the kernel's time, and so it will drift with respect to the hypervisor clock. You can only avoid this by 1) not allowing adjtime, or 2) making those same adjtime warps to the hypervisor time. Neither of these is a good general solution. Sigh, yes. Using a relative time for the next event is probably the least ugly solution tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 5/7] Per-container OOM killer and page reclamation
Balbir Singh wrote: Hi, Pavel, Please find my patch to add LRU behaviour to your latest RSS controller. Thanks for participation and additional testing :) I'll include this into next generation of patches. Balbir Singh Linux Technology Center IBM, ISTL Add LRU behaviour to the RSS controller patches posted by Pavel Emelianov http://lkml.org/lkml/2007/3/6/198 which was in turn similar to the RSS controller posted by me http://lkml.org/lkml/2007/2/26/8 Pavel's patches have a per container list of pages, which helps reduce reclaim time of the RSS controller but the per container list of pages is in FIFO order. I've implemented active and inactive lists per container to help select the right set of pages to reclaim when the container is under memory pressure. I've tested these patches on a ppc64 machine and they work fine for the minimal testing I've done. Pavel would you please include these patches in your next iteration. Comments, suggestions and further improvements are as always welcome! Signed-off-by: [EMAIL PROTECTED] --- include/linux/rss_container.h |1 mm/rss_container.c| 47 +++--- mm/swap.c |5 mm/vmscan.c |3 ++ 4 files changed, 44 insertions(+), 12 deletions(-) diff -puN include/linux/rss_container.h~rss-container-lru2 include/linux/rss_container.h --- linux-2.6.20/include/linux/rss_container.h~rss-container-lru2 2007-03-09 22:52:56.0 +0530 +++ linux-2.6.20-balbir/include/linux/rss_container.h 2007-03-10 00:39:59.0 +0530 @@ -19,6 +19,7 @@ int container_rss_prepare(struct page *, void container_rss_add(struct page_container *); void container_rss_del(struct page_container *); void container_rss_release(struct page_container *); +void container_rss_move_lists(struct page *pg, bool active); int mm_init_container(struct mm_struct *mm, struct task_struct *tsk); void mm_free_container(struct mm_struct *mm); diff -puN mm/rss_container.c~rss-container-lru2 mm/rss_container.c --- linux-2.6.20/mm/rss_container.c~rss-container-lru22007-03-09 22:52:56.0 +0530 +++ linux-2.6.20-balbir/mm/rss_container.c2007-03-10 02:42:54.0 +0530 @@ -17,7 +17,8 @@ static struct container_subsys rss_subsy struct rss_container { struct res_counter res; - struct list_head page_list; + struct list_head inactive_list; + struct list_head active_list; struct container_subsys_state css; }; @@ -96,6 +97,26 @@ void container_rss_release(struct page_c kfree(pc); } +void container_rss_move_lists(struct page *pg, bool active) +{ + struct rss_container *rss; + struct page_container *pc; + + if (!page_mapped(pg)) + return; + + pc = page_container(pg); + BUG_ON(!pc); + rss = pc-cnt; + + spin_lock_irq(rss-res.lock); + if (active) + list_move(pc-list, rss-active_list); + else + list_move(pc-list, rss-inactive_list); + spin_unlock_irq(rss-res.lock); +} + void container_rss_add(struct page_container *pc) { struct page *pg; @@ -105,7 +126,7 @@ void container_rss_add(struct page_conta rss = pc-cnt; spin_lock(rss-res.lock); - list_add(pc-list, rss-page_list); + list_add(pc-list, rss-active_list); spin_unlock(rss-res.lock); page_container(pg) = pc; @@ -141,7 +162,10 @@ unsigned long container_isolate_pages(un struct zone *z; spin_lock_irq(rss-res.lock); - src = rss-page_list; + if (active) + src = rss-active_list; + else + src = rss-inactive_list; for (scan = 0; scan nr_to_scan !list_empty(src); scan++) { pc = list_entry(src-prev, struct page_container, list); @@ -152,13 +176,10 @@ unsigned long container_isolate_pages(un spin_lock(z-lru_lock); if (PageLRU(page)) { - if ((active PageActive(page)) || - (!active !PageActive(page))) { - if (likely(get_page_unless_zero(page))) { - ClearPageLRU(page); - nr_taken++; - list_move(page-lru, dst); - } + if (likely(get_page_unless_zero(page))) { + ClearPageLRU(page); + nr_taken++; + list_move(page-lru, dst); } } spin_unlock(z-lru_lock); @@ -212,7 +233,8 @@ static int rss_create(struct container_s return -ENOMEM; res_counter_init(rss-res); -
Re: [PATCH] [scsi]: Add offline state checking while dispatch a scsi cmd
On Fri, 9 Mar 2007 09:40:40 +0800 Joe Jin [EMAIL PROTECTED] wrote: What's the error you're trying to fix? scsi_dispatch_cmd() is only called from scsi_request_fn() which already has an equivalent of this check in it just prior to calling dispatch. Yeah, I have saw the cheking at scsi_request_fn(), recently we got a crash info as following at rhel4 2.6.9-42.0.2.ELsmp, The 2.6.9 base is very old in mainline terms. Are you sure the bug hasn't been fixed in mainline by other means? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 0/3] swsusp: Stop using page flags
Hi, The following three patches make swsusp use its own data structures for memory management instead of special page flags. Thus the page flags used so far by swsusp (PG_nosave, PG_nosave_free) can be used for other purposes and I believe there are some urgend needs of them. :-) Last week I sent these patches to the linux-pm and linux-mm lists and there were no negative comments. Also I've been testing them on my x86_64 boxes for a few days and apparently they don't break anything. I think they can go into -mm for testing. Comments are welcome. Greetings, Rafael -- If you don't have the time to read, you don't have the time or the tools to write. - Stephen King - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 1/3] swsusp: Use inline functions for changing page flags
From: Rafael J. Wysocki [EMAIL PROTECTED] Replace direct invocations of SetPageNosave(), SetPageNosaveFree() etc. with calls to inline functions that can be changed in subsequent patches without modifying the code calling them. Signed-off-by: Rafael J. Wysocki [EMAIL PROTECTED] --- include/linux/suspend.h | 33 + kernel/power/snapshot.c | 48 +--- mm/page_alloc.c |6 +++--- 3 files changed, 61 insertions(+), 26 deletions(-) Index: linux-2.6.21-rc2/include/linux/suspend.h === --- linux-2.6.21-rc2.orig/include/linux/suspend.h 2007-03-02 09:05:53.0 +0100 +++ linux-2.6.21-rc2/include/linux/suspend.h2007-03-02 09:24:02.0 +0100 @@ -8,6 +8,7 @@ #include linux/notifier.h #include linux/init.h #include linux/pm.h +#include linux/mm.h /* struct pbe is used for creating lists of pages that should be restored * atomically during the resume from disk, because the page frames they have @@ -49,6 +50,38 @@ void __save_processor_state(struct saved void __restore_processor_state(struct saved_context *ctxt); unsigned long get_safe_page(gfp_t gfp_mask); +/* Page management functions for the software suspend (swsusp) */ + +static inline void swsusp_set_page_forbidden(struct page *page) +{ + SetPageNosave(page); +} + +static inline int swsusp_page_is_forbidden(struct page *page) +{ + return PageNosave(page); +} + +static inline void swsusp_unset_page_forbidden(struct page *page) +{ + ClearPageNosave(page); +} + +static inline void swsusp_set_page_free(struct page *page) +{ + SetPageNosaveFree(page); +} + +static inline int swsusp_page_is_free(struct page *page) +{ + return PageNosaveFree(page); +} + +static inline void swsusp_unset_page_free(struct page *page) +{ + ClearPageNosaveFree(page); +} + /* * XXX: We try to keep some more pages free so that I/O operations succeed * without paging. Might this be more? Index: linux-2.6.21-rc2/kernel/power/snapshot.c === --- linux-2.6.21-rc2.orig/kernel/power/snapshot.c 2007-03-02 09:05:53.0 +0100 +++ linux-2.6.21-rc2/kernel/power/snapshot.c2007-03-02 09:27:06.0 +0100 @@ -67,15 +67,15 @@ static void *get_image_page(gfp_t gfp_ma res = (void *)get_zeroed_page(gfp_mask); if (safe_needed) - while (res PageNosaveFree(virt_to_page(res))) { + while (res swsusp_page_is_free(virt_to_page(res))) { /* The page is unsafe, mark it for swsusp_free() */ - SetPageNosave(virt_to_page(res)); + swsusp_set_page_forbidden(virt_to_page(res)); allocated_unsafe_pages++; res = (void *)get_zeroed_page(gfp_mask); } if (res) { - SetPageNosave(virt_to_page(res)); - SetPageNosaveFree(virt_to_page(res)); + swsusp_set_page_forbidden(virt_to_page(res)); + swsusp_set_page_free(virt_to_page(res)); } return res; } @@ -91,8 +91,8 @@ static struct page *alloc_image_page(gfp page = alloc_page(gfp_mask); if (page) { - SetPageNosave(page); - SetPageNosaveFree(page); + swsusp_set_page_forbidden(page); + swsusp_set_page_free(page); } return page; } @@ -110,9 +110,9 @@ static inline void free_image_page(void page = virt_to_page(addr); - ClearPageNosave(page); + swsusp_unset_page_forbidden(page); if (clear_nosave_free) - ClearPageNosaveFree(page); + swsusp_unset_page_free(page); __free_page(page); } @@ -615,7 +615,8 @@ static struct page *saveable_highmem_pag BUG_ON(!PageHighMem(page)); - if (PageNosave(page) || PageReserved(page) || PageNosaveFree(page)) + if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page) || + PageReserved(page)) return NULL; return page; @@ -681,7 +682,7 @@ static struct page *saveable_page(unsign BUG_ON(PageHighMem(page)); - if (PageNosave(page) || PageNosaveFree(page)) + if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page)) return NULL; if (PageReserved(page) pfn_is_nosave(pfn)) @@ -821,9 +822,10 @@ void swsusp_free(void) if (pfn_valid(pfn)) { struct page *page = pfn_to_page(pfn); - if (PageNosave(page) PageNosaveFree(page)) { - ClearPageNosave(page); - ClearPageNosaveFree(page); + if (swsusp_page_is_forbidden(page) +
[RFC][PATCH 3/3] mm: Remove unused page flags
From: Rafael J. Wysocki [EMAIL PROTECTED] Remove the two page flags that were previously used by swsusp and are no longer needed. Signed-off-by: Rafael J. Wysocki [EMAIL PROTECTED] --- include/linux/page-flags.h | 12 1 file changed, 12 deletions(-) Index: linux-2.6.21-rc3/include/linux/page-flags.h === --- linux-2.6.21-rc3.orig/include/linux/page-flags.h +++ linux-2.6.21-rc3/include/linux/page-flags.h @@ -82,13 +82,11 @@ #define PG_private 11 /* If pagecache, has fs-private data */ #define PG_writeback 12 /* Page is under writeback */ -#define PG_nosave 13 /* Used for system suspend/resume */ #define PG_compound14 /* Part of a compound page */ #define PG_swapcache 15 /* Swap page: swp_entry_t in private */ #define PG_mappedtodisk16 /* Has blocks allocated on-disk */ #define PG_reclaim 17 /* To be reclaimed asap */ -#define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ /* PG_owner_priv_1 users should have descriptive aliases */ @@ -214,16 +212,6 @@ static inline void SetPageUptodate(struc ret;\ }) -#define PageNosave(page) test_bit(PG_nosave, (page)-flags) -#define SetPageNosave(page)set_bit(PG_nosave, (page)-flags) -#define TestSetPageNosave(page)test_and_set_bit(PG_nosave, (page)-flags) -#define ClearPageNosave(page) clear_bit(PG_nosave, (page)-flags) -#define TestClearPageNosave(page) test_and_clear_bit(PG_nosave, (page)-flags) - -#define PageNosaveFree(page) test_bit(PG_nosave_free, (page)-flags) -#define SetPageNosaveFree(page)set_bit(PG_nosave_free, (page)-flags) -#define ClearPageNosaveFree(page) clear_bit(PG_nosave_free, (page)-flags) - #define PageBuddy(page)test_bit(PG_buddy, (page)-flags) #define __SetPageBuddy(page) __set_bit(PG_buddy, (page)-flags) #define __ClearPageBuddy(page) __clear_bit(PG_buddy, (page)-flags) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 2/3] swsusp: Do not use page flags
From: Rafael J. Wysocki [EMAIL PROTECTED] Make swsusp use memory bitmaps instead of page flags for marking 'nosave' and free pages. This allows us to 'recycle' two page flags that can be used for other purposes. Also, the memory needed to store the bitmaps is allocated when necessary (ie. before the suspend) and freed after the resume which is more reasonable. The patch is designed to minimize the amount of changes and there are some nice simplifications and optimizations possible on top of it. I am going to implement them separately in the future. Signed-off-by: Rafael J. Wysocki [EMAIL PROTECTED] --- arch/x86_64/kernel/e820.c | 26 +--- include/linux/suspend.h | 58 +++--- kernel/power/disk.c | 23 +++- kernel/power/power.h |2 kernel/power/snapshot.c | 250 +++--- kernel/power/user.c |4 6 files changed, 281 insertions(+), 82 deletions(-) Index: linux-2.6.21-rc3/include/linux/suspend.h === --- linux-2.6.21-rc3.orig/include/linux/suspend.h +++ linux-2.6.21-rc3/include/linux/suspend.h @@ -24,63 +24,41 @@ struct pbe { extern void drain_local_pages(void); extern void mark_free_pages(struct zone *zone); -#ifdef CONFIG_PM -/* kernel/power/swsusp.c */ -extern int software_suspend(void); - -#if defined(CONFIG_VT) defined(CONFIG_VT_CONSOLE) +#if defined(CONFIG_PM) defined(CONFIG_VT) defined(CONFIG_VT_CONSOLE) extern int pm_prepare_console(void); extern void pm_restore_console(void); #else static inline int pm_prepare_console(void) { return 0; } static inline void pm_restore_console(void) {} -#endif /* defined(CONFIG_VT) defined(CONFIG_VT_CONSOLE) */ +#endif + +#if defined(CONFIG_PM) defined(CONFIG_SOFTWARE_SUSPEND) +/* kernel/power/swsusp.c */ +extern int software_suspend(void); +/* kernel/power/snapshot.c */ +extern void __init register_nosave_region(unsigned long, unsigned long); +extern int swsusp_page_is_forbidden(struct page *); +extern void swsusp_set_page_free(struct page *); +extern void swsusp_unset_page_free(struct page *); +extern unsigned long get_safe_page(gfp_t gfp_mask); #else static inline int software_suspend(void) { printk(Warning: fake suspend called\n); return -ENOSYS; } -#endif /* CONFIG_PM */ + +static inline void register_nosave_region(unsigned long b, unsigned long e) {} +static inline int swsusp_page_is_forbidden(struct page *p) { return 0; } +static inline void swsusp_set_page_free(struct page *p) {} +static inline void swsusp_unset_page_free(struct page *p) {} +#endif /* defined(CONFIG_PM) defined(CONFIG_SOFTWARE_SUSPEND) */ void save_processor_state(void); void restore_processor_state(void); struct saved_context; void __save_processor_state(struct saved_context *ctxt); void __restore_processor_state(struct saved_context *ctxt); -unsigned long get_safe_page(gfp_t gfp_mask); - -/* Page management functions for the software suspend (swsusp) */ - -static inline void swsusp_set_page_forbidden(struct page *page) -{ - SetPageNosave(page); -} - -static inline int swsusp_page_is_forbidden(struct page *page) -{ - return PageNosave(page); -} - -static inline void swsusp_unset_page_forbidden(struct page *page) -{ - ClearPageNosave(page); -} - -static inline void swsusp_set_page_free(struct page *page) -{ - SetPageNosaveFree(page); -} - -static inline int swsusp_page_is_free(struct page *page) -{ - return PageNosaveFree(page); -} - -static inline void swsusp_unset_page_free(struct page *page) -{ - ClearPageNosaveFree(page); -} /* * XXX: We try to keep some more pages free so that I/O operations succeed Index: linux-2.6.21-rc3/kernel/power/snapshot.c === --- linux-2.6.21-rc3.orig/kernel/power/snapshot.c +++ linux-2.6.21-rc3/kernel/power/snapshot.c @@ -21,6 +21,7 @@ #include linux/kernel.h #include linux/pm.h #include linux/device.h +#include linux/init.h #include linux/bootmem.h #include linux/syscalls.h #include linux/console.h @@ -34,6 +35,10 @@ #include power.h +static int swsusp_page_is_free(struct page *); +static void swsusp_set_page_forbidden(struct page *); +static void swsusp_unset_page_forbidden(struct page *); + /* List of PBEs needed for restoring the pages that were allocated before * the suspend and included in the suspend image, but have also been * allocated by the resume kernel, so their contents cannot be written @@ -224,11 +229,6 @@ static void chain_free(struct chain_allo * of type unsigned long each). It also contains the pfns that * correspond to the start and end of the represented memory area and * the number of bit chunks in the block. - * - * NOTE: Memory bitmaps are used for two types of operations only: - * set a bit and find the next bit set. Moreover, the searching - * is always carried out after all of the set a bit
[PATCH] drivers/isdn/hardware/eicon/: remove unused header files
Hi all, as pointed out by Robert P. J. Day, here is a patch to remove unused header files from Eicon/Dialogic ISDN driver. Signed-off-by: Armin Schindler [EMAIL PROTECTED] --- diff -Nur linux-2.6.20.1.orig/drivers/isdn/hardware/eicon/dbgioctl.h linux-2.6.20.1/drivers/isdn/hardware/eicon/dbgioctl.h --- linux-2.6.20.1.orig/drivers/isdn/hardware/eicon/dbgioctl.h 2007-03-10 11:21:15.0 +0100 +++ linux-2.6.20.1/drivers/isdn/hardware/eicon/dbgioctl.h 1970-01-01 01:00:00.0 +0100 @@ -1,198 +0,0 @@ - -/* - * - Copyright (c) Eicon Technology Corporation, 2000. - * - This source file is supplied for the use with Eicon - Technology Corporation's range of DIVA Server Adapters. - * - This program is free software; you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation; either version 2, or (at your option) - any later version. - * - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY OF ANY KIND WHATSOEVER INCLUDING ANY - implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - See the GNU General Public License for more details. - * - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. - * - */ -/*--*/ -/* file: dbgioctl.h */ -/*--*/ - -#if !defined(__DBGIOCTL_H__) - -#define __DBGIOCTL_H__ - -#ifdef NOT_YET_NEEDED -/* - * The requested operation is passed in arg0 of DbgIoctlArgs, - * additional arguments (if any) in arg1, arg2 and arg3. - */ - -typedef struct -{ ULONG arg0 ; - ULONG arg1 ; - ULONG arg2 ; - ULONG arg3 ; -} DbgIoctlArgs ; - -#defineDBG_COPY_LOGS 0 /* copy debugs to user until buffer full*/ - /* arg1: size threshold */ - /* arg2: timeout in milliseconds*/ - -#define DBG_FLUSH_LOGS 1 /* flush pending debugs to user buffer */ - /* arg1: internal driver id */ - -#define DBG_LIST_DRVS 2 /* return the list of registered drivers */ - -#defineDBG_GET_MASK3 /* get current debug mask of driver */ - /* arg1: internal driver id */ - -#defineDBG_SET_MASK4 /* set/change debug mask of driver */ - /* arg1: internal driver id */ - /* arg2: new debug mask */ - -#defineDBG_GET_BUFSIZE 5 /* get current buffer size of driver */ - /* arg1: internal driver id */ - /* arg2: new debug mask */ - -#defineDBG_SET_BUFSIZE 6 /* set new buffer size of driver */ - /* arg1: new buffer size*/ - -/* - * common internal debug message structure - */ - -typedef struct -{ unsigned short id ; /* virtual driver id */ - unsigned short type ; /* special message type */ - unsigned long seq ;/* sequence number of message */ - unsigned long size ; /* size of message in bytes */ - unsigned long next ; /* offset to next buffered message*/ - LARGE_INTEGER NTtime ; /* 100 ns since 1.1.1601 */ - unsigned char data[4] ;/* message data */ -} OldDbgMessage ; - -typedef struct -{ LARGE_INTEGER NTtime ; /* 100 ns since 1.1.1601 */ - unsigned short size ; /* size of message in bytes */ - unsigned short ; /* always 0x to indicate new msg */ - unsigned short id ; /* virtual driver id */ - unsigned short type ; /* special message type */ - unsigned long seq ;/* sequence number of message */ - unsigned char data[4] ;/* message data */ -} DbgMessage ; - -#endif - -#define DRV_ID_UNKNOWN 0x0C/* for messages via
Re: [RFC][PATCH 0/3] swsusp: Stop using page flags
On Sun, 2007-03-11 at 11:17 +0100, Rafael J. Wysocki wrote: Hi, The following three patches make swsusp use its own data structures for memory management instead of special page flags. Thus the page flags used so far by swsusp (PG_nosave, PG_nosave_free) can be used for other purposes and I believe there are some urgend needs of them. :-) Last week I sent these patches to the linux-pm and linux-mm lists and there were no negative comments. Also I've been testing them on my x86_64 boxes for a few days and apparently they don't break anything. I think they can go into -mm for testing. Comments are welcome. These patches have my blessing, they look good to me, but I'm not much involved with the swsusp code, so I won't ACK them. Again, thanks a bunch for freeing up 2 page flags :-) Peter - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA resume slowness, e1000 MSI warning
Michael S. Tsirkin [EMAIL PROTECTED] writes: The only case I can see which might trigger this is if we saved pci-X state and then didn't restore it because we could not find the capability on restore. Hmm. pci_save_pcix_state/pci_restore_pcix_state seem to only handle regular devices and seem to ignore the fact that for bridge PCI-X capability has a different structure. Is this intentional? Probably not a such. I don't think we have any drivers for bridge devices so I don't think it matters. It likely wouldn't hurt to fix it just in case though. Do any of the mellanox cards do anything with the bridge on the card? If not, here's a patch to fix this. Warning: completely untested. If you fix the offsets and diff this against my last fix (to never free the buffer) I think your patch makes sense. PCI: restore bridge PCI-X capability registers after PM event Restore PCI-X bridge up/downstream capability registers after PM event. This includes maxumum split transaction commitment limit which might be vital for PCI X. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index df49530..4b788ef 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -597,14 +597,19 @@ static int pci_save_pcix_state(struct pci_dev *dev) if (pos = 0) return 0; - save_state = kzalloc(sizeof(*save_state) + sizeof(u16), GFP_KERNEL); + save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 2, GFP_KERNEL); if (!save_state) { - dev_err(dev-dev, Out of memory in pci_save_pcie_state\n); + dev_err(dev-dev, Out of memory in pci_save_pcix_state\n); return -ENOMEM; } cap = (u16 *)save_state-data[0]; - pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]); + if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) { This appears to be the proper test. + pci_read_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]); + pci_read_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]); + } else + pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]); + pci_add_saved_cap(dev, save_state); return 0; } @@ -621,7 +626,11 @@ static void pci_restore_pcix_state(struct pci_dev *dev) return; cap = (u16 *)save_state-data[0]; - pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]); + if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) { + pci_write_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]); + pci_write_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]); These look like the proper two registers to save. + } else + pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]); pci_remove_saved_cap(save_state); kfree(save_state); } diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h index f09cce2..fb7eefd 100644 --- a/include/linux/pci_regs.h +++ b/include/linux/pci_regs.h @@ -332,6 +332,8 @@ #define PCI_X_STATUS_SPL_ERR 0x2000 /* Rcvd Split Completion Error Msg */ #define PCI_X_STATUS_266MHZ 0x4000 /* 266 MHz capable */ #define PCI_X_STATUS_533MHZ 0x8000 /* 533 MHz capable */ +#define PCI_X_BRIDGE_UP_SPL_CTL 10 /* PCI-X upstream split transaction limit */ +#define PCI_X_BRIDGE_DN_SPL_CTL 14 /* PCI-X downstream split transaction limit */ Unless I am completely misreading the spec. While you have picked the right register to save the offsets should be 0x08 and 0x0c or 8 and 12 Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Linux 2.6.16.44-rc1
Security fixes since 2.6.16.43: - CVE-2007-0005: Fix buffer overflow in Omnikey CardMan 4040 driver - CVE-2007-1000: [IPV6]: Handle np-opt being NULL in ipv6_getsockopt_sticky(). Location: ftp://ftp.kernel.org/pub/linux/kernel/people/bunk/linux-2.6.16.y/testing/ git tree: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.16.y.git Changes since 2.6.16.43: Adrian Bunk (1): Linux 2.6.16.44-rc1 Ang Way Chuang (1): dvb-core: fix bug in CRC-32 checking on 64-bit systems Arnaldo Carvalho de Melo (1): [TCP]: Fix minisock tcp_create_openreq_child() typo. Arthur Kepner (1): IB/mthca: Use mmiowb after doorbell ring Chris Wright (1): [IPV6] fix ipv6_getsockopt_sticky copy_to_user leak Dan Yeisley (1): init_reap_node() initialization fix David Moore (1): Missing critical phys_to_virt in lib/swiotlb.c David S. Miller (4): video/aty/mach64_ct.c: fix bogus delay loop [SPARC64] bbc_i2c: Fix kenvctrld eating %100 cpu. [IPV6]: Handle np-opt being NULL in ipv6_getsockopt_sticky(). (CVE-2007-1000) SPARC64: Fix memory corruption in pci_4u_free_consistent() David Stevens (1): [IPV6]: /proc/net/anycast6 unbalanced inet6_dev refcnt Eli Cohen (1): IPoIB: Rejoin all multicast groups after a port event Eric Dumazet (1): [INET]: twcal_jiffie should be unsigned long, not int Herbert Xu (1): [UDP]: Reread uh pointer after pskb_trim Hugh Dickins (1): make ppc64 current preempt-safe Jin-Bong lee (1): DVB: cxusb: fix firmware patch for big endian systems Komuro (1): modify 3c589_cs to be SMP safe Marcel Holtmann (1): Fix buffer overflow in Omnikey CardMan 4040 driver (CVE-2007-0005) Michael S. Tsirkin (1): IB/mthca: Fix off-by-one in FMR handling on memfree Michal Wrobel (1): [IPV6]: anycast refcnt fix Olaf Kirch (1): [IPV6]: Fix for ipv6_setsockopt NULL dereference Sergey Vlasov (1): Input: psmouse - fix attribute access on 64-bit systems Makefile|2 +- arch/sparc64/kernel/pci_iommu.c |2 +- drivers/char/pcmcia/cm4040_cs.c |3 ++- drivers/infiniband/hw/mthca/mthca_cq.c |7 +++ drivers/infiniband/hw/mthca/mthca_memfree.c |2 +- drivers/infiniband/hw/mthca/mthca_qp.c | 19 +++ drivers/infiniband/hw/mthca/mthca_srq.c |8 drivers/infiniband/ulp/ipoib/ipoib_ib.c |4 +++- drivers/input/mouse/psmouse-base.c |8 +--- drivers/media/dvb/dvb-core/dvb_net.c|4 ++-- drivers/media/dvb/dvb-usb/cxusb.c |4 ++-- drivers/net/pcmcia/3c589_cs.c |7 +-- drivers/sbus/char/bbc_i2c.c | 17 + drivers/video/aty/mach64_ct.c |4 ++-- include/asm-powerpc/current.h | 12 +++- include/net/inet_timewait_sock.h|2 +- lib/swiotlb.c |2 +- mm/slab.c |2 +- net/ipv4/tcp_minisocks.c|2 +- net/ipv4/udp.c |1 + net/ipv6/addrconf.c |2 ++ net/ipv6/anycast.c |1 + net/ipv6/ipv6_sockglue.c| 14 +- 23 files changed, 95 insertions(+), 34 deletions(-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/7] revoke: core code
On Fri, 2007-03-09 at 10:15 +0200, Pekka J Enberg wrote: + again: + restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr, + details); + + need_break = need_resched() || need_lockbreak(details-i_mmap_lock); + if (need_break) + goto out_need_break; + + if (restart_addr end_addr) { + start_addr = restart_addr; + goto again; + } + return 0; + + out_need_break: + spin_unlock(details-i_mmap_lock); + cond_resched(); + spin_lock(details-i_mmap_lock); + return -EINTR; On Fri, 2007-03-09 at 13:30 +0100, Peter Zijlstra wrote: I'm not sure this scheme works, given a sufficiently loaded machine, this might never complete. Hmm, so what's the alternative? It's better to fail revoke than lock up the box. On Fri, 2007-03-09 at 13:30 +0100, Peter Zijlstra wrote: I'm never sure of operator precedence and prefer: (vma-vm_flags VM_SHARED) ... which leaves no room for error. Thanks, fixed. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] revoke: special mmap handling
From: Pekka Enberg [EMAIL PROTECTED] This adds special handling for revoked memory mappings. We want to raise SIGBUS when accessing revoked mappings and return ENODEV when trying to remap with mmap(2). Acked-by: Alan Cox [EMAIL PROTECTED] Signed-off-by: Pekka Enberg [EMAIL PROTECTED] --- include/linux/mm.h |1 + mm/memory.c|3 +++ mm/mmap.c | 12 3 files changed, 12 insertions(+), 4 deletions(-) Index: uml-2.6/include/linux/mm.h === --- uml-2.6.orig/include/linux/mm.h 2007-03-11 13:07:57.0 +0200 +++ uml-2.6/include/linux/mm.h 2007-03-11 13:09:19.0 +0200 @@ -169,6 +169,7 @@ #define VM_NONLINEAR0x0080 /* Is no #define VM_MAPPED_COPY 0x0100 /* T if mapped copy of data (nommu mmap) */ #define VM_INSERTPAGE 0x0200 /* The vma has had vm_insert_page() done on it */ #define VM_ALWAYSDUMP 0x0400 /* Always include in core dumps */ +#define VM_REVOKED 0x0800 /* Mapping has been revoked */ #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */ #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS Index: uml-2.6/mm/memory.c === --- uml-2.6.orig/mm/memory.c2007-03-11 13:07:57.0 +0200 +++ uml-2.6/mm/memory.c 2007-03-11 13:09:19.0 +0200 @@ -2504,6 +2504,9 @@ int __handle_mm_fault(struct mm_struct * if (unlikely(is_vm_hugetlb_page(vma))) return hugetlb_fault(mm, vma, address, write_access); + if (unlikely(vma-vm_flags VM_REVOKED)) + return VM_FAULT_SIGBUS; + pgd = pgd_offset(mm, address); pud = pud_alloc(mm, pgd, address); if (!pud) Index: uml-2.6/mm/mmap.c === --- uml-2.6.orig/mm/mmap.c 2007-03-11 13:07:57.0 +0200 +++ uml-2.6/mm/mmap.c 2007-03-11 13:09:19.0 +0200 @@ -1030,10 +1030,14 @@ accountable = 0; error = -ENOMEM; munmap_back: vma = find_vma_prepare(mm, addr, prev, rb_link, rb_parent); - if (vma vma-vm_start addr + len) { - if (do_munmap(mm, addr, len)) - return -ENOMEM; - goto munmap_back; + if (vma) { + if (unlikely(vma-vm_flags VM_REVOKED)) + return -ENODEV; + if (vma-vm_start addr + len) { + if (do_munmap(mm, addr, len)) + return -ENOMEM; + goto munmap_back; + } } /* Check against address space limit. */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] revoke: core code
From: Pekka Enberg [EMAIL PROTECTED] The revokeat(2) and frevoke(2) system calls invalidate open file descriptors and shared mappings of an inode. After an successful revocation, operations on file descriptors fail with the EBADF or ENXIO error code for regular and device files, respectively. Attempting to read from or write to a revoked mapping causes SIGBUS. The actual operation is done in two passes: 1. Revoke all file descriptors that point to the given inode. We do this under tasklist_lock so that after this pass, we don't need to worry about racing with close(2) or dup(2). 2. Take down shared memory mappings of the inode and close all file pointers. The file descriptors and memory mapping ranges are preserved until the owning task does close(2) and munmap(2), respectively. Signed-off-by: Pekka Enberg [EMAIL PROTECTED] --- fs/Makefile |2 fs/revoke.c | 588 +++ fs/revoked_inode.c | 378 +++ include/linux/fs.h |4 include/linux/revoked_fs_i.h | 20 + include/linux/syscalls.h |3 6 files changed, 994 insertions(+), 1 deletion(-) Index: uml-2.6/fs/Makefile === --- uml-2.6.orig/fs/Makefile2007-03-11 13:07:57.0 +0200 +++ uml-2.6/fs/Makefile 2007-03-11 13:09:20.0 +0200 @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table. attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o drop_caches.o splice.o sync.o utimes.o \ - stack.o + stack.o revoke.o revoked_inode.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o Index: uml-2.6/include/linux/syscalls.h === --- uml-2.6.orig/include/linux/syscalls.h 2007-03-11 13:07:57.0 +0200 +++ uml-2.6/include/linux/syscalls.h2007-03-11 13:09:20.0 +0200 @@ -605,4 +605,7 @@ asmlinkage long sys_getcpu(unsigned __us int kernel_execve(const char *filename, char *const argv[], char *const envp[]); +asmlinkage int sys_revokeat(int dfd, const char __user *filename); +asmlinkage int sys_frevoke(unsigned int fd); + #endif Index: uml-2.6/include/linux/fs.h === --- uml-2.6.orig/include/linux/fs.h 2007-03-11 13:07:57.0 +0200 +++ uml-2.6/include/linux/fs.h 2007-03-11 13:09:20.0 +0200 @@ -1100,6 +1100,7 @@ struct file_operations { int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); + int (*revoke)(struct file *); }; struct inode_operations { @@ -1739,6 +1740,9 @@ extern ssize_t generic_splice_sendpage(s extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out, size_t len, unsigned int flags); +/* fs/revoke.c */ +extern int generic_file_revoke(struct file *); + extern void file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping); extern loff_t no_llseek(struct file *file, loff_t offset, int origin); Index: uml-2.6/fs/revoke.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ uml-2.6/fs/revoke.c 2007-03-11 13:14:42.0 +0200 @@ -0,0 +1,588 @@ +/* + * fs/revoke.c - Invalidate all current open file descriptors of an inode. + * + * Copyright (C) 2006-2007 Pekka Enberg + * + * This file is released under the GPLv2. + */ + +#include linux/file.h +#include linux/fs.h +#include linux/namei.h +#include linux/mm.h +#include linux/mman.h +#include linux/module.h +#include linux/mount.h +#include linux/sched.h +#include linux/revoked_fs_i.h + +/* + * This is used for pre-allocating an array of file pointers so that we don't + * have to do memory allocation under tasklist_lock. + */ +struct revoke_table { + struct file **files; + unsigned long size; + unsigned long end; + unsigned long restore_start; +}; + +struct kmem_cache *revokefs_inode_cache; + +/* + * Revoked file descriptors point to inodes in the revokefs filesystem. + */ +static struct vfsmount *revokefs_mnt; + +static struct file *get_revoked_file(void) +{ + struct dentry *dentry; + struct inode *inode; + struct file *filp; + struct qstr name; + + filp = get_empty_filp(); + if (!filp) + goto err; + + inode = new_inode(revokefs_mnt-mnt_sb); + if (!inode) + goto err_inode; + + name.name = revoked_file; + name.len = strlen(name.name); + dentry =
[PATCH 3/5] revoke: support for ext2 and ext3
From: Pekka Enberg [EMAIL PROTECTED] Add revoke support to ext2 and ext3 by wiring f_ops-revoke with generic_file_revoke. Signed-off-by: Pekka Enberg [EMAIL PROTECTED] --- fs/ext2/file.c |1 + fs/ext3/file.c |1 + 2 files changed, 2 insertions(+) Index: uml-2.6/fs/ext2/file.c === --- uml-2.6.orig/fs/ext2/file.c 2007-03-11 13:05:33.0 +0200 +++ uml-2.6/fs/ext2/file.c 2007-03-11 13:09:21.0 +0200 @@ -56,6 +56,7 @@ const struct file_operations ext2_file_o .sendfile = generic_file_sendfile, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .revoke = generic_file_revoke, }; #ifdef CONFIG_EXT2_FS_XIP Index: uml-2.6/fs/ext3/file.c === --- uml-2.6.orig/fs/ext3/file.c 2007-03-11 13:05:33.0 +0200 +++ uml-2.6/fs/ext3/file.c 2007-03-11 13:09:21.0 +0200 @@ -123,6 +123,7 @@ const struct file_operations ext3_file_o .sendfile = generic_file_sendfile, .splice_read= generic_file_splice_read, .splice_write = generic_file_splice_write, + .revoke = generic_file_revoke, }; const struct inode_operations ext3_file_inode_operations = { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] revoke: add documentation
From: Pekka Enberg [EMAIL PROTECTED] This documents revoke file operation in Documentation/filesystems/vfs.txt. Acked-by: Alan Cox [EMAIL PROTECTED] Signed-off-by: Pekka Enberg [EMAIL PROTECTED] --- Documentation/filesystems/vfs.txt |5 + 1 file changed, 5 insertions(+) Index: uml-2.6/Documentation/filesystems/vfs.txt === --- uml-2.6.orig/Documentation/filesystems/vfs.txt 2007-03-11 13:05:33.0 +0200 +++ uml-2.6/Documentation/filesystems/vfs.txt 2007-03-11 13:09:22.0 +0200 @@ -732,6 +732,7 @@ struct file_operations { int); ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int); + int (*revoke)(struct file *); }; Again, all methods are called without any locks being held, unless @@ -805,6 +806,10 @@ otherwise noted. splice_read: called by the VFS to splice data from file to a pipe. This method is used by the splice(2) system call + revoke: called by revokeat(2) and frevoke(2) system calls to revoke access + to an open file. This method must ensure that all currently blocked + writes are flushed and reads will fail. + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] revoke: wire up i386 system calls
From: Pekka Enberg [EMAIL PROTECTED] Make revokeat and frevoke system calls available to user-space on i386. Acked-by: Alan Cox [EMAIL PROTECTED] Signed-off-by: Pekka Enberg [EMAIL PROTECTED] --- arch/i386/kernel/syscall_table.S |3 +++ include/asm-i386/unistd.h|4 +++- 2 files changed, 6 insertions(+), 1 deletion(-) Index: uml-2.6/arch/i386/kernel/syscall_table.S === --- uml-2.6.orig/arch/i386/kernel/syscall_table.S 2007-03-11 13:05:32.0 +0200 +++ uml-2.6/arch/i386/kernel/syscall_table.S2007-03-11 13:09:23.0 +0200 @@ -319,3 +319,6 @@ .long sys_unshare /* 310 */ .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_revokeat /* 320 */ + .long sys_frevoke + Index: uml-2.6/include/asm-i386/unistd.h === --- uml-2.6.orig/include/asm-i386/unistd.h 2007-03-11 13:05:33.0 +0200 +++ uml-2.6/include/asm-i386/unistd.h 2007-03-11 13:09:23.0 +0200 @@ -325,10 +325,12 @@ #define __NR_unshare 310 #define __NR_move_pages317 #define __NR_getcpu318 #define __NR_epoll_pwait 319 +#define __NR_revokeat 320 +#define __NR_frevoke 321 #ifdef __KERNEL__ -#define NR_syscalls 320 +#define NR_syscalls 322 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA resume slowness, e1000 MSI warning
Quoting Eric W. Biederman [EMAIL PROTECTED]: Subject: Re: SATA resume slowness, e1000 MSI warning Michael S. Tsirkin [EMAIL PROTECTED] writes: The only case I can see which might trigger this is if we saved pci-X state and then didn't restore it because we could not find the capability on restore. Hmm. pci_save_pcix_state/pci_restore_pcix_state seem to only handle regular devices and seem to ignore the fact that for bridge PCI-X capability has a different structure. Is this intentional? Probably not a such. I don't think we have any drivers for bridge devices so I don't think it matters. It likely wouldn't hurt to fix it just in case though. Do any of the mellanox cards do anything with the bridge on the card? Yes but they do their own thing wrt saving/restoring registers. Look at drivers/infiniband/hw/mthca/mthca_reset.c If not, here's a patch to fix this. Warning: completely untested. If you fix the offsets and diff this against my last fix (to never free the buffer) I think your patch makes sense. Let's agree what the correct offsets are. PCI: restore bridge PCI-X capability registers after PM event Restore PCI-X bridge up/downstream capability registers after PM event. This includes maxumum split transaction commitment limit which might be vital for PCI X. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index df49530..4b788ef 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -597,14 +597,19 @@ static int pci_save_pcix_state(struct pci_dev *dev) if (pos = 0) return 0; - save_state = kzalloc(sizeof(*save_state) + sizeof(u16), GFP_KERNEL); + save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 2, GFP_KERNEL); if (!save_state) { - dev_err(dev-dev, Out of memory in pci_save_pcie_state\n); + dev_err(dev-dev, Out of memory in pci_save_pcix_state\n); return -ENOMEM; } cap = (u16 *)save_state-data[0]; - pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]); + if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) { This appears to be the proper test. + pci_read_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]); + pci_read_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]); + } else + pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]); + pci_add_saved_cap(dev, save_state); return 0; } @@ -621,7 +626,11 @@ static void pci_restore_pcix_state(struct pci_dev *dev) return; cap = (u16 *)save_state-data[0]; - pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]); + if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) { + pci_write_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]); + pci_write_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]); These look like the proper two registers to save. + } else + pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]); pci_remove_saved_cap(save_state); kfree(save_state); } diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h index f09cce2..fb7eefd 100644 --- a/include/linux/pci_regs.h +++ b/include/linux/pci_regs.h @@ -332,6 +332,8 @@ #define PCI_X_STATUS_SPL_ERR 0x2000 /* Rcvd Split Completion Error Msg */ #define PCI_X_STATUS_266MHZ 0x4000 /* 266 MHz capable */ #define PCI_X_STATUS_533MHZ 0x8000 /* 533 MHz capable */ +#define PCI_X_BRIDGE_UP_SPL_CTL 10 /* PCI-X upstream split transaction limit */ +#define PCI_X_BRIDGE_DN_SPL_CTL 14 /* PCI-X downstream split transaction limit */ Unless I am completely misreading the spec. While you have picked the right register to save the offsets should be 0x08 and 0x0c or 8 and 12 No, the spec is written in terms of dwords (32 bit), we are storing words (16 bits). The data at offsets 8 and 12 is read-only split transaction capacity. Split transaction limit starts at bit 16 so you need to add 2 to byte offset. Right? -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] CIRRUS: Delete unused header file.
On Sat, 10 Mar 2007, Andrew Morton wrote: On Sat, 10 Mar 2007 17:27:44 -0500 (EST) Robert P. J. Day [EMAIL PROTECTED] wrote: Delete apparently unused header file sound/pci/cs46xx/imgs/cwcemb80.h. That patch series was rather a mess - Multiple patches with the same Subject: (I might have lost some as a result) yes, that was a bad decision on my part, sorry. - Several patches which tried to remove the same header file *that* shouldn't have happened, those patches were designed to be independent of one another and, AFAIK, i submitted them only once. i have no idea how the above might have happened. - Several patches which simply didn't apply hm ... they were created against the latest git tree, i don't know why they wouldn't apply. ... - Useless indenting in changleog text which I have to edit away. ah, i'll remember to not indent the changelog text next time, sorry. rday -- Robert P. J. Day Linux Consulting, Training and Annoying Kernel Pedantry Waterloo, Ontario, CANADA http://fsdev.net/wiki/index.php?title=Main_Page - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2
Hi Con, On Sun, 2007-03-11 at 14:57 +1100, Con Kolivas wrote: What follows this email is a patch series for the latest version of the RSDL cpu scheduler (ie v0.29). I have addressed all bugs that I am able to reproduce in this version so if some people would be kind enough to test if there are any hidden bugs or oops lurking, it would be nice to know in anticipation of putting this back in -mm. Thanks. Full patch for 2.6.21-rc3-mm2: http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2-rsdl-0.29.patch I'm seeing a cpu distribution problem running this on my P4 box. Scenario: listening to music collection (mp3) via Amarok. Enable Amarok visualization gforce, and size such that X and gforce each use ~50% cpu. Start rip/encode of new CD with grip/lame encoder. Lame is set to use both cpus, at nice 5. Once the encoders start, they receive considerable more cpu than nice 0 X/Gforce, taking ~120% and leaving the remaining 80% for X/Gforce and Amarok (when it updates it's ~12k entry database) to squabble over. With 2.6.21-rc3, X/Gforce maintain their ~50% cpu (remain smooth), and the encoders (100%cpu bound) get whats left when Amarok isn't eating it. I plunked the above patch into plain 2.6.21-rc3 and retested to eliminate other mm tree differences, and it's repeatable. The nice 5 cpu hogs always receive considerably more that the nice 0 sleepers. -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2
On Sunday 11 March 2007 22:39, Mike Galbraith wrote: Hi Con, On Sun, 2007-03-11 at 14:57 +1100, Con Kolivas wrote: What follows this email is a patch series for the latest version of the RSDL cpu scheduler (ie v0.29). I have addressed all bugs that I am able to reproduce in this version so if some people would be kind enough to test if there are any hidden bugs or oops lurking, it would be nice to know in anticipation of putting this back in -mm. Thanks. Full patch for 2.6.21-rc3-mm2: http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2-rsdl-0.29 .patch I'm seeing a cpu distribution problem running this on my P4 box. Scenario: listening to music collection (mp3) via Amarok. Enable Amarok visualization gforce, and size such that X and gforce each use ~50% cpu. Start rip/encode of new CD with grip/lame encoder. Lame is set to use both cpus, at nice 5. Once the encoders start, they receive considerable more cpu than nice 0 X/Gforce, taking ~120% and leaving the remaining 80% for X/Gforce and Amarok (when it updates it's ~12k entry database) to squabble over. With 2.6.21-rc3, X/Gforce maintain their ~50% cpu (remain smooth), and the encoders (100%cpu bound) get whats left when Amarok isn't eating it. I plunked the above patch into plain 2.6.21-rc3 and retested to eliminate other mm tree differences, and it's repeatable. The nice 5 cpu hogs always receive considerably more that the nice 0 sleepers. Thanks for the report. I'm assuming you're describing a single hyperthread P4 here in SMP mode so 2 logical cores. Can you elaborate on whether there is any difference as to which cpu things are bound to as well? Can you also see what happens with lame not niced to +5 (ie at 0) and with lame at nice +19. Thanks. -- -ck - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/9] signalfd/timerfd - signalfd core ...
On 03/10, Davide Libenzi wrote: +static void signalfd_put_sighand(struct signalfd_ctx *ctx, + struct sighand_struct *sighand, + unsigned long *flags) +{ + unlock_task_sighand(ctx-tsk, flags); +} Note that signalfd_put_sighand() doesn't need sighand parameter, please see below. +int signalfd_deliver(struct sighand_struct *sighand, int sig, + struct siginfo *info) +{ + int nsig = 0; + struct signalfd_ctx *ctx, *tmp; + + list_for_each_entry_safe(ctx, tmp, sighand-sfdlist, lnk) { + /* + * We use a negative signal value as a way to broadcast that the + * sighand has been orphaned, so that we can notify all the + * listeners about this. Remeber the ctx-sigmask is inverted, + * so if the user is interested in a signal, that corresponding + * bit will be zero. + */ + if (sig 0) + list_del_init(ctx-lnk); I'm afraid this is not right. This should be per-thread. Suppose we have threads T1 and T2 from the same thread group. sighand-sfdlist contains ctx1 and ctx2 linked to T1 and T2. Now, T1 exits, __exit_signal() does signalfd_notify(sighand, -1), and unlinks all threads, not just T1. IOW, we should do if (ctx-tsk == current) { list_del_init(ctx-lnk); wake_up(ctx-wqh); } Perhaps it makes sense to not re-use signalfd_deliver(), but introduce a new signalfd_xxx(sighand, tsk) helper for de_thread/exit_signal. Btw, signalfd_deliver() doesn't use info parameter. + if (sig 0 || !sigismember(ctx-sigmask, sig)) { + wake_up(ctx-wqh); Minor nit. Perhaps it makes sense to do void signalfd_deliver(struct task_struct *tsk, int sig, struct sigpending *pending) { struct sighand_struct *sighand = tsk-sighand; int private = (tsk-pending == pending); list_for_each_entry_safe(ctx, tmp, sighand-sfdlist, lnk) { if (private ctx-tsk != tsk) continue; if (!sigismember(ctx-sigmask, sig)) wake_up(ctx-wqh); } } Even better: signalfd_deliver(struct task_struct *tsk, int sig, int private). This way specific_send_sig_info/send_sigqueue won't do a false wakeup. +asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t sizemask) +{ ... + if ((sighand = signalfd_get_sighand(ctx, flags)) != NULL) { + ctx-sigmask = sigmask; + signalfd_put_sighand(ctx, sighand, flags); + } This looks like unneeded complication to me, I'd suggest if (signalfd_get_sighand(ctx, flags)) { ctx-sigmask = sigmask; signalfd_put_sighand(ctx, flags); } unlock_task_sighand() (and thus signalfd_put_sighand) doesn't need sighand parameter. signalfd_get_sighand() is in fact boolean. It makes sense to return sighand, it may be useful, but this patch only needs != NULL. Every usage of signalfd_get_sighand() could be simplified accordingly. --- linux-2.6.20.ep2.orig/fs/exec.c 2007-03-10 15:57:00.0 -0800 +++ linux-2.6.20.ep2/fs/exec.c2007-03-10 15:57:51.0 -0800 @@ -50,6 +50,7 @@ #include linux/tsacct_kern.h #include linux/cn_proc.h #include linux/audit.h +#include linux/signalfd.h #include asm/uaccess.h #include asm/mmu_context.h @@ -583,6 +584,17 @@ int count; /* + * Tell all the sighand listeners that this sighand has + * been detached. Needs to be called with the sighand lock + * held. + */ + if (unlikely(!list_empty(oldsighand-sfdlist))) { + spin_lock_irq(oldsighand-siglock); + signalfd_notify(oldsighand, -1, NULL); + spin_unlock_irq(oldsighand-siglock); + } Very minor nit. I'd suggest to make a new helper and put it in signalfd.h (like signalfd_notify()). This will help CONFIG_SIGNALFD. I still think that we should do this only for suid-exec. If application passes a signalfd to another process with unix socket, it should know what it does. But yes, I agree, we can change this later if needed. (in that case the caller of the above helper should be flush_old_exec). Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2
On Sun, 2007-03-11 at 22:48 +1100, Con Kolivas wrote: Thanks for the report. I'm assuming you're describing a single hyperthread P4 here in SMP mode so 2 logical cores. Can you elaborate on whether there is any difference as to which cpu things are bound to as well? Can you also see what happens with lame not niced to +5 (ie at 0) and with lame at nice +19. Yes, one P4/HT/SMP. No change at nice 0, but setting the encoders to nice 19 did put X/gforce ~back where they were with 2.6.21-rc3. Tasks don't seem to be bound to any particular cpu, relies on load balancing (which appears to be working). -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2
* Mike Galbraith [EMAIL PROTECTED] wrote: Full patch for 2.6.21-rc3-mm2: http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2-rsdl-0.29.patch I'm seeing a cpu distribution problem running this on my P4 box. With 2.6.21-rc3, X/Gforce maintain their ~50% cpu (remain smooth), and the encoders (100%cpu bound) get whats left when Amarok isn't eating it. I plunked the above patch into plain 2.6.21-rc3 and retested to eliminate other mm tree differences, and it's repeatable. The nice 5 cpu hogs always receive considerably more that the nice 0 sleepers. hm. Do you get the same same problem on UP too? (i.e. lets eliminate any SMP/HT artifacts) Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2
On Sun, 2007-03-11 at 13:10 +0100, Ingo Molnar wrote: * Mike Galbraith [EMAIL PROTECTED] wrote: Full patch for 2.6.21-rc3-mm2: http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2-rsdl-0.29.patch I'm seeing a cpu distribution problem running this on my P4 box. With 2.6.21-rc3, X/Gforce maintain their ~50% cpu (remain smooth), and the encoders (100%cpu bound) get whats left when Amarok isn't eating it. I plunked the above patch into plain 2.6.21-rc3 and retested to eliminate other mm tree differences, and it's repeatable. The nice 5 cpu hogs always receive considerably more that the nice 0 sleepers. hm. Do you get the same same problem on UP too? (i.e. lets eliminate any SMP/HT artifacts) I'll boot up nosmp and report back (but now it's time to take Opa to the Gasthaus for his Sunday afternoon brewskies;) -Mike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [git patches] libata fixes
Hello, It seems like IRQ is not getting through. The first IRQ driven command is failing for you. H Extract is : ata7: PATA max UDMA/100 cmd 0x00019c00 ctl 0x00019882 bmdma 0x00019400 irq 16 ata8: PATA max UDMA/100 cmd 0x00019800 ctl 0x00019482 bmdma 0x00019408 irq 16 IRQ 16 is IO-APIC-fasteoi for libata, and is not shared... but all the others libata IRQ are IO-APIC-edge. * Does giving 'acpi=off' or 'irqpoll' make any difference? * Can you connect a harddisk to the channel and see whether that works? Tried that.. Disk is identified as ATA-7: Mastor 6Y080L0, YAR41BW0, max UDMA/13 and then timeout again... Tried then with acpi=off, same result (identify is OK, but then timeout), and irqpoll and then it was OK Let's then go back to my DVD-RW and test irqpoll... and ... Yes Got it ! It is identified, it can be mounted, and read as /dev/sr1 ! /proc/interrupts show a count of 0 for IRQ 16, so yes, it goes somewhere else... Doing some diffs on copy of /proc/interrupts while accessing the DVD gives two possibilities : IRQ14 or IRQ18, but both are also counting when not accessing the DVD... Question : does running with irqpoll affects performance ? Paul - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata extension
I believe you should be able to do this by sending ATA pass-through SCSI commands into the device using SG_IO, without any kernel changes. It's really the mechanism that's meant for this.. It should work, but Mark Lord reported some problems with READ_LONG on PIIX/ICH intel chipsets. I don't know if he ever resolved them but if not I have a patch that ought to. Alan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] driver core: fix device_add error path
Dmitriy Monakhov [EMAIL PROTECTED] writes: Greg Kroah-Hartman [EMAIL PROTECTED] writes: From: James Simmons [EMAIL PROTECTED] When a device fails to register the class symlinks where not cleaned up. This left a symlink in the /sys/class/device/ directory that pointed to no where. This caused the sysfs_follow_link Oops I reported earlier. This patch cleanups up the symlink. Please apply. Thank you. Signed-Off: James Simmons [EMAIL PROTECTED] Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED] --- drivers/base/core.c | 31 ++- 1 files changed, 30 insertions(+), 1 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index d04fd33..cf2a398 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -637,12 +637,41 @@ int device_add(struct device *dev) BUS_NOTIFY_DEL_DEVICE, dev); device_remove_groups(dev); GroupError: -device_remove_attrs(dev); +device_remove_attrs(dev); AttrsError: if (dev-devt_attr) { device_remove_file(dev, dev-devt_attr); kfree(dev-devt_attr); } + +if (dev-class) { +sysfs_remove_link(dev-kobj, subsystem); +/* If this is not a fake compatible device, remove the + * symlink from the class to the device. */ +if (dev-kobj.parent != dev-class-subsys.kset.kobj) +sysfs_remove_link(dev-class-subsys.kset.kobj, + dev-bus_id); +#ifdef CONFIG_SYSFS_DEPRECATED +if (parent) { +char *class_name = make_class_name(dev-class-name, + dev-kobj); +if (class_name) +sysfs_remove_link(dev-parent-kobj, + class_name); +kfree(class_name); +sysfs_remove_link(dev-kobj, device); +} +#endif + block begin +down(dev-class-sem); +/* notify any interfaces that the device is now gone */ +list_for_each_entry(class_intf, dev-class-interfaces, node) +if (class_intf-remove_dev) +class_intf-remove_dev(dev, class_intf); +/* remove the device from the class list */ +list_del_init(dev-node); +up(dev-class-sem); block end May be i've missed something, but i'm confuesd a litle bit. For example if error happens while device_pm_add() we jump to label PMError and code from block above will be executed (device will be remove from list), but this device wasn't added to this list yet! I've check it one more time, code it really broken!, and i think i understand how this can happen it look like full code chunck was copy-pasted from device_del(), but in case of device_add() error path, device was't added to dev-class-devices list yet. Folowing patch fix this copy-paste error: [PATCH] driver core: fix device_add error path - At the moment we jump here device was't added to dev-class-devices list yet. Signed-off-by: Monakhov Dmitriy [EMAIL PROTECTED] --- drivers/base/core.c |9 - 1 files changed, 0 insertions(+), 9 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 142c222..7d2459b 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -684,15 +684,6 @@ int device_add(struct device *dev) #endif sysfs_remove_link(dev-kobj, device); } - - down(dev-class-sem); - /* notify any interfaces that the device is now gone */ - list_for_each_entry(class_intf, dev-class-interfaces, node) - if (class_intf-remove_dev) - class_intf-remove_dev(dev, class_intf); - /* remove the device from the class list */ - list_del_init(dev-node); - up(dev-class-sem); } ueventattrError: device_remove_file(dev, dev-uevent_attr); -- 1.5.0.1 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc3-mm1 RSDL results
| See: | http://webcvs.freedesktop.org/mesa/Mesa/src/mesa/drivers/dri/r200/r200_ioctl.c?revision=1.37view=markup OK. Mesa is in git, now, but that still applies. The gitweb url is: http://gitweb.freedesktop.org/?p=mesa/mesa.git and for the version of the above file in the master branch: http://gitweb.freedesktop.org/?p=mesa/mesa.git;a=blob;f=src/mesa/drivers/dri/r200/r200_ioctl.c The recursive grep(1) on mesa shows: ,[grep -r sched_yield mesa] | mesa/mesa/src/mesa/drivers/dri/r300/radeon_ioctl.c: sched_yield(); | mesa/mesa/src/mesa/drivers/dri/i915tex/intel_batchpool.c: sched_yield(); | mesa/mesa/src/mesa/drivers/dri/i915tex/intel_batchbuffer.c: sched_yield(); | mesa/mesa/src/mesa/drivers/dri/common/vblank.h:#include sched.h /* for sched_yield() */ | mesa/mesa/src/mesa/drivers/dri/common/vblank.h:#include sched.h /* for sched_yield() */ | mesa/mesa/src/mesa/drivers/dri/common/vblank.h: sched_yield(); \ | mesa/mesa/src/mesa/drivers/dri/unichrome/via_ioctl.c: sched_yield(); | mesa/mesa/src/mesa/drivers/dri/i915/intel_ioctl.c: sched_yield(); | mesa/mesa/src/mesa/drivers/dri/r200/r200_ioctl.c: sched_yield(); ` Thanks for the heads up. I must've grep(1)ed the xorg subdir rather than the parent dir, and so missed mesa. -JimC -- James Cloos [EMAIL PROTECTED] OpenPGP: 1024D/ED7DAEA6 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc3-mm1 RSDL results
On Sunday 11 March 2007 23:38, James Cloos wrote: | See: | http://webcvs.freedesktop.org/mesa/Mesa/src/mesa/drivers/dri/r200/r200_i |octl.c?revision=1.37view=markup OK. Mesa is in git, now, but that still applies. The gitweb url is: http://gitweb.freedesktop.org/?p=mesa/mesa.git and for the version of the above file in the master branch: http://gitweb.freedesktop.org/?p=mesa/mesa.git;a=blob;f=src/mesa/drivers/dr i/r200/r200_ioctl.c The recursive grep(1) on mesa shows: ,[grep -r sched_yield mesa] | mesa/mesa/src/mesa/drivers/dri/r300/radeon_ioctl.c: sched_yield(); | mesa/mesa/src/mesa/drivers/dri/i915tex/intel_batchpool.c: | sched_yield(); | mesa/mesa/src/mesa/drivers/dri/i915tex/intel_batchbuffer.c: | sched_yield(); mesa/mesa/src/mesa/drivers/dri/common/vblank.h:#include | sched.h /* for sched_yield() */ | mesa/mesa/src/mesa/drivers/dri/common/vblank.h:#include sched.h /* | for sched_yield() */ mesa/mesa/src/mesa/drivers/dri/common/vblank.h: | sched_yield(); \ | mesa/mesa/src/mesa/drivers/dri/unichrome/via_ioctl.c: sched_yield(); | mesa/mesa/src/mesa/drivers/dri/i915/intel_ioctl.c: sched_yield(); | mesa/mesa/src/mesa/drivers/dri/r200/r200_ioctl.c: sched_yield(); ` Thanks for the heads up. I must've grep(1)ed the xorg subdir rather than the parent dir, and so missed mesa. I just wonder what the heck all these will do to testing when using any of these drivers. Whether or not we do no yield, mild yield or full blown expiration yield, somehow or other I can't get over the feeling that if the code relies on yield() we can't really trust them to be meaningful cpu scheduler tests. This means most 3d apps out there that aren't using binary drivers, whether they be (fscking) glxgears, audio app visualisations or what... -- -ck - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Locking interrupt handler in L1 cache
Hi, I have MPC 8548 Linux 2.6.x based firewall which will mostly do packet processing for 80% time. So obviously most of the time it will RX and TX packets through gianfar ethernet driver. I want to lock my interrupt handler of this driver in the L1 cache. 1. Is there any kernel API for locking function and data to lock them in the L1/L2 cache? 2. How can I use icbtls - Instruction Cache Block Touch and Lock Set for locking my interrupt handler? 3. Is icbtls is the correct instruction at which I am looking at? 4. How do I find end address of the interrupt handler function and how do we pass it to cache locking instructions? (Because it can happen that interrupt handler size is more than a cache line, not aligned etc)? 5. Can we enhance request_irq() function to take an additional parameter to lock the interrupt handler in the cache? I understand that if my interrupt handler is going to be called most of the time then it is very likely to happen that OS will flush the same, but there is no guarantee for it. Regards, Parav Pandit Get your own web address. Have a HUGE year through Yahoo! Small Business. http://smallbusiness.yahoo.com/domains/?p=BESTDEAL - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: libata extension
Hi, On Sunday 11 March 2007, Vitaliyi wrote: Good Day Say i want to implement extended set of ATA commands available to userspace for building diagnostic tools. I need 0x40 -- read verify and 0x32 -- write long with error handling, Mark Lord is working on READ/WRITE_LONG support for libata, he has posted draft patch recently on linux-ide mailing list. [ Please consider reading/joining linux-ide@vger.kernel.org ML, it is where Linux ATA discussion happens... ] for example. I was trying ide driver through ioctl's, but seems it lack of functionality and full of gotchas. Furthermore it oopses sometimes. READ/WRITE_LONG is unsupported and as you've already noticed TASKFILE ioctls are full of gotchas... Is it possible to use libata for such purpose or i need to write separate IDE driver ? It should be possible using ATA pass-through, some libata changes may be required but it is the right way to go IMO. Bart - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] lpfc: avoid double-free during PCI error failure
ACK... Looks good... -- james s Linas Vepstas wrote: Bino, James, Please review, sign-off and forward upstream. --linas If a PCI error is detected that cannot be recovered from, there will be a double call of lpfc_pci_remove_one(), with the second call resulting in a null-pointer dereference. The first call occurs in lpfc_io_error_detected(), and the second call during pci device remove. This patch eliminates the first call; its un-needed. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] drivers/scsi/lpfc/lpfc_init.c |5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) Index: linux-2.6.20-git16/drivers/scsi/lpfc/lpfc_init.c === --- linux-2.6.20-git16.orig/drivers/scsi/lpfc/lpfc_init.c 2007-03-08 15:57:40.0 -0600 +++ linux-2.6.20-git16/drivers/scsi/lpfc/lpfc_init.c2007-03-08 16:03:18.0 -0600 @@ -1817,10 +1817,9 @@ static pci_ers_result_t lpfc_io_error_de struct lpfc_sli *psli = phba-sli; struct lpfc_sli_ring *pring; - if (state == pci_channel_io_perm_failure) { - lpfc_pci_remove_one(pdev); + if (state == pci_channel_io_perm_failure) return PCI_ERS_RESULT_DISCONNECT; - } + pci_disable_device(pdev); /* * There may be I/Os dropped by the firmware. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Style Question
Hi, list! I have a question about coding style in linux kernel. In Documention/CodingStyle, it is said that Linux style for comments is the C89 /* ... */ style. Don't use C99-style // ... comments. _But_ I see a lot of '//' style comments in current kernel code. Which is wrong? The documentions or the code, or neither? And why? Another question is about NULL. AFAIK, in user space, using NULL is better than directly using 0 in C. In kernel, I know it used its own NULL, which may be defined as ((void*)0), but it's _still_ different from raw zero. So can I say using NULL is better than 0 in kernel? Any reply is welcome. Thanks and have a nice day! - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Style Question
On Sun, 2007-03-11 at 22:15 +0800, Cong WANG wrote: [...] Another question is about NULL. AFAIK, in user space, using NULL is better than directly using 0 in C. In kernel, I know it used its own NULL, which may be defined as ((void*)0), Userspace has the usually same definition. but it's _still_ different from raw zero. It is different that 0 as such has the type int. But this int is automatically promoted to a 0 pointer. So can I say using NULL is better than 0 in kernel? Yes, because it is immediately clear that a pointer is (or should be) there (and not an int). And the same holds for userspace since this is a pure C question. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 2/7] RSS controller core
On Sun, Mar 11, 2007 at 12:08:16PM +0300, Pavel Emelianov wrote: Herbert Poetzl wrote: On Tue, Mar 06, 2007 at 02:00:36PM -0800, Andrew Morton wrote: On Tue, 06 Mar 2007 17:55:29 +0300 Pavel Emelianov [EMAIL PROTECTED] wrote: +struct rss_container { + struct res_counter res; + struct list_head page_list; + struct container_subsys_state css; +}; + +struct page_container { + struct page *page; + struct rss_container *cnt; + struct list_head list; +}; ah. This looks good. I'll find a hunk of time to go through this work and through Paul's patches. It'd be good to get both patchsets lined up in -mm within a couple of weeks. But.. doesn't look so good for me, mainly becaus of the additional per page data and per page processing on 4GB memory, with 100 guests, 50% shared for each guest, this basically means ~1mio pages, 500k shared and 1500k x sizeof(page_container) entries, which roughly boils down to ~25MB of wasted memory ... increase the amount of shared pages and it starts getting worse, but maybe I'm missing something here You are. Each page has only one page_container associated with it despite the number of containers it is shared between. We need to decide whether we want to do per-container memory limitation via these data structures, or whether we do it via a physical scan of some software zone, possibly based on Mel's patches. why not do simple page accounting (as done currently in Linux) and use that for the limits, without keeping the reference from container to page? As I've already answered in my previous letter simple limiting w/o per-container reclamation and per-container oom killer isn't a good memory management. It doesn't allow to handle resource shortage gracefully. per container OOM killer does not require any container page reference, you know _what_ tasks belong to the container, and you know their _badness_ from the normal OOM calculations, so doing them for a container is really straight forward without having any page 'tagging' for the reclamation part, please elaborate how that will differ in a (shared memory) guest from what the kernel currently does ... TIA, Herbert This patchset provides more grace way to handle this, but full memory management includes accounting of VMA-length as well (returning ENOMEM from system call) but we've decided to start with RSS. best, Herbert ___ Containers mailing list [EMAIL PROTECTED] https://lists.osdl.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2
On Sunday 11 March 2007, Mike Galbraith wrote: Hi Con, On Sun, 2007-03-11 at 14:57 +1100, Con Kolivas wrote: What follows this email is a patch series for the latest version of the RSDL cpu scheduler (ie v0.29). I have addressed all bugs that I am able to reproduce in this version so if some people would be kind enough to test if there are any hidden bugs or oops lurking, it would be nice to know in anticipation of putting this back in -mm. Thanks. Full patch for 2.6.21-rc3-mm2: http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2-rsdl-0 .29.patch I'm seeing a cpu distribution problem running this on my P4 box. Scenario: listening to music collection (mp3) via Amarok. Enable Amarok visualization gforce, and size such that X and gforce each use ~50% cpu. Start rip/encode of new CD with grip/lame encoder. Lame is set to use both cpus, at nice 5. Once the encoders start, they receive considerable more cpu than nice 0 X/Gforce, taking ~120% and leaving the remaining 80% for X/Gforce and Amarok (when it updates it's ~12k entry database) to squabble over. With 2.6.21-rc3, X/Gforce maintain their ~50% cpu (remain smooth), and the encoders (100%cpu bound) get whats left when Amarok isn't eating it. I plunked the above patch into plain 2.6.21-rc3 and retested to eliminate other mm tree differences, and it's repeatable. The nice 5 cpu hogs always receive considerably more that the nice 0 sleepers. -Mike Just to comment, I've been running one of the patches between 20-ck1 and this latest one, which is building as I type, but I also run gkrellm here, version 2.2.9. Since I have been running this middle of this series patch, something is killing gkrellm about once a day, and there is nothing in the logs to indicate a problem. I see a blink out of the corner of my eye, and its gone. And it always starts right back up from a kmenu click. No idea if anyone else is experiencing this or not. -- Cheers, Gene There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order. -Ed Howdershelt (Author) You scratch my tape, and I'll scratch yours. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 2/7] RSS controller core
Herbert Poetzl wrote: On Sun, Mar 11, 2007 at 12:08:16PM +0300, Pavel Emelianov wrote: Herbert Poetzl wrote: On Tue, Mar 06, 2007 at 02:00:36PM -0800, Andrew Morton wrote: On Tue, 06 Mar 2007 17:55:29 +0300 Pavel Emelianov [EMAIL PROTECTED] wrote: +struct rss_container { + struct res_counter res; + struct list_head page_list; + struct container_subsys_state css; +}; + +struct page_container { + struct page *page; + struct rss_container *cnt; + struct list_head list; +}; ah. This looks good. I'll find a hunk of time to go through this work and through Paul's patches. It'd be good to get both patchsets lined up in -mm within a couple of weeks. But.. doesn't look so good for me, mainly becaus of the additional per page data and per page processing on 4GB memory, with 100 guests, 50% shared for each guest, this basically means ~1mio pages, 500k shared and 1500k x sizeof(page_container) entries, which roughly boils down to ~25MB of wasted memory ... increase the amount of shared pages and it starts getting worse, but maybe I'm missing something here You are. Each page has only one page_container associated with it despite the number of containers it is shared between. We need to decide whether we want to do per-container memory limitation via these data structures, or whether we do it via a physical scan of some software zone, possibly based on Mel's patches. why not do simple page accounting (as done currently in Linux) and use that for the limits, without keeping the reference from container to page? As I've already answered in my previous letter simple limiting w/o per-container reclamation and per-container oom killer isn't a good memory management. It doesn't allow to handle resource shortage gracefully. per container OOM killer does not require any container page reference, you know _what_ tasks belong to the container, and you know their _badness_ from the normal OOM calculations, so doing them for a container is really straight forward without having any page 'tagging' That's true. If you look at the patches you'll find out that no code in oom killer uses page 'tag'. for the reclamation part, please elaborate how that will differ in a (shared memory) guest from what the kernel currently does ... This is all described in the code and in the discussions we had before. TIA, Herbert This patchset provides more grace way to handle this, but full memory management includes accounting of VMA-length as well (returning ENOMEM from system call) but we've decided to start with RSS. best, Herbert ___ Containers mailing list [EMAIL PROTECTED] https://lists.osdl.org/mailman/listinfo/containers - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 6/9] signalfd/timerfd - timerfd core ...
Davide, On Sat, 2007-03-10 at 18:22 -0800, Davide Libenzi wrote: Some remarks: + +asmlinkage long sys_timerfd(int ufd, int clockid, int tmrtype, + const struct timespec __user *utmr) +{ + int error; + struct timerfd_ctx *ctx; + struct file *file; + struct inode *inode; + ktime_t tval, tnow; + struct timespec ktmr, tmrnow; + + error = -EFAULT; + if (copy_from_user(ktmr, utmr, sizeof(ktmr))) + goto err_exit; Please do not use goto for a simple return -EFAULT; Please validate the timespec before converting it. if (!timespec_valid(ktmr)) return -EINVAL; + tval = timespec_to_ktime(ktmr); + error = -EINVAL; + if (clockid != CLOCK_MONOTONIC + clockid != CLOCK_REALTIME) + goto err_exit; + switch (tmrtype) { + case TFD_TIMER_REL: + case TFD_TIMER_SEQ: + break; + case TFD_TIMER_ABS: + getnstimeofday(tmrnow); + tnow = timespec_to_ktime(tmrnow); tnow = ktime_get(); + if (ktime_to_ns(tval) = ktime_to_ns(tnow)) + goto err_exit; + tval = ktime_sub(tval, tnow); Why do you want to do that ? hrtimers handle relative and absolute expiry times. You break down everything to relative time and lose the accuracy for absolute timers. + break; + default: + goto err_exit; + } + + if (ufd == -1) { + error = -ENOMEM; + ctx = kmem_cache_alloc(timerfd_ctx_cachep, GFP_KERNEL); + if (!ctx) + goto err_exit; + + init_waitqueue_head(ctx-wqh); + spin_lock_init(ctx-lock); + ctx-ticks = 0; + ctx-tmrtype = tmrtype; + ctx-clockid = clockid; + ctx-tval = tval; + hrtimer_init(ctx-tmr, ctx-clockid, HRTIMER_REL); + ctx-tmr.expires = ctx-tval; + ctx-tmr.function = timerfd_tmrproc; + + hrtimer_start(ctx-tmr, ctx-tval, HRTIMER_REL); + + /* + * When we call this, the initialization must be complete, since + * aino_getfd() will install the fd. + */ + error = aino_getfd(ufd, inode, file, [timerfd], +timerfd_fops, ctx); + if (error) + goto err_fdalloc; Why is the timer started before we have everything in place ? Also if you turn it around then the (re)programming part of the timer can be shared. + } else { + error = -EBADF; + file = fget(ufd); + if (!file) + goto err_exit; + ctx = file-private_data; + error = -EINVAL; + if (file-f_op != timerfd_fops) { + fput(file); + goto err_exit; + } + + /* + * We need to stop the exiting timer before. We call + * hrtimer_cancel() w/out holding our lock. + */ + spin_lock_irq(ctx-lock); + while (hrtimer_active(ctx-tmr)) { + spin_unlock_irq(ctx-lock); + hrtimer_cancel(ctx-tmr); + spin_lock_irq(ctx-lock); + } Please use hrtimer_try_to_cancel() retry: spin_lock_irq(): if (hrtimer_try_to_cancel(ctx-tmr) 0) { spin_unlock_irq(); cpu_relax(); goto retry; } + +static unsigned int timerfd_poll(struct file *file, poll_table *wait) +{ + struct timerfd_ctx *ctx = file-private_data; + + poll_wait(file, ctx-wqh, wait); + + return ctx-ticks ? POLLIN: 0; This is racy: timer is set up (non periodic) timer expires poll now poll is stuck for ever ! tglx - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Style Question
Cong WANG wrote: Hi, list! I have a question about coding style in linux kernel. In Documention/CodingStyle, it is said that Linux style for comments is the C89 /* ... */ style. Don't use C99-style // ... comments. _But_ I see a lot of '//' style comments in current kernel code. Which is wrong? The documentions or the code, or neither? And why? The code.. As with a lot of coding style issues, it's likely just that nobody saw it and bothered to complain when it went in. Another question is about NULL. AFAIK, in user space, using NULL is better than directly using 0 in C. In kernel, I know it used its own NULL, which may be defined as ((void*)0), but it's _still_ different from raw zero. So can I say using NULL is better than 0 in kernel? It's the preferred style, Sparse will complain about using 0 for a null pointer for example.. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] KVM: MMU: Fix host memory corruption on i386 with = 4GB ram
PAGE_MASK is an unsigned long, so using it to mask physical addresses on i386 (which are 64-bit wide) leads to truncation. This can result in page-private of unrelated memory pages being modified, with disasterous results. Fix by not using PAGE_MASK for physical addresses; instead calculate the correct value directly from PAGE_SIZE. Also fix a similar BUG_ON(). Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/mmu.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c index 2cb4893..e85b4c7 100644 --- a/drivers/kvm/mmu.c +++ b/drivers/kvm/mmu.c @@ -131,7 +131,7 @@ static int dbg = 1; (((address) PT32_LEVEL_SHIFT(level)) ((1 PT32_LEVEL_BITS) - 1)) -#define PT64_BASE_ADDR_MASK (((1ULL 52) - 1) PAGE_MASK) +#define PT64_BASE_ADDR_MASK (((1ULL 52) - 1) ~(u64)(PAGE_SIZE-1)) #define PT64_DIR_BASE_ADDR_MASK \ (PT64_BASE_ADDR_MASK ~((1ULL (PAGE_SHIFT + PT64_LEVEL_BITS)) - 1)) @@ -406,8 +406,8 @@ static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn) spte = desc-shadow_ptes[0]; } BUG_ON(!spte); - BUG_ON((*spte PT64_BASE_ADDR_MASK) != - page_to_pfn(page) PAGE_SHIFT); + BUG_ON((*spte PT64_BASE_ADDR_MASK) PAGE_SHIFT + != page_to_pfn(page)); BUG_ON(!(*spte PT_PRESENT_MASK)); BUG_ON(!(*spte PT_WRITABLE_MASK)); rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte); -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] KVM: MMU: Fix guest writes to nonpae pde
KVM shadow page tables are always in pae mode, regardless of the guest setting. This means that a guest pde (mapping 4MB of memory) is mapped to two shadow pdes (mapping 2MB each). When the guest writes to a pte or pde, we intercept the write and emulate it. We also remove any shadowed mappings corresponding to the write. Since the mmu did not account for the doubling in the number of pdes, it removed the wrong entry, resulting in a mismatch between shadow page tables and guest page tables, followed shortly by guest memory corruption. This patch fixes the problem by detecting the special case of writing to a non-pae pde and adjusting the address and number of shadow pdes zapped accordingly. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/mmu.c | 46 ++ 1 files changed, 34 insertions(+), 12 deletions(-) diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c index a1a9336..2cb4893 100644 --- a/drivers/kvm/mmu.c +++ b/drivers/kvm/mmu.c @@ -1093,22 +1093,40 @@ out: return r; } +static void mmu_pre_write_zap_pte(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *page, + u64 *spte) +{ + u64 pte; + struct kvm_mmu_page *child; + + pte = *spte; + if (is_present_pte(pte)) { + if (page-role.level == PT_PAGE_TABLE_LEVEL) + rmap_remove(vcpu, spte); + else { + child = page_header(pte PT64_BASE_ADDR_MASK); + mmu_page_remove_parent_pte(vcpu, child, spte); + } + } + *spte = 0; +} + void kvm_mmu_pre_write(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes) { gfn_t gfn = gpa PAGE_SHIFT; struct kvm_mmu_page *page; - struct kvm_mmu_page *child; struct hlist_node *node, *n; struct hlist_head *bucket; unsigned index; u64 *spte; - u64 pte; unsigned offset = offset_in_page(gpa); unsigned pte_size; unsigned page_offset; unsigned misaligned; int level; int flooded = 0; + int npte; pgprintk(%s: gpa %llx bytes %d\n, __FUNCTION__, gpa, bytes); if (gfn == vcpu-last_pt_write_gfn) { @@ -1144,22 +1162,26 @@ void kvm_mmu_pre_write(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes) } page_offset = offset; level = page-role.level; + npte = 1; if (page-role.glevels == PT32_ROOT_LEVEL) { - page_offset = 1; /* 32-64 */ + page_offset = 1; /* 32-64 */ + /* +* A 32-bit pde maps 4MB while the shadow pdes map +* only 2MB. So we need to double the offset again +* and zap two pdes instead of one. +*/ + if (level == PT32_ROOT_LEVEL) { + page_offset = 1; + npte = 2; + } page_offset = ~PAGE_MASK; } spte = __va(page-page_hpa); spte += page_offset / sizeof(*spte); - pte = *spte; - if (is_present_pte(pte)) { - if (level == PT_PAGE_TABLE_LEVEL) - rmap_remove(vcpu, spte); - else { - child = page_header(pte PT64_BASE_ADDR_MASK); - mmu_page_remove_parent_pte(vcpu, child, spte); - } + while (npte--) { + mmu_pre_write_zap_pte(vcpu, page, spte); + ++spte; } - *spte = 0; } } -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] KVM: More fixes for 2.6.21-rc3
This patchset contains fixes I plan to submit pre 2.6.21: a fix for large memory 32-bit hosts, and a fix for non-pae 32-bit guests. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] KVM: MMU: Fix guest writes to nonpae pde
* Avi Kivity [EMAIL PROTECTED] wrote: KVM shadow page tables are always in pae mode, regardless of the guest setting. This means that a guest pde (mapping 4MB of memory) is mapped to two shadow pdes (mapping 2MB each). When the guest writes to a pte or pde, we intercept the write and emulate it. We also remove any shadowed mappings corresponding to the write. Since the mmu did not account for the doubling in the number of pdes, it removed the wrong entry, resulting in a mismatch between shadow page tables and guest page tables, followed shortly by guest memory corruption. This patch fixes the problem by detecting the special case of writing to a non-pae pde and adjusting the address and number of shadow pdes zapped accordingly. Signed-off-by: Avi Kivity [EMAIL PROTECTED] tested this with both PAE and non-PAE Linux host and guest - works fine. Acked-by: Ingo Molnar [EMAIL PROTECTED] Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] KVM: MMU: Fix host memory corruption on i386 with = 4GB ram
* Avi Kivity [EMAIL PROTECTED] wrote: PAGE_MASK is an unsigned long, so using it to mask physical addresses on i386 (which are 64-bit wide) leads to truncation. This can result in page-private of unrelated memory pages being modified, with disasterous results. Fix by not using PAGE_MASK for physical addresses; instead calculate the correct value directly from PAGE_SIZE. Also fix a similar BUG_ON(). Signed-off-by: Avi Kivity [EMAIL PROTECTED] i have tested this, albeit with less than 4GB RAM. Acked-by: Ingo Molnar [EMAIL PROTECTED] Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] KVM: always reload segment selectors
Subject: [patch] KVM: always reload segment selectors From: Ingo Molnar [EMAIL PROTECTED] failed VM entry on VMX might still change %fs or %gs, thus make sure that KVM always reloads the segment selectors. This is crutial on both x86 and x86_64: x86 has __KERNEL_PDA in %fs on which things like 'current' depends and x86_64 has 0 there and needs MSR_GS_BASE to work. Signed-off-by: Ingo Molnar [EMAIL PROTECTED] --- drivers/kvm/vmx.c | 37 + 1 file changed, 21 insertions(+), 16 deletions(-) Index: linux/drivers/kvm/vmx.c === --- linux.orig/drivers/kvm/vmx.c +++ linux/drivers/kvm/vmx.c @@ -1896,6 +1896,27 @@ again: [cr2]i(offsetof(struct kvm_vcpu, cr2)) : cc, memory ); + /* +* Reload segment selectors ASAP. (it's needed for a functional +* kernel: x86 relies on having __KERNEL_PDA in %fs and x86_64 +* relies on having 0 in %gs for the CPU PDA to work.) +*/ + if (fs_gs_ldt_reload_needed) { + load_ldt(ldt_sel); + load_fs(fs_sel); + /* +* If we have to reload gs, we must take care to +* preserve our gs base. +*/ + local_irq_disable(); + load_gs(gs_sel); +#ifdef CONFIG_X86_64 + wrmsrl(MSR_GS_BASE, vmcs_readl(HOST_GS_BASE)); +#endif + local_irq_enable(); + + reload_tss(); + } ++kvm_stat.exits; save_msrs(vcpu-guest_msrs, NR_BAD_MSRS); @@ -1913,22 +1934,6 @@ again: kvm_run-exit_reason = vmcs_read32(VM_INSTRUCTION_ERROR); r = 0; } else { - if (fs_gs_ldt_reload_needed) { - load_ldt(ldt_sel); - load_fs(fs_sel); - /* -* If we have to reload gs, we must take care to -* preserve our gs base. -*/ - local_irq_disable(); - load_gs(gs_sel); -#ifdef CONFIG_X86_64 - wrmsrl(MSR_GS_BASE, vmcs_readl(HOST_GS_BASE)); -#endif - local_irq_enable(); - - reload_tss(); - } /* * Profile KVM exit RIPs: */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] KVM: always reload segment selectors
Ingo Molnar wrote: Subject: [patch] KVM: always reload segment selectors From: Ingo Molnar [EMAIL PROTECTED] failed VM entry on VMX might still change %fs or %gs, thus make sure that KVM always reloads the segment selectors. This is crutial on both x86 and x86_64: x86 has __KERNEL_PDA in %fs on which things like 'current' depends and x86_64 has 0 there and needs MSR_GS_BASE to work. Signed-off-by: Ingo Molnar [EMAIL PROTECTED] --- drivers/kvm/vmx.c | 37 + 1 file changed, 21 insertions(+), 16 deletions(-) Index: linux/drivers/kvm/vmx.c === --- linux.orig/drivers/kvm/vmx.c +++ linux/drivers/kvm/vmx.c @@ -1896,6 +1896,27 @@ again: [cr2]i(offsetof(struct kvm_vcpu, cr2)) : cc, memory ); + /* +* Reload segment selectors ASAP. (it's needed for a functional +* kernel: x86 relies on having __KERNEL_PDA in %fs and x86_64 +* relies on having 0 in %gs for the CPU PDA to work.) +*/ + if (fs_gs_ldt_reload_needed) { + load_ldt(ldt_sel); + load_fs(fs_sel); + /* +* If we have to reload gs, we must take care to +* preserve our gs base. +*/ + local_irq_disable(); + load_gs(gs_sel); +#ifdef CONFIG_X86_64 + wrmsrl(MSR_GS_BASE, vmcs_readl(HOST_GS_BASE)); +#endif + local_irq_enable(); + + reload_tss(); + } ++kvm_stat.exits; save_msrs(vcpu-guest_msrs, NR_BAD_MSRS); btw, looking at the code, we could just remove fs from the fs_gs_reload_needed and make in unconditional. VT knows how to reload segments, except if they're user segments (groan). In the case of fs, if it's used for the pda, it's obviously a kernel segment. gs is different: since only the segment base is loaded (via swapgs), the selector part could well be a userspace selector, and thus the irq-protected reload is needed. Anyway, I'm applying the patch as the above discourse is irrelevant to the fix. -- error compiling committee.c: too many arguments to function - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/15] KVM userspace interface updates
This patchset updates the kvm userspace interface to what I hope will be the long-term stable interface. Provisions are included for extending the interface later. The patches address performance and cleanliness concerns. One patch is missing -- I'd like the string pio transfers not to include guest virtual addresses. To date all my attempts to write the patch ended with me losing consiousness. Hopefully I'll manage it soon. I'd like to submit the patchset post 2.6.21. Comments are welcome. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 03/15] KVM: Initialize PIO I/O count
This allows userspace to ignore the io.rep field. No a big deal, but friendly. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/svm.c |1 + drivers/kvm/vmx.c |1 + 2 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c index b176f5a..c35b8c8 100644 --- a/drivers/kvm/svm.c +++ b/drivers/kvm/svm.c @@ -1037,6 +1037,7 @@ static int io_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) kvm_run-io.size = ((io_info SVM_IOIO_SIZE_MASK) SVM_IOIO_SIZE_SHIFT); kvm_run-io.string = (io_info SVM_IOIO_STR_MASK) != 0; kvm_run-io.rep = (io_info SVM_IOIO_REP_MASK) != 0; + kvm_run-io.count = 1; if (kvm_run-io.string) { unsigned addr_mask; diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c index 7fd572a..d4c9f33 100644 --- a/drivers/kvm/vmx.c +++ b/drivers/kvm/vmx.c @@ -1459,6 +1459,7 @@ static int handle_io(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) = (vmcs_readl(GUEST_RFLAGS) X86_EFLAGS_DF) != 0; kvm_run-io.rep = (exit_qualification 32) != 0; kvm_run-io.port = exit_qualification 16; + kvm_run-io.count = 1; if (kvm_run-io.string) { if (!get_io_count(vcpu, kvm_run-io.count)) return 1; -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 04/15] KVM: Handle cpuid in the kernel instead of punting to userspace
KVM used to handle cpuid by letting userspace decide what values to return to the guest. We now handle cpuid completely in the kernel. We still let userspace decide which values the guest will see by having userspace set up the value table beforehand (this is necessary to allow management software to set the cpu features to the least common denominator, so that live migration can work). The motivation for the change is that kvm kernel code can be impacted by cpuid features, for example the x86 emulator. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/kvm.h |5 +++ drivers/kvm/kvm_main.c | 69 drivers/kvm/svm.c |4 +- drivers/kvm/vmx.c |4 +- include/linux/kvm.h| 18 - 5 files changed, 95 insertions(+), 5 deletions(-) diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index 59cbc5b..be3a0e7 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -55,6 +55,7 @@ #define KVM_NUM_MMU_PAGES 256 #define KVM_MIN_FREE_MMU_PAGES 5 #define KVM_REFILL_PAGES 25 +#define KVM_MAX_CPUID_ENTRIES 40 #define FX_IMAGE_SIZE 512 #define FX_IMAGE_ALIGN 16 @@ -286,6 +287,9 @@ struct kvm_vcpu { u32 ar; } tr, es, ds, fs, gs; } rmode; + + int cpuid_nent; + struct kvm_cpuid_entry cpuid_entries[KVM_MAX_CPUID_ENTRIES]; }; struct kvm_memory_slot { @@ -446,6 +450,7 @@ void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long value, struct x86_emulate_ctxt; +void kvm_emulate_cpuid(struct kvm_vcpu *vcpu); int emulate_invlpg(struct kvm_vcpu *vcpu, gva_t address); int emulate_clts(struct kvm_vcpu *vcpu); int emulator_get_dr(struct x86_emulate_ctxt* ctxt, int dr, diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 8a4984d..347467e 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -1504,6 +1504,43 @@ void save_msrs(struct vmx_msr_entry *e, int n) } EXPORT_SYMBOL_GPL(save_msrs); +void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) +{ + int i; + u32 function; + struct kvm_cpuid_entry *e, *best; + + kvm_arch_ops-cache_regs(vcpu); + function = vcpu-regs[VCPU_REGS_RAX]; + vcpu-regs[VCPU_REGS_RAX] = 0; + vcpu-regs[VCPU_REGS_RBX] = 0; + vcpu-regs[VCPU_REGS_RCX] = 0; + vcpu-regs[VCPU_REGS_RDX] = 0; + best = NULL; + for (i = 0; i vcpu-cpuid_nent; ++i) { + e = vcpu-cpuid_entries[i]; + if (e-function == function) { + best = e; + break; + } + /* +* Both basic or both extended? +*/ + if (((e-function ^ function) 0x8000) == 0) + if (!best || e-function best-function) + best = e; + } + if (best) { + vcpu-regs[VCPU_REGS_RAX] = best-eax; + vcpu-regs[VCPU_REGS_RBX] = best-ebx; + vcpu-regs[VCPU_REGS_RCX] = best-ecx; + vcpu-regs[VCPU_REGS_RDX] = best-edx; + } + kvm_arch_ops-decache_regs(vcpu); + kvm_arch_ops-skip_emulated_instruction(vcpu); +} +EXPORT_SYMBOL_GPL(kvm_emulate_cpuid); + static void complete_pio(struct kvm_vcpu *vcpu) { struct kvm_io *io = vcpu-run-io; @@ -2075,6 +2112,26 @@ out: return r; } +static int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu, + struct kvm_cpuid *cpuid, + struct kvm_cpuid_entry __user *entries) +{ + int r; + + r = -E2BIG; + if (cpuid-nent KVM_MAX_CPUID_ENTRIES) + goto out; + r = -EFAULT; + if (copy_from_user(vcpu-cpuid_entries, entries, + cpuid-nent * sizeof(struct kvm_cpuid_entry))) + goto out; + vcpu-cpuid_nent = cpuid-nent; + return 0; + +out: + return r; +} + static long kvm_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -2181,6 +2238,18 @@ static long kvm_vcpu_ioctl(struct file *filp, case KVM_SET_MSRS: r = msr_io(vcpu, argp, do_set_msr, 0); break; + case KVM_SET_CPUID: { + struct kvm_cpuid __user *cpuid_arg = argp; + struct kvm_cpuid cpuid; + + r = -EFAULT; + if (copy_from_user(cpuid, cpuid_arg, sizeof cpuid)) + goto out; + r = kvm_vcpu_ioctl_set_cpuid(vcpu, cpuid, cpuid_arg-entries); + if (r) + goto out; + break; + } default: ; } diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c index c35b8c8..d4b2936 100644 --- a/drivers/kvm/svm.c +++ b/drivers/kvm/svm.c @@ -1101,8 +1101,8 @@ static int task_switch_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_r static int
[PATCH 01/15] KVM: Use a shared page for kernel/user communication when runing a vcpu
Instead of passing a 'struct kvm_run' back and forth between the kernel and userspace, allocate a page and allow the user to mmap() it. This reduces needless copying and makes the interface expandable by providing lots of free space. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/kvm.h |1 + drivers/kvm/kvm_main.c | 54 +++ include/linux/kvm.h|6 ++-- 3 files changed, 44 insertions(+), 17 deletions(-) mode change 100755 = 100644 drivers/kvm/kvm_main.c diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index 0d122bf..901b8d9 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -228,6 +228,7 @@ struct kvm_vcpu { struct mutex mutex; int cpu; int launched; + struct kvm_run *run; int interrupt_window_open; unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */ #define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c old mode 100755 new mode 100644 index 946ed86..42be8a8 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -355,6 +355,8 @@ static void kvm_free_vcpu(struct kvm_vcpu *vcpu) kvm_mmu_destroy(vcpu); vcpu_put(vcpu); kvm_arch_ops-vcpu_free(vcpu); + free_page((unsigned long)vcpu-run); + vcpu-run = NULL; } static void kvm_free_vcpus(struct kvm *kvm) @@ -1887,6 +1889,33 @@ static int kvm_vcpu_ioctl_debug_guest(struct kvm_vcpu *vcpu, return r; } +static struct page *kvm_vcpu_nopage(struct vm_area_struct *vma, + unsigned long address, + int *type) +{ + struct kvm_vcpu *vcpu = vma-vm_file-private_data; + unsigned long pgoff; + struct page *page; + + *type = VM_FAULT_MINOR; + pgoff = ((address - vma-vm_start) PAGE_SHIFT) + vma-vm_pgoff; + if (pgoff != 0) + return NOPAGE_SIGBUS; + page = virt_to_page(vcpu-run); + get_page(page); + return page; +} + +static struct vm_operations_struct kvm_vcpu_vm_ops = { + .nopage = kvm_vcpu_nopage, +}; + +static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma-vm_ops = kvm_vcpu_vm_ops; + return 0; +} + static int kvm_vcpu_release(struct inode *inode, struct file *filp) { struct kvm_vcpu *vcpu = filp-private_data; @@ -1899,6 +1928,7 @@ static struct file_operations kvm_vcpu_fops = { .release= kvm_vcpu_release, .unlocked_ioctl = kvm_vcpu_ioctl, .compat_ioctl = kvm_vcpu_ioctl, + .mmap = kvm_vcpu_mmap, }; /* @@ -1947,6 +1977,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n) { int r; struct kvm_vcpu *vcpu; + struct page *page; r = -EINVAL; if (!valid_vcpu(n)) @@ -1961,6 +1992,12 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n) return -EEXIST; } + page = alloc_page(GFP_KERNEL | __GFP_ZERO); + r = -ENOMEM; + if (!page) + goto out_unlock; + vcpu-run = page_address(page); + vcpu-host_fx_image = (char*)ALIGN((hva_t)vcpu-fx_buf, FX_IMAGE_ALIGN); vcpu-guest_fx_image = vcpu-host_fx_image + FX_IMAGE_SIZE; @@ -1990,6 +2027,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int n) out_free_vcpus: kvm_free_vcpu(vcpu); +out_unlock: mutex_unlock(vcpu-mutex); out: return r; @@ -2003,21 +2041,9 @@ static long kvm_vcpu_ioctl(struct file *filp, int r = -EINVAL; switch (ioctl) { - case KVM_RUN: { - struct kvm_run kvm_run; - - r = -EFAULT; - if (copy_from_user(kvm_run, argp, sizeof kvm_run)) - goto out; - r = kvm_vcpu_ioctl_run(vcpu, kvm_run); - if (r 0 r != -EINTR) - goto out; - if (copy_to_user(argp, kvm_run, sizeof kvm_run)) { - r = -EFAULT; - goto out; - } + case KVM_RUN: + r = kvm_vcpu_ioctl_run(vcpu, vcpu-run); break; - } case KVM_GET_REGS: { struct kvm_regs kvm_regs; diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 275354f..d88e750 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -11,7 +11,7 @@ #include asm/types.h #include linux/ioctl.h -#define KVM_API_VERSION 4 +#define KVM_API_VERSION 5 /* * Architectural interrupt line count, and the size of the bitmap needed @@ -49,7 +49,7 @@ enum kvm_exit_reason { KVM_EXIT_SHUTDOWN = 8, }; -/* for KVM_RUN */ +/* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */ struct kvm_run { /* in */ __u32 emulated; /* skip current instruction */ @@ -233,7 +233,7 @@ struct kvm_dirty_log
[PATCH 12/15] KVM: Initialize the apic_base msr on svm too
Older userspace didn't care, but newer userspace (with the cpuid changes) does. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/svm.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c index 0311665..2396ada 100644 --- a/drivers/kvm/svm.c +++ b/drivers/kvm/svm.c @@ -582,6 +582,9 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu) init_vmcb(vcpu-svm-vmcb); fx_init(vcpu); + vcpu-apic_base = 0xfee0 | + /*for vcpu 0*/ MSR_IA32_APICBASE_BSP | + MSR_IA32_APICBASE_ENABLE; return 0; -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 07/15] KVM: Renumber ioctls
The recent changes have left the ioctl numbers in complete disarray. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- include/linux/kvm.h | 34 +- 1 files changed, 17 insertions(+), 17 deletions(-) diff --git a/include/linux/kvm.h b/include/linux/kvm.h index d89189a..93472da 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -229,34 +229,34 @@ struct kvm_cpuid { /* * ioctls for /dev/kvm fds: */ -#define KVM_GET_API_VERSION _IO(KVMIO, 1) -#define KVM_CREATE_VM _IO(KVMIO, 2) /* returns a VM fd */ -#define KVM_GET_MSR_INDEX_LIST_IOWR(KVMIO, 15, struct kvm_msr_list) +#define KVM_GET_API_VERSION _IO(KVMIO, 0x00) +#define KVM_CREATE_VM _IO(KVMIO, 0x01) /* returns a VM fd */ +#define KVM_GET_MSR_INDEX_LIST_IOWR(KVMIO, 0x02, struct kvm_msr_list) /* * ioctls for VM fds */ -#define KVM_SET_MEMORY_REGION _IOW(KVMIO, 10, struct kvm_memory_region) +#define KVM_SET_MEMORY_REGION _IOW(KVMIO, 0x40, struct kvm_memory_region) /* * KVM_CREATE_VCPU receives as a parameter the vcpu slot, and returns * a vcpu fd. */ -#define KVM_CREATE_VCPU _IO(KVMIO, 11) -#define KVM_GET_DIRTY_LOG _IOW(KVMIO, 12, struct kvm_dirty_log) +#define KVM_CREATE_VCPU _IO(KVMIO, 0x41) +#define KVM_GET_DIRTY_LOG _IOW(KVMIO, 0x42, struct kvm_dirty_log) /* * ioctls for vcpu fds */ -#define KVM_RUN _IO(KVMIO, 16) -#define KVM_GET_REGS _IOR(KVMIO, 3, struct kvm_regs) -#define KVM_SET_REGS _IOW(KVMIO, 4, struct kvm_regs) -#define KVM_GET_SREGS _IOR(KVMIO, 5, struct kvm_sregs) -#define KVM_SET_SREGS _IOW(KVMIO, 6, struct kvm_sregs) -#define KVM_TRANSLATE _IOWR(KVMIO, 7, struct kvm_translation) -#define KVM_INTERRUPT _IOW(KVMIO, 8, struct kvm_interrupt) -#define KVM_DEBUG_GUEST _IOW(KVMIO, 9, struct kvm_debug_guest) -#define KVM_GET_MSRS _IOWR(KVMIO, 13, struct kvm_msrs) -#define KVM_SET_MSRS _IOW(KVMIO, 14, struct kvm_msrs) -#define KVM_SET_CPUID _IOW(KVMIO, 17, struct kvm_cpuid) +#define KVM_RUN _IO(KVMIO, 0x80) +#define KVM_GET_REGS _IOR(KVMIO, 0x81, struct kvm_regs) +#define KVM_SET_REGS _IOW(KVMIO, 0x82, struct kvm_regs) +#define KVM_GET_SREGS _IOR(KVMIO, 0x83, struct kvm_sregs) +#define KVM_SET_SREGS _IOW(KVMIO, 0x84, struct kvm_sregs) +#define KVM_TRANSLATE _IOWR(KVMIO, 0x85, struct kvm_translation) +#define KVM_INTERRUPT _IOW(KVMIO, 0x86, struct kvm_interrupt) +#define KVM_DEBUG_GUEST _IOW(KVMIO, 0x87, struct kvm_debug_guest) +#define KVM_GET_MSRS _IOWR(KVMIO, 0x88, struct kvm_msrs) +#define KVM_SET_MSRS _IOW(KVMIO, 0x89, struct kvm_msrs) +#define KVM_SET_CPUID _IOW(KVMIO, 0x8a, struct kvm_cpuid) #endif -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 06/15] KVM: Remove minor wart from KVM_CREATE_VCPU ioctl
That ioctl does not transfer any data, so it should be an _IO rather than an _IOW. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- include/linux/kvm.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/kvm.h b/include/linux/kvm.h index c6dd4a7..d89189a 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -241,7 +241,7 @@ struct kvm_cpuid { * KVM_CREATE_VCPU receives as a parameter the vcpu slot, and returns * a vcpu fd. */ -#define KVM_CREATE_VCPU _IOW(KVMIO, 11, int) +#define KVM_CREATE_VCPU _IO(KVMIO, 11) #define KVM_GET_DIRTY_LOG _IOW(KVMIO, 12, struct kvm_dirty_log) /* -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 08/15] KVM: Add method to check for backwards-compatible API extensions
Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/kvm_main.c |6 ++ include/linux/kvm.h|5 + 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 747966e..376538c 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -2416,6 +2416,12 @@ static long kvm_dev_ioctl(struct file *filp, r = 0; break; } + case KVM_CHECK_EXTENSION: + /* +* No extensions defined at present. +*/ + r = 0; + break; default: ; } diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 93472da..c93cf53 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -232,6 +232,11 @@ struct kvm_cpuid { #define KVM_GET_API_VERSION _IO(KVMIO, 0x00) #define KVM_CREATE_VM _IO(KVMIO, 0x01) /* returns a VM fd */ #define KVM_GET_MSR_INDEX_LIST_IOWR(KVMIO, 0x02, struct kvm_msr_list) +/* + * Check if a kvm extension is available. Argument is extension number, + * return is 1 (yes) or 0 (no, sorry). + */ +#define KVM_CHECK_EXTENSION _IO(KVMIO, 0x03) /* * ioctls for VM fds -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 14/15] KVM: Allow kernel to select size of mmap() buffer
This allows us to store offsets in the kernel/user kvm_run area, and be sure that userspace has them mapped. As offsets can be outside the kvm_run struct, userspace has no way of knowing how much to mmap. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/kvm_main.c |8 +++- include/linux/kvm.h|4 2 files changed, 11 insertions(+), 1 deletions(-) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index ed95c9b..b81f007 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -2436,7 +2436,7 @@ static long kvm_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { void __user *argp = (void __user *)arg; - int r = -EINVAL; + long r = -EINVAL; switch (ioctl) { case KVM_GET_API_VERSION: @@ -2478,6 +2478,12 @@ static long kvm_dev_ioctl(struct file *filp, */ r = 0; break; + case KVM_GET_VCPU_MMAP_SIZE: + r = -EINVAL; + if (arg) + goto out; + r = PAGE_SIZE; + break; default: ; } diff --git a/include/linux/kvm.h b/include/linux/kvm.h index c0d10cd..dad9081 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -253,6 +253,10 @@ struct kvm_signal_mask { * return is 1 (yes) or 0 (no, sorry). */ #define KVM_CHECK_EXTENSION _IO(KVMIO, 0x03) +/* + * Get size for mmap(vcpu_fd) + */ +#define KVM_GET_VCPU_MMAP_SIZE_IO(KVMIO, 0x04) /* in bytes */ /* * ioctls for VM fds -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 13/15] KVM: Add guest mode signal mask
Allow a special signal mask to be used while executing in guest mode. This allows signals to be used to interrupt a vcpu without requiring signal delivery to a userspace handler, which is quite expensive. Userspace still receives -EINTR and can get the signal via sigwait(). Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/kvm.h |3 +++ drivers/kvm/kvm_main.c | 41 + include/linux/kvm.h|7 +++ 3 files changed, 51 insertions(+), 0 deletions(-) diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index be3a0e7..1c4a581 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -277,6 +277,9 @@ struct kvm_vcpu { gpa_t mmio_phys_addr; int pio_pending; + int sigset_active; + sigset_t sigset; + struct { int active; u8 save_iopl; diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 0e28f58..ed95c9b 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -1591,9 +1591,13 @@ static void complete_pio(struct kvm_vcpu *vcpu) static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { int r; + sigset_t sigsaved; vcpu_load(vcpu); + if (vcpu-sigset_active) + sigprocmask(SIG_SETMASK, vcpu-sigset, sigsaved); + /* re-sync apic's tpr */ vcpu-cr8 = kvm_run-cr8; @@ -1616,6 +1620,9 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) r = kvm_arch_ops-run(vcpu, kvm_run); + if (vcpu-sigset_active) + sigprocmask(SIG_SETMASK, sigsaved, NULL); + vcpu_put(vcpu); return r; } @@ -2142,6 +2149,17 @@ out: return r; } +static int kvm_vcpu_ioctl_set_sigmask(struct kvm_vcpu *vcpu, sigset_t *sigset) +{ + if (sigset) { + sigdelsetmask(sigset, sigmask(SIGKILL)|sigmask(SIGSTOP)); + vcpu-sigset_active = 1; + vcpu-sigset = *sigset; + } else + vcpu-sigset_active = 0; + return 0; +} + static long kvm_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -2260,6 +2278,29 @@ static long kvm_vcpu_ioctl(struct file *filp, goto out; break; } + case KVM_SET_SIGNAL_MASK: { + struct kvm_signal_mask __user *sigmask_arg = argp; + struct kvm_signal_mask kvm_sigmask; + sigset_t sigset, *p; + + p = NULL; + if (argp) { + r = -EFAULT; + if (copy_from_user(kvm_sigmask, argp, + sizeof kvm_sigmask)) + goto out; + r = -EINVAL; + if (kvm_sigmask.len != sizeof sigset) + goto out; + r = -EFAULT; + if (copy_from_user(sigset, sigmask_arg-sigset, + sizeof sigset)) + goto out; + p = sigset; + } + r = kvm_vcpu_ioctl_set_sigmask(vcpu, sigset); + break; + } default: ; } diff --git a/include/linux/kvm.h b/include/linux/kvm.h index b3af92e..c0d10cd 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -234,6 +234,12 @@ struct kvm_cpuid { struct kvm_cpuid_entry entries[0]; }; +/* for KVM_SET_SIGNAL_MASK */ +struct kvm_signal_mask { + __u32 len; + __u8 sigset[0]; +}; + #define KVMIO 0xAE /* @@ -273,5 +279,6 @@ struct kvm_cpuid { #define KVM_GET_MSRS _IOWR(KVMIO, 0x88, struct kvm_msrs) #define KVM_SET_MSRS _IOW(KVMIO, 0x89, struct kvm_msrs) #define KVM_SET_CPUID _IOW(KVMIO, 0x8a, struct kvm_cpuid) +#define KVM_SET_SIGNAL_MASK _IOW(KVMIO, 0x8b, struct kvm_signal_mask) #endif -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 05/15] KVM: Remove the 'emulated' field from the userspace interface
We no longer emulate single instructions in userspace. Instead, we service mmio or pio requests. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/kvm_main.c |5 - include/linux/kvm.h|3 +-- 2 files changed, 1 insertions(+), 7 deletions(-) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 347467e..747966e 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -1588,11 +1588,6 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) /* re-sync apic's tpr */ vcpu-cr8 = kvm_run-cr8; - if (kvm_run-emulated) { - kvm_arch_ops-skip_emulated_instruction(vcpu); - kvm_run-emulated = 0; - } - if (kvm_run-io_completed) { if (vcpu-pio_pending) complete_pio(vcpu); diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 15e23bc..c6dd4a7 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -51,10 +51,9 @@ enum kvm_exit_reason { /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */ struct kvm_run { /* in */ - __u32 emulated; /* skip current instruction */ __u32 io_completed; /* mmio/pio request completed */ __u8 request_interrupt_window; - __u8 padding1[7]; + __u8 padding1[3]; /* out */ __u32 exit_type; -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 11/15] KVM: Add a special exit reason when exiting due to an interrupt
This is redundant, as we also return -EINTR from the ioctl, but it allows us to examine the exit_reason field on resume without seeing old data. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/svm.c |2 ++ drivers/kvm/vmx.c |2 ++ include/linux/kvm.h |3 ++- 3 files changed, 6 insertions(+), 1 deletions(-) diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c index b09928f..0311665 100644 --- a/drivers/kvm/svm.c +++ b/drivers/kvm/svm.c @@ -1619,12 +1619,14 @@ again: if (signal_pending(current)) { ++kvm_stat.signal_exits; post_kvm_run_save(vcpu, kvm_run); + kvm_run-exit_reason = KVM_EXIT_INTR; return -EINTR; } if (dm_request_for_irq_injection(vcpu, kvm_run)) { ++kvm_stat.request_irq_exits; post_kvm_run_save(vcpu, kvm_run); + kvm_run-exit_reason = KVM_EXIT_INTR; return -EINTR; } kvm_resched(vcpu); diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c index ba7a98b..0d1c8cf 100644 --- a/drivers/kvm/vmx.c +++ b/drivers/kvm/vmx.c @@ -1936,12 +1936,14 @@ again: if (signal_pending(current)) { ++kvm_stat.signal_exits; post_kvm_run_save(vcpu, kvm_run); + kvm_run-exit_reason = KVM_EXIT_INTR; return -EINTR; } if (dm_request_for_irq_injection(vcpu, kvm_run)) { ++kvm_stat.request_irq_exits; post_kvm_run_save(vcpu, kvm_run); + kvm_run-exit_reason = KVM_EXIT_INTR; return -EINTR; } diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 57f47ef..b3af92e 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -11,7 +11,7 @@ #include asm/types.h #include linux/ioctl.h -#define KVM_API_VERSION 8 +#define KVM_API_VERSION 9 /* * Architectural interrupt line count, and the size of the bitmap needed @@ -45,6 +45,7 @@ enum kvm_exit_reason { KVM_EXIT_IRQ_WINDOW_OPEN = 7, KVM_EXIT_SHUTDOWN = 8, KVM_EXIT_FAIL_ENTRY = 9, + KVM_EXIT_INTR = 10, }; /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */ -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 09/15] KVM: Allow userspace to process hypercalls which have no kernel handler
This is useful for paravirtualized graphics devices, for example. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/kvm_main.c | 18 +- include/linux/kvm.h| 10 +- 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 376538c..2220e49 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -1203,7 +1203,16 @@ int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run) } switch (nr) { default: - ; + run-hypercall.args[0] = a0; + run-hypercall.args[1] = a1; + run-hypercall.args[2] = a2; + run-hypercall.args[3] = a3; + run-hypercall.args[4] = a4; + run-hypercall.args[5] = a5; + run-hypercall.ret = ret; + run-hypercall.longmode = is_long_mode(vcpu); + kvm_arch_ops-decache_regs(vcpu); + return 0; } vcpu-regs[VCPU_REGS_RAX] = ret; kvm_arch_ops-decache_regs(vcpu); @@ -1599,6 +1608,13 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) vcpu-mmio_needed = 0; + if (kvm_run-exit_type == KVM_EXIT_TYPE_VM_EXIT +kvm_run-exit_type == KVM_EXIT_HYPERCALL) { + kvm_arch_ops-cache_regs(vcpu); + vcpu-regs[VCPU_REGS_RAX] = kvm_run-hypercall.ret; + kvm_arch_ops-decache_regs(vcpu); + } + r = kvm_arch_ops-run(vcpu, kvm_run); vcpu_put(vcpu); diff --git a/include/linux/kvm.h b/include/linux/kvm.h index c93cf53..9151ebf 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -11,7 +11,7 @@ #include asm/types.h #include linux/ioctl.h -#define KVM_API_VERSION 6 +#define KVM_API_VERSION 7 /* * Architectural interrupt line count, and the size of the bitmap needed @@ -41,6 +41,7 @@ enum kvm_exit_reason { KVM_EXIT_UNKNOWN = 0, KVM_EXIT_EXCEPTION= 1, KVM_EXIT_IO = 2, + KVM_EXIT_HYPERCALL= 3, KVM_EXIT_DEBUG= 4, KVM_EXIT_HLT = 5, KVM_EXIT_MMIO = 6, @@ -103,6 +104,13 @@ struct kvm_run { __u32 len; __u8 is_write; } mmio; + /* KVM_EXIT_HYPERCALL */ + struct { + __u64 args[6]; + __u64 ret; + __u32 longmode; + __u32 pad; + } hypercall; }; }; -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 02/15] KVM: Do not communicate to userspace through cpu registers during PIO
Currently when passing the a PIO emulation request to userspace, we rely on userspace updating %rax (on 'in' instructions) and %rsi/%rdi/%rcx (on string instructions). This (a) requires two extra ioctls for getting and setting the registers and (b) is unfriendly to non-x86 archs, when they get kvm ports. So fix by doing the register fixups in the kernel and passing to userspace only an abstract description of the PIO to be done. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/kvm.h |1 + drivers/kvm/kvm_main.c | 48 +--- drivers/kvm/svm.c |1 + drivers/kvm/vmx.c |1 + include/linux/kvm.h|6 +++--- 5 files changed, 51 insertions(+), 6 deletions(-) diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index 901b8d9..59cbc5b 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -274,6 +274,7 @@ struct kvm_vcpu { int mmio_size; unsigned char mmio_data[8]; gpa_t mmio_phys_addr; + int pio_pending; struct { int active; diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 42be8a8..8a4984d 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -1504,6 +1504,44 @@ void save_msrs(struct vmx_msr_entry *e, int n) } EXPORT_SYMBOL_GPL(save_msrs); +static void complete_pio(struct kvm_vcpu *vcpu) +{ + struct kvm_io *io = vcpu-run-io; + long delta; + + kvm_arch_ops-cache_regs(vcpu); + + if (!io-string) { + if (io-direction == KVM_EXIT_IO_IN) + memcpy(vcpu-regs[VCPU_REGS_RAX], io-value, + io-size); + } else { + delta = 1; + if (io-rep) { + delta *= io-count; + /* +* The size of the register should really depend on +* current address size. +*/ + vcpu-regs[VCPU_REGS_RCX] -= delta; + } + if (io-string_down) + delta = -delta; + delta *= io-size; + if (io-direction == KVM_EXIT_IO_IN) + vcpu-regs[VCPU_REGS_RDI] += delta; + else + vcpu-regs[VCPU_REGS_RSI] += delta; + } + + vcpu-pio_pending = 0; + vcpu-run-io_completed = 0; + + kvm_arch_ops-decache_regs(vcpu); + + kvm_arch_ops-skip_emulated_instruction(vcpu); +} + static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { int r; @@ -1518,9 +1556,13 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) kvm_run-emulated = 0; } - if (kvm_run-mmio_completed) { - memcpy(vcpu-mmio_data, kvm_run-mmio.data, 8); - vcpu-mmio_read_completed = 1; + if (kvm_run-io_completed) { + if (vcpu-pio_pending) + complete_pio(vcpu); + else { + memcpy(vcpu-mmio_data, kvm_run-mmio.data, 8); + vcpu-mmio_read_completed = 1; + } } vcpu-mmio_needed = 0; diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c index 6787f11..b176f5a 100644 --- a/drivers/kvm/svm.c +++ b/drivers/kvm/svm.c @@ -1056,6 +1056,7 @@ static int io_interception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) } } else kvm_run-io.value = vcpu-svm-vmcb-save.rax; + vcpu-pio_pending = 1; return 0; } diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c index 910535d..7fd572a 100644 --- a/drivers/kvm/vmx.c +++ b/drivers/kvm/vmx.c @@ -1465,6 +1465,7 @@ static int handle_io(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) kvm_run-io.address = vmcs_readl(GUEST_LINEAR_ADDRESS); } else kvm_run-io.value = vcpu-regs[VCPU_REGS_RAX]; /* rax */ + vcpu-pio_pending = 1; return 0; } diff --git a/include/linux/kvm.h b/include/linux/kvm.h index d88e750..19aeb33 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -11,7 +11,7 @@ #include asm/types.h #include linux/ioctl.h -#define KVM_API_VERSION 5 +#define KVM_API_VERSION 6 /* * Architectural interrupt line count, and the size of the bitmap needed @@ -53,7 +53,7 @@ enum kvm_exit_reason { struct kvm_run { /* in */ __u32 emulated; /* skip current instruction */ - __u32 mmio_completed; /* mmio request completed */ + __u32 io_completed; /* mmio/pio request completed */ __u8 request_interrupt_window; __u8 padding1[7]; @@ -80,7 +80,7 @@ struct kvm_run { __u32 error_code; } ex; /* KVM_EXIT_IO */ - struct { + struct kvm_io { #define KVM_EXIT_IO_IN 0 #define KVM_EXIT_IO_OUT 1 __u8
[PATCH 15/15] KVM: Future-proof argument-less ioctls
Some ioctls ignore their arguments. By requiring them to be zero now, we allow a nonzero value to have some special meaning in the future. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/kvm_main.c |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index b81f007..bf8403e 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -2169,6 +2169,9 @@ static long kvm_vcpu_ioctl(struct file *filp, switch (ioctl) { case KVM_RUN: + r = -EINVAL; + if (arg) + goto out; r = kvm_vcpu_ioctl_run(vcpu, vcpu-run); break; case KVM_GET_REGS: { @@ -2440,9 +2443,15 @@ static long kvm_dev_ioctl(struct file *filp, switch (ioctl) { case KVM_GET_API_VERSION: + r = -EINVAL; + if (arg) + goto out; r = KVM_API_VERSION; break; case KVM_CREATE_VM: + r = -EINVAL; + if (arg) + goto out; r = kvm_dev_ioctl_create_vm(); break; case KVM_GET_MSR_INDEX_LIST: { -- 1.5.0.2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 10/15] KVM: Fold kvm_run::exit_type into kvm_run::exit_reason
Currently, userspace is told about the nature of the last exit from the guest using two fields, exit_type and exit_reason, where exit_type has just two enumerations (and no need for more). So fold exit_type into exit_reason, reducing the complexity of determining what really happened. Signed-off-by: Avi Kivity [EMAIL PROTECTED] --- drivers/kvm/kvm_main.c |3 +-- drivers/kvm/svm.c |7 +++ drivers/kvm/vmx.c |7 +++ include/linux/kvm.h| 15 --- 4 files changed, 15 insertions(+), 17 deletions(-) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index 2220e49..0e28f58 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -1608,8 +1608,7 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) vcpu-mmio_needed = 0; - if (kvm_run-exit_type == KVM_EXIT_TYPE_VM_EXIT -kvm_run-exit_type == KVM_EXIT_HYPERCALL) { + if (kvm_run-exit_reason == KVM_EXIT_HYPERCALL) { kvm_arch_ops-cache_regs(vcpu); vcpu-regs[VCPU_REGS_RAX] = kvm_run-hypercall.ret; kvm_arch_ops-decache_regs(vcpu); diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c index d4b2936..b09928f 100644 --- a/drivers/kvm/svm.c +++ b/drivers/kvm/svm.c @@ -1298,8 +1298,6 @@ static int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { u32 exit_code = vcpu-svm-vmcb-control.exit_code; - kvm_run-exit_type = KVM_EXIT_TYPE_VM_EXIT; - if (is_external_interrupt(vcpu-svm-vmcb-control.exit_int_info) exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR) printk(KERN_ERR %s: unexpected exit_ini_info 0x%x @@ -1609,8 +1607,9 @@ again: vcpu-svm-next_rip = 0; if (vcpu-svm-vmcb-control.exit_code == SVM_EXIT_ERR) { - kvm_run-exit_type = KVM_EXIT_TYPE_FAIL_ENTRY; - kvm_run-exit_reason = vcpu-svm-vmcb-control.exit_code; + kvm_run-exit_reason = KVM_EXIT_FAIL_ENTRY; + kvm_run-fail_entry.hardware_entry_failure_reason + = vcpu-svm-vmcb-control.exit_code; post_kvm_run_save(vcpu, kvm_run); return 0; } diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c index e093892..ba7a98b 100644 --- a/drivers/kvm/vmx.c +++ b/drivers/kvm/vmx.c @@ -1901,10 +1901,10 @@ again: asm (mov %0, %%ds; mov %0, %%es : : r(__USER_DS)); - kvm_run-exit_type = 0; if (fail) { - kvm_run-exit_type = KVM_EXIT_TYPE_FAIL_ENTRY; - kvm_run-exit_reason = vmcs_read32(VM_INSTRUCTION_ERROR); + kvm_run-exit_reason = KVM_EXIT_FAIL_ENTRY; + kvm_run-fail_entry.hardware_entry_failure_reason + = vmcs_read32(VM_INSTRUCTION_ERROR); r = 0; } else { if (fs_gs_ldt_reload_needed) { @@ -1930,7 +1930,6 @@ again: profile_hit(KVM_PROFILING, (void *)vmcs_readl(GUEST_RIP)); vcpu-launched = 1; - kvm_run-exit_type = KVM_EXIT_TYPE_VM_EXIT; r = kvm_handle_exit(kvm_run, vcpu); if (r 0) { /* Give scheduler a change to reschedule. */ diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 9151ebf..57f47ef 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -11,7 +11,7 @@ #include asm/types.h #include linux/ioctl.h -#define KVM_API_VERSION 7 +#define KVM_API_VERSION 8 /* * Architectural interrupt line count, and the size of the bitmap needed @@ -34,9 +34,6 @@ struct kvm_memory_region { #define KVM_MEM_LOG_DIRTY_PAGES 1UL -#define KVM_EXIT_TYPE_FAIL_ENTRY 1 -#define KVM_EXIT_TYPE_VM_EXIT2 - enum kvm_exit_reason { KVM_EXIT_UNKNOWN = 0, KVM_EXIT_EXCEPTION= 1, @@ -47,6 +44,7 @@ enum kvm_exit_reason { KVM_EXIT_MMIO = 6, KVM_EXIT_IRQ_WINDOW_OPEN = 7, KVM_EXIT_SHUTDOWN = 8, + KVM_EXIT_FAIL_ENTRY = 9, }; /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */ @@ -57,12 +55,11 @@ struct kvm_run { __u8 padding1[3]; /* out */ - __u32 exit_type; __u32 exit_reason; __u32 instruction_length; __u8 ready_for_interrupt_injection; __u8 if_flag; - __u16 padding2; + __u8 padding2[6]; /* in (pre_kvm_run), out (post_kvm_run) */ __u64 cr8; @@ -71,8 +68,12 @@ struct kvm_run { union { /* KVM_EXIT_UNKNOWN */ struct { - __u32 hardware_exit_reason; + __u64 hardware_exit_reason; } hw; + /* KVM_EXIT_FAIL_ENTRY */ + struct { + __u64 hardware_entry_failure_reason; + } fail_entry; /* KVM_EXIT_EXCEPTION */ struct { __u32 exception; -- 1.5.0.2
[PATCH -mm] Fix race between proc_readdir and remove_proc_entry
-procfs-fix-race-between-proc_readdir-and-remove_proc_entry.patch +fix-race-between-proc_get_inode-and-remove_proc_entry.patch Updated. Looks sane. Why have you dropped the first patch? Resending slightly fixed version of it. [PATCH -mm] Fix race between proc_readdir and remove_proc_entry From: Darrick J. Wong [EMAIL PROTECTED] Fix the following race: proc_readdirremove_proc_entry = spin_lock(proc_subdir_lock); [choose PDE to start filldir from] spin_unlock(proc_subdir_lock); spin_lock(proc_subdir_lock); [find PDE] [free PDE, refcount is 0] spin_unlock(proc_subdir_lock); /* boom */ if (filldir(dirent, de-name, ... [de_put on error path --adobriyan] Signed-off-by: Darrick J. Wong [EMAIL PROTECTED] Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- fs/proc/generic.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -478,14 +478,21 @@ int proc_readdir(struct file * filp, } do { + struct proc_dir_entry *next; + /* filldir passes info to user space */ + de_get(de); spin_unlock(proc_subdir_lock); if (filldir(dirent, de-name, de-namelen, filp-f_pos, - de-low_ino, de-mode 12) 0) + de-low_ino, de-mode 12) 0) { + de_put(de); goto out; + } spin_lock(proc_subdir_lock); filp-f_pos++; - de = de-next; + next = de-next; + de_put(de); + de = next; } while (de); spin_unlock(proc_subdir_lock); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 2/7] RSS controller core
Andrew Morton wrote: On Tue, 06 Mar 2007 17:55:29 +0300 Pavel Emelianov [EMAIL PROTECTED] wrote: +struct rss_container { + struct res_counter res; + struct list_head page_list; + struct container_subsys_state css; +}; + +struct page_container { + struct page *page; + struct rss_container *cnt; + struct list_head list; +}; ah. This looks good. I'll find a hunk of time to go through this work and through Paul's patches. It'd be good to get both patchsets lined up in -mm within a couple of weeks. But.. We need to decide whether we want to do per-container memory limitation via these data structures, or whether we do it via a physical scan of some software zone, possibly based on Mel's patches. i.e. a separate memzone for each container? imho memzone approach is inconvinient for pages sharing and shares accounting. it also makes memory management more strict, forbids overcommiting per-container etc. Maybe you have some ideas how we can decide on this? Thanks, Kirill - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 2/7] RSS controller core
On Sun, 11 Mar 2007 15:26:41 +0300 Kirill Korotaev [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Tue, 06 Mar 2007 17:55:29 +0300 Pavel Emelianov [EMAIL PROTECTED] wrote: +struct rss_container { + struct res_counter res; + struct list_head page_list; + struct container_subsys_state css; +}; + +struct page_container { + struct page *page; + struct rss_container *cnt; + struct list_head list; +}; ah. This looks good. I'll find a hunk of time to go through this work and through Paul's patches. It'd be good to get both patchsets lined up in -mm within a couple of weeks. But.. We need to decide whether we want to do per-container memory limitation via these data structures, or whether we do it via a physical scan of some software zone, possibly based on Mel's patches. i.e. a separate memzone for each container? Yep. Straightforward machine partitioning. An attractive thing is that it 100% reuses existing page reclaim, unaltered. imho memzone approach is inconvinient for pages sharing and shares accounting. it also makes memory management more strict, forbids overcommiting per-container etc. umm, who said they were requirements? Maybe you have some ideas how we can decide on this? We need to work out what the requirements are before we can settle on an implementation. Sigh. Who is running this show? Anyone? You can actually do a form of overcommittment by allowing multiple containers to share one or more of the zones. Whether that is sufficient or suitable I don't know. That depends on the requirements, and we haven't even discussed those, let alone agreed to them. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [scsi]: Add offline state checking while dispatch a scsi cmd
On Fri, 2007-03-09 at 09:40 +0800, Joe Jin wrote: What's the error you're trying to fix? scsi_dispatch_cmd() is only called from scsi_request_fn() which already has an equivalent of this check in it just prior to calling dispatch. Yeah, I have saw the cheking at scsi_request_fn(), recently we got a crash info as following at rhel4 2.6.9-42.0.2.ELsmp, This kernel is way to old to debug ... However: scsi0 (0:0): rejecting I/O to offline device ... EXT3-fs error (device sda8) in start_transaction: Journal has aborted Unable to handle kernel NULL pointer dereference at RIP: a0031e66{:megaraid_mbox:megaraid_queue_command+2634} This is a bug actually in the megaraid. PML4 21a25d067 PGD 2170ac067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: hangcheck_timer mptctl mptbase ipmi_devintf ipmi_si ipmi_msghandler dell_rbu netconsole netdump autofs4 i2c_dev i2c_core ocfs2(U) debugfs(U) nfs lockd nfs_acl ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs(U) sunrpc ds yenta_socket pcmcia_core ide_dump scsi_dump diskdump zlib_deflate dm_mirror dm_multipath dm_mod emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery ac joydev uhci_hcd ehci_hcd hw_random tg3 e1000 bond0(U) floppy sg ext3 jbd lpfc scsi_transport_fc megaraid_mbox megaraid_mm sd_mod scsi_mod Pid: 13238, comm: emagent Tainted: P 2.6.9-42.0.2.ELsmp RIP: 0010:[a0031e66] a0031e66{:megaraid_mbox:megaraid_queue_command+2634} RSP: 0018:01019b5a9b48 EFLAGS: 00010002 RAX: 000220b8e000 RBX: 0102ffd1b048 RCX: RDX: RSI: 0001 RDI: 010431124bf0 RBP: 0001 R08: R09: 010133ce5b80 R10: 0102ffd3e5a0 R11: 0060 R12: 010133ce5b80 R13: 0102ffd3e480 R14: 0100bfb4c8b8 R15: 0101ffcf4000 FS: () GS:804e5180(005b) knlGS:f47ffbb0 CS: 0010 DS: 002b ES: 002b CR0: 8005003b CR2: CR3: 00101000 CR4: 06e0 Process emagent (pid: 13238, threadinfo 01019b5a8000, task 01003e5a8030) Stack: 0046 0046 0102ffd3e480 0101fff73980 8015cb38 0100bfb4d4aa 0100bfb4d4a2 0100bfb4c8b8 01010080 Call Trace:8015cb38{mempool_alloc+129} a0002874{:scsi_mod:scsi_done+0} 8013fc00{__mod_timer+113} a0002adf{:scsi_mod:scsi_dispatch_cmd+595} a0007a72{:scsi_mod:scsi_request_fn+990} 8024e385{generic_unplug_device+24} 8017a6d3{__wait_on_buffer+120} 8017a55e{bh_wake_function+0} 8017a55e{bh_wake_function+0} a00877fe{:ext3:ext3_bread+96} a008935c{:ext3:htree_dirblock_to_tree+50} a008952c{:ext3:ext3_htree_fill_tree+295} 8018b232{filldir64+122} 8018b1b8{filldir64+0} a0083ace{:ext3:ext3_readdir+371} 8018f019{dput+56} 8018b1b8{filldir64+0} 8018599c{path_release+12} 8019e335{compat_sys_statfs+105} 8018b1b8{filldir64+0} 8018aef7{vfs_readdir+155} 8018b2e8{sys_getdents64+118} 80125bbb{sysenter_do_call+27} And this is a direct command submission path: it already passed both online check gates in this path *after* the device was offlined, so adding a third won't fix this. Firstly, I'm assuming you have only a single disk, so the I/O was definitely bound for sda? Secondly, can you reproduce with a modern (2.6.20) kernel. Your trace strongly suggests that the device came back online for some reason and then the megaraid driver died. James - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 2/7] RSS controller core
On 3/11/07, Andrew Morton [EMAIL PROTECTED] wrote: On Sun, 11 Mar 2007 15:26:41 +0300 Kirill Korotaev [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Tue, 06 Mar 2007 17:55:29 +0300 Pavel Emelianov [EMAIL PROTECTED] wrote: +struct rss_container { + struct res_counter res; + struct list_head page_list; + struct container_subsys_state css; +}; + +struct page_container { + struct page *page; + struct rss_container *cnt; + struct list_head list; +}; ah. This looks good. I'll find a hunk of time to go through this work and through Paul's patches. It'd be good to get both patchsets lined up in -mm within a couple of weeks. But.. We need to decide whether we want to do per-container memory limitation via these data structures, or whether we do it via a physical scan of some software zone, possibly based on Mel's patches. i.e. a separate memzone for each container? Yep. Straightforward machine partitioning. An attractive thing is that it 100% reuses existing page reclaim, unaltered. We discussed zones for resource control and some of the disadvantages at http://lkml.org/lkml/2006/10/30/222 I need to look at Mel's patches to determine if they are suitable for control. But in a thread of discussion on those patches, it was agreed that memory fragmentation and resource control are independent issues. imho memzone approach is inconvinient for pages sharing and shares accounting. it also makes memory management more strict, forbids overcommiting per-container etc. umm, who said they were requirements? We discussed some of the requirements in the RFC: Memory Controller requirements thread http://lkml.org/lkml/2006/10/30/51 Maybe you have some ideas how we can decide on this? We need to work out what the requirements are before we can settle on an implementation. Sigh. Who is running this show? Anyone? All the stake holders involved in the RFC discussion :-) We've been talking and building on top of each others patches. I hope that was a good answer ;) You can actually do a form of overcommittment by allowing multiple containers to share one or more of the zones. Whether that is sufficient or suitable I don't know. That depends on the requirements, and we haven't even discussed those, let alone agreed to them. There are other things like resizing a zone, finding the right size, etc. I'll look at Mel's patches to see what is supported. Warm Regards, Balbir Singh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Use more gcc extensions in the Linux headers
On Fri, 09 Mar 2007 20:24:42 PST, Randy Dunlap said: On Fri, 09 Mar 2007 23:03:05 -0500 [EMAIL PROTECTED] wrote: -/* GCC is awesome. */ +/* GCC leaves me speechless. */ awesome can mean inspiring awe or admiration or wonder (amazing) or it can mean awful (as in terrifying). 8) And as those who know me well will attest, it takes something well down the road of either definition to render me actually speechless.. :) pgpjKsvzaltOk.pgp Description: PGP signature
Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...
On Sat, 10 Mar 2007, Nicholas Miell wrote: UNIX has pid's for process handles, and file descriptors for just about everything else. And I imagine that somebody will come up with way of getting a fd for a process sooner or later. Well, /proc/pid/ is about as close as you get. And that's largely inspired by a Plan-9'ish thing that does indeed expose processes as files. The problem with processes is that they are actually so *complicated* that trying to describe them with a single file isn't all that useful (you could use tons of different ioctl's to do different operations, but that's against the stream of bytes model in UNIX, and even more so against the whole Plan-9 model). Actually, I was thinking reducing struct file to the bare minimum, and then using that as the common header shared by object-specific structures. I don't know how unpleasant that would be from a memory allocation perspective, though. It would probably not be a bad idea, but I just doubt that it makes much of a difference, at least not for timerfd/signalfd files. There likely just won't be that many of them (I'd expect that processes that use them would normally just have one or two of each). It might be more relevant for things like sockets and pty's: do a ls -l /proc/*/fd and see what kind of files you have open, and I suspect most of the files will actually be sockets on a normal desktop setup, and even more so on some network server thing. And yes, it might be nice to avoid allocating memory for the (unnecessary) readahead and f_pos state, but in the end you seldom really have all *that* much memory allocated for file descriptors. The real memory use ends up being elsewhere.. IOW, I don't think it's a bad idea per se, I just doubt that it is worth the complexity and effort. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA resume slowness, e1000 MSI warning
Michael S. Tsirkin [EMAIL PROTECTED] writes: Quoting Eric W. Biederman [EMAIL PROTECTED]: Subject: Re: SATA resume slowness, e1000 MSI warning Michael S. Tsirkin [EMAIL PROTECTED] writes: The only case I can see which might trigger this is if we saved pci-X state and then didn't restore it because we could not find the capability on restore. Hmm. pci_save_pcix_state/pci_restore_pcix_state seem to only handle regular devices and seem to ignore the fact that for bridge PCI-X capability has a different structure. Is this intentional? Probably not a such. I don't think we have any drivers for bridge devices so I don't think it matters. It likely wouldn't hurt to fix it just in case though. Do any of the mellanox cards do anything with the bridge on the card? Yes but they do their own thing wrt saving/restoring registers. Look at drivers/infiniband/hw/mthca/mthca_reset.c If not, here's a patch to fix this. Warning: completely untested. If you fix the offsets and diff this against my last fix (to never free the buffer) I think your patch makes sense. Let's agree what the correct offsets are. PCI: restore bridge PCI-X capability registers after PM event Restore PCI-X bridge up/downstream capability registers after PM event. This includes maxumum split transaction commitment limit which might be vital for PCI X. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index df49530..4b788ef 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -597,14 +597,19 @@ static int pci_save_pcix_state(struct pci_dev *dev) if (pos = 0) return 0; - save_state = kzalloc(sizeof(*save_state) + sizeof(u16), GFP_KERNEL); + save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 2, GFP_KERNEL); if (!save_state) { - dev_err(dev-dev, Out of memory in pci_save_pcie_state\n); + dev_err(dev-dev, Out of memory in pci_save_pcix_state\n); return -ENOMEM; } cap = (u16 *)save_state-data[0]; - pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]); + if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) { This appears to be the proper test. + pci_read_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]); + pci_read_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]); + } else + pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]); + pci_add_saved_cap(dev, save_state); return 0; } @@ -621,7 +626,11 @@ static void pci_restore_pcix_state(struct pci_dev *dev) return; cap = (u16 *)save_state-data[0]; - pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]); + if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) { + pci_write_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]); + pci_write_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]); These look like the proper two registers to save. + } else + pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]); pci_remove_saved_cap(save_state); kfree(save_state); } diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h index f09cce2..fb7eefd 100644 --- a/include/linux/pci_regs.h +++ b/include/linux/pci_regs.h @@ -332,6 +332,8 @@ #define PCI_X_STATUS_SPL_ERR 0x2000 /* Rcvd Split Completion Error Msg */ #define PCI_X_STATUS_266MHZ 0x4000 /* 266 MHz capable */ #define PCI_X_STATUS_533MHZ 0x8000 /* 533 MHz capable */ +#define PCI_X_BRIDGE_UP_SPL_CTL 10 /* PCI-X upstream split transaction limit */ +#define PCI_X_BRIDGE_DN_SPL_CTL 14 /* PCI-X downstream split transaction limit */ Unless I am completely misreading the spec. While you have picked the right register to save the offsets should be 0x08 and 0x0c or 8 and 12 No, the spec is written in terms of dwords (32 bit), we are storing words (16 bits). The data at offsets 8 and 12 is read-only split transaction capacity. Split transaction limit starts at bit 16 so you need to add 2 to byte offset. Right? From that perspective it makes sense. So I will agree with the way you are thinking the code works. The read-only and the read-write part are all defined as part of the same register so I didn't expect them to be separate. And I hadn't paid attention enough to see that the code was only saving 16bit values. Rumor has it that some pci devices can't tolerate 32bit accesses. Although I have never met one. The two factors together suggest that for generic code it probably makes sense to operate on 32bit quantities, and just to ignore the read-only portion. Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kthread_should_stop_check_freeze (was: Re: [PATCH -mm 3/7] Freezer: Remove PF_NOFREEZE from rcutorture thread)
On Saturday, 3 March 2007 18:32, Oleg Nesterov wrote: On 03/02, Paul E. McKenney wrote: On Sat, Mar 03, 2007 at 02:33:37AM +0300, Oleg Nesterov wrote: On 03/02, Paul E. McKenney wrote: One way to embed try_to_freeze() into kthread_should_stop() might be as follows: int kthread_should_stop(void) { if (kthread_stop_info.k == current) return 1; try_to_freeze(); return 0; } I think this is dangerous. For example, worker_thread() will probably need some special actions after return from refrigerator. Also, a kernel thread may check kthread_should_stop() in the place where try_to_freeze() is not safe. Perhaps we should introduce a new helper which does this. Good point -- the return value from try_to_freeze() is lost if one uses the above approach. About one third of the calls to try_to_freeze() in 2.6.20 pay attention to the return value. One approach would be to have a kthread_should_stop_nofreeze() for those cases, and let the default be to try to freeze. I personally think we should do the opposite, add kthread_should_stop_check_freeze() or something. kthread_should_stop() is like signal_pending(), we can use it under spin_lock (and it is probably used this way by some out-of-tree driver). The new helper is obviously might_sleep(). Something like this, perhaps: include/linux/kthread.h |1 + kernel/kthread.c| 16 kernel/rcutorture.c |5 ++--- 3 files changed, 19 insertions(+), 3 deletions(-) Index: linux-2.6.21-rc3-mm2/kernel/kthread.c === --- linux-2.6.21-rc3-mm2.orig/kernel/kthread.c 2007-03-08 21:58:48.0 +0100 +++ linux-2.6.21-rc3-mm2/kernel/kthread.c 2007-03-11 18:32:59.0 +0100 @@ -13,6 +13,7 @@ #include linux/file.h #include linux/module.h #include linux/mutex.h +#include linux/freezer.h #include asm/semaphore.h /* @@ -60,6 +61,21 @@ int kthread_should_stop(void) } EXPORT_SYMBOL(kthread_should_stop); +/** + * kthread_should_stop_check_freeze - check if the thread should return now and + * if not, check if there is a freezing request pending for it. + */ +int kthread_should_stop_check_freeze(void) +{ + might_sleep(); + if (kthread_stop_info.k == current) + return 1; + + try_to_freeze(); + return 0; +} +EXPORT_SYMBOL(kthread_should_stop_check_freeze); + static void kthread_exit_files(void) { struct fs_struct *fs; Index: linux-2.6.21-rc3-mm2/include/linux/kthread.h === --- linux-2.6.21-rc3-mm2.orig/include/linux/kthread.h 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.21-rc3-mm2/include/linux/kthread.h2007-03-11 18:37:10.0 +0100 @@ -29,5 +29,6 @@ struct task_struct *kthread_create(int ( void kthread_bind(struct task_struct *k, unsigned int cpu); int kthread_stop(struct task_struct *k); int kthread_should_stop(void); +int kthread_should_stop_check_freeze(void); #endif /* _LINUX_KTHREAD_H */ Index: linux-2.6.21-rc3-mm2/kernel/rcutorture.c === --- linux-2.6.21-rc3-mm2.orig/kernel/rcutorture.c 2007-03-11 11:39:06.0 +0100 +++ linux-2.6.21-rc3-mm2/kernel/rcutorture.c2007-03-11 18:45:00.0 +0100 @@ -540,10 +540,9 @@ rcu_torture_writer(void *arg) } rcu_torture_current_version++; oldbatch = cur_ops-completed(); - try_to_freeze(); - } while (!kthread_should_stop() !fullstop); + } while (!kthread_should_stop_check_freeze() !fullstop); VERBOSE_PRINTK_STRING(rcu_torture_writer task stopping); - while (!kthread_should_stop()) + while (!kthread_should_stop_check_freeze()) schedule_timeout_uninterruptible(1); return 0; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [git patches] libata fixes
Paul, do I understand correctly that the *only* difference between the working setup is that you applied (by hand) the libata patch that Jeff sent out? So plain 2.6.21-rc2 works fine, but with the patch applied, you get no interrupts on the DVD drive? On Sun, 11 Mar 2007, Paul Rolland wrote: It seems like IRQ is not getting through. The first IRQ driven command is failing for you. H Extract is : ata7: PATA max UDMA/100 cmd 0x00019c00 ctl 0x00019882 bmdma 0x00019400 irq 16 ata8: PATA max UDMA/100 cmd 0x00019800 ctl 0x00019482 bmdma 0x00019408 irq 16 IRQ 16 is IO-APIC-fasteoi for libata, and is not shared... but all the others libata IRQ are IO-APIC-edge. Ok, that's interesting, although IO-APIC-fasteoi certainly works for others (eg me), but it's still useful. * Does giving 'acpi=off' or 'irqpoll' make any difference? * Can you connect a harddisk to the channel and see whether that works? Tried that.. Disk is identified as ATA-7: Mastor 6Y080L0, YAR41BW0, max UDMA/13 and then timeout again... Tried then with acpi=off, same result (identify is OK, but then timeout), and irqpoll and then it was OK Whee... There were no changes that looked interrupt-related there.. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [QUICKLIST 4/6] x86_64: Single Quicklist
On Sun, 11 Mar 2007, Andi Kleen wrote: This and i386 version are ok to me, although it might be better to just finish __GFP_ZERO support to do this. This would not work for pgds on i386 and x86_64 GFP_ZERO support the way I have done it in the past would mean another set of buddy lists in the page allocator and another issue with fragmentation. So I have stayed away from it although patches exist in my archives (See my ftp.kernel.org archive). Maybe we could implemento limited GFP_ZERO support by just keeping an additional per cpu list of pages? The issue with that one is that a page may grow cold on that list. One usually want the page to be hot in the cache when it is allocated. This is different for page table pages. Page table pages are typically sparsely accessed. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!)
Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0! Feb 27 17:47:52 sw169 kernel: [8053aaf1] _spin_lock_irqsave+0x15/0x24 Feb 27 17:47:52 sw169 kernel: [88067a23] :ib_ipoib:ipoib_neigh_destructor+0xc2/0x139 It looks like this is deadlocking trying to take priv-lock in ipoib_neigh_destructor(). One idea I just had would be to build a kernel with CONFIG_PROVE_LOCKING turned on, and then rerun this test. There's a good chance that this would diagnose the deadlock. (I don't have good access to my test machines right now, or else I would do it myself) OK, I did that. But I get [13440.761857] INFO: trying to register non-static key. [13440.766903] the code is fine but needs lockdep annotation. [13440.772455] turning off the locking correctness validator. and I am not sure what triggers this, or how to fix it to have the validator actually do its job. Ingo, what key does the message refer to? The stack dump seems to point to drivers/infiniband/ulp/ipoib/ipoib_main.c line 829. Full message below: [13440.761857] INFO: trying to register non-static key. [13440.766903] the code is fine but needs lockdep annotation. [13440.772455] turning off the locking correctness validator. [13440.778008] [c023c082] __lock_acquire+0xae4/0xbb9 [13440.783078] [c023c43d] lock_acquire+0x56/0x71 [13440.787784] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.794412] [c051ad41] _spin_lock_irqsave+0x32/0x41 [13440.799649] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.806275] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.812897] [c04a1c1b] dst_run_gc+0xc/0x118 [13440.817439] [c022af6e] run_timer_softirq+0x37/0x16b [13440.822673] [c04a1c0f] dst_run_gc+0x0/0x118 [13440.827221] [c04a3eab] neigh_destroy+0xbe/0x104 [13440.832114] [c04a1bb1] dst_destroy+0x4d/0xab [13440.836751] [c04a1c64] dst_run_gc+0x55/0x118 [13440.841384] [c022b03f] run_timer_softirq+0x108/0x16b [13440.846711] [c0227634] __do_softirq+0x5a/0xd5 [13440.851427] [c023b435] trace_hardirqs_on+0x106/0x141 [13440.856754] [c0227643] __do_softirq+0x69/0xd5 [13440.861470] [c02276e6] do_softirq+0x37/0x4d [13440.866016] [c02167b0] smp_apic_timer_interrupt+0x6b/0x77 [13440.871774] [c02029ef] default_idle+0x3b/0x54 [13440.876491] [c02029ef] default_idle+0x3b/0x54 [13440.881211] [c0204c33] apic_timer_interrupt+0x33/0x38 [13440.886624] [c02029ef] default_idle+0x3b/0x54 [13440.891342] [c02029f1] default_idle+0x3d/0x54 [13440.896061] [c0202aaa] cpu_idle+0xa2/0xbb [13440.900436] === [13768.711447] BUG: spinlock lockup on CPU#1, swapper/0, c0687880 [13768.717353] [c031f919] _raw_spin_lock+0xda/0xfd [13768.722247] [c051ad48] _spin_lock_irqsave+0x39/0x41 [13768.727486] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13768.734110] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13768.740735] [c04a1c1b] dst_run_gc+0xc/0x118 [13768.745276] [c022af6e] run_timer_softirq+0x37/0x16b [13768.750517] [c04a1c0f] dst_run_gc+0x0/0x118 [13768.755061] [c04a3eab] neigh_destroy+0xbe/0x104 [13768.759955] [c04a1bb1] dst_destroy+0x4d/0xab [13768.764586] [c04a1c64] dst_run_gc+0x55/0x118 [13768.769218] [c022b03f] run_timer_softirq+0x108/0x16b [13768.774542] [c0227634] __do_softirq+0x5a/0xd5 [13768.779261] [c023b435] trace_hardirqs_on+0x106/0x141 [13768.784588] [c0227643] __do_softirq+0x69/0xd5 [13768.789308] [c02276e6] do_softirq+0x37/0x4d [13768.793851] [c02167b0] smp_apic_timer_interrupt+0x6b/0x77 [13768.799609] [c02029ef] default_idle+0x3b/0x54 [13768.804326] [c02029ef] default_idle+0x3b/0x54 [13768.809054] [c0204c33] apic_timer_interrupt+0x33/0x38 [13768.814471] [c02029ef] default_idle+0x3b/0x54 [13768.819187] [c02029f1] default_idle+0x3d/0x54 [13768.823903] [c0202aaa] cpu_idle+0xa2/0xbb [13768.828279] === -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!)
Quoting Michael S. Tsirkin [EMAIL PROTECTED]: Subject: Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!) After adding some printks, I started getting these: [ 597.036720] BUG: MAX_STACK_TRACE_ENTRIES too low! [ 597.041546] turning off the locking correctness validator. I looked at our stack usage a bit. It seems some work is in order. $ make checkstack | grep ib_ 0x0603 mthca_init_hca [ib_mthca]: 764 0x14ed mthca_init_hca [ib_mthca]: 764 0x65ae ipoib_cm_tx_start [ib_ipoib]:368 0x6b0b ipoib_cm_tx_start [ib_ipoib]:368 0x135f ib_uverbs_query_device [ib_uverbs]: 348 0x15f9 ib_uverbs_query_device [ib_uverbs]: 348 0x05d0 ib_ucm_init_qp_attr [ib_ucm]:300 0x0697 ib_ucm_init_qp_attr [ib_ucm]:300 0x7f9e ipoib_path_seq_show [ib_ipoib]: 264 0x8092 ipoib_path_seq_show [ib_ipoib]: 264 0x5b56 ipoib_cm_rx_handler [ib_ipoib]: 220 0x5eec ipoib_cm_rx_handler [ib_ipoib]: 220 0x7934 ipoib_cm_tx_handler [ib_ipoib]: 208 0x7ce0 ipoib_cm_tx_handler [ib_ipoib]: 208 0x32fe ib_uverbs_create_qp [ib_uverbs]: 192 0x36fd ib_uverbs_create_qp [ib_uverbs]: 192 0x28a9 srp_reset_host [ib_srp]: 192 0x2a96 srp_reset_host [ib_srp]: 192 0x1c99 show_sys_image_guid [ib_core]: 188 0x1d2b show_sys_image_guid [ib_core]: 188 0x01f9 ib_sa_service_rec_callback [ib_sa]: 180 0x0234 ib_sa_service_rec_callback [ib_sa]: 180 0x1b3c path_rec_completion [ib_ipoib]: 180 0x2020 path_rec_completion [ib_ipoib]: 180 0x70cf ipoib_cm_handle_rx_wc [ib_ipoib]:180 0x7402 ipoib_cm_handle_rx_wc [ib_ipoib]:180 0x09a7 srp_create_target [ib_srp]: 176 0x125f srp_create_target [ib_srp]: 176 0x0d9d ib_cm_listen [ib_cm]:172 0x10b3 ib_cm_listen [ib_cm]:172 0x4455 ipoib_mcast_send [ib_ipoib]: 172 0x48e0 ipoib_mcast_send [ib_ipoib]: 172 0x15c1 ipoib_start_xmit [ib_ipoib]: 164 0x1b2d ipoib_start_xmit [ib_ipoib]: 164 0x56c8 mthca_make_profile [ib_mthca]: 160 0x6051 mthca_make_profile [ib_mthca]: 160 0x2abb ipoib_ib_dev_stop [ib_ipoib]:160 0x2d19 ipoib_ib_dev_stop [ib_ipoib]:160 0x202b ib_uverbs_query_qp [ib_uverbs]: 156 0x22c0 ib_uverbs_query_qp [ib_uverbs]: 156 0x5269 ipoib_init_qp [ib_ipoib]:152 0x53bc ipoib_init_qp [ib_ipoib]:152 0x327f ipoib_mcast_join [ib_ipoib]: 144 0x349d ipoib_mcast_join [ib_ipoib]: 144 0x2092 ib_find_send_mad [ib_mad]: 140 0x23fa ib_find_send_mad [ib_mad]: 140 0x22cf ib_uverbs_modify_qp [ib_uverbs]: 140 0x24f2 ib_uverbs_modify_qp [ib_uverbs]: 140 0xbc8e mthca_modify_qp [ib_mthca]: 136 0xc9cc mthca_modify_qp [ib_mthca]: 136 0x00010cb1 mthca_reg_phys_mr [ib_mthca]:136 0x0001117a mthca_reg_phys_mr [ib_mthca]:136 0x35b4 ipoib_mcast_join_finish [ib_ipoib]: 136 0x3a33 ipoib_mcast_join_finish [ib_ipoib]: 136 0x0793 iser_cma_handler [ib_iser]: 132 0x0bc1 iser_cma_handler [ib_iser]: 132 0x1e37 srp_queuecommand [ib_srp]: 132 0x273b srp_queuecommand [ib_srp]: 132 0x8a5a mthca_poll_cq [ib_mthca]:128 0x9204 mthca_poll_cq [ib_mthca]:128 0x3a42 ipoib_mcast_join_complete [ib_ipoib]:128 0x3e6e ipoib_mcast_join_complete [ib_ipoib]:128 0x4a58 ipoib_mcast_restart_task [ib_ipoib]: 128 0x4eb8 ipoib_mcast_restart_task [ib_ipoib]: 128 0x38e6 ib_uverbs_create_ah [ib_uverbs]: 116 0x3ac4 ib_uverbs_create_ah [ib_uverbs]: 116 0xf6a5 mthca_process_mad [ib_mthca]:116 0xfa93 mthca_process_mad [ib_mthca]:116 0x11ef mcast_work_handler [ib_sa]: 112 0x16e6 mcast_work_handler [ib_sa]: 112 0x0a20 ib_ucm_send_req [ib_ucm]:112 0x0b7c ib_ucm_send_req [ib_ucm]:112 0x1697 ib_post_send_mad [ib_mad]: 112 0x1b05 ib_post_send_mad [ib_mad]: 112 0x030e iser_post_send [ib_iser]:112 0x03c5 iser_post_send [ib_iser]:112 0x1605
Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!)
On Sun, 2007-03-11 at 15:50 +0200, Michael S. Tsirkin wrote: Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0! Feb 27 17:47:52 sw169 kernel: [8053aaf1] _spin_lock_irqsave+0x15/0x24 Feb 27 17:47:52 sw169 kernel: [88067a23] :ib_ipoib:ipoib_neigh_destructor+0xc2/0x139 It looks like this is deadlocking trying to take priv-lock in ipoib_neigh_destructor(). One idea I just had would be to build a kernel with CONFIG_PROVE_LOCKING turned on, and then rerun this test. There's a good chance that this would diagnose the deadlock. (I don't have good access to my test machines right now, or else I would do it myself) OK, I did that. But I get [13440.761857] INFO: trying to register non-static key. [13440.766903] the code is fine but needs lockdep annotation. [13440.772455] turning off the locking correctness validator. and I am not sure what triggers this, or how to fix it to have the validator actually do its job. It usually indicates a spinlock is not properly initialized. Like __SPIN_LOCK_UNLOCKED() used in a non-static context, use spin_lock_init() in these cases. However looking at the code, ipoib_neight_destructor only uses priv-lock, and that seems to get properly initialized in ipoib_setup() using spin_lock_init(). So either there are other sites that instanciate those objects and forget about the lock init, or the object is corrupted (use after free?) Ingo, what key does the message refer to? The stack dump seems to point to drivers/infiniband/ulp/ipoib/ipoib_main.c line 829. Full message below: [13440.761857] INFO: trying to register non-static key. [13440.766903] the code is fine but needs lockdep annotation. [13440.772455] turning off the locking correctness validator. [13440.778008] [c023c082] __lock_acquire+0xae4/0xbb9 [13440.783078] [c023c43d] lock_acquire+0x56/0x71 [13440.787784] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.794412] [c051ad41] _spin_lock_irqsave+0x32/0x41 [13440.799649] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.806275] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13440.812897] [c04a1c1b] dst_run_gc+0xc/0x118 [13440.817439] [c022af6e] run_timer_softirq+0x37/0x16b [13440.822673] [c04a1c0f] dst_run_gc+0x0/0x118 [13440.827221] [c04a3eab] neigh_destroy+0xbe/0x104 [13440.832114] [c04a1bb1] dst_destroy+0x4d/0xab [13440.836751] [c04a1c64] dst_run_gc+0x55/0x118 [13440.841384] [c022b03f] run_timer_softirq+0x108/0x16b [13440.846711] [c0227634] __do_softirq+0x5a/0xd5 [13440.851427] [c023b435] trace_hardirqs_on+0x106/0x141 [13440.856754] [c0227643] __do_softirq+0x69/0xd5 [13440.861470] [c02276e6] do_softirq+0x37/0x4d [13440.866016] [c02167b0] smp_apic_timer_interrupt+0x6b/0x77 [13440.871774] [c02029ef] default_idle+0x3b/0x54 [13440.876491] [c02029ef] default_idle+0x3b/0x54 [13440.881211] [c0204c33] apic_timer_interrupt+0x33/0x38 [13440.886624] [c02029ef] default_idle+0x3b/0x54 [13440.891342] [c02029f1] default_idle+0x3d/0x54 [13440.896061] [c0202aaa] cpu_idle+0xa2/0xbb [13440.900436] === [13768.711447] BUG: spinlock lockup on CPU#1, swapper/0, c0687880 [13768.717353] [c031f919] _raw_spin_lock+0xda/0xfd [13768.722247] [c051ad48] _spin_lock_irqsave+0x39/0x41 [13768.727486] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13768.734110] [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib] [13768.740735] [c04a1c1b] dst_run_gc+0xc/0x118 [13768.745276] [c022af6e] run_timer_softirq+0x37/0x16b [13768.750517] [c04a1c0f] dst_run_gc+0x0/0x118 [13768.755061] [c04a3eab] neigh_destroy+0xbe/0x104 [13768.759955] [c04a1bb1] dst_destroy+0x4d/0xab [13768.764586] [c04a1c64] dst_run_gc+0x55/0x118 [13768.769218] [c022b03f] run_timer_softirq+0x108/0x16b [13768.774542] [c0227634] __do_softirq+0x5a/0xd5 [13768.779261] [c023b435] trace_hardirqs_on+0x106/0x141 [13768.784588] [c0227643] __do_softirq+0x69/0xd5 [13768.789308] [c02276e6] do_softirq+0x37/0x4d [13768.793851] [c02167b0] smp_apic_timer_interrupt+0x6b/0x77 [13768.799609] [c02029ef] default_idle+0x3b/0x54 [13768.804326] [c02029ef] default_idle+0x3b/0x54 [13768.809054] [c0204c33] apic_timer_interrupt+0x33/0x38 [13768.814471] [c02029ef] default_idle+0x3b/0x54 [13768.819187] [c02029f1] default_idle+0x3d/0x54 [13768.823903] [c0202aaa] cpu_idle+0xa2/0xbb [13768.828279] === -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please
Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!)
After adding some printks, I started getting these: [ 597.036720] BUG: MAX_STACK_TRACE_ENTRIES too low! [ 597.041546] turning off the locking correctness validator. [ 597.047135] [c023a922] save_trace+0x8a/0x8f [ 597.051751] [c023ae8c] mark_lock+0x65/0x3ff [ 597.056366] [c023a8d6] save_trace+0x3e/0x8f [ 597.060980] [c023a9f0] add_lock_to_list+0x62/0x85 [ 597.066116] [c023b992] __lock_acquire+0x3f4/0xbb9 [ 597.071252] [f89da11f] send_mad+0x79/0x103 [ib_sa] [ 597.076474] [c031a475] idr_get_new_above_int+0x13c/0x216 [ 597.082225] [c023c43d] lock_acquire+0x56/0x71 [ 597.087018] [f89da11f] send_mad+0x79/0x103 [ib_sa] [ 597.092240] [c051ad41] _spin_lock_irqsave+0x32/0x41 [ 597.097547] [f89da11f] send_mad+0x79/0x103 [ib_sa] [ 597.102770] [f89da11f] send_mad+0x79/0x103 [ib_sa] [ 597.107989] [f89da8d9] ib_sa_path_rec_get+0x134/0x172 [ib_sa] [ 597.114166] [f899b73f] path_rec_start+0x115/0x143 [ib_ipoib] [ 597.120254] [f899cb38] path_rec_completion+0x0/0x4f4 [ib_ipoib] [ 597.126610] [f899b874] path_rec_create+0x77/0x9d [ib_ipoib] [ 597.132617] [f899c9fe] ipoib_start_xmit+0x441/0x57b [ib_ipoib] [ 597.13] [c051ae06] _spin_unlock_irqrestore+0x34/0x39 [ 597.144635] [c023b435] trace_hardirqs_on+0x106/0x141 [ 597.150035] [c04a058b] dev_queue_xmit+0x109/0x245 [ 597.155167] [c022ae27] __mod_timer+0x94/0x9e [ 597.159871] [c04a0423] dev_hard_start_xmit+0x1be/0x21d [ 597.165438] [c04a9fa9] __qdisc_run+0xd7/0x190 [ 597.170226] [c04a05b7] dev_queue_xmit+0x135/0x245 [ 597.175360] [c04ce267] arp_process+0x2c0/0x512 [ 597.180234] [f8954346] mthca_tavor_interrupt+0xf3/0x12b [ib_mthca] [ 597.186855] [c04a088b] netif_receive_skb+0x1c4/0x1da [ 597.192254] [c023b435] trace_hardirqs_on+0x106/0x141 [ 597.197648] [c04a0935] process_backlog+0x94/0x107 [ 597.202785] [c049f02b] net_rx_action+0x9a/0x15e [ 597.207743] [c0227643] __do_softirq+0x69/0xd5 [ 597.212530] [c02276e6] do_softirq+0x37/0x4d [ 597.217147] [c020617e] do_IRQ+0x5c/0x72 [ 597.221415] [c0204b52] common_interrupt+0x2e/0x34 [ 597.226549] [c02029ef] default_idle+0x3b/0x54 [ 597.231337] [c02029f1] default_idle+0x3d/0x54 [ 597.236124] [c0202aaa] cpu_idle+0xa2/0xbb [ 597.240567] === And sometimes these: [ 404.493572] KERNEL: assertion (!timer_pending(dev-watchdog_timer)) failed at net/sched/sch_generic.c (608) -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup detected on CPU#0!)
Quoting Peter Zijlstra [EMAIL PROTECTED]: Subject: Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup detected on CPU#0!) On Sun, 2007-03-11 at 15:50 +0200, Michael S. Tsirkin wrote: Quoting Roland Dreier [EMAIL PROTECTED]: Subject: Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0! Feb 27 17:47:52 sw169 kernel: [8053aaf1] _spin_lock_irqsave+0x15/0x24 Feb 27 17:47:52 sw169 kernel: [88067a23] :ib_ipoib:ipoib_neigh_destructor+0xc2/0x139 It looks like this is deadlocking trying to take priv-lock in ipoib_neigh_destructor(). One idea I just had would be to build a kernel with CONFIG_PROVE_LOCKING turned on, and then rerun this test. There's a good chance that this would diagnose the deadlock. (I don't have good access to my test machines right now, or else I would do it myself) OK, I did that. But I get [13440.761857] INFO: trying to register non-static key. [13440.766903] the code is fine but needs lockdep annotation. [13440.772455] turning off the locking correctness validator. and I am not sure what triggers this, or how to fix it to have the validator actually do its job. It usually indicates a spinlock is not properly initialized. Like __SPIN_LOCK_UNLOCKED() used in a non-static context, use spin_lock_init() in these cases. However looking at the code, ipoib_neight_destructor only uses priv-lock, and that seems to get properly initialized in ipoib_setup() using spin_lock_init(). So either there are other sites that instanciate those objects and forget about the lock init, or the object is corrupted (use after free?) OK, thanks for the hint. So I added this: diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index f9dbc6f..2eea467 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -821,8 +821,15 @@ static void ipoib_neigh_destructor(struct neighbour *n) unsigned long flags; struct ipoib_ah *ah = NULL; + if (n-dev-type != ARPHRD_INFINIBAND) { + printk(KERN_ERR ipoib_neigh_destructor lock %p wrong type %d !!\n, + priv-lock, n-dev-type); + BUG_ON(n-dev-type != ARPHRD_INFINIBAND); + return; + } + ipoib_dbg(priv, neigh_destructor for %06x IPOIB_GID_FMT \n, IPOIB_QPN(n-ha), IPOIB_GID_RAW_ARG(n-ha + 4)); And sure enough it triggers: [ 858.503010] ipoib_neigh_destructor lock c0687880 wrong type 772 !! [ 858.510036] [ cut here ] [ 858.514723] kernel BUG at drivers/infiniband/ulp/ipoib/ipoib_main.c:827! [ 858.521486] invalid opcode: [#1] [ 858.525212] SMP [ 858.527173] Modules linked in: rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ib_uverbs ibv [ 858.538736] CPU:0 [ 858.538737] EIP:0060:[f899bfa5]Not tainted VLI [ 858.538738] EFLAGS: 00010206 (2.6.21-rc3-i686-dbg #4) [ 858.551755] EIP is at ipoib_neigh_destructor+0x40/0x178 [ib_ipoib] [ 858.557996] eax: c0687300 ebx: f240e880 ecx: c0223114 edx: c064f280 [ 858.564851] esi: f240e880 edi: f240e880 ebp: c0687880 esp: c06c7e9c [ 858.571702] ds: 007b es: 007b fs: 00d8 gs: ss: 0068 [ 858.577602] Process swapper (pid: 0, ti=c06c6000 task=c064f280 task.ti=c06c6000) [ 858.584883] Stack: f89a37be c0687880 0304 c022af6e c064f280 [ 858.593573] c06a2554 c064f280 0001 c064f280 [ 858.602259]c0860be0 c2a1fba0 0246 c06a2554 f240e880 f240e880 c04a [ 858.610946] Call Trace: [ 858.613723] [c022af6e] run_timer_softirq+0x37/0x16b [ 858.618959] [c04a1c0f] dst_run_gc+0x0/0x118 [ 858.623498] [c04a3eab] neigh_destroy+0xbe/0x104 [ 858.628382] [c04a1bb1] dst_destroy+0x4d/0xab [ 858.632998] [c04a1c64] dst_run_gc+0x55/0x118 [ 858.637620] [c022b03f] run_timer_softirq+0x108/0x16b [ 858.642934] [c0227634] __do_softirq+0x5a/0xd5 [ 858.647648] [c023b435] trace_hardirqs_on+0x106/0x141 [ 858.652970] [c0227643] __do_softirq+0x69/0xd5 [ 858.657677] [c02276e6] do_softirq+0x37/0x4d [ 858.662210] [c02167b0] smp_apic_timer_interrupt+0x6b/0x77 [ 858.667965] [c02029ef] default_idle+0x3b/0x54 [ 858.672681] [c02029ef] default_idle+0x3b/0x54 [ 858.677391] [c0204c33] apic_timer_interrupt+0x33/0x38 [ 858.682796] [c02029ef] default_idle+0x3b/0x54 [ 858.687505] [c02029f1] default_idle+0x3d/0x54 [ 858.692211] [c0202aaa] cpu_idle+0xa2/0xbb [ 858.696569] [c06cd7c3] start_kernel+0x40b/0x413 [ 858.701453] [c06cd1b3] unknown_bootoption+0x0/0x205 [ 858.706678] === [ 858.710321] Code: 66 83 f8 20 74 29 0f b7 c0 89 44 24 08 89 6c 24 04 c7 04 24 be 37 9a [ 858.730997] EIP: [f899bfa5] ipoib_neigh_destructor+0x40/0x178 [ib_ipoib] SS:ESP 0068c [ 858.740271] Kernel
[PATCH v5] Fix rmmod/read/write races in /proc entries
Differences from version 4: Updated in-code comments. Largely rewritten changelog. Lockdep please. --akpm -read_proc, -write_proc aren't special, Extend protection to most methods for regular /proc files. Mentioned by viro. Differences from version 3: Use completion instead of unlock/schedule/lock Move refcount waiting business after removing PDE from lists, so that *cough* possible concurrent remove_proc_entry() will work. Fix following races: === 1. Write via -write_proc sleeps in copy_from_user(). Module disappears meanwhile. Or, more generically, system call done on /proc file, method supplied by module is called, module dissapeares meanwhile. pde = create_proc_entry() if (!pde) return -ENOMEM; pde-write_proc = ... open write copy_from_user pde = create_proc_entry(); if (!pde) { remove_proc_entry(); return -ENOMEM; /* module unloaded */ } *boom* == 2. bogo-revoke aka proc_kill_inodes() remove_proc_entry vfs_read proc_kill_inodes [check -f_op validness] [check -f_op-read validness] [verify_area, security permissions checks] -f_op = NULL; if (file-f_op-read) /* -f_op dereference, boom */ NOTE, NOTE, NOTE: file_operations are proxied for regular files only. Let's see how this scheme behaves, then extend if needed for directories. Directories creators in /proc only set -owner for them, so proxying for directories may be unneeded. NOTE, NOTE, NOTE: methods being proxied are -llseek, -read, -write, -poll, -unlocked_ioctl, -ioctl, -compat_ioctl, -open, -release. If your in-tree module uses something else, yell on me. Full audit pending. Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED] --- fs/proc/generic.c | 32 + fs/proc/inode.c | 279 +++- include/linux/proc_fs.h | 13 ++ 3 files changed, 321 insertions(+), 3 deletions(-) --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -20,6 +20,7 @@ #include linux/idr.h #include linux/namei.h #include linux/bitops.h #include linux/spinlock.h +#include linux/completion.h #include asm/uaccess.h #include internal.h @@ -613,6 +614,9 @@ static struct proc_dir_entry *proc_creat ent-namelen = len; ent-mode = mode; ent-nlink = nlink; + ent-pde_users = 0; + spin_lock_init(ent-pde_unload_lock); + ent-pde_unload_completion = NULL; out: return ent; } @@ -734,9 +738,35 @@ void remove_proc_entry(const char *name, de = *p; *p = de-next; de-next = NULL; + + spin_lock(de-pde_unload_lock); + /* +* Stop accepting new callers into module. If you're +* dynamically allocating -proc_fops, save a pointer somewhere. +*/ + de-proc_fops = NULL; + /* Wait until all existing callers into module are done. */ + if (de-pde_users 0) { + DECLARE_COMPLETION_ONSTACK(c); + + if (!de-pde_unload_completion) + de-pde_unload_completion = c; + + spin_unlock(de-pde_unload_lock); + spin_unlock(proc_subdir_lock); + + wait_for_completion(de-pde_unload_completion); + + spin_lock(proc_subdir_lock); + goto continue_removing; + } + spin_unlock(de-pde_unload_lock); + +continue_removing: if (S_ISDIR(de-mode)) parent-nlink--; - proc_kill_inodes(de); + if (!S_ISREG(de-mode)) + proc_kill_inodes(de); de-nlink = 0; WARN_ON(de-subdir); if (!atomic_read(de-count)) --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -142,6 +142,277 @@ static const struct super_operations pro .remount_fs = proc_remount, }; +static loff_t proc_reg_llseek(struct file *file, loff_t offset, int whence) +{ + struct proc_dir_entry *pde = PDE(file-f_path.dentry-d_inode); + loff_t rv = -EINVAL; + loff_t (*llseek)(struct file *, loff_t, int); + + spin_lock(pde-pde_unload_lock); + /* +* remove_proc_entry() is going to delete PDE (as part of module +* cleanup sequence). No new callers into module allowed. +*/ + if (!pde-proc_fops) + goto out_unlock; + /* +* Bump refcount so that remove_proc_entry will wail for -llseek to +
Re: SATA resume slowness, e1000 MSI warning
Rumor has it that some pci devices can't tolerate 32bit accesses. Although I have never met one. hopefully not bridge devices? The two factors together suggest that for generic code it probably makes sense to operate on 32bit quantities, and just to ignore the read-only portion. The code for regular devices seems to use 16-bit accesses, so I think it's best to stay consistent. Or do you want to change this too? -- MST - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
Al Boldi wrote: BTW, another way to show these hickups would be through some kind of a cpu/proc timing-tracer. Do we have something like that? Here is something like a tracer. Original idea by Chris Friesen, thanks, from this post: http://marc.theaimsgroup.com/?l=linux-kernelm=117331003029329w=4 Try attached chew.c like this: Boot into /bin/sh. Run chew in one console. Run nice chew in another console. Watch timings. Console 1: ./chew pid 655, prio 0, out for6 ms pid 655, prio 0, out for6 ms pid 655, prio 0, out for6 ms pid 655, prio 0, out for5 ms pid 655, prio 0, out for6 ms pid 655, prio 0, out for6 ms pid 655, prio 0, out for5 ms pid 655, prio 0, out for5 ms pid 655, prio 0, out for5 ms pid 655, prio 0, out for6 ms pid 655, prio 0, out for5 ms pid 655, prio 0, out for5 ms pid 655, prio 0, out for5 ms pid 655, prio 0, out for6 ms pid 655, prio 0, out for5 ms pid 655, prio 0, out for6 ms pid 655, prio 0, out for6 ms pid 655, prio 0, out for6 ms pid 655, prio 0, out for5 ms pid 655, prio 0, out for6 ms pid 655, prio 0, out for6 ms pid 655, prio 0, out for5 ms Console 2: nice -10 ./chew pid 669, prio 10, out for6 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for5 ms pid 669, prio 10, out for 65 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for5 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for5 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for 65 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for6 ms pid 669, prio 10, out for6 ms Console 2: nice -15 ./chew pid 673, prio 15, out for5 ms pid 673, prio 15, out for6 ms pid 673, prio 15, out for 95 ms pid 673, prio 15, out for5 ms pid 673, prio 15, out for5 ms pid 673, prio 15, out for6 ms pid 673, prio 15, out for5 ms pid 673, prio 15, out for 95 ms pid 673, prio 15, out for5 ms pid 673, prio 15, out for5 ms pid 673, prio 15, out for6 ms pid 673, prio 15, out for6 ms pid 673, prio 15, out for 95 ms pid 673, prio 15, out for6 ms pid 673, prio 15, out for6 ms pid 673, prio 15, out for6 ms pid 673, prio 15, out for6 ms pid 673, prio 15, out for 95 ms pid 673, prio 15, out for5 ms pid 673, prio 15, out for5 ms pid 673, prio 15, out for6 ms pid 673, prio 15, out for5 ms Console 2: nice -18 ./chew pid 677, prio 18, out for 113 ms pid 677, prio 18, out for6 ms pid 677, prio 18, out for 113 ms pid 677, prio 18, out for6 ms pid 677, prio 18, out for 113 ms pid 677, prio 18, out for5 ms pid 677, prio 18, out for 113 ms pid 677, prio 18, out for6 ms pid 677, prio 18, out for 113 ms pid 677, prio 18, out for5 ms pid 677, prio 18, out for 113 ms pid 677, prio 18, out for5 ms pid 677, prio 18, out for 113 ms pid 677, prio 18, out for5 ms pid 677, prio 18, out for 113 ms pid 677, prio 18, out for6 ms pid 677, prio 18, out for 113 ms pid 677, prio 18, out for6 ms pid 677, prio 18, out for 113 ms pid 677, prio 18, out for5 ms pid 677, prio 18, out for 113 ms pid 677, prio 18, out for5 ms Console 2: nice -19 ./chew pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms pid 679, prio 19, out for 119 ms Now with negative nice: Console 1: ./chew pid 674, prio 0, out for6 ms pid 674, prio 0, out for 125 ms pid 674, prio 0, out for6 ms pid 674, prio 0, out for5 ms pid 674, prio 0, out for6 ms pid 674, prio 0, out for6 ms pid 674, prio 0, out for5 ms pid 674, prio 0, out for6 ms pid 674, prio 0, out for6 ms pid 674, prio 0, out for6 ms pid 674, prio 0, out for5 ms pid 674, prio 0, out for5 ms pid 674, prio 0, out for5 ms pid 674, prio 0, out for5 ms pid 674, prio 0, out for5 ms pid 674, prio 0, out for5 ms pid 674, prio 0, out for5 ms pid 674, prio 0, out
SwSusp to disk doesn't work - Try 2
Suspend to disk doesn't work on my laptop. The suspend seems to hang while enabling the non-boot cpus again. with platform = test and state = disk i get this: [cut] acpi device:02: freeze video video:00: freeze acpi device:01: freeze acpi PNP0C02:00: freeze pci_root PNP0A08:00: freeze button PNP0C0E:00: freeze button PNP0C0C:00: freeze acpi APP0002:00: freeze button PNP0C0D:00: freeze ac ACPI0003:00: freeze acpi device:00: freeze processor ACPI0007:01: freeze processor ACPI0007:00: freeze button button_power:00: freeze acpi acpi_system:00: freeze Disabling non-boot CPUs ... CPU 1 is now offline SMP alternatives: switching to UP code PM: Removing info for No Bus:cpu1 PM: Removing info for No Bus:msr1 CPU1 is down swsusp debug: Waiting for 5 seconds. Enabling non-boot CPUs ... Here the process hangs. But a fortunate coincidence showed me that an acpi event continues the process (pressing the power off button a few times... (2x - 4x) ). SMP alternatives: switching to SMP code Booting processor 1/1 eip 3000 CPU 1 irqstacks, hard=c0389000 soft=c0387000 Initializing CPU#1 Calibrating delay using timer specific routine.. 3663.73 BogoMIPS (lpj=6103576) CPU: After generic identify, caps: bfe9fbff 0010 c1a9 monitor/mwait feature present. CPU: L1 I cache: 32K, L1 D cache: 32K CPU: L2 cache: 2048K CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1 CPU: After all inits, caps: bfe9fbff 0010 2940 c1a9 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#1. CPU1: Intel Genuine Intel(R) CPU T2400 @ 1.83GHz stepping 08 PM: Adding info for No Bus:cpu1 PM: Adding info for No Bus:msr1 CPU1 is up acpi acpi_system:00: resuming button button_power:00: resuming processor ACPI0007:00: resuming processor ACPI0007:01: resuming acpi device:00: resuming ac ACPI0003:00: resuming button PNP0C0D:00: resuming acpi APP0002:00: resuming button PNP0C0C:00: resuming button PNP0C0E:00: resuming pci_root PNP0A08:00: resuming Any ideas? The same is true for disk = platform. With kind regards thomas - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/9] signalfd/timerfd - signalfd core ...
On Sun, 11 Mar 2007, Oleg Nesterov wrote: On 03/10, Davide Libenzi wrote: +static void signalfd_put_sighand(struct signalfd_ctx *ctx, +struct sighand_struct *sighand, +unsigned long *flags) +{ + unlock_task_sighand(ctx-tsk, flags); +} Note that signalfd_put_sighand() doesn't need sighand parameter, please see below. I want it to return the sighand, and for simmetry I prefer the put to be passed the parameter back too. Even if not used. +int signalfd_deliver(struct sighand_struct *sighand, int sig, +struct siginfo *info) +{ + int nsig = 0; + struct signalfd_ctx *ctx, *tmp; + + list_for_each_entry_safe(ctx, tmp, sighand-sfdlist, lnk) { + /* +* We use a negative signal value as a way to broadcast that the +* sighand has been orphaned, so that we can notify all the +* listeners about this. Remeber the ctx-sigmask is inverted, +* so if the user is interested in a signal, that corresponding +* bit will be zero. +*/ + if (sig 0) + list_del_init(ctx-lnk); I'm afraid this is not right. This should be per-thread. Suppose we have threads T1 and T2 from the same thread group. sighand-sfdlist contains ctx1 and ctx2 linked to T1 and T2. Now, T1 exits, __exit_signal() does signalfd_notify(sighand, -1), and unlinks all threads, not just T1. IOW, we should do if (ctx-tsk == current) { list_del_init(ctx-lnk); wake_up(ctx-wqh); } Yes, of course. Dunno why the change got lost. Perhaps it makes sense to not re-use signalfd_deliver(), but introduce a new signalfd_xxx(sighand, tsk) helper for de_thread/exit_signal. Btw, signalfd_deliver() doesn't use info parameter. + if (sig 0 || !sigismember(ctx-sigmask, sig)) { + wake_up(ctx-wqh); Minor nit. Perhaps it makes sense to do void signalfd_deliver(struct task_struct *tsk, int sig, struct sigpending *pending) { struct sighand_struct *sighand = tsk-sighand; int private = (tsk-pending == pending); list_for_each_entry_safe(ctx, tmp, sighand-sfdlist, lnk) { if (private ctx-tsk != tsk) continue; if (!sigismember(ctx-sigmask, sig)) wake_up(ctx-wqh); } } Even better: signalfd_deliver(struct task_struct *tsk, int sig, int private). This way specific_send_sig_info/send_sigqueue won't do a false wakeup. I agree in the latter. +asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t sizemask) +{ ... + if ((sighand = signalfd_get_sighand(ctx, flags)) != NULL) { + ctx-sigmask = sigmask; + signalfd_put_sighand(ctx, sighand, flags); + } This looks like unneeded complication to me, I'd suggest if (signalfd_get_sighand(ctx, flags)) { ctx-sigmask = sigmask; signalfd_put_sighand(ctx, flags); } unlock_task_sighand() (and thus signalfd_put_sighand) doesn't need sighand parameter. signalfd_get_sighand() is in fact boolean. It makes sense to return sighand, it may be useful, but this patch only needs != NULL. Every usage of signalfd_get_sighand() could be simplified accordingly. As I said before, I prefer that way. +* Tell all the sighand listeners that this sighand has +* been detached. Needs to be called with the sighand lock +* held. +*/ + if (unlikely(!list_empty(oldsighand-sfdlist))) { + spin_lock_irq(oldsighand-siglock); + signalfd_notify(oldsighand, -1, NULL); + spin_unlock_irq(oldsighand-siglock); + } Very minor nit. I'd suggest to make a new helper and put it in signalfd.h (like signalfd_notify()). This will help CONFIG_SIGNALFD. Yes, makes sense. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SwSusp to disk doesn't work - Try 2
On Sunday, 11 March 2007 19:08, Thomas Meyer wrote: Suspend to disk doesn't work on my laptop. The suspend seems to hang while enabling the non-boot cpus again. with platform = test and state = disk i get this: [cut] acpi device:02: freeze video video:00: freeze acpi device:01: freeze acpi PNP0C02:00: freeze pci_root PNP0A08:00: freeze button PNP0C0E:00: freeze button PNP0C0C:00: freeze acpi APP0002:00: freeze button PNP0C0D:00: freeze ac ACPI0003:00: freeze acpi device:00: freeze processor ACPI0007:01: freeze processor ACPI0007:00: freeze button button_power:00: freeze acpi acpi_system:00: freeze Disabling non-boot CPUs ... CPU 1 is now offline SMP alternatives: switching to UP code PM: Removing info for No Bus:cpu1 PM: Removing info for No Bus:msr1 CPU1 is down swsusp debug: Waiting for 5 seconds. Enabling non-boot CPUs ... Here the process hangs. But a fortunate coincidence showed me that an acpi event continues the process (pressing the power off button a few times... (2x - 4x) ). Hm, interesting. SMP alternatives: switching to SMP code Booting processor 1/1 eip 3000 CPU 1 irqstacks, hard=c0389000 soft=c0387000 Initializing CPU#1 Calibrating delay using timer specific routine.. 3663.73 BogoMIPS (lpj=6103576) CPU: After generic identify, caps: bfe9fbff 0010 c1a9 monitor/mwait feature present. CPU: L1 I cache: 32K, L1 D cache: 32K CPU: L2 cache: 2048K CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1 CPU: After all inits, caps: bfe9fbff 0010 2940 c1a9 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#1. CPU1: Intel Genuine Intel(R) CPU T2400 @ 1.83GHz stepping 08 PM: Adding info for No Bus:cpu1 PM: Adding info for No Bus:msr1 CPU1 is up acpi acpi_system:00: resuming button button_power:00: resuming processor ACPI0007:00: resuming processor ACPI0007:01: resuming acpi device:00: resuming ac ACPI0003:00: resuming button PNP0C0D:00: resuming acpi APP0002:00: resuming button PNP0C0C:00: resuming button PNP0C0E:00: resuming pci_root PNP0A08:00: resuming Any ideas? Could you please put some printk()s in kernel/cpu.c:_cpu_up() to see where it gets stuck? I bet one of the notifiers goes to sleep (cpufreq, maybe). Greetings, Rafael -- If you don't have the time to read, you don't have the time or the tools to write. - Stephen King - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[WATCHDOG] i8xx_tco - mark for removal patch
Hi all, I'm planning to remove the i8xx_tco watchdog driver (since we now have the iTCO_wdt driver that has a broader scope). If no-one objects I will sent the below patch to Linus for inclusion. (it adds the driver to the feature-removal-schedule list and defaults CONFIG_I8XX_TCO to n). Thanks, Wim. diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index c3b1430..0bc8b0b 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -316,3 +316,11 @@ Why: The option/code is Who: Johannes Berg [EMAIL PROTECTED] --- + +What: i8xx_tco watchdog driver +When: in 2.6.22 +Why: the i8xx_tco watchdog driver has been replaced by the iTCO_wdt + watchdog driver. +Who: Wim Van Sebroeck [EMAIL PROTECTED] + +--- diff --git a/drivers/char/watchdog/Kconfig b/drivers/char/watchdog/Kconfig index ea09d0c..e812aa1 100644 --- a/drivers/char/watchdog/Kconfig +++ b/drivers/char/watchdog/Kconfig @@ -301,6 +301,7 @@ config I6300ESB_WDT config I8XX_TCO tristate Intel i8xx TCO Timer/Watchdog depends on WATCHDOG (X86 || IA64) PCI + default n ---help--- Hardware driver for the TCO timer built into the Intel 82801 I/O Controller Hub family. The TCO (Total Cost of Ownership) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 6/9] signalfd/timerfd - timerfd core ...
On Sun, 11 Mar 2007, Thomas Gleixner wrote: Davide, On Sat, 2007-03-10 at 18:22 -0800, Davide Libenzi wrote: Some remarks: + +asmlinkage long sys_timerfd(int ufd, int clockid, int tmrtype, + const struct timespec __user *utmr) +{ + int error; + struct timerfd_ctx *ctx; + struct file *file; + struct inode *inode; + ktime_t tval, tnow; + struct timespec ktmr, tmrnow; + + error = -EFAULT; + if (copy_from_user(ktmr, utmr, sizeof(ktmr))) + goto err_exit; Please do not use goto for a simple return -EFAULT; Please validate the timespec before converting it. if (!timespec_valid(ktmr)) return -EINVAL; Ack. + tval = timespec_to_ktime(ktmr); + error = -EINVAL; + if (clockid != CLOCK_MONOTONIC + clockid != CLOCK_REALTIME) + goto err_exit; + switch (tmrtype) { + case TFD_TIMER_REL: + case TFD_TIMER_SEQ: + break; + case TFD_TIMER_ABS: + getnstimeofday(tmrnow); + tnow = timespec_to_ktime(tmrnow); tnow = ktime_get(); Ok, I think this is the wierd function that is declared static, whose symbol is exported, but is not declared in any .h file :) I used that before, because I saw it inside the hrtimer.c file, but then gcc was puking on me, and I noticd the wierdness. + if (ktime_to_ns(tval) = ktime_to_ns(tnow)) + goto err_exit; + tval = ktime_sub(tval, tnow); Why do you want to do that ? hrtimers handle relative and absolute expiry times. You break down everything to relative time and lose the accuracy for absolute timers. Yes. Those was in need of fixing. The first code I had was not working correctly with abs timers. Didn't have time to dig into it yet. Will verify and fix today... + + hrtimer_start(ctx-tmr, ctx-tval, HRTIMER_REL); + + /* +* When we call this, the initialization must be complete, since +* aino_getfd() will install the fd. +*/ + error = aino_getfd(ufd, inode, file, [timerfd], + timerfd_fops, ctx); + if (error) + goto err_fdalloc; Why is the timer started before we have everything in place ? I simplify the error path. The fd does not need to be in place for the timer function to be correctly triggered. Also if you turn it around then the (re)programming part of the timer can be shared. The two error/exit paths are different. One need to free the ctx, while the other one simply to do an fput(). Please use hrtimer_try_to_cancel() retry: spin_lock_irq(): if (hrtimer_try_to_cancel(ctx-tmr) 0) { spin_unlock_irq(); cpu_relax(); goto retry; } Ok, I will. +static unsigned int timerfd_poll(struct file *file, poll_table *wait) +{ + struct timerfd_ctx *ctx = file-private_data; + + poll_wait(file, ctx-wqh, wait); + + return ctx-ticks ? POLLIN: 0; This is racy: timer is set up (non periodic) timer expires poll now poll is stuck for ever ! Duh, yeah. I use the locked version of wakeups. Will fix. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA resume slowness, e1000 MSI warning
Michael S. Tsirkin [EMAIL PROTECTED] writes: Rumor has it that some pci devices can't tolerate 32bit accesses. Although I have never met one. hopefully not bridge devices? The two factors together suggest that for generic code it probably makes sense to operate on 32bit quantities, and just to ignore the read-only portion. The code for regular devices seems to use 16-bit accesses, so I think it's best to stay consistent. Or do you want to change this too? If we are stomping rare probabilities we might as well change that too. The code to save pci-x state is relatively recent. So it probably just hasn't met a problem device yet (assuming they exist). Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SwSusp to disk doesn't work - Try 2
Rafael J. Wysocki schrieb: Could you please put some printk()s in kernel/cpu.c:_cpu_up() to see where it gets stuck? I bet one of the notifiers goes to sleep (cpufreq, maybe). Here we go (ok. i forgot __FUNCTION__ ...): Mar 11 19:31:33 [kernel] ac ACPI0003:00: freeze Mar 11 19:31:33 [kernel] acpi device:00: freeze Mar 11 19:31:33 [kernel] processor ACPI0007:01: freeze Mar 11 19:31:33 [kernel] processor ACPI0007:00: freeze Mar 11 19:31:33 [kernel] button button_power:00: freeze Mar 11 19:31:33 [kernel] acpi acpi_system:00: freeze Mar 11 19:31:33 [kernel] Disabling non-boot CPUs ... Mar 11 19:31:33 [kernel] kvm: disabling virtualization on CPU1 Mar 11 19:31:33 [kernel] CPU 1 is now offline Mar 11 19:31:33 [kernel] SMP alternatives: switching to UP code Mar 11 19:31:33 [kernel] PM: Removing info for No Bus:cpu1 Mar 11 19:31:33 [kernel] PM: Removing info for No Bus:msr1 Mar 11 19:31:33 [kernel] CPU1 is down Mar 11 19:31:33 [kernel] swsusp debug: Waiting for 5 seconds. Mar 11 19:31:33 [kernel] Enabling non-boot CPUs ... Mar 11 19:31:33 [kernel] NULL: before notifier CPU_UP_PREPARE. Hung here. Mar 11 19:31:33 [kernel] NULL: after notifier CPU_UP_PREPARE. Mar 11 19:31:33 [kernel] SMP alternatives: switching to SMP code Mar 11 19:31:33 [kernel] Booting processor 1/1 eip 3000 Mar 11 19:31:33 [kernel] CPU 1 irqstacks, hard=c0388000 soft=c0386000 Mar 11 19:31:33 [kernel] Initializing CPU#1 Mar 11 19:31:33 [kernel] Calibrating delay using timer specific routine.. 3663.72 BogoMIPS (lpj=6103555) Mar 11 19:31:33 [kernel] CPU: After generic identify, caps: bfe9fbff 0010 c1a9 Mar 11 19:31:33 [kernel] monitor/mwait feature present. Mar 11 19:31:33 [kernel] CPU: L1 I cache: 32K, L1 D cache: 32K Mar 11 19:31:33 [kernel] CPU: L2 cache: 2048K Mar 11 19:31:33 [kernel] CPU: Physical Processor ID: 0 Mar 11 19:31:33 [kernel] CPU: Processor Core ID: 1 Mar 11 19:31:33 [kernel] CPU: After all inits, caps: bfe9fbff 0010 2940 c1a9 Mar 11 19:31:33 [kernel] CPU1: Intel Genuine Intel(R) CPU T2400 @ 1.83GHz stepping 08 Mar 11 19:31:33 [kernel] NULL: after __cpu_up Mar 11 19:31:33 [kernel] NULL: before notifier CPU_ONLINE. Mar 11 19:31:33 [kernel] kvm: enabling virtualization on CPU1 Mar 11 19:31:33 [kernel] Switched to high resolution mode on CPU 1 Mar 11 19:31:33 [kernel] PM: Adding info for No Bus:cpu1 Mar 11 19:31:33 [kernel] PM: Adding info for No Bus:msr1 Mar 11 19:31:33 [kernel] NULL: after notifier CPU_ONLINE. Mar 11 19:31:33 [kernel] CPU1 is up Mar 11 19:31:33 [kernel] acpi acpi_system:00: resuming Mar 11 19:31:33 [kernel] button button_power:00: resuming Mar 11 19:31:33 [kernel] processor ACPI0007:00: resuming Mar 11 19:31:33 [kernel] processor ACPI0007:01: resuming - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/7] Resource counters
Herbert Poetzl [EMAIL PROTECTED] writes: Linux-VServer does the accounting with atomic counters, so that works quite fine, just do the checks at the beginning of whatever resource allocation and the accounting once the resource is acquired ... Atomic operations versus locks is only a granularity thing. You still need the cache line which is the cost on SMP. Are you using atomic_add_return or atomic_add_unless or are you performing you actions in two separate steps which is racy? What I have seen indicates you are using a racy two separate operation form. If we'll remove failcnt this would look like while (atomic_cmpxchg(...)) which is also not that good. Moreover - in RSS accounting patches I perform page list manipulations under this lock, so this also saves one atomic op. it still hasn't been shown that this kind of RSS limit doesn't add big time overhead to normal operations (inside and outside of such a resource container) note that the 'usual' memory accounting is much more lightweight and serves similar purposes ... Perhaps Eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/