Re: libata extension

2007-03-11 Thread Robert Hancock

Vitaliyi wrote:

Good Day

Say i want to implement extended set of ATA commands available to
userspace for building diagnostic tools.
I need 0x40 -- read verify and 0x32 -- write long with error handling,
for example. I was trying ide driver through ioctl's, but seems it
lack of functionality and full of gotchas. Furthermore it oopses
sometimes.

Is it possible to use libata for such purpose or i need to write
separate IDE driver ?
By the way, i'm sure it should be done in kernel space since i'm going
to deal with some hdd manufacturer commands.

P.S. I was looking through libata and ide sources and documentation
but still dont have broad picture.


I believe you should be able to do this by sending ATA pass-through SCSI 
commands into the device using SG_IO, without any kernel changes. It's 
really the mechanism that's meant for this..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/3] Add ability to keep track of callers of symbol_(get|put)

2007-03-11 Thread Andrew Morton
 On Sat, 10 Mar 2007 02:31:35 -0200 Mauro Carvalho Chehab [EMAIL PROTECTED] 
 wrote:
 From: Trent Piepho [EMAIL PROTECTED]
 
 When a module uses symbol_get() to increase the ref count of another
 module, there is no record what module called symbol_get().  A module
 can
 show up as having other users, but there is no way to tell who those
 users are.
 
 This adds that ability to symbol_put() and symbol_get().

One day I'll write a script which unwordwraps patches and then you'll all
need to find new ways of torturing me.

This patch needed rather a lot of help in the coding-style department. 
Hopefully Rusty can comment on the content, because I'm all exhausted from
cleaning it up.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [2/6] 2.6.21-rc2: known regressions

2007-03-11 Thread Ingo Molnar

* Pavel Machek [EMAIL PROTECTED] wrote:

  Probably tweaking the webpage doesnt help because people dont get 
  there - as the results plainly show it. Maybe some more automation 
  would be useful too, a tool that detects failed resume and tries all 
  those options that makes sense on that box or something? It's not 
  like that
 
 Unfortunately, these tend to crash the box when you pass wrong 
 options, and I do not see easy way to test can user see whats on 
 display automatically.

you could perhaps try what X's modesetting utility does: display a 
dialog box that times out if it does not get clicked on, and reboot if 
it did not get clicked on. Likewise, detect upon the next bootup that a 
suspend-test was in progress (and didnt get back via normal resume), via 
some temporary file. That way both the 'did not resume and i had to 
power-cycle' and the 'resume did not restore my X' problems can be 
handled.

Finally, when the correct options have been established (worse-case with 
a small number of reboots and yes, indeed the resume did not work fine 
clicks done upon bootup by the user), automatically fill in a webform in 
firefox and ask the user to do a single click to submit that form.

techniques like that have more chance i think to get Linux 
suspend/resume anywhere near to working. The current 'rely on the 
developer' technique apparently does not work.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 6/7] Account for the number of tasks within container

2007-03-11 Thread Pavel Emelianov
Paul Menage wrote:
 On 3/6/07, Pavel Emelianov [EMAIL PROTECTED] wrote:
 The idea is:

 Task may be the entity that allocates the resources and the
 entity that is a resource allocated.

 When task is the first entity it may move across containers
 (that is implemented in your patches). When task is a resource
 it shouldn't move across containers like files or pages do.

 More generally - allocated resources hold reference to original
 container till they die. No resource migration is performed.

 Did I express my idea cleanly?
 
 Yes, but I disagree with the premise. The title of your patch is
 Account for the number of tasks within container, but that's not
 what the subsystem does, it accounts for the number of forks within
 the container that aren't directly accompanied by an exit.
 
 Ideally, resources like files and pages would be able to follow tasks
 as well. The reason that files and pages aren't easily migrated from
 one container to another is that there could be sharing involved;
 figuring out the sharing can be expensive, and it's not clear what to
 do if two users are in different containers.
 
 But in the case of a task count, there are no such issues with
 sharing, so it seems to me to be more sensible (and more efficient) to
 just limit the number of tasks in a container.
 
 i.e. when moving a task into a container or forking a task within a
 container, increment the count; when moving a task out of a container
 or when it exits, decrement the count.

Sounds reasonable.
I'll take this into account when I make the next iteration.
Thanks.

 With your approach, if you were to set the task limit of an empty
 container A to 1, and then move a process P from B into A, P would be
 able to fork a new child, since the task count would be 0 (as P was
 being charged to B still). Surely the fact that there's 1 process in A
 should prevent P from forking?
 
 Paul
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/7] Resource counters

2007-03-11 Thread Pavel Emelianov
Herbert Poetzl wrote:
 On Wed, Mar 07, 2007 at 10:19:05AM +0300, Pavel Emelianov wrote:
 Balbir Singh wrote:
 Pavel Emelianov wrote:
 Introduce generic structures and routines for
 resource accounting.

 Each resource accounting container is supposed to
 aggregate it, container_subsystem_state and its
 resource-specific members within.


 

 diff -upr linux-2.6.20.orig/include/linux/res_counter.h
 linux-2.6.20-0/include/linux/res_counter.h
 --- linux-2.6.20.orig/include/linux/res_counter.h2007-03-06
 13:39:17.0 +0300
 +++ linux-2.6.20-0/include/linux/res_counter.h2007-03-06
 13:33:28.0 +0300
 @@ -0,0 +1,83 @@
 +#ifndef __RES_COUNTER_H__
 +#define __RES_COUNTER_H__
 +/*
 + * resource counters
 + *
 + * Copyright 2007 OpenVZ SWsoft Inc
 + *
 + * Author: Pavel Emelianov [EMAIL PROTECTED]
 + *
 + */
 +
 +#include linux/container.h
 +
 +struct res_counter {
 +unsigned long usage;
 +unsigned long limit;
 +unsigned long failcnt;
 +spinlock_t lock;
 +};
 +
 +enum {
 +RES_USAGE,
 +RES_LIMIT,
 +RES_FAILCNT,
 +};
 +
 +ssize_t res_counter_read(struct res_counter *cnt, int member,
 +const char __user *buf, size_t nbytes, loff_t *pos);
 +ssize_t res_counter_write(struct res_counter *cnt, int member,
 +const char __user *buf, size_t nbytes, loff_t *pos);
 +
 +static inline void res_counter_init(struct res_counter *cnt)
 +{
 +spin_lock_init(cnt-lock);
 +cnt-limit = (unsigned long)LONG_MAX;
 +}
 +
 Is there any way to indicate that there are no limits on this container.
 Yes - LONG_MAX is essentially a no limit value as no
 container will ever have such many files :)
 
 -1 or ~0 is a viable choice for userspace to
 communicate 'infinite' or 'unlimited'

OK, I'll make ULONG_MAX :)

 LONG_MAX is quite huge, but still when the administrator wants to
 configure a container to *un-limited usage*, it becomes hard for
 the administrator.

 +static inline int res_counter_charge_locked(struct res_counter *cnt,
 +unsigned long val)
 +{
 +if (cnt-usage = cnt-limit - val) {
 +cnt-usage += val;
 +return 0;
 +}
 +
 +cnt-failcnt++;
 +return -ENOMEM;
 +}
 +
 +static inline int res_counter_charge(struct res_counter *cnt,
 +unsigned long val)
 +{
 +int ret;
 +unsigned long flags;
 +
 +spin_lock_irqsave(cnt-lock, flags);
 +ret = res_counter_charge_locked(cnt, val);
 +spin_unlock_irqrestore(cnt-lock, flags);
 +return ret;
 +}
 +
 Will atomic counters help here.
 I'm afraid no. We have to atomically check for limit and alter
 one of usage or failcnt depending on the checking result. Making
 this with atomic_xxx ops will require at least two ops.
 
 Linux-VServer does the accounting with atomic counters,
 so that works quite fine, just do the checks at the
 beginning of whatever resource allocation and the
 accounting once the resource is acquired ...

This works quite fine on non-preempted kernels.
From the time you checked for resource till you really
account it kernel may preempt and let another process
pass through vx_anything_avail() check.

 If we'll remove failcnt this would look like
while (atomic_cmpxchg(...))
 which is also not that good.

 Moreover - in RSS accounting patches I perform page list
 manipulations under this lock, so this also saves one atomic op.
 
 it still hasn't been shown that this kind of RSS limit
 doesn't add big time overhead to normal operations
 (inside and outside of such a resource container)
 
 note that the 'usual' memory accounting is much more
 lightweight and serves similar purposes ...

It OOM-kills current int case of limit hit instead of
reclaiming pages or killing *memory eater* to free memory.

 best,
 Herbert
 
 +static inline void res_counter_uncharge_locked(struct res_counter *cnt,
 +unsigned long val)
 +{
 +if (unlikely(cnt-usage  val)) {
 +WARN_ON(1);
 +val = cnt-usage;
 +}
 +
 +cnt-usage -= val;
 +}
 +
 +static inline void res_counter_uncharge(struct res_counter *cnt,
 +unsigned long val)
 +{
 +unsigned long flags;
 +
 +spin_lock_irqsave(cnt-lock, flags);
 +res_counter_uncharge_locked(cnt, val);
 +spin_unlock_irqrestore(cnt-lock, flags);
 +}
 +
 +#endif
 diff -upr linux-2.6.20.orig/init/Kconfig linux-2.6.20-0/init/Kconfig
 --- linux-2.6.20.orig/init/Kconfig2007-03-06 13:33:28.0 +0300
 +++ linux-2.6.20-0/init/Kconfig2007-03-06 13:33:28.0 +0300
 @@ -265,6 +265,10 @@ config CPUSETS

Say N if unsure.

 +config RESOURCE_COUNTERS
 +bool
 +select CONTAINERS
 +
  config SYSFS_DEPRECATED
  bool Create deprecated sysfs files
  default y
 diff -upr linux-2.6.20.orig/kernel/Makefile
 linux-2.6.20-0/kernel/Makefile
 --- linux-2.6.20.orig/kernel/Makefile2007-03-06 13:33:28.0
 +0300
 +++ linux-2.6.20-0/kernel/Makefile2007-03-06 13:33:28.0 +0300
 @@ -51,6 

Re: [RFC][PATCH 2/7] RSS controller core

2007-03-11 Thread Pavel Emelianov
Herbert Poetzl wrote:
 On Tue, Mar 06, 2007 at 02:00:36PM -0800, Andrew Morton wrote:
 On Tue, 06 Mar 2007 17:55:29 +0300
 Pavel Emelianov [EMAIL PROTECTED] wrote:

 +struct rss_container {
 +   struct res_counter res;
 +   struct list_head page_list;
 +   struct container_subsys_state css;
 +};
 +
 +struct page_container {
 +   struct page *page;
 +   struct rss_container *cnt;
 +   struct list_head list;
 +};
 ah. This looks good. I'll find a hunk of time to go through this work
 and through Paul's patches. It'd be good to get both patchsets lined
 up in -mm within a couple of weeks. But..
 
 doesn't look so good for me, mainly becaus of the 
 additional per page data and per page processing
 
 on 4GB memory, with 100 guests, 50% shared for each
 guest, this basically means ~1mio pages, 500k shared
 and 1500k x sizeof(page_container) entries, which
 roughly boils down to ~25MB of wasted memory ...
 
 increase the amount of shared pages and it starts
 getting worse, but maybe I'm missing something here

You are. Each page has only one page_container associated
with it despite the number of containers it is shared
between.

 We need to decide whether we want to do per-container memory
 limitation via these data structures, or whether we do it via a
 physical scan of some software zone, possibly based on Mel's patches.
 
 why not do simple page accounting (as done currently
 in Linux) and use that for the limits, without
 keeping the reference from container to page?

As I've already answered in my previous letter simple
limiting w/o per-container reclamation and per-container
oom killer isn't a good memory management. It doesn't allow
to handle resource shortage gracefully.

This patchset provides more grace way to handle this, but
full memory management includes accounting of VMA-length
as well (returning ENOMEM from system call) but we've decided
to start with RSS.

 best,
 Herbert
 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BIG] Re: sched rsdl fix for 0.28

2007-03-11 Thread Nicolas Mailhot
Le dimanche 11 mars 2007 à 11:07 +1100, Con Kolivas a écrit :
 sched rsdl fix

Doesn't change a thing. Always breaks at the same place (though
depending on hardware timings? the trace is not always the same). Pretty
sure nothing happens before this failure

-- 
Nicolas Mailhot


signature.asc
Description: Ceci est une partie de message	numériquement signée


Re: [BIG] Re: sched rsdl fix for 0.28

2007-03-11 Thread Con Kolivas
On Sunday 11 March 2007 20:10, Nicolas Mailhot wrote:
 Le dimanche 11 mars 2007 à 11:07 +1100, Con Kolivas a écrit :
  sched rsdl fix

 Doesn't change a thing. Always breaks at the same place (though
 depending on hardware timings? the trace is not always the same). Pretty
 sure nothing happens before this failure

Bummer. The only other thing to try is v0.29 posted recently. I still haven't 
got a good way to reproduce this locally but I'll keep trying. Thanks for 
testing.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BIG] Re: sched rsdl fix for 0.28

2007-03-11 Thread Con Kolivas
On Sunday 11 March 2007 20:21, Con Kolivas wrote:
 On Sunday 11 March 2007 20:10, Nicolas Mailhot wrote:
  Le dimanche 11 mars 2007 à 11:07 +1100, Con Kolivas a écrit :
   sched rsdl fix
 
  Doesn't change a thing. Always breaks at the same place (though
  depending on hardware timings? the trace is not always the same). Pretty
  sure nothing happens before this failure

 Bummer. The only other thing to try is v0.29 posted recently. I still
 haven't got a good way to reproduce this locally but I'll keep trying.
 Thanks for testing.

Oh and if that oopses and you still have the time, could you please test 0.29 
on 2.6.20.2 (available from same directory).

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: Make nenuconfig does not save parameters.

2007-03-11 Thread Cyrill Gorcunov
[Sam Ravnborg - Sat, Mar 10, 2007 at 11:45:34PM +0100]
| On Sat, Mar 10, 2007 at 10:34:41PM +0100, Jan Engelhardt wrote:
|  
|  On Mar 10 2007 22:27, Sam Ravnborg wrote:
|  On Sat, Mar 10, 2007 at 07:23:41PM +0100, Jan Engelhardt wrote:
|   
|   Whether the 'working config file path' should change when you do
|   'Save as Alternate' or not, is a menuconfig axiom. Ask Sam Ravnborg
|   if you want it changed :-)
|  
|  Current behaviour is not logical but on the other hand I do not
|  see a big need to make it so.
|  It seems that people very seldom uses save alternate anyway.
|  
|  But patches are welcome.
|  
|  ^_^ The patch has already been posted, has not it?
| No.
| Either we keep current behaviour or we change to the normal
| behaviour with a Save as... as know from all other programs.
| 
|   Sam
| 

Hi Sam,

here is a patch for menuconfig that shows current configuration
file. So I think menuconfig does its work well but the only
thing we need is to show location of an _active_ configuration.

Any comments are welcome (and you may swear at me too :)

Cyrill

diff --git a/scripts/kconfig/mconf.c b/scripts/kconfig/mconf.c
index 3f9a132..cde6792 100644
--- a/scripts/kconfig/mconf.c
+++ b/scripts/kconfig/mconf.c
@@ -602,6 +602,12 @@ static void conf(struct menu *menu)
item_set_tag('L');
item_make(_(Save an Alternate Configuration 
File));
item_set_tag('S');
+   item_make(--- );
+   item_set_tag(':');
+   item_make(_(Current Configuration File: ));
+   item_set_tag(':');
+   item_add_str(%s, filename);
+
}
dialog_clear();
res = dialog_menu(prompt ? prompt : _(Main Menu),
@@ -816,8 +822,11 @@ static void conf_load(void)
case 0:
if (!dialog_input_result[0])
return;
-   if (!conf_read(dialog_input_result))
+   if (!conf_read(dialog_input_result)) {
+   memset(filename, 0x0, PATH_MAX+1);
+   strncpy(filename, dialog_input_result, 
PATH_MAX);
return;
+   }
show_textbox(NULL, _(File does not exist!), 5, 38);
break;
case 1:
@@ -840,8 +849,11 @@ static void conf_save(void)
case 0:
if (!dialog_input_result[0])
return;
-   if (!conf_write(dialog_input_result))
+   if (!conf_write(dialog_input_result)) {
+   memset(filename, 0x0, PATH_MAX+1);
+   strncpy(filename, dialog_input_result, 
PATH_MAX);
return;
+   }
show_textbox(NULL, _(Can't create file!  Probably a 
nonexistent directory.), 5, 60);
break;
case 1:
@@ -903,7 +915,7 @@ int main(int ac, char **av)
 
switch (res) {
case 0:
-   if (conf_write(NULL)) {
+   if (conf_write(filename)) {
fprintf(stderr, _(\n\n
Error during writing of the kernel 
configuration.\n
Your kernel configuration changes were NOT 
saved.


Re: Use of absolute timeouts for oneshot timers

2007-03-11 Thread Thomas Gleixner
On Sat, 2007-03-10 at 16:42 -0800, Jeremy Fitzhardinge wrote:
 Thomas Gleixner wrote:
  It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
  time, which is read back from the clocksource, even if we use a relative
  value for real hardware clock event devices to program the next event.
  We calculate the delta between the absolute event and now. So we never
  get an accumulating error.
 
  What problem are you observing ?
 
 Actually, two things.  There was the unexpected pauses during boot,
 which is trivially fixable by not using the Xen periodic timer, and
 using the single-shot fallback.
 
 But I'm making the more general observation that if you use an absolute
 rather than relative time to set the single-shot timeout, then you have
 to deal with a long-term cumulative drift between the kernel's monotonic
 time and the hypervisor's monotonic time.  This can happen even if your
 clocksource is derived directly from the hypervisor monotonic time,
 because running ntp will warp the kernel's time, and so it will drift
 with respect to the hypervisor clock.  You can only avoid this by 1) not
 allowing adjtime, or 2) making those same adjtime warps to the
 hypervisor time.  Neither of these is a good general solution.

Sigh, yes. Using a relative time for the next event is probably the
least ugly solution

tglx



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 5/7] Per-container OOM killer and page reclamation

2007-03-11 Thread Pavel Emelianov
Balbir Singh wrote:
 Hi, Pavel,
 
 Please find my patch to add LRU behaviour to your latest RSS controller.

Thanks for participation and additional testing :)
I'll include this into next generation of patches.

 Balbir Singh
 Linux Technology Center
 IBM, ISTL
 
 
 
 
 Add LRU behaviour to the RSS controller patches posted by Pavel Emelianov
 
   http://lkml.org/lkml/2007/3/6/198
 
 which was in turn similar to the RSS controller posted by me
 
   http://lkml.org/lkml/2007/2/26/8
 
 Pavel's patches have a per container list of pages, which helps reduce
 reclaim time of the RSS controller but the per container list of pages is
 in FIFO order. I've implemented active and inactive lists per container to
 help select the right set of pages to reclaim when the container is under
 memory pressure.
 
 I've tested these patches on a ppc64 machine and they work fine for
 the minimal testing I've done.
 
 Pavel would you please include these patches in your next iteration.
 
 Comments, suggestions and further improvements are as always welcome!
 
 Signed-off-by: [EMAIL PROTECTED]
 ---
 
  include/linux/rss_container.h |1 
  mm/rss_container.c|   47 
 +++---
  mm/swap.c |5 
  mm/vmscan.c   |3 ++
  4 files changed, 44 insertions(+), 12 deletions(-)
 
 diff -puN include/linux/rss_container.h~rss-container-lru2 
 include/linux/rss_container.h
 --- linux-2.6.20/include/linux/rss_container.h~rss-container-lru2 
 2007-03-09 22:52:56.0 +0530
 +++ linux-2.6.20-balbir/include/linux/rss_container.h 2007-03-10 
 00:39:59.0 +0530
 @@ -19,6 +19,7 @@ int container_rss_prepare(struct page *,
  void container_rss_add(struct page_container *);
  void container_rss_del(struct page_container *);
  void container_rss_release(struct page_container *);
 +void container_rss_move_lists(struct page *pg, bool active);
  
  int mm_init_container(struct mm_struct *mm, struct task_struct *tsk);
  void mm_free_container(struct mm_struct *mm);
 diff -puN mm/rss_container.c~rss-container-lru2 mm/rss_container.c
 --- linux-2.6.20/mm/rss_container.c~rss-container-lru22007-03-09 
 22:52:56.0 +0530
 +++ linux-2.6.20-balbir/mm/rss_container.c2007-03-10 02:42:54.0 
 +0530
 @@ -17,7 +17,8 @@ static struct container_subsys rss_subsy
  
  struct rss_container {
   struct res_counter res;
 - struct list_head page_list;
 + struct list_head inactive_list;
 + struct list_head active_list;
   struct container_subsys_state css;
  };
  
 @@ -96,6 +97,26 @@ void container_rss_release(struct page_c
   kfree(pc);
  }
  
 +void container_rss_move_lists(struct page *pg, bool active)
 +{
 + struct rss_container *rss;
 + struct page_container *pc;
 +
 + if (!page_mapped(pg))
 + return;
 +
 + pc = page_container(pg);
 + BUG_ON(!pc);
 + rss = pc-cnt;
 +
 + spin_lock_irq(rss-res.lock);
 + if (active)
 + list_move(pc-list, rss-active_list);
 + else
 + list_move(pc-list, rss-inactive_list);
 + spin_unlock_irq(rss-res.lock);
 +}
 +
  void container_rss_add(struct page_container *pc)
  {
   struct page *pg;
 @@ -105,7 +126,7 @@ void container_rss_add(struct page_conta
   rss = pc-cnt;
  
   spin_lock(rss-res.lock);
 - list_add(pc-list, rss-page_list);
 + list_add(pc-list, rss-active_list);
   spin_unlock(rss-res.lock);
  
   page_container(pg) = pc;
 @@ -141,7 +162,10 @@ unsigned long container_isolate_pages(un
   struct zone *z;
  
   spin_lock_irq(rss-res.lock);
 - src = rss-page_list;
 + if (active)
 + src = rss-active_list;
 + else
 + src = rss-inactive_list;
  
   for (scan = 0; scan  nr_to_scan  !list_empty(src); scan++) {
   pc = list_entry(src-prev, struct page_container, list);
 @@ -152,13 +176,10 @@ unsigned long container_isolate_pages(un
  
   spin_lock(z-lru_lock);
   if (PageLRU(page)) {
 - if ((active  PageActive(page)) ||
 - (!active  !PageActive(page))) {
 - if (likely(get_page_unless_zero(page))) {
 - ClearPageLRU(page);
 - nr_taken++;
 - list_move(page-lru, dst);
 - }
 + if (likely(get_page_unless_zero(page))) {
 + ClearPageLRU(page);
 + nr_taken++;
 + list_move(page-lru, dst);
   }
   }
   spin_unlock(z-lru_lock);
 @@ -212,7 +233,8 @@ static int rss_create(struct container_s
   return -ENOMEM;
  
   res_counter_init(rss-res);
 - 

Re: [PATCH] [scsi]: Add offline state checking while dispatch a scsi cmd

2007-03-11 Thread Andrew Morton
 On Fri, 9 Mar 2007 09:40:40 +0800 Joe Jin [EMAIL PROTECTED] wrote:
  What's the error you're trying to fix?  scsi_dispatch_cmd() is only
  called from scsi_request_fn() which already has an equivalent of this
  check in it just prior to calling dispatch.
 
 Yeah, I have saw the cheking at scsi_request_fn(), recently we got a crash
 info as following at rhel4 2.6.9-42.0.2.ELsmp,

The 2.6.9 base is very old in mainline terms.  Are you sure the bug hasn't
been fixed in mainline by other means?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 0/3] swsusp: Stop using page flags

2007-03-11 Thread Rafael J. Wysocki
Hi,

The following three patches make swsusp use its own data structures for memory
management instead of special page flags.  Thus the page flags used so far by
swsusp (PG_nosave, PG_nosave_free) can be used for other purposes and I believe
there are some urgend needs of them. :-)

Last week I sent these patches to the linux-pm and linux-mm lists and there
were no negative comments.  Also I've been testing them on my x86_64 boxes for
a few days and apparently they don't break anything.  I think they can go into
-mm for testing.

Comments are welcome.

Greetings,
Rafael


-- 
If you don't have the time to read,
you don't have the time or the tools to write.
- Stephen King

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 1/3] swsusp: Use inline functions for changing page flags

2007-03-11 Thread Rafael J. Wysocki
From: Rafael J. Wysocki [EMAIL PROTECTED]

Replace direct invocations of SetPageNosave(), SetPageNosaveFree() etc. with
calls to inline functions that can be changed in subsequent patches without
modifying the code calling them.

Signed-off-by: Rafael J. Wysocki [EMAIL PROTECTED]
---
 include/linux/suspend.h |   33 +
 kernel/power/snapshot.c |   48 +---
 mm/page_alloc.c |6 +++---
 3 files changed, 61 insertions(+), 26 deletions(-)

Index: linux-2.6.21-rc2/include/linux/suspend.h
===
--- linux-2.6.21-rc2.orig/include/linux/suspend.h   2007-03-02 
09:05:53.0 +0100
+++ linux-2.6.21-rc2/include/linux/suspend.h2007-03-02 09:24:02.0 
+0100
@@ -8,6 +8,7 @@
 #include linux/notifier.h
 #include linux/init.h
 #include linux/pm.h
+#include linux/mm.h
 
 /* struct pbe is used for creating lists of pages that should be restored
  * atomically during the resume from disk, because the page frames they have
@@ -49,6 +50,38 @@ void __save_processor_state(struct saved
 void __restore_processor_state(struct saved_context *ctxt);
 unsigned long get_safe_page(gfp_t gfp_mask);
 
+/* Page management functions for the software suspend (swsusp) */
+
+static inline void swsusp_set_page_forbidden(struct page *page)
+{
+   SetPageNosave(page);
+}
+
+static inline int swsusp_page_is_forbidden(struct page *page)
+{
+   return PageNosave(page);
+}
+
+static inline void swsusp_unset_page_forbidden(struct page *page)
+{
+   ClearPageNosave(page);
+}
+
+static inline void swsusp_set_page_free(struct page *page)
+{
+   SetPageNosaveFree(page);
+}
+
+static inline int swsusp_page_is_free(struct page *page)
+{
+   return PageNosaveFree(page);
+}
+
+static inline void swsusp_unset_page_free(struct page *page)
+{
+   ClearPageNosaveFree(page);
+}
+
 /*
  * XXX: We try to keep some more pages free so that I/O operations succeed
  * without paging. Might this be more?
Index: linux-2.6.21-rc2/kernel/power/snapshot.c
===
--- linux-2.6.21-rc2.orig/kernel/power/snapshot.c   2007-03-02 
09:05:53.0 +0100
+++ linux-2.6.21-rc2/kernel/power/snapshot.c2007-03-02 09:27:06.0 
+0100
@@ -67,15 +67,15 @@ static void *get_image_page(gfp_t gfp_ma
 
res = (void *)get_zeroed_page(gfp_mask);
if (safe_needed)
-   while (res  PageNosaveFree(virt_to_page(res))) {
+   while (res  swsusp_page_is_free(virt_to_page(res))) {
/* The page is unsafe, mark it for swsusp_free() */
-   SetPageNosave(virt_to_page(res));
+   swsusp_set_page_forbidden(virt_to_page(res));
allocated_unsafe_pages++;
res = (void *)get_zeroed_page(gfp_mask);
}
if (res) {
-   SetPageNosave(virt_to_page(res));
-   SetPageNosaveFree(virt_to_page(res));
+   swsusp_set_page_forbidden(virt_to_page(res));
+   swsusp_set_page_free(virt_to_page(res));
}
return res;
 }
@@ -91,8 +91,8 @@ static struct page *alloc_image_page(gfp
 
page = alloc_page(gfp_mask);
if (page) {
-   SetPageNosave(page);
-   SetPageNosaveFree(page);
+   swsusp_set_page_forbidden(page);
+   swsusp_set_page_free(page);
}
return page;
 }
@@ -110,9 +110,9 @@ static inline void free_image_page(void 
 
page = virt_to_page(addr);
 
-   ClearPageNosave(page);
+   swsusp_unset_page_forbidden(page);
if (clear_nosave_free)
-   ClearPageNosaveFree(page);
+   swsusp_unset_page_free(page);
 
__free_page(page);
 }
@@ -615,7 +615,8 @@ static struct page *saveable_highmem_pag
 
BUG_ON(!PageHighMem(page));
 
-   if (PageNosave(page) || PageReserved(page) || PageNosaveFree(page))
+   if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page) ||
+   PageReserved(page))
return NULL;
 
return page;
@@ -681,7 +682,7 @@ static struct page *saveable_page(unsign
 
BUG_ON(PageHighMem(page));
 
-   if (PageNosave(page) || PageNosaveFree(page))
+   if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
return NULL;
 
if (PageReserved(page)  pfn_is_nosave(pfn))
@@ -821,9 +822,10 @@ void swsusp_free(void)
if (pfn_valid(pfn)) {
struct page *page = pfn_to_page(pfn);
 
-   if (PageNosave(page)  PageNosaveFree(page)) {
-   ClearPageNosave(page);
-   ClearPageNosaveFree(page);
+   if (swsusp_page_is_forbidden(page) 
+  

[RFC][PATCH 3/3] mm: Remove unused page flags

2007-03-11 Thread Rafael J. Wysocki
From: Rafael J. Wysocki [EMAIL PROTECTED]

Remove the two page flags that were previously used by swsusp and are no longer
needed.

Signed-off-by: Rafael J. Wysocki [EMAIL PROTECTED]
---
 include/linux/page-flags.h |   12 
 1 file changed, 12 deletions(-)

Index: linux-2.6.21-rc3/include/linux/page-flags.h
===
--- linux-2.6.21-rc3.orig/include/linux/page-flags.h
+++ linux-2.6.21-rc3/include/linux/page-flags.h
@@ -82,13 +82,11 @@
 #define PG_private 11  /* If pagecache, has fs-private data */
 
 #define PG_writeback   12  /* Page is under writeback */
-#define PG_nosave  13  /* Used for system suspend/resume */
 #define PG_compound14  /* Part of a compound page */
 #define PG_swapcache   15  /* Swap page: swp_entry_t in private */
 
 #define PG_mappedtodisk16  /* Has blocks allocated on-disk 
*/
 #define PG_reclaim 17  /* To be reclaimed asap */
-#define PG_nosave_free 18  /* Used for system suspend/resume */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
 /* PG_owner_priv_1 users should have descriptive aliases */
@@ -214,16 +212,6 @@ static inline void SetPageUptodate(struc
ret;\
})
 
-#define PageNosave(page)   test_bit(PG_nosave, (page)-flags)
-#define SetPageNosave(page)set_bit(PG_nosave, (page)-flags)
-#define TestSetPageNosave(page)test_and_set_bit(PG_nosave, 
(page)-flags)
-#define ClearPageNosave(page)  clear_bit(PG_nosave, (page)-flags)
-#define TestClearPageNosave(page)  test_and_clear_bit(PG_nosave, 
(page)-flags)
-
-#define PageNosaveFree(page)   test_bit(PG_nosave_free, (page)-flags)
-#define SetPageNosaveFree(page)set_bit(PG_nosave_free, (page)-flags)
-#define ClearPageNosaveFree(page)  clear_bit(PG_nosave_free, 
(page)-flags)
-
 #define PageBuddy(page)test_bit(PG_buddy, (page)-flags)
 #define __SetPageBuddy(page)   __set_bit(PG_buddy, (page)-flags)
 #define __ClearPageBuddy(page) __clear_bit(PG_buddy, (page)-flags)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH 2/3] swsusp: Do not use page flags

2007-03-11 Thread Rafael J. Wysocki
From: Rafael J. Wysocki [EMAIL PROTECTED]

Make swsusp use memory bitmaps instead of page flags for marking 'nosave' and
free pages.  This allows us to 'recycle' two page flags that can be used for 
other
purposes.  Also, the memory needed to store the bitmaps is allocated when
necessary (ie. before the suspend) and freed after the resume which is more
reasonable.

The patch is designed to minimize the amount of changes and there are some nice
simplifications and optimizations possible on top of it.  I am going to
implement them separately in the future.

Signed-off-by: Rafael J. Wysocki [EMAIL PROTECTED]
---
 arch/x86_64/kernel/e820.c |   26 +---
 include/linux/suspend.h   |   58 +++---
 kernel/power/disk.c   |   23 +++-
 kernel/power/power.h  |2 
 kernel/power/snapshot.c   |  250 +++---
 kernel/power/user.c   |4 
 6 files changed, 281 insertions(+), 82 deletions(-)

Index: linux-2.6.21-rc3/include/linux/suspend.h
===
--- linux-2.6.21-rc3.orig/include/linux/suspend.h
+++ linux-2.6.21-rc3/include/linux/suspend.h
@@ -24,63 +24,41 @@ struct pbe {
 extern void drain_local_pages(void);
 extern void mark_free_pages(struct zone *zone);
 
-#ifdef CONFIG_PM
-/* kernel/power/swsusp.c */
-extern int software_suspend(void);
-
-#if defined(CONFIG_VT)  defined(CONFIG_VT_CONSOLE)
+#if defined(CONFIG_PM)  defined(CONFIG_VT)  defined(CONFIG_VT_CONSOLE)
 extern int pm_prepare_console(void);
 extern void pm_restore_console(void);
 #else
 static inline int pm_prepare_console(void) { return 0; }
 static inline void pm_restore_console(void) {}
-#endif /* defined(CONFIG_VT)  defined(CONFIG_VT_CONSOLE) */
+#endif
+
+#if defined(CONFIG_PM)  defined(CONFIG_SOFTWARE_SUSPEND)
+/* kernel/power/swsusp.c */
+extern int software_suspend(void);
+/* kernel/power/snapshot.c */
+extern void __init register_nosave_region(unsigned long, unsigned long);
+extern int swsusp_page_is_forbidden(struct page *);
+extern void swsusp_set_page_free(struct page *);
+extern void swsusp_unset_page_free(struct page *);
+extern unsigned long get_safe_page(gfp_t gfp_mask);
 #else
 static inline int software_suspend(void)
 {
printk(Warning: fake suspend called\n);
return -ENOSYS;
 }
-#endif /* CONFIG_PM */
+
+static inline void register_nosave_region(unsigned long b, unsigned long e) {}
+static inline int swsusp_page_is_forbidden(struct page *p) { return 0; }
+static inline void swsusp_set_page_free(struct page *p) {}
+static inline void swsusp_unset_page_free(struct page *p) {}
+#endif /* defined(CONFIG_PM)  defined(CONFIG_SOFTWARE_SUSPEND) */
 
 void save_processor_state(void);
 void restore_processor_state(void);
 struct saved_context;
 void __save_processor_state(struct saved_context *ctxt);
 void __restore_processor_state(struct saved_context *ctxt);
-unsigned long get_safe_page(gfp_t gfp_mask);
-
-/* Page management functions for the software suspend (swsusp) */
-
-static inline void swsusp_set_page_forbidden(struct page *page)
-{
-   SetPageNosave(page);
-}
-
-static inline int swsusp_page_is_forbidden(struct page *page)
-{
-   return PageNosave(page);
-}
-
-static inline void swsusp_unset_page_forbidden(struct page *page)
-{
-   ClearPageNosave(page);
-}
-
-static inline void swsusp_set_page_free(struct page *page)
-{
-   SetPageNosaveFree(page);
-}
-
-static inline int swsusp_page_is_free(struct page *page)
-{
-   return PageNosaveFree(page);
-}
-
-static inline void swsusp_unset_page_free(struct page *page)
-{
-   ClearPageNosaveFree(page);
-}
 
 /*
  * XXX: We try to keep some more pages free so that I/O operations succeed
Index: linux-2.6.21-rc3/kernel/power/snapshot.c
===
--- linux-2.6.21-rc3.orig/kernel/power/snapshot.c
+++ linux-2.6.21-rc3/kernel/power/snapshot.c
@@ -21,6 +21,7 @@
 #include linux/kernel.h
 #include linux/pm.h
 #include linux/device.h
+#include linux/init.h
 #include linux/bootmem.h
 #include linux/syscalls.h
 #include linux/console.h
@@ -34,6 +35,10 @@
 
 #include power.h
 
+static int swsusp_page_is_free(struct page *);
+static void swsusp_set_page_forbidden(struct page *);
+static void swsusp_unset_page_forbidden(struct page *);
+
 /* List of PBEs needed for restoring the pages that were allocated before
  * the suspend and included in the suspend image, but have also been
  * allocated by the resume kernel, so their contents cannot be written
@@ -224,11 +229,6 @@ static void chain_free(struct chain_allo
  * of type unsigned long each).  It also contains the pfns that
  * correspond to the start and end of the represented memory area and
  * the number of bit chunks in the block.
- *
- * NOTE: Memory bitmaps are used for two types of operations only:
- * set a bit and find the next bit set.  Moreover, the searching
- * is always carried out after all of the set a bit 

[PATCH] drivers/isdn/hardware/eicon/: remove unused header files

2007-03-11 Thread Armin Schindler
Hi all,

as pointed out by Robert P. J. Day, here is a patch to remove unused header
files from Eicon/Dialogic ISDN driver.


Signed-off-by: Armin Schindler [EMAIL PROTECTED]

---

diff -Nur linux-2.6.20.1.orig/drivers/isdn/hardware/eicon/dbgioctl.h 
linux-2.6.20.1/drivers/isdn/hardware/eicon/dbgioctl.h
--- linux-2.6.20.1.orig/drivers/isdn/hardware/eicon/dbgioctl.h  2007-03-10 
11:21:15.0 +0100
+++ linux-2.6.20.1/drivers/isdn/hardware/eicon/dbgioctl.h   1970-01-01 
01:00:00.0 +0100
@@ -1,198 +0,0 @@
-
-/*
- *
-  Copyright (c) Eicon Technology Corporation, 2000.
- *
-  This source file is supplied for the use with Eicon
-  Technology Corporation's range of DIVA Server Adapters.
- *
-  This program is free software; you can redistribute it and/or modify
-  it under the terms of the GNU General Public License as published by
-  the Free Software Foundation; either version 2, or (at your option)
-  any later version.
- *
-  This program is distributed in the hope that it will be useful,
-  but WITHOUT ANY WARRANTY OF ANY KIND WHATSOEVER INCLUDING ANY
-  implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
-  See the GNU General Public License for more details.
- *
-  You should have received a copy of the GNU General Public License
-  along with this program; if not, write to the Free Software
-  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- *
- */
-/*--*/
-/* file: dbgioctl.h */
-/*--*/
-
-#if !defined(__DBGIOCTL_H__)
-
-#define __DBGIOCTL_H__
-
-#ifdef NOT_YET_NEEDED
-/*
- * The requested operation is passed in arg0 of DbgIoctlArgs,
- * additional arguments (if any) in arg1, arg2 and arg3.
- */
-
-typedef struct
-{  ULONG   arg0 ;
-   ULONG   arg1 ;
-   ULONG   arg2 ;
-   ULONG   arg3 ;
-} DbgIoctlArgs ;
-
-#defineDBG_COPY_LOGS   0   /* copy debugs to user until buffer 
full*/
-   /* arg1: size threshold 
*/
-   /* arg2: timeout in 
milliseconds*/
-
-#define DBG_FLUSH_LOGS 1   /* flush pending debugs to user buffer  
*/
-   /* arg1: internal 
driver id */
-
-#define DBG_LIST_DRVS  2   /* return the list of registered drivers
*/
-
-#defineDBG_GET_MASK3   /* get current debug mask of driver 
*/
-   /* arg1: internal 
driver id */
-
-#defineDBG_SET_MASK4   /* set/change debug mask of driver  
*/
-   /* arg1: internal 
driver id */
-   /* arg2: new debug mask 
*/
-
-#defineDBG_GET_BUFSIZE 5   /* get current buffer size of driver
*/
-   /* arg1: internal 
driver id */
-   /* arg2: new debug mask 
*/
-
-#defineDBG_SET_BUFSIZE 6   /* set new buffer size of driver
*/
-   /* arg1: new buffer 
size*/
-
-/*
- * common internal debug message structure
- */
-
-typedef struct
-{  unsigned short id ; /* virtual driver id  */
-   unsigned short type ;   /* special message type   */
-   unsigned long  seq ;/* sequence number of message */
-   unsigned long  size ;   /* size of message in bytes   */
-   unsigned long  next ;   /* offset to next buffered message*/
-   LARGE_INTEGER  NTtime ; /* 100 ns  since 1.1.1601 */
-   unsigned char  data[4] ;/* message data   */
-} OldDbgMessage ;
-
-typedef struct
-{  LARGE_INTEGER  NTtime ; /* 100 ns  since 1.1.1601 */
-   unsigned short size ;   /* size of message in bytes   */
-   unsigned short  ;   /* always 0x to indicate new msg  */
-   unsigned short id ; /* virtual driver id  */
-   unsigned short type ;   /* special message type   */
-   unsigned long  seq ;/* sequence number of message */
-   unsigned char  data[4] ;/* message data   */
-} DbgMessage ;
-
-#endif
-
-#define DRV_ID_UNKNOWN 0x0C/* for messages via 

Re: [RFC][PATCH 0/3] swsusp: Stop using page flags

2007-03-11 Thread Peter Zijlstra
On Sun, 2007-03-11 at 11:17 +0100, Rafael J. Wysocki wrote:
 Hi,
 
 The following three patches make swsusp use its own data structures for memory
 management instead of special page flags.  Thus the page flags used so far by
 swsusp (PG_nosave, PG_nosave_free) can be used for other purposes and I 
 believe
 there are some urgend needs of them. :-)
 
 Last week I sent these patches to the linux-pm and linux-mm lists and there
 were no negative comments.  Also I've been testing them on my x86_64 boxes for
 a few days and apparently they don't break anything.  I think they can go into
 -mm for testing.
 
 Comments are welcome.

These patches have my blessing, they look good to me, but I'm not much
involved with the swsusp code, so I won't ACK them.

Again, thanks a bunch for freeing up 2 page flags :-)

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA resume slowness, e1000 MSI warning

2007-03-11 Thread Eric W. Biederman
Michael S. Tsirkin [EMAIL PROTECTED] writes:

 The only case I can see which might trigger this is if we saved
 pci-X state and then didn't restore it because we could not find
 the capability on restore.

 Hmm. pci_save_pcix_state/pci_restore_pcix_state seem to only handle
 regular devices and seem to ignore the fact that for bridge PCI-X
 capability has a different structure.

 Is this intentional? 

Probably not a such.  I don't think we have any drivers for bridge
devices so I don't think it matters.  It likely wouldn't hurt to fix
it just in case though.

Do any of the mellanox cards do anything with the bridge on the card?

 If not, here's a patch to fix this. Warning: completely untested.

If you fix the offsets and diff this against my last fix (to never
free the buffer) I think your patch makes sense.

 PCI: restore bridge PCI-X capability registers after PM event

 Restore PCI-X bridge up/downstream capability registers
 after PM event.  This includes maxumum split transaction
 commitment limit which might be vital for PCI X.

 Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED]

 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
 index df49530..4b788ef 100644
 --- a/drivers/pci/pci.c
 +++ b/drivers/pci/pci.c
 @@ -597,14 +597,19 @@ static int pci_save_pcix_state(struct pci_dev *dev)
   if (pos = 0)
   return 0;
  
 - save_state = kzalloc(sizeof(*save_state) + sizeof(u16), GFP_KERNEL);
 + save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 2, GFP_KERNEL);
   if (!save_state) {
 - dev_err(dev-dev, Out of memory in pci_save_pcie_state\n);
 + dev_err(dev-dev, Out of memory in pci_save_pcix_state\n);
   return -ENOMEM;
   }
   cap = (u16 *)save_state-data[0];
  
 - pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]);
 + if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) {

This appears to be the proper test.

 + pci_read_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]);
 + pci_read_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]);
 + } else
 + pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]);
 +
   pci_add_saved_cap(dev, save_state);
   return 0;
  }
 @@ -621,7 +626,11 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
   return;
   cap = (u16 *)save_state-data[0];
  
 - pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]);
 + if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) {
 + pci_write_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]);
 + pci_write_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]);

These look like the proper two registers to save.

 + } else
 + pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]);
   pci_remove_saved_cap(save_state);
   kfree(save_state);
  }
 diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
 index f09cce2..fb7eefd 100644
 --- a/include/linux/pci_regs.h
 +++ b/include/linux/pci_regs.h
 @@ -332,6 +332,8 @@
  #define PCI_X_STATUS_SPL_ERR 0x2000 /* Rcvd Split Completion Error Msg */
  #define  PCI_X_STATUS_266MHZ 0x4000  /* 266 MHz capable */
  #define  PCI_X_STATUS_533MHZ 0x8000  /* 533 MHz capable */
 +#define PCI_X_BRIDGE_UP_SPL_CTL 10 /* PCI-X upstream split transaction limit 
 */
 +#define PCI_X_BRIDGE_DN_SPL_CTL 14 /* PCI-X downstream split transaction 
 limit */

Unless I am completely misreading the spec. While you have picked the
right register to save the offsets should be 0x08 and 0x0c or 8 and 12

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Linux 2.6.16.44-rc1

2007-03-11 Thread Adrian Bunk
Security fixes since 2.6.16.43:
- CVE-2007-0005: Fix buffer overflow in Omnikey CardMan 4040 driver
- CVE-2007-1000: [IPV6]: Handle np-opt being NULL in ipv6_getsockopt_sticky().


Location:
ftp://ftp.kernel.org/pub/linux/kernel/people/bunk/linux-2.6.16.y/testing/

git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.16.y.git


Changes since 2.6.16.43:

Adrian Bunk (1):
  Linux 2.6.16.44-rc1

Ang Way Chuang (1):
  dvb-core: fix bug in CRC-32 checking on 64-bit systems

Arnaldo Carvalho de Melo (1):
  [TCP]: Fix minisock tcp_create_openreq_child() typo.

Arthur Kepner (1):
  IB/mthca: Use mmiowb after doorbell ring

Chris Wright (1):
  [IPV6] fix ipv6_getsockopt_sticky copy_to_user leak

Dan Yeisley (1):
  init_reap_node() initialization fix

David Moore (1):
  Missing critical phys_to_virt in lib/swiotlb.c

David S. Miller (4):
  video/aty/mach64_ct.c: fix bogus delay loop
  [SPARC64] bbc_i2c: Fix kenvctrld eating %100 cpu.
  [IPV6]: Handle np-opt being NULL in ipv6_getsockopt_sticky(). 
(CVE-2007-1000)
  SPARC64: Fix memory corruption in pci_4u_free_consistent()

David Stevens (1):
  [IPV6]: /proc/net/anycast6 unbalanced inet6_dev refcnt

Eli Cohen (1):
  IPoIB: Rejoin all multicast groups after a port event

Eric Dumazet (1):
  [INET]: twcal_jiffie should be unsigned long, not int

Herbert Xu (1):
  [UDP]: Reread uh pointer after pskb_trim

Hugh Dickins (1):
  make ppc64 current preempt-safe

Jin-Bong lee (1):
  DVB: cxusb: fix firmware patch for big endian systems

Komuro (1):
  modify 3c589_cs to be SMP safe

Marcel Holtmann (1):
  Fix buffer overflow in Omnikey CardMan 4040 driver (CVE-2007-0005)

Michael S. Tsirkin (1):
  IB/mthca: Fix off-by-one in FMR handling on memfree

Michal Wrobel (1):
  [IPV6]: anycast refcnt fix

Olaf Kirch (1):
  [IPV6]: Fix for ipv6_setsockopt NULL dereference

Sergey Vlasov (1):
  Input: psmouse - fix attribute access on 64-bit systems


 Makefile|2 +-
 arch/sparc64/kernel/pci_iommu.c |2 +-
 drivers/char/pcmcia/cm4040_cs.c |3 ++-
 drivers/infiniband/hw/mthca/mthca_cq.c  |7 +++
 drivers/infiniband/hw/mthca/mthca_memfree.c |2 +-
 drivers/infiniband/hw/mthca/mthca_qp.c  |   19 +++
 drivers/infiniband/hw/mthca/mthca_srq.c |8 
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |4 +++-
 drivers/input/mouse/psmouse-base.c  |8 +---
 drivers/media/dvb/dvb-core/dvb_net.c|4 ++--
 drivers/media/dvb/dvb-usb/cxusb.c   |4 ++--
 drivers/net/pcmcia/3c589_cs.c   |7 +--
 drivers/sbus/char/bbc_i2c.c |   17 +
 drivers/video/aty/mach64_ct.c   |4 ++--
 include/asm-powerpc/current.h   |   12 +++-
 include/net/inet_timewait_sock.h|2 +-
 lib/swiotlb.c   |2 +-
 mm/slab.c   |2 +-
 net/ipv4/tcp_minisocks.c|2 +-
 net/ipv4/udp.c  |1 +
 net/ipv6/addrconf.c |2 ++
 net/ipv6/anycast.c  |1 +
 net/ipv6/ipv6_sockglue.c|   14 +-
 23 files changed, 95 insertions(+), 34 deletions(-)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/7] revoke: core code

2007-03-11 Thread Pekka Enberg
On Fri, 2007-03-09 at 10:15 +0200, Pekka J Enberg wrote:
  +  again:
  +   restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr,
  + details);
  +
  +   need_break = need_resched() || need_lockbreak(details-i_mmap_lock);
  +   if (need_break)
  +   goto out_need_break;
  +
  +   if (restart_addr  end_addr) {
  +   start_addr = restart_addr;
  +   goto again;
  +   }
  +   return 0;
  +
  +  out_need_break:
  +   spin_unlock(details-i_mmap_lock);
  +   cond_resched();
  +   spin_lock(details-i_mmap_lock);
  +   return -EINTR;

On Fri, 2007-03-09 at 13:30 +0100, Peter Zijlstra wrote:
 I'm not sure this scheme works, given a sufficiently loaded machine,
 this might never complete.

Hmm, so what's the alternative? It's better to fail revoke than lock up
the box.

On Fri, 2007-03-09 at 13:30 +0100, Peter Zijlstra wrote:
 I'm never sure of operator precedence and prefer:
 
  (vma-vm_flags  VM_SHARED)  ...
 
 which leaves no room for error.

Thanks, fixed.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/5] revoke: special mmap handling

2007-03-11 Thread Pekka J Enberg
From: Pekka Enberg [EMAIL PROTECTED]

This adds special handling for revoked memory mappings.  We want to
raise SIGBUS when accessing revoked mappings and return ENODEV when
trying to remap with mmap(2).

Acked-by: Alan Cox [EMAIL PROTECTED]
Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---
 include/linux/mm.h |1 +
 mm/memory.c|3 +++
 mm/mmap.c  |   12 
 3 files changed, 12 insertions(+), 4 deletions(-)

Index: uml-2.6/include/linux/mm.h
===
--- uml-2.6.orig/include/linux/mm.h 2007-03-11 13:07:57.0 +0200
+++ uml-2.6/include/linux/mm.h  2007-03-11 13:09:19.0 +0200
@@ -169,6 +169,7 @@ #define VM_NONLINEAR0x0080  /* Is no
 #define VM_MAPPED_COPY 0x0100  /* T if mapped copy of data (nommu 
mmap) */
 #define VM_INSERTPAGE  0x0200  /* The vma has had vm_insert_page() 
done on it */
 #define VM_ALWAYSDUMP  0x0400  /* Always include in core dumps */
+#define VM_REVOKED 0x0800  /* Mapping has been revoked */
 
 #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
Index: uml-2.6/mm/memory.c
===
--- uml-2.6.orig/mm/memory.c2007-03-11 13:07:57.0 +0200
+++ uml-2.6/mm/memory.c 2007-03-11 13:09:19.0 +0200
@@ -2504,6 +2504,9 @@ int __handle_mm_fault(struct mm_struct *
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, write_access);
 
+   if (unlikely(vma-vm_flags  VM_REVOKED))
+   return VM_FAULT_SIGBUS;
+
pgd = pgd_offset(mm, address);
pud = pud_alloc(mm, pgd, address);
if (!pud)
Index: uml-2.6/mm/mmap.c
===
--- uml-2.6.orig/mm/mmap.c  2007-03-11 13:07:57.0 +0200
+++ uml-2.6/mm/mmap.c   2007-03-11 13:09:19.0 +0200
@@ -1030,10 +1030,14 @@ accountable = 0;
error = -ENOMEM;
 munmap_back:
vma = find_vma_prepare(mm, addr, prev, rb_link, rb_parent);
-   if (vma  vma-vm_start  addr + len) {
-   if (do_munmap(mm, addr, len))
-   return -ENOMEM;
-   goto munmap_back;
+   if (vma) {
+   if (unlikely(vma-vm_flags  VM_REVOKED))
+   return -ENODEV;
+   if (vma-vm_start  addr + len) {
+   if (do_munmap(mm, addr, len))
+   return -ENOMEM;
+   goto munmap_back;
+   }
}
 
/* Check against address space limit. */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/5] revoke: core code

2007-03-11 Thread Pekka J Enberg
From: Pekka Enberg [EMAIL PROTECTED]

The revokeat(2) and frevoke(2) system calls invalidate open file
descriptors and shared mappings of an inode. After an successful
revocation, operations on file descriptors fail with the EBADF or
ENXIO error code for regular and device files,
respectively. Attempting to read from or write to a revoked mapping
causes SIGBUS.

The actual operation is done in two passes:

 1. Revoke all file descriptors that point to the given inode. We do
this under tasklist_lock so that after this pass, we don't need
to worry about racing with close(2) or dup(2).
   
 2. Take down shared memory mappings of the inode and close all file
pointers.

The file descriptors and memory mapping ranges are preserved until the
owning task does close(2) and munmap(2), respectively.

Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---
 fs/Makefile  |2 
 fs/revoke.c  |  588 +++
 fs/revoked_inode.c   |  378 +++
 include/linux/fs.h   |4 
 include/linux/revoked_fs_i.h |   20 +
 include/linux/syscalls.h |3 
 6 files changed, 994 insertions(+), 1 deletion(-)

Index: uml-2.6/fs/Makefile
===
--- uml-2.6.orig/fs/Makefile2007-03-11 13:07:57.0 +0200
+++ uml-2.6/fs/Makefile 2007-03-11 13:09:20.0 +0200
@@ -11,7 +11,7 @@ obj-y :=  open.o read_write.o file_table.
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o
+   stack.o revoke.o revoked_inode.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
Index: uml-2.6/include/linux/syscalls.h
===
--- uml-2.6.orig/include/linux/syscalls.h   2007-03-11 13:07:57.0 
+0200
+++ uml-2.6/include/linux/syscalls.h2007-03-11 13:09:20.0 +0200
@@ -605,4 +605,7 @@ asmlinkage long sys_getcpu(unsigned __us
 
 int kernel_execve(const char *filename, char *const argv[], char *const 
envp[]);
 
+asmlinkage int sys_revokeat(int dfd, const char __user *filename);
+asmlinkage int sys_frevoke(unsigned int fd);
+
 #endif
Index: uml-2.6/include/linux/fs.h
===
--- uml-2.6.orig/include/linux/fs.h 2007-03-11 13:07:57.0 +0200
+++ uml-2.6/include/linux/fs.h  2007-03-11 13:09:20.0 +0200
@@ -1100,6 +1100,7 @@ struct file_operations {
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t 
*, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info 
*, size_t, unsigned int);
+   int (*revoke)(struct file *);
 };
 
 struct inode_operations {
@@ -1739,6 +1740,9 @@ extern ssize_t generic_splice_sendpage(s
 extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
size_t len, unsigned int flags);
 
+/* fs/revoke.c */
+extern int generic_file_revoke(struct file *);
+
 extern void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
 extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
Index: uml-2.6/fs/revoke.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ uml-2.6/fs/revoke.c 2007-03-11 13:14:42.0 +0200
@@ -0,0 +1,588 @@
+/*
+ * fs/revoke.c - Invalidate all current open file descriptors of an inode.
+ *
+ * Copyright (C) 2006-2007  Pekka Enberg
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include linux/file.h
+#include linux/fs.h
+#include linux/namei.h
+#include linux/mm.h
+#include linux/mman.h
+#include linux/module.h
+#include linux/mount.h
+#include linux/sched.h
+#include linux/revoked_fs_i.h
+
+/*
+ * This is used for pre-allocating an array of file pointers so that we don't
+ * have to do memory allocation under tasklist_lock.
+ */
+struct revoke_table {
+   struct file **files;
+   unsigned long size;
+   unsigned long end;
+   unsigned long restore_start;
+};
+
+struct kmem_cache *revokefs_inode_cache;
+
+/*
+ * Revoked file descriptors point to inodes in the revokefs filesystem.
+ */
+static struct vfsmount *revokefs_mnt;
+
+static struct file *get_revoked_file(void)
+{
+   struct dentry *dentry;
+   struct inode *inode;
+   struct file *filp;
+   struct qstr name;
+
+   filp = get_empty_filp();
+   if (!filp)
+   goto err;
+
+   inode = new_inode(revokefs_mnt-mnt_sb);
+   if (!inode)
+   goto err_inode;
+
+   name.name = revoked_file;
+   name.len = strlen(name.name);
+   dentry = 

[PATCH 3/5] revoke: support for ext2 and ext3

2007-03-11 Thread Pekka J Enberg
From: Pekka Enberg [EMAIL PROTECTED]

Add revoke support to ext2 and ext3 by wiring f_ops-revoke with
generic_file_revoke.

Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---
 fs/ext2/file.c |1 +
 fs/ext3/file.c |1 +
 2 files changed, 2 insertions(+)

Index: uml-2.6/fs/ext2/file.c
===
--- uml-2.6.orig/fs/ext2/file.c 2007-03-11 13:05:33.0 +0200
+++ uml-2.6/fs/ext2/file.c  2007-03-11 13:09:21.0 +0200
@@ -56,6 +56,7 @@ const struct file_operations ext2_file_o
.sendfile   = generic_file_sendfile,
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
+   .revoke = generic_file_revoke,
 };
 
 #ifdef CONFIG_EXT2_FS_XIP
Index: uml-2.6/fs/ext3/file.c
===
--- uml-2.6.orig/fs/ext3/file.c 2007-03-11 13:05:33.0 +0200
+++ uml-2.6/fs/ext3/file.c  2007-03-11 13:09:21.0 +0200
@@ -123,6 +123,7 @@ const struct file_operations ext3_file_o
.sendfile   = generic_file_sendfile,
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
+   .revoke = generic_file_revoke,
 };
 
 const struct inode_operations ext3_file_inode_operations = {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5] revoke: add documentation

2007-03-11 Thread Pekka J Enberg
From: Pekka Enberg [EMAIL PROTECTED]

This documents revoke file operation in Documentation/filesystems/vfs.txt.

Acked-by: Alan Cox [EMAIL PROTECTED]
Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---
 Documentation/filesystems/vfs.txt |5 +
 1 file changed, 5 insertions(+)

Index: uml-2.6/Documentation/filesystems/vfs.txt
===
--- uml-2.6.orig/Documentation/filesystems/vfs.txt  2007-03-11 
13:05:33.0 +0200
+++ uml-2.6/Documentation/filesystems/vfs.txt   2007-03-11 13:09:22.0 
+0200
@@ -732,6 +732,7 @@ struct file_operations {
 int);
ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, 
unsigned  
 int);
+   int (*revoke)(struct file *);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -805,6 +806,10 @@ otherwise noted.
   splice_read: called by the VFS to splice data from file to a pipe. This
   method is used by the splice(2) system call
 
+  revoke: called by revokeat(2) and frevoke(2) system calls to revoke access
+ to an open file. This method must ensure that all currently blocked
+ writes are flushed and reads will fail.
+
 Note that the file operations are implemented by the specific
 filesystem in which the inode resides. When opening a device node
 (character or block special) most filesystems will call special
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/5] revoke: wire up i386 system calls

2007-03-11 Thread Pekka J Enberg
From: Pekka Enberg [EMAIL PROTECTED]

Make revokeat and frevoke system calls available to user-space on i386.

Acked-by: Alan Cox [EMAIL PROTECTED]
Signed-off-by: Pekka Enberg [EMAIL PROTECTED]
---
 arch/i386/kernel/syscall_table.S |3 +++
 include/asm-i386/unistd.h|4 +++-
 2 files changed, 6 insertions(+), 1 deletion(-)

Index: uml-2.6/arch/i386/kernel/syscall_table.S
===
--- uml-2.6.orig/arch/i386/kernel/syscall_table.S   2007-03-11 
13:05:32.0 +0200
+++ uml-2.6/arch/i386/kernel/syscall_table.S2007-03-11 13:09:23.0 
+0200
@@ -319,3 +319,6 @@ .long sys_unshare   /* 310 */
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_revokeat  /* 320 */
+   .long sys_frevoke
+
Index: uml-2.6/include/asm-i386/unistd.h
===
--- uml-2.6.orig/include/asm-i386/unistd.h  2007-03-11 13:05:33.0 
+0200
+++ uml-2.6/include/asm-i386/unistd.h   2007-03-11 13:09:23.0 +0200
@@ -325,10 +325,12 @@ #define __NR_unshare  310
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_revokeat  320
+#define __NR_frevoke   321
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 322
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA resume slowness, e1000 MSI warning

2007-03-11 Thread Michael S. Tsirkin
 Quoting Eric W. Biederman [EMAIL PROTECTED]:
 Subject: Re: SATA resume slowness, e1000 MSI warning
 
 Michael S. Tsirkin [EMAIL PROTECTED] writes:
 
  The only case I can see which might trigger this is if we saved
  pci-X state and then didn't restore it because we could not find
  the capability on restore.
 
  Hmm. pci_save_pcix_state/pci_restore_pcix_state seem to only handle
  regular devices and seem to ignore the fact that for bridge PCI-X
  capability has a different structure.
 
  Is this intentional? 
 
 Probably not a such.  I don't think we have any drivers for bridge
 devices so I don't think it matters.  It likely wouldn't hurt to fix
 it just in case though.
 
 Do any of the mellanox cards do anything with the bridge on the card?

Yes but they do their own thing wrt saving/restoring registers.
Look at drivers/infiniband/hw/mthca/mthca_reset.c

  If not, here's a patch to fix this. Warning: completely untested.
 
 If you fix the offsets and diff this against my last fix (to never
 free the buffer) I think your patch makes sense.

Let's agree what the correct offsets are.

  PCI: restore bridge PCI-X capability registers after PM event
 
  Restore PCI-X bridge up/downstream capability registers
  after PM event.  This includes maxumum split transaction
  commitment limit which might be vital for PCI X.
 
  Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED]
 
  diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
  index df49530..4b788ef 100644
  --- a/drivers/pci/pci.c
  +++ b/drivers/pci/pci.c
  @@ -597,14 +597,19 @@ static int pci_save_pcix_state(struct pci_dev *dev)
  if (pos = 0)
  return 0;
   
  -   save_state = kzalloc(sizeof(*save_state) + sizeof(u16), GFP_KERNEL);
  + save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 2, GFP_KERNEL);
  if (!save_state) {
  -   dev_err(dev-dev, Out of memory in pci_save_pcie_state\n);
  +   dev_err(dev-dev, Out of memory in pci_save_pcix_state\n);
  return -ENOMEM;
  }
  cap = (u16 *)save_state-data[0];
   
  -   pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]);
  +   if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) {
 
 This appears to be the proper test.
 
  + pci_read_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]);
  + pci_read_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]);
  +   } else
  +   pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]);
  +
  pci_add_saved_cap(dev, save_state);
  return 0;
   }
  @@ -621,7 +626,11 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
  return;
  cap = (u16 *)save_state-data[0];
   
  -   pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]);
  +   if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) {
  + pci_write_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]);
  + pci_write_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]);
 
 These look like the proper two registers to save.
 
  +   } else
  +   pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]);
  pci_remove_saved_cap(save_state);
  kfree(save_state);
   }
  diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
  index f09cce2..fb7eefd 100644
  --- a/include/linux/pci_regs.h
  +++ b/include/linux/pci_regs.h
  @@ -332,6 +332,8 @@
   #define PCI_X_STATUS_SPL_ERR 0x2000 /* Rcvd Split Completion Error Msg 
  */
   #define  PCI_X_STATUS_266MHZ   0x4000  /* 266 MHz capable */
   #define  PCI_X_STATUS_533MHZ   0x8000  /* 533 MHz capable */
  +#define PCI_X_BRIDGE_UP_SPL_CTL 10 /* PCI-X upstream split transaction 
  limit */
  +#define PCI_X_BRIDGE_DN_SPL_CTL 14 /* PCI-X downstream split transaction 
  limit */
 
 Unless I am completely misreading the spec. While you have picked the
 right register to save the offsets should be 0x08 and 0x0c or 8 and 12

No, the spec is written in terms of dwords (32 bit), we are storing words (16 
bits).
The data at offsets 8 and 12 is read-only split transaction capacity.
Split transaction limit starts at bit 16 so you need to add 2 to byte offset.

Right?


-- 
MST
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CIRRUS: Delete unused header file.

2007-03-11 Thread Robert P. J. Day
On Sat, 10 Mar 2007, Andrew Morton wrote:

  On Sat, 10 Mar 2007 17:27:44 -0500 (EST) Robert P. J. Day [EMAIL 
  PROTECTED] wrote:
 
Delete apparently unused header file
  sound/pci/cs46xx/imgs/cwcemb80.h.
 

 That patch series was rather a mess

 - Multiple patches with the same Subject: (I might have lost some as a result)

yes, that was a bad decision on my part, sorry.

 - Several patches which tried to remove the same header file

*that* shouldn't have happened, those patches were designed to be
independent of one another and, AFAIK, i submitted them only once.  i
have no idea how the above might have happened.

 - Several patches which simply didn't apply

hm ... they were created against the latest git tree, i don't know
why they wouldn't apply.

...

 - Useless indenting in changleog text which I have to edit away.

ah, i'll remember to not indent the changelog text next time, sorry.

rday

-- 

Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-11 Thread Mike Galbraith
Hi Con,

On Sun, 2007-03-11 at 14:57 +1100, Con Kolivas wrote:
 What follows this email is a patch series for the latest version of the RSDL 
 cpu scheduler (ie v0.29). I have addressed all bugs that I am able to 
 reproduce in this version so if some people would be kind enough to test if 
 there are any hidden bugs or oops lurking, it would be nice to know in 
 anticipation of putting this back in -mm. Thanks.
 
 Full patch for 2.6.21-rc3-mm2:
 http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2-rsdl-0.29.patch

I'm seeing a cpu distribution problem running this on my P4 box.

Scenario:
listening to music collection (mp3) via Amarok.  Enable Amarok
visualization gforce, and size such that X and gforce each use ~50% cpu.
Start rip/encode of new CD with grip/lame encoder.  Lame is set to use
both cpus, at nice 5.  Once the encoders start, they receive
considerable more cpu than nice 0 X/Gforce, taking ~120% and leaving the
remaining 80% for X/Gforce and Amarok (when it updates it's ~12k entry
database) to squabble over.

With 2.6.21-rc3,  X/Gforce maintain their ~50% cpu (remain smooth), and
the encoders (100%cpu bound) get whats left when Amarok isn't eating it.

I plunked the above patch into plain 2.6.21-rc3 and retested to
eliminate other mm tree differences, and it's repeatable.  The nice 5
cpu hogs always receive considerably more that the nice 0 sleepers.

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-11 Thread Con Kolivas
On Sunday 11 March 2007 22:39, Mike Galbraith wrote:
 Hi Con,

 On Sun, 2007-03-11 at 14:57 +1100, Con Kolivas wrote:
  What follows this email is a patch series for the latest version of the
  RSDL cpu scheduler (ie v0.29). I have addressed all bugs that I am able
  to reproduce in this version so if some people would be kind enough to
  test if there are any hidden bugs or oops lurking, it would be nice to
  know in anticipation of putting this back in -mm. Thanks.
 
  Full patch for 2.6.21-rc3-mm2:
  http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2-rsdl-0.29
 .patch

 I'm seeing a cpu distribution problem running this on my P4 box.

 Scenario:
 listening to music collection (mp3) via Amarok.  Enable Amarok
 visualization gforce, and size such that X and gforce each use ~50% cpu.
 Start rip/encode of new CD with grip/lame encoder.  Lame is set to use
 both cpus, at nice 5.  Once the encoders start, they receive
 considerable more cpu than nice 0 X/Gforce, taking ~120% and leaving the
 remaining 80% for X/Gforce and Amarok (when it updates it's ~12k entry
 database) to squabble over.

 With 2.6.21-rc3,  X/Gforce maintain their ~50% cpu (remain smooth), and
 the encoders (100%cpu bound) get whats left when Amarok isn't eating it.

 I plunked the above patch into plain 2.6.21-rc3 and retested to
 eliminate other mm tree differences, and it's repeatable.  The nice 5
 cpu hogs always receive considerably more that the nice 0 sleepers.

Thanks for the report. I'm assuming you're describing a single hyperthread P4 
here in SMP mode so 2 logical cores. Can you elaborate on whether there is 
any difference as to which cpu things are bound to as well? Can you also see 
what happens with lame not niced to +5 (ie at 0) and with lame at nice +19.

Thanks.

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/9] signalfd/timerfd - signalfd core ...

2007-03-11 Thread Oleg Nesterov
On 03/10, Davide Libenzi wrote:

 +static void signalfd_put_sighand(struct signalfd_ctx *ctx,
 +  struct sighand_struct *sighand,
 +  unsigned long *flags)
 +{
 + unlock_task_sighand(ctx-tsk, flags);
 +}

Note that signalfd_put_sighand() doesn't need sighand parameter, please
see below.

 +int signalfd_deliver(struct sighand_struct *sighand, int sig,
 +  struct siginfo *info)
 +{
 + int nsig = 0;
 + struct signalfd_ctx *ctx, *tmp;
 +
 + list_for_each_entry_safe(ctx, tmp, sighand-sfdlist, lnk) {
 + /*
 +  * We use a negative signal value as a way to broadcast that the
 +  * sighand has been orphaned, so that we can notify all the
 +  * listeners about this. Remeber the ctx-sigmask is inverted,
 +  * so if the user is interested in a signal, that corresponding
 +  * bit will be zero.
 +  */
 + if (sig  0)
 + list_del_init(ctx-lnk);

I'm afraid this is not right. This should be per-thread.

Suppose we have threads T1 and T2 from the same thread group. sighand-sfdlist
contains ctx1 and ctx2 linked to T1 and T2. Now, T1 exits, __exit_signal()
does signalfd_notify(sighand, -1), and unlinks all threads, not just T1.

IOW, we should do

if (ctx-tsk == current) {
list_del_init(ctx-lnk);
wake_up(ctx-wqh);
}

Perhaps it makes sense to not re-use signalfd_deliver(), but introduce
a new signalfd_xxx(sighand, tsk) helper for de_thread/exit_signal.

Btw, signalfd_deliver() doesn't use info parameter.

 + if (sig  0 || !sigismember(ctx-sigmask, sig)) {
 + wake_up(ctx-wqh);

Minor nit. Perhaps it makes sense to do

void signalfd_deliver(struct task_struct *tsk, int sig, struct 
sigpending *pending)
{
struct sighand_struct *sighand = tsk-sighand;
int private = (tsk-pending == pending);

list_for_each_entry_safe(ctx, tmp, sighand-sfdlist, lnk) {
if (private  ctx-tsk != tsk)
continue;
if (!sigismember(ctx-sigmask, sig))
wake_up(ctx-wqh);
}
}

Even better: signalfd_deliver(struct task_struct *tsk, int sig, int private).
This way specific_send_sig_info/send_sigqueue won't do a false wakeup.

 +asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t 
 sizemask)
 +{
 ...
 + if ((sighand = signalfd_get_sighand(ctx, flags)) != NULL) {
 + ctx-sigmask = sigmask;
 + signalfd_put_sighand(ctx, sighand, flags);
 + }

This looks like unneeded complication to me, I'd suggest

if (signalfd_get_sighand(ctx, flags)) {
ctx-sigmask = sigmask;
signalfd_put_sighand(ctx, flags);
}

unlock_task_sighand() (and thus signalfd_put_sighand) doesn't need sighand
parameter. signalfd_get_sighand() is in fact boolean. It makes sense to return
sighand, it may be useful, but this patch only needs != NULL.

Every usage of signalfd_get_sighand() could be simplified accordingly.

 --- linux-2.6.20.ep2.orig/fs/exec.c   2007-03-10 15:57:00.0 -0800
 +++ linux-2.6.20.ep2/fs/exec.c2007-03-10 15:57:51.0 -0800
 @@ -50,6 +50,7 @@
  #include linux/tsacct_kern.h
  #include linux/cn_proc.h
  #include linux/audit.h
 +#include linux/signalfd.h
  
  #include asm/uaccess.h
  #include asm/mmu_context.h
 @@ -583,6 +584,17 @@
   int count;
  
   /*
 +  * Tell all the sighand listeners that this sighand has
 +  * been detached. Needs to be called with the sighand lock
 +  * held.
 +  */
 + if (unlikely(!list_empty(oldsighand-sfdlist))) {
 + spin_lock_irq(oldsighand-siglock);
 + signalfd_notify(oldsighand, -1, NULL);
 + spin_unlock_irq(oldsighand-siglock);
 + }

Very minor nit. I'd suggest to make a new helper and put it in signalfd.h
(like signalfd_notify()). This will help CONFIG_SIGNALFD.

I still think that we should do this only for suid-exec. If application
passes a signalfd to another process with unix socket, it should know
what it does. But yes, I agree, we can change this later if needed.
(in that case the caller of the above helper should be flush_old_exec).

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-11 Thread Mike Galbraith
On Sun, 2007-03-11 at 22:48 +1100, Con Kolivas wrote:
 
 Thanks for the report. I'm assuming you're describing a single hyperthread P4 
 here in SMP mode so 2 logical cores. Can you elaborate on whether there is 
 any difference as to which cpu things are bound to as well? Can you also see 
 what happens with lame not niced to +5 (ie at 0) and with lame at nice +19.

Yes, one P4/HT/SMP. No change at nice 0, but setting the encoders to
nice 19 did put X/gforce ~back where they were with 2.6.21-rc3.  Tasks
don't seem to be bound to any particular cpu, relies on load balancing
(which appears to be working).

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-11 Thread Ingo Molnar

* Mike Galbraith [EMAIL PROTECTED] wrote:

  Full patch for 2.6.21-rc3-mm2: 
  http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2-rsdl-0.29.patch
 
 I'm seeing a cpu distribution problem running this on my P4 box.

 With 2.6.21-rc3, X/Gforce maintain their ~50% cpu (remain smooth), and 
 the encoders (100%cpu bound) get whats left when Amarok isn't eating 
 it.
 
 I plunked the above patch into plain 2.6.21-rc3 and retested to 
 eliminate other mm tree differences, and it's repeatable.  The nice 5 
 cpu hogs always receive considerably more that the nice 0 sleepers.

hm. Do you get the same same problem on UP too? (i.e. lets eliminate any 
SMP/HT artifacts)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-11 Thread Mike Galbraith
On Sun, 2007-03-11 at 13:10 +0100, Ingo Molnar wrote:
 * Mike Galbraith [EMAIL PROTECTED] wrote:
 
   Full patch for 2.6.21-rc3-mm2: 
   http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2-rsdl-0.29.patch
  
  I'm seeing a cpu distribution problem running this on my P4 box.
 
  With 2.6.21-rc3, X/Gforce maintain their ~50% cpu (remain smooth), and 
  the encoders (100%cpu bound) get whats left when Amarok isn't eating 
  it.
  
  I plunked the above patch into plain 2.6.21-rc3 and retested to 
  eliminate other mm tree differences, and it's repeatable.  The nice 5 
  cpu hogs always receive considerably more that the nice 0 sleepers.
 
 hm. Do you get the same same problem on UP too? (i.e. lets eliminate any 
 SMP/HT artifacts)

I'll boot up nosmp and report back (but now it's time to take Opa to the
Gasthaus for his Sunday afternoon brewskies;)

-Mike

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [git patches] libata fixes

2007-03-11 Thread Paul Rolland
Hello,

 It seems like IRQ is not getting through.  The first IRQ 
 driven command is failing for you.

H 
 Extract is :
 ata7: PATA max UDMA/100 cmd 0x00019c00 ctl 0x00019882 bmdma
 0x00019400 irq 16
 ata8: PATA max UDMA/100 cmd 0x00019800 ctl 0x00019482 bmdma
 0x00019408 irq 16

IRQ 16 is IO-APIC-fasteoi for libata, and is not shared... but all the
others libata IRQ are IO-APIC-edge.

 * Does giving 'acpi=off' or 'irqpoll' make any difference?
 
 * Can you connect a harddisk to the channel and see whether 
 that works?
Tried that.. Disk is identified as ATA-7: Mastor 6Y080L0, YAR41BW0, max
UDMA/13
and then timeout again...

Tried then with acpi=off, same result (identify is OK, but then timeout),
and irqpoll and then it was OK 

Let's then go back to my DVD-RW and test irqpoll...
and ... Yes Got it !
It is identified, it can be mounted, and read as /dev/sr1 !

/proc/interrupts show a count of 0 for IRQ 16, so yes, it goes somewhere
else...

Doing some diffs on copy of /proc/interrupts while accessing the DVD
gives two possibilities : IRQ14 or IRQ18, but both are also counting
when not accessing the DVD...

Question : does running with irqpoll affects performance ?

Paul
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: libata extension

2007-03-11 Thread Alan Cox
 I believe you should be able to do this by sending ATA pass-through SCSI 
 commands into the device using SG_IO, without any kernel changes. It's 
 really the mechanism that's meant for this..

It should work, but Mark Lord reported some problems with READ_LONG on
PIIX/ICH intel chipsets. I don't know if he ever resolved them but if not
I have a patch that ought to.

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] driver core: fix device_add error path

2007-03-11 Thread Dmitriy Monakhov
Dmitriy Monakhov [EMAIL PROTECTED] writes:

 Greg Kroah-Hartman [EMAIL PROTECTED] writes:

 From: James Simmons [EMAIL PROTECTED]

 When a device fails to register the class symlinks where not cleaned up.
 This left a symlink in the /sys/class/device/ directory that pointed
 to no where. This caused the sysfs_follow_link Oops I reported earlier.
 This patch cleanups up the symlink. Please apply. Thank you.

 Signed-Off: James Simmons [EMAIL PROTECTED]
 Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED]
 ---
  drivers/base/core.c |   31 ++-
  1 files changed, 30 insertions(+), 1 deletions(-)

 diff --git a/drivers/base/core.c b/drivers/base/core.c
 index d04fd33..cf2a398 100644
 --- a/drivers/base/core.c
 +++ b/drivers/base/core.c
 @@ -637,12 +637,41 @@ int device_add(struct device *dev)
   BUS_NOTIFY_DEL_DEVICE, dev);
  device_remove_groups(dev);
   GroupError:
 -device_remove_attrs(dev);
 +device_remove_attrs(dev);
   AttrsError:
  if (dev-devt_attr) {
  device_remove_file(dev, dev-devt_attr);
  kfree(dev-devt_attr);
  }
 +
 +if (dev-class) {
 +sysfs_remove_link(dev-kobj, subsystem);
 +/* If this is not a fake compatible device, remove the
 + * symlink from the class to the device. */
 +if (dev-kobj.parent != dev-class-subsys.kset.kobj)
 +sysfs_remove_link(dev-class-subsys.kset.kobj,
 +  dev-bus_id);
 +#ifdef CONFIG_SYSFS_DEPRECATED
 +if (parent) {
 +char *class_name = make_class_name(dev-class-name,
 +   dev-kobj);
 +if (class_name)
 +sysfs_remove_link(dev-parent-kobj,
 +  class_name);
 +kfree(class_name);
 +sysfs_remove_link(dev-kobj, device);
 +}
 +#endif
 +
  block begin
 +down(dev-class-sem);
 +/* notify any interfaces that the device is now gone */
 +list_for_each_entry(class_intf, dev-class-interfaces, node)
 +if (class_intf-remove_dev)
 +class_intf-remove_dev(dev, class_intf);
 +/* remove the device from the class list */
 +list_del_init(dev-node);
 +up(dev-class-sem);
  block end 
 May be i've missed something, but i'm confuesd a litle bit.
 For example if error happens while device_pm_add() we jump to label PMError
 and code from block above will be executed (device will be remove from list),
 but this device wasn't added to this list yet!
I've check it one more time, code it really broken!, and i think i understand 
how
this can happen 
it look like full code chunck was copy-pasted from device_del(), but in case of 
device_add() error path, device was't added to dev-class-devices list yet.
Folowing patch fix this copy-paste error:

 [PATCH] driver core: fix device_add error path

 - At the moment we jump here device was't added to
   dev-class-devices list yet.

Signed-off-by: Monakhov Dmitriy [EMAIL PROTECTED]
---
 drivers/base/core.c |9 -
 1 files changed, 0 insertions(+), 9 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 142c222..7d2459b 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -684,15 +684,6 @@ int device_add(struct device *dev)
 #endif
sysfs_remove_link(dev-kobj, device);
}
-
-   down(dev-class-sem);
-   /* notify any interfaces that the device is now gone */
-   list_for_each_entry(class_intf, dev-class-interfaces, node)
-   if (class_intf-remove_dev)
-   class_intf-remove_dev(dev, class_intf);
-   /* remove the device from the class list */
-   list_del_init(dev-node);
-   up(dev-class-sem);
}
  ueventattrError:
device_remove_file(dev, dev-uevent_attr);
-- 
1.5.0.1


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc3-mm1 RSDL results

2007-03-11 Thread James Cloos
| See:
| 
http://webcvs.freedesktop.org/mesa/Mesa/src/mesa/drivers/dri/r200/r200_ioctl.c?revision=1.37view=markup

OK.

Mesa is in git, now, but that still applies.  The gitweb url is:

http://gitweb.freedesktop.org/?p=mesa/mesa.git

and for the version of the above file in the master branch:

http://gitweb.freedesktop.org/?p=mesa/mesa.git;a=blob;f=src/mesa/drivers/dri/r200/r200_ioctl.c

The recursive grep(1) on mesa shows:

,[grep -r sched_yield mesa]
| mesa/mesa/src/mesa/drivers/dri/r300/radeon_ioctl.c:   sched_yield();
| mesa/mesa/src/mesa/drivers/dri/i915tex/intel_batchpool.c:  sched_yield();
| mesa/mesa/src/mesa/drivers/dri/i915tex/intel_batchbuffer.c: 
sched_yield();
| mesa/mesa/src/mesa/drivers/dri/common/vblank.h:#include sched.h   /* for 
sched_yield() */
| mesa/mesa/src/mesa/drivers/dri/common/vblank.h:#include sched.h   /* for 
sched_yield() */
| mesa/mesa/src/mesa/drivers/dri/common/vblank.h:  sched_yield();   
\
| mesa/mesa/src/mesa/drivers/dri/unichrome/via_ioctl.c:  sched_yield();
| mesa/mesa/src/mesa/drivers/dri/i915/intel_ioctl.c: sched_yield();
| mesa/mesa/src/mesa/drivers/dri/r200/r200_ioctl.c:   sched_yield();
`

Thanks for the heads up.  I must've grep(1)ed the xorg subdir rather
than the parent dir, and so missed mesa.

-JimC
-- 
James Cloos [EMAIL PROTECTED] OpenPGP: 1024D/ED7DAEA6
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc3-mm1 RSDL results

2007-03-11 Thread Con Kolivas
On Sunday 11 March 2007 23:38, James Cloos wrote:
 | See:
 | http://webcvs.freedesktop.org/mesa/Mesa/src/mesa/drivers/dri/r200/r200_i
 |octl.c?revision=1.37view=markup

 OK.

 Mesa is in git, now, but that still applies.  The gitweb url is:

 http://gitweb.freedesktop.org/?p=mesa/mesa.git

 and for the version of the above file in the master branch:

 http://gitweb.freedesktop.org/?p=mesa/mesa.git;a=blob;f=src/mesa/drivers/dr
i/r200/r200_ioctl.c

 The recursive grep(1) on mesa shows:

 ,[grep -r sched_yield mesa]

 | mesa/mesa/src/mesa/drivers/dri/r300/radeon_ioctl.c: sched_yield();
 | mesa/mesa/src/mesa/drivers/dri/i915tex/intel_batchpool.c: 
 | sched_yield();
 | mesa/mesa/src/mesa/drivers/dri/i915tex/intel_batchbuffer.c:
 | sched_yield(); mesa/mesa/src/mesa/drivers/dri/common/vblank.h:#include
 | sched.h   /* for sched_yield() */
 | mesa/mesa/src/mesa/drivers/dri/common/vblank.h:#include sched.h   /*
 | for sched_yield() */ mesa/mesa/src/mesa/drivers/dri/common/vblank.h: 
 | sched_yield();  \
 | mesa/mesa/src/mesa/drivers/dri/unichrome/via_ioctl.c:  sched_yield();
 | mesa/mesa/src/mesa/drivers/dri/i915/intel_ioctl.c:   sched_yield();
 | mesa/mesa/src/mesa/drivers/dri/r200/r200_ioctl.c:   sched_yield();

 `

 Thanks for the heads up.  I must've grep(1)ed the xorg subdir rather
 than the parent dir, and so missed mesa.

I just wonder what the heck all these will do to testing when using any of 
these drivers. Whether or not we do no yield, mild yield or full blown 
expiration yield, somehow or other I can't get over the feeling that if the 
code relies on yield() we can't really trust them to be meaningful cpu 
scheduler tests. This means most 3d apps out there that aren't using binary 
drivers, whether they be (fscking) glxgears, audio app visualisations or 
what...

-- 
-ck
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Locking interrupt handler in L1 cache

2007-03-11 Thread Parav Pandit
Hi,

I have MPC 8548 Linux 2.6.x based firewall which will
mostly do packet processing for 80% time.
So obviously most of the time it will RX and TX
packets through gianfar ethernet driver.

I want to lock my interrupt handler of this driver in
the L1 cache.

1. Is there any kernel API for locking function and
data to lock them in the L1/L2 cache?

2. How can I use icbtls - Instruction Cache Block
Touch and Lock Set for locking my interrupt handler?

3. Is icbtls is the correct instruction at which I
am looking at?

4. How do I find end address of the interrupt handler
function and how do we pass it to cache locking
instructions? (Because it can happen that interrupt
handler size is more than a cache line, not aligned
etc)?

5. Can we enhance request_irq() function to take an
additional parameter to lock the interrupt handler in
the cache?

I understand that if my interrupt handler is going to
be called most of the time then it is very likely to
happen that OS will flush the same, but there is no
guarantee for it.

Regards,
Parav Pandit



 

Get your own web address.  
Have a HUGE year through Yahoo! Small Business.
http://smallbusiness.yahoo.com/domains/?p=BESTDEAL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: libata extension

2007-03-11 Thread Bartlomiej Zolnierkiewicz

Hi,

On Sunday 11 March 2007, Vitaliyi wrote:
 Good Day
 
 Say i want to implement extended set of ATA commands available to
 userspace for building diagnostic tools.
 I need 0x40 -- read verify and 0x32 -- write long with error handling,

Mark Lord is working on READ/WRITE_LONG support for libata,
he has posted draft patch recently on linux-ide mailing list.

[ Please consider reading/joining linux-ide@vger.kernel.org ML,
  it is where Linux ATA discussion happens... ]

 for example. I was trying ide driver through ioctl's, but seems it
 lack of functionality and full of gotchas. Furthermore it oopses
 sometimes.

READ/WRITE_LONG is unsupported and as you've already noticed
TASKFILE ioctls are full of gotchas...

 Is it possible to use libata for such purpose or i need to write
 separate IDE driver ?

It should be possible using ATA pass-through, some libata changes
may be required but it is the right way to go IMO.

Bart
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lpfc: avoid double-free during PCI error failure

2007-03-11 Thread James Smart

ACK...  Looks good...

-- james s


Linas Vepstas wrote:

Bino, James,
Please review, sign-off and forward upstream.

--linas


If a PCI error is detected that cannot be recovered from, there
will be a double call of lpfc_pci_remove_one(), with the second call
resulting in a null-pointer dereference. The first call occurs in 
lpfc_io_error_detected(), and the second call during pci device 
remove. This patch eliminates the first call; its un-needed.


Signed-off-by: Linas Vepstas [EMAIL PROTECTED]


 drivers/scsi/lpfc/lpfc_init.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Index: linux-2.6.20-git16/drivers/scsi/lpfc/lpfc_init.c
===
--- linux-2.6.20-git16.orig/drivers/scsi/lpfc/lpfc_init.c   2007-03-08 
15:57:40.0 -0600
+++ linux-2.6.20-git16/drivers/scsi/lpfc/lpfc_init.c2007-03-08 
16:03:18.0 -0600
@@ -1817,10 +1817,9 @@ static pci_ers_result_t lpfc_io_error_de
struct lpfc_sli *psli = phba-sli;
struct lpfc_sli_ring  *pring;
 
-	if (state == pci_channel_io_perm_failure) {

-   lpfc_pci_remove_one(pdev);
+   if (state == pci_channel_io_perm_failure)
return PCI_ERS_RESULT_DISCONNECT;
-   }
+
pci_disable_device(pdev);
/*
 * There may be I/Os dropped by the firmware.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Style Question

2007-03-11 Thread Cong WANG

Hi, list!

I have a question about coding style in linux kernel. In
Documention/CodingStyle, it is said that Linux style for comments is
the C89 /* ... */ style. Don't use C99-style // ... comments.
_But_ I see a lot of '//' style comments in current kernel code.

Which is wrong? The documentions or the code, or neither? And why?

Another question is about NULL. AFAIK, in user space, using NULL is
better than directly using 0 in C. In kernel, I know it used its own
NULL, which may be defined as ((void*)0), but it's _still_ different
from raw zero. So can I say using NULL is better than 0 in kernel?

Any reply is welcome. Thanks and have a nice day!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Style Question

2007-03-11 Thread Bernd Petrovitsch
On Sun, 2007-03-11 at 22:15 +0800, Cong WANG wrote:
[...]
 Another question is about NULL. AFAIK, in user space, using NULL is
 better than directly using 0 in C. In kernel, I know it used its own
 NULL, which may be defined as ((void*)0),

Userspace has the usually same definition.

   but it's _still_ different
 from raw zero.

It is different that 0 as such has the type int. But this int is
automatically promoted to a 0 pointer.

So can I say using NULL is better than 0 in kernel?

Yes, because it is immediately clear that a pointer is (or should be)
there (and not an int).
And the same holds for userspace since this is a pure C question.

Bernd
-- 
Firmix Software GmbH   http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
  Embedded Linux Development and Services

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 2/7] RSS controller core

2007-03-11 Thread Herbert Poetzl
On Sun, Mar 11, 2007 at 12:08:16PM +0300, Pavel Emelianov wrote:
 Herbert Poetzl wrote:
 On Tue, Mar 06, 2007 at 02:00:36PM -0800, Andrew Morton wrote:
 On Tue, 06 Mar 2007 17:55:29 +0300
 Pavel Emelianov [EMAIL PROTECTED] wrote:

 +struct rss_container {
 +  struct res_counter res;
 +  struct list_head page_list;
 +  struct container_subsys_state css;
 +};
 +
 +struct page_container {
 +  struct page *page;
 +  struct rss_container *cnt;
 +  struct list_head list;
 +};
 ah. This looks good. I'll find a hunk of time to go through this
 work and through Paul's patches. It'd be good to get both patchsets
 lined up in -mm within a couple of weeks. But..
 
 doesn't look so good for me, mainly becaus of the 
 additional per page data and per page processing
 
 on 4GB memory, with 100 guests, 50% shared for each
 guest, this basically means ~1mio pages, 500k shared
 and 1500k x sizeof(page_container) entries, which
 roughly boils down to ~25MB of wasted memory ...
 
 increase the amount of shared pages and it starts
 getting worse, but maybe I'm missing something here
 
 You are. Each page has only one page_container associated
 with it despite the number of containers it is shared
 between.
 
 We need to decide whether we want to do per-container memory
 limitation via these data structures, or whether we do it via
 a physical scan of some software zone, possibly based on Mel's
 patches.
 
 why not do simple page accounting (as done currently
 in Linux) and use that for the limits, without
 keeping the reference from container to page?
 
 As I've already answered in my previous letter simple
 limiting w/o per-container reclamation and per-container
 oom killer isn't a good memory management. It doesn't allow
 to handle resource shortage gracefully.

per container OOM killer does not require any container
page reference, you know _what_ tasks belong to the 
container, and you know their _badness_ from the normal
OOM calculations, so doing them for a container is really
straight forward without having any page 'tagging'

for the reclamation part, please elaborate how that will
differ in a (shared memory) guest from what the kernel
currently does ...

TIA,
Herbert

 This patchset provides more grace way to handle this, but
 full memory management includes accounting of VMA-length
 as well (returning ENOMEM from system call) but we've decided
 to start with RSS.
 
 best,
 Herbert
 
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-11 Thread Gene Heskett
On Sunday 11 March 2007, Mike Galbraith wrote:
Hi Con,

On Sun, 2007-03-11 at 14:57 +1100, Con Kolivas wrote:
 What follows this email is a patch series for the latest version of
 the RSDL cpu scheduler (ie v0.29). I have addressed all bugs that I am
 able to reproduce in this version so if some people would be kind
 enough to test if there are any hidden bugs or oops lurking, it would
 be nice to know in anticipation of putting this back in -mm. Thanks.

 Full patch for 2.6.21-rc3-mm2:
 http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2-rsdl-0
.29.patch

I'm seeing a cpu distribution problem running this on my P4 box.

Scenario:
listening to music collection (mp3) via Amarok.  Enable Amarok
visualization gforce, and size such that X and gforce each use ~50% cpu.
Start rip/encode of new CD with grip/lame encoder.  Lame is set to use
both cpus, at nice 5.  Once the encoders start, they receive
considerable more cpu than nice 0 X/Gforce, taking ~120% and leaving the
remaining 80% for X/Gforce and Amarok (when it updates it's ~12k entry
database) to squabble over.

With 2.6.21-rc3,  X/Gforce maintain their ~50% cpu (remain smooth), and
the encoders (100%cpu bound) get whats left when Amarok isn't eating it.

I plunked the above patch into plain 2.6.21-rc3 and retested to
eliminate other mm tree differences, and it's repeatable.  The nice 5
cpu hogs always receive considerably more that the nice 0 sleepers.

   -Mike

Just to comment, I've been running one of the patches between 20-ck1 and 
this latest one, which is building as I type, but I also run gkrellm 
here, version 2.2.9.

Since I have been running this middle of this series patch, something is 
killing gkrellm about once a day, and there is nothing in the logs to 
indicate a problem.  I see a blink out of the corner of my eye, and its 
gone.  And it always starts right back up from a kmenu click.

No idea if anyone else is experiencing this or not.

-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
You scratch my tape, and I'll scratch yours.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 2/7] RSS controller core

2007-03-11 Thread Pavel Emelianov
Herbert Poetzl wrote:
 On Sun, Mar 11, 2007 at 12:08:16PM +0300, Pavel Emelianov wrote:
 Herbert Poetzl wrote:
 On Tue, Mar 06, 2007 at 02:00:36PM -0800, Andrew Morton wrote:
 On Tue, 06 Mar 2007 17:55:29 +0300
 Pavel Emelianov [EMAIL PROTECTED] wrote:

 +struct rss_container {
 + struct res_counter res;
 + struct list_head page_list;
 + struct container_subsys_state css;
 +};
 +
 +struct page_container {
 + struct page *page;
 + struct rss_container *cnt;
 + struct list_head list;
 +};
 ah. This looks good. I'll find a hunk of time to go through this
 work and through Paul's patches. It'd be good to get both patchsets
 lined up in -mm within a couple of weeks. But..
 doesn't look so good for me, mainly becaus of the 
 additional per page data and per page processing

 on 4GB memory, with 100 guests, 50% shared for each
 guest, this basically means ~1mio pages, 500k shared
 and 1500k x sizeof(page_container) entries, which
 roughly boils down to ~25MB of wasted memory ...

 increase the amount of shared pages and it starts
 getting worse, but maybe I'm missing something here
 You are. Each page has only one page_container associated
 with it despite the number of containers it is shared
 between.

 We need to decide whether we want to do per-container memory
 limitation via these data structures, or whether we do it via
 a physical scan of some software zone, possibly based on Mel's
 patches.
 why not do simple page accounting (as done currently
 in Linux) and use that for the limits, without
 keeping the reference from container to page?
 As I've already answered in my previous letter simple
 limiting w/o per-container reclamation and per-container
 oom killer isn't a good memory management. It doesn't allow
 to handle resource shortage gracefully.
 
 per container OOM killer does not require any container
 page reference, you know _what_ tasks belong to the 
 container, and you know their _badness_ from the normal
 OOM calculations, so doing them for a container is really
 straight forward without having any page 'tagging'

That's true. If you look at the patches you'll
find out that no code in oom killer uses page 'tag'.

 for the reclamation part, please elaborate how that will
 differ in a (shared memory) guest from what the kernel
 currently does ...

This is all described in the code and in the
discussions we had before.

 TIA,
 Herbert
 
 This patchset provides more grace way to handle this, but
 full memory management includes accounting of VMA-length
 as well (returning ENOMEM from system call) but we've decided
 to start with RSS.

 best,
 Herbert

 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.osdl.org/mailman/listinfo/containers
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd - timerfd core ...

2007-03-11 Thread Thomas Gleixner
Davide,

On Sat, 2007-03-10 at 18:22 -0800, Davide Libenzi wrote:

Some remarks:

 +
 +asmlinkage long sys_timerfd(int ufd, int clockid, int tmrtype,
 + const struct timespec __user *utmr)
 +{
 + int error;
 + struct timerfd_ctx *ctx;
 + struct file *file;
 + struct inode *inode;
 + ktime_t tval, tnow;
 + struct timespec ktmr, tmrnow;
 +
 + error = -EFAULT;
 + if (copy_from_user(ktmr, utmr, sizeof(ktmr)))
 + goto err_exit;

Please do not use goto for a simple
return -EFAULT;

Please validate the timespec before converting it.

if (!timespec_valid(ktmr))
return -EINVAL;


 + tval = timespec_to_ktime(ktmr);
 + error = -EINVAL;
 + if (clockid != CLOCK_MONOTONIC 
 + clockid != CLOCK_REALTIME)
 + goto err_exit;
 + switch (tmrtype) {
 + case TFD_TIMER_REL:
 + case TFD_TIMER_SEQ:
 + break;
 + case TFD_TIMER_ABS:
 + getnstimeofday(tmrnow);
 + tnow = timespec_to_ktime(tmrnow);

tnow = ktime_get();

 + if (ktime_to_ns(tval) = ktime_to_ns(tnow))
 + goto err_exit;
 + tval = ktime_sub(tval, tnow);

Why do you want to do that ? hrtimers handle relative and absolute
expiry times. You break down everything to relative time and lose the
accuracy for absolute timers. 

 + break;
 + default:
 + goto err_exit;
 + }
 +
 + if (ufd == -1) {
 + error = -ENOMEM;
 + ctx = kmem_cache_alloc(timerfd_ctx_cachep, GFP_KERNEL);
 + if (!ctx)
 + goto err_exit;
 +
 + init_waitqueue_head(ctx-wqh);
 + spin_lock_init(ctx-lock);
 + ctx-ticks = 0;
 + ctx-tmrtype = tmrtype;
 + ctx-clockid = clockid;
 + ctx-tval = tval;
 + hrtimer_init(ctx-tmr, ctx-clockid, HRTIMER_REL);
 + ctx-tmr.expires = ctx-tval;
 + ctx-tmr.function = timerfd_tmrproc;
 +
 + hrtimer_start(ctx-tmr, ctx-tval, HRTIMER_REL);
 +
 + /*
 +  * When we call this, the initialization must be complete, since
 +  * aino_getfd() will install the fd.
 +  */
 + error = aino_getfd(ufd, inode, file, [timerfd],
 +timerfd_fops, ctx);
 + if (error)
 + goto err_fdalloc;

Why is the timer started before we have everything in place ? 

Also if you turn it around then the (re)programming part of the timer
can be shared.

 + } else {
 + error = -EBADF;
 + file = fget(ufd);
 + if (!file)
 + goto err_exit;
 + ctx = file-private_data;
 + error = -EINVAL;
 + if (file-f_op != timerfd_fops) {
 + fput(file);
 + goto err_exit;
 + }
 +
 + /*
 +  * We need to stop the exiting timer before. We call
 +  * hrtimer_cancel() w/out holding our lock.
 +  */
 + spin_lock_irq(ctx-lock);
 + while (hrtimer_active(ctx-tmr)) {
 + spin_unlock_irq(ctx-lock);
 + hrtimer_cancel(ctx-tmr);
 + spin_lock_irq(ctx-lock);
 + }

Please use hrtimer_try_to_cancel()

retry:
spin_lock_irq():
if (hrtimer_try_to_cancel(ctx-tmr)  0) {
spin_unlock_irq();
cpu_relax();
goto retry;
}

 +
 +static unsigned int timerfd_poll(struct file *file, poll_table *wait)
 +{
 + struct timerfd_ctx *ctx = file-private_data;
 +
 + poll_wait(file, ctx-wqh, wait);
 +
 + return ctx-ticks ? POLLIN: 0;

This is racy:

timer is set up (non periodic)
timer expires
poll 

now poll is stuck for ever !


tglx


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Style Question

2007-03-11 Thread Robert Hancock

Cong WANG wrote:

Hi, list!

I have a question about coding style in linux kernel. In
Documention/CodingStyle, it is said that Linux style for comments is
the C89 /* ... */ style. Don't use C99-style // ... comments.
_But_ I see a lot of '//' style comments in current kernel code.

Which is wrong? The documentions or the code, or neither? And why?


The code.. As with a lot of coding style issues, it's likely just that 
nobody saw it and bothered to complain when it went in.



Another question is about NULL. AFAIK, in user space, using NULL is
better than directly using 0 in C. In kernel, I know it used its own
NULL, which may be defined as ((void*)0), but it's _still_ different
from raw zero. So can I say using NULL is better than 0 in kernel?


It's the preferred style, Sparse will complain about using 0 for a null 
pointer for example..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] KVM: MMU: Fix host memory corruption on i386 with = 4GB ram

2007-03-11 Thread Avi Kivity
PAGE_MASK is an unsigned long, so using it to mask physical addresses on
i386 (which are 64-bit wide) leads to truncation.  This can result in
page-private of unrelated memory pages being modified, with disasterous
results.

Fix by not using PAGE_MASK for physical addresses; instead calculate
the correct value directly from PAGE_SIZE.  Also fix a similar BUG_ON().

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/mmu.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c
index 2cb4893..e85b4c7 100644
--- a/drivers/kvm/mmu.c
+++ b/drivers/kvm/mmu.c
@@ -131,7 +131,7 @@ static int dbg = 1;
(((address)  PT32_LEVEL_SHIFT(level))  ((1  PT32_LEVEL_BITS) - 1))
 
 
-#define PT64_BASE_ADDR_MASK (((1ULL  52) - 1)  PAGE_MASK)
+#define PT64_BASE_ADDR_MASK (((1ULL  52) - 1)  ~(u64)(PAGE_SIZE-1))
 #define PT64_DIR_BASE_ADDR_MASK \
(PT64_BASE_ADDR_MASK  ~((1ULL  (PAGE_SHIFT + PT64_LEVEL_BITS)) - 1))
 
@@ -406,8 +406,8 @@ static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 
gfn)
spte = desc-shadow_ptes[0];
}
BUG_ON(!spte);
-   BUG_ON((*spte  PT64_BASE_ADDR_MASK) !=
-  page_to_pfn(page)  PAGE_SHIFT);
+   BUG_ON((*spte  PT64_BASE_ADDR_MASK)  PAGE_SHIFT
+  != page_to_pfn(page));
BUG_ON(!(*spte  PT_PRESENT_MASK));
BUG_ON(!(*spte  PT_WRITABLE_MASK));
rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte);
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] KVM: MMU: Fix guest writes to nonpae pde

2007-03-11 Thread Avi Kivity
KVM shadow page tables are always in pae mode, regardless of the guest
setting.  This means that a guest pde (mapping 4MB of memory) is mapped
to two shadow pdes (mapping 2MB each).

When the guest writes to a pte or pde, we intercept the write and emulate it.
We also remove any shadowed mappings corresponding to the write.  Since the
mmu did not account for the doubling in the number of pdes, it removed the
wrong entry, resulting in a mismatch between shadow page tables and guest
page tables, followed shortly by guest memory corruption.

This patch fixes the problem by detecting the special case of writing to
a non-pae pde and adjusting the address and number of shadow pdes zapped
accordingly.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/mmu.c |   46 ++
 1 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/drivers/kvm/mmu.c b/drivers/kvm/mmu.c
index a1a9336..2cb4893 100644
--- a/drivers/kvm/mmu.c
+++ b/drivers/kvm/mmu.c
@@ -1093,22 +1093,40 @@ out:
return r;
 }
 
+static void mmu_pre_write_zap_pte(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *page,
+ u64 *spte)
+{
+   u64 pte;
+   struct kvm_mmu_page *child;
+
+   pte = *spte;
+   if (is_present_pte(pte)) {
+   if (page-role.level == PT_PAGE_TABLE_LEVEL)
+   rmap_remove(vcpu, spte);
+   else {
+   child = page_header(pte  PT64_BASE_ADDR_MASK);
+   mmu_page_remove_parent_pte(vcpu, child, spte);
+   }
+   }
+   *spte = 0;
+}
+
 void kvm_mmu_pre_write(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes)
 {
gfn_t gfn = gpa  PAGE_SHIFT;
struct kvm_mmu_page *page;
-   struct kvm_mmu_page *child;
struct hlist_node *node, *n;
struct hlist_head *bucket;
unsigned index;
u64 *spte;
-   u64 pte;
unsigned offset = offset_in_page(gpa);
unsigned pte_size;
unsigned page_offset;
unsigned misaligned;
int level;
int flooded = 0;
+   int npte;
 
pgprintk(%s: gpa %llx bytes %d\n, __FUNCTION__, gpa, bytes);
if (gfn == vcpu-last_pt_write_gfn) {
@@ -1144,22 +1162,26 @@ void kvm_mmu_pre_write(struct kvm_vcpu *vcpu, gpa_t 
gpa, int bytes)
}
page_offset = offset;
level = page-role.level;
+   npte = 1;
if (page-role.glevels == PT32_ROOT_LEVEL) {
-   page_offset = 1;  /* 32-64 */
+   page_offset = 1;  /* 32-64 */
+   /*
+* A 32-bit pde maps 4MB while the shadow pdes map
+* only 2MB.  So we need to double the offset again
+* and zap two pdes instead of one.
+*/
+   if (level == PT32_ROOT_LEVEL) {
+   page_offset = 1;
+   npte = 2;
+   }
page_offset = ~PAGE_MASK;
}
spte = __va(page-page_hpa);
spte += page_offset / sizeof(*spte);
-   pte = *spte;
-   if (is_present_pte(pte)) {
-   if (level == PT_PAGE_TABLE_LEVEL)
-   rmap_remove(vcpu, spte);
-   else {
-   child = page_header(pte  PT64_BASE_ADDR_MASK);
-   mmu_page_remove_parent_pte(vcpu, child, spte);
-   }
+   while (npte--) {
+   mmu_pre_write_zap_pte(vcpu, page, spte);
+   ++spte;
}
-   *spte = 0;
}
 }
 
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] KVM: More fixes for 2.6.21-rc3

2007-03-11 Thread Avi Kivity
This patchset contains fixes I plan to submit pre 2.6.21: a fix for
large memory 32-bit hosts, and a fix for non-pae 32-bit guests.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [PATCH] KVM: MMU: Fix guest writes to nonpae pde

2007-03-11 Thread Ingo Molnar

* Avi Kivity [EMAIL PROTECTED] wrote:

 KVM shadow page tables are always in pae mode, regardless of the guest 
 setting.  This means that a guest pde (mapping 4MB of memory) is 
 mapped to two shadow pdes (mapping 2MB each).
 
 When the guest writes to a pte or pde, we intercept the write and 
 emulate it. We also remove any shadowed mappings corresponding to the 
 write.  Since the mmu did not account for the doubling in the number 
 of pdes, it removed the wrong entry, resulting in a mismatch between 
 shadow page tables and guest page tables, followed shortly by guest 
 memory corruption.
 
 This patch fixes the problem by detecting the special case of writing 
 to a non-pae pde and adjusting the address and number of shadow pdes 
 zapped accordingly.
 
 Signed-off-by: Avi Kivity [EMAIL PROTECTED]

tested this with both PAE and non-PAE Linux host and guest - works fine.

Acked-by: Ingo Molnar [EMAIL PROTECTED]

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [PATCH] KVM: MMU: Fix host memory corruption on i386 with = 4GB ram

2007-03-11 Thread Ingo Molnar

* Avi Kivity [EMAIL PROTECTED] wrote:

 PAGE_MASK is an unsigned long, so using it to mask physical addresses 
 on i386 (which are 64-bit wide) leads to truncation.  This can result 
 in page-private of unrelated memory pages being modified, with 
 disasterous results.
 
 Fix by not using PAGE_MASK for physical addresses; instead calculate 
 the correct value directly from PAGE_SIZE.  Also fix a similar 
 BUG_ON().
 
 Signed-off-by: Avi Kivity [EMAIL PROTECTED]

i have tested this, albeit with less than 4GB RAM.

Acked-by: Ingo Molnar [EMAIL PROTECTED]

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] KVM: always reload segment selectors

2007-03-11 Thread Ingo Molnar
Subject: [patch] KVM: always reload segment selectors
From: Ingo Molnar [EMAIL PROTECTED]

failed VM entry on VMX might still change %fs or %gs, thus make sure 
that KVM always reloads the segment selectors. This is crutial on both 
x86 and x86_64: x86 has __KERNEL_PDA in %fs on which things like 
'current' depends and x86_64 has 0 there and needs MSR_GS_BASE to work.

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 drivers/kvm/vmx.c |   37 +
 1 file changed, 21 insertions(+), 16 deletions(-)

Index: linux/drivers/kvm/vmx.c
===
--- linux.orig/drivers/kvm/vmx.c
+++ linux/drivers/kvm/vmx.c
@@ -1896,6 +1896,27 @@ again:
[cr2]i(offsetof(struct kvm_vcpu, cr2))
  : cc, memory );
 
+   /*
+* Reload segment selectors ASAP. (it's needed for a functional
+* kernel: x86 relies on having __KERNEL_PDA in %fs and x86_64
+* relies on having 0 in %gs for the CPU PDA to work.)
+*/
+   if (fs_gs_ldt_reload_needed) {
+   load_ldt(ldt_sel);
+   load_fs(fs_sel);
+   /*
+* If we have to reload gs, we must take care to
+* preserve our gs base.
+*/
+   local_irq_disable();
+   load_gs(gs_sel);
+#ifdef CONFIG_X86_64
+   wrmsrl(MSR_GS_BASE, vmcs_readl(HOST_GS_BASE));
+#endif
+   local_irq_enable();
+
+   reload_tss();
+   }
++kvm_stat.exits;
 
save_msrs(vcpu-guest_msrs, NR_BAD_MSRS);
@@ -1913,22 +1934,6 @@ again:
kvm_run-exit_reason = vmcs_read32(VM_INSTRUCTION_ERROR);
r = 0;
} else {
-   if (fs_gs_ldt_reload_needed) {
-   load_ldt(ldt_sel);
-   load_fs(fs_sel);
-   /*
-* If we have to reload gs, we must take care to
-* preserve our gs base.
-*/
-   local_irq_disable();
-   load_gs(gs_sel);
-#ifdef CONFIG_X86_64
-   wrmsrl(MSR_GS_BASE, vmcs_readl(HOST_GS_BASE));
-#endif
-   local_irq_enable();
-
-   reload_tss();
-   }
/*
 * Profile KVM exit RIPs:
 */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] KVM: always reload segment selectors

2007-03-11 Thread Avi Kivity

Ingo Molnar wrote:

Subject: [patch] KVM: always reload segment selectors
From: Ingo Molnar [EMAIL PROTECTED]

failed VM entry on VMX might still change %fs or %gs, thus make sure 
that KVM always reloads the segment selectors. This is crutial on both 
x86 and x86_64: x86 has __KERNEL_PDA in %fs on which things like 
'current' depends and x86_64 has 0 there and needs MSR_GS_BASE to work.


Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 drivers/kvm/vmx.c |   37 +
 1 file changed, 21 insertions(+), 16 deletions(-)

Index: linux/drivers/kvm/vmx.c
===
--- linux.orig/drivers/kvm/vmx.c
+++ linux/drivers/kvm/vmx.c
@@ -1896,6 +1896,27 @@ again:
[cr2]i(offsetof(struct kvm_vcpu, cr2))
  : cc, memory );
 
+	/*

+* Reload segment selectors ASAP. (it's needed for a functional
+* kernel: x86 relies on having __KERNEL_PDA in %fs and x86_64
+* relies on having 0 in %gs for the CPU PDA to work.)
+*/
+   if (fs_gs_ldt_reload_needed) {
+   load_ldt(ldt_sel);
+   load_fs(fs_sel);
+   /*
+* If we have to reload gs, we must take care to
+* preserve our gs base.
+*/
+   local_irq_disable();
+   load_gs(gs_sel);
+#ifdef CONFIG_X86_64
+   wrmsrl(MSR_GS_BASE, vmcs_readl(HOST_GS_BASE));
+#endif
+   local_irq_enable();
+
+   reload_tss();
+   }
++kvm_stat.exits;
 
 	save_msrs(vcpu-guest_msrs, NR_BAD_MSRS);


btw, looking at the code, we could just remove fs from the 
fs_gs_reload_needed and make in unconditional.  VT knows how to reload 
segments, except if they're user segments (groan).  In the case of fs, 
if it's used for the pda, it's obviously a kernel segment.


gs is different: since only the segment base is loaded (via swapgs), the 
selector part could well be a userspace selector, and thus the 
irq-protected reload is needed.


Anyway, I'm applying the patch as the above discourse is irrelevant to 
the fix.



--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/15] KVM userspace interface updates

2007-03-11 Thread Avi Kivity
This patchset updates the kvm userspace interface to what I hope will
be the long-term stable interface.  Provisions are included for extending
the interface later.  The patches address performance and cleanliness
concerns.

One patch is missing -- I'd like the string pio transfers not to include
guest virtual addresses.  To date all my attempts to write the patch ended
with me losing consiousness.  Hopefully I'll manage it soon.

I'd like to submit the patchset post 2.6.21.  Comments are welcome.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 03/15] KVM: Initialize PIO I/O count

2007-03-11 Thread Avi Kivity
This allows userspace to ignore the io.rep field.  No a big deal, but
friendly.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/svm.c |1 +
 drivers/kvm/vmx.c |1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index b176f5a..c35b8c8 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -1037,6 +1037,7 @@ static int io_interception(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
kvm_run-io.size = ((io_info  SVM_IOIO_SIZE_MASK)  
SVM_IOIO_SIZE_SHIFT);
kvm_run-io.string = (io_info  SVM_IOIO_STR_MASK) != 0;
kvm_run-io.rep = (io_info  SVM_IOIO_REP_MASK) != 0;
+   kvm_run-io.count = 1;
 
if (kvm_run-io.string) {
unsigned addr_mask;
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index 7fd572a..d4c9f33 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1459,6 +1459,7 @@ static int handle_io(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
= (vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_DF) != 0;
kvm_run-io.rep = (exit_qualification  32) != 0;
kvm_run-io.port = exit_qualification  16;
+   kvm_run-io.count = 1;
if (kvm_run-io.string) {
if (!get_io_count(vcpu, kvm_run-io.count))
return 1;
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/15] KVM: Handle cpuid in the kernel instead of punting to userspace

2007-03-11 Thread Avi Kivity
KVM used to handle cpuid by letting userspace decide what values to
return to the guest.  We now handle cpuid completely in the kernel.  We
still let userspace decide which values the guest will see by having
userspace set up the value table beforehand (this is necessary to allow
management software to set the cpu features to the least common denominator,
so that live migration can work).

The motivation for the change is that kvm kernel code can be impacted by
cpuid features, for example the x86 emulator.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/kvm.h  |5 +++
 drivers/kvm/kvm_main.c |   69 
 drivers/kvm/svm.c  |4 +-
 drivers/kvm/vmx.c  |4 +-
 include/linux/kvm.h|   18 -
 5 files changed, 95 insertions(+), 5 deletions(-)

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index 59cbc5b..be3a0e7 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -55,6 +55,7 @@
 #define KVM_NUM_MMU_PAGES 256
 #define KVM_MIN_FREE_MMU_PAGES 5
 #define KVM_REFILL_PAGES 25
+#define KVM_MAX_CPUID_ENTRIES 40
 
 #define FX_IMAGE_SIZE 512
 #define FX_IMAGE_ALIGN 16
@@ -286,6 +287,9 @@ struct kvm_vcpu {
u32 ar;
} tr, es, ds, fs, gs;
} rmode;
+
+   int cpuid_nent;
+   struct kvm_cpuid_entry cpuid_entries[KVM_MAX_CPUID_ENTRIES];
 };
 
 struct kvm_memory_slot {
@@ -446,6 +450,7 @@ void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, 
unsigned long value,
 
 struct x86_emulate_ctxt;
 
+void kvm_emulate_cpuid(struct kvm_vcpu *vcpu);
 int emulate_invlpg(struct kvm_vcpu *vcpu, gva_t address);
 int emulate_clts(struct kvm_vcpu *vcpu);
 int emulator_get_dr(struct x86_emulate_ctxt* ctxt, int dr,
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 8a4984d..347467e 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -1504,6 +1504,43 @@ void save_msrs(struct vmx_msr_entry *e, int n)
 }
 EXPORT_SYMBOL_GPL(save_msrs);
 
+void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
+{
+   int i;
+   u32 function;
+   struct kvm_cpuid_entry *e, *best;
+
+   kvm_arch_ops-cache_regs(vcpu);
+   function = vcpu-regs[VCPU_REGS_RAX];
+   vcpu-regs[VCPU_REGS_RAX] = 0;
+   vcpu-regs[VCPU_REGS_RBX] = 0;
+   vcpu-regs[VCPU_REGS_RCX] = 0;
+   vcpu-regs[VCPU_REGS_RDX] = 0;
+   best = NULL;
+   for (i = 0; i  vcpu-cpuid_nent; ++i) {
+   e = vcpu-cpuid_entries[i];
+   if (e-function == function) {
+   best = e;
+   break;
+   }
+   /*
+* Both basic or both extended?
+*/
+   if (((e-function ^ function)  0x8000) == 0)
+   if (!best || e-function  best-function)
+   best = e;
+   }
+   if (best) {
+   vcpu-regs[VCPU_REGS_RAX] = best-eax;
+   vcpu-regs[VCPU_REGS_RBX] = best-ebx;
+   vcpu-regs[VCPU_REGS_RCX] = best-ecx;
+   vcpu-regs[VCPU_REGS_RDX] = best-edx;
+   }
+   kvm_arch_ops-decache_regs(vcpu);
+   kvm_arch_ops-skip_emulated_instruction(vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_emulate_cpuid);
+
 static void complete_pio(struct kvm_vcpu *vcpu)
 {
struct kvm_io *io = vcpu-run-io;
@@ -2075,6 +2112,26 @@ out:
return r;
 }
 
+static int kvm_vcpu_ioctl_set_cpuid(struct kvm_vcpu *vcpu,
+   struct kvm_cpuid *cpuid,
+   struct kvm_cpuid_entry __user *entries)
+{
+   int r;
+
+   r = -E2BIG;
+   if (cpuid-nent  KVM_MAX_CPUID_ENTRIES)
+   goto out;
+   r = -EFAULT;
+   if (copy_from_user(vcpu-cpuid_entries, entries,
+  cpuid-nent * sizeof(struct kvm_cpuid_entry)))
+   goto out;
+   vcpu-cpuid_nent = cpuid-nent;
+   return 0;
+
+out:
+   return r;
+}
+
 static long kvm_vcpu_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
@@ -2181,6 +2238,18 @@ static long kvm_vcpu_ioctl(struct file *filp,
case KVM_SET_MSRS:
r = msr_io(vcpu, argp, do_set_msr, 0);
break;
+   case KVM_SET_CPUID: {
+   struct kvm_cpuid __user *cpuid_arg = argp;
+   struct kvm_cpuid cpuid;
+
+   r = -EFAULT;
+   if (copy_from_user(cpuid, cpuid_arg, sizeof cpuid))
+   goto out;
+   r = kvm_vcpu_ioctl_set_cpuid(vcpu, cpuid, cpuid_arg-entries);
+   if (r)
+   goto out;
+   break;
+   }
default:
;
}
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index c35b8c8..d4b2936 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -1101,8 +1101,8 @@ static int task_switch_interception(struct kvm_vcpu 
*vcpu, struct kvm_run *kvm_r
 static int 

[PATCH 01/15] KVM: Use a shared page for kernel/user communication when runing a vcpu

2007-03-11 Thread Avi Kivity
Instead of passing a 'struct kvm_run' back and forth between the kernel and
userspace, allocate a page and allow the user to mmap() it.  This reduces
needless copying and makes the interface expandable by providing lots of
free space.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/kvm.h  |1 +
 drivers/kvm/kvm_main.c |   54 +++
 include/linux/kvm.h|6 ++--
 3 files changed, 44 insertions(+), 17 deletions(-)
 mode change 100755 = 100644 drivers/kvm/kvm_main.c

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index 0d122bf..901b8d9 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -228,6 +228,7 @@ struct kvm_vcpu {
struct mutex mutex;
int   cpu;
int   launched;
+   struct kvm_run *run;
int interrupt_window_open;
unsigned long irq_summary; /* bit vector: 1 per word in irq_pending */
 #define NR_IRQ_WORDS KVM_IRQ_BITMAP_SIZE(unsigned long)
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
old mode 100755
new mode 100644
index 946ed86..42be8a8
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -355,6 +355,8 @@ static void kvm_free_vcpu(struct kvm_vcpu *vcpu)
kvm_mmu_destroy(vcpu);
vcpu_put(vcpu);
kvm_arch_ops-vcpu_free(vcpu);
+   free_page((unsigned long)vcpu-run);
+   vcpu-run = NULL;
 }
 
 static void kvm_free_vcpus(struct kvm *kvm)
@@ -1887,6 +1889,33 @@ static int kvm_vcpu_ioctl_debug_guest(struct kvm_vcpu 
*vcpu,
return r;
 }
 
+static struct page *kvm_vcpu_nopage(struct vm_area_struct *vma,
+   unsigned long address,
+   int *type)
+{
+   struct kvm_vcpu *vcpu = vma-vm_file-private_data;
+   unsigned long pgoff;
+   struct page *page;
+
+   *type = VM_FAULT_MINOR;
+   pgoff = ((address - vma-vm_start)  PAGE_SHIFT) + vma-vm_pgoff;
+   if (pgoff != 0)
+   return NOPAGE_SIGBUS;
+   page = virt_to_page(vcpu-run);
+   get_page(page);
+   return page;
+}
+
+static struct vm_operations_struct kvm_vcpu_vm_ops = {
+   .nopage = kvm_vcpu_nopage,
+};
+
+static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   vma-vm_ops = kvm_vcpu_vm_ops;
+   return 0;
+}
+
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
struct kvm_vcpu *vcpu = filp-private_data;
@@ -1899,6 +1928,7 @@ static struct file_operations kvm_vcpu_fops = {
.release= kvm_vcpu_release,
.unlocked_ioctl = kvm_vcpu_ioctl,
.compat_ioctl   = kvm_vcpu_ioctl,
+   .mmap   = kvm_vcpu_mmap,
 };
 
 /*
@@ -1947,6 +1977,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int 
n)
 {
int r;
struct kvm_vcpu *vcpu;
+   struct page *page;
 
r = -EINVAL;
if (!valid_vcpu(n))
@@ -1961,6 +1992,12 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int 
n)
return -EEXIST;
}
 
+   page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+   r = -ENOMEM;
+   if (!page)
+   goto out_unlock;
+   vcpu-run = page_address(page);
+
vcpu-host_fx_image = (char*)ALIGN((hva_t)vcpu-fx_buf,
   FX_IMAGE_ALIGN);
vcpu-guest_fx_image = vcpu-host_fx_image + FX_IMAGE_SIZE;
@@ -1990,6 +2027,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, int 
n)
 
 out_free_vcpus:
kvm_free_vcpu(vcpu);
+out_unlock:
mutex_unlock(vcpu-mutex);
 out:
return r;
@@ -2003,21 +2041,9 @@ static long kvm_vcpu_ioctl(struct file *filp,
int r = -EINVAL;
 
switch (ioctl) {
-   case KVM_RUN: {
-   struct kvm_run kvm_run;
-
-   r = -EFAULT;
-   if (copy_from_user(kvm_run, argp, sizeof kvm_run))
-   goto out;
-   r = kvm_vcpu_ioctl_run(vcpu, kvm_run);
-   if (r  0   r != -EINTR)
-   goto out;
-   if (copy_to_user(argp, kvm_run, sizeof kvm_run)) {
-   r = -EFAULT;
-   goto out;
-   }
+   case KVM_RUN:
+   r = kvm_vcpu_ioctl_run(vcpu, vcpu-run);
break;
-   }
case KVM_GET_REGS: {
struct kvm_regs kvm_regs;
 
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 275354f..d88e750 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -11,7 +11,7 @@
 #include asm/types.h
 #include linux/ioctl.h
 
-#define KVM_API_VERSION 4
+#define KVM_API_VERSION 5
 
 /*
  * Architectural interrupt line count, and the size of the bitmap needed
@@ -49,7 +49,7 @@ enum kvm_exit_reason {
KVM_EXIT_SHUTDOWN = 8,
 };
 
-/* for KVM_RUN */
+/* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
 struct kvm_run {
/* in */
__u32 emulated;  /* skip current instruction */
@@ -233,7 +233,7 @@ struct kvm_dirty_log 

[PATCH 12/15] KVM: Initialize the apic_base msr on svm too

2007-03-11 Thread Avi Kivity
Older userspace didn't care, but newer userspace (with the cpuid changes)
does.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/svm.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index 0311665..2396ada 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -582,6 +582,9 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
init_vmcb(vcpu-svm-vmcb);
 
fx_init(vcpu);
+   vcpu-apic_base = 0xfee0 |
+   /*for vcpu 0*/ MSR_IA32_APICBASE_BSP |
+   MSR_IA32_APICBASE_ENABLE;
 
return 0;
 
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 07/15] KVM: Renumber ioctls

2007-03-11 Thread Avi Kivity
The recent changes have left the ioctl numbers in complete disarray.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 include/linux/kvm.h |   34 +-
 1 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index d89189a..93472da 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -229,34 +229,34 @@ struct kvm_cpuid {
 /*
  * ioctls for /dev/kvm fds:
  */
-#define KVM_GET_API_VERSION   _IO(KVMIO, 1)
-#define KVM_CREATE_VM _IO(KVMIO, 2) /* returns a VM fd */
-#define KVM_GET_MSR_INDEX_LIST_IOWR(KVMIO, 15, struct kvm_msr_list)
+#define KVM_GET_API_VERSION   _IO(KVMIO,   0x00)
+#define KVM_CREATE_VM _IO(KVMIO,   0x01) /* returns a VM fd */
+#define KVM_GET_MSR_INDEX_LIST_IOWR(KVMIO, 0x02, struct kvm_msr_list)
 
 /*
  * ioctls for VM fds
  */
-#define KVM_SET_MEMORY_REGION _IOW(KVMIO, 10, struct kvm_memory_region)
+#define KVM_SET_MEMORY_REGION _IOW(KVMIO, 0x40, struct kvm_memory_region)
 /*
  * KVM_CREATE_VCPU receives as a parameter the vcpu slot, and returns
  * a vcpu fd.
  */
-#define KVM_CREATE_VCPU   _IO(KVMIO, 11)
-#define KVM_GET_DIRTY_LOG _IOW(KVMIO, 12, struct kvm_dirty_log)
+#define KVM_CREATE_VCPU   _IO(KVMIO,  0x41)
+#define KVM_GET_DIRTY_LOG _IOW(KVMIO, 0x42, struct kvm_dirty_log)
 
 /*
  * ioctls for vcpu fds
  */
-#define KVM_RUN   _IO(KVMIO, 16)
-#define KVM_GET_REGS  _IOR(KVMIO, 3, struct kvm_regs)
-#define KVM_SET_REGS  _IOW(KVMIO, 4, struct kvm_regs)
-#define KVM_GET_SREGS _IOR(KVMIO, 5, struct kvm_sregs)
-#define KVM_SET_SREGS _IOW(KVMIO, 6, struct kvm_sregs)
-#define KVM_TRANSLATE _IOWR(KVMIO, 7, struct kvm_translation)
-#define KVM_INTERRUPT _IOW(KVMIO, 8, struct kvm_interrupt)
-#define KVM_DEBUG_GUEST   _IOW(KVMIO, 9, struct kvm_debug_guest)
-#define KVM_GET_MSRS  _IOWR(KVMIO, 13, struct kvm_msrs)
-#define KVM_SET_MSRS  _IOW(KVMIO, 14, struct kvm_msrs)
-#define KVM_SET_CPUID _IOW(KVMIO, 17, struct kvm_cpuid)
+#define KVM_RUN   _IO(KVMIO,   0x80)
+#define KVM_GET_REGS  _IOR(KVMIO,  0x81, struct kvm_regs)
+#define KVM_SET_REGS  _IOW(KVMIO,  0x82, struct kvm_regs)
+#define KVM_GET_SREGS _IOR(KVMIO,  0x83, struct kvm_sregs)
+#define KVM_SET_SREGS _IOW(KVMIO,  0x84, struct kvm_sregs)
+#define KVM_TRANSLATE _IOWR(KVMIO, 0x85, struct kvm_translation)
+#define KVM_INTERRUPT _IOW(KVMIO,  0x86, struct kvm_interrupt)
+#define KVM_DEBUG_GUEST   _IOW(KVMIO,  0x87, struct kvm_debug_guest)
+#define KVM_GET_MSRS  _IOWR(KVMIO, 0x88, struct kvm_msrs)
+#define KVM_SET_MSRS  _IOW(KVMIO,  0x89, struct kvm_msrs)
+#define KVM_SET_CPUID _IOW(KVMIO,  0x8a, struct kvm_cpuid)
 
 #endif
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 06/15] KVM: Remove minor wart from KVM_CREATE_VCPU ioctl

2007-03-11 Thread Avi Kivity
That ioctl does not transfer any data, so it should be an _IO rather than an
_IOW.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 include/linux/kvm.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index c6dd4a7..d89189a 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -241,7 +241,7 @@ struct kvm_cpuid {
  * KVM_CREATE_VCPU receives as a parameter the vcpu slot, and returns
  * a vcpu fd.
  */
-#define KVM_CREATE_VCPU   _IOW(KVMIO, 11, int)
+#define KVM_CREATE_VCPU   _IO(KVMIO, 11)
 #define KVM_GET_DIRTY_LOG _IOW(KVMIO, 12, struct kvm_dirty_log)
 
 /*
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 08/15] KVM: Add method to check for backwards-compatible API extensions

2007-03-11 Thread Avi Kivity
Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/kvm_main.c |6 ++
 include/linux/kvm.h|5 +
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 747966e..376538c 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -2416,6 +2416,12 @@ static long kvm_dev_ioctl(struct file *filp,
r = 0;
break;
}
+   case KVM_CHECK_EXTENSION:
+   /*
+* No extensions defined at present.
+*/
+   r = 0;
+   break;
default:
;
}
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 93472da..c93cf53 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -232,6 +232,11 @@ struct kvm_cpuid {
 #define KVM_GET_API_VERSION   _IO(KVMIO,   0x00)
 #define KVM_CREATE_VM _IO(KVMIO,   0x01) /* returns a VM fd */
 #define KVM_GET_MSR_INDEX_LIST_IOWR(KVMIO, 0x02, struct kvm_msr_list)
+/*
+ * Check if a kvm extension is available.  Argument is extension number,
+ * return is 1 (yes) or 0 (no, sorry).
+ */
+#define KVM_CHECK_EXTENSION   _IO(KVMIO,   0x03)
 
 /*
  * ioctls for VM fds
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/15] KVM: Allow kernel to select size of mmap() buffer

2007-03-11 Thread Avi Kivity
This allows us to store offsets in the kernel/user kvm_run area, and be
sure that userspace has them mapped.  As offsets can be outside the
kvm_run struct, userspace has no way of knowing how much to mmap.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/kvm_main.c |8 +++-
 include/linux/kvm.h|4 
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index ed95c9b..b81f007 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -2436,7 +2436,7 @@ static long kvm_dev_ioctl(struct file *filp,
  unsigned int ioctl, unsigned long arg)
 {
void __user *argp = (void __user *)arg;
-   int r = -EINVAL;
+   long r = -EINVAL;
 
switch (ioctl) {
case KVM_GET_API_VERSION:
@@ -2478,6 +2478,12 @@ static long kvm_dev_ioctl(struct file *filp,
 */
r = 0;
break;
+   case KVM_GET_VCPU_MMAP_SIZE:
+   r = -EINVAL;
+   if (arg)
+   goto out;
+   r = PAGE_SIZE;
+   break;
default:
;
}
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index c0d10cd..dad9081 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -253,6 +253,10 @@ struct kvm_signal_mask {
  * return is 1 (yes) or 0 (no, sorry).
  */
 #define KVM_CHECK_EXTENSION   _IO(KVMIO,   0x03)
+/*
+ * Get size for mmap(vcpu_fd)
+ */
+#define KVM_GET_VCPU_MMAP_SIZE_IO(KVMIO,   0x04) /* in bytes */
 
 /*
  * ioctls for VM fds
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/15] KVM: Add guest mode signal mask

2007-03-11 Thread Avi Kivity
Allow a special signal mask to be used while executing in guest mode.  This
allows signals to be used to interrupt a vcpu without requiring signal
delivery to a userspace handler, which is quite expensive.  Userspace still
receives -EINTR and can get the signal via sigwait().

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/kvm.h  |3 +++
 drivers/kvm/kvm_main.c |   41 +
 include/linux/kvm.h|7 +++
 3 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index be3a0e7..1c4a581 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -277,6 +277,9 @@ struct kvm_vcpu {
gpa_t mmio_phys_addr;
int pio_pending;
 
+   int sigset_active;
+   sigset_t sigset;
+
struct {
int active;
u8 save_iopl;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 0e28f58..ed95c9b 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -1591,9 +1591,13 @@ static void complete_pio(struct kvm_vcpu *vcpu)
 static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
int r;
+   sigset_t sigsaved;
 
vcpu_load(vcpu);
 
+   if (vcpu-sigset_active)
+   sigprocmask(SIG_SETMASK, vcpu-sigset, sigsaved);
+
/* re-sync apic's tpr */
vcpu-cr8 = kvm_run-cr8;
 
@@ -1616,6 +1620,9 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
 
r = kvm_arch_ops-run(vcpu, kvm_run);
 
+   if (vcpu-sigset_active)
+   sigprocmask(SIG_SETMASK, sigsaved, NULL);
+
vcpu_put(vcpu);
return r;
 }
@@ -2142,6 +2149,17 @@ out:
return r;
 }
 
+static int kvm_vcpu_ioctl_set_sigmask(struct kvm_vcpu *vcpu, sigset_t *sigset)
+{
+   if (sigset) {
+   sigdelsetmask(sigset, sigmask(SIGKILL)|sigmask(SIGSTOP));
+   vcpu-sigset_active = 1;
+   vcpu-sigset = *sigset;
+   } else
+   vcpu-sigset_active = 0;
+   return 0;
+}
+
 static long kvm_vcpu_ioctl(struct file *filp,
   unsigned int ioctl, unsigned long arg)
 {
@@ -2260,6 +2278,29 @@ static long kvm_vcpu_ioctl(struct file *filp,
goto out;
break;
}
+   case KVM_SET_SIGNAL_MASK: {
+   struct kvm_signal_mask __user *sigmask_arg = argp;
+   struct kvm_signal_mask kvm_sigmask;
+   sigset_t sigset, *p;
+
+   p = NULL;
+   if (argp) {
+   r = -EFAULT;
+   if (copy_from_user(kvm_sigmask, argp,
+  sizeof kvm_sigmask))
+   goto out;
+   r = -EINVAL;
+   if (kvm_sigmask.len != sizeof sigset)
+   goto out;
+   r = -EFAULT;
+   if (copy_from_user(sigset, sigmask_arg-sigset,
+  sizeof sigset))
+   goto out;
+   p = sigset;
+   }
+   r = kvm_vcpu_ioctl_set_sigmask(vcpu, sigset);
+   break;
+   }
default:
;
}
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index b3af92e..c0d10cd 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -234,6 +234,12 @@ struct kvm_cpuid {
struct kvm_cpuid_entry entries[0];
 };
 
+/* for KVM_SET_SIGNAL_MASK */
+struct kvm_signal_mask {
+   __u32 len;
+   __u8  sigset[0];
+};
+
 #define KVMIO 0xAE
 
 /*
@@ -273,5 +279,6 @@ struct kvm_cpuid {
 #define KVM_GET_MSRS  _IOWR(KVMIO, 0x88, struct kvm_msrs)
 #define KVM_SET_MSRS  _IOW(KVMIO,  0x89, struct kvm_msrs)
 #define KVM_SET_CPUID _IOW(KVMIO,  0x8a, struct kvm_cpuid)
+#define KVM_SET_SIGNAL_MASK   _IOW(KVMIO,  0x8b, struct kvm_signal_mask)
 
 #endif
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/15] KVM: Remove the 'emulated' field from the userspace interface

2007-03-11 Thread Avi Kivity
We no longer emulate single instructions in userspace.  Instead, we service
mmio or pio requests.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/kvm_main.c |5 -
 include/linux/kvm.h|3 +--
 2 files changed, 1 insertions(+), 7 deletions(-)

diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 347467e..747966e 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -1588,11 +1588,6 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
/* re-sync apic's tpr */
vcpu-cr8 = kvm_run-cr8;
 
-   if (kvm_run-emulated) {
-   kvm_arch_ops-skip_emulated_instruction(vcpu);
-   kvm_run-emulated = 0;
-   }
-
if (kvm_run-io_completed) {
if (vcpu-pio_pending)
complete_pio(vcpu);
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 15e23bc..c6dd4a7 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -51,10 +51,9 @@ enum kvm_exit_reason {
 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
 struct kvm_run {
/* in */
-   __u32 emulated;  /* skip current instruction */
__u32 io_completed; /* mmio/pio request completed */
__u8 request_interrupt_window;
-   __u8 padding1[7];
+   __u8 padding1[3];
 
/* out */
__u32 exit_type;
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/15] KVM: Add a special exit reason when exiting due to an interrupt

2007-03-11 Thread Avi Kivity
This is redundant, as we also return -EINTR from the ioctl, but it
allows us to examine the exit_reason field on resume without seeing
old data.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/svm.c   |2 ++
 drivers/kvm/vmx.c   |2 ++
 include/linux/kvm.h |3 ++-
 3 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index b09928f..0311665 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -1619,12 +1619,14 @@ again:
if (signal_pending(current)) {
++kvm_stat.signal_exits;
post_kvm_run_save(vcpu, kvm_run);
+   kvm_run-exit_reason = KVM_EXIT_INTR;
return -EINTR;
}
 
if (dm_request_for_irq_injection(vcpu, kvm_run)) {
++kvm_stat.request_irq_exits;
post_kvm_run_save(vcpu, kvm_run);
+   kvm_run-exit_reason = KVM_EXIT_INTR;
return -EINTR;
}
kvm_resched(vcpu);
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index ba7a98b..0d1c8cf 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1936,12 +1936,14 @@ again:
if (signal_pending(current)) {
++kvm_stat.signal_exits;
post_kvm_run_save(vcpu, kvm_run);
+   kvm_run-exit_reason = KVM_EXIT_INTR;
return -EINTR;
}
 
if (dm_request_for_irq_injection(vcpu, kvm_run)) {
++kvm_stat.request_irq_exits;
post_kvm_run_save(vcpu, kvm_run);
+   kvm_run-exit_reason = KVM_EXIT_INTR;
return -EINTR;
}
 
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 57f47ef..b3af92e 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -11,7 +11,7 @@
 #include asm/types.h
 #include linux/ioctl.h
 
-#define KVM_API_VERSION 8
+#define KVM_API_VERSION 9
 
 /*
  * Architectural interrupt line count, and the size of the bitmap needed
@@ -45,6 +45,7 @@ enum kvm_exit_reason {
KVM_EXIT_IRQ_WINDOW_OPEN  = 7,
KVM_EXIT_SHUTDOWN = 8,
KVM_EXIT_FAIL_ENTRY   = 9,
+   KVM_EXIT_INTR = 10,
 };
 
 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/15] KVM: Allow userspace to process hypercalls which have no kernel handler

2007-03-11 Thread Avi Kivity
This is useful for paravirtualized graphics devices, for example.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/kvm_main.c |   18 +-
 include/linux/kvm.h|   10 +-
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 376538c..2220e49 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -1203,7 +1203,16 @@ int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run 
*run)
}
switch (nr) {
default:
-   ;
+   run-hypercall.args[0] = a0;
+   run-hypercall.args[1] = a1;
+   run-hypercall.args[2] = a2;
+   run-hypercall.args[3] = a3;
+   run-hypercall.args[4] = a4;
+   run-hypercall.args[5] = a5;
+   run-hypercall.ret = ret;
+   run-hypercall.longmode = is_long_mode(vcpu);
+   kvm_arch_ops-decache_regs(vcpu);
+   return 0;
}
vcpu-regs[VCPU_REGS_RAX] = ret;
kvm_arch_ops-decache_regs(vcpu);
@@ -1599,6 +1608,13 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
 
vcpu-mmio_needed = 0;
 
+   if (kvm_run-exit_type == KVM_EXIT_TYPE_VM_EXIT
+kvm_run-exit_type == KVM_EXIT_HYPERCALL) {
+   kvm_arch_ops-cache_regs(vcpu);
+   vcpu-regs[VCPU_REGS_RAX] = kvm_run-hypercall.ret;
+   kvm_arch_ops-decache_regs(vcpu);
+   }
+
r = kvm_arch_ops-run(vcpu, kvm_run);
 
vcpu_put(vcpu);
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index c93cf53..9151ebf 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -11,7 +11,7 @@
 #include asm/types.h
 #include linux/ioctl.h
 
-#define KVM_API_VERSION 6
+#define KVM_API_VERSION 7
 
 /*
  * Architectural interrupt line count, and the size of the bitmap needed
@@ -41,6 +41,7 @@ enum kvm_exit_reason {
KVM_EXIT_UNKNOWN  = 0,
KVM_EXIT_EXCEPTION= 1,
KVM_EXIT_IO   = 2,
+   KVM_EXIT_HYPERCALL= 3,
KVM_EXIT_DEBUG= 4,
KVM_EXIT_HLT  = 5,
KVM_EXIT_MMIO = 6,
@@ -103,6 +104,13 @@ struct kvm_run {
__u32 len;
__u8  is_write;
} mmio;
+   /* KVM_EXIT_HYPERCALL */
+   struct {
+   __u64 args[6];
+   __u64 ret;
+   __u32 longmode;
+   __u32 pad;
+   } hypercall;
};
 };
 
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02/15] KVM: Do not communicate to userspace through cpu registers during PIO

2007-03-11 Thread Avi Kivity
Currently when passing the a PIO emulation request to userspace, we
rely on userspace updating %rax (on 'in' instructions) and %rsi/%rdi/%rcx
(on string instructions).  This (a) requires two extra ioctls for getting
and setting the registers and (b) is unfriendly to non-x86 archs, when
they get kvm ports.

So fix by doing the register fixups in the kernel and passing to userspace
only an abstract description of the PIO to be done.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/kvm.h  |1 +
 drivers/kvm/kvm_main.c |   48 +---
 drivers/kvm/svm.c  |1 +
 drivers/kvm/vmx.c  |1 +
 include/linux/kvm.h|6 +++---
 5 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h
index 901b8d9..59cbc5b 100644
--- a/drivers/kvm/kvm.h
+++ b/drivers/kvm/kvm.h
@@ -274,6 +274,7 @@ struct kvm_vcpu {
int mmio_size;
unsigned char mmio_data[8];
gpa_t mmio_phys_addr;
+   int pio_pending;
 
struct {
int active;
diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 42be8a8..8a4984d 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -1504,6 +1504,44 @@ void save_msrs(struct vmx_msr_entry *e, int n)
 }
 EXPORT_SYMBOL_GPL(save_msrs);
 
+static void complete_pio(struct kvm_vcpu *vcpu)
+{
+   struct kvm_io *io = vcpu-run-io;
+   long delta;
+
+   kvm_arch_ops-cache_regs(vcpu);
+
+   if (!io-string) {
+   if (io-direction == KVM_EXIT_IO_IN)
+   memcpy(vcpu-regs[VCPU_REGS_RAX], io-value,
+  io-size);
+   } else {
+   delta = 1;
+   if (io-rep) {
+   delta *= io-count;
+   /*
+* The size of the register should really depend on
+* current address size.
+*/
+   vcpu-regs[VCPU_REGS_RCX] -= delta;
+   }
+   if (io-string_down)
+   delta = -delta;
+   delta *= io-size;
+   if (io-direction == KVM_EXIT_IO_IN)
+   vcpu-regs[VCPU_REGS_RDI] += delta;
+   else
+   vcpu-regs[VCPU_REGS_RSI] += delta;
+   }
+   
+   vcpu-pio_pending = 0;
+   vcpu-run-io_completed = 0;
+
+   kvm_arch_ops-decache_regs(vcpu);
+
+   kvm_arch_ops-skip_emulated_instruction(vcpu);
+}
+
 static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 {
int r;
@@ -1518,9 +1556,13 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
kvm_run-emulated = 0;
}
 
-   if (kvm_run-mmio_completed) {
-   memcpy(vcpu-mmio_data, kvm_run-mmio.data, 8);
-   vcpu-mmio_read_completed = 1;
+   if (kvm_run-io_completed) {
+   if (vcpu-pio_pending)
+   complete_pio(vcpu);
+   else {
+   memcpy(vcpu-mmio_data, kvm_run-mmio.data, 8);
+   vcpu-mmio_read_completed = 1;
+   }
}
 
vcpu-mmio_needed = 0;
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index 6787f11..b176f5a 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -1056,6 +1056,7 @@ static int io_interception(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
}
} else
kvm_run-io.value = vcpu-svm-vmcb-save.rax;
+   vcpu-pio_pending = 1;
return 0;
 }
 
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index 910535d..7fd572a 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1465,6 +1465,7 @@ static int handle_io(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
kvm_run-io.address = vmcs_readl(GUEST_LINEAR_ADDRESS);
} else
kvm_run-io.value = vcpu-regs[VCPU_REGS_RAX]; /* rax */
+   vcpu-pio_pending = 1;
return 0;
 }
 
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index d88e750..19aeb33 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -11,7 +11,7 @@
 #include asm/types.h
 #include linux/ioctl.h
 
-#define KVM_API_VERSION 5
+#define KVM_API_VERSION 6
 
 /*
  * Architectural interrupt line count, and the size of the bitmap needed
@@ -53,7 +53,7 @@ enum kvm_exit_reason {
 struct kvm_run {
/* in */
__u32 emulated;  /* skip current instruction */
-   __u32 mmio_completed; /* mmio request completed */
+   __u32 io_completed; /* mmio/pio request completed */
__u8 request_interrupt_window;
__u8 padding1[7];
 
@@ -80,7 +80,7 @@ struct kvm_run {
__u32 error_code;
} ex;
/* KVM_EXIT_IO */
-   struct {
+   struct kvm_io {
 #define KVM_EXIT_IO_IN  0
 #define KVM_EXIT_IO_OUT 1
__u8 

[PATCH 15/15] KVM: Future-proof argument-less ioctls

2007-03-11 Thread Avi Kivity
Some ioctls ignore their arguments.  By requiring them to be zero now,
we allow a nonzero value to have some special meaning in the future.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/kvm_main.c |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index b81f007..bf8403e 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -2169,6 +2169,9 @@ static long kvm_vcpu_ioctl(struct file *filp,
 
switch (ioctl) {
case KVM_RUN:
+   r = -EINVAL;
+   if (arg)
+   goto out;
r = kvm_vcpu_ioctl_run(vcpu, vcpu-run);
break;
case KVM_GET_REGS: {
@@ -2440,9 +2443,15 @@ static long kvm_dev_ioctl(struct file *filp,
 
switch (ioctl) {
case KVM_GET_API_VERSION:
+   r = -EINVAL;
+   if (arg)
+   goto out;
r = KVM_API_VERSION;
break;
case KVM_CREATE_VM:
+   r = -EINVAL;
+   if (arg)
+   goto out;
r = kvm_dev_ioctl_create_vm();
break;
case KVM_GET_MSR_INDEX_LIST: {
-- 
1.5.0.2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/15] KVM: Fold kvm_run::exit_type into kvm_run::exit_reason

2007-03-11 Thread Avi Kivity
Currently, userspace is told about the nature of the last exit from the
guest using two fields, exit_type and exit_reason, where exit_type has
just two enumerations (and no need for more).  So fold exit_type into
exit_reason, reducing the complexity of determining what really happened.

Signed-off-by: Avi Kivity [EMAIL PROTECTED]
---
 drivers/kvm/kvm_main.c |3 +--
 drivers/kvm/svm.c  |7 +++
 drivers/kvm/vmx.c  |7 +++
 include/linux/kvm.h|   15 ---
 4 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c
index 2220e49..0e28f58 100644
--- a/drivers/kvm/kvm_main.c
+++ b/drivers/kvm/kvm_main.c
@@ -1608,8 +1608,7 @@ static int kvm_vcpu_ioctl_run(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
 
vcpu-mmio_needed = 0;
 
-   if (kvm_run-exit_type == KVM_EXIT_TYPE_VM_EXIT
-kvm_run-exit_type == KVM_EXIT_HYPERCALL) {
+   if (kvm_run-exit_reason == KVM_EXIT_HYPERCALL) {
kvm_arch_ops-cache_regs(vcpu);
vcpu-regs[VCPU_REGS_RAX] = kvm_run-hypercall.ret;
kvm_arch_ops-decache_regs(vcpu);
diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c
index d4b2936..b09928f 100644
--- a/drivers/kvm/svm.c
+++ b/drivers/kvm/svm.c
@@ -1298,8 +1298,6 @@ static int handle_exit(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
 {
u32 exit_code = vcpu-svm-vmcb-control.exit_code;
 
-   kvm_run-exit_type = KVM_EXIT_TYPE_VM_EXIT;
-
if (is_external_interrupt(vcpu-svm-vmcb-control.exit_int_info) 
exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR)
printk(KERN_ERR %s: unexpected exit_ini_info 0x%x 
@@ -1609,8 +1607,9 @@ again:
vcpu-svm-next_rip = 0;
 
if (vcpu-svm-vmcb-control.exit_code == SVM_EXIT_ERR) {
-   kvm_run-exit_type = KVM_EXIT_TYPE_FAIL_ENTRY;
-   kvm_run-exit_reason = vcpu-svm-vmcb-control.exit_code;
+   kvm_run-exit_reason = KVM_EXIT_FAIL_ENTRY;
+   kvm_run-fail_entry.hardware_entry_failure_reason
+   = vcpu-svm-vmcb-control.exit_code;
post_kvm_run_save(vcpu, kvm_run);
return 0;
}
diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c
index e093892..ba7a98b 100644
--- a/drivers/kvm/vmx.c
+++ b/drivers/kvm/vmx.c
@@ -1901,10 +1901,10 @@ again:
 
asm (mov %0, %%ds; mov %0, %%es : : r(__USER_DS));
 
-   kvm_run-exit_type = 0;
if (fail) {
-   kvm_run-exit_type = KVM_EXIT_TYPE_FAIL_ENTRY;
-   kvm_run-exit_reason = vmcs_read32(VM_INSTRUCTION_ERROR);
+   kvm_run-exit_reason = KVM_EXIT_FAIL_ENTRY;
+   kvm_run-fail_entry.hardware_entry_failure_reason
+   = vmcs_read32(VM_INSTRUCTION_ERROR);
r = 0;
} else {
if (fs_gs_ldt_reload_needed) {
@@ -1930,7 +1930,6 @@ again:
profile_hit(KVM_PROFILING, (void 
*)vmcs_readl(GUEST_RIP));
 
vcpu-launched = 1;
-   kvm_run-exit_type = KVM_EXIT_TYPE_VM_EXIT;
r = kvm_handle_exit(kvm_run, vcpu);
if (r  0) {
/* Give scheduler a change to reschedule. */
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 9151ebf..57f47ef 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -11,7 +11,7 @@
 #include asm/types.h
 #include linux/ioctl.h
 
-#define KVM_API_VERSION 7
+#define KVM_API_VERSION 8
 
 /*
  * Architectural interrupt line count, and the size of the bitmap needed
@@ -34,9 +34,6 @@ struct kvm_memory_region {
 #define KVM_MEM_LOG_DIRTY_PAGES  1UL
 
 
-#define KVM_EXIT_TYPE_FAIL_ENTRY 1
-#define KVM_EXIT_TYPE_VM_EXIT2
-
 enum kvm_exit_reason {
KVM_EXIT_UNKNOWN  = 0,
KVM_EXIT_EXCEPTION= 1,
@@ -47,6 +44,7 @@ enum kvm_exit_reason {
KVM_EXIT_MMIO = 6,
KVM_EXIT_IRQ_WINDOW_OPEN  = 7,
KVM_EXIT_SHUTDOWN = 8,
+   KVM_EXIT_FAIL_ENTRY   = 9,
 };
 
 /* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
@@ -57,12 +55,11 @@ struct kvm_run {
__u8 padding1[3];
 
/* out */
-   __u32 exit_type;
__u32 exit_reason;
__u32 instruction_length;
__u8 ready_for_interrupt_injection;
__u8 if_flag;
-   __u16 padding2;
+   __u8 padding2[6];
 
/* in (pre_kvm_run), out (post_kvm_run) */
__u64 cr8;
@@ -71,8 +68,12 @@ struct kvm_run {
union {
/* KVM_EXIT_UNKNOWN */
struct {
-   __u32 hardware_exit_reason;
+   __u64 hardware_exit_reason;
} hw;
+   /* KVM_EXIT_FAIL_ENTRY */
+   struct {
+   __u64 hardware_entry_failure_reason;
+   } fail_entry;
/* KVM_EXIT_EXCEPTION */
struct {
__u32 exception;
-- 
1.5.0.2


[PATCH -mm] Fix race between proc_readdir and remove_proc_entry

2007-03-11 Thread Alexey Dobriyan
 -procfs-fix-race-between-proc_readdir-and-remove_proc_entry.patch
 +fix-race-between-proc_get_inode-and-remove_proc_entry.patch

  Updated.  Looks sane.

Why have you dropped the first patch? Resending slightly fixed version
of it.


[PATCH -mm] Fix race between proc_readdir and remove_proc_entry

From: Darrick J. Wong [EMAIL PROTECTED]

Fix the following race:

proc_readdirremove_proc_entry
=

spin_lock(proc_subdir_lock);
[choose PDE to start filldir from]
spin_unlock(proc_subdir_lock);
spin_lock(proc_subdir_lock);
[find PDE]
[free PDE, refcount is 0]
spin_unlock(proc_subdir_lock);
/* boom */
if (filldir(dirent, de-name, ...

[de_put on error path --adobriyan]

Signed-off-by: Darrick J. Wong [EMAIL PROTECTED]
Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 fs/proc/generic.c |   11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -478,14 +478,21 @@ int proc_readdir(struct file * filp,
}
 
do {
+   struct proc_dir_entry *next;
+
/* filldir passes info to user space */
+   de_get(de);
spin_unlock(proc_subdir_lock);
if (filldir(dirent, de-name, de-namelen, 
filp-f_pos,
-   de-low_ino, de-mode  12)  0)
+   de-low_ino, de-mode  12)  0) {
+   de_put(de);
goto out;
+   }
spin_lock(proc_subdir_lock);
filp-f_pos++;
-   de = de-next;
+   next = de-next;
+   de_put(de);
+   de = next;
} while (de);
spin_unlock(proc_subdir_lock);
}

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 2/7] RSS controller core

2007-03-11 Thread Kirill Korotaev
Andrew Morton wrote:
 On Tue, 06 Mar 2007 17:55:29 +0300
 Pavel Emelianov [EMAIL PROTECTED] wrote:
 
 
+struct rss_container {
+ struct res_counter res;
+ struct list_head page_list;
+ struct container_subsys_state css;
+};
+
+struct page_container {
+ struct page *page;
+ struct rss_container *cnt;
+ struct list_head list;
+};
 
 
 ah.  This looks good.  I'll find a hunk of time to go through this work
 and through Paul's patches.  It'd be good to get both patchsets lined
 up in -mm within a couple of weeks.  But..
 
 We need to decide whether we want to do per-container memory limitation via
 these data structures, or whether we do it via a physical scan of some
 software zone, possibly based on Mel's patches.
i.e. a separate memzone for each container?
imho memzone approach is inconvinient for pages sharing and shares accounting.
it also makes memory management more strict, forbids overcommiting
per-container etc.
Maybe you have some ideas how we can decide on this?

Thanks,
Kirill

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 2/7] RSS controller core

2007-03-11 Thread Andrew Morton
 On Sun, 11 Mar 2007 15:26:41 +0300 Kirill Korotaev [EMAIL PROTECTED] wrote:
 Andrew Morton wrote:
  On Tue, 06 Mar 2007 17:55:29 +0300
  Pavel Emelianov [EMAIL PROTECTED] wrote:
  
  
 +struct rss_container {
 +   struct res_counter res;
 +   struct list_head page_list;
 +   struct container_subsys_state css;
 +};
 +
 +struct page_container {
 +   struct page *page;
 +   struct rss_container *cnt;
 +   struct list_head list;
 +};
  
  
  ah.  This looks good.  I'll find a hunk of time to go through this work
  and through Paul's patches.  It'd be good to get both patchsets lined
  up in -mm within a couple of weeks.  But..
  
  We need to decide whether we want to do per-container memory limitation via
  these data structures, or whether we do it via a physical scan of some
  software zone, possibly based on Mel's patches.
 i.e. a separate memzone for each container?

Yep.  Straightforward machine partitioning.  An attractive thing is that it
100% reuses existing page reclaim, unaltered.

 imho memzone approach is inconvinient for pages sharing and shares accounting.
 it also makes memory management more strict, forbids overcommiting
 per-container etc.

umm, who said they were requirements?

 Maybe you have some ideas how we can decide on this?

We need to work out what the requirements are before we can settle on an
implementation.

Sigh.  Who is running this show?   Anyone?

You can actually do a form of overcommittment by allowing multiple
containers to share one or more of the zones.  Whether that is sufficient
or suitable I don't know.  That depends on the requirements, and we haven't
even discussed those, let alone agreed to them.  

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [scsi]: Add offline state checking while dispatch a scsi cmd

2007-03-11 Thread James Bottomley
On Fri, 2007-03-09 at 09:40 +0800, Joe Jin wrote:
  What's the error you're trying to fix?  scsi_dispatch_cmd() is only
  called from scsi_request_fn() which already has an equivalent of this
  check in it just prior to calling dispatch.
 
 Yeah, I have saw the cheking at scsi_request_fn(), recently we got a crash
 info as following at rhel4 2.6.9-42.0.2.ELsmp,

This kernel is way to old to debug ...

However: 
 scsi0 (0:0): rejecting I/O to offline device
 ...
 EXT3-fs error (device sda8) in start_transaction: Journal has aborted
 
 Unable to handle kernel NULL pointer dereference at  RIP: 
 a0031e66{:megaraid_mbox:megaraid_queue_command+2634}

This is a bug actually in the megaraid.

 PML4 21a25d067 PGD 2170ac067 PMD 0 
 Oops: 0002 [1] SMP 
 CPU 0 
 Modules linked in: hangcheck_timer mptctl mptbase ipmi_devintf ipmi_si 
 ipmi_msghandler dell_rbu netconsole netdump autofs4 i2c_dev i2c_core ocfs2(U) 
 debugfs(U) nfs lockd nfs_acl ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) 
 configfs(U) sunrpc ds yenta_socket pcmcia_core ide_dump scsi_dump diskdump 
 zlib_deflate dm_mirror dm_multipath dm_mod emcphr(U) emcpmpap(U) emcpmpaa(U) 
 emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery ac joydev uhci_hcd 
 ehci_hcd hw_random tg3 e1000 bond0(U) floppy sg ext3 jbd lpfc 
 scsi_transport_fc megaraid_mbox megaraid_mm sd_mod scsi_mod
 Pid: 13238, comm: emagent Tainted: P  2.6.9-42.0.2.ELsmp
 RIP: 0010:[a0031e66] 
 a0031e66{:megaraid_mbox:megaraid_queue_command+2634}
 RSP: 0018:01019b5a9b48  EFLAGS: 00010002
 RAX: 000220b8e000 RBX: 0102ffd1b048 RCX: 
 RDX:  RSI: 0001 RDI: 010431124bf0
 RBP: 0001 R08:  R09: 010133ce5b80
 R10: 0102ffd3e5a0 R11: 0060 R12: 010133ce5b80
 R13: 0102ffd3e480 R14: 0100bfb4c8b8 R15: 0101ffcf4000
 FS:  () GS:804e5180(005b) knlGS:f47ffbb0
 CS:  0010 DS: 002b ES: 002b CR0: 8005003b
 CR2:  CR3: 00101000 CR4: 06e0
 Process emagent (pid: 13238, threadinfo 01019b5a8000, task 
 01003e5a8030)
 Stack:  0046 0046 0102ffd3e480 
0101fff73980 8015cb38 0100bfb4d4aa 0100bfb4d4a2 
0100bfb4c8b8 01010080 
 Call Trace:8015cb38{mempool_alloc+129} 
 a0002874{:scsi_mod:scsi_done+0} 
8013fc00{__mod_timer+113} 
 a0002adf{:scsi_mod:scsi_dispatch_cmd+595} 
a0007a72{:scsi_mod:scsi_request_fn+990} 
 8024e385{generic_unplug_device+24} 
8017a6d3{__wait_on_buffer+120} 
 8017a55e{bh_wake_function+0} 
8017a55e{bh_wake_function+0} 
 a00877fe{:ext3:ext3_bread+96} 
a008935c{:ext3:htree_dirblock_to_tree+50} 
a008952c{:ext3:ext3_htree_fill_tree+295} 
8018b232{filldir64+122} 8018b1b8{filldir64+0} 
a0083ace{:ext3:ext3_readdir+371} 8018f019{dput+56} 
8018b1b8{filldir64+0} 8018599c{path_release+12} 
8019e335{compat_sys_statfs+105} 
 8018b1b8{filldir64+0} 
8018aef7{vfs_readdir+155} 
 8018b2e8{sys_getdents64+118} 
80125bbb{sysenter_do_call+27} 

And this is a direct command submission path:  it already passed both
online check gates in this path *after* the device was offlined, so
adding a third won't fix this.  Firstly, I'm assuming you have only a
single disk, so the I/O was definitely bound for sda?  Secondly, can you
reproduce with a modern (2.6.20) kernel.  Your trace strongly suggests
that the device came back online for some reason and then the megaraid
driver died.

James


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 2/7] RSS controller core

2007-03-11 Thread Balbir Singh

On 3/11/07, Andrew Morton [EMAIL PROTECTED] wrote:

 On Sun, 11 Mar 2007 15:26:41 +0300 Kirill Korotaev [EMAIL PROTECTED] wrote:
 Andrew Morton wrote:
  On Tue, 06 Mar 2007 17:55:29 +0300
  Pavel Emelianov [EMAIL PROTECTED] wrote:
 
 
 +struct rss_container {
 +   struct res_counter res;
 +   struct list_head page_list;
 +   struct container_subsys_state css;
 +};
 +
 +struct page_container {
 +   struct page *page;
 +   struct rss_container *cnt;
 +   struct list_head list;
 +};
 
 
  ah.  This looks good.  I'll find a hunk of time to go through this work
  and through Paul's patches.  It'd be good to get both patchsets lined
  up in -mm within a couple of weeks.  But..
 
  We need to decide whether we want to do per-container memory limitation via
  these data structures, or whether we do it via a physical scan of some
  software zone, possibly based on Mel's patches.
 i.e. a separate memzone for each container?

Yep.  Straightforward machine partitioning.  An attractive thing is that it
100% reuses existing page reclaim, unaltered.


We discussed zones for resource control and some of the disadvantages at
 http://lkml.org/lkml/2006/10/30/222

I need to look at Mel's patches to determine if they are suitable for
control. But in a thread of discussion on those patches, it was agreed
that memory fragmentation and resource control are independent issues.




 imho memzone approach is inconvinient for pages sharing and shares accounting.
 it also makes memory management more strict, forbids overcommiting
 per-container etc.

umm, who said they were requirements?



We discussed some of the requirements in the RFC: Memory Controller
requirements thread
http://lkml.org/lkml/2006/10/30/51


 Maybe you have some ideas how we can decide on this?

We need to work out what the requirements are before we can settle on an
implementation.

Sigh.  Who is running this show?   Anyone?



All the stake holders involved in the RFC discussion :-) We've been
talking and building on top of each others patches. I hope that was a
good answer ;)


You can actually do a form of overcommittment by allowing multiple
containers to share one or more of the zones.  Whether that is sufficient
or suitable I don't know.  That depends on the requirements, and we haven't
even discussed those, let alone agreed to them.



There are other things like resizing a zone, finding the right size,
etc. I'll look
at Mel's patches to see what is supported.

Warm Regards,
Balbir Singh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Use more gcc extensions in the Linux headers

2007-03-11 Thread Valdis . Kletnieks
On Fri, 09 Mar 2007 20:24:42 PST, Randy Dunlap said:
 On Fri, 09 Mar 2007 23:03:05 -0500 [EMAIL PROTECTED] wrote:
  -/* GCC is awesome. */
  +/* GCC leaves me speechless. */
 
 awesome can mean inspiring awe or admiration or wonder (amazing)
 or it can mean awful (as in terrifying).  8)

And as those who know me well will attest, it takes something well down the
road of either definition to render me actually speechless.. :)


pgpjKsvzaltOk.pgp
Description: PGP signature


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-11 Thread Linus Torvalds


On Sat, 10 Mar 2007, Nicholas Miell wrote:
  
  UNIX has pid's for process handles, and file descriptors for just 
  about everything else.
 
 And I imagine that somebody will come up with way of getting a fd for a
 process sooner or later. 

Well, /proc/pid/ is about as close as you get. And that's largely 
inspired by a Plan-9'ish thing that does indeed expose processes as files.

The problem with processes is that they are actually so *complicated* that 
trying to describe them with a single file isn't all that useful (you 
could use tons of different ioctl's to do different operations, but that's 
against the stream of bytes model in UNIX, and even more so against the 
whole Plan-9 model).

 Actually, I was thinking reducing struct file to the bare minimum, and
 then using that as the common header shared by object-specific
 structures. I don't know how unpleasant that would be from a memory
 allocation perspective, though.

It would probably not be a bad idea, but I just doubt that it makes much 
of a difference, at least not for timerfd/signalfd files. There likely 
just won't be that many of them (I'd expect that processes that use them 
would normally just have one or two of each).

It might be more relevant for things like sockets and pty's: do a

ls -l /proc/*/fd

and see what kind of files you have open, and I suspect most of the files 
will actually be sockets on a normal desktop setup, and even more so on 
some network server thing. And yes, it might be nice to avoid allocating 
memory for the (unnecessary) readahead and f_pos state, but in the end 
you seldom really have all *that* much memory allocated for file 
descriptors. The real memory use ends up being elsewhere..

IOW, I don't think it's a bad idea per se, I just doubt that it is worth 
the complexity and effort.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA resume slowness, e1000 MSI warning

2007-03-11 Thread Eric W. Biederman
Michael S. Tsirkin [EMAIL PROTECTED] writes:

 Quoting Eric W. Biederman [EMAIL PROTECTED]:
 Subject: Re: SATA resume slowness, e1000 MSI warning
 
 Michael S. Tsirkin [EMAIL PROTECTED] writes:
 
  The only case I can see which might trigger this is if we saved
  pci-X state and then didn't restore it because we could not find
  the capability on restore.
 
  Hmm. pci_save_pcix_state/pci_restore_pcix_state seem to only handle
  regular devices and seem to ignore the fact that for bridge PCI-X
  capability has a different structure.
 
  Is this intentional? 
 
 Probably not a such.  I don't think we have any drivers for bridge
 devices so I don't think it matters.  It likely wouldn't hurt to fix
 it just in case though.
 
 Do any of the mellanox cards do anything with the bridge on the card?

 Yes but they do their own thing wrt saving/restoring registers.
 Look at drivers/infiniband/hw/mthca/mthca_reset.c

  If not, here's a patch to fix this. Warning: completely untested.
 
 If you fix the offsets and diff this against my last fix (to never
 free the buffer) I think your patch makes sense.

 Let's agree what the correct offsets are.

  PCI: restore bridge PCI-X capability registers after PM event
 
  Restore PCI-X bridge up/downstream capability registers
  after PM event.  This includes maxumum split transaction
  commitment limit which might be vital for PCI X.
 
  Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED]
 
  diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
  index df49530..4b788ef 100644
  --- a/drivers/pci/pci.c
  +++ b/drivers/pci/pci.c
  @@ -597,14 +597,19 @@ static int pci_save_pcix_state(struct pci_dev *dev)
 if (pos = 0)
 return 0;
   
  -  save_state = kzalloc(sizeof(*save_state) + sizeof(u16), GFP_KERNEL);
  + save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 2, GFP_KERNEL);
 if (!save_state) {
  -  dev_err(dev-dev, Out of memory in pci_save_pcie_state\n);
  +  dev_err(dev-dev, Out of memory in pci_save_pcix_state\n);
 return -ENOMEM;
 }
 cap = (u16 *)save_state-data[0];
   
  -  pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]);
  +  if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) {
 
 This appears to be the proper test.
 
  + pci_read_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]);
  + pci_read_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]);
  +  } else
  +  pci_read_config_word(dev, pos + PCI_X_CMD, cap[i++]);
  +
 pci_add_saved_cap(dev, save_state);
 return 0;
   }
  @@ -621,7 +626,11 @@ static void pci_restore_pcix_state(struct pci_dev 
  *dev)
 return;
 cap = (u16 *)save_state-data[0];
   
  -  pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]);
  +  if (dev-hdr_type == PCI_HEADER_TYPE_BRIDGE) {
  + pci_write_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]);
  + pci_write_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]);
 
 These look like the proper two registers to save.
 
  +  } else
  +  pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]);
 pci_remove_saved_cap(save_state);
 kfree(save_state);
   }
  diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
  index f09cce2..fb7eefd 100644
  --- a/include/linux/pci_regs.h
  +++ b/include/linux/pci_regs.h
  @@ -332,6 +332,8 @@
  #define PCI_X_STATUS_SPL_ERR 0x2000 /* Rcvd Split Completion Error Msg
 */
   #define  PCI_X_STATUS_266MHZ  0x4000  /* 266 MHz capable */
   #define  PCI_X_STATUS_533MHZ  0x8000  /* 533 MHz capable */
  +#define PCI_X_BRIDGE_UP_SPL_CTL 10 /* PCI-X upstream split transaction
 limit */
  +#define PCI_X_BRIDGE_DN_SPL_CTL 14 /* PCI-X downstream split transaction
 limit */
 
 Unless I am completely misreading the spec. While you have picked the
 right register to save the offsets should be 0x08 and 0x0c or 8 and 12

 No, the spec is written in terms of dwords (32 bit), we are storing words (16
 bits).
 The data at offsets 8 and 12 is read-only split transaction capacity.
 Split transaction limit starts at bit 16 so you need to add 2 to byte offset.

 Right?

From that perspective it makes sense.  So I will agree with the way you are
thinking the code works.

The read-only and the read-write part are all defined as part of the
same register so I didn't expect them to be separate.  And I hadn't
paid attention enough to see that the code was only saving 16bit
values.

Rumor has it that some pci devices can't tolerate  32bit accesses.
Although I have never met one.  The two factors together suggest that
for generic code it probably makes sense to operate on 32bit
quantities, and just to ignore the read-only portion.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kthread_should_stop_check_freeze (was: Re: [PATCH -mm 3/7] Freezer: Remove PF_NOFREEZE from rcutorture thread)

2007-03-11 Thread Rafael J. Wysocki
On Saturday, 3 March 2007 18:32, Oleg Nesterov wrote:
 On 03/02, Paul E. McKenney wrote:
 
  On Sat, Mar 03, 2007 at 02:33:37AM +0300, Oleg Nesterov wrote:
   On 03/02, Paul E. McKenney wrote:
   
One way to embed try_to_freeze() into kthread_should_stop() might be
as follows:

int kthread_should_stop(void)
{
if (kthread_stop_info.k == current)
return 1;
try_to_freeze();
return 0;
}
   
   I think this is dangerous. For example, worker_thread() will probably
   need some special actions after return from refrigerator. Also, a kernel
   thread may check kthread_should_stop() in the place where try_to_freeze()
   is not safe.
   
   Perhaps we should introduce a new helper which does this.
  
  Good point -- the return value from try_to_freeze() is lost if one uses
  the above approach.  About one third of the calls to try_to_freeze()
  in 2.6.20 pay attention to the return value.
  
  One approach would be to have a kthread_should_stop_nofreeze() for those
  cases, and let the default be to try to freeze.
 
 I personally think we should do the opposite, add 
 kthread_should_stop_check_freeze()
 or something. kthread_should_stop() is like signal_pending(), we can use
 it under spin_lock (and it is probably used this way by some out-of-tree
 driver). The new helper is obviously might_sleep().

Something like this, perhaps:

 include/linux/kthread.h |1 +
 kernel/kthread.c|   16 
 kernel/rcutorture.c |5 ++---
 3 files changed, 19 insertions(+), 3 deletions(-)

Index: linux-2.6.21-rc3-mm2/kernel/kthread.c
===
--- linux-2.6.21-rc3-mm2.orig/kernel/kthread.c  2007-03-08 21:58:48.0 
+0100
+++ linux-2.6.21-rc3-mm2/kernel/kthread.c   2007-03-11 18:32:59.0 
+0100
@@ -13,6 +13,7 @@
 #include linux/file.h
 #include linux/module.h
 #include linux/mutex.h
+#include linux/freezer.h
 #include asm/semaphore.h
 
 /*
@@ -60,6 +61,21 @@ int kthread_should_stop(void)
 }
 EXPORT_SYMBOL(kthread_should_stop);
 
+/**
+ * kthread_should_stop_check_freeze - check if the thread should return now and
+ * if not, check if there is a freezing request pending for it.
+ */
+int kthread_should_stop_check_freeze(void)
+{
+   might_sleep();
+   if (kthread_stop_info.k == current)
+   return 1;
+
+   try_to_freeze();
+   return 0;
+}
+EXPORT_SYMBOL(kthread_should_stop_check_freeze);
+
 static void kthread_exit_files(void)
 {
struct fs_struct *fs;
Index: linux-2.6.21-rc3-mm2/include/linux/kthread.h
===
--- linux-2.6.21-rc3-mm2.orig/include/linux/kthread.h   2007-02-04 
19:44:54.0 +0100
+++ linux-2.6.21-rc3-mm2/include/linux/kthread.h2007-03-11 
18:37:10.0 +0100
@@ -29,5 +29,6 @@ struct task_struct *kthread_create(int (
 void kthread_bind(struct task_struct *k, unsigned int cpu);
 int kthread_stop(struct task_struct *k);
 int kthread_should_stop(void);
+int kthread_should_stop_check_freeze(void);
 
 #endif /* _LINUX_KTHREAD_H */
Index: linux-2.6.21-rc3-mm2/kernel/rcutorture.c
===
--- linux-2.6.21-rc3-mm2.orig/kernel/rcutorture.c   2007-03-11 
11:39:06.0 +0100
+++ linux-2.6.21-rc3-mm2/kernel/rcutorture.c2007-03-11 18:45:00.0 
+0100
@@ -540,10 +540,9 @@ rcu_torture_writer(void *arg)
}
rcu_torture_current_version++;
oldbatch = cur_ops-completed();
-   try_to_freeze();
-   } while (!kthread_should_stop()  !fullstop);
+   } while (!kthread_should_stop_check_freeze()  !fullstop);
VERBOSE_PRINTK_STRING(rcu_torture_writer task stopping);
-   while (!kthread_should_stop())
+   while (!kthread_should_stop_check_freeze())
schedule_timeout_uninterruptible(1);
return 0;
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [git patches] libata fixes

2007-03-11 Thread Linus Torvalds

Paul,
 do I understand correctly that the *only* difference between the working 
setup is that you applied (by hand) the libata patch that Jeff sent out?

So plain 2.6.21-rc2 works fine, but with the patch applied, you get no 
interrupts on the DVD drive?

On Sun, 11 Mar 2007, Paul Rolland wrote:
 
  It seems like IRQ is not getting through.  The first IRQ 
  driven command is failing for you.
 
 H 
  Extract is :
  ata7: PATA max UDMA/100 cmd 0x00019c00 ctl 0x00019882 bmdma 
  0x00019400 irq 16
  ata8: PATA max UDMA/100 cmd 0x00019800 ctl 0x00019482 bmdma 
  0x00019408 irq 16
 
 IRQ 16 is IO-APIC-fasteoi for libata, and is not shared... but all the
 others libata IRQ are IO-APIC-edge.

Ok, that's interesting, although IO-APIC-fasteoi certainly works for 
others (eg me), but it's still useful.

  * Does giving 'acpi=off' or 'irqpoll' make any difference?
  
  * Can you connect a harddisk to the channel and see whether 
  that works?

 Tried that.. Disk is identified as ATA-7: Mastor 6Y080L0, YAR41BW0, max
 UDMA/13 and then timeout again...
 
 Tried then with acpi=off, same result (identify is OK, but then timeout),
 and irqpoll and then it was OK 

Whee... There were no changes that looked interrupt-related there..

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 4/6] x86_64: Single Quicklist

2007-03-11 Thread Christoph Lameter
On Sun, 11 Mar 2007, Andi Kleen wrote:

 This and i386 version are ok to me, although it might be better to just
 finish __GFP_ZERO support to do this.

This would not work for pgds on i386 and x86_64

GFP_ZERO support the way I have done it in the past would mean another set 
of buddy lists in the page allocator and another issue with fragmentation. 
So I have stayed away from it although patches exist in my archives (See 
my ftp.kernel.org archive).

Maybe we could implemento limited GFP_ZERO support by just keeping an 
additional per cpu list of pages? The issue with that one is that a page
may grow cold on that list. One usually want the page to be hot in the 
cache when it is allocated. This is different for page table pages. Page 
table pages are typically sparsely accessed.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!)

2007-03-11 Thread Michael S. Tsirkin
 Quoting Roland Dreier [EMAIL PROTECTED]:
 Subject: Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!
 
 Feb 27 17:47:52 sw169 kernel:  [8053aaf1] 
 _spin_lock_irqsave+0x15/0x24
 Feb 27 17:47:52 sw169 kernel:  [88067a23] 
 :ib_ipoib:ipoib_neigh_destructor+0xc2/0x139
 
 It looks like this is deadlocking trying to take priv-lock in 
 ipoib_neigh_destructor().
 One idea I just had would be to build a kernel with CONFIG_PROVE_LOCKING 
 turned on, and then rerun this test.  There's a good chance that this would
 diagnose the deadlock.  (I don't have good access to my test machines right 
 now, or
 else I would do it myself)

OK, I did that. But I get
[13440.761857] INFO: trying to register non-static key.
[13440.766903] the code is fine but needs lockdep annotation.
[13440.772455] turning off the locking correctness validator.
and I am not sure what triggers this, or how to fix it to have the
validator actually do its job.

Ingo, what key does the message refer to?

The stack dump seems to point to drivers/infiniband/ulp/ipoib/ipoib_main.c line
829.

Full message below:

[13440.761857] INFO: trying to register non-static key.
[13440.766903] the code is fine but needs lockdep annotation.
[13440.772455] turning off the locking correctness validator.
[13440.778008]  [c023c082] __lock_acquire+0xae4/0xbb9
[13440.783078]  [c023c43d] lock_acquire+0x56/0x71
[13440.787784]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
[13440.794412]  [c051ad41] _spin_lock_irqsave+0x32/0x41
[13440.799649]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
[13440.806275]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
[13440.812897]  [c04a1c1b] dst_run_gc+0xc/0x118
[13440.817439]  [c022af6e] run_timer_softirq+0x37/0x16b
[13440.822673]  [c04a1c0f] dst_run_gc+0x0/0x118
[13440.827221]  [c04a3eab] neigh_destroy+0xbe/0x104
[13440.832114]  [c04a1bb1] dst_destroy+0x4d/0xab
[13440.836751]  [c04a1c64] dst_run_gc+0x55/0x118
[13440.841384]  [c022b03f] run_timer_softirq+0x108/0x16b
[13440.846711]  [c0227634] __do_softirq+0x5a/0xd5
[13440.851427]  [c023b435] trace_hardirqs_on+0x106/0x141
[13440.856754]  [c0227643] __do_softirq+0x69/0xd5
[13440.861470]  [c02276e6] do_softirq+0x37/0x4d
[13440.866016]  [c02167b0] smp_apic_timer_interrupt+0x6b/0x77
[13440.871774]  [c02029ef] default_idle+0x3b/0x54
[13440.876491]  [c02029ef] default_idle+0x3b/0x54
[13440.881211]  [c0204c33] apic_timer_interrupt+0x33/0x38
[13440.886624]  [c02029ef] default_idle+0x3b/0x54
[13440.891342]  [c02029f1] default_idle+0x3d/0x54
[13440.896061]  [c0202aaa] cpu_idle+0xa2/0xbb
[13440.900436]  ===
[13768.711447] BUG: spinlock lockup on CPU#1, swapper/0, c0687880
[13768.717353]  [c031f919] _raw_spin_lock+0xda/0xfd
[13768.722247]  [c051ad48] _spin_lock_irqsave+0x39/0x41
[13768.727486]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
[13768.734110]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
[13768.740735]  [c04a1c1b] dst_run_gc+0xc/0x118
[13768.745276]  [c022af6e] run_timer_softirq+0x37/0x16b
[13768.750517]  [c04a1c0f] dst_run_gc+0x0/0x118
[13768.755061]  [c04a3eab] neigh_destroy+0xbe/0x104
[13768.759955]  [c04a1bb1] dst_destroy+0x4d/0xab
[13768.764586]  [c04a1c64] dst_run_gc+0x55/0x118
[13768.769218]  [c022b03f] run_timer_softirq+0x108/0x16b
[13768.774542]  [c0227634] __do_softirq+0x5a/0xd5
[13768.779261]  [c023b435] trace_hardirqs_on+0x106/0x141
[13768.784588]  [c0227643] __do_softirq+0x69/0xd5
[13768.789308]  [c02276e6] do_softirq+0x37/0x4d
[13768.793851]  [c02167b0] smp_apic_timer_interrupt+0x6b/0x77
[13768.799609]  [c02029ef] default_idle+0x3b/0x54
[13768.804326]  [c02029ef] default_idle+0x3b/0x54
[13768.809054]  [c0204c33] apic_timer_interrupt+0x33/0x38
[13768.814471]  [c02029ef] default_idle+0x3b/0x54
[13768.819187]  [c02029f1] default_idle+0x3d/0x54
[13768.823903]  [c0202aaa] cpu_idle+0xa2/0xbb
[13768.828279]  ===


-- 
MST
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!)

2007-03-11 Thread Michael S. Tsirkin
 Quoting Michael S. Tsirkin [EMAIL PROTECTED]:
 Subject: Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft 
 lockup detected on CPU#0!)
 
 
 After adding some printks, I started getting these:
 
 [  597.036720] BUG: MAX_STACK_TRACE_ENTRIES too low!
 [  597.041546] turning off the locking correctness validator.

I looked at our stack usage a bit. It seems some work is in order.

$ make checkstack | grep ib_
0x0603 mthca_init_hca [ib_mthca]:   764
0x14ed mthca_init_hca [ib_mthca]:   764
0x65ae ipoib_cm_tx_start [ib_ipoib]:368
0x6b0b ipoib_cm_tx_start [ib_ipoib]:368
0x135f ib_uverbs_query_device [ib_uverbs]:  348
0x15f9 ib_uverbs_query_device [ib_uverbs]:  348
0x05d0 ib_ucm_init_qp_attr [ib_ucm]:300
0x0697 ib_ucm_init_qp_attr [ib_ucm]:300
0x7f9e ipoib_path_seq_show [ib_ipoib]:  264
0x8092 ipoib_path_seq_show [ib_ipoib]:  264
0x5b56 ipoib_cm_rx_handler [ib_ipoib]:  220
0x5eec ipoib_cm_rx_handler [ib_ipoib]:  220
0x7934 ipoib_cm_tx_handler [ib_ipoib]:  208
0x7ce0 ipoib_cm_tx_handler [ib_ipoib]:  208
0x32fe ib_uverbs_create_qp [ib_uverbs]: 192
0x36fd ib_uverbs_create_qp [ib_uverbs]: 192
0x28a9 srp_reset_host [ib_srp]: 192
0x2a96 srp_reset_host [ib_srp]: 192
0x1c99 show_sys_image_guid [ib_core]:   188
0x1d2b show_sys_image_guid [ib_core]:   188
0x01f9 ib_sa_service_rec_callback [ib_sa]:  180
0x0234 ib_sa_service_rec_callback [ib_sa]:  180
0x1b3c path_rec_completion [ib_ipoib]:  180
0x2020 path_rec_completion [ib_ipoib]:  180
0x70cf ipoib_cm_handle_rx_wc [ib_ipoib]:180
0x7402 ipoib_cm_handle_rx_wc [ib_ipoib]:180
0x09a7 srp_create_target [ib_srp]:  176
0x125f srp_create_target [ib_srp]:  176
0x0d9d ib_cm_listen [ib_cm]:172
0x10b3 ib_cm_listen [ib_cm]:172
0x4455 ipoib_mcast_send [ib_ipoib]: 172
0x48e0 ipoib_mcast_send [ib_ipoib]: 172
0x15c1 ipoib_start_xmit [ib_ipoib]: 164
0x1b2d ipoib_start_xmit [ib_ipoib]: 164
0x56c8 mthca_make_profile [ib_mthca]:   160
0x6051 mthca_make_profile [ib_mthca]:   160
0x2abb ipoib_ib_dev_stop [ib_ipoib]:160
0x2d19 ipoib_ib_dev_stop [ib_ipoib]:160
0x202b ib_uverbs_query_qp [ib_uverbs]:  156
0x22c0 ib_uverbs_query_qp [ib_uverbs]:  156
0x5269 ipoib_init_qp [ib_ipoib]:152
0x53bc ipoib_init_qp [ib_ipoib]:152
0x327f ipoib_mcast_join [ib_ipoib]: 144
0x349d ipoib_mcast_join [ib_ipoib]: 144
0x2092 ib_find_send_mad [ib_mad]:   140
0x23fa ib_find_send_mad [ib_mad]:   140
0x22cf ib_uverbs_modify_qp [ib_uverbs]: 140
0x24f2 ib_uverbs_modify_qp [ib_uverbs]: 140
0xbc8e mthca_modify_qp [ib_mthca]:  136
0xc9cc mthca_modify_qp [ib_mthca]:  136
0x00010cb1 mthca_reg_phys_mr [ib_mthca]:136
0x0001117a mthca_reg_phys_mr [ib_mthca]:136
0x35b4 ipoib_mcast_join_finish [ib_ipoib]:  136
0x3a33 ipoib_mcast_join_finish [ib_ipoib]:  136
0x0793 iser_cma_handler [ib_iser]:  132
0x0bc1 iser_cma_handler [ib_iser]:  132
0x1e37 srp_queuecommand [ib_srp]:   132
0x273b srp_queuecommand [ib_srp]:   132
0x8a5a mthca_poll_cq [ib_mthca]:128
0x9204 mthca_poll_cq [ib_mthca]:128
0x3a42 ipoib_mcast_join_complete [ib_ipoib]:128
0x3e6e ipoib_mcast_join_complete [ib_ipoib]:128
0x4a58 ipoib_mcast_restart_task [ib_ipoib]: 128
0x4eb8 ipoib_mcast_restart_task [ib_ipoib]: 128
0x38e6 ib_uverbs_create_ah [ib_uverbs]: 116
0x3ac4 ib_uverbs_create_ah [ib_uverbs]: 116
0xf6a5 mthca_process_mad [ib_mthca]:116
0xfa93 mthca_process_mad [ib_mthca]:116
0x11ef mcast_work_handler [ib_sa]:  112
0x16e6 mcast_work_handler [ib_sa]:  112
0x0a20 ib_ucm_send_req [ib_ucm]:112
0x0b7c ib_ucm_send_req [ib_ucm]:112
0x1697 ib_post_send_mad [ib_mad]:   112
0x1b05 ib_post_send_mad [ib_mad]:   112
0x030e iser_post_send [ib_iser]:112
0x03c5 iser_post_send [ib_iser]:112
0x1605 

Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!)

2007-03-11 Thread Peter Zijlstra
On Sun, 2007-03-11 at 15:50 +0200, Michael S. Tsirkin wrote:
  Quoting Roland Dreier [EMAIL PROTECTED]:
  Subject: Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!
  
  Feb 27 17:47:52 sw169 kernel:  [8053aaf1] 
  _spin_lock_irqsave+0x15/0x24
  Feb 27 17:47:52 sw169 kernel:  [88067a23] 
  :ib_ipoib:ipoib_neigh_destructor+0xc2/0x139
  
  It looks like this is deadlocking trying to take priv-lock in 
  ipoib_neigh_destructor().
  One idea I just had would be to build a kernel with CONFIG_PROVE_LOCKING 
  turned on, and then rerun this test.  There's a good chance that this would
  diagnose the deadlock.  (I don't have good access to my test machines right 
  now, or
  else I would do it myself)
 
 OK, I did that. But I get
   [13440.761857] INFO: trying to register non-static key.
   [13440.766903] the code is fine but needs lockdep annotation.
   [13440.772455] turning off the locking correctness validator.
 and I am not sure what triggers this, or how to fix it to have the
 validator actually do its job.

It usually indicates a spinlock is not properly initialized. Like
__SPIN_LOCK_UNLOCKED() used in a non-static context, use
spin_lock_init() in these cases.

However looking at the code, ipoib_neight_destructor only uses
priv-lock, and that seems to get properly initialized in ipoib_setup()
using spin_lock_init().

So either there are other sites that instanciate those objects and
forget about the lock init, or the object is corrupted (use after free?)

 Ingo, what key does the message refer to?
 
 The stack dump seems to point to drivers/infiniband/ulp/ipoib/ipoib_main.c 
 line
 829.
 
 Full message below:
   
 [13440.761857] INFO: trying to register non-static key.
 [13440.766903] the code is fine but needs lockdep annotation.
 [13440.772455] turning off the locking correctness validator.
 [13440.778008]  [c023c082] __lock_acquire+0xae4/0xbb9
 [13440.783078]  [c023c43d] lock_acquire+0x56/0x71
 [13440.787784]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
 [13440.794412]  [c051ad41] _spin_lock_irqsave+0x32/0x41
 [13440.799649]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
 [13440.806275]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
 [13440.812897]  [c04a1c1b] dst_run_gc+0xc/0x118
 [13440.817439]  [c022af6e] run_timer_softirq+0x37/0x16b
 [13440.822673]  [c04a1c0f] dst_run_gc+0x0/0x118
 [13440.827221]  [c04a3eab] neigh_destroy+0xbe/0x104
 [13440.832114]  [c04a1bb1] dst_destroy+0x4d/0xab
 [13440.836751]  [c04a1c64] dst_run_gc+0x55/0x118
 [13440.841384]  [c022b03f] run_timer_softirq+0x108/0x16b
 [13440.846711]  [c0227634] __do_softirq+0x5a/0xd5
 [13440.851427]  [c023b435] trace_hardirqs_on+0x106/0x141
 [13440.856754]  [c0227643] __do_softirq+0x69/0xd5
 [13440.861470]  [c02276e6] do_softirq+0x37/0x4d
 [13440.866016]  [c02167b0] smp_apic_timer_interrupt+0x6b/0x77
 [13440.871774]  [c02029ef] default_idle+0x3b/0x54
 [13440.876491]  [c02029ef] default_idle+0x3b/0x54
 [13440.881211]  [c0204c33] apic_timer_interrupt+0x33/0x38
 [13440.886624]  [c02029ef] default_idle+0x3b/0x54
 [13440.891342]  [c02029f1] default_idle+0x3d/0x54
 [13440.896061]  [c0202aaa] cpu_idle+0xa2/0xbb
 [13440.900436]  ===
 [13768.711447] BUG: spinlock lockup on CPU#1, swapper/0, c0687880
 [13768.717353]  [c031f919] _raw_spin_lock+0xda/0xfd
 [13768.722247]  [c051ad48] _spin_lock_irqsave+0x39/0x41
 [13768.727486]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
 [13768.734110]  [f899bff2] ipoib_neigh_destructor+0xd0/0x132 [ib_ipoib]
 [13768.740735]  [c04a1c1b] dst_run_gc+0xc/0x118
 [13768.745276]  [c022af6e] run_timer_softirq+0x37/0x16b
 [13768.750517]  [c04a1c0f] dst_run_gc+0x0/0x118
 [13768.755061]  [c04a3eab] neigh_destroy+0xbe/0x104
 [13768.759955]  [c04a1bb1] dst_destroy+0x4d/0xab
 [13768.764586]  [c04a1c64] dst_run_gc+0x55/0x118
 [13768.769218]  [c022b03f] run_timer_softirq+0x108/0x16b
 [13768.774542]  [c0227634] __do_softirq+0x5a/0xd5
 [13768.779261]  [c023b435] trace_hardirqs_on+0x106/0x141
 [13768.784588]  [c0227643] __do_softirq+0x69/0xd5
 [13768.789308]  [c02276e6] do_softirq+0x37/0x4d
 [13768.793851]  [c02167b0] smp_apic_timer_interrupt+0x6b/0x77
 [13768.799609]  [c02029ef] default_idle+0x3b/0x54
 [13768.804326]  [c02029ef] default_idle+0x3b/0x54
 [13768.809054]  [c0204c33] apic_timer_interrupt+0x33/0x38
 [13768.814471]  [c02029ef] default_idle+0x3b/0x54
 [13768.819187]  [c02029f1] default_idle+0x3d/0x54
 [13768.823903]  [c0202aaa] cpu_idle+0xa2/0xbb
 [13768.828279]  ===
 
 
 -- 
 MST
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please 

Re: lockdep question (was Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!)

2007-03-11 Thread Michael S. Tsirkin

After adding some printks, I started getting these:

[  597.036720] BUG: MAX_STACK_TRACE_ENTRIES too low!
[  597.041546] turning off the locking correctness validator.
[  597.047135]  [c023a922] save_trace+0x8a/0x8f
[  597.051751]  [c023ae8c] mark_lock+0x65/0x3ff
[  597.056366]  [c023a8d6] save_trace+0x3e/0x8f
[  597.060980]  [c023a9f0] add_lock_to_list+0x62/0x85
[  597.066116]  [c023b992] __lock_acquire+0x3f4/0xbb9
[  597.071252]  [f89da11f] send_mad+0x79/0x103 [ib_sa]
[  597.076474]  [c031a475] idr_get_new_above_int+0x13c/0x216
[  597.082225]  [c023c43d] lock_acquire+0x56/0x71
[  597.087018]  [f89da11f] send_mad+0x79/0x103 [ib_sa]
[  597.092240]  [c051ad41] _spin_lock_irqsave+0x32/0x41
[  597.097547]  [f89da11f] send_mad+0x79/0x103 [ib_sa]
[  597.102770]  [f89da11f] send_mad+0x79/0x103 [ib_sa]
[  597.107989]  [f89da8d9] ib_sa_path_rec_get+0x134/0x172 [ib_sa]
[  597.114166]  [f899b73f] path_rec_start+0x115/0x143 [ib_ipoib]
[  597.120254]  [f899cb38] path_rec_completion+0x0/0x4f4 [ib_ipoib]
[  597.126610]  [f899b874] path_rec_create+0x77/0x9d [ib_ipoib]
[  597.132617]  [f899c9fe] ipoib_start_xmit+0x441/0x57b [ib_ipoib]
[  597.13]  [c051ae06] _spin_unlock_irqrestore+0x34/0x39
[  597.144635]  [c023b435] trace_hardirqs_on+0x106/0x141
[  597.150035]  [c04a058b] dev_queue_xmit+0x109/0x245
[  597.155167]  [c022ae27] __mod_timer+0x94/0x9e
[  597.159871]  [c04a0423] dev_hard_start_xmit+0x1be/0x21d
[  597.165438]  [c04a9fa9] __qdisc_run+0xd7/0x190
[  597.170226]  [c04a05b7] dev_queue_xmit+0x135/0x245
[  597.175360]  [c04ce267] arp_process+0x2c0/0x512
[  597.180234]  [f8954346] mthca_tavor_interrupt+0xf3/0x12b [ib_mthca]
[  597.186855]  [c04a088b] netif_receive_skb+0x1c4/0x1da
[  597.192254]  [c023b435] trace_hardirqs_on+0x106/0x141
[  597.197648]  [c04a0935] process_backlog+0x94/0x107
[  597.202785]  [c049f02b] net_rx_action+0x9a/0x15e
[  597.207743]  [c0227643] __do_softirq+0x69/0xd5
[  597.212530]  [c02276e6] do_softirq+0x37/0x4d
[  597.217147]  [c020617e] do_IRQ+0x5c/0x72
[  597.221415]  [c0204b52] common_interrupt+0x2e/0x34
[  597.226549]  [c02029ef] default_idle+0x3b/0x54
[  597.231337]  [c02029f1] default_idle+0x3d/0x54
[  597.236124]  [c0202aaa] cpu_idle+0xa2/0xbb
[  597.240567]  ===

And sometimes these:

[  404.493572] KERNEL: assertion (!timer_pending(dev-watchdog_timer)) failed 
at net/sched/sch_generic.c (608)

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup detected on CPU#0!)

2007-03-11 Thread Michael S. Tsirkin
Quoting Peter Zijlstra [EMAIL PROTECTED]:
Subject: Re: lockdep question (was Re: IPoIB caused a kernel: BUG: softlockup 
detected on CPU#0!)

 On Sun, 2007-03-11 at 15:50 +0200, Michael S. Tsirkin wrote:
   Quoting Roland Dreier [EMAIL PROTECTED]:
   Subject: Re: IPoIB caused a kernel: BUG: soft lockup detected on CPU#0!
   
   Feb 27 17:47:52 sw169 kernel:  [8053aaf1] 
   _spin_lock_irqsave+0x15/0x24
   Feb 27 17:47:52 sw169 kernel:  [88067a23] 
   :ib_ipoib:ipoib_neigh_destructor+0xc2/0x139
   
   It looks like this is deadlocking trying to take priv-lock in 
   ipoib_neigh_destructor().
   One idea I just had would be to build a kernel with CONFIG_PROVE_LOCKING 
   turned on, and then rerun this test.  There's a good chance that this 
   would
   diagnose the deadlock.  (I don't have good access to my test machines 
   right now, or
   else I would do it myself)
  
  OK, I did that. But I get
  [13440.761857] INFO: trying to register non-static key.
  [13440.766903] the code is fine but needs lockdep annotation.
  [13440.772455] turning off the locking correctness validator.
  and I am not sure what triggers this, or how to fix it to have the
  validator actually do its job.
 
 It usually indicates a spinlock is not properly initialized. Like
 __SPIN_LOCK_UNLOCKED() used in a non-static context, use
 spin_lock_init() in these cases.
 
 However looking at the code, ipoib_neight_destructor only uses
 priv-lock, and that seems to get properly initialized in ipoib_setup()
 using spin_lock_init().
 
 So either there are other sites that instanciate those objects and
 forget about the lock init, or the object is corrupted (use after free?)

OK, thanks for the hint. So I added this:

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index f9dbc6f..2eea467 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -821,8 +821,15 @@ static void ipoib_neigh_destructor(struct neighbour *n)
unsigned long flags;
struct ipoib_ah *ah = NULL;
 
+   if (n-dev-type != ARPHRD_INFINIBAND) {
+   printk(KERN_ERR ipoib_neigh_destructor lock %p wrong type %d 
!!\n,
+  priv-lock, n-dev-type);
+   BUG_ON(n-dev-type != ARPHRD_INFINIBAND);
+   return;
+   }
+
ipoib_dbg(priv,
  neigh_destructor for %06x  IPOIB_GID_FMT \n,
  IPOIB_QPN(n-ha),
  IPOIB_GID_RAW_ARG(n-ha + 4));
 
And sure enough it triggers:

[  858.503010] ipoib_neigh_destructor lock c0687880 wrong type 772 !!
[  858.510036] [ cut here ]
[  858.514723] kernel BUG at drivers/infiniband/ulp/ipoib/ipoib_main.c:827!
[  858.521486] invalid opcode:  [#1]
[  858.525212] SMP
[  858.527173] Modules linked in: rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa 
ib_uverbs ibv
[  858.538736] CPU:0
[  858.538737] EIP:0060:[f899bfa5]Not tainted VLI
[  858.538738] EFLAGS: 00010206   (2.6.21-rc3-i686-dbg #4)
[  858.551755] EIP is at ipoib_neigh_destructor+0x40/0x178 [ib_ipoib]
[  858.557996] eax: c0687300   ebx: f240e880   ecx: c0223114   edx: c064f280
[  858.564851] esi: f240e880   edi: f240e880   ebp: c0687880   esp: c06c7e9c
[  858.571702] ds: 007b   es: 007b   fs: 00d8  gs:   ss: 0068
[  858.577602] Process swapper (pid: 0, ti=c06c6000 task=c064f280 
task.ti=c06c6000)
[  858.584883] Stack: f89a37be c0687880 0304 c022af6e c064f280  
 
[  858.593573] c06a2554  c064f280 0001  
c064f280 
[  858.602259]c0860be0 c2a1fba0 0246 c06a2554 f240e880  
f240e880 c04a
[  858.610946] Call Trace:
[  858.613723]  [c022af6e] run_timer_softirq+0x37/0x16b
[  858.618959]  [c04a1c0f] dst_run_gc+0x0/0x118
[  858.623498]  [c04a3eab] neigh_destroy+0xbe/0x104
[  858.628382]  [c04a1bb1] dst_destroy+0x4d/0xab
[  858.632998]  [c04a1c64] dst_run_gc+0x55/0x118
[  858.637620]  [c022b03f] run_timer_softirq+0x108/0x16b
[  858.642934]  [c0227634] __do_softirq+0x5a/0xd5
[  858.647648]  [c023b435] trace_hardirqs_on+0x106/0x141
[  858.652970]  [c0227643] __do_softirq+0x69/0xd5
[  858.657677]  [c02276e6] do_softirq+0x37/0x4d
[  858.662210]  [c02167b0] smp_apic_timer_interrupt+0x6b/0x77
[  858.667965]  [c02029ef] default_idle+0x3b/0x54
[  858.672681]  [c02029ef] default_idle+0x3b/0x54
[  858.677391]  [c0204c33] apic_timer_interrupt+0x33/0x38
[  858.682796]  [c02029ef] default_idle+0x3b/0x54
[  858.687505]  [c02029f1] default_idle+0x3d/0x54
[  858.692211]  [c0202aaa] cpu_idle+0xa2/0xbb
[  858.696569]  [c06cd7c3] start_kernel+0x40b/0x413
[  858.701453]  [c06cd1b3] unknown_bootoption+0x0/0x205
[  858.706678]  ===
[  858.710321] Code: 66 83 f8 20 74 29 0f b7 c0 89 44 24 08 89 6c 24 04 c7 04 
24 be 37 9a
[  858.730997] EIP: [f899bfa5] ipoib_neigh_destructor+0x40/0x178 [ib_ipoib] 
SS:ESP 0068c
[  858.740271] Kernel 

[PATCH v5] Fix rmmod/read/write races in /proc entries

2007-03-11 Thread Alexey Dobriyan
Differences from version 4:
Updated in-code comments. Largely rewritten changelog.
Lockdep please. --akpm
-read_proc, -write_proc aren't special, Extend protection to
most methods for regular /proc files. Mentioned by viro.
Differences from version 3:
Use completion instead of unlock/schedule/lock
Move refcount waiting business after removing PDE from lists,
so that *cough* possible concurrent remove_proc_entry() will
work.

Fix following races:
===
1. Write via -write_proc sleeps in copy_from_user(). Module disappears
   meanwhile. Or, more generically, system call done on /proc file, method
   supplied by module is called, module dissapeares meanwhile.

   pde = create_proc_entry()
   if (!pde)
return -ENOMEM;
   pde-write_proc = ...
open
write
copy_from_user
   pde = create_proc_entry();
   if (!pde) {
remove_proc_entry();
return -ENOMEM;
/* module unloaded */
   }
*boom*
==
2. bogo-revoke aka proc_kill_inodes()

  remove_proc_entry vfs_read
  proc_kill_inodes  [check -f_op validness]
[check -f_op-read validness]
[verify_area, security permissions checks]
-f_op = NULL;
if (file-f_op-read)
/* -f_op dereference, boom */

NOTE, NOTE, NOTE: file_operations are proxied for regular files only. Let's
see how this scheme behaves, then extend if needed for directories.
Directories creators in /proc only set -owner for them, so proxying for
directories may be unneeded.

NOTE, NOTE, NOTE: methods being proxied are -llseek, -read, -write,
-poll, -unlocked_ioctl, -ioctl, -compat_ioctl, -open, -release.
If your in-tree module uses something else, yell on me. Full audit pending.

Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 fs/proc/generic.c   |   32 +
 fs/proc/inode.c |  279 +++-
 include/linux/proc_fs.h |   13 ++
 3 files changed, 321 insertions(+), 3 deletions(-)

--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -20,6 +20,7 @@ #include linux/idr.h
 #include linux/namei.h
 #include linux/bitops.h
 #include linux/spinlock.h
+#include linux/completion.h
 #include asm/uaccess.h
 
 #include internal.h
@@ -613,6 +614,9 @@ static struct proc_dir_entry *proc_creat
ent-namelen = len;
ent-mode = mode;
ent-nlink = nlink;
+   ent-pde_users = 0;
+   spin_lock_init(ent-pde_unload_lock);
+   ent-pde_unload_completion = NULL;
  out:
return ent;
 }
@@ -734,9 +738,35 @@ void remove_proc_entry(const char *name,
de = *p;
*p = de-next;
de-next = NULL;
+
+   spin_lock(de-pde_unload_lock);
+   /*
+* Stop accepting new callers into module. If you're
+* dynamically allocating -proc_fops, save a pointer somewhere.
+*/
+   de-proc_fops = NULL;
+   /* Wait until all existing callers into module are done. */
+   if (de-pde_users  0) {
+   DECLARE_COMPLETION_ONSTACK(c);
+
+   if (!de-pde_unload_completion)
+   de-pde_unload_completion = c;
+
+   spin_unlock(de-pde_unload_lock);
+   spin_unlock(proc_subdir_lock);
+
+   wait_for_completion(de-pde_unload_completion);
+
+   spin_lock(proc_subdir_lock);
+   goto continue_removing;
+   }
+   spin_unlock(de-pde_unload_lock);
+
+continue_removing:
if (S_ISDIR(de-mode))
parent-nlink--;
-   proc_kill_inodes(de);
+   if (!S_ISREG(de-mode))
+   proc_kill_inodes(de);
de-nlink = 0;
WARN_ON(de-subdir);
if (!atomic_read(de-count))
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -142,6 +142,277 @@ static const struct super_operations pro
.remount_fs = proc_remount,
 };
 
+static loff_t proc_reg_llseek(struct file *file, loff_t offset, int whence)
+{
+   struct proc_dir_entry *pde = PDE(file-f_path.dentry-d_inode);
+   loff_t rv = -EINVAL;
+   loff_t (*llseek)(struct file *, loff_t, int);
+
+   spin_lock(pde-pde_unload_lock);
+   /*
+* remove_proc_entry() is going to delete PDE (as part of module
+* cleanup sequence). No new callers into module allowed.
+*/
+   if (!pde-proc_fops)
+   goto out_unlock;
+   /*
+* Bump refcount so that remove_proc_entry will wail for -llseek to
+   

Re: SATA resume slowness, e1000 MSI warning

2007-03-11 Thread Michael S. Tsirkin
 Rumor has it that some pci devices can't tolerate  32bit accesses.
 Although I have never met one.

hopefully not bridge devices?

 The two factors together suggest that
 for generic code it probably makes sense to operate on 32bit
 quantities, and just to ignore the read-only portion.

The code for regular devices seems to use 16-bit accesses, so
I think it's best to stay consistent. Or do you want to change this too?

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler

2007-03-11 Thread Al Boldi
Al Boldi wrote:
 BTW, another way to show these hickups would be through some kind of a
 cpu/proc timing-tracer.  Do we have something like that?

Here is something like a tracer.

Original idea by Chris Friesen, thanks, from this post:
http://marc.theaimsgroup.com/?l=linux-kernelm=117331003029329w=4

Try attached chew.c like this:
Boot into /bin/sh.
Run chew in one console.
Run nice chew in another console.
Watch timings.

Console 1: ./chew
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for5 ms
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for5 ms
pid 655, prio   0, out for5 ms
pid 655, prio   0, out for5 ms
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for5 ms
pid 655, prio   0, out for5 ms
pid 655, prio   0, out for5 ms
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for5 ms
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for5 ms
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for6 ms
pid 655, prio   0, out for5 ms

Console 2: nice -10 ./chew
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for5 ms
pid 669, prio  10, out for   65 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for5 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for5 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for   65 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for6 ms
pid 669, prio  10, out for6 ms

Console 2: nice -15 ./chew
pid 673, prio  15, out for5 ms
pid 673, prio  15, out for6 ms
pid 673, prio  15, out for   95 ms
pid 673, prio  15, out for5 ms
pid 673, prio  15, out for5 ms
pid 673, prio  15, out for6 ms
pid 673, prio  15, out for5 ms
pid 673, prio  15, out for   95 ms
pid 673, prio  15, out for5 ms
pid 673, prio  15, out for5 ms
pid 673, prio  15, out for6 ms
pid 673, prio  15, out for6 ms
pid 673, prio  15, out for   95 ms
pid 673, prio  15, out for6 ms
pid 673, prio  15, out for6 ms
pid 673, prio  15, out for6 ms
pid 673, prio  15, out for6 ms
pid 673, prio  15, out for   95 ms
pid 673, prio  15, out for5 ms
pid 673, prio  15, out for5 ms
pid 673, prio  15, out for6 ms
pid 673, prio  15, out for5 ms

Console 2: nice -18 ./chew
pid 677, prio  18, out for  113 ms
pid 677, prio  18, out for6 ms
pid 677, prio  18, out for  113 ms
pid 677, prio  18, out for6 ms
pid 677, prio  18, out for  113 ms
pid 677, prio  18, out for5 ms
pid 677, prio  18, out for  113 ms
pid 677, prio  18, out for6 ms
pid 677, prio  18, out for  113 ms
pid 677, prio  18, out for5 ms
pid 677, prio  18, out for  113 ms
pid 677, prio  18, out for5 ms
pid 677, prio  18, out for  113 ms
pid 677, prio  18, out for5 ms
pid 677, prio  18, out for  113 ms
pid 677, prio  18, out for6 ms
pid 677, prio  18, out for  113 ms
pid 677, prio  18, out for6 ms
pid 677, prio  18, out for  113 ms
pid 677, prio  18, out for5 ms
pid 677, prio  18, out for  113 ms
pid 677, prio  18, out for5 ms

Console 2: nice -19 ./chew
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms
pid 679, prio  19, out for  119 ms

Now with negative nice:
Console 1: ./chew
pid 674, prio   0, out for6 ms
pid 674, prio   0, out for  125 ms
pid 674, prio   0, out for6 ms
pid 674, prio   0, out for5 ms
pid 674, prio   0, out for6 ms
pid 674, prio   0, out for6 ms
pid 674, prio   0, out for5 ms
pid 674, prio   0, out for6 ms
pid 674, prio   0, out for6 ms
pid 674, prio   0, out for6 ms
pid 674, prio   0, out for5 ms
pid 674, prio   0, out for5 ms
pid 674, prio   0, out for5 ms
pid 674, prio   0, out for5 ms
pid 674, prio   0, out for5 ms
pid 674, prio   0, out for5 ms
pid 674, prio   0, out for5 ms
pid 674, prio   0, out 

SwSusp to disk doesn't work - Try 2

2007-03-11 Thread Thomas Meyer

Suspend to disk doesn't work on my laptop.

The suspend seems to hang while enabling the non-boot cpus again.

with platform = test and state = disk i get this:

[cut]
acpi device:02: freeze
video video:00: freeze
acpi device:01: freeze
acpi PNP0C02:00: freeze
pci_root PNP0A08:00: freeze
button PNP0C0E:00: freeze
button PNP0C0C:00: freeze
acpi APP0002:00: freeze
button PNP0C0D:00: freeze
ac ACPI0003:00: freeze
acpi device:00: freeze
processor ACPI0007:01: freeze
processor ACPI0007:00: freeze
button button_power:00: freeze
acpi acpi_system:00: freeze
Disabling non-boot CPUs ...
CPU 1 is now offline
SMP alternatives: switching to UP code
PM: Removing info for No Bus:cpu1
PM: Removing info for No Bus:msr1
CPU1 is down
swsusp debug: Waiting for 5 seconds.
Enabling non-boot CPUs ...

 Here the process hangs. But a fortunate coincidence showed me that 
an acpi event continues the process (pressing the power off button a few 
times... (2x - 4x)  ).


SMP alternatives: switching to SMP code
Booting processor 1/1 eip 3000
CPU 1 irqstacks, hard=c0389000 soft=c0387000
Initializing CPU#1
Calibrating delay using timer specific routine.. 3663.73 BogoMIPS 
(lpj=6103576)
CPU: After generic identify, caps: bfe9fbff 0010   
c1a9  

monitor/mwait feature present.
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 2048K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
CPU: After all inits, caps: bfe9fbff 0010  2940 c1a9 
 

Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel Genuine Intel(R) CPU   T2400  @ 1.83GHz stepping 08
PM: Adding info for No Bus:cpu1
PM: Adding info for No Bus:msr1
CPU1 is up
acpi acpi_system:00: resuming
button button_power:00: resuming
processor ACPI0007:00: resuming
processor ACPI0007:01: resuming
acpi device:00: resuming
ac ACPI0003:00: resuming
button PNP0C0D:00: resuming
acpi APP0002:00: resuming
button PNP0C0C:00: resuming
button PNP0C0E:00: resuming
pci_root PNP0A08:00: resuming


Any ideas?

The same is true for disk = platform.

With kind regards
thomas



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/9] signalfd/timerfd - signalfd core ...

2007-03-11 Thread Davide Libenzi
On Sun, 11 Mar 2007, Oleg Nesterov wrote:

 On 03/10, Davide Libenzi wrote:
 
  +static void signalfd_put_sighand(struct signalfd_ctx *ctx,
  +struct sighand_struct *sighand,
  +unsigned long *flags)
  +{
  +   unlock_task_sighand(ctx-tsk, flags);
  +}
 
 Note that signalfd_put_sighand() doesn't need sighand parameter, please
 see below.

I want it to return the sighand, and for simmetry I prefer the put to be 
passed the parameter back too. Even if not used.



  +int signalfd_deliver(struct sighand_struct *sighand, int sig,
  +struct siginfo *info)
  +{
  +   int nsig = 0;
  +   struct signalfd_ctx *ctx, *tmp;
  +
  +   list_for_each_entry_safe(ctx, tmp, sighand-sfdlist, lnk) {
  +   /*
  +* We use a negative signal value as a way to broadcast that the
  +* sighand has been orphaned, so that we can notify all the
  +* listeners about this. Remeber the ctx-sigmask is inverted,
  +* so if the user is interested in a signal, that corresponding
  +* bit will be zero.
  +*/
  +   if (sig  0)
  +   list_del_init(ctx-lnk);
 
 I'm afraid this is not right. This should be per-thread.
 
 Suppose we have threads T1 and T2 from the same thread group. sighand-sfdlist
 contains ctx1 and ctx2 linked to T1 and T2. Now, T1 exits, __exit_signal()
 does signalfd_notify(sighand, -1), and unlinks all threads, not just T1.
 
 IOW, we should do
 
   if (ctx-tsk == current) {
   list_del_init(ctx-lnk);
   wake_up(ctx-wqh);
   }

Yes, of course. Dunno why the change got lost.



 Perhaps it makes sense to not re-use signalfd_deliver(), but introduce
 a new signalfd_xxx(sighand, tsk) helper for de_thread/exit_signal.
 
 Btw, signalfd_deliver() doesn't use info parameter.
 
  +   if (sig  0 || !sigismember(ctx-sigmask, sig)) {
  +   wake_up(ctx-wqh);
 
 Minor nit. Perhaps it makes sense to do
 
   void signalfd_deliver(struct task_struct *tsk, int sig, struct 
 sigpending *pending)
   {
   struct sighand_struct *sighand = tsk-sighand;
   int private = (tsk-pending == pending);
 
   list_for_each_entry_safe(ctx, tmp, sighand-sfdlist, lnk) {
   if (private  ctx-tsk != tsk)
   continue;
   if (!sigismember(ctx-sigmask, sig))
   wake_up(ctx-wqh);
   }
   }
 
 Even better: signalfd_deliver(struct task_struct *tsk, int sig, int private).
 This way specific_send_sig_info/send_sigqueue won't do a false wakeup.

I agree in the latter.



  +asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t 
  sizemask)
  +{
  ...
  +   if ((sighand = signalfd_get_sighand(ctx, flags)) != NULL) {
  +   ctx-sigmask = sigmask;
  +   signalfd_put_sighand(ctx, sighand, flags);
  +   }
 
 This looks like unneeded complication to me, I'd suggest
 
   if (signalfd_get_sighand(ctx, flags)) {
   ctx-sigmask = sigmask;
   signalfd_put_sighand(ctx, flags);
   }
 
 unlock_task_sighand() (and thus signalfd_put_sighand) doesn't need sighand
 parameter. signalfd_get_sighand() is in fact boolean. It makes sense to return
 sighand, it may be useful, but this patch only needs != NULL.
 
 Every usage of signalfd_get_sighand() could be simplified accordingly.

As I said before, I prefer that way.


  +* Tell all the sighand listeners that this sighand has
  +* been detached. Needs to be called with the sighand lock
  +* held.
  +*/
  +   if (unlikely(!list_empty(oldsighand-sfdlist))) {
  +   spin_lock_irq(oldsighand-siglock);
  +   signalfd_notify(oldsighand, -1, NULL);
  +   spin_unlock_irq(oldsighand-siglock);
  +   }
 
 Very minor nit. I'd suggest to make a new helper and put it in signalfd.h
 (like signalfd_notify()). This will help CONFIG_SIGNALFD.

Yes, makes sense.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SwSusp to disk doesn't work - Try 2

2007-03-11 Thread Rafael J. Wysocki
On Sunday, 11 March 2007 19:08, Thomas Meyer wrote:
 Suspend to disk doesn't work on my laptop.
 
 The suspend seems to hang while enabling the non-boot cpus again.
 
 with platform = test and state = disk i get this:
 
 [cut]
 acpi device:02: freeze
 video video:00: freeze
 acpi device:01: freeze
 acpi PNP0C02:00: freeze
 pci_root PNP0A08:00: freeze
 button PNP0C0E:00: freeze
 button PNP0C0C:00: freeze
 acpi APP0002:00: freeze
 button PNP0C0D:00: freeze
 ac ACPI0003:00: freeze
 acpi device:00: freeze
 processor ACPI0007:01: freeze
 processor ACPI0007:00: freeze
 button button_power:00: freeze
 acpi acpi_system:00: freeze
 Disabling non-boot CPUs ...
 CPU 1 is now offline
 SMP alternatives: switching to UP code
 PM: Removing info for No Bus:cpu1
 PM: Removing info for No Bus:msr1
 CPU1 is down
 swsusp debug: Waiting for 5 seconds.
 Enabling non-boot CPUs ...
 
  Here the process hangs. But a fortunate coincidence showed me that 
 an acpi event continues the process (pressing the power off button a few 
 times... (2x - 4x)  ).

Hm, interesting.

 SMP alternatives: switching to SMP code
 Booting processor 1/1 eip 3000
 CPU 1 irqstacks, hard=c0389000 soft=c0387000
 Initializing CPU#1
 Calibrating delay using timer specific routine.. 3663.73 BogoMIPS 
 (lpj=6103576)
 CPU: After generic identify, caps: bfe9fbff 0010   
 c1a9  
 monitor/mwait feature present.
 CPU: L1 I cache: 32K, L1 D cache: 32K
 CPU: L2 cache: 2048K
 CPU: Physical Processor ID: 0
 CPU: Processor Core ID: 1
 CPU: After all inits, caps: bfe9fbff 0010  2940 c1a9 
  
 Intel machine check architecture supported.
 Intel machine check reporting enabled on CPU#1.
 CPU1: Intel Genuine Intel(R) CPU   T2400  @ 1.83GHz stepping 08
 PM: Adding info for No Bus:cpu1
 PM: Adding info for No Bus:msr1
 CPU1 is up
 acpi acpi_system:00: resuming
 button button_power:00: resuming
 processor ACPI0007:00: resuming
 processor ACPI0007:01: resuming
 acpi device:00: resuming
 ac ACPI0003:00: resuming
 button PNP0C0D:00: resuming
 acpi APP0002:00: resuming
 button PNP0C0C:00: resuming
 button PNP0C0E:00: resuming
 pci_root PNP0A08:00: resuming
 
 
 Any ideas?

Could you please put some printk()s in kernel/cpu.c:_cpu_up() to see where
it gets stuck?  I bet one of the notifiers goes to sleep (cpufreq, maybe).

Greetings,
Rafael
-- 
If you don't have the time to read,
you don't have the time or the tools to write.
- Stephen King
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[WATCHDOG] i8xx_tco - mark for removal patch

2007-03-11 Thread Wim Van Sebroeck
Hi all,

I'm planning to remove the i8xx_tco watchdog driver
(since we now have the iTCO_wdt driver that has a broader scope).

If no-one objects I will sent the below patch to Linus for inclusion.
(it adds the driver to the feature-removal-schedule list and defaults
CONFIG_I8XX_TCO to n).

Thanks,
Wim.


diff --git a/Documentation/feature-removal-schedule.txt 
b/Documentation/feature-removal-schedule.txt
index c3b1430..0bc8b0b 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -316,3 +316,11 @@ Why:   The option/code is
 Who:   Johannes Berg [EMAIL PROTECTED]
 
 ---
+
+What:  i8xx_tco watchdog driver
+When:  in 2.6.22
+Why:   the i8xx_tco watchdog driver has been replaced by the iTCO_wdt
+   watchdog driver.
+Who:   Wim Van Sebroeck [EMAIL PROTECTED]
+
+---
diff --git a/drivers/char/watchdog/Kconfig b/drivers/char/watchdog/Kconfig
index ea09d0c..e812aa1 100644
--- a/drivers/char/watchdog/Kconfig
+++ b/drivers/char/watchdog/Kconfig
@@ -301,6 +301,7 @@ config I6300ESB_WDT
 config I8XX_TCO
tristate Intel i8xx TCO Timer/Watchdog
depends on WATCHDOG  (X86 || IA64)  PCI
+   default n
---help---
  Hardware driver for the TCO timer built into the Intel 82801
  I/O Controller Hub family.  The TCO (Total Cost of Ownership)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd - timerfd core ...

2007-03-11 Thread Davide Libenzi
On Sun, 11 Mar 2007, Thomas Gleixner wrote:

 Davide,
 
 On Sat, 2007-03-10 at 18:22 -0800, Davide Libenzi wrote:
 
 Some remarks:
 
  +
  +asmlinkage long sys_timerfd(int ufd, int clockid, int tmrtype,
  +   const struct timespec __user *utmr)
  +{
  +   int error;
  +   struct timerfd_ctx *ctx;
  +   struct file *file;
  +   struct inode *inode;
  +   ktime_t tval, tnow;
  +   struct timespec ktmr, tmrnow;
  +
  +   error = -EFAULT;
  +   if (copy_from_user(ktmr, utmr, sizeof(ktmr)))
  +   goto err_exit;
 
 Please do not use goto for a simple
   return -EFAULT;
 
 Please validate the timespec before converting it.
 
 if (!timespec_valid(ktmr))
 return -EINVAL;

Ack.


  +   tval = timespec_to_ktime(ktmr);
  +   error = -EINVAL;
  +   if (clockid != CLOCK_MONOTONIC 
  +   clockid != CLOCK_REALTIME)
  +   goto err_exit;
  +   switch (tmrtype) {
  +   case TFD_TIMER_REL:
  +   case TFD_TIMER_SEQ:
  +   break;
  +   case TFD_TIMER_ABS:
  +   getnstimeofday(tmrnow);
  +   tnow = timespec_to_ktime(tmrnow);
 
   tnow = ktime_get();

Ok, I think this is the wierd function that is declared static, whose 
symbol is exported, but is not declared in any .h file :)
I used that before, because I saw it inside the hrtimer.c file, but then 
gcc was puking on me, and I noticd the wierdness.



  +   if (ktime_to_ns(tval) = ktime_to_ns(tnow))
  +   goto err_exit;
  +   tval = ktime_sub(tval, tnow);
 
 Why do you want to do that ? hrtimers handle relative and absolute
 expiry times. You break down everything to relative time and lose the
 accuracy for absolute timers. 

Yes. Those was in need of fixing. The first code I had was not working 
correctly with abs timers. Didn't have time to dig into it yet.
Will verify and fix today...



  +
  +   hrtimer_start(ctx-tmr, ctx-tval, HRTIMER_REL);
  +
  +   /*
  +* When we call this, the initialization must be complete, since
  +* aino_getfd() will install the fd.
  +*/
  +   error = aino_getfd(ufd, inode, file, [timerfd],
  +  timerfd_fops, ctx);
  +   if (error)
  +   goto err_fdalloc;
 
 Why is the timer started before we have everything in place ? 

I simplify the error path. The fd does not need to be in place for the 
timer function to be correctly triggered.


 Also if you turn it around then the (re)programming part of the timer
 can be shared.

The two error/exit paths are different. One need to free the ctx, while 
the other one simply to do an fput().



 Please use hrtimer_try_to_cancel()
 
 retry:
   spin_lock_irq():
   if (hrtimer_try_to_cancel(ctx-tmr)  0) {
   spin_unlock_irq();
   cpu_relax();
   goto retry;
   }

Ok, I will.



  +static unsigned int timerfd_poll(struct file *file, poll_table *wait)
  +{
  +   struct timerfd_ctx *ctx = file-private_data;
  +
  +   poll_wait(file, ctx-wqh, wait);
  +
  +   return ctx-ticks ? POLLIN: 0;
 
 This is racy:
 
   timer is set up (non periodic)
   timer expires
   poll 
 
   now poll is stuck for ever !

Duh, yeah. I use the locked version of wakeups. Will fix.



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA resume slowness, e1000 MSI warning

2007-03-11 Thread Eric W. Biederman
Michael S. Tsirkin [EMAIL PROTECTED] writes:

 Rumor has it that some pci devices can't tolerate  32bit accesses.
 Although I have never met one.

 hopefully not bridge devices?

 The two factors together suggest that
 for generic code it probably makes sense to operate on 32bit
 quantities, and just to ignore the read-only portion.

 The code for regular devices seems to use 16-bit accesses, so
 I think it's best to stay consistent. Or do you want to change this too?

If we are stomping rare probabilities we might as well change that too.
The code to save pci-x state is relatively recent.  So it probably just
hasn't met a problem device yet (assuming they exist).

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SwSusp to disk doesn't work - Try 2

2007-03-11 Thread Thomas Meyer

Rafael J. Wysocki schrieb:


Could you please put some printk()s in kernel/cpu.c:_cpu_up() to see where
it gets stuck?  I bet one of the notifiers goes to sleep (cpufreq, maybe).
  

Here we go (ok. i forgot __FUNCTION__ ...):

Mar 11 19:31:33 [kernel] ac ACPI0003:00: freeze
Mar 11 19:31:33 [kernel] acpi device:00: freeze
Mar 11 19:31:33 [kernel] processor ACPI0007:01: freeze
Mar 11 19:31:33 [kernel] processor ACPI0007:00: freeze
Mar 11 19:31:33 [kernel] button button_power:00: freeze
Mar 11 19:31:33 [kernel] acpi acpi_system:00: freeze
Mar 11 19:31:33 [kernel] Disabling non-boot CPUs ...
Mar 11 19:31:33 [kernel] kvm: disabling virtualization on CPU1
Mar 11 19:31:33 [kernel] CPU 1 is now offline
Mar 11 19:31:33 [kernel] SMP alternatives: switching to UP code
Mar 11 19:31:33 [kernel] PM: Removing info for No Bus:cpu1
Mar 11 19:31:33 [kernel] PM: Removing info for No Bus:msr1
Mar 11 19:31:33 [kernel] CPU1 is down
Mar 11 19:31:33 [kernel] swsusp debug: Waiting for 5 seconds.
Mar 11 19:31:33 [kernel] Enabling non-boot CPUs ...
Mar 11 19:31:33 [kernel] NULL: before notifier CPU_UP_PREPARE.

Hung here.

Mar 11 19:31:33 [kernel] NULL: after notifier CPU_UP_PREPARE.
Mar 11 19:31:33 [kernel] SMP alternatives: switching to SMP code
Mar 11 19:31:33 [kernel] Booting processor 1/1 eip 3000
Mar 11 19:31:33 [kernel] CPU 1 irqstacks, hard=c0388000 soft=c0386000
Mar 11 19:31:33 [kernel] Initializing CPU#1
Mar 11 19:31:33 [kernel] Calibrating delay using timer specific 
routine.. 3663.72 BogoMIPS (lpj=6103555)
Mar 11 19:31:33 [kernel] CPU: After generic identify, caps: bfe9fbff 
0010   c1a9  

Mar 11 19:31:33 [kernel] monitor/mwait feature present.
Mar 11 19:31:33 [kernel] CPU: L1 I cache: 32K, L1 D cache: 32K
Mar 11 19:31:33 [kernel] CPU: L2 cache: 2048K
Mar 11 19:31:33 [kernel] CPU: Physical Processor ID: 0
Mar 11 19:31:33 [kernel] CPU: Processor Core ID: 1
Mar 11 19:31:33 [kernel] CPU: After all inits, caps: bfe9fbff 0010 
 2940 c1a9  
Mar 11 19:31:33 [kernel] CPU1: Intel Genuine Intel(R) CPU   
T2400  @ 1.83GHz stepping 08

Mar 11 19:31:33 [kernel] NULL: after __cpu_up
Mar 11 19:31:33 [kernel] NULL: before notifier CPU_ONLINE.
Mar 11 19:31:33 [kernel] kvm: enabling virtualization on CPU1
Mar 11 19:31:33 [kernel] Switched to high resolution mode on CPU 1
Mar 11 19:31:33 [kernel] PM: Adding info for No Bus:cpu1
Mar 11 19:31:33 [kernel] PM: Adding info for No Bus:msr1
Mar 11 19:31:33 [kernel] NULL: after notifier CPU_ONLINE.
Mar 11 19:31:33 [kernel] CPU1 is up
Mar 11 19:31:33 [kernel] acpi acpi_system:00: resuming
Mar 11 19:31:33 [kernel] button button_power:00: resuming
Mar 11 19:31:33 [kernel] processor ACPI0007:00: resuming
Mar 11 19:31:33 [kernel] processor ACPI0007:01: resuming

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/7] Resource counters

2007-03-11 Thread Eric W. Biederman
Herbert Poetzl [EMAIL PROTECTED] writes:


 Linux-VServer does the accounting with atomic counters,
 so that works quite fine, just do the checks at the
 beginning of whatever resource allocation and the
 accounting once the resource is acquired ...

Atomic operations versus locks is only a granularity thing.
You still need the cache line which is the cost on SMP.

Are you using atomic_add_return or atomic_add_unless or
are you performing you actions in two separate steps which
is racy?  What I have seen indicates you are using a racy two separate
operation form.

 If we'll remove failcnt this would look like
while (atomic_cmpxchg(...))
 which is also not that good.
 
 Moreover - in RSS accounting patches I perform page list
 manipulations under this lock, so this also saves one atomic op.

 it still hasn't been shown that this kind of RSS limit
 doesn't add big time overhead to normal operations
 (inside and outside of such a resource container)

 note that the 'usual' memory accounting is much more
 lightweight and serves similar purposes ...

Perhaps

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   >