date:20070607

Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid

2007-06-07 Thread Eric W. Biederman

"Albert Cahalan" <[EMAIL PROTECTED]> writes:

> On 6/7/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:
>
>> So it looks to me like we need to do three things:
>> - Fix the inode number
>> - Fix the name on the hugetlbfs dentry to hold the key
>> - Add a big fat comment that user space programs depend on this
>>   behavior of both the dentry name and the inode number.
>
> Assuming that this proposed fix goes in:
>
> Since the inode number is the shmid, and this is a number
> that the kernel randomly chooses AFAIK, there should be
> no need to have different shm segments sharing the same
> inode number.

Where we run into inode number confusion is that all of
these shm segments are actually files on a tmpfs filesystem
somewhere, and by making the inode number the shmid we loose
the tmpfs inode number.  So it is possible we get tmpfs inode
number conflicts.  However the inode number is not used for
anything, and the files are not visible in any other way except
as shm segments so it doesn't matter.

There is another case with ipc namespaces where we ultimately need
to support duplicate shmids on the same machine (so migration
is a possibility).  However by and large the user space
processes with duplicate ids should be invisible to each other.

> The situation with the key is a bit more disturbing, though
> we already hit that anyway when IPC_PRIVATE is used.
> (why anybody would NOT use IPC_PRIVATE is a mystery)
> So having the key in the name doesn't make things worse.

Having "SYSV" in the name appears mandatory.  Otherwise you
don't even know it is a shm file. Although I may be confused.

> I have some concern about the device minor number.
> This should be the same for all shm mappings; I do not
> know if the behavior changed.

So I haven't changed anything here.  But I haven't really
looked either.

I don't have a clue if hugetlbfs files use the same device minor
number as tmpfs files.

Hmm.  Thinking about this I have just realized that we may want
to approach this a little differently.  Currently I am reusing
the dentry and inode structure that hugetlbfs and tmpfs return
me, and simply have a distinct struct file for each shm mapping.

There is a little more cost but it may actually make sense to have
a dentry and inode that is specific to shm.c so we can do whatever
we need to without adding requirements to the normal tmpfs or hugtlb
code.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bus hidden behind transparent bridge.

2007-06-07 Thread Greg KH

On Wed, Jun 06, 2007 at 02:33:11AM -0400, Dave Jones wrote:
> We've had this printk in drivers/pci/probe.c asking people
> to report it if they see it to linux-kernel for a long time.
> 
> google finds hundreds of instances of this being hit.
> There are a bunch in bugzilla too..
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=6783
> http://bugzilla.kernel.org/show_bug.cgi?id=7403
> http://bugzilla.kernel.org/show_bug.cgi?id=7575
> http://bugzilla.kernel.org/show_bug.cgi?id=7664
> http://bugzilla.kernel.org/show_bug.cgi?id=8074
> http://bugzilla.kernel.org/show_bug.cgi?id=8480
> 
> (possibly others, these are just a quick skim based on summary)
> 
> I don't recall ever seeing anything happen any time someone has
> reported this.  Is it worth keeping the printk there?
> It only seems to cause confusion (users think its an error,
> especially as we ask them to do something).
> 
> Better yet, I don't think I've seen a report of this where
> the user is actually experiencing a problem.
> 
> Who is the intended recipient of these reports?

They should all go to [EMAIL PROTECTED] as he is the one who added that line.
In fact, if he doesn't fix it up soon, I'm really tempted to just add
his email address there instead, as I'm really tired of getting these
emails and he said it would be fixed up by now...

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] ptraced process waiting on syscall may return kernel internal errnos

2007-06-07 Thread Benjamin Herrenschmidt

On Thu, 2007-06-07 at 20:18 -0700, Linus Torvalds wrote:
> 
> > Oh and Roland patch doesn't prevent signalfd() from stealing synchronous
> > signals such as SIGSEGV etc... which I think would result in random
> > behaviour do you want a patch for that or we just consider it broken
> > API usage and let it as it is ?
> 
> I think Davide ack'ed your patch, but I also think that one clashes with 
> the one I actually ended up applying. If you were to send me the signalfd 
> part (tested, preferably), I'd apply it, considering that it seems to be 
> the right thign to do, and it already got acked by Davide. 

Well, it's less urgent imho now that the real problems are fixed but
here it is, totally untested patch :-)
---

Don't let signalfd dequeue private signals off other threads

Signed-off-by: Benjamin Herrenschmidt <[EMAIL PROTECTED]>
---

Index: linux-work/kernel/signal.c
===
--- linux-work.orig/kernel/signal.c 2007-06-08 15:28:00.0 +1000
+++ linux-work/kernel/signal.c  2007-06-08 15:28:44.0 +1000
@@ -363,7 +363,13 @@ static int __dequeue_signal(struct sigpe
  */
 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
 {
-   int signr = __dequeue_signal(>pending, mask, info);
+   int signr = 0;
+
+   /* We only dequeue private signals from ourselves, we don't let
+* signalfd steal them
+*/
+   if (tsk == current)
+   signr = __dequeue_signal(>pending, mask, info);
if (!signr) {
signr = __dequeue_signal(>signal->shared_pending,
 mask, info);


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Another version of cleanfile/cleanpatch

2007-06-07 Thread Jan Engelhardt



On Jun 8 2007 01:06, Oleg Verych wrote:

>- empty lines in the end of file (patches can't be handled, or can? :).

Yes it can.

>Body -- is a commented sed script with shell variables for source/patch
>handling switch and compatibility with other versions of sed, not only GNU.
>If you like more tabs, then i stated in whitespace damage, just use
>"unexpand".

sed just does not cut it anymore. Perl regexes win in the long term.

>> Yes, UNIX was designed to handle fork-exec efficiently, thank God. But
>> still.
>[]
>> >efforts to remove bashizms...
>> 
>> I prefer bashisms over using a shell [referring to original sh or ksh]
>> that can't do a sane thing.
>
>I would like to know cases. Just to try to solve them.
>
>Two from my head are:
>
>- `set pipefail' option -- not problem at all [0]
>- arrays.
>
>Arrays. Well, that depends. My option is as follows.

If you need arrays or a lot of substr magic, well, it's perhaps time to
consider a switch to a scripted language (perl, python, php, whatever comes
around). But I really meant these handy bash features:

  for i in {1..5}; do ...; done;
  if [ ]; then ...; fi;
  echo $[ math expr ];
  echo `backtick` and $(backtick with dollar-parentheses);

  The possibility to say if [ "$any" == "" ] (not having to use
  crap like [ "x$any" == "x" ].

  The possibility to say if [ -z "$any" ]

On the other hand, if you wanted to extinguish bashisms, then you'd
also need to do so for kshisms like

  if [[ ]]

>OK, to not to go offtopic, i would say here, that if that temp file on the
>tmpfs, then Linux directly helps you with its efficient memory
>management, not libc (good addition to fork/clone-execve, isn't it? ;)

And don't assume everything is a UNIX. CreateProcess() is particularly
expensive on Windows, burdening even Cygwin's fork()/exec() emulation.



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/8] fdmap v2 - fdmap core

2007-06-07 Thread Davide Libenzi

On Fri, 8 Jun 2007, Eric Dumazet wrote:

> struct fd_map {
> /*
>  * read mostly part
>  */
> unsigned int base;   /* 0x00 */
> unsigned int size;   /* 0x04 */
> struct list_head slist; /* 0x08 */
> struct list_head *slots; /* 0x18 */
> unsigned long *map; /* 0x20 */
> void (*freecb)(void *, struct fd_map *); /* 0x28 */
> void *freecb_priv; /* 0x30 */
>   /* one 8 bytes hole */
> 
> /*
>  * written part on a separate cache line in SMP
>  */
> unsigned int fdnext cacheline_aligned_in_smp; /* 0x40 */
> 
> struct fd_map *next;  /* 0x48 */
> struct rcu_head rcu; /* 0x50 */
> }; /* size 0x60 , aligned to 0x80 */

Wait, that's wrong. Mostly written fields are fdnext and slist, and now 
count. The rcu and next are written only on free. I'd move next down, move 
slist up, and place count up.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-rc: regression: no irda0 interface (2.6.21 was OK), smsc does not find chip

2007-06-07 Thread Andrey Borzenkov

On Thursday 07 June 2007, Bjorn Helgaas wrote:
> In 2.6.21, smsc-ircc2 at least found the device.  So we definitely have
> a problem because in 2.6.22-rc, we don't find the device.
>
> What laptop do you have?  Maybe I can find one to play with.
>

This is Toshiba Portege 4000. The rest of your questions later when I have 
time :)

signature.asc
Description: This is a digitally signed message part.

Re: [patch 1/8] fdmap v2 - fdmap core

2007-06-07 Thread Davide Libenzi

On Fri, 8 Jun 2007, Eric Dumazet wrote:

> Well, offsets are wrong but layout OK
> 
> struct fd_map {
> /*
>  * read mostly part
>  */
> unsigned int base;   /* 0x00 */
> unsigned int size;   /* 0x04 */
> struct list_head slist; /* 0x08 */
> struct list_head *slots; /* 0x18 */
> unsigned long *map; /* 0x20 */
> void (*freecb)(void *, struct fd_map *); /* 0x28 */
> void *freecb_priv; /* 0x30 */
>   /* one 8 bytes hole */
> 
> /*
>  * written part on a separate cache line in SMP
>  */
> unsigned int fdnext cacheline_aligned_in_smp; /* 0x40 */
> 
> struct fd_map *next;  /* 0x48 */
> struct rcu_head rcu; /* 0x50 */
> }; /* size 0x60 , aligned to 0x80 */

There's a new "unsigned int count" now. That hole just calls for it ;)


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] fdmap v2 - implement sys_socket2

2007-06-07 Thread Davide Libenzi

On Fri, 8 Jun 2007, Eric Dumazet wrote:

> Davide Libenzi a écrit :
> > On Thu, 7 Jun 2007, Eric Dumazet wrote:
> > > I am afraid randomization wont really work if /sbin/init or /bin/bash for
> > > example uses one (or more) unseq fd :
> > > The 'random base' will be propagated at fork()/exec() time ?
> > 
> > As I said to Uli, we can't move the base while fds are in there. We can
> > re-randomize it when it's empty. This can also be done (it's a trivial and
> > fast operation - just set fmap->base to a new value) even every time the fd
> > count on the map touches zero.
> > 
> 
> Hum, I think it would be better to free fmap if it's empty, instead of change
> fmap->base. (Only in fork() after removal of O_CLOFORK file handles, and in
> exec() after removal of O_CLOEXEC file handles)

That can be done too. When it'll be re-created will be randomized anyway.
I'll probably be doing it everytime fmap->count touches zero in 
__put_unused_fd(), so even if the map is not empty at fork and/or exec 
time, the program have other chances of randomize in the middle of its 
lifetime.

- Davide

Re: [patch 7/8] fdmap v2 - implement sys_socket2

2007-06-07 Thread Davide Libenzi

On Thu, 7 Jun 2007, Ulrich Drepper wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Andrew Morton wrote:
> > Absolutely.  Let's get them into their final form now, rather than letting
> > some intermediate interface escape out into a major kernel release.
> 
> Even if Linus' redirection is adopted, these interfaces should still be
> fixed up.  No need to rely on hacks if we can still fix up the interfaces.

Indeed. I still think that adding O_NONSEQFD is a good idea, independently 
on the other interfaces.


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] update checkpatch.pl to version 0.03

2007-06-07 Thread Jon Masters

On Fri, 2007-06-08 at 02:04 +0200, Adrian Bunk wrote:

> The difference is that "ls" expects and handles such issues while 
> "lsmod" (and most likely also other userspace working with kernel 
> output) does not yet handle it resulting in problems if bytes are 
> wrongly interpreted as control codes.

I'm happy to modify module-init-tools for Unicode support, I just didn't
think this was a huge issue - but now there's a test case so... :-)

Jon.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/8] fdmap v2 - fdmap core

2007-06-07 Thread Eric Dumazet


Eric Dumazet a écrit :


struct fd_map {
/*
 * read mostly part
 */
unsigned int base;   /* 0x00 */
unsigned int size;   /* 0x04 */
struct list_head slist; /* 0x08 */
struct list_head *slots; /* 0x18 */
unsigned long *map; /* 0x28 */
void (*freecb)(void *, struct fd_map *); /* 0x30 */
void *freecb_priv; /* 0x38 */

/*
 * written part on a separate cache line in SMP
 */
unsigned int fdnext cacheline_aligned_in_smp; /* 0x40 */

struct fd_map *next;  /* 0x48 */
struct rcu_head rcu; /* 0x50 */
};


Well, offsets are wrong but layout OK

struct fd_map {
/*
 * read mostly part
 */
unsigned int base;   /* 0x00 */
unsigned int size;   /* 0x04 */
struct list_head slist; /* 0x08 */
struct list_head *slots; /* 0x18 */
unsigned long *map; /* 0x20 */
void (*freecb)(void *, struct fd_map *); /* 0x28 */
void *freecb_priv; /* 0x30 */
/* one 8 bytes hole */

/*
 * written part on a separate cache line in SMP
 */
unsigned int fdnext cacheline_aligned_in_smp; /* 0x40 */

struct fd_map *next;  /* 0x48 */
struct rcu_head rcu; /* 0x50 */
}; /* size 0x60 , aligned to 0x80 */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/8] fdmap v2 - fdmap core

2007-06-07 Thread Eric Dumazet


Davide Libenzi a écrit :

+struct fd_map {
+   struct fd_map *next;
+   struct rcu_head rcu;
+   unsigned int base;
+   unsigned int size;
+   struct list_head slist;
+   struct list_head *slots;
+   unsigned int fdnext;
+   unsigned long *map;
+   void (*freecb)(void *, struct fd_map *);
+   void *freecb_priv;
+};



On x86_64 that would mean folowing offsets

struct fd_map {
struct fd_map *next;  /* 0 */
struct rcu_head rcu; /* 8 */
unsigned int base;   /* 0x18 */
unsigned int size;   /* 0x1c */
struct list_head slist; /* 0x20 */
struct list_head *slots; /* 0x30 */
unsigned int fdnext; /* 0x38 */
unsigned long *map; /* 0x40 */
void (*freecb)(void *, struct fd_map *); /* 0x48 */
void *freecb_priv; /* 0x50 */
}; /* size 0x58 */

As L1_CACHE_BYTES is 64, two cache lines are necessary to get base,size,map
And a change of fdnext dirties one cache line that should be kept read only to 
avoid false sharing.


I suggest to reorder to move rcu and next at the end (since they are seldom 
used). Also move fdnext on a separate cache line on SMP


struct fd_map {
/*
 * read mostly part
 */
unsigned int base;   /* 0x00 */
unsigned int size;   /* 0x04 */
struct list_head slist; /* 0x08 */
struct list_head *slots; /* 0x18 */
unsigned long *map; /* 0x28 */
void (*freecb)(void *, struct fd_map *); /* 0x30 */
void *freecb_priv; /* 0x38 */

/*
 * written part on a separate cache line in SMP
 */
unsigned int fdnext cacheline_aligned_in_smp; /* 0x40 */

struct fd_map *next;  /* 0x48 */
struct rcu_head rcu; /* 0x50 */
};

Thank you
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm patch] make drivers/firmware/dmi-id.c:dmi_id_init() static

2007-06-07 Thread Greg KH

On Sun, Jun 03, 2007 at 10:54:04PM +0200, Adrian Bunk wrote:
> This patch makes the needlessly global dmi_id_init() static.
> 
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

Thanks, I've merged this with the original.

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]: add parameter "struct bin_attribute *" in .read/.write methods for sysfs binary attributes

2007-06-07 Thread Greg KH

On Thu, Jun 07, 2007 at 05:31:35PM +0800, Zhang Rui wrote:
> From: Zhang Rui <[EMAIL PROTECTED]>
> 
> Well, first of all, I don't want to change so many files either.
> 
> What I do:
> Adding a new parameter "struct bin_attribute *" in the
> .read/.write methods for the sysfs binary attributes.
> 
> In fact, only the four lines change in fs/sysfs/bin.c and
> include/linux/sysfs.h do the real work.
> But I have to update all the files that use binary attributes
> to make them compatible with the new .read and .write methods.
> I'm not sure if I missed any. :(
> 
> Why I do this:
> For a sysfs attribute, we can get a pointer pointing to the
> struct attribute in the .show/.store method,
> while we can't do this for the binary attributes.
> I don't know why this is different, but this does make it not
> so handy to use the binary attributes as the regular ones.
> So I think this patch is reasonable. :)
> 
> Who benefits from it:
> The patch that exposes ACPI tables in sysfs
> requires such an improvement.
> All the table binary attributes share the same .read method.
> Parameter "struct bin_attribute *" is used to get
> the table signature and instance number which are used to
> distinguish different ACPI table binary attributes.
> 
> Without this parameter, we need to offer different .read methods
> for different ACPI table binary attributes.
> This is impossible as there are various ACPI tables on different
> platforms, and we don't know what they are until they are loaded.

I have no objection to this patch but it failes the checkpatch.pl script
pretty badly with too long of lines.  Care to fix it up and resend it?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid

2007-06-07 Thread Albert Cahalan


On 6/7/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:


So it looks to me like we need to do three things:
- Fix the inode number
- Fix the name on the hugetlbfs dentry to hold the key
- Add a big fat comment that user space programs depend on this
  behavior of both the dentry name and the inode number.


Assuming that this proposed fix goes in:

Since the inode number is the shmid, and this is a number
that the kernel randomly chooses AFAIK, there should be
no need to have different shm segments sharing the same
inode number.

The situation with the key is a bit more disturbing, though
we already hit that anyway when IPC_PRIVATE is used.
(why anybody would NOT use IPC_PRIVATE is a mystery)
So having the key in the name doesn't make things worse.

I have some concern about the device minor number.
This should be the same for all shm mappings; I do not
know if the behavior changed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-rc4-mm1

2007-06-07 Thread WANG Cong

On Thu, Jun 07, 2007 at 11:59:16AM -0500, Matt Mackall wrote:
>On Fri, Jun 08, 2007 at 12:39:30AM +0800, WANG Cong wrote:
>> >Ketchup doesn't even look inside patches, and patch doesn't invent
>> >names, so something in the bzip2 -> patch(1) -> filesystem chain got
>> >corrupted. Probably not bzip2, as it has CRCs.
>> >
>> 
>> Do you mean ketchup doesn't do anything if a file is corrupted?
>
>Ketchup never even sees the filenames. It just calls bzip2 | patch. So
>it can't be responsible for damaging the filename.
> 
>> >Do you have ECC memory?
>>
>> No. Do you mean it's an error of my RAM? I have never met such things before,
>> how often does such kind of things happen? May be less often than a bug in
>> a stable kernel?
>
>The best studies I've seen suggest so-called "soft errors" in DRAM
>happen at a rate of once a week to once a day per gigabyte of RAM at
>sea-level. It's unknown how many of these errors manifest by visibly
>corrupting data, but it wouldn't be surprising if it were
>significantly less than 10%. But ECC is definitely not just for the
>paranoid!
>
>So if I were to rank the reliability of everything, it'd look
>something like this, highest to lowest:
>
> bzip: simple, stable and heavily-used codebase, built-in safeguards like CRC
> patch: simple, stable, heavily-used, limited detection of input errors
> CPU: heavily used, very low non-catastrophic failure rate
> disk: heavily used, CRC on cable, ECC on disk
> kernel: complex, rapidly-changing, but heavily-used
> Non-ECC DRAM: significant known transient failure rate
> 
>When the error rate for the kernel approaches that of DRAM, it gets
>very hard to assign blame.
>
>(And of course, there's the user, who tends to be near the bottom of
>this range, but I'll let you judge that.)

Good explanation! Thanks!

That the RAM error occurs so often really surprises me.
I think it might be RAM's fault, because, at least, it's less reproduceable than
a bug in a stable kernel.


Regards!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] fdmap v2 - implement sys_socket2

2007-06-07 Thread Eric Dumazet


Davide Libenzi a écrit :

On Thu, 7 Jun 2007, Eric Dumazet wrote:

I am afraid randomization wont really work if /sbin/init or /bin/bash for
example uses one (or more) unseq fd :
The 'random base' will be propagated at fork()/exec() time ?


As I said to Uli, we can't move the base while fds are in there. We can 
re-randomize it when it's empty. This can also be done (it's a trivial and 
fast operation - just set fmap->base to a new value) even every time the 
fd count on the map touches zero.




Hum, I think it would be better to free fmap if it's empty, instead of change 
fmap->base. (Only in fork() after removal of O_CLOFORK file handles, and in 
exec() after removal of O_CLOEXEC file handles)



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH, v2] Audit: Add TTY input auditing

2007-06-07 Thread Miloslav Trmac

From: Miloslav Trmac <[EMAIL PROTECTED]>

Add TTY input auditing, used to audit system administrator's actions.
TTY input auditing works on a higher level than auditing all system
calls within the session, which would produce an overwhelming amount of
mostly useless audit events.

Add an "audit_tty" attribute, inherited across fork ().  Data read from
TTYs by process with the attribute is sent to the audit subsystem by the
kernel.  The audit netlink interface is extended to allow modifying the
audit_tty attribute, and to allow sending explanatory audit events from
user-space (for example, a shell might send an event containing the
final command, after the interactive command-line editing and history
expansion is performed, which might be difficult to decipher from the
TTY input alone).

Because the "audit_tty" attribute is inherited across fork (), it would
be set e.g. for sshd restarted within an audited session.  To prevent
this, the audit_tty attribute is cleared when a process with no open TTY
file descriptors (e.g. after daemon startup) opens a TTY.

See https://www.redhat.com/archives/linux-audit/2007-June/msg0.html
for a more detailed rationale document for an older version of this patch.

---
Changes since the previous patch:
* use spin_lock_irq() for siglock
* add an is_tty() function instead of checking f_op->read from n_tty.c;
  handle hung TTYs
* replace the audit_tty bit field by a whole word to avoid the risk of
  incorrect locking
* move most new code from n_tty.c to a separate file
* fix coding style violations
* fix compilation with !CONFIG_AUDIT

From: Miloslav Trmac <[EMAIL PROTECTED]>

Add TTY input auditing, used to audit system administrator's actions.
TTY input auditing works on a higher level than auditing all system
calls within the session, which would produce an overwhelming amount of
mostly useless audit events.

Add an "audit_tty" attribute, inherited across fork ().  Data read from
TTYs by process with the attribute is sent to the audit subsystem by the
kernel.  The audit netlink interface is extended to allow modifying the
audit_tty attribute, and to allow sending explanatory audit events from
user-space (for example, a shell might send an event containing the
final command, after the interactive command-line editing and history
expansion is performed, which might be difficult to decipher from the
TTY input alone).

Because the "audit_tty" attribute is inherited across fork (), it would
be set e.g. for sshd restarted within an audited session.  To prevent
this, the audit_tty attribute is cleared when a process with no open TTY
file descriptors (e.g. after daemon startup) opens a TTY.

See https://www.redhat.com/archives/linux-audit/2007-June/msg0.html
for a more detailed rationale document for an older version of this patch.

Signed-off-by: Miloslav Trmac <[EMAIL PROTECTED]>
---
Changes since the previous patch:
* use spin_lock_irq() for siglock
* add an is_tty() function instead of checking f_op->read from n_tty.c; handle
  hung TTYs
* replace the audit_tty bit field by a whole word to avoid the risk of
  incorrect locking
* most code from n_tty.c split to a separate file
* fix coding style violations
* fix compilation with !CONFIG_AUDIT

 drivers/char/Makefile|1 
 drivers/char/n_tty.c |   20 +
 drivers/char/tty_audit.c |  332 +
 drivers/char/tty_io.c|   14 +
 include/linux/audit.h|   11 +
 include/linux/sched.h|2 
 include/linux/tty.h  |   28 ++
 kernel/audit.c   |   96 -
 kernel/audit.h   |1 
 kernel/auditsc.c |3 
 kernel/exit.c|2 
 kernel/fork.c|5 
 net/netlabel/netlabel_user.c |4 
 security/selinux/nlmsgtab.c  |2 

diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 2f56ecc..f2996a9 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -15,6 +15,7 @@ obj-y+= misc.o
 obj-$(CONFIG_VT)		+= vt_ioctl.o vc_screen.o consolemap.o \
    consolemap_deftbl.o selection.o keyboard.o
 obj-$(CONFIG_HW_CONSOLE)	+= vt.o defkeymap.o
+obj-$(CONFIG_AUDIT)		+= tty_audit.o
 obj-$(CONFIG_MAGIC_SYSRQ)	+= sysrq.o
 obj-$(CONFIG_ESPSERIAL)		+= esp.o
 obj-$(CONFIG_MVME147_SCC)	+= generic_serial.o vme_scc.o
diff --git a/drivers/char/n_tty.c b/drivers/char/n_tty.c
index 154f422..4e71791 100644
--- a/drivers/char/n_tty.c
+++ b/drivers/char/n_tty.c
@@ -45,6 +45,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -78,6 +80,13 @@ static inline void free_buf(unsigned char *buf)
 		free_page((unsigned long) buf);
 }
 
+static inline int tty_put_user(struct tty_struct *tty, unsigned char x,
+			   unsigned char __user *ptr)
+{
+	tty_audit_add_data(tty, , 1);
+	return put_user(x, ptr);
+}
+
 /**
  *	n_tty_set__room	-	receive space
  *	@tty: terminal
@@ -1153,6 +1162,7 @@ static int copy_from_read_buf(struct tty_struct *tty,
 	if (n) {

Re: [linux-usb-devel] 2.6.22-rc4-mm1

2007-06-07 Thread Greg KH

On Thu, Jun 07, 2007 at 10:53:29AM -0400, Alan Stern wrote:
> To tell you the truth, I rather think there's not much point in keeping
> usb-try-to-debug-bug-8561.patch around.  Anything seriously wrong that
> it could catch ought to have shown up long ago.  And it is now clear
> that bug 8561 has nothing to do with this; it is a programming error
> common to many of the USB serial drivers.  (Still waiting to hear back 
> from Paulo Pereira whether the fix to the USB Option driver works...)

What error in the usb-serial drivers are you speaking about?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Audit: Add TTY input auditing

2007-06-07 Thread Miloslav Trmac

Thanks for the review.
Andrew Morton napsal(a):
> On Wed, 06 Jun 2007 12:10:28 +0200 Miloslav Trmac <[EMAIL PROTECTED]> wrote:
>> +/**
>> + *  tty_audit_opening   -   A TTY is being opened.
>> + *
>> + *  As a special hack, tasks that close all their TTYs and open new ones
>> + *  are assumed to be system daemons (e.g. getty) and auditing is
>> + *  automatically disabled for them.
>> + */
>> +void
>> +tty_audit_opening(void)
>> +{
>> +int disable;
>> +
>> +disable = 1;
>> +spin_lock(>sighand->siglock);
>> +if (current->signal->audit_tty == 0)
>> +disable = 0;
>> +spin_unlock(>sighand->siglock);
>> +if (!disable)
>> +return;
>> +
>> +task_lock(current);
>> +if (current->files) {
>> +struct fdtable *fdt;
>> +unsigned i;
>> +
>> +/*
>> + * We don't take a ref to the file, so we must hold ->file_lock
>> + * instead.
>> + */
>> +spin_lock(>files->file_lock);
> 
> So we make file_lock nest inside task_lock().  Was that lock ranking
> already being used elsewhere in the kernel, or is it a new association?
It is used in __do_SAK ().

> Has this code had full coverage testing with all lockdep features enabled?
> 
> (I suspect not - lockdep should have gone wild over the siglock thing)
It was not.  The new version will be.


>> diff --git a/kernel/audit.c b/kernel/audit.c
>> index d13276d..a071a96 100644
>> --- a/kernel/audit.c
>> +++ b/kernel/audit.c
>> @@ -423,6 +424,32 @@ static int kauditd_thread(void *dummy)
>>  return 0;
>>  }
>>  
>> +static int
>> +audit_prepare_user_tty(pid_t pid, uid_t loginuid)
>> +{
>> +struct task_struct *tsk;
>> +int err;
>> +
>> +read_lock(_lock);
>> +tsk = find_task_by_pid(pid);
>> +err = -ESRCH;
>> +if (!tsk)
>> +goto out;
>> +err = 0;
>> +
>> +spin_lock(>sighand->siglock);
>> +if (!tsk->signal->audit_tty)
>> +err = -EPERM;
>> +spin_unlock(>sighand->siglock);
> So siglock nests inside tasklist_lock?  Sounds reasonable.  Is this a
> preexisting association, or did this patch just create it?
This is used in send_sig_info() and several other functions in
kernel/signal.c.


>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index d58e74b..3ae4904 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -506,6 +506,8 @@ struct signal_struct {
>>  #ifdef CONFIG_TASKSTATS
>>  struct taskstats *stats;
>>  #endif
>> +unsigned audit_tty:1;
>> +struct tty_audit_buf *tty_audit_buf;
>>  };
> 
> hm, bitfields are risky.  If someone adds another one, it will land in
> the same word and external locking will be needed.  You do seem to be using
> ->siglock to cover this - was that to address the bitfield non-atomicity
> problem?
I don't know what the memory access atomicity assumptions are in the
kernel, so I have used the basic rule that any write<->read conflict on
a variable with type other than atomic_t must be prevented by a lock.
This happens to work for the bit field as well.

> A suitable (but somewhat less pretty) way to resolve all this is to not use
> bitfields at all: add `unsigned long flags' and use set_bit/clear_bit/etc.
The new patch replaces the bit field by a simple "unsigned", a whole
word is allocated for the bit field anyway.

>>
>>  break;
>> +case AUDIT_TTY_GET: {
>> +struct audit_tty_status s;
>> +struct task_struct *tsk;
>> +
>> +read_lock(_lock);
>> +tsk = find_task_by_pid(pid);
>> +if (!tsk)
>> +err = -ESRCH;
>> +else {
>> +spin_lock(>sighand->siglock);
>> +s.enabled = tsk->signal->audit_tty != 0;
>> +spin_unlock(>sighand->siglock);
> 
> The locking here looks dubious.  tsk->signal->audit_tty can change state
> the instant ->siglock gets unlocked, in which case s.enabled is now wrong.
The user-space process must avoid concurrent AUDIT_TTY_SET to get
reasonable results.  There's nothing better the kernel can do.

> If that is acceptable then we didn't need that locking at all.
So I can assume that int-sized reads are always atomic with respect to
concurrent writes?
Mirek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RT] convert RCU Preempt tasklet into softirq.

2007-06-07 Thread Paul E. McKenney

On Thu, Jun 07, 2007 at 04:16:09PM -0400, Steven Rostedt wrote:
> On Thu, 2007-06-07 at 14:51 -0400, Steven Rostedt wrote:
> 
> > There might still be an issue here. With the patch I'm getting a really
> > slow response time on networking. But that be because of other patches I
> > have applied.
> 
> I removed this patch and I can still get the network slowdown/hang. So
> this patch is unrelated to this issue. RCU on the otherhand has not been
> cleared of suspicion.

Might the slowdown be due to different kernel threads running at different
priorities?  I could easily believe that changing the priority of the
kernel thread processing RCU callbacks could have a noticeable effect
on performance in some cases.

In other news, passed a set of kernbenches at this end, so starting
an rcutorture.

Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file

2007-06-07 Thread H. Peter Anvin

Vivek Goyal wrote:
> 
> One would not know highest used address until ELF headers have been 
> parsed. May be it is two step movement. First decompress ELF.gz and 
> ELF parser can be at the end of decompressed data. Then it can parse
> the ELF headers and move itself out of the ELF header destination memory
> and then load the elf segments at appropriate place.
> 
> One will have to be little careful while moving ELF parser or while
> decompressing the file to a temporary buffer so that we don't stomp over
> any other data loaded by boot-loader (like kexec does) or we don't go beyond
> the memory bounds which might have been created in the case of using kdump.
> 

The easiest is probably to decode the ELF headers (which can be done in
O(1) space), relocate, reset the decompressor and restart.

Relocation is currently done in the decompressor, but it could also be
done at the kernel entrypoint, as long as the kernel entrypoint code is
all PIC.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III

William Lee Irwin III wrote:
>> Beg your pardon? Are you reading the patch description correctly?

On Thu, Jun 07, 2007 at 08:44:09PM -0700, H. Peter Anvin wrote:
> I mean, with your patch CONFIG_HIGHMEM4G versus CONFIG_HIGHMEM64G really
> don't make sense as separate selections anymore.

I thought about sweeping those up, but defaulted to minimal diffsize.
I can sweep them up given more votes in favor of doing so.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make i386 kernel show the segfaults in kernel dmesg, like x86_64.

2007-06-07 Thread Masoud Sharbiani


On 6/7/07, Robert Hancock <[EMAIL PROTECTED]> wrote:

Masoud Asgharifard Sharbiani wrote:
> Hello,
> This patch makes the i386 behave the same way that x86_64 does when a
> segfault happens. A line gets printed to the kernel log so that tools
> that need to check for failures can behave more uniformly between
> different kernels. Like x86_64, it can be disabled by setting
> debug.exception-trace sysctl variable to 0 (or by doing
> echo 0 > /proc/sys/debug/exception-trace)
>
> Same behaviour can be extended to other architectures, if needed.
> cheers,
> Masoud.
>
> Signed-off-by: Masoud Sharbiani <[EMAIL PROTECTED]>
>
> diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c
> index 29d7d61..6aa56db 100644
> --- a/arch/i386/mm/fault.c
> +++ b/arch/i386/mm/fault.c
> @@ -283,6 +283,8 @@ static inline int vmalloc_fault(unsigned long address)
>   return 0;
>  }
>
> +int exception_trace = 1;
> +
>  /*
>   * This routine handles page faults.  It determines the address,
>   * and the problem, and then passes it off to one of the appropriate
> @@ -464,7 +466,14 @@ bad_area_nosemaphore:
>*/
>   if (is_prefetch(regs, address, error_code))
>   return;
> -
> + if (exception_trace && unhandled_signal(tsk, SIGSEGV)) {
> + printk(
> +"%s%s[%d]: segfault at %08lx eip %08lx esp %08lx error 
%lx\n",
> + tsk->pid > 1 ? KERN_INFO : KERN_EMERG,
> + tsk->comm, tsk->pid, address, regs->eip,
> + regs->esp, error_code);

Shouldn't we use printk_ratelimit() here, to prevent some nasty person
from creating some rapidly-segfaulting process that floods the kernel
logs? (Same with the x86_64 version if it doesn't already..)


Good call.

Masoud


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file

2007-06-07 Thread Vivek Goyal

On Wed, Jun 06, 2007 at 05:42:35PM -0700, H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
> > 
> > Certainly, but much harder to implement.  The ELF parser needs to be
> > prepared to move itself around to get out of the way of the ELF file. 
> > It's a fairly large change from how it works now.
> > 
> 
> It doesn't if we simply declare that a certain chunk of memory is
> available to it, for the case where it runs in the native configuration.
> Since it doesn't have to support *any* ELF file, just the kernel one,
> that's an option.
> 
> On the other hand, I guess with the decompressor/ELF parser being PIC,
> one would simply look for the highest used address, and relocate itself
> above that point.  It's not really all that different from what the
> decompressor does today, except that it knows the address a priori.
> 

One would not know highest used address until ELF headers have been 
parsed. May be it is two step movement. First decompress ELF.gz and 
ELF parser can be at the end of decompressed data. Then it can parse
the ELF headers and move itself out of the ELF header destination memory
and then load the elf segments at appropriate place.

One will have to be little careful while moving ELF parser or while
decompressing the file to a temporary buffer so that we don't stomp over
any other data loaded by boot-loader (like kexec does) or we don't go beyond
the memory bounds which might have been created in the case of using kdump.

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid

2007-06-07 Thread Eric W. Biederman

"Serge E. Hallyn" <[EMAIL PROTECTED]> writes:

> Ok, so IIUC the problem was that inode->i_ino was being set to the id,
> and the id can be the same for different things in two namespaces.

There is nothing preventing inode number collisions in this code even
without multiple namespaces, and even when it was functioning
correctly.  However as it does not seem possible to find these files
through normal filesystem operations that does not seem to be a problem.

> So aside from not using the id as inode->i_ino, an alternative is to use
> a separate superblock, spearate mqeueue fs, for each ipc ns.
>
> I haven't looked at that enough to see whether it's feasible, i.e. I 
> don't know what else mqueue fs is used for.  Eric, does that sound
> reasonable to you?

At this point given that we actually have a small user space dependency
and the fact that after I have reviewed the code it looks harmless to
change the inode number of those inodes, in both cases they are just
anonymous inodes generated with new_inode, and anything that we wrap
is likely to be equally so.

So it looks to me like we need to do three things:
- Fix the inode number
- Fix the name on the hugetlbfs dentry to hold the key
- Add a big fat comment that user space programs depend on this
  behavior of both the dentry name and the inode number.

So Badari it looks like your original patch plus a little bit is
what we need.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread H. Peter Anvin

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> !CONFIG_X86_PAE && CONFIG_HIGHMEM64G doesn't make sense and is not allowed
>>> by this patch. CONFIG_X86_PAE && !CONFIG_HIGHMEM64G works here.
> 
> 
> On Thu, Jun 07, 2007 at 08:38:22PM -0700, H. Peter Anvin wrote:
>> But what's the point?
>> If you're going to divorce these, at least do it in a way that makes
>> sense, specifically the two independent variables are PAE and HIGHMEM.
>> PAE and !HIGHMEM does make (some amount of) sense, due to no kmap overhead.
> 
> Beg your pardon? Are you reading the patch description correctly?
> 

I mean, with your patch CONFIG_HIGHMEM4G versus CONFIG_HIGHMEM64G really
don't make sense as separate selections anymore.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]: complete cleanup of check_region]

2007-06-07 Thread Surya

re-sending with a minor correction...


> >
> > @@ -5640,6 +5645,7 @@ int __init sbpcd_init(void)
> > int i=0, j=0;
> > int addr[2]={1, CDROM_PORT};
> > int port_index;
> > +   int request_region_flag;
> >
> One could argue that it would be possible to just use the 'j' variable
> for this since it's not used for anything else until the *_region()
> stuff is all done, that would save adding a new variable...
completely eliminated the variable, haven't felt the need for it.


> > msg(DBG_INI,"check_drives done.\n");
> > +   if(request_region_flag==0)
> > +   release_region(addr[1],4);
> 
> I know the driver in its current version just tests if the region is
> available and doesn't hang on to it, but I'm wondering if that's not a
> bug.  Shouldn't we actually hold on to this region as long as the
> driver is loaded and only release the region in the sbpcd_exit()
> function, just loke we do with the "CDo_command" region?  
But if any of the conditions fail after the request_region code, it is
returning an error and exiting out of init. Dont we need to cleanup
request_region's allocation before we exit. Infact I had a feeling that
CDo_command region cleanup was not perfect. Did a cleanup of both the
regions wherever we have a return out of the function.

Please have a look at the below patch.

diff --git a/drivers/cdrom/sbpcd.c b/drivers/cdrom/sbpcd.c
index a1283b1..5078211 100644
--- a/drivers/cdrom/sbpcd.c
+++ b/drivers/cdrom/sbpcd.c
@@ -358,6 +358,10 @@
  * Add bio/kdev_t changes for 2.5.x required to make it work again. 
  * Still room for improvement in the request handling here if anyone
  * actually cares.  Bring your own chainsaw.Paul G.  02/2002
+ *
+ *
+ * Cleaned up the reference for deprecated check_region to 
+ * request_region.- Surya Prabhakar N  08/07/2007
  */
 
 
@@ -555,6 +559,8 @@ static struct cdrom_read_audio read_audio;
 static unsigned char msgnum;
 static char msgbuf[80];
 
+static int addr[2]={1, CDROM_PORT};
+
 static int max_drives = MAX_DRIVES;
 module_param(max_drives, int, 0);
 #ifndef MODULE
@@ -5638,7 +5644,6 @@ int __init sbpcd_init(void)
 #endif
 {
int i=0, j=0;
-   int addr[2]={1, CDROM_PORT};
int port_index;
 
sti();
@@ -5670,9 +5675,9 @@ int __init sbpcd_init(void)
{
addr[1]=sbpcd[port_index];
if (addr[1]==0) break;
-   if (check_region(addr[1],4))
+   if (!request_region(addr[1],4, "sbpcd driver"))
{
-   msg(DBG_INF,"check_region: %03X is not 
free.\n",addr[1]);
+   msg(DBG_INF,"request_region: %03X is not 
free.\n",addr[1]);
continue;
}
if (sbpcd[port_index+1]==2) type=str_sp;
@@ -5699,6 +5704,7 @@ int __init sbpcd_init(void)
if (ndrives==0)
{
msg(DBG_INF, "No drive found.\n");
+   release_region(addr[1],4);
 #ifdef MODULE
return -EIO;
 #else
@@ -5775,6 +5781,7 @@ int __init sbpcd_init(void)
if (!request_region(CDo_command,4,major_name))
{
printk(KERN_WARNING "sbpcd: Unable to request region 0x%x\n", 
CDo_command);
+   release_region(addr[1],4);
return -EIO;
}
 
@@ -5788,6 +5795,8 @@ int __init sbpcd_init(void)
 #endif /* SOUND_BASE */
 
if (register_blkdev(MAJOR_NR, major_name)) {
+   release_region(CDo_command,4);
+   release_region(addr[1],4);
 #ifdef MODULE
return -EIO;
 #else
@@ -5801,6 +5810,7 @@ int __init sbpcd_init(void)
sbpcd_queue = blk_init_queue(do_sbpcd_request, _lock);
if (!sbpcd_queue) {
release_region(CDo_command,4);
+   release_region(addr[1],4);
unregister_blkdev(MAJOR_NR, major_name);
return -ENOMEM;
}
@@ -5834,6 +5844,7 @@ int __init sbpcd_init(void)
printk("Can't unregister %s\n", major_name);
}
release_region(CDo_command,4);
+   release_region(addr[1],4);
blk_cleanup_queue(sbpcd_queue);
return -EIO;
}
@@ -5850,6 +5861,7 @@ int __init sbpcd_init(void)
if (sbpcd_infop == NULL)
{
 release_region(CDo_command,4);
+   release_region(addr[1],4);
blk_cleanup_queue(sbpcd_queue);
 return -ENOMEM;
}
@@ -5894,6 +5906,7 @@ static void sbpcd_exit(void)
return;
}
release_region(CDo_command,4);
+   release_region(addr[1],4);
blk_cleanup_queue(sbpcd_queue);
for (j=0;jhttp://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III

William Lee Irwin III wrote:
>> !CONFIG_X86_PAE && CONFIG_HIGHMEM64G doesn't make sense and is not allowed
>> by this patch. CONFIG_X86_PAE && !CONFIG_HIGHMEM64G works here.


On Thu, Jun 07, 2007 at 08:38:22PM -0700, H. Peter Anvin wrote:
> But what's the point?
> If you're going to divorce these, at least do it in a way that makes
> sense, specifically the two independent variables are PAE and HIGHMEM.
> PAE and !HIGHMEM does make (some amount of) sense, due to no kmap overhead.

Beg your pardon? Are you reading the patch description correctly?


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread H. Peter Anvin

William Lee Irwin III wrote:
> 
> !CONFIG_X86_PAE && CONFIG_HIGHMEM64G doesn't make sense and is not allowed
> by this patch. CONFIG_X86_PAE && !CONFIG_HIGHMEM64G works here.
> 

But what's the point?

If you're going to divorce these, at least do it in a way that makes
sense, specifically the two independent variables are PAE and HIGHMEM.
PAE and !HIGHMEM does make (some amount of) sense, due to no kmap overhead.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] via c3/2.6.20 -- ps/2 keyboard doesn't work with console

2007-06-07 Thread Dmitry Torokhov

Hi,

On Thursday 07 June 2007 13:17, Paul Albrecht wrote:
> Hi,
> 
> I'm trying use the 2.6.20 kernel and have run into a problem using the
> ps/2 keyboard with console. When I type input in response to console
> output the characters aren't echo'ed and the console program doesn't
> receive the character input. It seems like the problem is specific to
> the ps/2 interface because a usb keyboard works fine.

I think this bug is fixed in newer kernels. Please try the latest 2.6.21.y
and if it still does not work please send me full dmesg after booting with
"i8042.debug log_buf_len=131072". Thanks!

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] ptraced process waiting on syscall may return kernel internal errnos

2007-06-07 Thread Linus Torvalds

On Fri, 8 Jun 2007, Benjamin Herrenschmidt wrote:
> 
> Looks good to me. Do you think we need to do something about the DRM
> notifier thingy too ?

No. I think that anybody who uses that gets whatever he deserves. It's 
designed for one particular usage scenario (where it should work fine), if 
you try to use it for something else across threads, you've already lost.

> Oh and Roland patch doesn't prevent signalfd() from stealing synchronous
> signals such as SIGSEGV etc... which I think would result in random
> behaviour do you want a patch for that or we just consider it broken
> API usage and let it as it is ?

I think Davide ack'ed your patch, but I also think that one clashes with 
the one I actually ended up applying. If you were to send me the signalfd 
part (tested, preferably), I'd apply it, considering that it seems to be 
the right thign to do, and it already got acked by Davide.

Or did I get confused by some other patches flying around?

(Gaah, there's been _way_ too much patching going on today, I'm not happy 
about how -rc4 seems to not be calming down)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 7/7] lguest documentation:Chapter VII

2007-06-07 Thread Rusty Russell

Documentation: The FIXMEs

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
---
 Documentation/lguest/lguest.c |   12 
 drivers/char/hvc_lguest.c |3 +++
 drivers/lguest/interrupts_and_traps.c |   14 ++
 drivers/lguest/io.c   |   10 ++
 drivers/lguest/lguest.c   |8 
 drivers/lguest/lguest_asm.S   |   14 ++
 drivers/lguest/page_tables.c  |5 +
 drivers/lguest/segments.c |4 
 drivers/net/lguest_net.c  |   19 +++
 9 files changed, 89 insertions(+)

===
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -1536,3 +1536,15 @@ int main(int argc, char *argv[])
/* Finally, run the Guest.  This doesn't return. */
run_guest(lguest_fd, _list);
 }
+/*:*/
+
+/*M:999
+ * Mastery is done: you now know everything I do.
+ *
+ * But surely you have seen code, features and bugs in your wanderings which
+ * you now yearn to attack?  That is the real game, and I look forward to you
+ * patching and forking lguest into the Your-Name-Here-visor.
+ *
+ * Farewell, and good coding!
+ * Rusty Russell.
+ */
===
--- a/drivers/char/hvc_lguest.c
+++ b/drivers/char/hvc_lguest.c
@@ -13,6 +13,9 @@
  * functions.
  :*/
 
+/*M:002 The console can be flooded: while the Guest is processing input the
+ * Host can send more.  Buffering in the Host could alleviate this, but it is a
+ * difficult problem in general. :*/
 /* Copyright (C) 2006 Rusty Russell, IBM Corporation
  *
  * This program is free software; you can redistribute it and/or modify
===
--- a/drivers/lguest/interrupts_and_traps.c
+++ b/drivers/lguest/interrupts_and_traps.c
@@ -231,6 +231,20 @@ static int direct_trap(const struct lgue
 * go direct, of course 8) */
return idt_type(trap->a, trap->b) == 0xF;
 }
+/*:*/
+
+/*M:005 The Guest has the ability to turn its interrupt gates into trap gates,
+ * if it is careful.  The Host will let trap gates can go directly to the
+ * Guest, but the Guest needs the interrupts atomically disabled for an
+ * interrupt gate.  It can do this by pointing the trap gate at instructions
+ * within noirq_start and noirq_end, where it can safely disable interrupts. */
+
+/*M:006 The Guests do not use the sysenter (fast system call) instruction,
+ * because it's hardcoded to enter privilege level 0 and so can't go direct.
+ * It's about twice as fast as the older "int 0x80" system call, so it might
+ * still be worthwhile to handle it in the Switcher and lcall down to the
+ * Guest.  The sysenter semantics are hairy tho: search for that keyword in
+ * entry.S :*/
 
 /*H:260 When we make traps go directly into the Guest, we need to make sure
  * the kernel stack is valid (ie. mapped in the page tables).  Otherwise, the
===
--- a/drivers/lguest/io.c
+++ b/drivers/lguest/io.c
@@ -570,6 +570,16 @@ void set_wakeup_process(struct lguest *l
get_task_struct(lg->wake);
 }
 
+/*M:007 We only return a single DMA buffer to the Launcher, but it would be
+ * more efficient to return a pointer to the entire array of DMA buffers, which
+ * it can cache and choose one whenever it wants.
+ *
+ * Currently the Launcher uses a write to /dev/lguest, and the return value is
+ * the address of the DMA structure with the interrupt number placed in
+ * dma->used_len.  If we wanted to return the entire array, we need to return
+ * the address, array size and interrupt number: this seems to require an
+ * ioctl(). :*/
+
 /*L:320 This routine looks for a DMA buffer registered by the Guest on the
  * given key (using the BIND_DMA hypercall). */
 unsigned long get_dma_buffer(struct lguest *lg,
===
--- a/drivers/lguest/lguest.c
+++ b/drivers/lguest/lguest.c
@@ -251,6 +251,14 @@ static void irq_enable(void)
 {
lguest_data.irq_enabled = X86_EFLAGS_IF;
 }
+/*:*/
+/*M:003 Note that we don't check for outstanding interrupts when we re-enable
+ * them (or when we unmask an interrupt).  This seems to work for the moment,
+ * since interrupts are rare and we'll just get the interrupt on the next timer
+ * tick, but when we turn on CONFIG_NO_HZ, we should revisit this.  One way
+ * would be to put the "irq_enabled" field in a page by itself, and have the
+ * Host write-protect it when an interrupt comes in when irqs are disabled.
+ * There will then be a page fault as soon as interrupts are re-enabled. :*/
 
 /*G:034
  * The Interrupt Descriptor Table (IDT).
===
--- a/drivers/lguest/lguest_asm.S
+++ b/drivers/lguest/lguest_asm.S
@@ -41,6 +41,20 @@

Re: [Bug 8473] New: Oops: 0010 [1] SMP

2007-06-07 Thread Andrew Morton

On Fri, 8 Jun 2007 05:06:29 +0200 Björn Steinbrink <[EMAIL PROTECTED]> wrote:

> On 2007.05.26 21:10:15 +0200, Nicolas Mailhot wrote:
> > Le jeudi 17 mai 2007 à 18:59 +0200, Nicolas Mailhot a écrit :
> > > Le jeudi 17 mai 2007 à 09:45 -0700, Randy Dunlap a écrit :
> > > 
> > > > Can you boot with "kstack=32" so that we can see more of the stack?
> > > 
> > > I can try. It's not triggering quickly though
> > 
> > Seems I was completely wrong about the trigger, but anyway it happened
> > again, this time on 2.6.22-rc2.mm1.cfs14 (and I had kept kstack=32)
> > 
> >  BUG: using smp_processor_id() in preemptible [0001] code: bash/3857
> >  caller is oops_begin+0xb/0x6f
> >  
> >  Call Trace:
> >  [] show_trace+0x34/0x4f
> >  [] dump_stack+0x12/0x17
> >  [] debug_smp_processor_id+0xad/0xbc
> >  [] oops_begin+0xb/0x6f
> >  [] do_page_fault+0x66a/0x7c0
> >  [] error_exit+0x0/0x84

hm that was dumb.  I'll stick a raw_smp_processor_id() in there.

> >  Unable to handle kernel NULL pointer dereference at  RIP: 
> >  [<>]
> >  PGD bdd2067 PUD c133067 PMD 0 
> >  Oops: 0010 [1] PREEMPT SMP 
> >  CPU 1 
> >  Pid: 3857, comm: bash Not tainted 2.6.22-0.8.rc2.mm1.cfs14.fc8.nim #1
> >  RIP: 0010:[<>]  [<>]
> >  RSP: 0018:81000cb03ee0  EFLAGS: 00010296
> >  RAX: 8044dbc0 RBX: 81000c3aa8c0 RCX: 7fff549dcae4
> >  RDX: 5410 RSI: 81000c3aa8c0 RDI: 81000ba913d8
> >  RBP: 7fff549dcae4 R08:  R09: 
> >  R10: 0008 R11: 0246 R12: 5410
> >  R13: 00ff R14: 00ff R15: 
> >  FS:  2b06560d8f40() GS:810004017180() 
> > knlGS:
> >  CS:  0010 DS:  ES:  CR0: 8005003b
> >  CR2:  CR3: 0bc55000 CR4: 06e0
> >  Process bash (pid: 3857, threadinfo 81000cb02000, task 
> > 81000adc59a0)
> >  Stack:  8028ada9 81000c3aa8c0 7fff549dcae4 7fff549dcae4
> >  8028b016 5410 00ff 81000c3aa8c0
> >   7fff549dcae4 5410 00ff
> >  8028b088  80209571 
> >  7fff549dce87 0f11 7fff549dcfb8 7fff549dddb0
> >  802095dc 0246 0008 
> >   ffda  7fff549dcae4
> >  5410 00ff 0010 003d340c9117
> >  Call Trace:
> >  Inexact backtrace:
> >  [] do_ioctl+0x55/0x6b
> >  [] vfs_ioctl+0x257/0x270
> >  [] sys_ioctl+0x59/0x79
> >  [] tracesys+0xdc/0xe1
> >  
> >  INFO: lockdep is turned off.
> >  
> >  Code:  Bad RIP value.
> >  RIP  [<>]
> >  RSP 
> >  CR2: 
> 
> This is do_tty_hangup() exchanging the fops while we're waiting for the
> lock. The new fops (hung_up_tty_fops) only have the unlocked variant and
> thus we Oops our way.

ah, yes, that explains it, thanks.  Culprit:

commit e10cc1df1d2014f68a4bdcf73f6dd122c4561f94
Author: Paul Fulghum <[EMAIL PROTECTED]>
Date:   Thu May 10 22:22:50 2007 -0700

tty: add compat_ioctl

Add compat_ioctl method for tty code to allow processing of 32 bit ioctl
calls on 64 bit systems by tty core, tty drivers, and line disciplines.

Based on patch by Arnd Bergmann:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0511.0/1732.html

[EMAIL PROTECTED]: make things static]
Signed-off-by: Paul Fulghum <[EMAIL PROTECTED]>
Acked-by: Arnd Bergmann <[EMAIL PROTECTED]>
Cc: Alan Cox <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>


> The following program reproduces it quite easily on a SMP box. I'm
> running it from X as root like this:
> while true; do xterm /path/to/program; done
> 
> #include 
> #include 
> #include 
> 
> #include 
> 
> pid_t pid;
> 
> void *thread(void *arg)
> {
>   while (1)
>   ioctl(0, TIOCSPGRP, );
> }
> 
> int main()
> {
>   pthread_t t;
> 
>   pid = getpid();
> 
>   pthread_create(, NULL, thread, NULL);
>   sleep(1);
>   vhangup();
>   perror("vhangup");
>   return 0;
> }
> 
> I'm not exactly sure how to solve that in a clean way, though. Moving
> the call to lock_kernel() up would make the Oops go away, but could
> result in the wrong error code being returned. Checking for ioctl first
> and unlocked_ioctl last would cause useless locking. And retrying the
> unlocked ioctl doesn't look nice either :-(
> 
> Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/7] lguest documentation:Chapter VI

2007-06-07 Thread Rusty Russell

Documentation: The Switcher

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
---
 drivers/lguest/core.c |   51 +++-
 drivers/lguest/switcher.S |  271 ++---
 2 files changed, 276 insertions(+), 46 deletions(-)

===
--- a/drivers/lguest/core.c
+++ b/drivers/lguest/core.c
@@ -394,46 +394,89 @@ static void set_ts(void)
write_cr0(cr0|8);
 }
 
+/*S:010
+ * We are getting close to the Switcher.
+ *
+ * Remember that each CPU has two pages which are visible to the Guest when it
+ * runs on that CPU.  This has to contain the state for that Guest: we copy the
+ * state in just before we run the Guest.
+ *
+ * Each Guest has "changed" flags which indicate what has changed in the Guest
+ * since it last ran.  We saw this set in interrupts_and_traps.c and
+ * segments.c.
+ */
 static void copy_in_guest_info(struct lguest *lg, struct lguest_pages *pages)
 {
+   /* Copying all this data can be quite expensive.  We usually run the
+* same Guest we ran last time (and that Guest hasn't run anywhere else
+* meanwhile).  If that's not the case, we pretend everything in the
+* Guest has changed. */
if (__get_cpu_var(last_guest) != lg || lg->last_pages != pages) {
__get_cpu_var(last_guest) = lg;
lg->last_pages = pages;
lg->changed = CHANGED_ALL;
}
 
-   /* These are pretty cheap, so we do them unconditionally. */
+   /* These copies are pretty cheap, so we do them unconditionally: */
+   /* Save the current Host top-level page directory. */
pages->state.host_cr3 = __pa(current->mm->pgd);
+   /* Set up the Guest's page tables to see this CPU's pages (and no
+* other CPU's pages). */
map_switcher_in_guest(lg, pages);
+   /* Set up the two "TSS" members which tell the CPU what stack to use
+* for traps which do directly into the Guest (ie. traps at privilege
+* level 1). */
pages->state.guest_tss.esp1 = lg->esp1;
pages->state.guest_tss.ss1 = lg->ss1;
 
-   /* Copy direct trap entries. */
+   /* Copy direct-to-Guest trap entries. */
if (lg->changed & CHANGED_IDT)
copy_traps(lg, pages->state.guest_idt, default_idt_entries);
 
-   /* Copy all GDT entries but the TSS. */
+   /* Copy all GDT entries which the Guest can change. */
if (lg->changed & CHANGED_GDT)
copy_gdt(lg, pages->state.guest_gdt);
/* If only the TLS entries have changed, copy them. */
else if (lg->changed & CHANGED_GDT_TLS)
copy_gdt_tls(lg, pages->state.guest_gdt);
 
+   /* Mark the Guest as unchanged for next time. */
lg->changed = 0;
 }
 
+/* Finally: the code to actually call into the Switcher to run the Guest. */
 static void run_guest_once(struct lguest *lg, struct lguest_pages *pages)
 {
+   /* This is a dummy value we need for GCC's sake. */
unsigned int clobber;
 
+   /* Copy the guest-specific information into this CPU's "struct
+* lguest_pages". */
copy_in_guest_info(lg, pages);
 
-   /* Put eflags on stack, lcall does rest: suitable for iret return. */
+   /* Now: we push the "eflags" register on the stack, then do an "lcall".
+* This is how we change from using the kernel code segment to using
+* the dedicated lguest code segment, as well as jumping into the
+* Switcher.
+*
+* The lcall also pushes the old code segment (KERNEL_CS) onto the
+* stack, then the address of this call.  This stack layout happens to
+* exactly match the stack of an interrupt... */
asm volatile("pushf; lcall *lguest_entry"
+/* This is how we tell GCC that %eax ("a") and %ebx ("b")
+ * are changed by this routine.  The "=" means output. */
 : "=a"(clobber), "=b"(clobber)
+/* %eax contains the pages pointer.  ("0" refers to the
+ * 0-th argument above, ie "a").  %ebx contains the
+ * physical address of the Guest's top-level page
+ * directory. */
 : "0"(pages), "1"(__pa(lg->pgdirs[lg->pgdidx].pgdir))
+/* We tell gcc that all these registers could change,
+ * which means we don't have to save and restore them in
+ * the Switcher. */
 : "memory", "%edx", "%ecx", "%edi", "%esi");
 }
+/*:*/
 
 /*H:030 Let's jump straight to the the main loop which runs the Guest.
  * Remember, this is called by the Launcher reading /dev/lguest, and we keep
===
--- a/drivers/lguest/switcher.S
+++ b/drivers/lguest/switcher.S
@@ -6,41 +6,131 @@
  * are feeling invigorated and refreshed then the next, more

[PATCH 3/7] lguest documentation:Chapter III

2007-06-07 Thread Rusty Russell

Documentation: The Drivers

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
---
 drivers/block/lguest_blk.c  |  171 +++---
 drivers/char/hvc_lguest.c   |   77 +
 drivers/lguest/lguest_bus.c |   72 
 drivers/net/lguest_net.c|  222 +++
 include/linux/lguest_bus.h  |5 
 include/linux/lguest_launcher.h |   60 ++
 6 files changed, 565 insertions(+), 42 deletions(-)

===
--- a/drivers/block/lguest_blk.c
+++ b/drivers/block/lguest_blk.c
@@ -1,6 +1,12 @@
-/* A simple block driver for lguest.
- *
- * Copyright 2006 Rusty Russell <[EMAIL PROTECTED]> IBM Corporation
+/*D:400
+ * The Guest block driver
+ *
+ * This is a simple block driver, which appears as /dev/lgba, lgbb, lgbc etc.
+ * The mechanism is simple: we place the information about the request in the
+ * device page, then use SEND_DMA (containing the data for a write, or an empty
+ * "ping" DMA for a read).
+ :*/
+/* Copyright 2006 Rusty Russell <[EMAIL PROTECTED]> IBM Corporation
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -25,27 +31,50 @@
 
 static char next_block_index = 'a';
 
+/*D:420 Here is the structure which holds all the information we need about
+ * each Guest block device.
+ *
+ * I'm sure at this stage, you're wondering "hey, where was the adventure I was
+ * promised?" and thinking "Rusty sucks, I shall say nasty things about him on
+ * my blog".  I think Real adventures have boring bits, too, and you're in the
+ * middle of one.  But it gets better.  Just not quite yet. */
 struct blockdev
 {
+   /* The block queue infrastructure wants a spinlock: it is held while it
+* calls our block request function.  We grab it in our interrupt
+* handler so the responses don't mess with new requests. */
spinlock_t lock;
 
-   /* The disk structure for the kernel. */
+   /* The disk structure registered with kernel. */
struct gendisk *disk;
 
-   /* The major number for this disk. */
+   /* The major device number for this disk, and the interrupt.  We only
+* really keep them here for completeness; we'd need them if we
+* supported device unplugging. */
int major;
int irq;
 
+   /* The physical address of this device's memory page */
unsigned long phys_addr;
-   /* The mapped block page. */
+   /* The mapped memory page for convenient acces. */
struct lguest_block_page *lb_page;
 
-   /* We only have a single request outstanding at a time. */
+   /* We only have a single request outstanding at a time: this is it. */
struct lguest_dma dma;
struct request *req;
 };
 
-/* Jens gave me this nice helper to end all chunks of a request. */
+/*D:495 We originally used end_request() throughout the driver, but it turns
+ * out that end_request() is deprecated, and doesn't actually end the request
+ * (which seems like a good reason to deprecate it!).  It simply ends the first
+ * bio.  So if we had 3 bios in a "struct request" we would do all 3,
+ * end_request(), do 2, end_request(), do 1 and end_request(): twice as much
+ * work as we needed to do.
+ *
+ * This reinforced to me that I do not understand the block layer.
+ *
+ * Nonetheless, Jens Axboe gave me this nice helper to end all chunks of a
+ * request.  This improved disk speed by 130%. */
 static void end_entire_request(struct request *req, int uptodate)
 {
if (end_that_request_first(req, uptodate, req->hard_nr_sectors))
@@ -55,30 +84,62 @@ static void end_entire_request(struct re
end_that_request_last(req, uptodate);
 }
 
+/* I'm told there are only two stories in the world worth telling: love and
+ * hate.  So there used to be a love scene here like this:
+ *
+ *  Launcher:  We could make beautiful I/O together, you and I.
+ *  Guest: My, that's a big disk!
+ *
+ * Unfortunately, it was just too raunchy for our otherwise-gentle tale. */
+
+/*D:490 This is the interrupt handler, called when a block read or write has
+ * been completed for us. */
 static irqreturn_t lgb_irq(int irq, void *_bd)
 {
+   /* We handed our "struct blockdev" as the argument to request_irq(), so
+* it is passed through to us here.  This tells us which device we're
+* dealing with in case we have more than one. */
struct blockdev *bd = _bd;
unsigned long flags;
 
+   /* We weren't doing anything?  Strange, but could happen if we shared
+* interrupts (we don't!). */
if (!bd->req) {
pr_debug("No work!\n");
return IRQ_NONE;
}
 
+   /* Not done yet?  That's equally strange. */
if (!bd->lb_page->result) {
pr_debug("No result!\n");
return IRQ_NONE;

[PATCH 2/7] lguest documentation:Chapter II

2007-06-07 Thread Rusty Russell

Documentation: The Guest

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
---
 drivers/lguest/lguest.c |  456 ---
 drivers/lguest/lguest_asm.S |   57 +++--
 include/linux/lguest.h  |   47 +++-
 3 files changed, 510 insertions(+), 50 deletions(-)

===
--- a/drivers/lguest/lguest.c
+++ b/drivers/lguest/lguest.c
@@ -68,6 +68,12 @@
 #include 
 #include 
 
+/*G:010 Welcome to the Guest!
+ *
+ * The Guest in our tale is a simple creature: identical to the Host but
+ * behaving in simplified but equivalent ways.  In particular, the Guest is the
+ * same kernel as the Host (or at least, built from the same source code). :*/
+
 /* Declarations for definitions in lguest_guest.S */
 extern char lguest_noirq_start[], lguest_noirq_end[];
 extern const char lgstart_cli[], lgend_cli[];
@@ -85,7 +91,26 @@ struct lguest_data lguest_data = {
 };
 struct lguest_device_desc *lguest_devices;
 
-static enum paravirt_lazy_mode lazy_mode;
+/*G:035 Notice the lazy_hcall() above, rather than hcall().  This is our first
+ * real optimization trick!
+ *
+ * When lazy_mode is set, it means we're allowed to defer all hypercalls and do
+ * them as a batch when lazy_mode is eventually turned off.  Because hypercalls
+ * are reasonably expensive, batching them up makes sense.  For example, a
+ * large mmap might update dozens of page table entries: that code calls
+ * lguest_lazy_mode(PARAVIRT_LAZY_MMU), does the dozen updates, then calls
+ * lguest_lazy_mode(PARAVIRT_LAZY_NONE).
+ *
+ * So, when we're in lazy mode, we call async_hypercall() to store the call for
+ * future processing.  When lazy mode is turned off we issue a hypercall to
+ * flush the stored calls.
+ *
+ * There's also a hack where "mode" is set to "PARAVIRT_LAZY_FLUSH" which
+ * indicates we're to flush any outstanding calls immediately.  This is used
+ * when an interrupt handler does a kmap_atomic(): the page table changes must
+ * happen immediately even if we're in the middle of a batch.  Usually we're
+ * not, though, so there's nothing to do. */
+static enum paravirt_lazy_mode lazy_mode; /* Note: not SMP-safe! */
 static void lguest_lazy_mode(enum paravirt_lazy_mode mode)
 {
if (mode == PARAVIRT_LAZY_FLUSH) {
@@ -109,6 +134,16 @@ static void lazy_hcall(unsigned long cal
async_hcall(call, arg1, arg2, arg3);
 }
 
+/* async_hcall() is pretty simple: I'm quite proud of it really.  We have a
+ * ring buffer of stored hypercalls which the Host will run though next time we
+ * do a normal hypercall.  Each entry in the ring has 4 slots for the hypercall
+ * arguments, and a "hcall_status" word which is 0 if the call is ready to go,
+ * and 255 once the Host has finished with it.
+ *
+ * If we come around to a slot which hasn't been finished, then the table is
+ * full and we just make the hypercall directly.  This has the nice side
+ * effect of causing the Host to run all the stored calls in the ring buffer
+ * which empties it for next time! */
 void async_hcall(unsigned long call,
 unsigned long arg1, unsigned long arg2, unsigned long arg3)
 {
@@ -116,6 +151,9 @@ void async_hcall(unsigned long call,
static unsigned int next_call;
unsigned long flags;
 
+   /* Disable interrupts if not already disabled: we don't want an
+* interrupt handler making a hypercall while we're already doing
+* one! */
local_irq_save(flags);
if (lguest_data.hcall_status[next_call] != 0xFF) {
/* Table full, so do normal hcall which will flush table. */
@@ -125,7 +163,7 @@ void async_hcall(unsigned long call,
lguest_data.hcalls[next_call].edx = arg1;
lguest_data.hcalls[next_call].ebx = arg2;
lguest_data.hcalls[next_call].ecx = arg3;
-   /* Make sure host sees arguments before "valid" flag. */
+   /* Arguments must all be written before we mark it to go */
wmb();
lguest_data.hcall_status[next_call] = 0;
if (++next_call == LHCALL_RING_SIZE)
@@ -133,9 +171,14 @@ void async_hcall(unsigned long call,
}
local_irq_restore(flags);
 }
-
+/*:*/
+
+/* Wrappers for the SEND_DMA and BIND_DMA hypercalls.  This is mainly because
+ * Jeff Garzik complained that __pa() should never appear in drivers, and this
+ * helps remove most of them.   But also, it wraps some ugliness. */
 void lguest_send_dma(unsigned long key, struct lguest_dma *dma)
 {
+   /* The hcall might not write this if something goes wrong */
dma->used_len = 0;
hcall(LHCALL_SEND_DMA, key, __pa(dma), 0);
 }
@@ -143,11 +186,16 @@ int lguest_bind_dma(unsigned long key, s
 int lguest_bind_dma(unsigned long key, struct lguest_dma *dmas,
unsigned int num, u8 irq)
 {
+   /* This is the only hypercall which actually wants 5 arguments, and we
+* only

[PATCH 1/7] lguest documentation: infrastructure and Chapter I

2007-06-07 Thread Rusty Russell

The netfilter code had very good documentation: the Netfilter Hacking
HOWTO.  Noone ever read it.

So this time I'm trying something different, using a bit of
Knuthiness.

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
---
 Documentation/lguest/extract  |   58 +
 Documentation/lguest/lguest.c |9 +++--
 drivers/lguest/Makefile   |   12 ++
 drivers/lguest/README |   47 ++
 drivers/lguest/core.c |7 ++-
 drivers/lguest/hypercalls.c   |9 +++--
 drivers/lguest/interrupts_and_traps.c |   13 +++
 drivers/lguest/io.c   |8 +++-
 drivers/lguest/lguest.c   |   31 -
 drivers/lguest/lguest_bus.c   |3 +
 drivers/lguest/lguest_user.c  |7 +++
 drivers/lguest/page_tables.c  |   10 -
 drivers/lguest/segments.c |   11 ++
 drivers/lguest/switcher.S |   13 +++
 14 files changed, 219 insertions(+), 19 deletions(-)

===
--- /dev/null
+++ b/Documentation/lguest/extract
@@ -0,0 +1,58 @@
+#! /bin/sh
+
+set -e
+
+PREFIX=$1
+shift
+
+trap 'rm -r $TMPDIR' 0
+TMPDIR=`mktemp -d`
+
+exec 3>/dev/null
+for f; do
+while IFS="
+" read -r LINE; do
+   case "$LINE" in
+   *$PREFIX:[0-9]*:\**)
+   NUM=`echo "$LINE" | sed "s/.*$PREFIX:\([0-9]*\).*/\1/"`
+   if [ -f $TMPDIR/$NUM ]; then
+   echo "$TMPDIR/$NUM already exits prior to $f"
+   exit 1
+   fi
+   exec 3>>$TMPDIR/$NUM
+   echo $f | sed 's,\.\./,,g' > $TMPDIR/.$NUM
+   /bin/echo "$LINE" | sed -e "s/$PREFIX:[0-9]*//" -e "s/:\*/*/" 
>&3
+   ;;
+   *$PREFIX:[0-9]*)
+   NUM=`echo "$LINE" | sed "s/.*$PREFIX:\([0-9]*\).*/\1/"`
+   if [ -f $TMPDIR/$NUM ]; then
+   echo "$TMPDIR/$NUM already exits prior to $f"
+   exit 1
+   fi
+   exec 3>>$TMPDIR/$NUM
+   echo $f | sed 's,\.\./,,g' > $TMPDIR/.$NUM
+   /bin/echo "$LINE" | sed "s/$PREFIX:[0-9]*//" >&3
+   ;;
+   *:\**)
+   /bin/echo "$LINE" | sed -e "s/:\*/*/" -e "s,/\*\*/,," >&3
+   echo >&3
+   exec 3>/dev/null
+   ;;
+   *)
+   /bin/echo "$LINE" >&3
+   ;;
+   esac
+done < $f
+echo >&3
+exec 3>/dev/null
+done
+
+LASTFILE=""
+for f in $TMPDIR/*; do
+if [ "$LASTFILE" != $(cat $TMPDIR/.$(basename $f) ) ]; then
+   LASTFILE=$(cat $TMPDIR/.$(basename $f) )
+   echo "[ $LASTFILE ]"
+fi
+cat $f
+done
+
===
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -1,5 +1,10 @@
-/* Simple program to layout "physical" memory for new lguest guest.
- * Linked high to avoid likely physical memory.  */
+/*P:100 This is the Launcher code, a simple program which lays out the
+ * "physical" memory for the new Guest by mapping the kernel image and the
+ * virtual devices, then reads repeatedly from /dev/lguest to run the Guest.
+ *
+ * The only trick: the Makefile links it statically at a high address, so it
+ * will be clear of the guest memory region.  It means that each Guest cannot
+ * have more than 2.5G of memory on a normally configured Host. :*/
 #define _LARGEFILE64_SOURCE
 #define _GNU_SOURCE
 #include 
===
--- a/drivers/lguest/Makefile
+++ b/drivers/lguest/Makefile
@@ -5,3 +5,15 @@ obj-$(CONFIG_LGUEST)   += lg.o
 obj-$(CONFIG_LGUEST)   += lg.o
 lg-y := core.o hypercalls.o page_tables.o interrupts_and_traps.o \
segments.o io.o lguest_user.o switcher.o
+
+Preparation Preparation!: PREFIX=P
+Guest: PREFIX=G
+Drivers: PREFIX=D
+Launcher: PREFIX=L
+Host: PREFIX=H
+Switcher: PREFIX=S
+Mastery: PREFIX=M
+Beer:
+   @for f in Preparation Guest Drivers Launcher Host Switcher Mastery; do 
echo "{==- $$f -==}"; make -s $$f; done; echo "{==-==}"
+Preparation Preparation! Guest Drivers Launcher Host Switcher Mastery:
+   @sh ../../Documentation/lguest/extract $(PREFIX) `find ../../* -name 
'*.[chS]' -wholename '*lguest*'`
===
--- /dev/null
+++ b/drivers/lguest/README
@@ -0,0 +1,47 @@
+Welcome, friend reader, to lguest.
+
+Lguest is an adventure, with you, the reader, as Hero.  I can't think of many
+5000-line projects which offer both such capability and glimpses of future
+potential; it is an exciting time to be delving into the source!
+
+But be warned; this is an arduous journey of several hours or more!  And as we
+know, all true Heroes are driven by a Noble Goal.  Thus I offer a Beer (or
+equivalent) to anyone I meet who has completed this

Re: [PATCH 001 of 2] Fix read/truncate race.

2007-06-07 Thread Neil Brown

On Thursday June 7, [EMAIL PROTECTED] wrote:
> 
> Is this really worth fixing?

Hard to say.
One could argue that a read should never return data that was never in
the file.  One could also argue that you should always use locking...
But people often look for lock-less solutions.

> 
> The code still has races there - for example, another process can come
> after we've got the page, truncate it off the file then write new data.  So
> this thread ends up copying out-of-date page contents out to userspace.

I don't think "out of date" is a problem.  As long as the data
returned by the "read" matches what was in the file at some moment
between when the syscall started and when the syscall completed, it is
correct.

> 
> The application is being silly, and the silly application gets wrong data:
> either out-of-date or zeroed.

"was in the file when syscall commenced" and "was never in the file"
are two very different things.
"silly" ??  Maybe it is "silly" to return data that was never in the
file.

> 
> afacit the patch shrinks the timing windows significantly, but at a cost:
> moving a bunch of 64-bit arithmetic and seqlock operations into the inner
> loop.

For common cases where truncation is to a page boundary, the window
for seeing data that was never in the file is shrunk to zero.  Partial
truncates can still cause a problem as the changelog comment suggests,
and require a different solution.

Most of the 64-bit arithmetic is already in the main loop, though not
quite all.  In most cases the seqlock will not loop - how expensive is
that?
The following patch will remove the extra seqlock except when we
actually need it and remove the extra arithmetic - but I haven't
tested it or reviewed it properly.  I can do that if you think it is
the right direction.

> 
> We could plug all this in a more predictable fashion by locking the page,
> but then we have another instance of the
> read-a-page-into-a-mmapped-copy-of-itself deadlock (I think - maybe it
> can't happen because of page uptodateness checks).

But do we really want exclusion between multiple readers of the same
page?  Yes, I expect locking the page would remove all the races, but
I suspect it is more expensive that the current alternative.

NeilBrown

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./mm/filemap.c |   34 --
 1 file changed, 28 insertions(+), 6 deletions(-)

diff .prev/mm/filemap.c ./mm/filemap.c
--- .prev/mm/filemap.c  2007-06-07 16:23:37.0 +1000
+++ ./mm/filemap.c  2007-06-08 12:43:36.0 +1000
@@ -876,6 +876,7 @@ void do_generic_mapping_read(struct addr
unsigned long next_index;
unsigned long prev_index;
unsigned int prev_offset;
+   loff_t isize;
int error;
struct file_ra_state ra = *_ra;

@@ -886,13 +887,22 @@ void do_generic_mapping_read(struct addr
last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> 
PAGE_CACHE_SHIFT;
offset = *ppos & ~PAGE_CACHE_MASK;

+   isize = i_size_read(inode);
+   if (!isize)
+   goto out;
+
+   end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
for (;;) {
struct page *page;
unsigned long end_index;
-   loff_t isize;
unsigned long nr, ret;

cond_resched();
+
+   if (unlikely(index > end_index)) {
+   page_cache_release(page);
+   goto out;
+   }
 find_page:
page = find_get_page(mapping, index);
if (!page) {
@@ -918,13 +928,25 @@ page_ok:
 * the correct value for "nr", which means the zero-filled
 * part of the page is not copied back to userspace (unless
 * another truncate extends the file - this is desired though).
+*
+* NOTE: This access of inode->i_size is not protected
+*  and if there is a concurrent update on a 32bit machine,
+*  it could return the wrong value.  This could only be a 
problem
+*  if i_size has actually changed to a smaller value before the
+*  page became uptodate, and at this point it still has a 
smaller
+*  value, but due to a race while reading, it appears 
unchanged.
+*  The chances of this happening are so small and the 
consequence
+*  sufficiently minor, that the cost of the seqlock seems
+*  not to be justified.
 */

-   isize = i_size_read(inode);
-   end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
-   if (unlikely(!isize || index > end_index)) {
-   page_cache_release(page);
-   goto out;
+   if (unlikely(isize != inode->i_size)) {
+   isize = i_size_read(inode);
+   end_index = (isize - 1)

[PATCH] driver core: multithreaded device matching with dependency

2007-06-07 Thread Huang, Ying

From: Huang Ying <[EMAIL PROTECTED]>

This is another solution to implement multithreaded device matching
(probing). The device matching is delayed until all drivers are
registered. The driver registering is executed one by one, this
eliminates the potential of interdependency between driver. All devices
are matched to drivers afterwards, multithreaded. The parent <->
children relationship between devices is used as the dependency between
devices, that is, the children devices matching will not begin until
parent device matching finishes.

Signed-off-by: Huang Ying <[EMAIL PROTECTED]>
---
The patch has been tested on a thinkpad T42. The system boot-up time can
be reduced effectively. To take advantage of the multithreaded matching
with the patch, following option must be added to kernel command line:

dev_match_thread=

For many distributions, the drivers are not compiled in kernel directly,
instead, they are compiled as modules and inserted into kernel by a
initramfs. To take advanage of multithreaded matching in this situation,
the device_match_freeze and device_match_thaw can be encapsulated into
two syscall, and the device probing sequence of initramfs should be:

1. device_match_freeze
2. modprobe 
3. modprobe 
...
. device_match_thaw.

 drivers/base/dd.c  |  168 ---
 include/linux/device.h |5 +
 init/main.c|3
 3 files changed, 165 insertions(+), 11 deletions(-)
Index: linux-2.6.22-rc4/init/main.c
===
--- linux-2.6.22-rc4.orig/init/main.c   2007-06-08 18:26:11.0
+0800
+++ linux-2.6.22-rc4/init/main.c2007-06-08 18:27:01.0 +0800
@@ -652,6 +652,7 @@
initcall_t *call;
int count = preempt_count();
 
+   device_match_freeze();
for (call = __initcall_start; call < __initcall_end; call++) {
ktime_t t0, t1, delta;
char *msg = NULL;
@@ -703,6 +704,8 @@
}
}
 
+   device_match_thaw(0);
+
/* Make sure there is no pending stuff from the initcall sequence */
flush_scheduled_work();
 }
Index: linux-2.6.22-rc4/drivers/base/dd.c
===
--- linux-2.6.22-rc4.orig/drivers/base/dd.c 2007-06-08
18:26:11.0 +0800
+++ linux-2.6.22-rc4/drivers/base/dd.c  2007-06-08 10:41:18.0
+0800
@@ -25,6 +25,7 @@
 
 #define to_drv(node) container_of(node, struct device_driver,
kobj.entry)
 
+static atomic_t device_match_freezed = ATOMIC_INIT(0);
 
 static void driver_bound(struct device *dev)
 {
@@ -220,6 +221,25 @@
return ret;
 }
 
+/* This function must be called with dev->sem held. */
+static inline int real_device_attach(struct device * dev)
+{
+   int ret = 0;
+
+   if (dev->driver) {
+   ret = device_bind_driver(dev);
+   if (ret == 0)
+   ret = 1;
+   else {
+   dev->driver = NULL;
+   ret = 0;
+   }
+   } else {
+   ret = bus_for_each_drv(dev->bus, NULL, dev, __device_attach);
+   }
+   return ret;
+}
+
 /**
  * device_attach - try to attach device to a driver.
  * @dev:   device.
@@ -238,18 +258,10 @@
 {
int ret = 0;
 
+   if (atomic_read(_match_freezed))
+   return 0;
down(>sem);
-   if (dev->driver) {
-   ret = device_bind_driver(dev);
-   if (ret == 0)
-   ret = 1;
-   else {
-   dev->driver = NULL;
-   ret = 0;
-   }
-   } else {
-   ret = bus_for_each_drv(dev->bus, NULL, dev, __device_attach);
-   }
+   ret = real_device_attach(dev);
up(>sem);
return ret;
 }
@@ -291,6 +303,8 @@
  */
 int driver_attach(struct device_driver * drv)
 {
+   if (atomic_read(_match_freezed))
+   return 0;
return bus_for_each_dev(drv->bus, NULL, drv, __driver_attach);
 }
 
@@ -380,3 +394,135 @@
 EXPORT_SYMBOL_GPL(device_attach);
 EXPORT_SYMBOL_GPL(driver_attach);
 
+#define to_dev(obj) container_of(obj, struct device, kobj)
+
+static void init_devices_check(void)
+{
+   struct kobject *kobj;
+   struct device *dev;
+
+   spin_lock(_subsys.list_lock);
+   list_for_each_entry(kobj, &(devices_subsys.list), entry) {
+   dev = to_dev(kobj);
+   dev = get_device(dev);
+   /* class device and root device need not be checked */
+   if (dev->class || !dev->parent) {
+   dev->is_checked = 1;
+   dev->is_matched = 1;
+   } else
+   dev->is_checked = 0;
+   dev->is_checking = 0;
+   put_device(dev);
+   }
+   spin_unlock(_subsys.list_lock);
+}
+
+#define CONTINUE_FOUND  1
+#define CONTINUE_WAIT   2
+
+static int

Re: [BUG] ptraced process waiting on syscall may return kernel internal errnos

2007-06-07 Thread Satoru Takeuchi

Hi Linus,

At Thu, 7 Jun 2007 08:54:33 -0700 (PDT),
Linus Torvalds wrote:
> 
> 
> 
> On Thu, 7 Jun 2007, Satoru Takeuchi wrote:
> > 
> > I tested your patch and my problem didn't occur again, so it seems to my
> > this case at least.
> 
> Ok. I applied Roland's slightly bigger patch instead, since Ben and Paul
> pointed out that some kernel threads depend on dequeue_signal clearing the 
> bit. You might want to test that one too (or just get current git), just 
> in case there are any differences in behaviour (not bloody likely, but 
> still..)

I tested 2.6.22-rc4-git1 and it works fine.

Thanks,

Satoru
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Bug 8473] New: Oops: 0010 [1] SMP

2007-06-07 Thread Björn Steinbrink

On 2007.05.26 21:10:15 +0200, Nicolas Mailhot wrote:
> Le jeudi 17 mai 2007 à 18:59 +0200, Nicolas Mailhot a écrit :
> > Le jeudi 17 mai 2007 à 09:45 -0700, Randy Dunlap a écrit :
> > 
> > > Can you boot with "kstack=32" so that we can see more of the stack?
> > 
> > I can try. It's not triggering quickly though
> 
> Seems I was completely wrong about the trigger, but anyway it happened
> again, this time on 2.6.22-rc2.mm1.cfs14 (and I had kept kstack=32)
> 
>  BUG: using smp_processor_id() in preemptible [0001] code: bash/3857
>  caller is oops_begin+0xb/0x6f
>  
>  Call Trace:
>  [] show_trace+0x34/0x4f
>  [] dump_stack+0x12/0x17
>  [] debug_smp_processor_id+0xad/0xbc
>  [] oops_begin+0xb/0x6f
>  [] do_page_fault+0x66a/0x7c0
>  [] error_exit+0x0/0x84
>  
>  Unable to handle kernel NULL pointer dereference at  RIP: 
>  [<>]
>  PGD bdd2067 PUD c133067 PMD 0 
>  Oops: 0010 [1] PREEMPT SMP 
>  CPU 1 
>  Pid: 3857, comm: bash Not tainted 2.6.22-0.8.rc2.mm1.cfs14.fc8.nim #1
>  RIP: 0010:[<>]  [<>]
>  RSP: 0018:81000cb03ee0  EFLAGS: 00010296
>  RAX: 8044dbc0 RBX: 81000c3aa8c0 RCX: 7fff549dcae4
>  RDX: 5410 RSI: 81000c3aa8c0 RDI: 81000ba913d8
>  RBP: 7fff549dcae4 R08:  R09: 
>  R10: 0008 R11: 0246 R12: 5410
>  R13: 00ff R14: 00ff R15: 
>  FS:  2b06560d8f40() GS:810004017180() knlGS:
>  CS:  0010 DS:  ES:  CR0: 8005003b
>  CR2:  CR3: 0bc55000 CR4: 06e0
>  Process bash (pid: 3857, threadinfo 81000cb02000, task 81000adc59a0)
>  Stack:  8028ada9 81000c3aa8c0 7fff549dcae4 7fff549dcae4
>  8028b016 5410 00ff 81000c3aa8c0
>   7fff549dcae4 5410 00ff
>  8028b088  80209571 
>  7fff549dce87 0f11 7fff549dcfb8 7fff549dddb0
>  802095dc 0246 0008 
>   ffda  7fff549dcae4
>  5410 00ff 0010 003d340c9117
>  Call Trace:
>  Inexact backtrace:
>  [] do_ioctl+0x55/0x6b
>  [] vfs_ioctl+0x257/0x270
>  [] sys_ioctl+0x59/0x79
>  [] tracesys+0xdc/0xe1
>  
>  INFO: lockdep is turned off.
>  
>  Code:  Bad RIP value.
>  RIP  [<>]
>  RSP 
>  CR2: 

This is do_tty_hangup() exchanging the fops while we're waiting for the
lock. The new fops (hung_up_tty_fops) only have the unlocked variant and
thus we Oops our way.

The following program reproduces it quite easily on a SMP box. I'm
running it from X as root like this:
while true; do xterm /path/to/program; done

#include 
#include 
#include 

#include 

pid_t pid;

void *thread(void *arg)
{
while (1)
ioctl(0, TIOCSPGRP, );
}

int main()
{
pthread_t t;

pid = getpid();

pthread_create(, NULL, thread, NULL);
sleep(1);
vhangup();
perror("vhangup");
return 0;
}

I'm not exactly sure how to solve that in a clean way, though. Moving
the call to lock_kernel() up would make the Oops go away, but could
result in the wrong error code being returned. Checking for ioctl first
and unlocked_ioctl last would cause useless locking. And retrying the
unlocked ioctl doesn't look nice either :-(

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III

On Thu, 7 Jun 2007 19:35:51 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:
>> PAE is useful for more than supporting more than 4GB RAM. It supports
>> expanded swapspace and NX executable protections. Some users may want
>> NX or expanded swapspace support without the overhead or instability
>> of highmem. For these reasons, the following patch divorces
>> CONFIG_X86_PAE from CONFIG_HIGHMEM64G.

On Thu, Jun 07, 2007 at 07:41:56PM -0700, Andrew Morton wrote:
> Do (CONFIG_X86_PAE && !CONFIG_HIGHMEM64G) and (!CONFIG_X86_PAE && 
> CONFIG_HIGHMEM64G)
> kernels actually work?  I wouldn't be surprised if there are places where we 
> used
> the incorrect one.

!CONFIG_X86_PAE && CONFIG_HIGHMEM64G doesn't make sense and is not allowed
by this patch. CONFIG_X86_PAE && !CONFIG_HIGHMEM64G works here.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] fdmap v2 - implement sys_socket2

2007-06-07 Thread Ulrich Drepper

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andrew Morton wrote:
> Absolutely.  Let's get them into their final form now, rather than letting
> some intermediate interface escape out into a major kernel release.

Even if Linus' redirection is adopted, these interfaces should still be
fixed up.  No need to rely on hacks if we can still fix up the interfaces.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGaMU42ijCOnn/RHQRAm0MAJ9pprTVUD/2YvDsL6CPL21WvZSkEACeN3r8
TT8a5FyF2rCct+8gBKyWF/s=
=fs4f
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH]: complete cleanup of check_region

2007-06-07 Thread Surya


> >
> > @@ -5640,6 +5645,7 @@ int __init sbpcd_init(void)
> > int i=0, j=0;
> > int addr[2]={1, CDROM_PORT};
> > int port_index;
> > +   int request_region_flag;
> >
> One could argue that it would be possible to just use the 'j' variable
> for this since it's not used for anything else until the *_region()
> stuff is all done, that would save adding a new variable...
completely eliminated the variable, haven't felt the need for it.


> > msg(DBG_INI,"check_drives done.\n");
> > +   if(request_region_flag==0)
> > +   release_region(addr[1],4);
> 
> I know the driver in its current version just tests if the region is
> available and doesn't hang on to it, but I'm wondering if that's not a
> bug.  Shouldn't we actually hold on to this region as long as the
> driver is loaded and only release the region in the sbpcd_exit()
> function, just loke we do with the "CDo_command" region?  
But if any of the conditions fail after the request_region code, it is
returning an error and exiting out of init. Dont we need to cleanup
request_region's allocation before we exit. Infact I had a feeling that
CDo_command region cleanup was not perfect. Did a cleanup of both the
regions wherever we have a return out of the function.

Please have a look at the below patch.


diff --git a/drivers/cdrom/sbpcd.c b/drivers/cdrom/sbpcd.c
index a1283b1..0a85b1b 100644
--- a/drivers/cdrom/sbpcd.c
+++ b/drivers/cdrom/sbpcd.c
@@ -358,6 +358,10 @@
  * Add bio/kdev_t changes for 2.5.x required to make it work again. 
  * Still room for improvement in the request handling here if anyone
  * actually cares.  Bring your own chainsaw.Paul G.  02/2002
+ *
+ *
+ * Cleaned up the reference for deprecated check_region to 
+ * request_region.- Surya Prabhakar N  08/07/2007
  */
 
 
@@ -555,6 +559,8 @@ static struct cdrom_read_audio read_audio;
 static unsigned char msgnum;
 static char msgbuf[80];
 
+static int addr[2];
+
 static int max_drives = MAX_DRIVES;
 module_param(max_drives, int, 0);
 #ifndef MODULE
@@ -5638,7 +5644,7 @@ int __init sbpcd_init(void)
 #endif
 {
int i=0, j=0;
-   int addr[2]={1, CDROM_PORT};
+   addr[2]={1, CDROM_PORT};
int port_index;
 
sti();
@@ -5670,9 +5676,9 @@ int __init sbpcd_init(void)
{
addr[1]=sbpcd[port_index];
if (addr[1]==0) break;
-   if (check_region(addr[1],4))
+   if (!request_region(addr[1],4, "sbpcd driver"))
{
-   msg(DBG_INF,"check_region: %03X is not 
free.\n",addr[1]);
+   msg(DBG_INF,"request_region: %03X is not 
free.\n",addr[1]);
continue;
}
if (sbpcd[port_index+1]==2) type=str_sp;
@@ -5699,6 +5705,7 @@ int __init sbpcd_init(void)
if (ndrives==0)
{
msg(DBG_INF, "No drive found.\n");
+   release_region(addr[1],4);
 #ifdef MODULE
return -EIO;
 #else
@@ -5775,6 +5782,7 @@ int __init sbpcd_init(void)
if (!request_region(CDo_command,4,major_name))
{
printk(KERN_WARNING "sbpcd: Unable to request region 0x%x\n", 
CDo_command);
+   release_region(addr[1],4);
return -EIO;
}
 
@@ -5788,6 +5796,8 @@ int __init sbpcd_init(void)
 #endif /* SOUND_BASE */
 
if (register_blkdev(MAJOR_NR, major_name)) {
+   release_region(CDo_command,4);
+   release_region(addr[1],4);
 #ifdef MODULE
return -EIO;
 #else
@@ -5801,6 +5811,7 @@ int __init sbpcd_init(void)
sbpcd_queue = blk_init_queue(do_sbpcd_request, _lock);
if (!sbpcd_queue) {
release_region(CDo_command,4);
+   release_region(addr[1],4);
unregister_blkdev(MAJOR_NR, major_name);
return -ENOMEM;
}
@@ -5834,6 +5845,7 @@ int __init sbpcd_init(void)
printk("Can't unregister %s\n", major_name);
}
release_region(CDo_command,4);
+   release_region(addr[1],4);
blk_cleanup_queue(sbpcd_queue);
return -EIO;
}
@@ -5850,6 +5862,7 @@ int __init sbpcd_init(void)
if (sbpcd_infop == NULL)
{
 release_region(CDo_command,4);
+   release_region(addr[1],4);
blk_cleanup_queue(sbpcd_queue);
 return -ENOMEM;
}
@@ -5894,6 +5907,7 @@ static void sbpcd_exit(void)
return;
}
release_region(CDo_command,4);
+   release_region(addr[1],4);
blk_cleanup_queue(sbpcd_queue);
for (j=0;jhttp://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Per container statistics (containerstats)

2007-06-07 Thread Balbir Singh

Andrew Morton wrote:
> On Fri, 08 Jun 2007 07:51:12 +0530 Balbir Singh <[EMAIL PROTECTED]> wrote:
> 
>> Andrew Morton wrote:
>>> I'd have hoped to see containerstats.c in here.
>>>
>> The current statistics code is really small, so it fit into taskstats.c.
>> May be in the future, we could re-factor it and move it out.
> 
> I was referring to your userspace tool which reads this stuff.  The one
> which you described in the changelog.
> 

Ahh.. yes.. I'll make the changes and repost with the tool.

 +  rcu_read_lock();
 +
 +  for_each_root(root) {
 +  if (!root->subsys_bits)
 +  continue;
 +  root_cont = >top_container;
 +  get_first_subsys(root_cont, NULL, _id);
 +  do_each_thread(g, p) {
>>> this needs tasklist_lock?
>>>
>> rcu_read_lock() should be fine. From Eric's patch at
>>
>> 2.6.17-mm2 - proc-remove-tasklist_lock-from-proc_pid_readdir.patch
>>
>> The patch mentions that "We don't need the tasklist_lock to safely
>> iterate through processes anymore."
>>
> 
> oh, OK.  rcu_read_lock() is the new lock_kernel() - always hard to tell
> what it's locking.
> 

:-) I'll add a comment to make it clearer.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread Andrew Morton

On Thu, 7 Jun 2007 19:35:51 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:

> PAE is useful for more than supporting more than 4GB RAM. It supports
> expanded swapspace and NX executable protections. Some users may want
> NX or expanded swapspace support without the overhead or instability
> of highmem. For these reasons, the following patch divorces
> CONFIG_X86_PAE from CONFIG_HIGHMEM64G.

Do (CONFIG_X86_PAE && !CONFIG_HIGHMEM64G) and (!CONFIG_X86_PAE && 
CONFIG_HIGHMEM64G)
kernels actually work?  I wouldn't be surprised if there are places where we 
used
the incorrect one.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Per container statistics (containerstats)

2007-06-07 Thread Andrew Morton

On Fri, 08 Jun 2007 07:51:12 +0530 Balbir Singh <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > 
> > I'd have hoped to see containerstats.c in here.
> > 
> 
> The current statistics code is really small, so it fit into taskstats.c.
> May be in the future, we could re-factor it and move it out.

I was referring to your userspace tool which reads this stuff.  The one
which you described in the changelog.

> >> +  rcu_read_lock();
> >> +
> >> +  for_each_root(root) {
> >> +  if (!root->subsys_bits)
> >> +  continue;
> >> +  root_cont = >top_container;
> >> +  get_first_subsys(root_cont, NULL, _id);
> >> +  do_each_thread(g, p) {
> > 
> > this needs tasklist_lock?
> > 
> 
> rcu_read_lock() should be fine. From Eric's patch at
> 
> 2.6.17-mm2 - proc-remove-tasklist_lock-from-proc_pid_readdir.patch
> 
> The patch mentions that "We don't need the tasklist_lock to safely
> iterate through processes anymore."
> 

oh, OK.  rcu_read_lock() is the new lock_kernel() - always hard to tell
what it's locking.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make i386 kernel show the segfaults in kernel dmesg, like x86_64.

2007-06-07 Thread Andrew Morton

On Thu, 07 Jun 2007 20:04:41 -0600 Robert Hancock <[EMAIL PROTECTED]> wrote:

> Masoud Asgharifard Sharbiani wrote:
> > Hello,
> > This patch makes the i386 behave the same way that x86_64 does when a
> > segfault happens. A line gets printed to the kernel log so that tools
> > that need to check for failures can behave more uniformly between
> > different kernels. Like x86_64, it can be disabled by setting
> > debug.exception-trace sysctl variable to 0 (or by doing
> > echo 0 > /proc/sys/debug/exception-trace)
> > 
> > Same behaviour can be extended to other architectures, if needed.
> > cheers,
> > Masoud.
> > 
> > Signed-off-by: Masoud Sharbiani <[EMAIL PROTECTED]>
> > 
> > diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c
> > index 29d7d61..6aa56db 100644
> > --- a/arch/i386/mm/fault.c
> > +++ b/arch/i386/mm/fault.c
> > @@ -283,6 +283,8 @@ static inline int vmalloc_fault(unsigned long address)
> > return 0;
> >  }
> >  
> > +int exception_trace = 1;
> > +
> >  /*
> >   * This routine handles page faults.  It determines the address,
> >   * and the problem, and then passes it off to one of the appropriate
> > @@ -464,7 +466,14 @@ bad_area_nosemaphore:
> >  */
> > if (is_prefetch(regs, address, error_code))
> > return;
> > -
> > +   if (exception_trace && unhandled_signal(tsk, SIGSEGV)) {
> > +   printk(
> > +  "%s%s[%d]: segfault at %08lx eip %08lx esp %08lx error 
> > %lx\n",
> > +   tsk->pid > 1 ? KERN_INFO : KERN_EMERG,
> > +   tsk->comm, tsk->pid, address, regs->eip,
> > +   regs->esp, error_code);
> 
> Shouldn't we use printk_ratelimit() here, to prevent some nasty person 
> from creating some rapidly-segfaulting process that floods the kernel 
> logs? (Same with the x86_64 version if it doesn't already..)

Yes.  In fact I thought that was in there, but I must have dreamed it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

divorce CONFIG_X86_PAE from CONFIG_HIGHMEM64G

2007-06-07 Thread William Lee Irwin III

PAE is useful for more than supporting more than 4GB RAM. It supports
expanded swapspace and NX executable protections. Some users may want
NX or expanded swapspace support without the overhead or instability
of highmem. For these reasons, the following patch divorces
CONFIG_X86_PAE from CONFIG_HIGHMEM64G.

vs. 2.6.22-rc4-mm2

Cc: Mark Lord <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: William Irwin <[EMAIL PROTECTED]>


Index: mm-2.6.22-rc4-2/arch/i386/Kconfig
===
--- mm-2.6.22-rc4-2.orig/arch/i386/Kconfig  2007-06-07 00:05:53.609599701 
-0700
+++ mm-2.6.22-rc4-2/arch/i386/Kconfig   2007-06-07 17:02:24.333262965 -0700
@@ -544,6 +544,7 @@
 config HIGHMEM64G
bool "64GB"
depends on X86_CMPXCHG64
+   select X86_PAE
help
  Select this if you have a 32-bit processor and more than 4
  gigabytes of physical RAM.
@@ -573,12 +574,12 @@
config VMSPLIT_3G
bool "3G/1G user/kernel split"
config VMSPLIT_3G_OPT
-   depends on !HIGHMEM
+   depends on !X86_PAE
bool "3G/1G user/kernel split (for full 1G low memory)"
config VMSPLIT_2G
bool "2G/2G user/kernel split"
config VMSPLIT_2G_OPT
-   depends on !HIGHMEM
+   depends on !X86_PAE
bool "2G/2G user/kernel split (for full 2G low memory)"
config VMSPLIT_1G
bool "1G/3G user/kernel split"
@@ -598,10 +599,15 @@
default y
 
 config X86_PAE
-   bool
-   depends on HIGHMEM64G
-   default y
+   bool "PAE (Physical Address Extension) Support"
+   default n
+   depends on !HIGHMEM4G
select RESOURCES_64BIT
+   help
+ PAE is required for NX support, and furthermore enables
+ larger swapspace support for non-overcommit purposes. It
+ has the cost of more pagetable lookup overhead, and also
+ consumes more pagetable space per process.
 
 # Common NUMA Features
 config NUMA
Index: mm-2.6.22-rc4-2/arch/i386/kernel/setup.c
===
--- mm-2.6.22-rc4-2.orig/arch/i386/kernel/setup.c   2007-06-06 
23:52:18.839168580 -0700
+++ mm-2.6.22-rc4-2/arch/i386/kernel/setup.c2007-06-07 17:02:24.349263876 
-0700
@@ -273,18 +273,18 @@
printk(KERN_WARNING "Warning only %ldMB will be used.\n",
MAXMEM>>20);
if (max_pfn > MAX_NONPAE_PFN)
-   printk(KERN_WARNING "Use a PAE enabled kernel.\n");
+   printk(KERN_WARNING "Use a HIGHMEM64G enabled 
kernel.\n");
else
printk(KERN_WARNING "Use a HIGHMEM enabled kernel.\n");
max_pfn = MAXMEM_PFN;
 #else /* !CONFIG_HIGHMEM */
-#ifndef CONFIG_X86_PAE
+#ifndef CONFIG_HIGHMEM64G
if (max_pfn > MAX_NONPAE_PFN) {
max_pfn = MAX_NONPAE_PFN;
printk(KERN_WARNING "Warning only 4GB will be used.\n");
-   printk(KERN_WARNING "Use a PAE enabled kernel.\n");
+   printk(KERN_WARNING "Use a HIGHMEM64G enabled 
kernel.\n");
}
-#endif /* !CONFIG_X86_PAE */
+#endif /* !CONFIG_HIGHMEM64G */
 #endif /* !CONFIG_HIGHMEM */
} else {
if (highmem_pages == -1)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] enable interrupts in user path of page fault.

2007-06-07 Thread Linus Torvalds

On Thu, 7 Jun 2007, Steven Rostedt wrote:
> 
> I didn't realize that userspace was allowed to run with interrupts
> disabled.

It isn't, normally. You *can* try to do it by using iopl(), but it's not 
practical. It's certainly not practical to expect a page fault to not 
enable them again, since we obviously need interrupts enabled just to 
handle any IO. 

So don't worry about it. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] enable interrupts in user path of page fault.

2007-06-07 Thread Linus Torvalds

On Thu, 7 Jun 2007, Andrew Morton wrote:
> 
> Interrupts got disabled here because do_page_fault() is an
> interrupt-disabling trap, yes?

Yes - and it has to be: we want to disable preemption and interrupts that 
can fault on the vmalloc space, until we've at least saved away %cr2. We 
had bugs in that area before.

> The patch looks reasonable to me: a slight reduction in interrupt-off
> latency when really weird things are happening.

I applied it as obviously correct.

> The patch also breaks things, I think: if userspace is running with
> interrupts disabled and tries to access kernel memory it will presently
> whizz through the kernel without ever enabling interrupts.  With this
> change, the kernel will now enable interrupts, which is presumably not what
> the application wanted.

Well, we *do* enable interrupts for real page faults anyway, and this 
whole code just triggers for the case where we'd send a SIGSEGV. If some 
silly app really thought it could do that with interrupts disabled, it was 
wrong before too (we'd hit a reschedule point and enable them there 
anyway).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Per container statistics (containerstats)

2007-06-07 Thread Paul Menage

On 6/7/07, Balbir Singh <[EMAIL PROTECTED]> wrote:

> this needs tasklist_lock?
>

rcu_read_lock() should be fine. From Eric's patch at

2.6.17-mm2 - proc-remove-tasklist_lock-from-proc_pid_readdir.patch

The patch mentions that "We don't need the tasklist_lock to safely
iterate through processes anymore."

Containers V10 includes an iterator interface for listing the member
tasks of a container, which avoids scanning the entire tasklist.
Downside is that it currently requires taking a read_lock on a global
lock, but I hope to improve on that in the future.

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] fdmap v2 - implement sys_socket2

2007-06-07 Thread Linus Torvalds

On Thu, 7 Jun 2007, Matt Mackall wrote:
> 
> First, how does this work in-kernel? Does it set a flag in the thread
> struct that magically gets used in the actual syscall? Or do we pass
> flags down to the sys_foo() function in some manner?

Set a flag in the thread-struct.

In fact, that's how "access()" already works.

And yes, syslets would need to have their own thread-structs and/or 
save/restore the thing.

> Second, I think we're likely to run out of flag bits really quickly as
> this is a good dumping spot for patching up our many slightly
> brain-damaged APIs (be they POSIX or Linux-specific).

Well, I do suspect that we'd need to basically make the flags be 
per-system call. With "common features" (ie a system call that doesn't 
return a file descriptor would re-use the bit for "nonlinear-fd" for 
something else, while a system call that doesn't do path lookup would use 
all the LOOKUP_xyzzy bits for something else).

I agree that if we kept flags _totally_ separate, we'd run out of them 
really quickly. But I don't think we want to ever be in the situation 
where _one_ set of system calls would need that many flags. If we get 
there, we'd really be much better off with a new system call!

> Third, can I do sys_indirect(sys_indirect(foo, args), flags1), flags2)?

I'd say no.

> Fourth, can we do sys_indirect(foo, args, flags | ASYNC) and get most
> of the way to merging this with the syslet proposal?

I think that may well be a really good idea.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Per container statistics (containerstats)

2007-06-07 Thread Balbir Singh

Andrew Morton wrote:
> 
> I'd have hoped to see containerstats.c in here.
> 

The current statistics code is really small, so it fit into taskstats.c.
May be in the future, we could re-factor it and move it out.

>> +#ifndef _LINUX_CONTAINERSTATS_H
>> +#define _LINUX_CONTAINERSTATS_H
>> +
>> +#include 
> 
> I don't understand the relationship between containerstats and taskstats. 
> afacit it's using the same genetlink channel?
> 

I've registered containerstats_ops with the same family as taskstats to
reuse the taskstats interface.

>> +enum {
>> +CONTAINERSTATS_CMD_UNSPEC = __TASKSTATS_CMD_MAX,/* Reserved */
> 
> This seems to mean that the containerstats commands all get renumbered if
> we add new taskstats commands.  That would be bad?
> 

As per comment above, since we register containerstats_ops with the taskstats
family, the commands need to be unique, hence we start where taskstats ended
(left off)

>> + */
>> +int containerstats_build(struct containerstats *stats, struct dentry 
>> *dentry)
>> +{
>> +int ret = -EINVAL;
>> +struct task_struct *g, *p;
>> +struct container *cont, *root_cont;
>> +struct container *src_cont;
>> +int subsys_id;
>> +struct containerfs_root *root;
>> +
>> +/*
>> + * Validate dentry by checking the superblock operations
>> + */
>> +if (dentry->d_sb->s_op != _ops)
>> + goto err;
>> +
>> +ret = 0;
>> +src_cont = (struct container *)dentry->d_fsdata;
> 
> Unneeded cast.
> 

Will remove

>> +rcu_read_lock();
>> +
>> +for_each_root(root) {
>> +if (!root->subsys_bits)
>> +continue;
>> +root_cont = >top_container;
>> +get_first_subsys(root_cont, NULL, _id);
>> +do_each_thread(g, p) {
> 
> this needs tasklist_lock?
> 

rcu_read_lock() should be fine. From Eric's patch at

2.6.17-mm2 - proc-remove-tasklist_lock-from-proc_pid_readdir.patch

The patch mentions that "We don't need the tasklist_lock to safely
iterate through processes anymore."

>> +cont = task_container(p, subsys_id);
>> +if (cont == src_cont) {
>> +switch (p->state) {
>> +case TASK_RUNNING:
>> +stats->nr_running++;
>> +break;
>> +case TASK_INTERRUPTIBLE:
>> +stats->nr_sleeping++;
>> +break;
>> +case TASK_UNINTERRUPTIBLE:
>> +stats->nr_uninterruptible++;
>> +break;
>> +case TASK_STOPPED:
>> +stats->nr_stopped++;
>> +break;
>> +default:
>> +if (delayacct_is_task_waiting_on_io(p))
>> +stats->nr_io_wait++;
>> +break;
>> +}
>> +}
>> +} while_each_thread(g, p);
>> +}
>> +rcu_read_unlock();
>> +err:
>> +return ret;
>> +}
>> +
>>  static int cmppid(const void *a, const void *b)
>>  {
>>  return *(pid_t *)a - *(pid_t *)b;
>> diff -puN kernel/sched.c~containers-taskstats kernel/sched.c
>> --- linux-2.6.22-rc2-mm1/kernel/sched.c~containers-taskstats 2007-06-05 
>> 17:21:57.0 +0530
>> +++ linux-2.6.22-rc2-mm1-balbir/kernel/sched.c   2007-06-05 
>> 17:21:57.0 +0530
>> @@ -4280,11 +4280,13 @@ void __sched io_schedule(void)
>>  {
>>  struct rq *rq = &__raw_get_cpu_var(runqueues);
>>  
>> +delayacct_set_flag(DELAYACCT_PF_BLKIO);
>>  delayacct_blkio_start();
> 
> Would it be suitable and appropriate to embed the delayacct_set_flag() call
> inside delayacct_blkio_start()?
> 

Yes, I should have done that, will do.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] intel-rng: Undo mess made by an 80 column extremist

2007-06-07 Thread John Stoffel

Jeff> On Thu, Jun 07, 2007 at 09:56:06PM -0400, John Stoffel wrote:
>> Thinking about it more, I wonder if Krysztof is bitching more about
>> the tab width of 8 characters?   I know that it ticks me off,

Jeff> Even if he is, _that_ is definitely not getting changed.

Oh sure... I know that part is written in stone.  

Jeff> That was the very first item Linus wrote in
Jeff> Documentation/CodingStyle, and for good reasons.

I could argue that 4 spaces would be a better level, but it's a
personal preference and not worth the battle.  

Jeff> If code starts creeping way right due to indentation levels,
Jeff> create a new function.

Sure... compilers are good, us humans haven't gotten much better, make
it easier on us and harder on the computer.   

John
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] intel-rng: Undo mess made by an 80 column extremist

2007-06-07 Thread Jeff Garzik

On Thu, Jun 07, 2007 at 09:56:06PM -0400, John Stoffel wrote:
> Thinking about it more, I wonder if Krysztof is bitching more about
> the tab width of 8 characters?   I know that it ticks me off,

Even if he is, _that_ is definitely not getting changed.

That was the very first item Linus wrote in Documentation/CodingStyle,
and for good reasons.

If code starts creeping way right due to indentation levels, create a
new function.

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] enable interrupts in user path of page fault.

2007-06-07 Thread Steven Rostedt

On Thu, 2007-06-07 at 18:58 -0700, Andrew Morton wrote:
> On Wed, 06 Jun 2007 23:34:04 -0400

> Interrupts got disabled here because do_page_fault() is an
> interrupt-disabling trap, yes?

Correct.

> 
> The patch looks reasonable to me: a slight reduction in interrupt-off
> latency when really weird things are happening.
> 
> The patch also breaks things, I think: if userspace is running with
> interrupts disabled and tries to access kernel memory it will presently
> whizz through the kernel without ever enabling interrupts.  With this
> change, the kernel will now enable interrupts, which is presumably not what
> the application wanted.

I didn't realize that userspace was allowed to run with interrupts
disabled. If this becomes a problem, we can add the same check that's
above where do_page_fault does enable interrupts, but is skipped because
the faulting address was above PAGE_OFFSET.


ie. (i386)
if (regs->eflags & (X86_EFLAGS_IF|VM_MASK))
local_irq_enable();


> 
> However it's surely already the case that most pagefaults will go and
> enable interrupts on this process anyway, so no big loss there.  I'd expect
> the kernel to spit piles of might_sleep() warnings when all this happens, so
> maybe it just doesn't happen for some reason.

Actually it does on the RT kernel. Hence why I found it ;-)

-- Steve


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Make i386 kernel show the segfaults in kernel dmesg, like x86_64.

2007-06-07 Thread Robert Hancock


Masoud Asgharifard Sharbiani wrote:

Hello,
This patch makes the i386 behave the same way that x86_64 does when a
segfault happens. A line gets printed to the kernel log so that tools
that need to check for failures can behave more uniformly between
different kernels. Like x86_64, it can be disabled by setting
debug.exception-trace sysctl variable to 0 (or by doing
echo 0 > /proc/sys/debug/exception-trace)

Same behaviour can be extended to other architectures, if needed.
cheers,
Masoud.

Signed-off-by: Masoud Sharbiani <[EMAIL PROTECTED]>

diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c
index 29d7d61..6aa56db 100644
--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -283,6 +283,8 @@ static inline int vmalloc_fault(unsigned long address)
return 0;
 }
 
+int exception_trace = 1;

+
 /*
  * This routine handles page faults.  It determines the address,
  * and the problem, and then passes it off to one of the appropriate
@@ -464,7 +466,14 @@ bad_area_nosemaphore:
 */
if (is_prefetch(regs, address, error_code))
return;
-
+   if (exception_trace && unhandled_signal(tsk, SIGSEGV)) {
+   printk(
+  "%s%s[%d]: segfault at %08lx eip %08lx esp %08lx error 
%lx\n",
+   tsk->pid > 1 ? KERN_INFO : KERN_EMERG,
+   tsk->comm, tsk->pid, address, regs->eip,
+   regs->esp, error_code);


Shouldn't we use printk_ratelimit() here, to prevent some nasty person 
from creating some rapidly-segfaulting process that floods the kernel 
logs? (Same with the x86_64 version if it doesn't already..)


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] enable interrupts in user path of page fault.

2007-06-07 Thread Andrew Morton

On Wed, 06 Jun 2007 23:34:04 -0400
Steven Rostedt <[EMAIL PROTECTED]> wrote:

> This is a minor fix, but what is currently there is essentially wrong.
> In do_page_fault, if the faulting address from user code happens to be
> in kernel address space (int *p = (int*)-1; p = 0xbed;)  then the
> do_page_fault handler will jump over the local_irq_enable with the
> 
>   goto bad_area_nosemaphore;
> 
> But the first line there sees this is user code and goes through the
> process of sending a signal to send SIGSEGV to the user task. This whole
> time interrupts are disabled and the task can not be preempted by a
> higher priority task.
> 
> This patch always enables interrupts in the user path of the
> bad_area_nosemaphore.
> 
> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
> 
> diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c
> index 29d7d61..1ecb3e4 100644
> --- a/arch/i386/mm/fault.c
> +++ b/arch/i386/mm/fault.c
> @@ -458,6 +458,11 @@ bad_area:
>  bad_area_nosemaphore:
>   /* User mode accesses just cause a SIGSEGV */
>   if (error_code & 4) {
> + /*
> +  * It's possible to have interrupts off here.
> +  */
> + local_irq_enable();
> +

Interrupts got disabled here because do_page_fault() is an
interrupt-disabling trap, yes?

The patch looks reasonable to me: a slight reduction in interrupt-off
latency when really weird things are happening.

The patch also breaks things, I think: if userspace is running with
interrupts disabled and tries to access kernel memory it will presently
whizz through the kernel without ever enabling interrupts.  With this
change, the kernel will now enable interrupts, which is presumably not what
the application wanted.

However it's surely already the case that most pagefaults will go and
enable interrupts on this process anyway, so no big loss there.  I'd expect
the kernel to spit piles of might_sleep() warnings when all this happens, so
maybe it just doesn't happen for some reason.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] trim memory not covered by WB MTRRs

2007-06-07 Thread Robert Hancock


Justin Piszcz wrote:

Note that your boot also mentions this:

[  106.449661] mtrr: no more MTRRs available

which indicates that things like X may not be able to map the
framebuffer with the 'write-combine' attribute, which will hurt
performance.  I've heard reports that turning of 'Intel QST fan
control' in your BIOS settings will prevent all your MTRRs from being
used (improperly, probably another BIOS bug) so that X will perform
well.  But if you don't use X on this machine, you don't have to worry
about it.  The other option would be to remap your MTRRs by hand to
free one up for X, you can do that by combining the last one or two
entries into a single MTRR using the API described in
Documentation/mtrr.txt before you start X.

Jesse



FYI--

[  106.449661] mtrr: no more MTRRs available

This has always occurred, even with mem=8832M setting.

Justin.


Yes, it's another consequence of the way your BIOS configured the MTRRs 
(wastefully, using more of the precious register entries than it needed 
to, in addition to not covering all of the RAM).


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] intel-rng: Undo mess made by an 80 column extremist

2007-06-07 Thread John Stoffel

Jeff> On Thu, Jun 07, 2007 at 09:44:57PM -0400, John Stoffel wrote:
>> I don't.  I use a pair of 80x48 xterms or emacs windows side by side
>> on my monitor with a nice big clear easy to read font.  Itty bitty

Jeff> That's pretty much what I do.  For me at least, it's a more
Jeff> efficient use of screen real estate, and of course the
Jeff> overwhelming majority of source code looks great in an 80-column
Jeff> text window.

Thinking about it more, I wonder if Krysztof is bitching more about
the tab width of 8 characters?   I know that it ticks me off,
indenting by two spaces is plenty for me to follow the flow, along
with Emacs and braces closure indication.  

I'm constantly un-tabifying code and making my indents be just two
spaces, though I guess I should really just change my tab-width to be
2 consistently.

What do you think Krysztof?

John
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bad behaviour after hdparm -M 128

2007-06-07 Thread Henrique de Moraes Holschuh

On Thu, 07 Jun 2007, Mark Lord wrote:
> Henrique de Moraes Holschuh wrote:
> >On Wed, 06 Jun 2007, Björn Steinbrink wrote:
> >>HDAPS support is available, revision is MB2IA60A.
> >
> >That would usually rule out the possibility of it being the firmware, but 
> >we
> >have different disks, so different firmware.
> >
> >It looks like I will have to try 2.6.22 to know for sure.
> 
> "hdparm -I" should list "Automatic Acoustic Management feature set"
> (near the bottom) for any drive that *does* support the -M feature.

I just tracked it down to hdparm.  Version 6.9 (the one in Debian stable)
doesn't work right with libata.  Version 7.5 (the one in Debian unstable)
works fine.

So, at least in my side, there are *no* kernel bugs.  Maybe this is also the
case for the poster that reported the problem?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix race in AF_UNIX

2007-06-07 Thread David Miller

From: Miklos Szeredi <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 10:08:29 +0200

> There are races involving the garbage collector, that can throw away
> perfectly good packets with AF_UNIX sockets in them.
> 
> The problems arise when a socket goes from installed to in-flight or
> vice versa during garbage collection.  Since gc is done with a
> spinlock held, this only shows up on SMP.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

I'm going to hold off on this one for now.

Holding all of the read locks kind of defeats the purpose of using
the per-socket lock.

Can't you just lock purely around the receive queue operation?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] intel-rng: Undo mess made by an 80 column extremist

2007-06-07 Thread Jeff Garzik

On Thu, Jun 07, 2007 at 09:44:57PM -0400, John Stoffel wrote:
> I don't.  I use a pair of 80x48 xterms or emacs windows side by side
> on my monitor with a nice big clear easy to read font.  Itty bitty

That's pretty much what I do.  For me at least, it's a more efficient
use of screen real estate, and of course the overwhelming majority of
source code looks great in an 80-column text window.

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] intel-rng: Undo mess made by an 80 column extremist

2007-06-07 Thread John Stoffel


Krzysztof> Perhaps we should drop that 80-column style and use some
Krzysztof> 120+?  X or no X, almost all people now have more lines and
Krzysztof> columns on their displays than MDA 20 years ago.

I don't.  I use a pair of 80x48 xterms or emacs windows side by side
on my monitor with a nice big clear easy to read font.  Itty bitty
fonts are a total pain...

John

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.22-rc4-mm2: Assigning IP address fails

2007-06-07 Thread David Miller

From: Herbert Xu <[EMAIL PROTECTED]>
Date: Fri, 8 Jun 2007 08:54:52 +1000

> On Thu, Jun 07, 2007 at 03:06:53PM -0700, Andrew Morton wrote:
> >
> > > I'm not able to bring an ethernet interface down and back up again
> > > with this if avahi-autoipd is installed on my Ubuntu boxes.  I've seen
> > > it on three different computers with different NIC hardware.
> 
> Sorry, it was my patch.  This patch should fix it.
> 
> [IPV4]: Do not remove idev when addresses are cleared
> 
> Now that we create idev before addresses are added, it no longer makes
> sense to remove them when addresses are all deleted.
> 
> Signed-off-by: Herbert Xu <[EMAIL PROTECTED]>

Applied, thanks Herbert.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] trim memory not covered by WB MTRRs

2007-06-07 Thread Jesse Barnes

On Thursday, June 7, 2007 5:20:50 Andrew Morton wrote:
> > -unsigned int *usage_table;
> > +unsigned int usage_table[NUM_VAR_RANGES];
> >  static DEFINE_MUTEX(mtrr_mutex);
>
> didn't it feel all dirty when you had to do that?

Hey, this was already there... I didn't want to rewrite the whole thing at 
once. :)  Patch looks fine though.

Thanks,
Jesse
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] intel-rng: Undo mess made by an 80 column extremist

2007-06-07 Thread Jeff Garzik

On Thu, Jun 07, 2007 at 06:08:32PM -0700, H. Peter Anvin wrote:
> My big concern with the 80-column rule is that it discourages commenting.

My concern with that logic is that encourages random, super-wide code
lines that varies with each coder.  You are left to the mercy of he with
the widest text window.

The 80-column rule is good as a general guideline, though there are
obvious exceptions.  Comments IMO are not one of them.  That rapidly
creates unreadable code.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Another version of cleanfile/cleanpatch

2007-06-07 Thread Oleg Verych

On Thu, Jun 07, 2007 at 04:19:56PM -0700, H. Peter Anvin wrote:
> Oleg Verych wrote:
> > 
> > Because of that, i think, following is redundant:
> > 
> > - to check for binary files
> 
> find . -type f | xargs cleanfile

What about patches?

Anyway, by agreement (with myself), i've stopped on having per-file-name
division (prev. message first patch, and that was last design remaining
from cleanfile/cleanpatch). So:

for f in $*
do clean-whitespace $f 2>&1 >/dev/null
done

But this doesn't look like interactive usage, which i've concluded.
Plus copy is saved in $f.clean file, so user can `diff -u` to see any
destruction and possibly report a bug.

[] 
> > - scan whole file for long lines, with useless bunch of messages about
> >   ones. Useless, because script doesn't fix that, it can't do that!
> 
> Still useful to let the human know what is going on, and why.

What i've done was `cleanpatch patch-2.6.21-rc4-rc5`
That's where usefulness comes from ;)

>   -hpa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 001 of 2] Fix read/truncate race.

2007-06-07 Thread Andrew Morton

On Thu, 7 Jun 2007 11:46:53 +1000
NeilBrown <[EMAIL PROTECTED]> wrote:

> do_generic_mapping_read currently samples the i_size at the start
> and doesn't do so again unless it needs to call ->readpage to load
> a page.  After ->readpage it has to re-sample i_size as a truncate
> may have caused that page to be filled with zeros, and the read()
> call should not see these.
> 
> However there are other activities that might cause ->readpage to be
> called on a page between the time that do_generic_mapping_read
> samples i_size and when it finds that it has an uptodate page.  These
> include at least read-ahead and possibly another thread performing a
> read.
> 
> So do_generic_mapping_read must sample i_size *after* it has an
> uptodate page.  Thus the current sampling at the start and after a read
> can be replaced with a sampling before the copy-out.
> 
> The same change applied to __generic_file_splice_read.
> 
> Note that this fixes any race with truncate_complete_page, but does
> not fix a possible race with truncate_partial_page.  If a partial
> truncate happens after do_generic_mapping_read samples i_size and
> before the copy_out, the nuls that truncate_partial_page place in the
> page could be copied out incorrectly.
> 
> I think the best fix for that is to *not* zero out parts of the page
> in truncate_partial_page, but rather to zero out the tail of a page
> when increasing i_size.

Is this really worth fixing?

The code still has races there - for example, another process can come
after we've got the page, truncate it off the file then write new data.  So
this thread ends up copying out-of-date page contents out to userspace.

The application is being silly, and the silly application gets wrong data:
either out-of-date or zeroed.

afacit the patch shrinks the timing windows significantly, but at a cost:
moving a bunch of 64-bit arithmetic and seqlock operations into the inner
loop.

We could plug all this in a more predictable fashion by locking the page,
but then we have another instance of the
read-a-page-into-a-mmapped-copy-of-itself deadlock (I think - maybe it
can't happen because of page uptodateness checks).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] intel-rng: Undo mess made by an 80 column extremist

2007-06-07 Thread H. Peter Anvin

Krzysztof Halasa wrote:
> Alistair John Strachan <[EMAIL PROTECTED]> writes:
> 
>> I personally buy the argument that 80 cols helps remind people that they've 
>> used too many indentation depths and should redesign their code.
>> I think it's 
>> a good thing to stick to where possible, even if just from a design 
>> perspective.
> 
> How many is too many? A function with 2 "if" means 3 tabs = 24 characters
> for just indentation. Add a printk(KERN_DEBUG "\n"); and you can print
> ca. 30 characters without having to break the string. This is crazy.
> 

My big concern with the 80-column rule is that it discourages commenting.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Device hang when offlining a CPU due to IRQ misrouting

2007-06-07 Thread Siddha, Suresh B

On Wed, Jun 06, 2007 at 04:16:42PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 06, 2007 at 12:35:14PM -0700, Siddha, Suresh B wrote:
> 
> > Weird. Then the bug can only happen if for some reason, "mask = map"
> > didn't happen in fixup_irqs(). Can you send us the disassembly of the
> > fixup_irqs()?
> 
> Attached.

hmm.. Darrick, can't find anything wrong in there.

I am very much puzzled and the main thing I am confused about is, that
how come "/proc/irq//smp_affinity" is still pointing at the old
offlined cpu, while calls to set_affinity() with cpu_online_map mask
in fixup_irqs() don't show any failure..

As you have the failing system, you need to do more detective work and
help me out. Can you try this debug patch and send across the dmesg after the
bug happens and also can you try different compiler to see if something
changes..

diff --git a/arch/x86_64/kernel/irq.c b/arch/x86_64/kernel/irq.c
index 3eaceac..fc2a576 100644
--- a/arch/x86_64/kernel/irq.c
+++ b/arch/x86_64/kernel/irq.c
@@ -152,9 +152,11 @@ void fixup_irqs(cpumask_t map)
printk("Breaking affinity for irq %i\n", irq);
mask = map;
}
-   if (irq_desc[irq].chip->set_affinity)
+   if (irq_desc[irq].chip->set_affinity) {
+   printk("calling set affinity for %i, with mask %lx\n",
+   irq, cpus_addr(mask)[0]);
irq_desc[irq].chip->set_affinity(irq, mask);
-   else if (irq_desc[irq].action && !(warned++))
+   } else if (irq_desc[irq].action && !(warned++))
printk("Cannot set affinity for irq %i\n", irq);
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] fdmap v2 - implement sys_socket2

2007-06-07 Thread Matt Mackall

On Thu, Jun 07, 2007 at 01:10:23PM -0700, Linus Torvalds wrote:
> What do people think about that kind of approach? It has the advantage 
> that it does *not* involve multiple kernel entries (just a single entry to 
> a small wrapper that sets some process state temporarily), and that it 
> doesn't have any sticky state that might confuse a library (or a signal 
> handler: even if you end up doing "prctrl(ON) ; syscall(); prctrl(OFF)", a 
> signal handler that happens in between the prctrl's would see unexpected 
> behaviour).

First, how does this work in-kernel? Does it set a flag in the thread
struct that magically gets used in the actual syscall? Or do we pass
flags down to the sys_foo() function in some manner?

If the former, there is potential for interaction with asynchronous
code running on behalf of the thread, no? Especially if we ever adopt
one of the syslet schemes.

Second, I think we're likely to run out of flag bits really quickly as
this is a good dumping spot for patching up our many slightly
brain-damaged APIs (be they POSIX or Linux-specific).

Third, can I do sys_indirect(sys_indirect(foo, args), flags1),
flags2)?

Fourth, can we do sys_indirect(foo, args, flags | ASYNC) and get most
of the way to merging this with the syslet proposal?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] sysrq-m oops

2007-06-07 Thread Bob Picco

john stultz wrote:  [Thu Jun 07 2007, 03:54:41PM EDT]
[snip]
john you are welcome.

We aren't sampling for holes in memory. Thus we encounter a section hole with
empty section map pointer for SPARSEMEM and OOPs for show_mem. This issue
has been seen in 2.6.21, current git and current mm. The patch below
is for mainline and mm. It was boot tested for SPARSEMEM, current
VMEMMAP of Andy's in mm ml and DISCONTIGMEM. A slightly different patch
will be posted to stable for 2.6.21.

Previous to commit f0a5a58aa812b31fd9f197c4ba48245942364eae memory_present
was called for node_start_pfn to node_end_pfn. This would cover the hole(s)
with reserved pages and valid sections. Most SPARSEMEM supported arches
do a pfn_valid check in show_mem before computing the page structure address.

This issue was brought to my attention on IRC by Arnaldo Carvalho de Melo at
[EMAIL PROTECTED] Thanks to Arnaldo for testing.

Signed-off-by: Bob Picco <[EMAIL PROTECTED]>

 arch/x86_64/mm/init.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6.22-rc4-mm1/arch/x86_64/mm/init.c
===
--- linux-2.6.22-rc4-mm1.orig/arch/x86_64/mm/init.c 2007-06-06 
12:59:26.0 -0400
+++ linux-2.6.22-rc4-mm1/arch/x86_64/mm/init.c  2007-06-07 11:14:31.0 
-0400
@@ -79,6 +79,8 @@ void show_mem(void)
if (unlikely(i % MAX_ORDER_NR_PAGES == 0)) {
touch_nmi_watchdog();
}
+   if (!pfn_valid(pgdat->node_start_pfn + i))
+   continue;
page = pfn_to_page(pgdat->node_start_pfn + i);
total++;
if (PageReserved(page))
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

get_xip_page() uncertainity

2007-06-07 Thread Jared Hulbert


I am trying to create valid "struct page* (*get_xip_page)(struct
address_space *, sector_t, int)" to use the filemap_xip.c.

I've been trying to do it as follows:

virtual = ioremap(physical,size);

struct page* my_get_xip_page(struct address_space *mapping, sector_t
sector, int create)
{
 unsigned long offset;
 /*extract offset from mapping and sector*/
 return virt_to_page(virtual + offset);
}

I believe this to be fundamentally flawed.  While this works for
xip_file_read(), it does not work for xip_file_mmap().  I'm not sure I
understand the correct way to do this.  But I assume the problem, and
have some evidence to support it, is that virt_to_page() is not
returning a vaild page struct.

How can I get a valid page struct?  The memory is not RAM but Flash.
It is addressable like RAM and I want userspace to use it like
readonly RAM.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 4/4] FUTEX: Tidy up the code

2007-06-07 Thread Thomas Gleixner

The recent PRIVATE and REQUEUE_PI changes to the futex code made it hard to 
read.
Tidy it up.

Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]>

---
 kernel/futex.c  |  192 
 kernel/rtmutex-debug.c  |6 -
 kernel/rtmutex.c|6 -
 kernel/rtmutex_common.h |7 +
 4 files changed, 88 insertions(+), 123 deletions(-)

Index: linux-2.6.22-rc4/kernel/rtmutex-debug.c
===
--- linux-2.6.22-rc4.orig/kernel/rtmutex-debug.c2007-06-08 
01:42:29.0 +0200
+++ linux-2.6.22-rc4/kernel/rtmutex-debug.c 2007-06-08 01:48:02.0 
+0200
@@ -29,12 +29,6 @@
 
 #include "rtmutex_common.h"
 
-#ifdef CONFIG_DEBUG_RT_MUTEXES
-# include "rtmutex-debug.h"
-#else
-# include "rtmutex.h"
-#endif
-
 # define TRACE_WARN_ON(x)  WARN_ON(x)
 # define TRACE_BUG_ON(x)   BUG_ON(x)
 
Index: linux-2.6.22-rc4/kernel/rtmutex.c
===
--- linux-2.6.22-rc4.orig/kernel/rtmutex.c  2007-06-08 01:44:54.0 
+0200
+++ linux-2.6.22-rc4/kernel/rtmutex.c   2007-06-08 01:48:02.0 +0200
@@ -17,12 +17,6 @@
 
 #include "rtmutex_common.h"
 
-#ifdef CONFIG_DEBUG_RT_MUTEXES
-# include "rtmutex-debug.h"
-#else
-# include "rtmutex.h"
-#endif
-
 /*
  * lock->owner state tracking:
  *
Index: linux-2.6.22-rc4/kernel/rtmutex_common.h
===
--- linux-2.6.22-rc4.orig/kernel/rtmutex_common.h   2007-06-08 
01:42:29.0 +0200
+++ linux-2.6.22-rc4/kernel/rtmutex_common.h2007-06-08 01:48:02.0 
+0200
@@ -154,4 +154,11 @@ extern int rt_mutex_adjust_prio_chain(st
  struct task_struct *top_task);
 extern void remove_waiter(struct rt_mutex *lock,
  struct rt_mutex_waiter *waiter);
+
+#ifdef CONFIG_DEBUG_RT_MUTEXES
+# include "rtmutex-debug.h"
+#else
+# include "rtmutex.h"
+#endif
+
 #endif
Index: linux-2.6.22-rc4/kernel/futex.c
===
--- linux-2.6.22-rc4.orig/kernel/futex.c2007-06-08 01:46:59.0 
+0200
+++ linux-2.6.22-rc4/kernel/futex.c 2007-06-08 01:48:02.0 +0200
@@ -56,12 +56,6 @@
 
 #include "rtmutex_common.h"
 
-#ifdef CONFIG_DEBUG_RT_MUTEXES
-# include "rtmutex-debug.h"
-#else
-# include "rtmutex.h"
-#endif
-
 #define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
 
 /*
@@ -133,6 +127,24 @@ static struct futex_hash_bucket futex_qu
 static struct vfsmount *futex_mnt;
 
 /*
+ * Take mm->mmap_sem, when futex is shared
+ */
+static inline void futex_lock_mm(struct rw_semaphore *fshared)
+{
+   if (fshared)
+   down_read(fshared);
+}
+
+/*
+ * Release mm->mmap_sem, when the futex is shared
+ */
+static inline void futex_unlock_mm(struct rw_semaphore *fshared)
+{
+   if (fshared)
+   up_read(fshared);
+}
+
+/*
  * We hash on the keys returned from get_futex_key (see below).
  */
 static struct futex_hash_bucket *hash_futex(union futex_key *key)
@@ -302,7 +314,18 @@ void drop_futex_key_refs(union futex_key
 }
 EXPORT_SYMBOL_GPL(drop_futex_key_refs);
 
-static inline int get_futex_value_locked(u32 *dest, u32 __user *from)
+static u32 cmpxchg_futex_value_locked(u32 __user *uaddr, u32 uval, u32 newval)
+{
+   u32 curval;
+
+   pagefault_disable();
+   curval = futex_atomic_cmpxchg_inatomic(uaddr, uval, newval);
+   pagefault_enable();
+
+   return curval;
+}
+
+static int get_futex_value_locked(u32 *dest, u32 __user *from)
 {
int ret;
 
@@ -424,14 +447,12 @@ static struct task_struct * futex_find_g
 
rcu_read_lock();
p = find_task_by_pid(pid);
-   if (!p)
-   goto out_unlock;
-   if ((current->euid != p->euid) && (current->euid != p->uid)) {
-   p = NULL;
-   goto out_unlock;
-   }
-   get_task_struct(p);
-out_unlock:
+
+   if (!p || ((current->euid != p->euid) && (current->euid != p->uid)))
+   p = ERR_PTR(-ESRCH);
+   else
+   get_task_struct(p);
+
rcu_read_unlock();
 
return p;
@@ -639,9 +660,7 @@ static int wake_futex_pi(u32 __user *uad
/* Keep the FUTEX_WAITER_REQUEUED flag if it was set */
newval |= (uval & FUTEX_WAITER_REQUEUED);
 
-   pagefault_disable();
-   curval = futex_atomic_cmpxchg_inatomic(uaddr, uval, newval);
-   pagefault_enable();
+   curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
 
if (curval == -EFAULT)
ret = -EFAULT;
@@ -678,9 +697,7 @@ static int unlock_futex_pi(u32 __user *u
 * There is no waiter, so we unlock the futex. The owner died
 * bit has not to be preserved here. We are the owner:
 */
-   pagefault_disable();
-   oldval =

Re: 2.6.22-rc4-mm1

2007-06-07 Thread KAMEZAWA Hiroyuki

On Thu, 7 Jun 2007 08:34:58 -0700
Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> I assume the above is your code - it's not in the tree?
> 
Ah, that code was disappeared in -mm2. 

But it informed me that I should consider memory unplug v.s. sys_mremap case... 

Thanks, anyway.

-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/4] pi-futex: Fix exit races and locking problems

2007-06-07 Thread Thomas Gleixner


From: Alexey Kuznetsov <[EMAIL PROTECTED]>

1. New entries can be added to tsk->pi_state_list after task completed
   exit_pi_state_list(). The result is memory leakage and deadlocks.

2. handle_mm_fault() is called under spinlock. The result is obvious.

3. results in self-inflicted deadlock inside glibc.
   Sometimes futex_lock_pi returns -ESRCH, when it is not expected
   and glibc enters to for(;;) sleep() to simulate deadlock. This problem
   is quite obvious and I think the patch is right. Though it looks like
   each "if" in futex_lock_pi() got some stupid special case "else if". :-)

4. sometimes futex_lock_pi() returns -EDEADLK,
   when nobody has the lock. The reason is also obvious (see comment
   in the patch), but correct fix is far beyond my comprehension.
   I guess someone already saw this, the chunk:

if (rt_mutex_trylock(_state->pi_mutex))
ret = 0;

   is obviously from the same opera. But it does not work, because the
   rtmutex is really taken at this point: wake_futex_pi() of previous
   owner reassigned it to us. My fix works. But it looks very stupid.
   I would think about removal of shift of ownership in wake_futex_pi()
   and making all the work in context of process taking lock.

From: Thomas Gleixner <[EMAIL PROTECTED]>

Fix 1) Avoid the tasklist lock variant of the exit race fix by adding
an additional state transition to the exit code.

This fixes also the issue, when a task with recursive segfaults
is not able to release the futexes.

Fix 2) Cleanup the lookup_pi_state() failure path and solve the -ESRCH
problem finally.

Fix 3) Solve the fixup_pi_state_owner() problem which needs to do the fixup
in the lock protected section by using the in_atomic userspace access
functions.

This removes also the ugly lock drop / unqueue inside of fixup_pi_state()

Fix 4) Fix a stale lock in the error path of futex_wake_pi()

Added some error checks for verification.

The -EDEADLK problem is solved by the rtmutex fixups.

Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]>

---
 include/linux/sched.h |1 
 kernel/exit.c |   24 
 kernel/futex.c|  271 +-
 3 files changed, 184 insertions(+), 112 deletions(-)

Index: linux-2.6.22-rc4/kernel/futex.c
===
--- linux-2.6.22-rc4.orig/kernel/futex.c2007-06-08 01:42:29.0 
+0200
+++ linux-2.6.22-rc4/kernel/futex.c 2007-06-08 01:46:59.0 +0200
@@ -430,10 +430,6 @@ static struct task_struct * futex_find_g
p = NULL;
goto out_unlock;
}
-   if (p->exit_state != 0) {
-   p = NULL;
-   goto out_unlock;
-   }
get_task_struct(p);
 out_unlock:
rcu_read_unlock();
@@ -502,7 +498,7 @@ lookup_pi_state(u32 uval, struct futex_h
struct futex_q *this, *next;
struct plist_head *head;
struct task_struct *p;
-   pid_t pid;
+   pid_t pid = uval & FUTEX_TID_MASK;
 
head = >chain;
 
@@ -520,6 +516,8 @@ lookup_pi_state(u32 uval, struct futex_h
return -EINVAL;
 
WARN_ON(!atomic_read(_state->refcount));
+   WARN_ON(pid && pi_state->owner &&
+   pi_state->owner->pid != pid);
 
atomic_inc(_state->refcount);
*ps = pi_state;
@@ -530,15 +528,33 @@ lookup_pi_state(u32 uval, struct futex_h
 
/*
 * We are the first waiter - try to look up the real owner and attach
-* the new pi_state to it, but bail out when the owner died bit is set
-* and TID = 0:
+* the new pi_state to it, but bail out when TID = 0
 */
-   pid = uval & FUTEX_TID_MASK;
-   if (!pid && (uval & FUTEX_OWNER_DIED))
+   if (!pid)
return -ESRCH;
p = futex_find_get_task(pid);
-   if (!p)
-   return -ESRCH;
+   if (IS_ERR(p))
+   return PTR_ERR(p);
+
+   /*
+* We need to look at the task state flags to figure out,
+* whether the task is exiting. To protect against the do_exit
+* change of the task flags, we do this protected by
+* p->pi_lock:
+*/
+   spin_lock_irq(>pi_lock);
+   if (unlikely(p->flags & PF_EXITING)) {
+   /*
+* The task is on the way out. When PF_EXITPIDONE is
+* set, we know that the task has finished the
+* cleanup:
+*/
+   int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
+
+   spin_unlock_irq(>pi_lock);
+   put_task_struct(p);
+   return ret;
+   }
 
pi_state = alloc_pi_state();
 
@@ -551,7 +567,6 @@ lookup_pi_state(u32 uval, struct futex_h
/* Store the key for possible

[patch 2/4] rt-mutex: Fix chain walk early wakeup bug

2007-06-07 Thread Thomas Gleixner

Alexey Kuznetsov found some problems in the pi-futex code. 

One of the root causes is:

When a wakeup happens, we do not to stop the chain walk so
we follow a not longer relevant locking chain.

Drop out when this happens.

Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]>

---
 kernel/rtmutex.c |   13 +
 1 file changed, 13 insertions(+)

Index: linux-2.6.22-rc4/kernel/rtmutex.c
===
--- linux-2.6.22-rc4.orig/kernel/rtmutex.c  2007-06-08 01:39:38.0 
+0200
+++ linux-2.6.22-rc4/kernel/rtmutex.c   2007-06-08 01:39:38.0 +0200
@@ -189,6 +189,19 @@ int rt_mutex_adjust_prio_chain(struct ta
if (!waiter || !waiter->task)
goto out_unlock_pi;
 
+   /*
+* Check the orig_waiter state. After we dropped the locks,
+* the previous owner of the lock might have released the lock
+* and made us the pending owner:
+*/
+   if (orig_waiter && !orig_waiter->task)
+   goto out_unlock_pi;
+
+   /*
+* Drop out, when the task has no waiters. Note,
+* top_waiter can be NULL, when we are in the deboosting
+* mode!
+*/
if (top_waiter && (!task_has_pi_waiters(task) ||
   top_waiter != task_top_pi_waiter(task)))
goto out_unlock_pi;

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/4] rt-mutex: Fix stale return value

2007-06-07 Thread Thomas Gleixner

Alexey Kuznetsov found some problems in the pi-futex code. 

The major problem is a stale return value in rt_mutex_slowlock():

When the pi chain walk returns -EDEADLK, but the waiter was woken up 
during the phases where the locks were dropped, the rtmutex could be
acquired, but due to the stale return value -EDEADLK returned to the
caller.

Reset the return value in the retry path.

Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]>

---
 kernel/rtmutex.c |   11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc4/kernel/rtmutex.c
===
--- linux-2.6.22-rc4.orig/kernel/rtmutex.c  2007-06-08 01:39:38.0 
+0200
+++ linux-2.6.22-rc4/kernel/rtmutex.c   2007-06-08 01:39:38.0 +0200
@@ -636,9 +636,16 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 * all over without going into schedule to try
 * to get the lock now:
 */
-   if (unlikely(!waiter.task))
+   if (unlikely(!waiter.task)) {
+   /*
+* Reset the return value. We might
+* have returned with -EDEADLK and the
+* owner released the lock while we
+* were walking the pi chain.
+*/
+   ret = 0;
continue;
-
+   }
if (unlikely(ret))
break;
}

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 0/4] PI-Futex bugfixes and cleanups

2007-06-07 Thread Thomas Gleixner

Andrew,

the following patch series solves the issues with pi-futexes which
were reported from Alexey.

While the first three patches should go mainline ASAP, the last patch
is not necessarily 2.6.22 material. I did this cleanup to make the
code more readable again.

@stable: 
The rtmutex patches (1,2) should apply to stable as is, but
the futex patch will not. I'm going to backport the fixes tomorrow and
send a seperate series with all fixes.

Pending problems:

There are also the pending REQUEUE_PI issues. I have seen some
problems there, but I did not address them in this patch set. I have
no test program to verify any changes in that code.

I'm not sure, whether we should keep the REQUEUE_PI stuff for
2.6.22. I'd prefer to back it out or at least disable it, so we can
fix it proper for 2.6.23.

tglx

Sorry for the resend. My mailer did not notice the CC wreckage

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How to access correctly serial port inside module?

2007-06-07 Thread Tilman Schmidt

Am 08.06.2007 00:25 schrieb Lars K.W. Gohlke:
>>> a) With a line discipline, [...]
>>> b) With the "serio" interface, [...]
>>>
>> ok because I want to handle it in kernel space I will no take option a)
> 
>> I will try it the way of b)

> I asked [EMAIL PROTECTED] as one of the driver developer
> 
> this is how I did it (just prototype)
> 
> (neccessary just one folder and two files)
> 1. copy template linux-source/drivers/input/joystick/magellan.c
> 2. insert in line 90  :
>   printk("%s:\"%s\"\n",__FUNCTION__, data);
> 3. make and insert it
> 4. inputattach --magellan /dev/ttyS0 &

I see. So it turns out the "serio" interface finally goes through a
line discipline (N_MOUSE) and consequently also needs a userspace
daemon (inputattach). That doesn't give me any advantage compared
to attaching directly to the line discipline interface, so I'll
stay with the latter approach.

But glad to hear it worked out for you.

Take care,
Tilman

-- 
Tilman Schmidt  E-Mail: [EMAIL PROTECTED]
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Ungeöffnet mindestens haltbar bis: (siehe Rückseite)



signature.asc
Description: OpenPGP digital signature

Re: [PATCH] trim memory not covered by WB MTRRs

2007-06-07 Thread Andrew Morton

On Wed, 6 Jun 2007 12:29:23 -0700
Jesse Barnes <[EMAIL PROTECTED]> wrote:

> --- a/arch/i386/kernel/cpu/mtrr/if.c
> +++ b/arch/i386/kernel/cpu/mtrr/if.c
> @@ -12,7 +12,7 @@
>  #include "mtrr.h"
>  
>  /* RED-PEN: this is accessed without any locking */
> -extern unsigned int *usage_table;
> +extern unsigned int usage_table[];
>  
>  
> --- a/arch/i386/kernel/cpu/mtrr/main.c
> +++ b/arch/i386/kernel/cpu/mtrr/main.c
> @@ -47,7 +47,7 @@
>  
>  u32 num_var_ranges = 0;
>  
> -unsigned int *usage_table;
> +unsigned int usage_table[NUM_VAR_RANGES];
>  static DEFINE_MUTEX(mtrr_mutex);

didn't it feel all dirty when you had to do that?



From: Andrew Morton <[EMAIL PROTECTED]>

- Move the declaration into a header file

- "usage_table" is a dumb name for an mtrr-specific kernel-wide identifier.

  There appear to beseveral other poorly-chosen identifiers in mtrr.

Cc: Andi Kleen <[EMAIL PROTECTED]>
Cc: Jesse Barnes <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 arch/i386/kernel/cpu/mtrr/if.c   |8 ++--
 arch/i386/kernel/cpu/mtrr/main.c |   17 +
 arch/i386/kernel/cpu/mtrr/mtrr.h |2 ++
 3 files changed, 13 insertions(+), 14 deletions(-)

diff -puN 
arch/i386/kernel/cpu/mtrr/if.c~i386-x86_64-trim-memory-not-covered-by-wb-mtrrs-fix
 arch/i386/kernel/cpu/mtrr/if.c
--- 
a/arch/i386/kernel/cpu/mtrr/if.c~i386-x86_64-trim-memory-not-covered-by-wb-mtrrs-fix
+++ a/arch/i386/kernel/cpu/mtrr/if.c
@@ -11,10 +11,6 @@
 #include 
 #include "mtrr.h"
 
-/* RED-PEN: this is accessed without any locking */
-extern unsigned int usage_table[];
-
-
 #define FILE_FCOUNT(f) (((struct seq_file *)((f)->private_data))->private)
 
 static const char *const mtrr_strings[MTRR_NUM_TYPES] =
@@ -396,7 +392,7 @@ static int mtrr_seq_show(struct seq_file
for (i = 0; i < max; i++) {
mtrr_if->get(i, , , );
if (size == 0)
-   usage_table[i] = 0;
+   mtrr_usage_table[i] = 0;
else {
if (size < (0x10 >> PAGE_SHIFT)) {
/* less than 1MB */
@@ -410,7 +406,7 @@ static int mtrr_seq_show(struct seq_file
len += seq_printf(seq, 
   "reg%02i: base=0x%05lx000 (%4luMB), 
size=%4lu%cB: %s, count=%d\n",
 i, base, base >> (20 - PAGE_SHIFT), size, factor,
-mtrr_attrib_to_str(type), usage_table[i]);
+mtrr_attrib_to_str(type), mtrr_usage_table[i]);
}
}
return 0;
diff -puN 
arch/i386/kernel/cpu/mtrr/main.c~i386-x86_64-trim-memory-not-covered-by-wb-mtrrs-fix
 arch/i386/kernel/cpu/mtrr/main.c
--- 
a/arch/i386/kernel/cpu/mtrr/main.c~i386-x86_64-trim-memory-not-covered-by-wb-mtrrs-fix
+++ a/arch/i386/kernel/cpu/mtrr/main.c
@@ -47,7 +47,7 @@
 
 u32 num_var_ranges = 0;
 
-unsigned int usage_table[NUM_VAR_RANGES];
+unsigned int mtrr_usage_table[NUM_VAR_RANGES];
 static DEFINE_MUTEX(mtrr_mutex);
 
 u64 size_or_mask, size_and_mask;
@@ -127,7 +127,7 @@ static void __init init_table(void)
 
max = num_var_ranges;
for (i = 0; i < max; i++)
-   usage_table[i] = 1;
+   mtrr_usage_table[i] = 1;
 }
 
 struct set_mtrr_data {
@@ -381,7 +381,7 @@ int mtrr_add_page(unsigned long base, un
goto out;
}
if (increment)
-   ++usage_table[i];
+   ++mtrr_usage_table[i];
error = i;
goto out;
}
@@ -390,12 +390,13 @@ int mtrr_add_page(unsigned long base, un
if (i >= 0) {
set_mtrr(i, base, size, type);
if (likely(replace < 0))
-   usage_table[i] = 1;
+   mtrr_usage_table[i] = 1;
else {
-   usage_table[i] = usage_table[replace] + !!increment;
+   mtrr_usage_table[i] = mtrr_usage_table[replace] +
+   !!increment;
if (unlikely(replace != i)) {
set_mtrr(replace, 0, 0, 0);
-   usage_table[replace] = 0;
+   mtrr_usage_table[replace] = 0;
}
}
} else
@@ -525,11 +526,11 @@ int mtrr_del_page(int reg, unsigned long
printk(KERN_WARNING "mtrr: MTRR %d not used\n", reg);
goto out;
}
-   if (usage_table[reg] < 1) {
+   if (mtrr_usage_table[reg] < 1) {
printk(KERN_WARNING "mtrr: reg: %d has count=0\n", reg);
goto out;
}
-   if (--usage_table[reg] < 1)
+   if (--mtrr_usage_table[reg] < 1)
set_mtrr(reg, 0, 0, 0);
error = reg;
  out:
diff -puN

Re: [PATCH 1/3] [PATCH i386] during VM oom condition, kill all threads in process group

2007-06-07 Thread Andrew Morton

On Thu, 7 Jun 2007 18:16:21 -0500
Anton Blanchard <[EMAIL PROTECTED]> wrote:

>  
> Hi,
> 
> > zap_other_threads() requires tasklist_lock.
> > 
> > If we're going to do this then we should probably create some new function
> > (with a better name) which takes tasklsit_lock and then calls
> > zap_other_threads().
> > 
> > Does this patch fix any observed-in-the-real-world problem?  If so, please
> > describe it.
> 
> Yeah we have had complaints where threaded apps have only one thread
> shot down instead of the entire process. This leaves the application in
> a bad state, whereas if it had been killed cleanly the application could
> have restarted.
> 
> My understanding is that fatal signals should kill all threads in the
> group.
> 

OK, well could we please get all that info appropriatelt captured in #2's
changelog?

Other architectures will probably need to implement this.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] update checkpatch.pl to version 0.03

2007-06-07 Thread Adrian Bunk

On Fri, Jun 08, 2007 at 12:41:17AM +0100, Alan Cox wrote:
> > I added a MODULE_AUTHOR("J. Ørsted <[EMAIL PROTECTED]>") into the "raw" 
> > module:
> > 
> > # echo $LANG
> > C
> > # modinfo --version
> > module-init-tools version 3.3-pre11
> > # modinfo raw
> > filename:   /lib/modules/2.6.21.2/kernel/drivers/char/raw.ko
> > author: J. Ã
> > ^ the cursor hangs here
> 
> "Mummy if I deliberately shoot myself in the head it hurts"
> 
> Distro's don't ship in C locale and haven't for years. And the worst case
> effect you can engineer by trying is to display some slightly odd symbols

If it would only display some slightly odd symbols I wouldn't complain.

The problem is that the second byte is interpreted as a control code.

Is there any trick to get the shell working again in this situation?
The cursor hangs, and I've not yet found a trick to do anything in this
xterm again (except for killing it from another xterm).

> (And incidentially since the Linux fs has been defined to be utf-8 for
> naming for many years you'll find the same problem using "ls")

No, "ls" can handle it perfectly:

# echo $LANG
C
# ls
??rsted
# 

Or:

$ echo $LANG
en_US
$ ls
Ã?rsted
$ 

Different from the lsmod example, the cursor doesn't hang and the shell 
is usable after this command.

The difference is that "ls" expects and handles such issues while 
"lsmod" (and most likely also other userspace working with kernel 
output) does not yet handle it resulting in problems if bytes are 
wrongly interpreted as control codes.

> > - get module-init-tools fixed and
> > - document that 2.6.23 (or whichever will be the first kernel to support 
> >   UTF-8 in MODULE_AUTHOR) will require updated module-init-tools.
> 
> "Require" is a rather strong word for a print formatting issue specific
> to obscure setups.

See obove, it's not only "print formatting", it's "kills my shell".

> Alan

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] fdmap v2 - implement sys_socket2

2007-06-07 Thread Davide Libenzi

On Thu, 7 Jun 2007, Linus Torvalds wrote:

> 
> 
> On Thu, 7 Jun 2007, Davide Libenzi wrote:
> > 
> > We'd still need sys_nonseqfd() though, to move/dup legacy fds into the 
> > non-sequential area.
> 
> Umm. No we don't. Because it's no more than 
> 
>   indirect_syscall(dup, FD_NONSEQ)
> 
> isn't it?

Hmm, ok. It need some changes since sys_dup() and F_DUPFD uses common code 
at the moment, but it'd ok.
Basically, everything that calls get_unused_fd() can get the magic 
indirect_syscall() settings. I was just planning to localize the 
sequential/non-sequential behaviour just in there.
The sys_dup(), sys_dup2() and F_DUPFD have some custom code, although 
sys_dup() should really use get_unused_fd() in any way.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Intel-IOMMU 09/10] Iommu Gfx workaround

2007-06-07 Thread Andrew Morton

On Wed, 06 Jun 2007 11:57:07 -0700
[EMAIL PROTECTED] wrote:

> +#ifdef CONFIG_DMAR_GFX_WA
> + iommu_prepare_gfx_mapping();
> +#endif

Please do

#ifndef CONFIG_DMAR_GFX_WA
static inline void iommu_prepare_gfx_mapping(void)
{
}
#endif

in the head file instead (whole patchset)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Intel-IOMMU 06/10] Intel IOMMU driver

2007-06-07 Thread Andrew Morton

On Wed, 06 Jun 2007 11:57:04 -0700
[EMAIL PROTECTED] wrote:

>   Actual intel IOMMU driver. Hardware spec can be found at:
> http://www.intel.com/technology/virtualization
> 
> This driver sets X86_64 'dma_ops', so hook into standard DMA APIs. In this 
> way,
> PCI driver will get virtual DMA address. This change is transparent to PCI
> drivers.
> 
> ...
>  
> +#ifdef CONFIG_DMAR
> + detect_intel_iommu();
> +#endif
> +
>  #ifdef CONFIG_SWIOTLB
>   pci_swiotlb_init();
>  #endif
> @@ -314,6 +319,10 @@
>   calgary_iommu_init();
>  #endif
>  
> +#ifdef CONFIG_DMAR
> + intel_iommu_init();
> +#endif

We'd prefer that the header file have suitable #ifndef CONFIG_DMAR stubs,
so the ifdefs here become unneeded.

> +/* context entry handling */
> +static struct context_entry * device_to_context_entry(struct intel_iommu 
> *iommu,
> + u8 bus, u8 devfn)
> +{
> + struct root_entry *root;
> + struct context_entry *context;
> + unsigned long phy_addr;
> + unsigned long flags;
> +
> + spin_lock_irqsave(>lock, flags);
> + root = >root_entry[bus];
> + if (!(context = get_context_addr_from_root(*root))) {
> + context = (struct context_entry *)alloc_pgtable_page();
> + if (!context) {
> + spin_unlock_irqrestore(>lock, flags);
> + return NULL;
> + }
> + __iommu_flush_cache(iommu, (void *)context, PAGE_SIZE_4K);
> + phy_addr = virt_to_phys((void *)context);
> + set_root_value(*root, phy_addr);
> + set_root_present(*root);
> + __iommu_flush_cache(iommu, root, sizeof(*root));
> + }
> + spin_unlock_irqrestore(>lock, flags);
> + return [devfn];
> +}

checkpatch.pl has lots of fun with this patch.

> +/* page table handling */
> +#define LEVEL_STRIDE (9)
> +#define LEVEL_MASK   (((u64)1 << LEVEL_STRIDE) - 1)
> +#define agaw_to_level(val) ((val) + 2)
> +#define agaw_to_width(val) (30 + val * LEVEL_STRIDE)
> +#define width_to_agaw(w)  ((w - 30)/LEVEL_STRIDE)
> +#define level_to_offset_bits(l) (12 + (l - 1) * LEVEL_STRIDE)
> +#define address_level_offset(addr, level) \
> + ((addr >> level_to_offset_bits(level)) & LEVEL_MASK)
> +#define level_mask(l) (((u64)(-1)) << level_to_offset_bits(l))
> +#define level_size(l) ((u64)1 << level_to_offset_bits(l))
> +#define align_to_level(addr, l) ((addr + level_size(l) - 1) & level_mask(l))

static inlines are better than macros - please consider.

If you're going to stick with macros here then you'll find that many of the
above macro's arguments are insufficiently parenthesised.

> +#define IOMMU_WAIT_OP(iommu, offset, op, cond, sts) \
> +{\
> + unsigned long start_time = jiffies;\
> + while (1) {\
> + sts = op (iommu->reg, offset);\
> + if (cond)\
> + break;\
> + if (time_after(jiffies, start_time + DMAR_OPERATION_TIMEOUT))\
> + panic("DMAR hardware is malfunctional, please disable 
> IOMMU\n");\
> + cpu_relax();\
> + }\
> +}

wow, harsh treatment.

"malfunctioning" might parse better here.

> +static int inline get_alignment(u64 base, unsigned int size)
> +{
> + int t = 0;
> + u64 end;
> +
> + end = base + size - 1;
> + while (base != end) {
> + t++;
> + base >>= 1;
> + end >>= 1;
> + }
> + return t;
> +}

What's this (too large to inline) function doing?  I suspect we might have
helper functions which already do it...  If not, perhasp we should.


> +static int inline iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
> + u64 addr, unsigned int pages, int non_present_entry_flush)
> +{
> + unsigned int align;
> +
> + BUG_ON(addr & (~PAGE_MASK_4K));
> + BUG_ON(pages == 0);
> +
> + /* Fallback to domain selective flush if no PSI support */
> + if (!cap_pgsel_inv(iommu->cap))
> + return iommu_flush_iotlb_dsi(iommu, did,
> + non_present_entry_flush);
> +
> + /*
> +  * PSI requires page size is 2 ^ x, and the base address is naturally
> +  * aligned to the size
> +  */
> + align = get_alignment(addr >> PAGE_SHIFT_4K, pages);
> + /* Fallback to domain selective flush if size is too big */
> + if (align > cap_max_amask_val(iommu->cap))
> + return iommu_flush_iotlb_dsi(iommu, did,
> + non_present_entry_flush);
> +
> + addr >>= PAGE_SHIFT_4K + align;
> + addr <<= PAGE_SHIFT_4K + align;
> +
> + return __iommu_flush_iotlb(iommu, did, addr, align,
> + DMA_TLB_PSI_FLUSH, non_present_entry_flush);
> +}

too large for inlining.

> +static int iommu_enable_translation(struct intel_iommu *iommu)
> +{
> + u32 sts;
> + unsigned long flag;

we conventionally use "flags" for this.

> + spin_lock_irqsave(>register_lock, flag);
> + dmar_writel(iommu->reg, DMAR_GCMD_REG,

Re: [Intel-IOMMU 07/10] Intel iommu cmdline option - forcedac

2007-06-07 Thread Andrew Morton

On Wed, 06 Jun 2007 11:57:05 -0700
[EMAIL PROTECTED] wrote:

> --- linux-2.6.22-rc3.orig/Documentation/kernel-parameters.txt 2007-06-04 
> 12:40:29.0 -0700
> +++ linux-2.6.22-rc3/Documentation/kernel-parameters.txt  2007-06-04 
> 12:40:41.0 -0700
> @@ -785,6 +785,13 @@
>   bypassed by not enabling DMAR with this option. In
>   this case, gfx device will use physical address for
>   DMA.
> + forcedac

You want
forcedac [X86-64]

> + With this option iommu will not optimize to look
> + for io virtual address below 32 bit forcing dual
> + address cycle on pci bus for cards supporting greater
> + than 32 bit addressing. The default is to look
> + for translation below 32 bit and if not available
> + then look in the higher range.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid

2007-06-07 Thread Badari Pulavarty




Serge E. Hallyn wrote:


Quoting Serge E. Hallyn ([EMAIL PROTECTED]):


Quoting Badari Pulavarty ([EMAIL PROTECTED]):


On Thu, 2007-06-07 at 15:37 -0500, Serge E. Hallyn wrote:


Quoting Badari Pulavarty ([EMAIL PROTECTED]):


On Thu, 2007-06-07 at 12:48 -0700, Andrew Morton wrote:


On Thu, 07 Jun 2007 10:06:37 -0700
Badari Pulavarty <[EMAIL PROTECTED]> wrote:


On Thu, 2007-06-07 at 12:43 -0400, Albert Cahalan wrote:


On 6/7/07, Badari Pulavarty <[EMAIL PROTECTED]> wrote:


BTW, I agree with Eric that its would be nice to use shmid as part
of name instead of forcing to be as inode number. It should be
possible for pmap to workout shmid from "key" or name. Isn't it ?


It is not at all nice.

1. it's incompatible ABI breakage
2. where will you put the key then, in the inode? :-)


Nope. Currently "key" is part of the name (but its not unique).


Changing to "SYSVID%d" is no good either. Look, people
are ***parsing*** this stuff in /proc. The /proc filesystem
is not some random sandbox to be playing in.

Before you go messing with it, note that the device number
also matters. (it's per-boot dynamic, but that's OK)
That's how one knows that /SYSV is not just
a regular file; sadly these didn't get a non-/ prefix.
(and no you can't fix that now; it's way too late)

Next time you feel like breaking an ABI, mind putting
"LET'S BREAK AN ABI!" in the subject of your email?


I am not breaking ABI. Its already broken in the current
mainline. I am trying to fix it by putting back the ino#
as shmid. Eric had a suggestion that, instead of depending
on the inode# to be shmid, we could embed shmid into name
(instead of "key" which is currently not unique).


BTW, I suspect this kind of thing also breaks:
a. fuser, lsof, and other resource usage display tools
b. various obscure emulators (similar to valgrind)

If you strongly feel that "old" behaviour needs to be retained, 


yup, we should put it back.  The change was, afaik, accidental.


here is the patch I originally suggested.


Confused.  Will this one-liner fix all the userspace breakage to which
Albert refers?


Yes. Albert, please correct me if I am wrong.


It will, but could lead to two different inodes with the same i_ino,
right?


Only if we generate same ID in two different namespaces. Is it currently
possible ? 


Should be nothing stopping it.



(just to be more certain, a quick test showed I can get id 0 for
different keys, and different ids for the same key 0xff, in different
ipc namespaces)


Funny. I played with it and decided that it can happen :)

Thanks,
Badari



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] update checkpatch.pl to version 0.03

2007-06-07 Thread Adrian Bunk

On Fri, Jun 08, 2007 at 01:21:52AM +0200, Adrian Bunk wrote:
>...
> I added a MODULE_AUTHOR("J. Ørsted <[EMAIL PROTECTED]>") into the "raw" 
> module:
> 
> # echo $LANG
> C
> # modinfo --version
> module-init-tools version 3.3-pre11
> # modinfo raw
> filename:   /lib/modules/2.6.21.2/kernel/drivers/char/raw.ko
> author: J. Ã
> ^ the cursor hangs here
>...

If anyone's wondering what's happening:

The UTF-8 representation of the character Ø consists of the two bytes
  0xC3 0x98

In the ISO/IEC 8859 encodings where every character is represented by 
one bytes this corresponds to two characters:

In ISO/IEC 8859-1 the byte 0xC3 represents the character Ã resulting in 
the (harmless) display of this wrong character.

But in all the ISO/IEC 8859 encodings, the byte 0x98 is the 
_control code_ "Start of String".

Therefore, if we want start using UTF-8 anywhere into the kernel, we 
must ensure that all applications correctly convert all characters 
if running in a non-UTF-8 environment.

I'm not sure that's worth the hassle.

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid

2007-06-07 Thread Serge E. Hallyn

Quoting Serge E. Hallyn ([EMAIL PROTECTED]):
> Quoting Badari Pulavarty ([EMAIL PROTECTED]):
> > On Thu, 2007-06-07 at 15:37 -0500, Serge E. Hallyn wrote:
> > > Quoting Badari Pulavarty ([EMAIL PROTECTED]):
> > > > On Thu, 2007-06-07 at 12:48 -0700, Andrew Morton wrote:
> > > > > On Thu, 07 Jun 2007 10:06:37 -0700
> > > > > Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> > > > > 
> > > > > > On Thu, 2007-06-07 at 12:43 -0400, Albert Cahalan wrote:
> > > > > > > On 6/7/07, Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> > > > > > > 
> > > > > > > > BTW, I agree with Eric that its would be nice to use shmid as 
> > > > > > > > part
> > > > > > > > of name instead of forcing to be as inode number. It should be
> > > > > > > > possible for pmap to workout shmid from "key" or name. Isn't it 
> > > > > > > > ?
> > > > > > > 
> > > > > > > It is not at all nice.
> > > > > > > 
> > > > > > > 1. it's incompatible ABI breakage
> > > > > > > 2. where will you put the key then, in the inode? :-)
> > > > > > 
> > > > > > Nope. Currently "key" is part of the name (but its not unique).
> > > > > > 
> > > > > > > 
> > > > > > > Changing to "SYSVID%d" is no good either. Look, people
> > > > > > > are ***parsing*** this stuff in /proc. The /proc filesystem
> > > > > > > is not some random sandbox to be playing in.
> > > > > > > 
> > > > > > > Before you go messing with it, note that the device number
> > > > > > > also matters. (it's per-boot dynamic, but that's OK)
> > > > > > > That's how one knows that /SYSV is not just
> > > > > > > a regular file; sadly these didn't get a non-/ prefix.
> > > > > > > (and no you can't fix that now; it's way too late)
> > > > > > > 
> > > > > > > Next time you feel like breaking an ABI, mind putting
> > > > > > > "LET'S BREAK AN ABI!" in the subject of your email?
> > > > > > 
> > > > > > I am not breaking ABI. Its already broken in the current
> > > > > > mainline. I am trying to fix it by putting back the ino#
> > > > > > as shmid. Eric had a suggestion that, instead of depending
> > > > > > on the inode# to be shmid, we could embed shmid into name
> > > > > > (instead of "key" which is currently not unique).
> > > > > > 
> > > > > > > BTW, I suspect this kind of thing also breaks:
> > > > > > > a. fuser, lsof, and other resource usage display tools
> > > > > > > b. various obscure emulators (similar to valgrind)
> > > > > > 
> > > > > > If you strongly feel that "old" behaviour needs to be retained, 
> > > > > 
> > > > > yup, we should put it back.  The change was, afaik, accidental.
> > > > > 
> > > > > > here is the patch I originally suggested.
> > > > > 
> > > > > Confused.  Will this one-liner fix all the userspace breakage to which
> > > > > Albert refers?
> > > > 
> > > > Yes. Albert, please correct me if I am wrong.
> > > 
> > > It will, but could lead to two different inodes with the same i_ino,
> > > right?
> > 
> > Only if we generate same ID in two different namespaces. Is it currently
> > possible ? 
> 
> Should be nothing stopping it.

(just to be more certain, a quick test showed I can get id 0 for
different keys, and different ids for the same key 0xff, in different
ipc namespaces)

-serge
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/8] fdmap v2 - implement sys_socket2

2007-06-07 Thread Linus Torvalds



On Thu, 7 Jun 2007, Davide Libenzi wrote:
> 
> We'd still need sys_nonseqfd() though, to move/dup legacy fds into the 
> non-sequential area.

Umm. No we don't. Because it's no more than 

indirect_syscall(dup, FD_NONSEQ)

isn't it?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Some ATARI config option confusion

2007-06-07 Thread Jesper Juhl


FYI

I just finished configuring my freshly git pull'ed kernel source and
then when I started building it some ATARI related config symbols
yelled at me :(

$ make -j 3
scripts/kconfig/conf -s arch/i386/Kconfig
drivers/input/keyboard/Kconfig:170:warning: 'select' used by config
symbol 'KEYBOARD_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'
drivers/input/mouse/Kconfig:182:warning: 'select' used by config
symbol 'MOUSE_ATARI' refers to undefined symbol 'ATARI_KBD_CORE'
...

--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] update checkpatch.pl to version 0.03

2007-06-07 Thread Alan Cox

> I added a MODULE_AUTHOR("J. Ørsted <[EMAIL PROTECTED]>") into the "raw" 
> module:
> 
> # echo $LANG
> C
> # modinfo --version
> module-init-tools version 3.3-pre11
> # modinfo raw
> filename:   /lib/modules/2.6.21.2/kernel/drivers/char/raw.ko
> author: J. Ã
> ^ the cursor hangs here

"Mummy if I deliberately shoot myself in the head it hurts"

Distro's don't ship in C locale and haven't for years. And the worst case
effect you can engineer by trying is to display some slightly odd symbols

(And incidentially since the Linux fs has been defined to be utf-8 for
naming for many years you'll find the same problem using "ls")

> - get module-init-tools fixed and
> - document that 2.6.23 (or whichever will be the first kernel to support 
>   UTF-8 in MODULE_AUTHOR) will require updated module-init-tools.

"Require" is a rather strong word for a print formatting issue specific
to obscure setups.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Intel-IOMMU 05/10] IOVA allocation and management routines

2007-06-07 Thread Andrew Morton

On Wed, 06 Jun 2007 11:57:03 -0700
[EMAIL PROTECTED] wrote:

>   This code implements a generic IOVA allocation and 
> management. As per Dave's suggestion we are now allocating
> IO virtual address from Higher DMA limit address rather
> than lower end address and this eliminated the need to preserve
> the IO virtual address for multiple devices sharing the same
> domain virtual address.
> 
> Also this code uses red black trees to store the allocated and
> reserved iova nodes. This showed a good performance improvements
> over previous linear linked list.
> 
> ...
>
> +
> +/**
> + * alloc_iova - allocates an iova
> + * @iovad - iova domain in question
> + * @size - size of page frames to allocate
> + * @limit_pfn - max limit address
> + * This function allocates an iova in the range limit_pfn to IOVA_START_PFN
> + * looking from limit_pfn instead from IOVA_START_PFN.
> + */
> +
> +struct iova *
> +alloc_iova(struct iova_domain *iovad, unsigned long size, unsigned long 
> limit_pfn)

Generally we omit the blank line between the end-of-comment and the
function definition.

> +{
> + unsigned long flags,flags1;
> + struct iova *new_iova;
> + int ret;
> +
> + new_iova = alloc_iova_mem();
> + if (!new_iova)
> + return NULL;
> +
> + spin_lock_irqsave(>iova_alloc_lock, flags1);
> + ret = __alloc_iova_range(iovad, size, limit_pfn, new_iova);
> +
> + if (ret) {
> + spin_unlock_irqrestore(>iova_alloc_lock, flags1);
> + free_iova_mem(new_iova);
> + return NULL;
> + }
> +
> + /* Insert the new_iova into domain rbtree by holding writer lock */
> + spin_lock_irqsave(>iova_rbtree_lock, flags);
> + iova_insert_rbtree(>rbroot, new_iova);
> + __cached_rbnode_insert_update(iovad, limit_pfn, new_iova);
> + spin_unlock_irqrestore(>iova_rbtree_lock, flags);
> +
> + spin_unlock_irqrestore(>iova_alloc_lock, flags1);

spin_unlock_irqrestore() within spin_unlock_irqrestore() is fairly
pointless.  You can just use spin_lock()/spin_unlock() for the innermost
pair.

> + return new_iova;
> +}
>
> ...
>
> +__is_range_overlap(struct rb_node *node, unsigned long pfn_lo, unsigned long 
> pfn_hi)
> +{
> + struct iova * iova = container_of(node, struct iova, node);

run checkpatch.pl, please.

> + if ((pfn_lo <= iova->pfn_hi) && (pfn_hi >= iova->pfn_lo))
> + return 1;
> + return 0;
> +}
> +
>
> ...
>
> +struct iova *
> +reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo, unsigned long 
> pfn_hi)
> +{
> + struct rb_node *node;
> + unsigned long flags, flags1;
> + struct iova *iova;
> + unsigned int overlap = 0;
> +
> + spin_lock_irqsave(>iova_alloc_lock, flags);
> + spin_lock_irqsave(>iova_rbtree_lock, flags1);
> + for (node = rb_first(>rbroot); node; node = rb_next(node)) {
> + if (__is_range_overlap(node, pfn_lo, pfn_hi)) {
> + iova = container_of(node, struct iova, node);
> + __adjust_overlap_range(iova, _lo, _hi);
> + if ((pfn_lo >= iova->pfn_lo) &&
> + (pfn_hi <= iova->pfn_hi))
> + goto finish;
> + overlap = 1;
> +
> + } else if (overlap)
> + break;
> + }
> +
> + /* We are here either becasue this is the first reserver node
> +  * or need to insert remaining non overlap addr range
> +  */
> + iova = __insert_new_range(iovad, pfn_lo, pfn_hi);
> +finish:
> +
> + spin_unlock_irqrestore(>iova_rbtree_lock, flags1);
> + spin_unlock_irqrestore(>iova_alloc_lock, flags);

ditto

> + return iova;
> +}
> +
> +/**
> + * copy_reserved_iova - copies the reserved between domains
> + * @from: - source doamin from where to copy
> + * @to: - destination domin where to copy
> + * This function copies reserved iova's from one doamin to
> + * other.
> + */
> +void
> +copy_reserved_iova(struct iova_domain *from, struct iova_domain *to)
> +{
> + unsigned long flags, flags1;
> + struct rb_node *node;
> + spin_lock_irqsave(>iova_alloc_lock, flags);
> + spin_lock_irqsave(>iova_rbtree_lock, flags1);

Add a blank line betwee the end-of-locals and the start-of-statements

> + for (node = rb_first(>rbroot); node; node = rb_next(node)) {
> + struct iova *iova = container_of(node, struct iova, node);
> + struct iova *new_iova;
> + new_iova = reserve_iova(to, iova->pfn_lo, iova->pfn_hi);
> + if (!new_iova)
> + printk(KERN_ERR "Reserve iova range [EMAIL PROTECTED] 
> failed\n",
> + iova->pfn_lo, iova->pfn_lo);
> + }
> + spin_unlock_irqrestore(>iova_rbtree_lock, flags1);
> + spin_unlock_irqrestore(>iova_alloc_lock, flags);

ditto

> +}
> Index: linux-2.6.22-rc3/drivers/pci/iova.h
> ===
> ---

Re: [PATCH 2/2] LEDS: generic driver

2007-06-07 Thread Richard Purdie

On Wed, 2007-06-06 at 19:02 +0200, Robin Farine wrote: 
> From: Robin Farine <[EMAIL PROTECTED]>
> 
> This generic LED driver implements the platform independent part of a
> LED driver letting platform specific code focus on the hardware
> details. The driver binds to platform devices named "Generic-LED"
> which provide the platform specific data and code needed to act on an
> LED.
> 
> This is useful for platforms with exotic ways of controlling LEDs
> which preclude the use of a common LED driver such as GPIO based
> drivers.
> 
> Signed-off-by: Robin Farine <[EMAIL PROTECTED]>

I'm not sure about this to be honest. I can see a case for perhaps
having a couple of standard suspend/resume functions for platform device
based LED drivers as those functions are often identical. I'm not sure
whether there is going to be much need for more than that.

Having looked through a few of the LED drivers, even the suspend/resume
functions are often different...

>From a style point of view, why not s/pdev_to_led/platform_get_drvdata/?

What are your thoughts on multiple LEDs on a single device?

Given the LED class is going to get a conversion to struct device soon,
I'd prefer to put this on hold until after I've made that conversion at
which point I'll reconsider this.

Regards,

Richard

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

2007-06-07 Thread Andrew Morton

On Wed, 06 Jun 2007 11:57:00 -0700
[EMAIL PROTECTED] wrote:

> Signed-off-by: Anil S Keshavamurthy <[EMAIL PROTECTED]>

That was a terse changelog.

Obvious question: how does this differ from mempools, and would it be
better to fill in any gaps in mempool functionality instead of
implementing something similar-looking?

The changelog very much should describe all this, as well as explaining
what the dynamic behaviour of this new thing is, and what applications are
envisaged, what problems it solves, etc, etc.


> --- /dev/null 1970-01-01 00:00:00.0 +
> +++ linux-2.6.22-rc3/lib/respool.c2007-06-06 11:34:46.0 -0700

There are a number of coding-style glitches in here, but
scripts/checkpatch.pl catches most of them.  Please run it, and fix.

> @@ -0,0 +1,222 @@
> +/*
> + * respool.c - library routines for handling generic pre-allocated pool of 
> objects
> + *
> + * Copyright (c) 2006, Intel Corporation.
> + *
> + * This file is released under the GPLv2.
> + *
> + * Copyright (C) 2006 Anil S Keshavamurthy <[EMAIL PROTECTED]>
> + */
> +
> +#include 
> +
> +/**
> + * get_resource_pool_obj - gets an object from the pool
> + * @ppool - resource pool in question
> + * This function gets an object from the pool and
> + * if the pool count drops below min_count, this
> + * function schedules work to grow the pool. If
> + * no elements are fount in the pool then this function
> + * tries to get memory from kernel.
> + */
> +void * get_resource_pool_obj(struct resource_pool *ppool)
> +{
> + unsigned long   flags;
> + struct list_head *plist;
> + bool queue_work = 0;
> +
> + spin_lock_irqsave(>pool_lock, flags);
> + if (!list_empty(>pool_head)) {
> + plist = ppool->pool_head.next;
> + list_del(plist);
> + ppool->curr_count--;
> + } else {
> + /*Making sure that curr_count is 0 when list is empty */
> + plist = NULL;
> + BUG_ON(ppool->curr_count != 0);
> + }
> +
> + /* Check if pool needs to grow */
> + if (ppool->curr_count <= ppool->min_count)
> + queue_work = 1;
> + spin_unlock_irqrestore(>pool_lock, flags);
> +
> + if (queue_work)
> + schedule_work(>work); /* queue work to grow the pool */
> +
> +
> + if (plist) {
> + memset(plist, 0, ppool->alloc_size); /* Zero out memory */
> + return plist;
> + }
> +
> + /* Out of luck, try to get memory from kernel */
> + plist = (struct list_head *)ppool->alloc_mem(ppool->alloc_size,
> + GFP_ATOMIC);
> +
> + return plist;
> +}

A function like this should take a gfp_t from the caller, and pass it on.

> +/**
> + * put_resource_pool_obj - puts an object back to the pool
> + * @vaddr - object's address
> + * @ppool - resource pool in question.
> + * This function puts an object back to the pool.
> + */
> +void put_resource_pool_obj(void * vaddr, struct resource_pool *ppool)
> +{
> + unsigned long   flags;
> + struct list_head *plist = (struct list_head *)vaddr;
> + bool queue_work = 0;
> +
> + BUG_ON(!vaddr);
> + BUG_ON(!ppool);
> +
> + spin_lock_irqsave(>pool_lock, flags);
> + list_add(plist, >pool_head);
> + ppool->curr_count++;
> + if (ppool->curr_count > (ppool->min_count +
> + ppool->grow_count * 2))
> + queue_work = 1;

Some of the indenting is a bit funny-looking in here.

> + spin_unlock_irqrestore(>pool_lock, flags);
> +
> + if (queue_work)
> + schedule_work(>work); /* queue work to shrink the pool */
> +}
> +
> +void
> +__grow_resource_pool(struct resource_pool *ppool,
> + unsigned int grow_count)
> +{
> + unsigned long   flags;
> + struct list_head *plist;
> +
> + while(grow_count) {
> + plist = (struct list_head *)ppool->alloc_mem(ppool->alloc_size,
> + GFP_KERNEL);

resource_pool.alloc_mem() already returns void *, so there is never a need
to cast its return value.

> + if (!plist)
> + break;
> +
> + /* Add the element to the list */
> + spin_lock_irqsave(>pool_lock, flags);
> + list_add(plist, >pool_head);
> + ppool->curr_count++;
> + spin_unlock_irqrestore(>pool_lock, flags);
> + grow_count--;
> + }
> +}
> +

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Syslets - Fix cachemiss_thread return value

2007-06-07 Thread Zach Brown

On Thu, May 31, 2007 at 02:19:23PM -0400, Jeff Dike wrote:
> cachemiss_thread should explicitly return 0 or error instead of
> task_ret_reg(current) (which is -ENOSYS anyway) because
> async_thread_helper is careful to put the return value in eax anyway.

Here's the fix I came up with for this.  Untested (my test boxes are
busy doing some aio+syslet perf runs :)).  What do you guys think?

- z

- - -

async: return task_ret_reg() from async_cachemiss_loop()

When a thread blocks in __async_schedule() it finds a thread waiting in
async_cachemiss_loop() to return to userspace.  The return value of
async_cachemiss_loop() is propagated to userspace as the return code of the
call which the thread blocked in.

Previously this return code was hard-coded to 0 and -1, which worked fine for
the first syslet syscall APIs.  sys_io_submit() wants a different return call
convention when it uses syslets.  So cachemiss_thread() was changed to return
task_ret_reg(), sys_io_submit() would set it to the value it wanted returned to
userspace.

But this, by itself, was buggy.  The syslet syscall APIs didn't set
task_ret_reg() like sys_io_submit() did, so they got garbage returned.  Both
Jeff Dike and Chris Mason saw this in testing, though I was lucky enough to
have my returned garbage pass the tests I initially tried.  It also didn't
catch the other callers of async_cachemiss_loop() which would be returning to
userspace on behalf of calls which blocked in __async_schedule().  This would
be a problem, say, if people mixed sys_threadlet_on() and sys_io_submit() in
the same task.

So this patch fixes those up.

We always return task_ret_reg() from async_cachemiss_loop().  If we're
returning to the user's stack and IP instead of on behalf of a blocking call we
set task_ret_reg() to -1 and then return it.

__exec_atom() sets task_ret_reg() to NULL if there's a chance that it will
block while executing the syscall in the atom.

Signed-off-by: Zach Brown <[EMAIL PROTECTED]>

diff -r f0d8ee165e2e kernel/async.c
--- a/kernel/async.cThu Jun 07 14:32:31 2007 -0700
+++ b/kernel/async.cThu Jun 07 16:14:16 2007 -0700
@@ -206,6 +206,13 @@ static long __exec_atom(struct task_stru

if (unlikely(atom->nr >= atom->nr_syscalls))
return -ENOSYS;
+
+   /*
+* If the syslet goes asynchronous then we want the cachemiss
+* thread to return NULL to userspace on our behalf.
+*/
+   if (likely(t->async_ready))
+   task_ret_reg(t) = 0;

ret = atom->call_table[atom->nr](atom->args[0], atom->args[1],
 atom->args[2], atom->args[3],
@@ -515,7 +522,22 @@ exec_atom(struct async_head *ah, struct 

return last_uatom;
 }
-
+/**
+ * async_cachemiss_loop - wait as a "cachemiss" thread
+ * @at:this thread's async_thread in its task_struct
+ * @ah:The async_head for this thread's group of threads
+ * @t: this thread's task_struct
+ *
+ * Sleep as a ready async "cachemiss" thread.  If another thread in our
+ * async_head blocks then __async_schedule may block that thread and transfer
+ * its userspace over to us to return in to.
+ *
+ * Before threads get to __async_schedule() they set task_ret_reg() to the
+ * value that should be returned to userspace if they block.  We return it so
+ * that our caller can return it to userspace, either through the usual syscall
+ * return path or through the helper which cachemiss_thread() returns to after
+ * being created.
+ */
 long async_cachemiss_loop(struct async_thread *at, struct async_head *ah,
  struct task_struct *t)
 {
@@ -542,10 +564,13 @@ long async_cachemiss_loop(struct async_t
 */
set_task_stack_reg(t, at->user_stack);
task_ip_reg(t) = at->user_ip;
-
-   return -1;
-   }
-   return 0;
+   /*
+* We're not returning on behalf of some call, communicate
+* that.
+*/
+   task_ret_reg(t) = -1;
+   }
+   return task_ret_reg(t);
 }

 struct cachemiss_args {
@@ -559,14 +584,6 @@ struct cachemiss_args {
  * first time: initialize, pick up the user stack/IP addresses from
  * the head and then execute the cachemiss loop. If the cachemiss
  * loop returns then we return back to user-space.
- *
- * Our return code overwrites the task_ret_reg() in ptregs in the assembly
- * stub which execution returns to before returning to userspace.  We may
- * be returning to userspace on behalf of an interrupted call which might
- * want us to return something specific to user space.  We return 
- * task_ret_reg() from this function so that the tasks we're returning
- * on behalf of can use it to pass return codes to userspace.  Perhaps
- * we shouldn't have the asm stub overwrite task_ret_reg() instead..
  */
 static long cachemiss_thread(void *data)
 {
@@ -606,8 +623,7 @@ static long cachemiss_thread(void *data)
}

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1140 matches

Mail list logo