Re: PROBLEM: "Make nenuconfig" does not save parameters.

2007-03-10 Thread Cyrill Gorcunov
[Bodo Eggert - Sun, Mar 11, 2007 at 06:21:59AM +0100]
| Sam Ravnborg <[EMAIL PROTECTED]> wrote:
| > On Sat, Mar 10, 2007 at 10:34:41PM +0100, Jan Engelhardt wrote:
| >> On Mar 10 2007 22:27, Sam Ravnborg wrote:
| >> >On Sat, Mar 10, 2007 at 07:23:41PM +0100, Jan Engelhardt wrote:
| 
| >> >> Whether the 'working config file path' should change when you do
| >> >> 'Save as Alternate' or not, is a menuconfig axiom. Ask Sam Ravnborg
| >> >> if you want it changed :-)
| >> >
| >> >Current behaviour is not logical but on the other hand I do not
| >> >see a big need to make it so.
| >> >It seems that people very seldom uses "save alternate" anyway.
| >> >
| >> >But patches are welcome.
| >> 
| >> ^_^ The patch has already been posted, has not it?
| > No.
| > Either we keep current behaviour
| 
| , which is misleading,
| 
| > or we change to the "normal"
| > behaviour with a "Save as..." as know from all other programs.
| 
| , which is not desirable, as long as there is no "open" and "save" option
| also working as "normal".
| 
| IMO the option should have the "Save a copy" semantics, since that's what the
| name suggests.

Please decribe me how should it work at all. I mean should we work with
a single _active_ file and then "Save a copy" just put a config snapshot
to some file but that will not affect an original file? Should we work
with any config file as text editors do? Just write your point of view
in details and give kernel community time to review...

| -- 
| Top 100 things you don't want the sysadmin to say:
| 51. YEEEHA!!!  What a CRASH!!!
| 
| Fri?, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
| 


Cyrill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 4/6] x86_64: Single Quicklist

2007-03-10 Thread Andi Kleen
On Sunday 11 March 2007 03:09, Christoph Lameter wrote:
> x86_64: Convert to use a single quicklists
> 
> This adds caching of pgds and puds, pmds, pte. That way we can
> avoid costly zeroing and initialization of special mappings in the
> pgd.
> 
> The first patch just adds a simple implementation using a single
> quicklist. As a consequence we need to zero a pgd before returning
> it to the pool.


This and i386 version are ok to me, although it might be better to just
finish __GFP_ZERO support to do this.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: netconsole system freeze when cable unplugged

2007-03-10 Thread Andi Kleen
On Sat, Mar 10, 2007 at 02:06:28PM +, Simon Arlott wrote:
> On 10/03/07 13:38, Andi Kleen wrote:
> >Simon Arlott <[EMAIL PROTECTED]> writes:
> >
> >>On 09/03/07 20:42, Francois Romieu wrote:
> >>>Simon Arlott <[EMAIL PROTECTED]> :
> When I unplug the cable the system just stops responding to
> anything, at all. No message is printed to the console when the
> cable is plugged back in.
> >>>rtl8139_interrupt (spin_lock(>lock))
> >>>-> rtl8139_weird_interrupt
> >>>   -> rtl_check_media
> >>>  -> mii_check_media (printk(KERN_INFO "%s: link down\n", ...))
> >>> [netpoll stuff here]
> >>> -> rtl8139_poll_controller
> >>>-> rtl8139_interrupt
> >>>   *deadlock*
> >>>See below for my random stuff of the day. Feel free to open a PR at
> >>>bugzilla.kernel.org if the issue does not go away.
> >>The patch doesn't fix it, nothing changes. I'm not sure how this can
> >>be debugged if printk won't work...
> >
> >earlyprintk can be called directly (early_printk()) and should
> >work. It won't log over the network of course.
> 
> It also won't log over the serial console either :(

It does, you just have to configure it properly.

earlyprintk=serial,ttySx,baud

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: "Make nenuconfig" does not save parameters.

2007-03-10 Thread Cyrill Gorcunov
[Sam Ravnborg - Sat, Mar 10, 2007 at 11:45:34PM +0100]
| On Sat, Mar 10, 2007 at 10:34:41PM +0100, Jan Engelhardt wrote:
| > 
| > On Mar 10 2007 22:27, Sam Ravnborg wrote:
| > >On Sat, Mar 10, 2007 at 07:23:41PM +0100, Jan Engelhardt wrote:
| > >> 
| > >> Whether the 'working config file path' should change when you do
| > >> 'Save as Alternate' or not, is a menuconfig axiom. Ask Sam Ravnborg
| > >> if you want it changed :-)
| > >
| > >Current behaviour is not logical but on the other hand I do not
| > >see a big need to make it so.
| > >It seems that people very seldom uses "save alternate" anyway.
| > >
| > >But patches are welcome.
| > 
| > ^_^ The patch has already been posted, has not it?
| No.
| Either we keep current behaviour or we change to the "normal"
| behaviour with a "Save as..." as know from all other programs.
| 
|   Sam
| 

Hi Sam,

I think we should use "Save As..." idea. And thereby menuconfig,
qconfig, gconfig will be affected. Please give me time to make patches.
(I'm a little busy now so I hope to make them during a week :)

The patch I sent to Vladimir does not normalize behaviour of config
process but just makes an alternate file as a snapshot of current config
state. You may review it as a temporary solution only.

Cyrill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL-mm 0.28

2007-03-10 Thread Willy Tarreau
On Sat, Mar 10, 2007 at 07:35:06PM -0600, Matt Mackall wrote:
> I've tested -mm2 against -mm2+noyield and -mm2+rsdl+noyield. The
> noyield patch simply makes the sched_yield syscall return immediately.
> Xorg and all tests are run at nice 0.

[skipped long and precise test report]

> Also note I could occassionally trigger nasty multi-second pauses with
> -mm2+noyield under exectest that didn't show up elsewhere. That's
> probably a bug in the mainline scheduler.

This is not a bug per se, but more a design problem. This is caused by
the interactivity booster which is unfair. Mike Galbraith and others
spent a lot of time trying to get rid of those problems a few versions
ago. In early kernels (around 2.6.11), I could trivially cause pauses
more than 30 seconds long by running a few tasks simulating an interactive
workload. It is much more difficult to achieve this with recent kernels,
and it has absolutely no effect on RSDL, which is one of the reasons
I have to find it great !

Regards,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CIRRUS: Delete unused header file.

2007-03-10 Thread Andrew Morton
> On Sat, 10 Mar 2007 17:27:44 -0500 (EST) "Robert P. J. Day" <[EMAIL 
> PROTECTED]> wrote:
> 
>   Delete apparently unused header file
> sound/pci/cs46xx/imgs/cwcemb80.h.
> 

That patch series was rather a mess

- Multiple patches with the same Subject: (I might have lost some as a result)

- Several patches which tried to remove the same header file

- Several patches which simply didn't apply

- Inconsisent changelogging, inconsistent titling

- Lack of sequence numbering (again, contributes to possible patch loss)

- Useless indenting in changleog text which I have to edit away.

We have good tools (ie: quilt) which make this sort of thing nice and easy
to get right - please use them.

I didn't check that these headers are indeed unused.  I hope you got that
minor part right..

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


libata extension

2007-03-10 Thread Vitaliyi

Good Day

Say i want to implement extended set of ATA commands available to
userspace for building diagnostic tools.
I need 0x40 -- read verify and 0x32 -- write long with error handling,
for example. I was trying ide driver through ioctl's, but seems it
lack of functionality and full of gotchas. Furthermore it oopses
sometimes.

Is it possible to use libata for such purpose or i need to write
separate IDE driver ?
By the way, i'm sure it should be done in kernel space since i'm going
to deal with some hdd manufacturer commands.

P.S. I was looking through libata and ide sources and documentation
but still dont have broad picture.


Thanks
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [SLUB 0/3] SLUB: The unqueued slab allocator V5

2007-03-10 Thread Christoph Lameter
On Sat, 10 Mar 2007, Andrew Morton wrote:

> Is this safe to think about applying yet?

Its safe. By default kernels will be build with SLAB. SLUB becomes only a 
selectable alternative. It should not become the primary slab until we 
know that its really superior overall and have thoroughly tested it in
a variety of workloads.

> We lost the leak detector feature.

There will be numerous small things that will have to be addressed. There
is also some minor work to be done for tracking callers better.
 
> It might be nice to create synonyms for PageActive, PageReferenced and
> PageError, to make things clearer in the slub core.   At the expense of
> making things less clear globally.  Am unsure.

I have been back and forth on doing that. There are somewhat similar 
in what they mean for SLUB. But creating synonyms may be confusing to 
those checking how page flags are being used.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [SLUB 0/3] SLUB: The unqueued slab allocator V5

2007-03-10 Thread Andrew Morton
Is this safe to think about applying yet?

We lost the leak detector feature.

It might be nice to create synonyms for PageActive, PageReferenced and
PageError, to make things clearer in the slub core.   At the expense of
making things less clear globally.  Am unsure.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL-mm 0.28

2007-03-10 Thread Con Kolivas
On Sunday 11 March 2007 15:03, Matt Mackall wrote:
> On Sat, Mar 10, 2007 at 10:01:32PM -0600, Matt Mackall wrote:
> > On Sun, Mar 11, 2007 at 01:28:22PM +1100, Con Kolivas wrote:
> > > Ok I don't think there's any actual accounting problem here per se
> > > (although I did just recently post a bugfix for rsdl however I think
> > > that's unrelated). What I think is going on in the ccache testcase is
> > > that all the work is being offloaded to kernel threads reading/writing
> > > to/from the filesystem and the make is not getting any actual cpu
> > > time.
> >
> > I don't see significant system time while this is happening.
>
> Also, it's running pretty much entirely out of page cache so there
> wouldn't be a whole lot for kernel threads to do.

Well I can't reproduce that behaviour here at all whether from disk or the 
pagecache with ccache, so I'm not entirely sure what's different at your end. 
However both you and the other person reporting bad behaviour were using ATI 
drivers. That's about the only commonality? I wonder if they do need to 
yield... somewhat instead of not at all.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Nicholas Miell
On Sat, 2007-03-10 at 21:31 -0800, Linus Torvalds wrote:
> 
> On Sat, 10 Mar 2007, Nicholas Miell wrote:
> > 
> > Ah, I see. You're just interested in fds as a generic handle concept,
> > and not a more Plan 9 type thing.
> 
> Indeed. It's a "handle".
> 
> UNIX has pid's for "process" handles, and "file descriptors" for just 
> about everything else.

And I imagine that somebody will come up with way of getting a fd for a
process sooner or later. 

> > If that's the goal, somebody should start thinking about reducing the
> > contents of struct file to the bare minimum (i.e. not much more than a
> > file_operations pointer).
> 
> Well, there's more there, but it really is fairly close. If you look at 
> it, a "struct file" ends up not having a lot more than the minimal stuff 
> required to use it as a a handle: it really isn't a very big structure. 
> 
> The biggest part is actually the read-ahead state, which is arguably a 
> generic thing for a file handle, even though not all kinds will be able to 
> use it. We *could* make that be behind a pointer (along with the "f_pos" 
> thing, that really logically goes along with the read-ahead thing), of 
> course, but since most files probably do end up being "traditional file" 
> structures, it's probably not wrong to just have it in the file.
> 

Actually, I was thinking reducing struct file to the bare minimum, and
then using that as the common header shared by object-specific
structures. I don't know how unpleasant that would be from a memory
allocation perspective, though.

-- 
Nicholas Miell <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Davide Libenzi
On Sat, 10 Mar 2007, Linus Torvalds wrote:

> > Actually, the only place where I can find the itimerspec usefull, is 
> > indeed with TFD_TIMER_SEQ. In cases where you want you clock starting at a 
> > given time (it_value) *and* with the given frequency (it_interval).
> 
> .. and this is where itimerspec is even better: once you have absolute 
> time, *and* a process that might miss ticks (because it does something 
> else), the "absolute time start + interval" thing can avoid drifting 
> (which a "relative interval" has a really hard time doing).
> 
> So if you want a "timer tick every second, *on* the second" kind of 
> interface, you really do want a absolute time starting point, and then a 
> fixed interval. Two different times.

Alrighty, I'll use a itimerspec ...



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Linus Torvalds


On Sat, 10 Mar 2007, Davide Libenzi wrote:

> On Sat, 10 Mar 2007, Linus Torvalds wrote:
> 
> > (That said, using "struct itimerspec" might be a good idea. That would 
> > also obviate the need for TFD_TIMER_SEQ, since an itimerspec automatically 
> > has both "base" and "incremental" parts).
> 
> But TFD_TIMER_SEQ is a simple auto-rearm case of TFD_TIMER_REL. So the 
> timespec is sufficent too (in all three cases we just need *one* time). 

Well, people actually do use itimers like "give me a timer every second, 
starting five seconds from now".

> Actually, the only place where I can find the itimerspec usefull, is 
> indeed with TFD_TIMER_SEQ. In cases where you want you clock starting at a 
> given time (it_value) *and* with the given frequency (it_interval).

.. and this is where itimerspec is even better: once you have absolute 
time, *and* a process that might miss ticks (because it does something 
else), the "absolute time start + interval" thing can avoid drifting 
(which a "relative interval" has a really hard time doing).

So if you want a "timer tick every second, *on* the second" kind of 
interface, you really do want a absolute time starting point, and then a 
fixed interval. Two different times.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Linus Torvalds


On Sat, 10 Mar 2007, Nicholas Miell wrote:
> 
> Ah, I see. You're just interested in fds as a generic handle concept,
> and not a more Plan 9 type thing.

Indeed. It's a "handle".

UNIX has pid's for "process" handles, and "file descriptors" for just 
about everything else.

> If that's the goal, somebody should start thinking about reducing the
> contents of struct file to the bare minimum (i.e. not much more than a
> file_operations pointer).

Well, there's more there, but it really is fairly close. If you look at 
it, a "struct file" ends up not having a lot more than the minimal stuff 
required to use it as a a handle: it really isn't a very big structure. 

The biggest part is actually the read-ahead state, which is arguably a 
generic thing for a file handle, even though not all kinds will be able to 
use it. We *could* make that be behind a pointer (along with the "f_pos" 
thing, that really logically goes along with the read-ahead thing), of 
course, but since most files probably do end up being "traditional file" 
structures, it's probably not wrong to just have it in the file.

> It'd be useful if the polling interfaces could return small datums
> beyond just the POLL* flags -- having to do a read on timerfd just to
> get the overrun count has a lot of overhead for just an integer, and I
> imagine other things would like to pass back stuff too.

Well, since a lot of the interfaces harken back to "select()", we really 
are stuck with basically a couple of bits total (poll extends on the 
number of bits, but not a whole lot). So right now we have just "an event 
happened", and if you want to know more, you do need to do a read() or 
similar. That's true of all the traditional file descriptors too, of 
course.

> You still want timeouts, creating/setting/destroying at timer just for
> a single call to select/poll/epoll is probably too heavy weight.

Well, since the interfaces for that already exists, I'm certainly not 
going to disagree.

> timerfd() still leaves out the basic clock selection functionality
> provided by both setitimer() and timer_create().

Well, the setitimer ones do not really make sense for a timer that isn't 
directly associated with one particular process. Once it's associated with 
a file descriptor, it really isn't bound to any particular execution 
context, and as such, virtual and profiling timers really don't make any 
sense any more!

The only thing that exists outside of an execution context is really just 
"relative" and "absolute". Of course, you could still specify just what 
you want your timers to be based on (ie the "realtime" vs "monotonic" 
thing), and possibly the resolution, but it really does boil down to just 
those two choices (and the rest is just confusion).

So I really don't think you lose a lot by just limiting it to "real time" 
vs "relative time". Those really *are* the choices.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: "Make nenuconfig" does not save parameters.

2007-03-10 Thread Bodo Eggert
Sam Ravnborg <[EMAIL PROTECTED]> wrote:
> On Sat, Mar 10, 2007 at 10:34:41PM +0100, Jan Engelhardt wrote:
>> On Mar 10 2007 22:27, Sam Ravnborg wrote:
>> >On Sat, Mar 10, 2007 at 07:23:41PM +0100, Jan Engelhardt wrote:

>> >> Whether the 'working config file path' should change when you do
>> >> 'Save as Alternate' or not, is a menuconfig axiom. Ask Sam Ravnborg
>> >> if you want it changed :-)
>> >
>> >Current behaviour is not logical but on the other hand I do not
>> >see a big need to make it so.
>> >It seems that people very seldom uses "save alternate" anyway.
>> >
>> >But patches are welcome.
>> 
>> ^_^ The patch has already been posted, has not it?
> No.
> Either we keep current behaviour

, which is misleading,

> or we change to the "normal"
> behaviour with a "Save as..." as know from all other programs.

, which is not desirable, as long as there is no "open" and "save" option
also working as "normal".

IMO the option should have the "Save a copy" semantics, since that's what the
name suggests.
-- 
Top 100 things you don't want the sysadmin to say:
51. YEEEHA!!!  What a CRASH!!!

Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RSDL v0.29 backport to 2.6.18.8

2007-03-10 Thread Veronique & Vincent
Hello, again,

I just saw that my 0.28 patch file was wrongly named 0.26 and that there is a 
new version 0.29 of RSDL that just came out... so here is the backported RSDL 
0.29 to a 2.6.18.8 kernel.

This does compile but I did not got the time to fully test it yet.

> 
> Here is an update for RSDL to version 0.28
> 
> Full patch:
> http://ck.kolivas.org/patches/staircase-deadline/2.6.20-sched-
> rsdl-0.28.patch
> 
> Series:
> http://ck.kolivas.org/patches/staircase-deadline/2.6.20/
> 
> The patch to get you from 0.26 to 0.28:
> http://ck.kolivas.org/patches/staircase-deadline/2.6.20/sched-
> rsdl-0.26-0.28.patch
> 
> A similar patch and directories will be made for 2.6.21-rc3 
> without further announcement
> 

Once again, thanx Con for this nice piece of code.

Also note that this patch already includes a few other patches from 2.6.19+ 
kernel and there might also be other small pieces of code comming from a 
2.6.19+ kernel:

PATCH 1:
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=co
mmit;h=ece8a684c75df215320b4155944979e3f78c5c93

PATCH 2:
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=co
mmit;h=08c183f31bdbb709f177f6d3110d5f288ea33933

PATCH 3:
Original RSDL patches (thnx again Con)
http://ck.kolivas.org/patches/staircase-deadline/

Due to the project I'm currently working on, this will, in the next few
weeks, help me out comparing heavy loads on a Debian Sarge/Etch 32/64
platform.  Suggestions on benchmark tools would greatly be appreciated.

Again, duno if this will be helpfull for anybody... but who knows!

- vin


sched-rsdl-0.29-backport-kernel-2.6.18.patch.gz
Description: GNU Zip compressed data


RSDL v0.28 for 2.6.20 -> backport to 2.6.18.8

2007-03-10 Thread Veronique & Vincent
Hi all,

> 
> Here is an update for RSDL to version 0.28
> 
> Full patch:
> http://ck.kolivas.org/patches/staircase-deadline/2.6.20-sched-
> rsdl-0.28.patch
> 
> Series:
> http://ck.kolivas.org/patches/staircase-deadline/2.6.20/
> 
> The patch to get you from 0.26 to 0.28:
> http://ck.kolivas.org/patches/staircase-deadline/2.6.20/sched-
> rsdl-0.26-0.28.patch
> 
> A similar patch and directories will be made for 2.6.21-rc3 
> without further announcement
> 

First of all, thanx Con for this nice piece of code.

I've been trying in the last few days to backport this new scheduler to
a 2.6.18 kernel.  After a lot of efforts I have finally been able to
compile and run a RSDL patched 2.6.18.8 kernel on a x86_64 arch and
actually my test PC booted 2-3 seconds faster with it compared to a
vanilla 2.6.18.8 kernel.

This patch includes a few other patches from 2.6.19+ kernel:

PATCH 1:
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=co
mmit;h=ece8a684c75df215320b4155944979e3f78c5c93

PATCH 2:
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.20.y.git;a=co
mmit;h=08c183f31bdbb709f177f6d3110d5f288ea33933

PATCH 3:
The patch to get you from 0.26 to 0.28:
http://ck.kolivas.org/patches/staircase-deadline/2.6.20/sched-rsdl-0.26-
0.28.patch

There might also be other small pieces of code comming from a 2.6.19+
kernel.

Due to the project I'm currently working on, this will, in the next few
weeks, help me out comparing heavy loads on a Debian Sarge/Etch 32/64
platform.  Suggestions on benchmark tools would greatly be appreciated.

Duno if this will be helpfull for anybody but I tought it would be nice
to give it back to the lkml community.

- vin


sched-rsdl-0.26-backport-kernel-2.6.18.patch.gz
Description: sched-rsdl-0.26-backport-kernel-2.6.18.patch.gz


Re: [PATCH v2] Bitbanging i2c bus driver using the GPIO API

2007-03-10 Thread David Brownell
On Saturday 10 March 2007 5:13 am, Haavard Skinnemoen wrote:
> This is a very simple bitbanging i2c bus driver utilizing the new
> arch-neutral GPIO API. ...
> ---
> This patch is different from the first patch in the following ways:
>   * Handles pins set up as open drain (aka multidrive) by toggling
> the output value instead of the direction
>   * Handles output-only SCL pins the same way, and also does not
> install a getscl() callback for such pins
>   * Does not add anything to include/linux/i2c-ids.h
>   * Sets the output value explicitly after changing the direction to
> output.
>   * Plugs a memory leak in remove() -- algo_data wasn't freed.
>   * Prints out the pin IDs in decimal, with an extra note when clock
> stretching isn't supported
> 
> This version has been compile-tested only. I'll give it a spin when I
> get back to work on monday.
> 
> Dave, does this address your concerns?

Yes, though see my followup to Jean's note.  Unless I make time
to test this out on some system, the issues seem to be:

 (a) will need to change once gpio_direction_output() gains
 that second argument;

 (b) i2c-gpio.h could stand one minor comment addition to highlight
 an assumption.

Looking good!

- Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Bitbanging i2c bus driver using the GPIO API

2007-03-10 Thread David Brownell
On Saturday 10 March 2007 12:15 pm, Jean Delvare wrote:
> Hi Haavard,
> 
> On Sat, 10 Mar 2007 14:13:28 +0100, Haavard Skinnemoen wrote:
> > This is a very simple bitbanging i2c bus driver utilizing the new
> > arch-neutral GPIO API. Useful for chips that don't have a built-in
> > i2c controller, additional i2c busses, or testing purposes.

This updated version looks a lot better.  However it doesn't address
the API change -- gpio_direction_output(gpio, initial_value) -- which
is understandable since that patch hasn't yet merged.


> I like the idea very much. Would this let us get rid of i2c-ixp2000?
> i2c-ixp4xx? scx200_i2c? Other drivers?

There's CONFIG_GENERIC_GPIO support for ixp4xx (nyet upstream, ISTR it's
waiting on the gpio_direction_output update), so that one should be
particularly easy to replace.  Presumably some other bitbang drivers
could vanish before long too.


> What value will you get if the SDA pin is open-drain and currently in
> output mode?

For output GPIOs, gpio_get_value() is specified to either return the
actual value at the pin ... or zero, if the hardware can't do that.
Most GPIO pins *can* do that.  (Specifically, that's how AT91 GPIOs
work, open drain or otherwise.)

(However, there can be various latencies involved.  On one chip
when I wrote the output value, then immediately read it back, I
got the old value.  Reason:  the GPIO controller clock needed
to tick first in order to latch the new input value!  It was only
about 30 MHz, so the back-to-back instructions were too fast.  You
can also sometimes notice capacitance causing similar delays.
Of course those latencies apply regardless of pin direction.)

I think Haavard is assuming the GPIO actually returns that value,
since otherwise there'd be no point in trying to use the open drain
mode.  It'd be worth capturing that in the i2c-gpio.h definition
for that struct.


> Are such GPIO pins actually able to detect that the pin is 
> low while they are not themselves driving it low?

Given a "yes" to the above, then clearly "yes" here too.  As
I noted, if it can't actually sense the value at the pin, that
function should always return zero.

- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: and try remove another quirk on this computers Re: [3/6] 2.6.21-rc2: known regressions

2007-03-10 Thread Sergio Monteiro Basto
On Fri, 2007-03-09 at 21:41 -0800, Linus Torvalds wrote:
> 
> On Sat, 10 Mar 2007, Sergio Monteiro Basto wrote:
> > 
> > With this quirk I got this oops on hibernate (but computer still
> > working) 
> 
> Well, strictly speaking it's a warning, not an oops per se. 
> 
> What happens is that the quirk wants to do an "ioremap_nocache()", which 
> allocates memory, and that happens very early during initialization when 
> interrupts are disabled.
> 
> And you're really not supposed to allocate memory, except using 
> GFP_ATOMIC. But we've always been lax about that during early boot, so we 
> have stuff that does. And resume ends up doing a lot of the same things 
> early boot does, and shows issues like this.
> 
> So the quirk is probably still a good idea, and the warning message is 
> just that - a very scary warning message, but not an indicator that 
> anything is seriously screwed up for you.
> 
> (It is an indication of a real bug, though, even though it's harmless in 
> practice in this case)

Hi, thanks 
Just to write, I test last fedora kernel(2.6.20-1.2981.fc7) which is
based on 2.6.21-rc3-git5, without any problem, less than the scary
warning, talked in this email :)

Best regards,
-- 
Sérgio M. B.


smime.p7s
Description: S/MIME cryptographic signature


Re: RSDL-mm 0.28

2007-03-10 Thread Matt Mackall
On Sat, Mar 10, 2007 at 10:01:32PM -0600, Matt Mackall wrote:
> On Sun, Mar 11, 2007 at 01:28:22PM +1100, Con Kolivas wrote:
> > Ok I don't think there's any actual accounting problem here per se
> > (although I did just recently post a bugfix for rsdl however I think
> > that's unrelated). What I think is going on in the ccache testcase is
> > that all the work is being offloaded to kernel threads reading/writing
> > to/from the filesystem and the make is not getting any actual cpu
> > time.
> 
> I don't see significant system time while this is happening.

Also, it's running pretty much entirely out of page cache so there
wouldn't be a whole lot for kernel threads to do.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL-mm 0.28

2007-03-10 Thread Matt Mackall
On Sun, Mar 11, 2007 at 01:28:22PM +1100, Con Kolivas wrote:
> >make -j 5 ccache
> > berylok  good   awful
> > galeon   goodgood   bad
> > mp3  goodgood   bad
> > terminal goodgood   bad/ok
> > mousegoodgood   bad/ok
...
> >RSDL makes most of the noyield hit back in normal make and then some
> >with ccache. Impressive. But ccache is still destroying interactivity
> >somehow. The ccache effect is fairly visible even with non-parallel
> >'make'.
> 
> Ok I don't think there's any actual accounting problem here per se
> (although I did just recently post a bugfix for rsdl however I think
> that's unrelated). What I think is going on in the ccache testcase is
> that all the work is being offloaded to kernel threads reading/writing
> to/from the filesystem and the make is not getting any actual cpu
> time.

I don't see significant system time while this is happening.

> This is "worked around" in mainline thanks to the testing for
> sleeping on uninterruptible sleep in the interactivity estimator. What
> I suspect is happening is kernel threads that are running nice -5 are
> doing all the work on make's behalf in the setting of ccache since it
> is mostly i/o bound. The reason for -nice values on kernel threads is
> questionable anyway. Can you try renicing your kernel threads all to
> nice 0 and see what effect that has? Obviously this doesn't need a
> recompile, but is simple enough to implement in kthread code as a new
> default.

Sorry, little to no benefit.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][RSDL-mm 7/7] sched: document rsdl cpu scheduler

2007-03-10 Thread Con Kolivas
From: Con Kolivas <[EMAIL PROTECTED]>

Add comprehensive documentation of the RSDL cpu scheduler design.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Nick Piggin <[EMAIL PROTECTED]>
Cc: "Siddha, Suresh B" <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 Documentation/sched-design.txt |  273 -
 1 file changed, 267 insertions(+), 6 deletions(-)

Index: linux-2.6.21-rc3-mm2/Documentation/sched-design.txt
===
--- linux-2.6.21-rc3-mm2.orig/Documentation/sched-design.txt2007-03-11 
14:47:57.0 +1100
+++ linux-2.6.21-rc3-mm2/Documentation/sched-design.txt 2007-03-11 
14:48:00.0 +1100
@@ -1,11 +1,14 @@
-  Goals, Design and Implementation of the
- new ultra-scalable O(1) scheduler
+ Goals, Design and Implementation of the ultra-scalable O(1) scheduler by
+ Ingo Molnar and the Rotating Staircase Deadline cpu scheduler policy
+ designed by Con Kolivas.
 
 
-  This is an edited version of an email Ingo Molnar sent to
-  lkml on 4 Jan 2002.  It describes the goals, design, and
-  implementation of Ingo's new ultra-scalable O(1) scheduler.
-  Last Updated: 18 April 2002.
+  This was originally an edited version of an email Ingo Molnar sent to
+  lkml on 4 Jan 2002.  It describes the goals, design, and implementation
+  of Ingo's ultra-scalable O(1) scheduler. It now contains a description
+  of the Rotating Staircase Deadline priority scheduler that was built on
+  this design.
+  Last Updated: Sun Feb 25 2007
 
 
 Goal
@@ -163,3 +166,261 @@ certain code paths and data constructs. 
 code is smaller than the old one.
 
Ingo
+
+
+Rotating Staircase Deadline cpu scheduler policy
+
+
+Design summary
+==
+
+A novel design which incorporates a foreground-background descending priority
+system (the staircase) with runqueue managed minor and major epochs (rotation
+and deadline).
+
+
+Features
+
+
+A starvation free, strict fairness O(1) scalable design with interactivity
+as good as the above restrictions can provide. There is no interactivity
+estimator, no sleep/run measurements and only simple fixed accounting.
+The design has strict enough a design and accounting that task behaviour
+can be modelled and maximum scheduling latencies can be predicted by
+the virtual deadline mechanism that manages runqueues. The prime concern
+in this design is to maintain fairness at all costs determined by nice level,
+yet to maintain as good interactivity as can be allowed within the
+constraints of strict fairness.
+
+
+Design description
+==
+
+RSDL works off the principle of providing each task a quota of runtime that
+it is allowed to run at each priority level equal to its static priority
+(ie. its nice level) and every priority below that. When each task is queued,
+the cpu that it is queued onto also keeps a record of that quota. If the
+task uses up its quota it is decremented one priority level. Also, if the cpu
+notices a quota full has been used for that priority level, it pushes
+everything remaining at that priority level to the next lowest priority
+level. Once every runtime quota has been consumed of every priority level,
+a task is queued on the "expired" array. When no other tasks exist with
+quota, the expired array is activated and fresh quotas are handed out. This
+is all done in O(1).
+
+
+Design details
+==
+
+Each cpu has its own runqueue which micromanages its own epochs, and each
+task keeps a record of its own entitlement of cpu time. Most of the rest
+of these details apply to non-realtime tasks as rt task management is
+straight forward.
+
+Each runqueue keeps a record of what major epoch it is up to in the
+rq->prio_rotation field which is incremented on each major epoch. It also
+keeps a record of quota available to each priority value valid for that
+major epoch in rq->prio_quota[].
+
+Each task keeps a record of what major runqueue epoch it was last running
+on in p->rotation. It also keeps a record of what priority levels it has
+already been allocated quota from during this epoch in a bitmap p->bitmap.
+
+The only tunable that determines all other details is the RR_INTERVAL. This
+is set to 6ms (minimum on 1000HZ, higher at different HZ values).
+
+All tasks are initially given a quota based on RR_INTERVAL. This is equal to
+RR_INTERVAL between nice values of 0 and 19, and progressively larger for
+nice values from -1 to -20. This is assigned to p->quota and only changes
+with changes in nice level.
+
+As a task is first queued, it checks in recalc_task_prio to see if it has
+run at this runqueue's current priority rotation. If it has not, it will
+have its p->prio level set to equal its p->static_prio (nice level) and will
+be given a p->time_slice equal to the p->quota, and 

[PATCH][RSDL-mm 3/7] sched: remove noninteractive flag

2007-03-10 Thread Con Kolivas
From: Con Kolivas <[EMAIL PROTECTED]>

Remove the TASK_NONINTERACTIVE flag as it will no longer be used.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Nick Piggin <[EMAIL PROTECTED]>
Cc: "Siddha, Suresh B" <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 fs/pipe.c |7 +--
 include/linux/sched.h |3 +--
 2 files changed, 2 insertions(+), 8 deletions(-)

Index: linux-2.6.21-rc3-mm2/fs/pipe.c
===
--- linux-2.6.21-rc3-mm2.orig/fs/pipe.c 2007-03-11 14:47:57.0 +1100
+++ linux-2.6.21-rc3-mm2/fs/pipe.c  2007-03-11 14:47:59.0 +1100
@@ -41,12 +41,7 @@ void pipe_wait(struct pipe_inode_info *p
 {
DEFINE_WAIT(wait);
 
-   /*
-* Pipes are system-local resources, so sleeping on them
-* is considered a noninteractive wait:
-*/
-   prepare_to_wait(>wait, ,
-   TASK_INTERRUPTIBLE | TASK_NONINTERACTIVE);
+   prepare_to_wait(>wait, , TASK_INTERRUPTIBLE);
if (pipe->inode)
mutex_unlock(>inode->i_mutex);
schedule();
Index: linux-2.6.21-rc3-mm2/include/linux/sched.h
===
--- linux-2.6.21-rc3-mm2.orig/include/linux/sched.h 2007-03-11 
14:47:57.0 +1100
+++ linux-2.6.21-rc3-mm2/include/linux/sched.h  2007-03-11 14:47:59.0 
+1100
@@ -150,8 +150,7 @@ extern unsigned long weighted_cpuload(co
 #define EXIT_ZOMBIE16
 #define EXIT_DEAD  32
 /* in tsk->state again */
-#define TASK_NONINTERACTIVE64
-#define TASK_DEAD  128
+#define TASK_DEAD  64
 
 #define __set_task_state(tsk, state_value) \
do { (tsk)->state = (state_value); } while (0)

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][RSDL-mm 4/7] sched: implement 180 bit sched bitmap

2007-03-10 Thread Con Kolivas
From: Con Kolivas <[EMAIL PROTECTED]>

Modify the sched_find_first_bit function to work on a 180bit long bitmap.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Nick Piggin <[EMAIL PROTECTED]>
Cc: "Siddha, Suresh B" <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 include/asm-generic/bitops/sched.h |   10 ++
 include/asm-s390/bitops.h  |   12 +---
 2 files changed, 7 insertions(+), 15 deletions(-)

Index: linux-2.6.21-rc3-mm2/include/asm-generic/bitops/sched.h
===
--- linux-2.6.21-rc3-mm2.orig/include/asm-generic/bitops/sched.h
2007-03-11 14:47:57.0 +1100
+++ linux-2.6.21-rc3-mm2/include/asm-generic/bitops/sched.h 2007-03-11 
14:47:59.0 +1100
@@ -6,8 +6,8 @@
 
 /*
  * Every architecture must define this function. It's the fastest
- * way of searching a 140-bit bitmap where the first 100 bits are
- * unlikely to be set. It's guaranteed that at least one of the 140
+ * way of searching a 180-bit bitmap where the first 100 bits are
+ * unlikely to be set. It's guaranteed that at least one of the 180
  * bits is cleared.
  */
 static inline int sched_find_first_bit(const unsigned long *b)
@@ -15,7 +15,7 @@ static inline int sched_find_first_bit(c
 #if BITS_PER_LONG == 64
if (unlikely(b[0]))
return __ffs(b[0]);
-   if (likely(b[1]))
+   if (b[1])
return __ffs(b[1]) + 64;
return __ffs(b[2]) + 128;
 #elif BITS_PER_LONG == 32
@@ -27,7 +27,9 @@ static inline int sched_find_first_bit(c
return __ffs(b[2]) + 64;
if (b[3])
return __ffs(b[3]) + 96;
-   return __ffs(b[4]) + 128;
+   if (b[4])
+   return __ffs(b[4]) + 128;
+   return __ffs(b[5]) + 160;
 #else
 #error BITS_PER_LONG not defined
 #endif
Index: linux-2.6.21-rc3-mm2/include/asm-s390/bitops.h
===
--- linux-2.6.21-rc3-mm2.orig/include/asm-s390/bitops.h 2007-03-11 
14:47:57.0 +1100
+++ linux-2.6.21-rc3-mm2/include/asm-s390/bitops.h  2007-03-11 
14:47:59.0 +1100
@@ -729,17 +729,7 @@ find_next_bit (const unsigned long * add
return offset + find_first_bit(p, size);
 }
 
-/*
- * Every architecture must define this function. It's the fastest
- * way of searching a 140-bit bitmap where the first 100 bits are
- * unlikely to be set. It's guaranteed that at least one of the 140
- * bits is cleared.
- */
-static inline int sched_find_first_bit(unsigned long *b)
-{
-   return find_first_bit(b, 140);
-}
-
+#include 
 #include 
 
 #include 

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][RSDL-mm 5/7] sched dont renice kernel threads

2007-03-10 Thread Con Kolivas
The practice of renicing kernel threads to negative nice values is of
questionable benefit at best, and at worst leads to larger latencies when
kernel threads are busy on behalf of other tasks.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>

---
 kernel/workqueue.c |1 -
 1 file changed, 1 deletion(-)

Index: linux-2.6.21-rc3-mm2/kernel/workqueue.c
===
--- linux-2.6.21-rc3-mm2.orig/kernel/workqueue.c2007-03-11 
14:47:57.0 +1100
+++ linux-2.6.21-rc3-mm2/kernel/workqueue.c 2007-03-11 14:47:59.0 
+1100
@@ -294,7 +294,6 @@ static int worker_thread(void *__cwq)
if (!cwq->wq->freezeable)
current->flags |= PF_NOFREEZE;
 
-   set_user_nice(current, -5);
/*
 * We inherited MPOL_INTERLEAVE from the booting kernel.
 * Set MPOL_DEFAULT to insure node local allocations.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][RSDL-mm 1/7] lists: add list splice tail

2007-03-10 Thread Con Kolivas
From: Con Kolivas <[EMAIL PROTECTED]>

Add a list_splice_tail variant of list_splice.

Patch-by: Peter Zijlstra <[EMAIL PROTECTED]>
Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Nick Piggin <[EMAIL PROTECTED]>
Cc: "Siddha, Suresh B" <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 include/linux/list.h |   42 ++
 1 file changed, 42 insertions(+)

Index: linux-2.6.21-rc3-mm2/include/linux/list.h
===
--- linux-2.6.21-rc3-mm2.orig/include/linux/list.h  2007-03-11 
14:47:57.0 +1100
+++ linux-2.6.21-rc3-mm2/include/linux/list.h   2007-03-11 14:47:59.0 
+1100
@@ -333,6 +333,20 @@ static inline void __list_splice(struct 
at->prev = last;
 }
 
+static inline void __list_splice_tail(struct list_head *list,
+ struct list_head *head)
+{
+   struct list_head *first = list->next;
+   struct list_head *last = list->prev;
+   struct list_head *at = head->prev;
+
+   first->prev = at;
+   at->next = first;
+
+   last->next = head;
+   head->prev = last;
+}
+
 /**
  * list_splice - join two lists
  * @list: the new list to add.
@@ -345,6 +359,18 @@ static inline void list_splice(struct li
 }
 
 /**
+ * list_splice_tail - join two lists at one's tail
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ */
+static inline void list_splice_tail(struct list_head *list,
+   struct list_head *head)
+{
+   if (!list_empty(list))
+   __list_splice_tail(list, head);
+}
+
+/**
  * list_splice_init - join two lists and reinitialise the emptied list.
  * @list: the new list to add.
  * @head: the place to add it in the first list.
@@ -417,6 +443,22 @@ static inline void list_splice_init_rcu(
 }
 
 /**
+ * list_splice_tail_init - join 2 lists at one's tail & reinitialise emptied
+ * @list: the new list to add.
+ * @head: the place to add it in the first list.
+ *
+ * The list at @list is reinitialised
+ */
+static inline void list_splice_tail_init(struct list_head *list,
+struct list_head *head)
+{
+   if (!list_empty(list)) {
+   __list_splice_tail(list, head);
+   INIT_LIST_HEAD(list);
+   }
+}
+
+/**
  * list_entry - get the struct for this entry
  * @ptr:   the  list_head pointer.
  * @type:  the type of the struct this is embedded in.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][RSDL-mm 2/7] sched: remove sleepavg from proc

2007-03-10 Thread Con Kolivas
From: Con Kolivas <[EMAIL PROTECTED]>

Remove the sleep_avg field from proc output as it will be removed from the
task_struct.

Signed-off-by: Con Kolivas <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Nick Piggin <[EMAIL PROTECTED]>
Cc: "Siddha, Suresh B" <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 fs/proc/array.c |2 --
 1 file changed, 2 deletions(-)

Index: linux-2.6.21-rc3-mm2/fs/proc/array.c
===
--- linux-2.6.21-rc3-mm2.orig/fs/proc/array.c   2007-03-11 14:47:57.0 
+1100
+++ linux-2.6.21-rc3-mm2/fs/proc/array.c2007-03-11 14:47:59.0 
+1100
@@ -171,7 +171,6 @@ static inline char * task_state(struct t
 
buffer += sprintf(buffer,
"State:\t%s\n"
-   "SleepAVG:\t%lu%%\n"
"Tgid:\t%d\n"
"Pid:\t%d\n"
"PPid:\t%d\n"
@@ -179,7 +178,6 @@ static inline char * task_state(struct t
"Uid:\t%d\t%d\t%d\t%d\n"
"Gid:\t%d\t%d\t%d\t%d\n",
get_task_state(p),
-   (p->sleep_avg/1024)*100/(102000/1024),
p->tgid, p->pid,
pid_alive(p) ? rcu_dereference(p->parent)->tgid : 0,
tracer_pid,

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2

2007-03-10 Thread Con Kolivas
What follows this email is a patch series for the latest version of the RSDL 
cpu scheduler (ie v0.29). I have addressed all bugs that I am able to 
reproduce in this version so if some people would be kind enough to test if 
there are any hidden bugs or oops lurking, it would be nice to know in 
anticipation of putting this back in -mm. Thanks.

Full patch for 2.6.21-rc3-mm2:
http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2-rsdl-0.29.patch

Patch series (which will follow this email):
http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc3-mm2/

Changelog:
- Fixed the longstanding buggy bitmap problem which occurred due to swapping 
arrays when there were still tasks on the active array.
- Fixed preemption of realtime tasks when rt prio inheritance elevated their 
priority.
- Made kernel threads not be reniced to -5 by default
- Changed sched_yield behaviour of SCHED_NORMAL (SCHED_OTHER) to resemble 
realtime task yielding.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL-mm 0.28

2007-03-10 Thread Con Kolivas
On Sunday 11 March 2007 14:39, Andrew Morton wrote:
> > On Sun, 11 Mar 2007 14:59:28 +1100 Con Kolivas <[EMAIL PROTECTED]> wrote:
> > > Bottom line: we've had a _lot_ of problems with the new yield()
> > > semantics. We effectively broke back-compatibility by changing its
> > > behaviour a lot, and we can't really turn around and blame application
> > > developers for that.
> >
> > So... I would take it that's a yes for a recommendation with respect to
> > implementing a new yield() ? A new scheduler is as good a time as any to
> > do it.
>
> I guess so.  We'd, err, need to gather Ingo's input ;)

cc'ed. Don't you hate timezones?

> Perhaps a suitable way of doing this would be to characterise then emulate
> the 2.4 behaviour.  As long as it turns out to be vaguely sensible.

It's really very simple. We just go the end of the current queued priority on 
the same array instead of swapping to the expired array; ie we do what 
realtime tasks currently do. It works fine here locally afaict.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL-mm 0.28

2007-03-10 Thread William Lee Irwin III
On Sun, 11 Mar 2007 13:28:22 +1100 "Con Kolivas" <[EMAIL PROTECTED]> wrote:
>> Well... are you advocating we change sched_yield semantics to a
>> gentler form?

On Sat, Mar 10, 2007 at 07:16:14PM -0800, Andrew Morton wrote:
> From a practical POV: our present yield() behaviour is so truly awful that
> it's basically always a bug to use it.  This probably isn't a good thing.
> So yes, I do think that we should have a rethink and try to come up with
> behaviour which is more in accord with what application developers expect
> yield() to do.

ISTR that apps varied wrt. their expectations for yield(). Some,
particularly those using it to implement multi-tiered userspace locks,
really did expect to go all the way to the back of the queue. (Rumor has
it that realtime apps break otherwise.) Others wanted a kinder, gentler,
"mistress, please hit me, but not too hard" opportunity to let someone
else have a little cpu time, particularly when userspace is spinning in
some sort of busywait. In both scenarios something very much against
the latest trends of Linux kernel politics is done by userspace.


On Sat, Mar 10, 2007 at 07:16:14PM -0800, Andrew Morton wrote:
> otoh,
> a) we should have done this five years ago.  Instead, we've spent that
>time training userspace programmers to not use yield(), so perhaps
>there's little to be gained in changing it now.
> b) if we _were_ to change yield(), people would use it more, and their
>applications would of course suck bigtime when run on earlier 2.6
>kernels.
> Bottom line: we've had a _lot_ of problems with the new yield() semantics. 
> We effectively broke back-compatibility by changing its behaviour a lot,
> and we can't really turn around and blame application developers for that.

My dumb idea would be to break out new syscall. One for the kinder,
gentler version, one for the serious version. Or otherwise pass an
argument indicating the expected behavior. Then a dumb app can be
LD_PRELOAD'd into calling whatever makes it run fastest.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Davide Libenzi
On Sat, 10 Mar 2007, Linus Torvalds wrote:

> (That said, using "struct itimerspec" might be a good idea. That would 
> also obviate the need for TFD_TIMER_SEQ, since an itimerspec automatically 
> has both "base" and "incremental" parts).

But TFD_TIMER_SEQ is a simple auto-rearm case of TFD_TIMER_REL. So the 
timespec is sufficent too (in all three cases we just need *one* time). 
Actually, the only place where I can find the itimerspec usefull, is 
indeed with TFD_TIMER_SEQ. In cases where you want you clock starting at a 
given time (it_value) *and* with the given frequency (it_interval).



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL-mm 0.28

2007-03-10 Thread Andrew Morton
> On Sun, 11 Mar 2007 14:59:28 +1100 Con Kolivas <[EMAIL PROTECTED]> wrote:
> > Bottom line: we've had a _lot_ of problems with the new yield() semantics.
> > We effectively broke back-compatibility by changing its behaviour a lot,
> > and we can't really turn around and blame application developers for that.
> 
> So... I would take it that's a yes for a recommendation with respect to 
> implementing a new yield() ? A new scheduler is as good a time as any to do 
> it.

I guess so.  We'd, err, need to gather Ingo's input ;)

Perhaps a suitable way of doing this would be to characterise then emulate
the 2.4 behaviour.  As long as it turns out to be vaguely sensible.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ATA: abnormal status 0x7F on port 0xNNNN since 2.6.20

2007-03-10 Thread Gerardo Exequiel Pozzi

Hi,

Since linux 2.6.20 the kernel log shows at boot time these error. The
system are stable, but shows this, that in 2.6.19.N does not show.
(please CC to my email, i am currently not subscribe to lkml)

Thanks,


Linux version 2.6.20.2 ([EMAIL PROTECTED]) (gcc version 3.4.6) #1 PREEMPT Fri
Mar 9 21:43:32 ART 2007
BIOS-provided physical RAM map:
sanitize start
sanitize end
copy_e820_map() start:  size: 0009fc00 end:
0009fc00 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 0009fc00 size: 0400 end:
000a type: 2
copy_e820_map() start: 000f size: 0001 end:
0010 type: 2
copy_e820_map() start: 0010 size: 1fef end:
1fff type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 1fff size: 8000 end:
1fff8000 type: 3
copy_e820_map() start: 1fff8000 size: 8000 end:
2000 type: 4
copy_e820_map() start: fec0 size: 1000 end:
fec01000 type: 2
copy_e820_map() start: fee0 size: 1000 end:
fee01000 type: 2
copy_e820_map() start: fff8 size: 0008 end:
0001 type: 2
BIOS-e820:  - 0009fc00 (usable)
BIOS-e820: 0009fc00 - 000a (reserved)
BIOS-e820: 000f - 0010 (reserved)
BIOS-e820: 0010 - 1fff (usable)
BIOS-e820: 1fff - 1fff8000 (ACPI data)
BIOS-e820: 1fff8000 - 2000 (ACPI NVS)
BIOS-e820: fec0 - fec01000 (reserved)
BIOS-e820: fee0 - fee01000 (reserved)
BIOS-e820: fff8 - 0001 (reserved)
511MB LOWMEM available.
Entering add_active_range(0, 0, 131056) 0 entries of 256 used
Zone PFN ranges:
DMA 0 -> 4096
Normal   4096 ->   131056
early_node_map[1] active PFN ranges
  0:0 ->   131056
On node 0 totalpages: 131056
DMA zone: 32 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 4064 pages, LIFO batch:0
Normal zone: 991 pages used for memmap
Normal zone: 125969 pages, LIFO batch:31
DMI 2.3 present.
ACPI: RSDP (v000 AMI   ) @ 0x000fa8c0
ACPI: RSDT (v001 AMIINT VIA_K7   0x0010 MSFT 0x0097) @ 0x1fff
ACPI: FADT (v001 AMIINT VIA_K7   0x0011 MSFT 0x0097) @ 0x1fff0030
ACPI: MADT (v001 AMIINT VIA_K7   0x0009 MSFT 0x0097) @ 0x1fff00c0
ACPI: DSDT (v001VIAK7VT4 0x1000 INTL 0x02002024) @ 0x
ACPI: PM-Timer IO Port: 0x808
Allocating PCI resources starting at 3000 (gap: 2000:dec0)
Detected 1666.250 MHz processor.
Built 1 zonelists.  Total pages: 130033
Kernel command line: BOOT_IMAGE=2.6.20.2 ro root=305
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
PID hash table entries: 2048 (order: 11, 8192 bytes)
Console: colour VGA+ 80x25
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
Memory: 516100k/524224k available (2321k kernel code, 7572k reserved,
462k data, 144k init, 0k highmem)
virtual kernel memory layout:
  fixmap  : 0x8000 - 0xf000   (  28 kB)
  vmalloc : 0xe080 - 0x6000   ( 503 MB)
  lowmem  : 0xc000 - 0xdfff   ( 511 MB)
.init : 0xc03bc000 - 0xc03e   ( 144 kB)
.data : 0xc0344554 - 0xc03b806c   ( 462 kB)
.text : 0xc010 - 0xc0344554   (2321 kB)
Checking if this processor honours the WP bit even in supervisor mode...
Ok.
Calibrating delay using timer specific routine.. 3334.64 BogoMIPS
(lpj=1667322)
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 0383fbff c1cbfbff  
  
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 256K (64 bytes/line)
CPU: After all inits, caps: 0383fbff c1cbfbff  0420 
 
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: AMD Sempron(tm) 2400+ stepping 01
Checking 'hlt' instruction... OK.
ACPI: Core revision 20060707
ACPI: setting ELCR to 0200 (from 0c88)
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xfda71, last bus=1
PCI: Using configuration type 1
Setting up standard PCI resources
ACPI: Interpreter enabled
ACPI: Using PIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (:00)
PCI: Probing PCI hardware (bus 00)
Boot video device is :01:00.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: Power Resource [URP1] (off)
ACPI: Power Resource [URP2] (off)
ACPI: Power Resource [FDDP] (off)
ACPI: Power Resource [LPTP] (off)
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 7 *10 11 12 14 15)
ACPI: PCI Interrupt 

Re: RSDL-mm 0.28

2007-03-10 Thread Con Kolivas
On Sunday 11 March 2007 14:16, Andrew Morton wrote:
> > On Sun, 11 Mar 2007 13:28:22 +1100 "Con Kolivas" <[EMAIL PROTECTED]>
> > wrote: Well... are you advocating we change sched_yield semantics to a
> > gentler form?
> >
> >From a practical POV: our present yield() behaviour is so truly awful that
>
> it's basically always a bug to use it.  This probably isn't a good thing.
>
> So yes, I do think that we should have a rethink and try to come up with
> behaviour which is more in accord with what application developers expect
> yield() to do.
>
> otoh,
>
> a) we should have done this five years ago.  Instead, we've spent that
>time training userspace programmers to not use yield(), so perhaps
>there's little to be gained in changing it now.
>
> b) if we _were_ to change yield(), people would use it more, and their
>applications would of course suck bigtime when run on earlier 2.6
>kernels.
>
>
> Bottom line: we've had a _lot_ of problems with the new yield() semantics.
> We effectively broke back-compatibility by changing its behaviour a lot,
> and we can't really turn around and blame application developers for that.

So... I would take it that's a yes for a recommendation with respect to 
implementing a new yield() ? A new scheduler is as good a time as any to do 
it.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 2/6] i386: quicklist support

2007-03-10 Thread William Lee Irwin III
On Sat, Mar 10, 2007 at 06:09:34PM -0800, Christoph Lameter wrote:
> i386: Convert to quicklists
> Implement the i386 management of pgd and pmds using quicklists.

I approve, though it would be nice if ptes had an interface operating
on struct page * to use.


On Sat, Mar 10, 2007 at 06:09:34PM -0800, Christoph Lameter wrote:
> The i386 management of page table pages currently uses page sized slabs.
> The page state is therefore mainly determined by the slab code. However,
> i386 also uses its own fields in the page struct to mark special pages
> and to build a list of pgds using the ->private and ->index field (yuck!).
> This has been finely tuned to work right with SLAB but SLUB needs more
> control over the page struct. Currently the only way for SLUB to support
> these slabs is through special casing PAGE_SIZE slabs.
> If we use quicklists instead then we can avoid the mess, and also the
> overhead of manipulating page sized objects through slab.

Hey! I did quite well given the constraints under which I was operating.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Use more gcc extensions in the Linux headers

2007-03-10 Thread Rusty Russell
On Sun, 2007-03-11 at 03:58 +0100, Jan Engelhardt wrote:
> >-#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> >+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + 
> >__must_be_array(arr))
> >+
> 80 cols *cough* :)

I think your cough added a column?

Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: RSDL-mm 0.28

2007-03-10 Thread Andrew Morton
> On Sun, 11 Mar 2007 13:28:22 +1100 "Con Kolivas" <[EMAIL PROTECTED]> wrote:
> Well... are you advocating we change sched_yield semantics to a
> gentler form?

>From a practical POV: our present yield() behaviour is so truly awful that
it's basically always a bug to use it.  This probably isn't a good thing.

So yes, I do think that we should have a rethink and try to come up with
behaviour which is more in accord with what application developers expect
yield() to do.

otoh,

a) we should have done this five years ago.  Instead, we've spent that
   time training userspace programmers to not use yield(), so perhaps
   there's little to be gained in changing it now.

b) if we _were_ to change yield(), people would use it more, and their
   applications would of course suck bigtime when run on earlier 2.6
   kernels.


Bottom line: we've had a _lot_ of problems with the new yield() semantics. 
We effectively broke back-compatibility by changing its behaviour a lot,
and we can't really turn around and blame application developers for that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: libata-acpi: allow _GTF on SATA, but disable on PATA for now

2007-03-10 Thread Len Brown
On Saturday 10 March 2007 06:30, Jeff Garzik wrote:
> Linux Kernel Mailing List wrote:
> > Gitweb: 
> > http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=df33c77e3981e71afc8727ee5c432ba1a1bba68c
> > Commit: df33c77e3981e71afc8727ee5c432ba1a1bba68c
> > Parent: 908e0a8a265fe8057604a9a30aec3f0be7bb5ebb
> > Author: Kristen Accardi <[EMAIL PROTECTED]>
> > AuthorDate: Fri Mar 9 18:15:33 2007 -0500
> > Committer:  Len Brown <[EMAIL PROTECTED]>
> > CommitDate: Fri Mar 9 18:15:33 2007 -0500
> > 
> > libata-acpi: allow _GTF on SATA, but disable on PATA for now
> > 
> > The ACPI specification states, and BIOS implementations depend on,
> > _STM being called before _GTF.
> > 
> > SATA does this, but PATA does not.  So for now, simply
> > prevent execution of _GTF on PATA devices.  Longer term we
> > should implement ACPI support for PATA devices in libata.
> > 
> > Signed-off-by: Kristen Accardi <[EMAIL PROTECTED]>
> > Signed-off-by: Len Brown <[EMAIL PROTECTED]>
> > ---
> >  drivers/ata/libata-acpi.c |7 +++
> >  1 files changed, 7 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/ata/libata-acpi.c b/drivers/ata/libata-acpi.c
> > index d14a48e..89aaf74 100644
> > --- a/drivers/ata/libata-acpi.c
> > +++ b/drivers/ata/libata-acpi.c
> > @@ -561,6 +561,13 @@ int ata_acpi_exec_tfs(struct ata_port *ap)
> >  
> > if (noacpi)
> > return 0;
> > +   /*
> > +* TBD - implement PATA support.  For now,
> > +* we should not run GTF on PATA devices since some
> > +* PATA require execution of GTM/STM before GTF.
> > +*/
> > +   if (!(ap->cbl == ATA_CBL_SATA))
> > +   return 0;
> >  
> > for (ix = 0; ix < ATA_MAX_DEVICES; ix++) {
> > if (!ata_dev_enabled(>device[ix]))
> 
> Grumble!
> 
> This /really/ should have gone through me and linux-ide first.

Back at you Jeff,
This feature /really/ should have never gone upstream in the first place,
as this failure was reported and isolated to git-libata-all.patch
back in 2.6.20-rc6-mm3:

http://bugzilla.kernel.org/show_bug.cgi?id=7907

It then went on to become the most widely reported "ACPI related"
regression in the 2.6.21-rc series -- for which ACPI gets smeared.
Thank you ATA...

> Alan has been actively working on PATA ACPI, and we have been debugging 
> ACPI issues as well.  PLEASE coordinate with the maintainer, when 
> touching code outside of drivers/acpi!

And PLEASE coordinate with the maintainer when invoking methods
that provoke errors in other sub-systems!

Re: "debugging ACPI issues as well"

What issues?
Why haven't I see any mention of them on linux-acpi?
Coordination and communication is a two-way street, Jeff.

> AFAICS this patch went in with zero appearance on LKML or another 
> related list, until submission.  This is /not/ how we do Linux development.

I proudly take credit+blame for shipping Kristen's patch with no delay.
It did appear on linux-acpi, as do all the patches I ship -- though
I admit it was the same day it went upstream.
I'm sorry I didn't CC linux-ide -- I'll get that part right next time.

However, I believe that late -rc3 is _well_ past the time to be developing
new code real-time in the upstream tree; and is instead time to
shut the damn thing off and set sights on the next release.

If you disagree with me, I'm not going to object
when you send a better fix to Linus for 2.6.21-rc4.

However, I do request that you first either start responding
to bugzilla traffic, or delete your account from bugzilla
so that people don't get the false impression that you're
paying attention.

thanks,
-Len
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Use more gcc extensions in the Linux headers

2007-03-10 Thread Jan Engelhardt

On Mar 11 2007 13:50, Rusty Russell wrote:
>On Sat, 2007-03-10 at 02:04 +0100, Jan Engelhardt wrote:
>> Getting back at the macro, how would you like to have it merged?
>
>Well, this is what I sent to Linus and Andrew (many thanks to those who
>made appropriately whimsical *or* useful comments):
>
>diff -r 1ccdf46b0f41 include/linux/compiler-gcc.h
>--- a/include/linux/compiler-gcc.h Sat Mar 10 09:55:29 2007 +1100
>+++ b/include/linux/compiler-gcc.h Sat Mar 10 11:06:35 2007 +1100
>@@ -22,6 +22,9 @@
> __asm__ ("" : "=r"(__ptr) : "0"(ptr));\
> (typeof(ptr)) (__ptr + (off)); })
> 
>+/* [0] degrades to a pointer: a different type from an array */
>+#define __must_be_array(a) \
>+  BUILD_BUG_ON_ZERO(__builtin_types_compatible_p(typeof(a), typeof([0])))

This looks _much_ nicer! (And BUILD_BUG_ON is also appropriately
commented.)

> 
> #define inlineinline  __attribute__((always_inline))
> #define __inline____inline__  __attribute__((always_inline))
>diff -r 1ccdf46b0f41 include/linux/compiler-intel.h
>--- a/include/linux/compiler-intel.h   Sat Mar 10 09:55:29 2007 +1100
>+++ b/include/linux/compiler-intel.h   Sat Mar 10 11:06:25 2007 +1100
>@@ -21,4 +21,7 @@
>  __ptr = (unsigned long) (ptr);   \
> (typeof(ptr)) (__ptr + (off)); })
> 
>+/* Intel ECC compiler doesn't support __builtin_types_compatible_p() */
>+#define __must_be_array(a) 0
>+
> #endif
>diff -r 1ccdf46b0f41 include/linux/kernel.h
>--- a/include/linux/kernel.h   Sat Mar 10 09:55:29 2007 +1100
>+++ b/include/linux/kernel.h   Sat Mar 10 11:06:16 2007 +1100
>@@ -35,7 +35,8 @@ extern const char linux_proc_banner[];
> #define ALIGN(x,a)__ALIGN_MASK(x,(typeof(x))(a)-1)
> #define __ALIGN_MASK(x,mask)  (((x)+(mask))&~(mask))
> 
>-#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
>+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + 
>__must_be_array(arr))
>+
> #define FIELD_SIZEOF(t, f) (sizeof(((t*)0)->f))
> #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
> #define roundup(x, y) x) + ((y) - 1)) / (y)) * (y))

80 cols *cough* :)


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Use more gcc extensions in the Linux headers

2007-03-10 Thread Rusty Russell
On Sat, 2007-03-10 at 02:04 +0100, Jan Engelhardt wrote:
> Getting back at the macro, how would you like to have it merged?

Well, this is what I sent to Linus and Andrew (many thanks to those who
made appropriately whimsical *or* useful comments):

diff -r 1ccdf46b0f41 include/linux/compiler-gcc.h
--- a/include/linux/compiler-gcc.h  Sat Mar 10 09:55:29 2007 +1100
+++ b/include/linux/compiler-gcc.h  Sat Mar 10 11:06:35 2007 +1100
@@ -22,6 +22,9 @@
 __asm__ ("" : "=r"(__ptr) : "0"(ptr)); \
 (typeof(ptr)) (__ptr + (off)); })
 
+/* [0] degrades to a pointer: a different type from an array */
+#define __must_be_array(a) \
+  BUILD_BUG_ON_ZERO(__builtin_types_compatible_p(typeof(a), typeof([0])))
 
 #define inline inline  __attribute__((always_inline))
 #define __inline__ __inline__  __attribute__((always_inline))
diff -r 1ccdf46b0f41 include/linux/compiler-intel.h
--- a/include/linux/compiler-intel.hSat Mar 10 09:55:29 2007 +1100
+++ b/include/linux/compiler-intel.hSat Mar 10 11:06:25 2007 +1100
@@ -21,4 +21,7 @@
  __ptr = (unsigned long) (ptr);\
 (typeof(ptr)) (__ptr + (off)); })
 
+/* Intel ECC compiler doesn't support __builtin_types_compatible_p() */
+#define __must_be_array(a) 0
+
 #endif
diff -r 1ccdf46b0f41 include/linux/kernel.h
--- a/include/linux/kernel.hSat Mar 10 09:55:29 2007 +1100
+++ b/include/linux/kernel.hSat Mar 10 11:06:16 2007 +1100
@@ -35,7 +35,8 @@ extern const char linux_proc_banner[];
 #define ALIGN(x,a) __ALIGN_MASK(x,(typeof(x))(a)-1)
 #define __ALIGN_MASK(x,mask)   (((x)+(mask))&~(mask))
 
-#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr))
+
 #define FIELD_SIZEOF(t, f) (sizeof(((t*)0)->f))
 #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
 #define roundup(x, y) x) + ((y) - 1)) / (y)) * (y))


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL-mm 0.28

2007-03-10 Thread Con Kolivas

On 11/03/07, Matt Mackall <[EMAIL PROTECTED]> wrote:

I've tested -mm2 against -mm2+noyield and -mm2+rsdl+noyield. The
noyield patch simply makes the sched_yield syscall return immediately.
Xorg and all tests are run at nice 0.

Loads:
 memload: constant memcpy of 16MB buffer
 execload: constant re-exec of a trivial shell script
 forkload: constant fork and exit of a trivial shell script
 make -j 5: hot-cache kernel build without ccache
 make -j 5 ccache: hot-cache kernel build with ccache

Tests:
 beryl - 3D window manager, wiggle windows, spin desktop, etc.
 galeon - web browser, rapidly scrolling long web pages by grabbing
the scroll bar
 mp3 - XMMS on a FUSE sshfs over wireless (during all tests)
 terminal - responsiveness of ssh and local terminal sessions
 mouse - responsiveness of mouse pointer

Results:
 great = completely smooth
 good = fully responsive
 ok = visible latency
 bad = becomes difficult to use (or mp3 skips)
 awful = make it stop, please

  -mm2-mm2+noyield   rsdl+noyield
no load
 berylgreat   great  great
 galeon   goodgood   good
 mp3  goodgood   good
 terminal goodgood   good
 mousegoodgood   good
memload x10
 berylawful/bad   great  good
 galeon   goodgood   ok/good
 mp3  goodgood   good
 terminal goodgood   good
 mousegoodgood   good
execload x10
 berylawful/bad   bad/good   good
 galeon   goodbad/good   ok/good
 mp3  goodbadgood
 terminal goodbad/good   good
 mousegoodbad/good   good
forkload x10
 berylgoodgood   great
 galeon   goodgood   ok/good
 mp3  goodgood   good
 terminal goodgood   ok/good
 mousegoodgood   good
make -j 5
 berylok  good   good/great
 galeon   goodgood   ok/good
 mp3  goodgood   good
 terminal goodgood   good
 mousegoodgood   good
make -j 5 ccache
 berylok  good   awful
 galeon   goodgood   bad
 mp3  goodgood   bad
 terminal goodgood   bad/ok
 mousegoodgood   bad/ok

make -j 5
real  8m1.857s8m50.659s  8m9.282s
user  7m19.127s   8m3.494s   7m30.740s
sys   0m30.910s   0m33.722s  0m29.542s

make -j 5 ccache
real  2m6.182s2m19.032s  2m1.832s
user  1m39.466s   1m48.787s  1m37.250s
sys   0m19.741s   0m22.993s  0m20.109s


Thanks very much for that comprehensive summary and testing!


There's a substantial performance hit for not yield, so we probably
want to investigate alternate semantics for it. It seems reasonable
for apps to say "let me not hog the CPU" without completely expiring
them. Imagine you're in the front of the line (aka queue) and you
spend a moment fumbling for your wallet. The polite thing to do is to
let the next guy in front. But with the current sched_yield, you go
all the way to the back of the line.


Well... are you advocating we change sched_yield semantics to a
gentler form? This is a cinch to implement but I know how Ingo feels
about this. It will only encourage more lax coding using sched_yield
instead of proper blocking (see huge arguments with the ldap people on
this one who insist it's impossible not to use yield).


RSDL makes most of the noyield hit back in normal make and then some
with ccache. Impressive. But ccache is still destroying interactivity
somehow. The ccache effect is fairly visible even with non-parallel
'make'.


Ok I don't think there's any actual accounting problem here per se
(although I did just recently post a bugfix for rsdl however I think
that's unrelated). What I think is going on in the ccache testcase is
that all the work is being offloaded to kernel threads reading/writing
to/from the filesystem and the make is not getting any actual cpu
time. This is "worked around" in mainline thanks to the testing for
sleeping on uninterruptible sleep in the interactivity estimator. What
I suspect is happening is kernel threads that are running nice -5 are
doing all the work on make's behalf in the setting of ccache since it
is mostly i/o bound. The reason for -nice values on kernel threads is
questionable anyway. Can you try renicing your kernel threads all to
nice 0 and see what effect that has? Obviously this doesn't need a
recompile, but is simple enough to implement in kthread code as a new
default.


Also note I could occassionally trigger nasty multi-second pauses with
-mm2+noyield under exectest that didn't show up elsewhere. That's
probably a bug in the mainline scheduler.


Ew. It's probably not a bug but a good example of some of the
starvation scenarios we're hitting on mainline (hence the need for a
rewrite ;))


Re: [RFC PATCH 1/3] Add ability to keep track of callers of symbol_(get|put)

2007-03-10 Thread Rusty Russell
On Sat, 2007-03-10 at 02:31 -0200, Mauro Carvalho Chehab wrote:
> From: Trent Piepho <[EMAIL PROTECTED]>
> 
> When a module uses symbol_get() to increase the ref count of another
> module, there is no record what module called symbol_get().  A module
> can
> show up as having other users, but there is no way to tell who those
> users are.

Hi Mauro,

Interesting: in general you cannot tell who is using a module, but for
this case it makes sense.  Your patch was linewrapped here, but it all
looks fine.

Thanks,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 7/9] signalfd/timerfd - timerfd wire up i386 arch ...

2007-03-10 Thread Davide Libenzi
This patch wire the timerfd system call to the i386 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.20.ep2.orig/arch/i386/kernel/syscall_table.S  2007-03-10 
15:57:58.0 -0800
+++ linux-2.6.20.ep2/arch/i386/kernel/syscall_table.S   2007-03-10 
15:58:08.0 -0800
@@ -320,3 +320,4 @@
.long sys_getcpu
.long sys_epoll_pwait
.long sys_signalfd  /* 320 */
+   .long sys_timerfd
Index: linux-2.6.20.ep2/include/asm-i386/unistd.h
===
--- linux-2.6.20.ep2.orig/include/asm-i386/unistd.h 2007-03-10 
15:57:58.0 -0800
+++ linux-2.6.20.ep2/include/asm-i386/unistd.h  2007-03-10 15:58:08.0 
-0800
@@ -326,10 +326,11 @@
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
 #define __NR_signalfd  320
+#define __NR_timerfd   321
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 321
+#define NR_syscalls 322
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 8/9] signalfd/timerfd - timerfd wire up x86_64 arch ...

2007-03-10 Thread Davide Libenzi
This patch wire the timerfd system call to the x86_64 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.20.ep2.orig/arch/x86_64/ia32/ia32entry.S  2007-03-10 
15:58:00.0 -0800
+++ linux-2.6.20.ep2/arch/x86_64/ia32/ia32entry.S   2007-03-10 
15:58:10.0 -0800
@@ -720,4 +720,5 @@
.quad sys_getcpu
.quad sys_epoll_pwait
.quad sys_signalfd  /* 320 */
+   .quad sys_timerfd
 ia32_syscall_end:
Index: linux-2.6.20.ep2/include/asm-x86_64/unistd.h
===
--- linux-2.6.20.ep2.orig/include/asm-x86_64/unistd.h   2007-03-10 
15:58:00.0 -0800
+++ linux-2.6.20.ep2/include/asm-x86_64/unistd.h2007-03-10 
15:58:10.0 -0800
@@ -621,8 +621,10 @@
 __SYSCALL(__NR_move_pages, sys_move_pages)
 #define __NR_signalfd  280
 __SYSCALL(__NR_signalfd, sys_signalfd)
+#define __NR_timerfd   281
+__SYSCALL(__NR_timerfd, sys_timerfd)
 
-#define __NR_syscall_max __NR_signalfd
+#define __NR_syscall_max __NR_timerfd
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 9/9] signalfd/timerfd - timerfd compat code ...

2007-03-10 Thread Davide Libenzi
This patch implement the necessary compat code for the timerfd system call.


Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/fs/compat.c
===
--- linux-2.6.20.ep2.orig/fs/compat.c   2007-03-10 15:58:03.0 -0800
+++ linux-2.6.20.ep2/fs/compat.c2007-03-10 15:58:12.0 -0800
@@ -2257,3 +2257,23 @@
return sys_signalfd(ufd, ksigmask, sizeof(sigset_t));
 }
 
+
+asmlinkage long compat_sys_timerfd(int ufd, int clockid, int tmrtype,
+  const struct compat_timespec __user *utmr)
+{
+   long res;
+   struct timespec t;
+   struct timespec __user *ut;
+
+   res = -EFAULT;
+   if (get_compat_timespec(, utmr))
+   goto err_exit;
+   ut = compat_alloc_user_space(sizeof(*ut));
+   if (copy_to_user(ut, , sizeof(t)) )
+   goto err_exit;
+
+   res = sys_timerfd(ufd, clockid, tmrtype, ut);
+err_exit:
+   return res;
+}
+
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 4/9] signalfd/timerfd - signalfd wire up x86_64 arch ...

2007-03-10 Thread Davide Libenzi
This patch wire the signalfd system call to the x86_64 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/include/asm-x86_64/unistd.h
===
--- linux-2.6.20.ep2.orig/include/asm-x86_64/unistd.h   2007-03-10 
15:57:00.0 -0800
+++ linux-2.6.20.ep2/include/asm-x86_64/unistd.h2007-03-10 
15:58:00.0 -0800
@@ -619,8 +619,10 @@
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_signalfd  280
+__SYSCALL(__NR_signalfd, sys_signalfd)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_signalfd
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.20.ep2/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.20.ep2.orig/arch/x86_64/ia32/ia32entry.S  2007-03-10 
15:57:00.0 -0800
+++ linux-2.6.20.ep2/arch/x86_64/ia32/ia32entry.S   2007-03-10 
15:58:00.0 -0800
@@ -714,8 +714,10 @@
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
-   .quad sys_tee
+   .quad sys_tee   /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
-ia32_syscall_end:  
+   .quad sys_epoll_pwait
+   .quad sys_signalfd  /* 320 */
+ia32_syscall_end:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 6/9] signalfd/timerfd - timerfd core ...

2007-03-10 Thread Davide Libenzi
This patch introduces a new system call for timers events delivered
though file descriptors. This allows timer event to be used with
standard POSIX poll(2), select(2) and read(2). As a consequence of
supporting the Linux f_op->poll subsystem, they can be used with
epoll(2) too.
The system call is defined as:

int timerfd(int ufd, int clockid, int tmrtype, const struct timespec *utmr);

The "ufd" parameter allows for re-use (re-programming) of an existing
timerfd w/out going through the close/open cycle (same as signalfd).
If "ufd" is -1, s new file descriptor will be created, otherwise the
existing "ufd" will be re-programmed.
The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME.
The "tmrtype" parameter allows to specify the timer type. The following
values are supported:

TFD_TIMER_REL
The time specified in the "utmr" parameter is a relative time
from NOW.

TFD_TIMER_ABS
The timer specified in the "utmr" parameter is an absolute time.

TFD_TIMER_SEQ
The time specified in the "utmr" parameter is an interval at
which a continuous clock rate will be generated.

The function returns the new (or same, in case "ufd" is a valid timerfd
descriptor) file, or -1 in case of error.
As stated before, the timerfd file descriptor supports poll(2), select(2)
and epoll(2). When a timer event happened on the timerfd, a POLLIN mask
will be returned.
The read(2) call can be used, and it will return a u32 variable holding
the number of "ticks" that happened on the interface since the last call
to read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN
will be returned if no ticks happened.
A quick test program, shows timerfd working correctly on my amd64 box:

http://www.xmailserver.org/timerfd-test.c




Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.20.ep2/fs/timerfd.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.20.ep2/fs/timerfd.c   2007-03-10 15:58:05.0 -0800
@@ -0,0 +1,268 @@
+/*
+ *  fs/timerfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+struct timerfd_ctx {
+   struct hrtimer tmr;
+   int clockid;
+   ktime_t tval;
+   int tmrtype;
+   spinlock_t lock;
+   wait_queue_head_t wqh;
+   unsigned long ticks;
+};
+
+
+static int timerfd_tmrproc(struct hrtimer *htmr);
+static void timerfd_cleanup(struct timerfd_ctx *ctx);
+static int timerfd_close(struct inode *inode, struct file *file);
+static unsigned int timerfd_poll(struct file *file, poll_table *wait);
+static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
+   loff_t *ppos);
+
+
+
+static const struct file_operations timerfd_fops = {
+   .release= timerfd_close,
+   .poll   = timerfd_poll,
+   .read   = timerfd_read,
+};
+static struct kmem_cache *timerfd_ctx_cachep;
+
+
+
+static int timerfd_tmrproc(struct hrtimer *htmr)
+{
+   struct timerfd_ctx *ctx = container_of(htmr, struct timerfd_ctx, tmr);
+   int rval = HRTIMER_NORESTART;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   ctx->ticks++;
+   wake_up_locked(>wqh);
+   if (ctx->tmrtype == TFD_TIMER_SEQ) {
+   hrtimer_forward(htmr, htmr->base->softirq_time, ctx->tval);
+   rval = HRTIMER_RESTART;
+   }
+   spin_unlock_irqrestore(>lock, flags);
+
+   return rval;
+}
+
+
+asmlinkage long sys_timerfd(int ufd, int clockid, int tmrtype,
+   const struct timespec __user *utmr)
+{
+   int error;
+   struct timerfd_ctx *ctx;
+   struct file *file;
+   struct inode *inode;
+   ktime_t tval, tnow;
+   struct timespec ktmr, tmrnow;
+
+   error = -EFAULT;
+   if (copy_from_user(, utmr, sizeof(ktmr)))
+   goto err_exit;
+
+   tval = timespec_to_ktime(ktmr);
+   error = -EINVAL;
+   if (clockid != CLOCK_MONOTONIC &&
+   clockid != CLOCK_REALTIME)
+   goto err_exit;
+   switch (tmrtype) {
+   case TFD_TIMER_REL:
+   case TFD_TIMER_SEQ:
+   break;
+   case TFD_TIMER_ABS:
+   getnstimeofday();
+   tnow = timespec_to_ktime(tmrnow);
+   if (ktime_to_ns(tval) <= ktime_to_ns(tnow))
+   goto err_exit;
+   tval = ktime_sub(tval, tnow);
+   break;
+   default:
+   goto err_exit;
+   }
+
+   if (ufd == -1) {
+   error = -ENOMEM;
+   ctx = kmem_cache_alloc(timerfd_ctx_cachep, GFP_KERNEL);
+   if (!ctx)
+   goto err_exit;
+
+   

[patch 5/9] signalfd/timerfd - signalfd compat code ...

2007-03-10 Thread Davide Libenzi
This patch implement the necessary compat code for the signalfd system call.


Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/fs/compat.c
===
--- linux-2.6.20.ep2.orig/fs/compat.c   2007-03-10 15:57:00.0 -0800
+++ linux-2.6.20.ep2/fs/compat.c2007-03-10 15:58:03.0 -0800
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -2235,3 +2236,24 @@
return sys_ni_syscall();
 }
 #endif
+
+asmlinkage long compat_sys_signalfd(int ufd,
+   const compat_sigset_t __user *sigmask,
+   compat_size_t sigsetsize)
+{
+   compat_sigset_t ss32;
+   sigset_t tmp;
+   sigset_t __user *ksigmask;
+
+   if (sigsetsize != sizeof(compat_sigset_t))
+   return -EINVAL;
+   if (copy_from_user(, sigmask, sizeof(ss32)))
+   return -EFAULT;
+   sigset_from_compat(, );
+   ksigmask = compat_alloc_user_space(sizeof(sigset_t));
+   if (copy_to_user(ksigmask, , sizeof(sigset_t)))
+   return -EFAULT;
+
+   return sys_signalfd(ufd, ksigmask, sizeof(sigset_t));
+}
+
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/9] signalfd/timerfd - anonymous inode source ...

2007-03-10 Thread Davide Libenzi
This patch add an anonymous inode source, to be used for files that need 
and inode only in order to create a file*. We do not care of having an 
inode for each file, and we do not even care of having different names in 
the associated dentries (dentry names will be same for classes of file*).
This allow code reuse, and will be used by epoll, signalfd and timerfd 
(and whatever else there'll be).



Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.20.ep2/fs/anon_inodes.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.20.ep2/fs/anon_inodes.c   2007-03-10 15:57:47.0 -0800
@@ -0,0 +1,203 @@
+/*
+ *  fs/anon_inodes.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+static int ainofs_delete_dentry(struct dentry *dentry);
+static struct inode *aino_getinode(void);
+static struct inode *aino_mkinode(void);
+static int ainofs_get_sb(struct file_system_type *fs_type, int flags,
+const char *dev_name, void *data, struct vfsmount 
*mnt);
+
+
+
+static struct vfsmount *aino_mnt __read_mostly;
+static struct inode *aino_inode;
+static struct file_operations aino_fops = { };
+static struct file_system_type aino_fs_type = {
+   .name   = "ainofs",
+   .get_sb = ainofs_get_sb,
+   .kill_sb= kill_anon_super,
+};
+static struct dentry_operations ainofs_dentry_operations = {
+   .d_delete   = ainofs_delete_dentry,
+};
+
+
+
+int aino_getfd(int *pfd, struct inode **pinode, struct file **pfile,
+  char const *name, const struct file_operations *fops, void *priv)
+{
+   struct qstr this;
+   struct dentry *dentry;
+   struct inode *inode;
+   struct file *file;
+   int error, fd;
+
+   error = -ENFILE;
+   file = get_empty_filp();
+   if (!file)
+   goto eexit_1;
+
+   inode = aino_getinode();
+   if (IS_ERR(inode)) {
+   error = PTR_ERR(inode);
+   goto eexit_2;
+   }
+
+   error = get_unused_fd();
+   if (error < 0)
+   goto eexit_3;
+   fd = error;
+
+   /*
+* Link the inode to a directory entry by creating a unique name
+* using the inode sequence number.
+*/
+   error = -ENOMEM;
+   this.name = name;
+   this.len = strlen(name);
+   this.hash = 0;
+   dentry = d_alloc(aino_mnt->mnt_sb->s_root, );
+   if (!dentry)
+   goto eexit_4;
+   dentry->d_op = _dentry_operations;
+   /* Do not publish this dentry inside the global dentry hash table */
+   dentry->d_flags &= ~DCACHE_UNHASHED;
+   d_instantiate(dentry, inode);
+
+   file->f_path.mnt = mntget(aino_mnt);
+   file->f_path.dentry = dentry;
+   file->f_mapping = inode->i_mapping;
+
+   file->f_pos = 0;
+   file->f_flags = O_RDONLY;
+   file->f_op = fops;
+   file->f_mode = FMODE_READ;
+   file->f_version = 0;
+   file->private_data = priv;
+
+   fd_install(fd, file);
+
+   *pfd = fd;
+   *pinode = inode;
+   *pfile = file;
+   return 0;
+
+eexit_4:
+   put_unused_fd(fd);
+eexit_3:
+   iput(inode);
+eexit_2:
+   put_filp(file);
+eexit_1:
+   return error;
+}
+
+
+static int ainofs_delete_dentry(struct dentry *dentry)
+{
+   /*
+* We faked vfs to believe the dentry was hashed when we created it.
+* Now we restore the flag so that dput() will work correctly.
+*/
+   dentry->d_flags |= DCACHE_UNHASHED;
+   return 1;
+}
+
+
+static struct inode *aino_getinode(void)
+{
+   return igrab(aino_inode);
+}
+
+
+/*
+ * A single inode exist for all aino files. On the contrary of pipes,
+ * aino inodes has no per-instance data associated, so we can avoid
+ * the allocation of multiple of them.
+ */
+static struct inode *aino_mkinode(void)
+{
+   int error = -ENOMEM;
+   struct inode *inode = new_inode(aino_mnt->mnt_sb);
+
+   if (!inode)
+   goto eexit_1;
+
+   inode->i_fop = _fops;
+
+   /*
+* Mark the inode dirty from the very beginning,
+* that way it will never be moved to the dirty
+* list because mark_inode_dirty() will think
+* that it already _is_ on the dirty list.
+*/
+   inode->i_state = I_DIRTY;
+   inode->i_mode = S_IRUSR | S_IWUSR;
+   inode->i_uid = current->fsuid;
+   inode->i_gid = current->fsgid;
+   inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+   return inode;
+
+eexit_1:
+   return ERR_PTR(error);
+}
+
+
+static int ainofs_get_sb(struct file_system_type *fs_type, int flags,
+const char *dev_name, void *data, struct vfsmount *mnt)
+{
+   return get_sb_pseudo(fs_type, "aino:", NULL, AINOFS_MAGIC, mnt);
+}
+
+

[patch 2/9] signalfd/timerfd - signalfd core ...

2007-03-10 Thread Davide Libenzi
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how
badly it can be broken :), and I added even more breakage ;)
Signals are fetched from the same signal queue used by the process,
so signalfd will compete with standard kernel delivery in dequeue_signal().
If you want to reliably fetch signals on the signalfd file, you need to
block them with sigprocmask(SIG_BLOCK).
This seems to be working fine on my Dual Opteron machine. I made a quick 
test program for it:

http://www.xmailserver.org/signafd-test.c

The signalfd() system call implements signal delivery into a file 
descriptor receiver. The signalfd file descriptor if created with the 
following API:

int signalfd(int ufd, const sigset_t *mask, size_t masksize);

The "ufd" parameter allows to change an existing signalfd sigmask, w/out 
going to close/create cycle (Linus idea). Use "ufd" == -1 if you want a 
brand new signalfd file.
The "mask" allows to specify the signal mask of signals that we are 
interested in. The "masksize" parameter is the size of "mask".
The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.
The read(2) system call will return a "struct signalfd_siginfo" structure
in the userspace supplied buffer. The return value is the number of bytes
copied in the supplied buffer, or -1 in case of error. The read(2) call
can also return 0, in case the sighand structure to which the signalfd
was attached, has been orphaned. The O_NONBLOCK flag is also supported, and
read(2) will return -EAGAIN in case no signal is available.
The format of the struct signalfd_siginfo is, and the valid fields depends
of the (->code & __SI_MASK) value, in the same way a struct siginfo would:

struct signalfd_siginfo {
__u32 signo;/* si_signo */
__s32 err;  /* si_errno */
__s32 code; /* si_code */
__u32 pid;  /* si_pid */
__u32 uid;  /* si_uid */
__s32 fd;   /* si_fd */
__u32 tid;  /* si_fd */
__u32 band; /* si_band */
__u32 overrun;  /* si_overrun */
__u32 trapno;   /* si_trapno */
__s32 status;   /* si_status */
__s32 svint;/* si_int */
__u64 svptr;/* si_ptr */
__u64 utime;/* si_utime */
__u64 stime;/* si_stime */
__u64 addr; /* si_addr */
};



Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.20.ep2/fs/signalfd.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.20.ep2/fs/signalfd.c  2007-03-10 15:57:51.0 -0800
@@ -0,0 +1,381 @@
+/*
+ *  fs/signalfd.c
+ *
+ *  Copyright (C) 2003  Linus Torvalds
+ *
+ *  Mon Mar 5, 2007: Davide Libenzi 
+ *  Changed ->read() to return a siginfo strcture instead of signal number.
+ *  Fixed locking in ->poll().
+ *  Added sighand-detach notification.
+ *  Added fd re-use in sys_signalfd() syscall.
+ *  Now using anonymous inode source.
+ *  Thanks to Oleg Nesterov for useful code review and suggestions.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+struct signalfd_ctx {
+   struct list_head lnk;
+   wait_queue_head_t wqh;
+   sigset_t sigmask;
+   struct task_struct *tsk;
+};
+
+
+
+static struct sighand_struct *signalfd_get_sighand(struct signalfd_ctx *ctx,
+  unsigned long *flags);
+static void signalfd_put_sighand(struct signalfd_ctx *ctx,
+struct sighand_struct *sighand,
+unsigned long *flags);
+static void signalfd_cleanup(struct signalfd_ctx *ctx);
+static int signalfd_close(struct inode *inode, struct file *file);
+static unsigned int signalfd_poll(struct file *file, poll_table *wait);
+static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
+siginfo_t const *kinfo);
+static ssize_t signalfd_read(struct file *file, char __user *buf, size_t count,
+loff_t *ppos);
+
+
+
+static const struct file_operations signalfd_fops = {
+   .release= signalfd_close,
+   .poll   = signalfd_poll,
+   .read   = signalfd_read,
+};
+static struct kmem_cache *signalfd_ctx_cachep;
+
+
+
+static struct sighand_struct *signalfd_get_sighand(struct signalfd_ctx *ctx,
+  unsigned long *flags)
+{
+   struct sighand_struct *sighand;
+
+   rcu_read_lock();
+   sighand = lock_task_sighand(ctx->tsk, flags);
+   rcu_read_unlock();
+
+   if (sighand && list_empty(>lnk)) {
+

[patch 3/9] signalfd/timerfd - signalfd wire up i386 arch ...

2007-03-10 Thread Davide Libenzi
This patch wire the signalfd system call to the i386 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.20.ep2/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.20.ep2.orig/arch/i386/kernel/syscall_table.S  2007-03-10 
15:57:00.0 -0800
+++ linux-2.6.20.ep2/arch/i386/kernel/syscall_table.S   2007-03-10 
15:57:58.0 -0800
@@ -319,3 +319,4 @@
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_signalfd  /* 320 */
Index: linux-2.6.20.ep2/include/asm-i386/unistd.h
===
--- linux-2.6.20.ep2.orig/include/asm-i386/unistd.h 2007-03-10 
15:57:00.0 -0800
+++ linux-2.6.20.ep2/include/asm-i386/unistd.h  2007-03-10 15:57:58.0 
-0800
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_signalfd  320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[QUICKLIST 3/6] i386: Use standard list manipulators for pgd_list

2007-03-10 Thread Christoph Lameter
i386: Use standard list macros.

Get rid of generating a list via page->index and page->private. Use
page->lru instead.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3/arch/i386/mm/pgtable.c
===
--- linux-2.6.21-rc3.orig/arch/i386/mm/pgtable.c2007-03-10 
17:42:08.0 -0800
+++ linux-2.6.21-rc3/arch/i386/mm/pgtable.c 2007-03-10 17:44:23.0 
-0800
@@ -213,31 +213,12 @@ struct page *pte_alloc_one(struct mm_str
  * -- wli
  */
 DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
-
-static inline void pgd_list_add(pgd_t *pgd)
-{
-   struct page *page = virt_to_page(pgd);
-   page->index = (unsigned long)pgd_list;
-   if (pgd_list)
-   set_page_private(pgd_list, (unsigned long)>index);
-   pgd_list = page;
-   set_page_private(page, (unsigned long)_list);
-}
-
-static inline void pgd_list_del(pgd_t *pgd)
-{
-   struct page *next, **pprev, *page = virt_to_page(pgd);
-   next = (struct page *)page->index;
-   pprev = (struct page **)page_private(page);
-   *pprev = next;
-   if (next)
-   set_page_private(next, (unsigned long)pprev);
-}
+LIST_HEAD(pgd_list);
 
 void pgd_ctor(void *pgd)
 {
unsigned long flags;
+   struct page *page = virt_to_page(pgd);
 
if (PTRS_PER_PMD == 1) {
memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
@@ -256,7 +237,7 @@ void pgd_ctor(void *pgd)
__pa(swapper_pg_dir) >> PAGE_SHIFT,
USER_PTRS_PER_PGD, PTRS_PER_PGD - USER_PTRS_PER_PGD);
 
-   pgd_list_add(pgd);
+   list_add(>lru, _list);
spin_unlock_irqrestore(_lock, flags);
 }
 
@@ -264,10 +245,11 @@ void pgd_ctor(void *pgd)
 void pgd_dtor(void *pgd)
 {
unsigned long flags; /* can be called from interrupt context */
+   struct page *page = virt_to_page(pgd);
 
paravirt_release_pd(__pa(pgd) >> PAGE_SHIFT);
spin_lock_irqsave(_lock, flags);
-   pgd_list_del(pgd);
+   list_del(>lru);
spin_unlock_irqrestore(_lock, flags);
 }
 
Index: linux-2.6.21-rc3/include/asm-i386/pgtable.h
===
--- linux-2.6.21-rc3.orig/include/asm-i386/pgtable.h2007-03-10 
17:41:48.0 -0800
+++ linux-2.6.21-rc3/include/asm-i386/pgtable.h 2007-03-10 17:42:00.0 
-0800
@@ -39,7 +39,7 @@ extern pgd_t swapper_pg_dir[1024];
 void check_pgt_cache(void);
 
 extern spinlock_t pgd_lock;
-extern struct page *pgd_list;
+extern struct list_head pgd_list;
 static inline void pgtable_cache_init(void) {};
 void paging_init(void);
 
Index: linux-2.6.21-rc3/arch/i386/mm/fault.c
===
--- linux-2.6.21-rc3.orig/arch/i386/mm/fault.c  2007-03-10 17:48:04.0 
-0800
+++ linux-2.6.21-rc3/arch/i386/mm/fault.c   2007-03-10 17:49:30.0 
-0800
@@ -608,11 +608,10 @@ void vmalloc_sync_all(void)
struct page *page;
 
spin_lock_irqsave(_lock, flags);
-   for (page = pgd_list; page; page =
-   (struct page *)page->index)
+   list_for_each_entry(page, _list, lru)
if (!vmalloc_sync_one(page_address(page),
address)) {
-   BUG_ON(page != pgd_list);
+   BUG();
break;
}
spin_unlock_irqrestore(_lock, flags);
Index: linux-2.6.21-rc3/arch/i386/mm/pageattr.c
===
--- linux-2.6.21-rc3.orig/arch/i386/mm/pageattr.c   2007-03-10 
17:49:44.0 -0800
+++ linux-2.6.21-rc3/arch/i386/mm/pageattr.c2007-03-10 17:50:14.0 
-0800
@@ -95,7 +95,7 @@ static void set_pmd_pte(pte_t *kpte, uns
return;
 
spin_lock_irqsave(_lock, flags);
-   for (page = pgd_list; page; page = (struct page *)page->index) {
+   list_for_each_entry(page, _list, lru) {
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[SLUB 0/3] SLUB: The unqueued slab allocator V5

2007-03-10 Thread Christoph Lameter
[PATCH] SLUB The unqueued slab allocator v4

V4->V5:

- Single object slabs only for slabs > slub_max_order otherwise generate
  sufficient objects to avoid frequent use of the page allocator. This is
  necessary to compensate for fragmentation caused by frequent uses of
  the page allocator. We expect slabs of PAGE_SIZE from this rule since
  multi object slabs require uses of fields that are in use on i386 and
  x86_64. See the quicklist patchset for a way to fix that issue
  and a patch to get rid of the PAGE_SIZE special casing.

- Drop pass through to page allocator due to page allocator fragmenting
  memory. The buffering through large order allocations is done in SLUB.
  Infrequent larger order allocations cause less fragmentation
  than frequent small order allocations.

- We need to update object sizes when merging slabs otherwise kzalloc
  will not initialize the full object (this caused the failure on
  varios platforms).

- Padding checks before redzone checks so that we get messages about
  the corruption of whole slab and not about a single object.

Note that SLUB will warn on zero sized allocations. SLAB just allocates
some memory. So some traces from the usb subsystem etc should be expected.

Note that the definition of the return type of ksize() is currently
different between mm and Linus tree. Patch is conforming to mm.

V3->V4
- Rename /proc/slabinfo to /proc/slubinfo. We have a different format after
  all.
- More bug fixes and stabilization of diagnostic functions. This seems
  to be finally something that works wherever we test it.
- Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's
  idea)
- Add two new modifications (separate patches) to guarantee
  a mininum number of objects per slab and to pass through large
  allocations.

V2->V3
- Debugging and diagnostic support. This is runtime enabled and not compile
  time enabled. Runtime debugging can be controlled via kernel boot options
  on an individual slab cache basis or globally.
- Slab Trace support (For individual slab caches).
- Resiliency support: If basic sanity checks are enabled (via F f.e.)
  (boot option) then SLUB will do the best to perform diagnostics and
  then continue (i.e. mark corrupted objects as used).
- Fix up numerous issues including clash of SLUBs use of page
  flags with i386 arch use for pmd and pgds (which are managed
  as slab caches, sigh).
- Dynamic per CPU array sizing.
- Explain SLUB slabcache flags

V1->V2
- Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA.
- Provide NUMA support by splitting partial lists per node.
- Better Slab cache merge support (now at around 50% of slabs)
- List slab cache aliases if slab caches are merged.
- Updated descriptions /proc/slabinfo output

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all meta data in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over.
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per CPU slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per CPU, shared and alien object queues
   during cache reaping which may cause strange hold offs.

E. SLAB has complex NUMA policy layer support

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of
   policies to individual slab objects allocated in SLAB is
   certainly a performance concern 

[SLUB 2/3] Enable poisoning for RCU and constructors

2007-03-10 Thread Christoph Lameter
Enable poisoning / redzoning for slabs with constructors or SLAB_DEWSTROY_BY_RCU

We cannot poison the object itself but we can poison padding spaces and do
the redzoning. For that we introduce another flag controlling object
poisoning.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3/mm/slub.c
===
--- linux-2.6.21-rc3.orig/mm/slub.c 2007-03-09 21:13:02.0 -0800
+++ linux-2.6.21-rc3/mm/slub.c  2007-03-09 21:13:44.0 -0800
@@ -80,6 +80,9 @@
 #define ARCH_SLAB_MINALIGN sizeof(void *)
 #endif
 
+/* Internal SLUB flags */
+#define __OBJECT_POISON 0x8000 /* Poison object */
+
 static int kmem_size = sizeof(struct kmem_cache);
 
 #ifdef CONFIG_SMP
@@ -247,8 +250,8 @@
if (s->objects == 1)
return;
 
-   if (s->flags & SLAB_POISON) {
-   memset(p, POISON_FREE, s->objsize -1);
+   if (s->flags & __OBJECT_POISON) {
+   memset(p, POISON_FREE, s->objsize - 1);
p[s->objsize -1] = POISON_END;
}
 
@@ -388,7 +391,8 @@
object_err(s, page, p, "Alignment padding check fails");
 
if (s->flags & SLAB_POISON) {
-   if (!active && (!check_bytes(p, POISON_FREE, s->objsize - 1) ||
+   if (!active && (s->flags & __OBJECT_POISON)
+   && (!check_bytes(p, POISON_FREE, s->objsize - 
1) ||
p[s->objsize -1] != POISON_END)) {
object_err(s, page, p, "Poison");
return 0;
@@ -1371,14 +1375,9 @@
strncmp(slub_debug_slabs, name, strlen(slub_debug_slabs)) == 0))
flags |= slub_debug;
 
-   if ((flags & SLAB_POISON) &&((flags & SLAB_DESTROY_BY_RCU) ||
-   ctor || dtor)) {
-   if (!(slub_debug & SLAB_POISON))
-   printk(KERN_WARNING "SLUB %s: Clearing SLAB_POISON "
-   "because de/constructor exists.\n",
-   s->name);
-   flags &= ~SLAB_POISON;
-   }
+   if ((flags & SLAB_POISON) && !(flags & SLAB_DESTROY_BY_RCU) &&
+   !ctor && !dtor)
+   flags |= __OBJECT_POISON;
 
tentative_size = ALIGN(size, calculate_alignment(align, flags));
 
@@ -1389,7 +1388,7 @@
 */
if (size == PAGE_SIZE)
flags &= ~(SLAB_RED_ZONE| SLAB_DEBUG_FREE | \
-   SLAB_STORE_USER | SLAB_POISON);
+   SLAB_STORE_USER | SLAB_POISON | __OBJECT_POISON);
 
s->name = name;
s->ctor = ctor;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[QUICKLIST 5/6] x86_64: Separate quicklist for pgds

2007-03-10 Thread Christoph Lameter
x86_64: Add quicklist for pgd.

A second quicklist is useful to separate out PGD handling. We can carry
the initialized pgds over to the next process needing them. This
avoids the zeroing of the pgds on free that we had to introduce
in the last patch.

Also clean up the pgd_list handling to use regular list macros.
There is no need anymore to avoid the lru field.

Move the add/removal of the pgds to the pgdlist into the
constructor / destructor. That way the implementation is
congruent with i386.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3/arch/x86_64/Kconfig
===
--- linux-2.6.21-rc3.orig/arch/x86_64/Kconfig   2007-03-10 14:00:52.0 
-0800
+++ linux-2.6.21-rc3/arch/x86_64/Kconfig2007-03-10 14:00:53.0 
-0800
@@ -58,7 +58,7 @@
 
 config NR_QUICK
int
-   default 1
+   default 2
 
 config ISA
bool
Index: linux-2.6.21-rc3/arch/x86_64/mm/fault.c
===
--- linux-2.6.21-rc3.orig/arch/x86_64/mm/fault.c2007-03-10 
14:00:29.0 -0800
+++ linux-2.6.21-rc3/arch/x86_64/mm/fault.c 2007-03-10 14:00:53.0 
-0800
@@ -585,7 +585,7 @@
 }
 
 DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
+LIST_HEAD(pgd_list);
 
 void vmalloc_sync_all(void)
 {
@@ -605,8 +605,7 @@
if (pgd_none(*pgd_ref))
continue;
spin_lock(_lock);
-   for (page = pgd_list; page;
-page = (struct page *)page->index) {
+   list_for_each_entry(page, _list, lru) {
pgd_t *pgd;
pgd = (pgd_t *)page_address(page) + 
pgd_index(address);
if (pgd_none(*pgd))
Index: linux-2.6.21-rc3/include/asm-x86_64/pgalloc.h
===
--- linux-2.6.21-rc3.orig/include/asm-x86_64/pgalloc.h  2007-03-10 
14:00:52.0 -0800
+++ linux-2.6.21-rc3/include/asm-x86_64/pgalloc.h   2007-03-10 
14:00:53.0 -0800
@@ -7,6 +7,9 @@
 #include 
 #include 
 
+#define QUICK_PGD 0/* We preserve special mappings over free */
+#define QUICK_PT 1 /* Other page table pages that are zero on free */
+
 #define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
 #define pud_populate(mm, pud, pmd) \
@@ -22,88 +25,77 @@
 static inline void pmd_free(pmd_t *pmd)
 {
BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
-   quicklist_free(0, NULL, pmd);
+   quicklist_free(QUICK_PT, NULL, pmd);
 }
 
 static inline pmd_t *pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-   return (pmd_t *)quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
+   return (pmd_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, 
NULL);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return (pud_t *)quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
+   return (pud_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, 
NULL);
 }
 
 static inline void pud_free (pud_t *pud)
 {
BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
-   quicklist_free(0, NULL, pud);
+   quicklist_free(QUICK_PT, NULL, pud);
 }
 
-static inline void pgd_list_add(pgd_t *pgd)
+static inline void pgd_ctor(void *x)
 {
+   unsigned boundary;
+   pgd_t *pgd = x;
struct page *page = virt_to_page(pgd);
 
+   /*
+* Copy kernel pointers in from init.
+*/
+   boundary = pgd_index(__PAGE_OFFSET);
+   memcpy(pgd + boundary,
+   init_level4_pgt + boundary,
+   (PTRS_PER_PGD - boundary) * sizeof(pgd_t));
+
spin_lock(_lock);
-   page->index = (pgoff_t)pgd_list;
-   if (pgd_list)
-   pgd_list->private = (unsigned long)>index;
-   pgd_list = page;
-   page->private = (unsigned long)_list;
+   list_add(>lru, _list);
spin_unlock(_lock);
 }
 
-static inline void pgd_list_del(pgd_t *pgd)
+static inline void pgd_dtor(void *x)
 {
-   struct page *next, **pprev, *page = virt_to_page(pgd);
+   pgd_t *pgd = x;
+   struct page *page = virt_to_page(pgd);
 
spin_lock(_lock);
-   next = (struct page *)page->index;
-   pprev = (struct page **)page->private;
-   *pprev = next;
-   if (next)
-   next->private = (unsigned long)pprev;
+   list_del(>lru);
spin_unlock(_lock);
 }
 
+
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   unsigned boundary;
-   pgd_t *pgd = (pgd_t *)quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
-   if (!pgd)
-   return NULL;
+   pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD,
+GFP_KERNEL|__GFP_REPEAT, pgd_ctor);
 
-   pgd_list_add(pgd);
-   /*
- 

Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Nicholas Miell
On Sat, 2007-03-10 at 17:57 -0800, Davide Libenzi wrote:
> On Sat, 10 Mar 2007, Nicholas Miell wrote:
> 
> > If that's the goal, somebody should start thinking about reducing the
> > contents of struct file to the bare minimum (i.e. not much more than a
> > file_operations pointer).
> 
> That's already pretty smal, and the single inode (and maybe dentry) will 
> make it even smaller. Unless you want to create brazillions of signalfds,
> timerfds or asyncfds.
> 

Timers don't need dentry or inode pointers or readahead state, etc., do
they? (Beyond the existing VFS expectation, that is.)

> > > And the real point of the whole signalfd() is that there really *are* a 
> > > lot of UNIX interfaces that basically only work with file descriptors. 
> > > Not 
> > > just read, but select/poll/epoll.
> > 
> > It'd be useful if the polling interfaces could return small datums
> > beyond just the POLL* flags -- having to do a read on timerfd just to
> > get the overrun count has a lot of overhead for just an integer, and I
> > imagine other things would like to pass back stuff too.
> ...
> 
> > You still want timeouts, creating/setting/destroying at timer just for
> > a single call to select/poll/epoll is probably too heavy weight.
> 
> Take a look at what timerfd does and what posix timers has to do to 
> implement the interface. You'll prolly stop trolling with things like "a 
> lot of overhead" or "too heavy weight".

That wasn't a troll. I was talking about the timerfd()/close() overhead
and the corresponding bookkeeping necessary to keep that fd around
compared to just passing a struct timespec to poll or a millisecond
count to epoll_wait.

> > timerfd() still leaves out the basic clock selection functionality
> > provided by both setitimer() and timer_create().
> 
> That is coming as soon as I fixed my send-serie script ...

Nice.

-- 
Nicholas Miell <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[SLUB 3/3] Configurable slub_max_order

2007-03-10 Thread Christoph Lameter
Add slub_max_order

Avoid slabs getting to large. Do no longer enforce slub_min_objects
if the slab gets bigger than slub_max_order.

I am not sure if we really want this. Maybe we should make the
selection of the base page size depending on page allocator
defrag behavior? I.e. try to restrict allocations to order 0 and order 2
so that can limit fragmentation?

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3/mm/slub.c
===
--- linux-2.6.21-rc3.orig/mm/slub.c 2007-03-10 13:14:06.0 -0800
+++ linux-2.6.21-rc3/mm/slub.c  2007-03-10 13:14:11.0 -0800
@@ -1211,6 +1211,7 @@ static __always_inline struct page *get_
  * take the list_lock.
  */
 static int slub_min_order = 0;
+static int slub_max_order = 4;
 
 /*
  * Minimum number of objects per slab. This is necessary in order to
@@ -1249,7 +1250,11 @@ static int calculate_order(int size)
order < MAX_ORDER; order++) {
unsigned long slab_size = PAGE_SIZE << order;
 
-   if (slab_size < slub_min_objects * size)
+   if (slub_max_order > order &&
+   slab_size < slub_min_objects * size)
+   continue;
+
+   if (slab_size < size)
continue;
 
rem = slab_size % size;
@@ -1637,6 +1642,15 @@ static int __init setup_slub_min_order(c
 
 __setup("slub_min_order=", setup_slub_min_order);
 
+static int __init setup_slub_max_order(char *str)
+{
+   get_option (, _max_order);
+
+   return 1;
+}
+
+__setup("slub_max_order=", setup_slub_max_order);
+
 static int __init setup_slub_min_objects(char *str)
 {
get_option (, _min_objects);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[QUICKLIST 6/6] slub: remove special casing for PAGE_SIZE slabs

2007-03-10 Thread Christoph Lameter
Slub: Remove special casing for page sized slabs

After we have used quicklist so that arches can avoid using the slab
allocator to manage page table pages we can now remove the special
casing from slub.

This is against SLUB V5

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3/mm/slub.c
===
--- linux-2.6.21-rc3.orig/mm/slub.c 2007-03-09 21:23:39.0 -0800
+++ linux-2.6.21-rc3/mm/slub.c  2007-03-09 21:24:23.0 -0800
@@ -1236,16 +1236,6 @@
int order;
int rem;
 
-   /*
-* If this is an order 0 page then there are no issues with
-* fragmentation. We can then create a slab with a single object.
-* We need this to support the i386 arch code that uses our
-* freelist field (index field) for a list pointer. We neveri
-* touch the freelist pointer if we just have one object
-*/
-   if (size == PAGE_SIZE)
-   return 0;
-
for (order = max(slub_min_order, fls(size - 1) - PAGE_SHIFT);
order < MAX_ORDER; order++) {
unsigned long slab_size = PAGE_SIZE << order;
@@ -1386,15 +1376,6 @@
 
tentative_size = ALIGN(size, calculate_alignment(align, flags));
 
-   /*
-* PAGE_SIZE slabs are special in that they are passed through
-* to the page allocator. Do not do any debugging in order to avoid
-* increasing the size of the object.
-*/
-   if (size == PAGE_SIZE)
-   flags &= ~(SLAB_RED_ZONE| SLAB_DEBUG_FREE | \
-   SLAB_STORE_USER | SLAB_POISON | __OBJECT_POISON);
-
s->name = name;
s->ctor = ctor;
s->dtor = dtor;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[QUICKLIST 2/6] i386: quicklist support

2007-03-10 Thread Christoph Lameter
i386: Convert to quicklists

Implement the i386 management of pgd and pmds using quicklists.

The i386 management of page table pages currently uses page sized slabs.
The page state is therefore mainly determined by the slab code. However,
i386 also uses its own fields in the page struct to mark special pages
and to build a list of pgds using the ->private and ->index field (yuck!).
This has been finely tuned to work right with SLAB but SLUB needs more
control over the page struct. Currently the only way for SLUB to support
these slabs is through special casing PAGE_SIZE slabs.

If we use quicklists instead then we can avoid the mess, and also the
overhead of manipulating page sized objects through slab.

It also allows us to use standard list manipulation macros for the
pgd list using page->lru thereby simplifying the code.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3/arch/i386/mm/init.c
===
--- linux-2.6.21-rc3.orig/arch/i386/mm/init.c   2007-03-10 13:13:32.0 
-0800
+++ linux-2.6.21-rc3/arch/i386/mm/init.c2007-03-10 13:39:23.0 
-0800
@@ -695,31 +695,6 @@ int remove_memory(u64 start, u64 size)
 EXPORT_SYMBOL_GPL(remove_memory);
 #endif
 
-struct kmem_cache *pgd_cache;
-struct kmem_cache *pmd_cache;
-
-void __init pgtable_cache_init(void)
-{
-   if (PTRS_PER_PMD > 1) {
-   pmd_cache = kmem_cache_create("pmd",
-   PTRS_PER_PMD*sizeof(pmd_t),
-   PTRS_PER_PMD*sizeof(pmd_t),
-   0,
-   pmd_ctor,
-   NULL);
-   if (!pmd_cache)
-   panic("pgtable_cache_init(): cannot create pmd cache");
-   }
-   pgd_cache = kmem_cache_create("pgd",
-   PTRS_PER_PGD*sizeof(pgd_t),
-   PTRS_PER_PGD*sizeof(pgd_t),
-   0,
-   pgd_ctor,
-   PTRS_PER_PMD == 1 ? pgd_dtor : NULL);
-   if (!pgd_cache)
-   panic("pgtable_cache_init(): Cannot create pgd cache");
-}
-
 /*
  * This function cannot be __init, since exceptions don't work in that
  * section.  Put this after the callers, so that it cannot be inlined.
Index: linux-2.6.21-rc3/arch/i386/mm/pgtable.c
===
--- linux-2.6.21-rc3.orig/arch/i386/mm/pgtable.c2007-03-10 
13:13:32.0 -0800
+++ linux-2.6.21-rc3/arch/i386/mm/pgtable.c 2007-03-10 13:43:39.0 
-0800
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -181,9 +182,12 @@ void reserve_top_address(unsigned long r
 #endif
 }
 
+#define QUICK_PGD 0
+#define QUICK_PT 1
+
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
 {
-   return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO);
+   return (pte_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
 }
 
 struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
@@ -198,11 +202,6 @@ struct page *pte_alloc_one(struct mm_str
return pte;
 }
 
-void pmd_ctor(void *pmd, struct kmem_cache *cache, unsigned long flags)
-{
-   memset(pmd, 0, PTRS_PER_PMD*sizeof(pmd_t));
-}
-
 /*
  * List of all pgd's needed for non-PAE so it can invalidate entries
  * in both cached and uncached pgd's; not needed for PAE since the
@@ -211,8 +210,6 @@ void pmd_ctor(void *pmd, struct kmem_cac
  * against pageattr.c; it is the unique case in which a valid change
  * of kernel pagetables can't be lazily synchronized by vmalloc faults.
  * vmalloc faults work because attached pagetables are never freed.
- * The locking scheme was chosen on the basis of manfred's
- * recommendations and having no core impact whatsoever.
  * -- wli
  */
 DEFINE_SPINLOCK(pgd_lock);
@@ -238,7 +235,7 @@ static inline void pgd_list_del(pgd_t *p
set_page_private(next, (unsigned long)pprev);
 }
 
-void pgd_ctor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+void pgd_ctor(void *pgd)
 {
unsigned long flags;
 
@@ -264,7 +261,7 @@ void pgd_ctor(void *pgd, struct kmem_cac
 }
 
 /* never called when PTRS_PER_PMD > 1 */
-void pgd_dtor(void *pgd, struct kmem_cache *cache, unsigned long unused)
+void pgd_dtor(void *pgd)
 {
unsigned long flags; /* can be called from interrupt context */
 
@@ -277,13 +274,13 @@ void pgd_dtor(void *pgd, struct kmem_cac
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
int i;
-   pgd_t *pgd = kmem_cache_alloc(pgd_cache, GFP_KERNEL);
+   pgd_t *pgd = quicklist_alloc(QUICK_PGD, GFP_KERNEL, pgd_ctor);
 
if (PTRS_PER_PMD == 1 || !pgd)
return pgd;
 
for (i = 0; i < USER_PTRS_PER_PGD; ++i) {
-   pmd_t *pmd = 

[QUICKLIST 1/6] Extract quicklist implementation from IA64

2007-03-10 Thread Christoph Lameter
Abstract quicklist from the IA64 implementation

Extract the quicklist implementation for IA64, clean it up
and generalize it to:

1. Allow multiple quicklists

2. Add support for constructors and destructors..

Quicklist allocation and frees occur inline. The support
for constructors / destructors and multiple quicklists
can therefore be optimized out of the final code for an
arch.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3/arch/ia64/mm/init.c
===
--- linux-2.6.21-rc3.orig/arch/ia64/mm/init.c   2007-03-10 11:34:00.0 
-0800
+++ linux-2.6.21-rc3/arch/ia64/mm/init.c2007-03-10 11:50:46.0 
-0800
@@ -39,9 +39,6 @@
 
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 
-DEFINE_PER_CPU(unsigned long *, __pgtable_quicklist);
-DEFINE_PER_CPU(long, __pgtable_quicklist_size);
-
 extern void ia64_tlb_init (void);
 
 unsigned long MAX_DMA_ADDRESS = PAGE_OFFSET + 0x1UL;
@@ -56,54 +53,6 @@ EXPORT_SYMBOL(vmem_map);
 struct page *zero_page_memmap_ptr; /* map entry for zero page */
 EXPORT_SYMBOL(zero_page_memmap_ptr);
 
-#define MIN_PGT_PAGES  25UL
-#define MAX_PGT_FREES_PER_PASS 16L
-#define PGT_FRACTION_OF_NODE_MEM   16
-
-static inline long
-max_pgt_pages(void)
-{
-   u64 node_free_pages, max_pgt_pages;
-
-#ifndefCONFIG_NUMA
-   node_free_pages = nr_free_pages();
-#else
-   node_free_pages = node_page_state(numa_node_id(), NR_FREE_PAGES);
-#endif
-   max_pgt_pages = node_free_pages / PGT_FRACTION_OF_NODE_MEM;
-   max_pgt_pages = max(max_pgt_pages, MIN_PGT_PAGES);
-   return max_pgt_pages;
-}
-
-static inline long
-min_pages_to_free(void)
-{
-   long pages_to_free;
-
-   pages_to_free = pgtable_quicklist_size - max_pgt_pages();
-   pages_to_free = min(pages_to_free, MAX_PGT_FREES_PER_PASS);
-   return pages_to_free;
-}
-
-void
-check_pgt_cache(void)
-{
-   long pages_to_free;
-
-   if (unlikely(pgtable_quicklist_size <= MIN_PGT_PAGES))
-   return;
-
-   preempt_disable();
-   while (unlikely((pages_to_free = min_pages_to_free()) > 0)) {
-   while (pages_to_free--) {
-   free_page((unsigned long)pgtable_quicklist_alloc());
-   }
-   preempt_enable();
-   preempt_disable();
-   }
-   preempt_enable();
-}
-
 void
 lazy_mmu_prot_update (pte_t pte)
 {
Index: linux-2.6.21-rc3/include/asm-ia64/pgalloc.h
===
--- linux-2.6.21-rc3.orig/include/asm-ia64/pgalloc.h2007-03-10 
11:34:00.0 -0800
+++ linux-2.6.21-rc3/include/asm-ia64/pgalloc.h 2007-03-10 12:37:56.0 
-0800
@@ -18,71 +18,18 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
-DECLARE_PER_CPU(unsigned long *, __pgtable_quicklist);
-#define pgtable_quicklist __ia64_per_cpu_var(__pgtable_quicklist)
-DECLARE_PER_CPU(long, __pgtable_quicklist_size);
-#define pgtable_quicklist_size __ia64_per_cpu_var(__pgtable_quicklist_size)
-
-static inline long pgtable_quicklist_total_size(void)
-{
-   long ql_size = 0;
-   int cpuid;
-
-   for_each_online_cpu(cpuid) {
-   ql_size += per_cpu(__pgtable_quicklist_size, cpuid);
-   }
-   return ql_size;
-}
-
-static inline void *pgtable_quicklist_alloc(void)
-{
-   unsigned long *ret = NULL;
-
-   preempt_disable();
-
-   ret = pgtable_quicklist;
-   if (likely(ret != NULL)) {
-   pgtable_quicklist = (unsigned long *)(*ret);
-   ret[0] = 0;
-   --pgtable_quicklist_size;
-   preempt_enable();
-   } else {
-   preempt_enable();
-   ret = (unsigned long *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
-   }
-
-   return ret;
-}
-
-static inline void pgtable_quicklist_free(void *pgtable_entry)
-{
-#ifdef CONFIG_NUMA
-   int nid = page_to_nid(virt_to_page(pgtable_entry));
-
-   if (unlikely(nid != numa_node_id())) {
-   free_page((unsigned long)pgtable_entry);
-   return;
-   }
-#endif
-
-   preempt_disable();
-   *(unsigned long *)pgtable_entry = (unsigned long)pgtable_quicklist;
-   pgtable_quicklist = (unsigned long *)pgtable_entry;
-   ++pgtable_quicklist_size;
-   preempt_enable();
-}
-
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   return pgtable_quicklist_alloc();
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pgd_free(pgd_t * pgd)
 {
-   pgtable_quicklist_free(pgd);
+   quicklist_free(0, NULL, pgd);
 }
 
 #ifdef CONFIG_PGTABLE_4
@@ -94,12 +41,12 @@ pgd_populate(struct mm_struct *mm, pgd_t
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return pgtable_quicklist_alloc();
+   return quicklist_alloc(0, GFP_KERNEL, NULL);
 }
 
 static inline void pud_free(pud_t * 

[QUICKLIST 4/6] x86_64: Single Quicklist

2007-03-10 Thread Christoph Lameter
x86_64: Convert to use a single quicklists

This adds caching of pgds and puds, pmds, pte. That way we can
avoid costly zeroing and initialization of special mappings in the
pgd.

The first patch just adds a simple implementation using a single
quicklist. As a consequence we need to zero a pgd before returning
it to the pool.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc3/arch/x86_64/Kconfig
===
--- linux-2.6.21-rc3.orig/arch/x86_64/Kconfig   2007-03-10 10:45:38.0 
-0800
+++ linux-2.6.21-rc3/arch/x86_64/Kconfig2007-03-10 12:50:47.0 
-0800
@@ -56,6 +56,10 @@ config ZONE_DMA
bool
default y
 
+config NR_QUICK
+   int
+   default 1
+
 config ISA
bool
 
Index: linux-2.6.21-rc3/include/asm-x86_64/pgalloc.h
===
--- linux-2.6.21-rc3.orig/include/asm-x86_64/pgalloc.h  2007-03-10 
10:45:39.0 -0800
+++ linux-2.6.21-rc3/include/asm-x86_64/pgalloc.h   2007-03-10 
12:52:14.0 -0800
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define pmd_populate_kernel(mm, pmd, pte) \
set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
@@ -21,23 +22,23 @@ static inline void pmd_populate(struct m
 static inline void pmd_free(pmd_t *pmd)
 {
BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
-   free_page((unsigned long)pmd);
+   quicklist_free(0, NULL, pmd);
 }
 
 static inline pmd_t *pmd_alloc_one (struct mm_struct *mm, unsigned long addr)
 {
-   return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   return (pmd_t *)quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return (pud_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   return (pud_t *)quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline void pud_free (pud_t *pud)
 {
BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
-   free_page((unsigned long)pud);
+   quicklist_free(0, NULL, pud);
 }
 
 static inline void pgd_list_add(pgd_t *pgd)
@@ -69,9 +70,10 @@ static inline void pgd_list_del(pgd_t *p
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
unsigned boundary;
-   pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
+   pgd_t *pgd = (pgd_t *)quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
if (!pgd)
return NULL;
+
pgd_list_add(pgd);
/*
 * Copy kernel pointers in from init.
@@ -90,17 +92,18 @@ static inline void pgd_free(pgd_t *pgd)
 {
BUG_ON((unsigned long)pgd & (PAGE_SIZE-1));
pgd_list_del(pgd);
-   free_page((unsigned long)pgd);
+   memset(pgd, 0, PAGE_SIZE);
+   quicklist_free(0, NULL, pgd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long 
address)
 {
-   return (pte_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   return (pte_t *)quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long 
address)
 {
-   void *p = (void *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+   void *p = (void *)quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
if (!p)
return NULL;
return virt_to_page(p);
@@ -112,17 +115,21 @@ static inline struct page *pte_alloc_one
 static inline void pte_free_kernel(pte_t *pte)
 {
BUG_ON((unsigned long)pte & (PAGE_SIZE-1));
-   free_page((unsigned long)pte); 
+   quicklist_free(0, NULL, pte);
 }
 
 static inline void pte_free(struct page *pte)
 {
__free_page(pte);
-} 
+}
 
 #define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
 
 #define __pmd_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
 #define __pud_free_tlb(tlb,x)   tlb_remove_page((tlb),virt_to_page(x))
 
+static inline void check_pgt_cache(void)
+{
+   quicklist_check(0, NULL);
+}
 #endif /* _X86_64_PGALLOC_H */
Index: linux-2.6.21-rc3/mm/Kconfig
===
--- linux-2.6.21-rc3.orig/mm/Kconfig2007-03-10 11:50:46.0 -0800
+++ linux-2.6.21-rc3/mm/Kconfig 2007-03-10 12:50:47.0 -0800
@@ -168,3 +168,8 @@ config QUICKLIST
default y if NR_QUICK != 0
 
 
+config QUICKLIST
+   bool
+   default y if NR_QUICK != 0
+
+
Index: linux-2.6.21-rc3/arch/x86_64/kernel/process.c
===
--- linux-2.6.21-rc3.orig/arch/x86_64/kernel/process.c  2007-03-10 
10:45:38.0 -0800
+++ linux-2.6.21-rc3/arch/x86_64/kernel/process.c   2007-03-10 
12:52:46.0 -0800
@@ -207,6 +207,7 @@ void cpu_idle (void)
if (__get_cpu_var(cpu_idle_state))
__get_cpu_var(cpu_idle_state) = 0;
 
+   

[QUICKLIST 0/6] Arch independent quicklists V1

2007-03-10 Thread Christoph Lameter
This patchset introduces an arch independent framework to handle lists
of recently used page table pages.

Page table pages have the characteristics that they are typically zero
or in a known state when they are freed. This is usually the exactly
same state as needed after allocation. So it makes sense to build a list
of freed page table pages and then consume the pages already in use
first. Those pages have already been initialized correctly (thus no
need to zero them) and are likely already cached in such a way that
the MMU can use them most effectively.

Such an implementation already exits for ia64. If I remember correctly
it was done by Robin Holt. Howver, that implementation did not support
constructors and destructors as needed by i386 / x86_64. It also only
supported a single quicklist. The implementation here has constructor
and destructor support as well as the ability for an arch to specify
how many quicklists are needed.

Quicklists are defined by an arch defining the necessary number
of quicklists in arch//Kconfig. F.e. i386 needs two and thus
has 

config NR_QUICK
int
default 2

If an arch has requested quicklist support then pages can be allocated
from the quicklist (or from the page allocator if the quicklist is
empty) via:

quicklist_alloc(, , )

Page table pages can be freed using:

quicklist_free(, , )

Pages must have a definite state after allocation and before
they are freed. If no constructor is specified then pages
will be zeroed on allocation and must be zeroed before they are
freed.

If a constructor is used then the constructor will establish
a definite page state. F.e. the i386 and x86_64 pgd constructors
establish certain mappings.

Constructors and destructors can also be used to track the pages.
i386 and x86_64 use a list of pgds in order to be able to dynamically
update standard mappings.

6 patches follow this message:

[QUICKLIST 1/6] Extract quicklist implementation from IA64
[QUICKLIST 2/6] i386: quicklist support
[QUICKLIST 3/6] i386: Use standard list manipulators for pgd_list
[QUICKLIST 4/6] x86_64: Single quicklist
[QUICKLIST 5/6] x86_64: Separate quicklist for pgds
[QUICKLIST 6/6] slub: remove special casing for PAGE_SIZE slabs

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Davide Libenzi
On Sat, 10 Mar 2007, Nicholas Miell wrote:

> If that's the goal, somebody should start thinking about reducing the
> contents of struct file to the bare minimum (i.e. not much more than a
> file_operations pointer).

That's already pretty smal, and the single inode (and maybe dentry) will 
make it even smaller. Unless you want to create brazillions of signalfds,
timerfds or asyncfds.



> > And the real point of the whole signalfd() is that there really *are* a 
> > lot of UNIX interfaces that basically only work with file descriptors. Not 
> > just read, but select/poll/epoll.
> 
> It'd be useful if the polling interfaces could return small datums
> beyond just the POLL* flags -- having to do a read on timerfd just to
> get the overrun count has a lot of overhead for just an integer, and I
> imagine other things would like to pass back stuff too.
...

> You still want timeouts, creating/setting/destroying at timer just for
> a single call to select/poll/epoll is probably too heavy weight.

Take a look at what timerfd does and what posix timers has to do to 
implement the interface. You'll prolly stop trolling with things like "a 
lot of overhead" or "too heavy weight".



> timerfd() still leaves out the basic clock selection functionality
> provided by both setitimer() and timer_create().

That is coming as soon as I fixed my send-serie script ...




- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: maps protection

2007-03-10 Thread Kees Cook
On Sat, Mar 10, 2007 at 04:21:01PM -0800, Andrew Morton wrote:
> We'd be needing a changelog for that.

Done; sent separately from this email.

> Please update the procfs documentation.

Done.

> Does the patch also cover /proc/pid/smaps?

Yes, and numa_maps.

Thanks!

-- 
Kees Cook
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Nicholas Miell
On Sat, 2007-03-10 at 16:35 -0800, Linus Torvalds wrote:
> 
> On Sat, 10 Mar 2007, Nicholas Miell wrote:
> > > 
> > > I'd actually much rather do POSIX timers the other way around: associate 
> > > a 
> > > generic notification mechanism with the file descriptor, and then 
> > > implement posix_timer_create() on top of timerfd. Now THAT sounds like a 
> > > clean unix-like interface ("everything is a file") and would imply that 
> > > you'd be able to do the same kind of notification for any file 
> > > descriptor, 
> > > not just timers.
> > > 
> > 
> > But timers aren't files or even remotely file-like
> 
> What do you think "a file" is?
> 
> In UNIX, a file descriptor is pretty much anything. You could say that 
> sockets aren't remotely file-like, and you'd be right. What's your point? 
> If you can read on it, it's a file.

Ah, I see. You're just interested in fds as a generic handle concept,
and not a more Plan 9 type thing.

If that's the goal, somebody should start thinking about reducing the
contents of struct file to the bare minimum (i.e. not much more than a
file_operations pointer).

> 
> And the real point of the whole signalfd() is that there really *are* a 
> lot of UNIX interfaces that basically only work with file descriptors. Not 
> just read, but select/poll/epoll.

It'd be useful if the polling interfaces could return small datums
beyond just the POLL* flags -- having to do a read on timerfd just to
get the overrun count has a lot of overhead for just an integer, and I
imagine other things would like to pass back stuff too.


> They currently have just one timeout, but the thing is, if UNIX had just 
> had "timer file descriptors", they'd not need even that one. And even with 
> the timeout, Davide's patch actually makes for a *better* timeout than the 
> ones provided by select/poll/epoll, exactly because you can do things like 
> repeating timers and absolute time etc.
> 
> Much more naturally than the timer interface we currently have for those 
> system calls.
> 

You still want timeouts, creating/setting/destroying at timer just for
a single call to select/poll/epoll is probably too heavy weight.

timerfd() still leaves out the basic clock selection functionality
provided by both setitimer() and timer_create().

> The same goes for signals. The whole "pselect()" thing shows that signals 
> really *should* have been file descriptors, and suddenly you don't need 
> "pselect()" at all.
> 
> So the "not remotely file-like" is not actually a real argument. One of 
> the big *points* of UNIX was that it unified a lot under the general 
> umbrella of a "file descriptor". Davide just unifies even more.
>
>   Linus
-- 
Nicholas Miell <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RSDL-mm 0.28

2007-03-10 Thread Matt Mackall
I've tested -mm2 against -mm2+noyield and -mm2+rsdl+noyield. The
noyield patch simply makes the sched_yield syscall return immediately.
Xorg and all tests are run at nice 0.

Loads:
 memload: constant memcpy of 16MB buffer
 execload: constant re-exec of a trivial shell script
 forkload: constant fork and exit of a trivial shell script
 make -j 5: hot-cache kernel build without ccache
 make -j 5 ccache: hot-cache kernel build with ccache

Tests:
 beryl - 3D window manager, wiggle windows, spin desktop, etc.
 galeon - web browser, rapidly scrolling long web pages by grabbing
the scroll bar
 mp3 - XMMS on a FUSE sshfs over wireless (during all tests)
 terminal - responsiveness of ssh and local terminal sessions
 mouse - responsiveness of mouse pointer

Results:
 great = completely smooth
 good = fully responsive
 ok = visible latency
 bad = becomes difficult to use (or mp3 skips)
 awful = make it stop, please

  -mm2-mm2+noyield   rsdl+noyield
no load
 berylgreat   great  great
 galeon   goodgood   good
 mp3  goodgood   good
 terminal goodgood   good
 mousegoodgood   good
memload x10
 berylawful/bad   great  good
 galeon   goodgood   ok/good
 mp3  goodgood   good
 terminal goodgood   good
 mousegoodgood   good
execload x10
 berylawful/bad   bad/good   good
 galeon   goodbad/good   ok/good
 mp3  goodbadgood
 terminal goodbad/good   good
 mousegoodbad/good   good
forkload x10
 berylgoodgood   great
 galeon   goodgood   ok/good
 mp3  goodgood   good
 terminal goodgood   ok/good
 mousegoodgood   good
make -j 5
 berylok  good   good/great
 galeon   goodgood   ok/good
 mp3  goodgood   good
 terminal goodgood   good
 mousegoodgood   good
make -j 5 ccache
 berylok  good   awful
 galeon   goodgood   bad
 mp3  goodgood   bad
 terminal goodgood   bad/ok
 mousegoodgood   bad/ok

make -j 5
real  8m1.857s8m50.659s  8m9.282s
user  7m19.127s   8m3.494s   7m30.740s
sys   0m30.910s   0m33.722s  0m29.542s

make -j 5 ccache
real  2m6.182s2m19.032s  2m1.832s
user  1m39.466s   1m48.787s  1m37.250s
sys   0m19.741s   0m22.993s  0m20.109s

There's a substantial performance hit for not yield, so we probably
want to investigate alternate semantics for it. It seems reasonable
for apps to say "let me not hog the CPU" without completely expiring
them. Imagine you're in the front of the line (aka queue) and you
spend a moment fumbling for your wallet. The polite thing to do is to
let the next guy in front. But with the current sched_yield, you go
all the way to the back of the line.

RSDL makes most of the noyield hit back in normal make and then some
with ccache. Impressive. But ccache is still destroying interactivity
somehow. The ccache effect is fairly visible even with non-parallel
'make'.

Also note I could occassionally trigger nasty multi-second pauses with
-mm2+noyield under exectest that didn't show up elsewhere. That's
probably a bug in the mainline scheduler.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] proc: maps protection

2007-03-10 Thread Kees Cook
The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive 
information about the memory location and usage of processes.  Issues:

- maps should not be world-readable, especially if programs expect any 
  kind of ASLR protection from local attackers.
- maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc
  check the maps when %n is in a *printf call, and a setuid(getuid()) 
  process wouldn't be able to read its own maps file.  (For reference
  see http://lkml.org/lkml/2006/1/22/150)
- a system-wide toggle is needed to allow prior behavior in the case of
  non-root applications that depend on access to the maps contents.

This change implements a check using "ptrace_may_attach" before allowing 
access to read the maps contents.  To control this protection, the new 
knob /proc/sys/kernel/maps_protect has been added, with corresponding 
updates to the procfs documentation.

Signed-off-by: Kees Cook <[EMAIL PROTECTED]>
---
 CREDITS|2 +-
 Documentation/filesystems/proc.txt |7 +++
 fs/proc/base.c |3 +++
 fs/proc/internal.h |2 ++
 fs/proc/task_mmu.c |   16 +++-
 fs/proc/task_nommu.c   |6 ++
 include/linux/sysctl.h |1 +
 kernel/sysctl.c|9 +
 8 files changed, 44 insertions(+), 2 deletions(-)
---
diff --git a/CREDITS b/CREDITS
index 6bd8ab8..38c3ada 100644
--- a/CREDITS
+++ b/CREDITS
@@ -655,7 +655,7 @@ N: Kees Cook
 E: [EMAIL PROTECTED]
 W: http://outflux.net/
 P: 1024D/17063E6D 9FA3 C49C 23C9 D1BC 2E30  1975 1FFF 4BA9 1706 3E6D
-D: Minor updates to SCSI code for the Communications type
+D: Minor updates to SCSI types, added /proc/pid/maps protection
 S: (ask for current address)
 S: USA
 
diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 5484ab5..d9b06b5 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1137,6 +1137,13 @@ determine whether or not they are still functioning 
properly.
 Because the NMI watchdog shares registers with oprofile, by disabling the NMI
 watchdog, oprofile may have more registers to utilize.
 
+maps_protect
+
+
+Enables/Disables the protection of the per-process proc entries "maps" and
+"smaps".  When enabled, the contents of these files are visible only to
+readers that are allowed to ptrace() the given process.
+
 
 2.4 /proc/sys/vm - The virtual memory subsystem
 ---
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 01f7769..6feccbc 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -123,6 +123,9 @@ struct pid_entry {
NULL, _info_file_operations,   \
{ .proc_read = _##OTYPE } )
 
+int maps_protect = 0;
+EXPORT_SYMBOL(maps_protect);
+
 static struct fs_struct *get_fs_struct(struct task_struct *task)
 {
struct fs_struct *fs;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c932aa6..2c65b6e 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -33,6 +33,8 @@ do {  \
 extern int nommu_vma_show(struct seq_file *, struct vm_area_struct *);
 #endif
 
+extern int maps_protect;
+
 extern void create_seq_entry(char *name, mode_t mode, const struct 
file_operations *f);
 extern int proc_exe_link(struct inode *, struct dentry **, struct vfsmount **);
 extern int proc_tid_stat(struct task_struct *,  char *);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7445980..45a0f3e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -134,6 +134,9 @@ static int show_map_internal(struct seq_file *m, void *v, 
struct mem_size_stats
dev_t dev = 0;
int len;
 
+   if (maps_protect && !ptrace_may_attach(task))
+   return -EACCES;
+
if (file) {
struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
dev = inode->i_sb->s_dev;
@@ -444,11 +447,22 @@ const struct file_operations proc_maps_operations = {
 #ifdef CONFIG_NUMA
 extern int show_numa_map(struct seq_file *m, void *v);
 
+static int show_numa_map_checked(struct seq_file *m, void *v)
+{
+   struct proc_maps_private *priv = m->private;
+   struct task_struct *task = priv->task;
+
+   if (maps_protect && !ptrace_may_attach(task))
+   return -EACCES;
+   
+   return show_numa_map(m, v);
+}
+
 static struct seq_operations proc_pid_numa_maps_op = {
 .start  = m_start,
 .next   = m_next,
 .stop   = m_stop,
-.show   = show_numa_map
+.show   = show_numa_map_checked
 };
 
 static int numa_maps_open(struct inode *inode, struct file *file)
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 7cddf6b..c2747c9 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -143,6 +143,12 @@ out:
 static int show_map(struct seq_file *m, void *_vml)
 {

Re: [PATCH] proc: maps protection

2007-03-10 Thread Matt Mackall
On Sat, Mar 10, 2007 at 04:21:01PM -0800, Andrew Morton wrote:
> > On Sat, 10 Mar 2007 10:33:41 -0800 Kees Cook <[EMAIL PROTECTED]> wrote:
> > Here's another revision, with both the "can ptrace" and the global /proc 
> > knob;
> 
> We'd be needing a changelog for that.
> 
> Please update the procfs documentation.
> 
> Does the patch also cover /proc/pid/smaps?

Also, we ought to revisit /proc/pid/mem write, which is currently disabled.
Either drop the code, fix it, or make it root only.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ck] Re: RSDL v0.28 for 2.6.20

2007-03-10 Thread Thibaut VARENE

On 3/11/07, Thibaut VARENE <[EMAIL PROTECTED]> wrote:

On 3/11/07, Con Kolivas <[EMAIL PROTECTED]> wrote:

> Has anyone had any trouble with RSDL on the stable kernels (ie not -mm)?

Tested fine so far on ppc, ia64 and (mostly) parisc.


I meant ppc64, actually.
Gomen.

--
Thibaut VARENE
http://www.parisc-linux.org/~varenet/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ck] Re: RSDL v0.28 for 2.6.20

2007-03-10 Thread Thibaut VARENE

On 3/11/07, Con Kolivas <[EMAIL PROTECTED]> wrote:


Has anyone had any trouble with RSDL on the stable kernels (ie not -mm)?


Tested fine so far on ppc, ia64 and (mostly) parisc.

HTH

--
Thibaut VARENE
http://www.parisc-linux.org/~varenet/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] SCSI: Delete unused header file.

2007-03-10 Thread James Bottomley
On Sat, 2007-03-10 at 17:16 -0500, Robert P. J. Day wrote:
>   Delete apparently unused header file drivers/scsi/pci2000.h.

This was apparently missed by Christoph when he removed the driver ...
I'll add it to the queue.  For future SCSI work, could you cc
linux-scsi@vger.kernel.org please?  That way, any interested parties are
more likely to see the patch.

Thanks,

James

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-10 Thread Jeremy Fitzhardinge
Thomas Gleixner wrote:
> It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
> time, which is read back from the clocksource, even if we use a relative
> value for real hardware clock event devices to program the next event.
> We calculate the delta between the absolute event and now. So we never
> get an accumulating error.
>
> What problem are you observing ?

Actually, two things.  There was the unexpected pauses during boot,
which is trivially fixable by not using the Xen periodic timer, and
using the single-shot fallback.

But I'm making the more general observation that if you use an absolute
rather than relative time to set the single-shot timeout, then you have
to deal with a long-term cumulative drift between the kernel's monotonic
time and the hypervisor's monotonic time.  This can happen even if your
clocksource is derived directly from the hypervisor monotonic time,
because running ntp will warp the kernel's time, and so it will drift
with respect to the hypervisor clock.  You can only avoid this by 1) not
allowing adjtime, or 2) making those same adjtime warps to the
hypervisor time.  Neither of these is a good general solution.

Therefore, the only useful way to set a single-shot timer is by using
relative rather than absolute time, and making sure the delta not too
large.  The guest and hypervisor may (and in general, will) have
drifting clocks, but the error will never be too large to deal with.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Linus Torvalds


On Sat, 10 Mar 2007, Nicholas Miell wrote:
> > 
> > I'd actually much rather do POSIX timers the other way around: associate a 
> > generic notification mechanism with the file descriptor, and then 
> > implement posix_timer_create() on top of timerfd. Now THAT sounds like a 
> > clean unix-like interface ("everything is a file") and would imply that 
> > you'd be able to do the same kind of notification for any file descriptor, 
> > not just timers.
> > 
> 
> But timers aren't files or even remotely file-like

What do you think "a file" is?

In UNIX, a file descriptor is pretty much anything. You could say that 
sockets aren't remotely file-like, and you'd be right. What's your point? 
If you can read on it, it's a file.

And the real point of the whole signalfd() is that there really *are* a 
lot of UNIX interfaces that basically only work with file descriptors. Not 
just read, but select/poll/epoll.

They currently have just one timeout, but the thing is, if UNIX had just 
had "timer file descriptors", they'd not need even that one. And even with 
the timeout, Davide's patch actually makes for a *better* timeout than the 
ones provided by select/poll/epoll, exactly because you can do things like 
repeating timers and absolute time etc.

Much more naturally than the timer interface we currently have for those 
system calls.

The same goes for signals. The whole "pselect()" thing shows that signals 
really *should* have been file descriptors, and suddenly you don't need 
"pselect()" at all.

So the "not remotely file-like" is not actually a real argument. One of 
the big *points* of UNIX was that it unified a lot under the general 
umbrella of a "file descriptor". Davide just unifies even more.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Nicholas Miell
On Sat, 2007-03-10 at 14:42 -0800, Linus Torvalds wrote:
> 
> On Sat, 10 Mar 2007, Nicholas Miell wrote:
> > 
> > Care to elaborate on why they're a horrible crock?
> 
> It's a *classic* case of an interface that tries to do everything under 
> the sun.
> 
> Here's a clue: look at any system call that takes a union as part of its 
> arguments. Count them. I think we have two:
>  - struct siginfo

No argument here -- just about everything related to signals is stupidly
complex.

>  - struct sigevent

However, this I take issue with.

Conceptually (and what the user ends up actually using), struct sigevent
is just:

struct sigevent
{
int sigev_notify;/* delivery method */
sigval_t sigev_value /* user cookie */
int sigev_signo; /* signal number */
void (*sigev_notify_function)(sigval_t); /* thread fn */
pthread_attr_t *sigev_notify_attributes; /* thread attr */
};

You could complain about sigval_t being a union, but that's probably
just because it predates uintptr_t. (Plus, no ugly casting.)

You also could complain that the above isn't what you actually see when
you look at /usr/include/bits/siginfo.h -- there's a union involved and
some macros to hide the fact, but that's just internal implementation
details related to how threads are created and padding out the struct
for any future expansion. 

The actual complexity for understanding and using struct sigevent isn't
all that much, and once you've figured that out, you know how to
configure event delivery for AIO completion, DNS resolution, and
messages queues, not just timers.

> and they are both broken horrible interfaces where the data structures 
> depend on various flags.
> 
> It's just not the UNIX system call way. And none of it really makes sense 
> if you already have a file descriptor, since at that point you know what 
> the notification mechanism is.
> 
> I'd actually much rather do POSIX timers the other way around: associate a 
> generic notification mechanism with the file descriptor, and then 
> implement posix_timer_create() on top of timerfd. Now THAT sounds like a 
> clean unix-like interface ("everything is a file") and would imply that 
> you'd be able to do the same kind of notification for any file descriptor, 
> not just timers.
> 

But timers aren't files or even remotely file-like -- if they were a
real files, you could just
open /dev/timers/realtime/2007/June/3rd/half-past-teatime and get a
timer. (Or, more realisticly, open /dev/timer and use ioctl().)

timerfd() had to be created to coerce them into some semblance of
filehood just to make them work with existing (and new) polling/queuing
interfaces just because those interfaces can only deal with file
descriptors.

Making non-file things look like files just because that's what poll()
and friends can deal with isn't much different from holding a hammer in
your hand and looking for what you have to do in order to turn every
problem into a nail.

Sometimes you need to go back to your toolbox for a screwdriver or a
saw.


> But posix timers as they are done now are just an abomination. They are 
> not unix-like at all.
> 
> > And are the bugs fixed? If so, why replace them? They work now.
> 
> .. but the reason for the bugs was largely a very baroque interface, which 
> didn't get fixed (because it's specified by the standard).
>

But the API isn't baroque.

There's a veritable boutique of clock sources to choose from, but they
all serve specific needs, it's just one parameter to timer_create, and
you probably want CLOCK_MONOTONIC anyway.

struct sigevent  might be a bit complex, but the difficultly in learning
that is amortized across all the other APIs that also use it to specify
how their events are delivered.

Delivering via signals and dealing with struct siginfo is painful, but
everything related to signals is painful. This is what you get when you
take an interface designed essentially for exception handling and start
abusing it for general information delivery. But, hey!, that's what
SIGEV_THREAD and SIGEV_PORT are for.[1]

About the worst that can be said of it is that using timer_settime to
both arm and disarm the timer and set the interval is awkward.






[1] A SIGEV_FUNCTION which skips all the signal baggage and just passes
a supplied cookie and a purpose-specific struct pointer to an
object-specific user-supplied function pointer might be interesting, but
then you run into all of the reentrancy/masking/choosing which thread to
deliver to and other issues that signals already have without the
benefit of the existing signal infrastructure for all that stuff. Gah, I
don't want to think about this anymore.

-- 
Nicholas Miell <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH] proc: maps protection

2007-03-10 Thread Andrew Morton
> On Sat, 10 Mar 2007 10:33:41 -0800 Kees Cook <[EMAIL PROTECTED]> wrote:
> Here's another revision, with both the "can ptrace" and the global /proc 
> knob;

We'd be needing a changelog for that.

Please update the procfs documentation.

Does the patch also cover /proc/pid/smaps?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] MMC: Clean up low voltage range handling

2007-03-10 Thread Philip Langdale
Clean up the handling of low voltage MMC cards.

The latest MMC and SD specs both agree that the low
voltage range is defined as 1.65-1.95V and is signified
by bit 7 in the OCR. An old Sandisk spec implied that
bits 7-0 represented voltages below 2.0V in 1V increments,
and the code was accordingly written with that expectation.

This confusion meant that host drivers attempting to support
the typical low voltage (1.8V) would set the wrong bits in
the host OCR mask (usually bits 5 and 6) resulting in the
the low voltage mode never being used.

This change switches the code to conform to the specs and
fixes the SDHCI driver. It also removes the explicit
defines for the host vdd and updates the SDHCI driver
to convert the bit number back to the mask value
for comparisons. Having only a single set of defines
ensures there's nothing to get out of sync.

Signed-off-by: Philip Langdale <[EMAIL PROTECTED]>

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index c87ce56..74ebd97 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -317,6 +317,24 @@ static u32 mmc_select_voltage(struct mmc
 {
int bit;

+   /*
+* Sanity check the voltages that the card claims to
+* support.
+*/
+   if (ocr & 0x7F) {
+   printk("%s: card claims to support voltages below "
+  "the defined range. These will be ignored.\n",
+  mmc_hostname(host));
+   ocr &= ~0x7F;
+   }
+
+   if (host->mode == MMC_MODE_SD && (ocr & MMC_VDD_165_195)) {
+   printk("%s: SD card claims to support the incompletely "
+  "defined 'low voltage range'. This will be ignored.\n",
+  mmc_hostname(host));
+   ocr &= ~MMC_VDD_165_195;
+   }
+
ocr &= host->ocr_avail;

bit = ffs(ocr);
diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 86d0957..a80c043 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -668,20 +668,16 @@ static void sdhci_set_power(struct sdhci

pwr = SDHCI_POWER_ON;

-   switch (power) {
-   case MMC_VDD_170:
-   case MMC_VDD_180:
-   case MMC_VDD_190:
+   switch (1 << power) {
+   case MMC_VDD_165_195:
pwr |= SDHCI_POWER_180;
break;
-   case MMC_VDD_290:
-   case MMC_VDD_300:
-   case MMC_VDD_310:
+   case MMC_VDD_29_30:
+   case MMC_VDD_30_31:
pwr |= SDHCI_POWER_300;
break;
-   case MMC_VDD_320:
-   case MMC_VDD_330:
-   case MMC_VDD_340:
+   case MMC_VDD_32_33:
+   case MMC_VDD_33_34:
pwr |= SDHCI_POWER_330;
break;
default:
@@ -1293,7 +1289,7 @@ static int __devinit sdhci_probe_slot(st
if (caps & SDHCI_CAN_VDD_300)
mmc->ocr_avail |= MMC_VDD_29_30|MMC_VDD_30_31;
if (caps & SDHCI_CAN_VDD_180)
-   mmc->ocr_avail |= MMC_VDD_17_18|MMC_VDD_18_19;
+   mmc->ocr_avail |= MMC_VDD_165_195;

if (mmc->ocr_avail == 0) {
printk(KERN_ERR "%s: Hardware doesn't report any "
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index 43bf6a5..89dbb91 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -16,30 +16,7 @@ struct mmc_ios {
unsigned intclock;  /* clock rate */
unsigned short  vdd;

-#defineMMC_VDD_150 0
-#defineMMC_VDD_155 1
-#defineMMC_VDD_160 2
-#defineMMC_VDD_165 3
-#defineMMC_VDD_170 4
-#defineMMC_VDD_180 5
-#defineMMC_VDD_190 6
-#defineMMC_VDD_200 7
-#defineMMC_VDD_210 8
-#defineMMC_VDD_220 9
-#defineMMC_VDD_230 10
-#defineMMC_VDD_240 11
-#defineMMC_VDD_250 12
-#defineMMC_VDD_260 13
-#defineMMC_VDD_270 14
-#defineMMC_VDD_280 15
-#defineMMC_VDD_290 16
-#defineMMC_VDD_300 17
-#defineMMC_VDD_310 18
-#defineMMC_VDD_320 19
-#defineMMC_VDD_330 20
-#defineMMC_VDD_340 21
-#defineMMC_VDD_350 22
-#defineMMC_VDD_360 23
+/* vdd stores the bit number of the selected voltage range from protocol.h */

unsigned char   bus_mode;   /* command output mode */

@@ -88,14 +65,7 @@ struct mmc_host {
unsigned intf_max;
u32 ocr_avail;

-#define MMC_VDD_145_1500x0001  /* VDD voltage 1.45 - 
1.50 */
-#define MMC_VDD_150_1550x0002  /* VDD voltage 1.50 - 
1.55 */
-#define MMC_VDD_155_1600x0004  /* VDD voltage 1.55 - 
1.60 */
-#define MMC_VDD_160_1650x0008  /* VDD voltage 1.60 - 
1.65 */
-#define MMC_VDD_165_1700x0010  /* VDD voltage 

Re: IP Defragmentation

2007-03-10 Thread Jan Engelhardt

On Mar 8 2007 11:45, Kanhu Rauta wrote:
>
> 1>in case of fragmention i am getting only one packet at the
> hook,While analyzing the ip header it says this is the assembled
> packet(skb->len=1528,offset=0,MF=0).

conntrack assembles defragmented packets.

> While dumping the data(for 0 to 1528 print skb->data[i]) it shows that
> only 1472 bytes are valid data and rest 28 bytes are something
> garbage.

Have you forgotten to use skb_header_pointer()?



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


sched rsdl fix for 0.28

2007-03-10 Thread Con Kolivas
Here's a big bugfix for sched rsdl 0.28

---
 kernel/sched.c |7 +++
 1 file changed, 7 insertions(+)

Index: linux-2.6.21-rc3-mm2/kernel/sched.c
===
--- linux-2.6.21-rc3-mm2.orig/kernel/sched.c2007-03-11 11:04:38.0 
+1100
+++ linux-2.6.21-rc3-mm2/kernel/sched.c 2007-03-11 11:05:46.0 +1100
@@ -3328,6 +3328,13 @@ static inline void rotate_runqueue_prior
int new_prio_level, remaining_quota = rq_quota(rq, rq->prio_level);
struct prio_array *array = rq->active;
 
+   /*
+* Make sure we don't have tasks still on the active array that
+* haven't run due to not preempting (merging or smp balancing)
+*/
+   if (find_next_bit(rq->dyn_bitmap, MAX_PRIO, MAX_RT_PRIO) <
+   rq->prio_level)
+   return;
if (rq->prio_level > MAX_PRIO - 2) {
/* Major rotation required */
struct prio_array *new_queue = rq->expired;

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-10 Thread Jeremy Fitzhardinge
Thomas Gleixner wrote:
> The clocksource is not used until the clocksource is installed. Also the
> periodic mode during boot, when the clock event device supports periodic
> mode, is not reading the time. It relies on the clock event device
> getting it straight.

Yes.  This could be one source of error, where I compute  the offset
hypervisor_time - ktime_get(), but ktime_get() may drift with respect to
hypervisor time while using a periodic jiffies timebase.

> Once we switch to NO_HZ or HIGHRES the clock event device is directly
> coupled to the clock event source.
>   
OK.  Erm, but not in the sense that you always choose the xen/hpet/lapic
clocksource+clockevent together; there's no direct linkage between the
two kinds of device.  But there's the coupling where the clocksource is
always used to directly measure the clockevent's behaviour.

> Once we switched over to the clocksource, everything should be in
> perfect sync.
>   

Assuming that the clocksource and the clockevent device have
close-enough timebases.

>> Or perhaps this is a property of the whole clock subsystem: that
>> clockevents must be paired with clocksources.  But its not obvious to me
>> that this enforced, or even acknowledged.
>> 
>
> It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
> time, which is read back from the clocksource, even if we use a relative
> value for real hardware clock event devices to program the next event.
> We calculate the delta between the absolute event and now. So we never
> get an accumulating error.
>   

Right, but if the clocksource and the clockevent devices have a relative
drift, then using the clocksource to compute that we need a 500ns delay,
but the clockevent device ends delivering the oneshot event 750ns (or
250ns) later, then things are going to be locally upset, even if the
next time the clockevent oneshot is programmed it will take the
overshoot into account.  (Of course, you'd hope the drift would never
really be that bad, and 2^32 ns only gives you ~4s window to screw up).

> What problem are you observing ?
>   

Unexpected pauses during boot.  I think the real problem is that Xen
periodic timer events are not delivered unless the vcpu is actually
running (ie, they're specifically intended for timeslicing rather than
general periodic events).  Perhaps the real fix in this case is to just
remove the periodic feature flag.


J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RSDL v0.28 for 2.6.20

2007-03-10 Thread Con Kolivas
On Sunday 11 March 2007 06:11, Willy Tarreau wrote:
> On Sat, Mar 10, 2007 at 01:09:35PM -0500, Stephen Clark wrote:
> > Con Kolivas wrote:
> > >Here is an update for RSDL to version 0.28
> > >
> > >Full patch:
> > >http://ck.kolivas.org/patches/staircase-deadline/2.6.20-sched-rsdl-0.28.
> > >patch
> > >
> > >Series:
> > >http://ck.kolivas.org/patches/staircase-deadline/2.6.20/
> > >
> > >The patch to get you from 0.26 to 0.28:
> > >http://ck.kolivas.org/patches/staircase-deadline/2.6.20/sched-rsdl-0.26-
> > >0.28.patch
> > >
> > >A similar patch and directories will be made for 2.6.21-rc3 without
> > >further announcement
> >
> > doesn't apply against 2.6.20.2:
> >
> > patch -p1 <~/2.6.20-sched-rsdl-0.28.patch --dry-run
> > patching file include/linux/list.h
> > patching file fs/proc/array.c
> > patching file fs/pipe.c
> > patching file include/linux/sched.h
> > patching file include/asm-generic/bitops/sched.h
> > patching file include/asm-s390/bitops.h
> > patching file kernel/sched.c
> > Hunk #41 FAILED at 3531.
> > 1 out of 62 hunks FAILED -- saving rejects to file kernel/sched.c.rej
> > patching file include/linux/init_task.h
> > patching file Documentation/sched-design.txt
>
> It is easier to apply 2.6.20.2 on top of 2.6.20+RSDL. The .2 patch
> is a one-liner that you can easily fix by hand, and I'm not even
> certain that it is still required :
>
> --- ./kernel/sched.c.orig 2007-03-10 13:03:51 +0100
> +++ ./kernel/sched.c  2007-03-10 13:08:02 +0100
> @@ -3544,7 +3544,7 @@
>   next = list_entry(queue->next, struct task_struct, run_list);
>   }
>
> - if (dependent_sleeper(cpu, rq, next))
> + if (rq->nr_running == 1 && dependent_sleeper(cpu, rq, next))
>   next = rq->idle;
>  switch_tasks:
>   if (next == rq->idle)
>
> BTW, Con, I think that you should base your work on 2.6.20.[23] and not
> 2.6.20 next time, due to this conflict. It will get wider adoption.

Gotcha. This bugfix for 2.6.20.2 was controversial anyway so it probably wont 
hurt if you dont apply it.

Has anyone had any trouble with RSDL on the stable kernels (ie not -mm)?

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: "Make nenuconfig" does not save parameters.

2007-03-10 Thread Jan Engelhardt

On Mar 10 2007 23:45, Sam Ravnborg wrote:
>> >On Sat, Mar 10, 2007 at 07:23:41PM +0100, Jan Engelhardt wrote:
>> >> 
>> >> Whether the 'working config file path' should change when you do
>> >> 'Save as Alternate' or not, is a menuconfig axiom. Ask Sam Ravnborg
>> >> if you want it changed :-)
>> >
>> >Current behaviour is not logical but on the other hand I do not
>> >see a big need to make it so.
>> >It seems that people very seldom uses "save alternate" anyway.
>> >
>> >But patches are welcome.
>> 
>> ^_^ The patch has already been posted, has not it?
>
>No.

http://lkml.org/lkml/2007/3/10/163 ? Not that I have tried it personally.

>Either we keep current behaviour or we change to the "normal"
>behaviour with a "Save as..." as know from all other programs.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc3-mm1 RSDL results

2007-03-10 Thread Con Kolivas
On Sunday 11 March 2007 10:34, Con Kolivas wrote:
> On Sunday 11 March 2007 05:21, Mark Lord wrote:
> > Con Kolivas wrote:
> > > On Saturday 10 March 2007 05:07, Mark Lord wrote:
> > >> Mmm.. when it's good, it's *really* good.
> > >> My desktop feels snappier and all of that.
> > >
> > >..
> > >
> > >> But when it's bad, it stinks.
> > >> Like when a "make -j2" kernel rebuild is happening in a background
> > >> window
> > >
> > > And that's bad. When you say "it stinks" is it more than 3 times
> > > slower? It should be precisely 3 times slower under that load (although
> > > low cpu using things like audio wont be affected by running 3 times
> > > slower). If it feels like much more than that much slower, there is a
> > > bug there somewhere.
> >
> > Scrolling windows is incredibly jerkey, and very very sluggish
> > when images are involved (eg. a large web page in firefox).
> >
> > > As another reader suggested, how does it run with the compile 'niced'?
> > > How does it perform with make (without a -j number).
> >
> > Yes, it behaves itself when the "make -j2" is nice'd.
> >
> > >> This is on a Pentium-M 760 single-core, w/2GB SDRAM (notebook).
> > >
> > > What HZ are you running? Are you running a Beryl desktop?
> >
> > HZ==1000, NO_HZ, Kubunutu Dapper Drake distro, ATI X300 open-source X.org
> > driver.
>
> Can you try the new version of RSDL. Assuming it doesn't oops on you it has
> some accounting bugfixes which may have been biting you.

Oh I just checked the mesa repo for that driver as well. It seems the r300 
drivers have sched_yield in them as well, but not all components. You may be 
getting bitten by this too.

http://webcvs.freedesktop.org/mesa/Mesa/src/mesa/drivers/dri/r300/radeon_ioctl.c?revision=1.14=markup

I don't really know what the radeon and other models are so I'm not sure if it 
applies to your hardware; I just did a random search through the r300 
directory.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of absolute timeouts for oneshot timers

2007-03-10 Thread Thomas Gleixner
On Sat, 2007-03-10 at 14:52 -0800, Jeremy Fitzhardinge wrote:
> When booting under Xen, you'll get this if you're using both the xen
> clocksource and clockevent drivers.  However, it seems that during boot
> on a NO_HZ HIGHRES_TIMERS system, the kernel does not use the Xen
> clocksource until it switches to highres timer mode.  This means that
> during boot the kernel's monotonic clock is drifting with respect to the
> hypervisor, and all timeouts are unreliable.

The clocksource is not used until the clocksource is installed. Also the
periodic mode during boot, when the clock event device supports periodic
mode, is not reading the time. It relies on the clock event device
getting it straight. That's not a big deal during boot and on a kernel
with NO_HZ=n and HIGHRES=n the periodic tick only updates jiffies. If
the only clocksource is jiffies, then we have to live with it and we do
not switch to NO_HZ/HIGHRES as we would lose track of time.

Once we switch to NO_HZ or HIGHRES the clock event device is directly
coupled to the clock event source.

> Initially I was just computing the kernel-hypervisor offset at boot
> time, but then I changed it to recompute it every time the timer mode
> changes.  However, this didn't really help, and I was still getting
> unpredictable timeouts during boot.  I've changed it to just compute the
> hypervisor absolute time directly using the delta each time the oneshot
> timer is set, which will definitely be reliable (if the kernel and
> hypervisor have drifting timebases then the meaning of Xns delta will be
> different, but at least thats a local error rather than a long-term
> cumulative error).

We do not really care up to the point, where the high resolution
clocksource (e.g. TSC, PM-Timer or HPET on real hardware) becomes
active. Early boot is fragile and we switch over to high res clocksource
and highres/nohz when things have stabilized. 

> My analysis might be wrong here (I suspect the Xen periodic timer may
> have unexpected behaviour), but the overall conclusion still stands:
> using an absolute timeout only works if the kernel and hypervisor have
> non-drifting timebases.  I think its too fragile for a clockevent
> implementation to assume that a particular clocksource is in use to get
> reliable results.

Once we switched over to the clocksource, everything should be in
perfect sync.

> Or perhaps this is a property of the whole clock subsystem: that
> clockevents must be paired with clocksources.  But its not obvious to me
> that this enforced, or even acknowledged.

It's simply enforced in NO_HZ, HIGHRES mode as we operate in absolute
time, which is read back from the clocksource, even if we use a relative
value for real hardware clock event devices to program the next event.
We calculate the delta between the absolute event and now. So we never
get an accumulating error.

What problem are you observing ?

tglx


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc3-mm1 RSDL results

2007-03-10 Thread Con Kolivas
On Sunday 11 March 2007 05:21, Mark Lord wrote:
> Con Kolivas wrote:
> > On Saturday 10 March 2007 05:07, Mark Lord wrote:
> >> Mmm.. when it's good, it's *really* good.
> >> My desktop feels snappier and all of that.
> >
> >..
> >
> >> But when it's bad, it stinks.
> >> Like when a "make -j2" kernel rebuild is happening in a background
> >> window
> >
> > And that's bad. When you say "it stinks" is it more than 3 times slower?
> > It should be precisely 3 times slower under that load (although low cpu
> > using things like audio wont be affected by running 3 times slower). If
> > it feels like much more than that much slower, there is a bug there
> > somewhere.
>
> Scrolling windows is incredibly jerkey, and very very sluggish
> when images are involved (eg. a large web page in firefox).
>
> > As another reader suggested, how does it run with the compile 'niced'?
> > How does it perform with make (without a -j number).
>
> Yes, it behaves itself when the "make -j2" is nice'd.
>
> >> This is on a Pentium-M 760 single-core, w/2GB SDRAM (notebook).
> >
> > What HZ are you running? Are you running a Beryl desktop?
>
> HZ==1000, NO_HZ, Kubunutu Dapper Drake distro, ATI X300 open-source X.org
> driver.

Can you try the new version of RSDL. Assuming it doesn't oops on you it has 
some accounting bugfixes which may have been biting you.

Thanks
-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] Configuration generic drivers at runtime

2007-03-10 Thread Laurent Pinchart
Hi everybody,

I'm writing a Linux driver for USB Video Class (UVC) devices. Before 
submitting it to the kernel, there are still a few rough corners I'd like to 
polish. Comments would be appreciated for the following one.

The UVC spec defines a way for device vendors to provide extensions to the 
standard through so-called extension units, identified by a GUID (Globally 
Unique IDentifier). An extension unit can define any number of controls 
(think of controls as simple parameters such as brightness, zoom, pan/tilt, 
shutter speed, ...). Devices advertise in their USB descriptors the extension 
units they support, along with the controls that are supported in each 
extension unit.

To access those extension units from user-space, the UVC driver will offer two 
methods. One of them will map the controls defined by extension units to V4L2 
controls. The question that arises is how to define and store those mappings.

And obvious solution would be to have an ever growing array in the driver, 
storing control information for all possible extension units ever defined by 
webcam vendors. While this is quite straightforward, it might not be the most 
usable solution for device vendors who wouldn't want debug controls to be 
included in the kernel by default, or who wouldn't want to submit new control 
definitions for inclusion in the kernel (with the implied delay) every time a 
new device comes out.

Another solution would be to introduce a way to define controls and mappings 
at runtime. Mappings would be stored in test-based user-space configuration 
files, distributed by vendors. A small user-space utility would add them 
through a few ioctls. This obviously raises some security concerns (regarding 
which users will be allowed to add mappings, or how many of them they can 
add).

I would like comments regarding the second solution. Is this something that is 
likely to be accepted in the mainline kernel ? I don't know of any other 
Linux driver implementing such kind of dynamic runtime configuration.

Best regards,

Laurent Pinchart
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Use of absolute timeouts for oneshot timers

2007-03-10 Thread Jeremy Fitzhardinge
I've been thinking a bit more about how useful an absolute timeout is
for a oneshot timer in a virtual environment.

In principle, absolute times are generally preferable.  A relative
timeout means "timeout in X ns from now", but the meaning of "now" is
ambiguous, particularly if the vcpu can be preempted at any time, which
means the determination of "now" can be arbitrarily deferred.

However, an absolute time is only meaningful if the kernel and
hypervisor are operating off the same timebase (ie, no drift).  In
general, the kernel's monotonic timer is going to start from 0ns when
the virtual machine is booted, and the hypervisor's is going to start at
0ns when the hypervisor is booted.  If they're operating off the same
timebase, then in principle you can work out a constant offset between
the two, and use that for converting a kernel absolute time into a
hypervisor absolute time.

When booting under Xen, you'll get this if you're using both the xen
clocksource and clockevent drivers.  However, it seems that during boot
on a NO_HZ HIGHRES_TIMERS system, the kernel does not use the Xen
clocksource until it switches to highres timer mode.  This means that
during boot the kernel's monotonic clock is drifting with respect to the
hypervisor, and all timeouts are unreliable.

Initially I was just computing the kernel-hypervisor offset at boot
time, but then I changed it to recompute it every time the timer mode
changes.  However, this didn't really help, and I was still getting
unpredictable timeouts during boot.  I've changed it to just compute the
hypervisor absolute time directly using the delta each time the oneshot
timer is set, which will definitely be reliable (if the kernel and
hypervisor have drifting timebases then the meaning of Xns delta will be
different, but at least thats a local error rather than a long-term
cumulative error).

My analysis might be wrong here (I suspect the Xen periodic timer may
have unexpected behaviour), but the overall conclusion still stands:
using an absolute timeout only works if the kernel and hypervisor have
non-drifting timebases.  I think its too fragile for a clockevent
implementation to assume that a particular clocksource is in use to get
reliable results.

Or perhaps this is a property of the whole clock subsystem: that
clockevents must be paired with clocksources.  But its not obvious to me
that this enforced, or even acknowledged.

(Of course, if the drift can be characterized, then you can compensate
for it, but this seems too complex to be the right answer.  And drift
compensation is numerically much simpler for small 32-bit deltas
compared to 64-bit absolute times.)

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc3-mm1 RSDL results

2007-03-10 Thread Con Kolivas
On Sunday 11 March 2007 04:01, James Cloos wrote:
> > "Con" == Con Kolivas <[EMAIL PROTECTED]> writes:
>
> Con> It's sad that sched_yield is still in our graphics card drivers ...
>
> I just did a recursive grep(1) on my mirror of the freedesktop git
> repos for sched_yield.  This only checked the master branches as I
> did not bother to script up something to clone each, check out all
> branches in turn, and grep(1) each possibility.
>
> The output is just:
> :; grep -r sched_yield FDO/xorg
>
> FDO/xorg/xserver/hw/kdrive/via/viadraw.c: sched_yield();
> FDO/xorg/driver/xf86-video-glint/src/pm2_video.c:if (sync) /*
> sched_yield? */
>
> Is there something else I should grep(1) for?  If not, it looks as
> if sched_yield(2) has been evicted from the drivers.

See:

http://webcvs.freedesktop.org/mesa/Mesa/src/mesa/drivers/dri/r200/r200_ioctl.c?revision=1.37=markup

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problem: cat < /dev/my_ttyS0 is not blocked

2007-03-10 Thread Denis Vlasenko
On Saturday 10 March 2007 13:16, Mockern wrote:
> I have a problem with  cat < /dev/my_ttyS0 (see strace output below).
> cat function is not blocked. I don't understand why it is not stopped
> at read(0, __  and terminated?  
> Thank you

Because /dev/my_ttyS0 is probaly a null file.

Please show output of 'ls -l /dev/*ttyS*'

--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: "Make nenuconfig" does not save parameters.

2007-03-10 Thread Sam Ravnborg
On Sat, Mar 10, 2007 at 10:34:41PM +0100, Jan Engelhardt wrote:
> 
> On Mar 10 2007 22:27, Sam Ravnborg wrote:
> >On Sat, Mar 10, 2007 at 07:23:41PM +0100, Jan Engelhardt wrote:
> >> 
> >> Whether the 'working config file path' should change when you do
> >> 'Save as Alternate' or not, is a menuconfig axiom. Ask Sam Ravnborg
> >> if you want it changed :-)
> >
> >Current behaviour is not logical but on the other hand I do not
> >see a big need to make it so.
> >It seems that people very seldom uses "save alternate" anyway.
> >
> >But patches are welcome.
> 
> ^_^ The patch has already been posted, has not it?
No.
Either we keep current behaviour or we change to the "normal"
behaviour with a "Save as..." as know from all other programs.

Sam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Linus Torvalds


On Sat, 10 Mar 2007, Nicholas Miell wrote:
> 
> Care to elaborate on why they're a horrible crock?

It's a *classic* case of an interface that tries to do everything under 
the sun.

Here's a clue: look at any system call that takes a union as part of its 
arguments. Count them. I think we have two:
 - struct siginfo
 - struct sigevent
and they are both broken horrible interfaces where the data structures 
depend on various flags.

It's just not the UNIX system call way. And none of it really makes sense 
if you already have a file descriptor, since at that point you know what 
the notification mechanism is.

I'd actually much rather do POSIX timers the other way around: associate a 
generic notification mechanism with the file descriptor, and then 
implement posix_timer_create() on top of timerfd. Now THAT sounds like a 
clean unix-like interface ("everything is a file") and would imply that 
you'd be able to do the same kind of notification for any file descriptor, 
not just timers.

But posix timers as they are done now are just an abomination. They are 
not unix-like at all.

> And are the bugs fixed? If so, why replace them? They work now.

.. but the reason for the bugs was largely a very baroque interface, which 
didn't get fixed (because it's specified by the standard).

I'd rather have straightforward interfaces. The timerfd() one lookedalot 
more straightforward than posix timers.

(That said, using "struct itimerspec" might be a good idea. That would 
also obviate the need for TFD_TIMER_SEQ, since an itimerspec automatically 
has both "base" and "incremental" parts).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RSDL-mm 0/6] Rotating Staircase DeadLine scheduler for -mm

2007-03-10 Thread Con Kolivas
On Sunday 11 March 2007 03:53, Nicolas Mailhot wrote:
> Le dimanche 11 mars 2007 à 01:03 +1100, Con Kolivas a écrit :
> > On Saturday 10 March 2007 22:49, Nicolas Mailhot wrote:
> > > Oops
> > >
> > > ⇒ http://bugzilla.kernel.org/show_bug.cgi?id=8166
> >
> > Thanks very much. I can't get your config to boot on qemu, but could you
> > please try this debugging patch? It's not a patch you can really run the
> > machine with but might find where the problem occurs. Specifically I'm
> > looking for the warning MISSING STATIC BIT in your case.
> >
> > http://ck.kolivas.org/patches/crap/sched-rsdl-0.28-stuff.patch
>
> I attached a screenshot of the patched kernel boot

Thanks. Darn the debugging didn't catch anything. Did you see any BUG during 
the boot earlier than that screenshot? Probably not. 

If you have the time I would appreciate you testing 2.6.20 with the rsdl 0.28 
patch for it with a config as close to this -mm2 one as possible.

http://ck.kolivas.org/patches/staircase-deadline/2.6.20-sched-rsdl-0.28.patch

and see if the bug recurs please?

Thanks!

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...

2007-03-10 Thread Davide Libenzi
On Sat, 10 Mar 2007, Nicholas Miell wrote:

> I never complained about one timer per fd (although, now that you
> mention it, that would get a bit excessive if you have thousands of
> outstanding timers).

Right, of course.



> > The real-time and monotonic selection can be added. 
> 
> IOW, the timerfd patch is not suitable for inclusion as-is. (While
> you're at it, you should probably add a flags argument for future
> expansion.)

That's already in.



> > If you look at the posix timers code, that's a bunch of code over the real 
> > meat of it, that is hrtimer.c. The timerfd interface goes straight to 
> > that, without adding yet another meaning to the sigevent structure,
> 
> That's what the sigevent structure is for -- to describe how events
> should be signaled to userspace, whether by signal delivery, thread
> creation, or queuing to event completion ports. If if you think
> extending it would be bad, I can show you the line in POSIX where it
> encourages the contrary.

I'm sorry, I already explained you that linking the two (files and posix 
timers) is going to create more troubles than it actually solves.
The timerfd code provides the same functionality, with zero intrusion in 
existing code, and basically zero code (once if you remove the usual fd 
creation/cleanup).
The code of adding posix timers support would be *all* the existing one
(that is already a thin wrapper that calls hrtimer.c support - like 
posix timers do), plus adding more crud into the posix timers code, plus 
adding file references handling. If *you* want to do that, I can open you 
a door into the timerfd.




- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Software Suspend: Fix suspend when console is in VT_AUTO/KD_GRAPHICS mode

2007-03-10 Thread Pavel Machek
Hi1

> > It should explain why it is okay to proceed when we can't change to
> > text console.
> > 
> See updated comment in attached patch.  It's really up to the caller to
> decide what to do if we can't switch the console - currently all callers
> ignore the return code so I assume that it's okay to proceed anyway.

Ok, I guess the patch is right thing to do after all.  Fix issues
below, append a changelog, and send a patch to lmkl, cc Andrew Morton
and me. Oh and you have my ACK.
Pavel

> Signed-off-by: Andrew Johnson <[EMAIL PROTECTED]>
> ---
> diff -rup linux-2.6.20.1/drivers/char/vt.c linux/drivers/char/vt.c
> --- linux-2.6.20.1/drivers/char/vt.c  2007-02-19 22:34:32.0 -0800
> +++ linux/drivers/char/vt.c   2007-03-09 15:48:29.0 -0800
> @@ -2188,10 +2188,30 @@ static void console_callback(struct work
>   release_console_sem();
>  }
>  
> -void set_console(int nr)
> +extern char vt_dont_switch;
> +
> +int set_console(int nr)
>  {
> + struct vc_data *vc = vc_cons[fg_console].d;
> +
> + if(!vc_cons_allocated(nr) || vt_dont_switch || 

there should be space between "if" and "(".

> diff -rup linux-2.6.20.1/drivers/char/vt_ioctl.c
> linux/drivers/char/vt_ioctl.c
> --- linux-2.6.20.1/drivers/char/vt_ioctl.c2007-02-19 22:34:32.0
> -0800
> +++ linux/drivers/char/vt_ioctl.c 2007-03-08 14:15:41.0

And your mailer still wordwraps.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ck] Re: RSDL v0.28 for 2.6.20

2007-03-10 Thread michael chang

On 3/10/07, Willy Tarreau <[EMAIL PROTECTED]> wrote:

On Sat, Mar 10, 2007 at 04:56:57PM -0500, michael chang wrote:
> On 3/10/07, Willy Tarreau <[EMAIL PROTECTED]> wrote:
> >BTW, Con, I think that you should base your work on 2.6.20.[23] and not
> >2.6.20 next time, due to this conflict. It will get wider adoption.
  ^^

> Maybe I'm naive, but I find this hard to understand -- 2.6.20.2 didn't
> exist when Con published his patch. (Con published it ~12 hours before
> the release of 2.6.20.2, from what I can tell.) How can he base his
> work on something that didn't yet exist? (And it applied cleanly to
> 2.6.20.1, the latest when he published it.)

You see the words I have underlined ? "next time". I know for sure he
published it before 2.6.20.2, but now that it is out, I suggested that
Con rebases his work on this version for new releases.



Oh. That's my mistake, then. That makes sense. To me, it sounded like
you were implying he was supposed to base it on 2.6.20.2 in advance,
for some reason. *sigh*

--
~Mike
- Just the crazy copy cat.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Delete unused header file.

2007-03-10 Thread Robert P. J. Day

  Delete apparently unused header file include/linux/elfnote.h.

Signed-off-by: Robert P. J. Day <[EMAIL PROTECTED]>

---

  not sure who's responsible for this.

diff --git a/include/linux/elfnote.h b/include/linux/elfnote.h
deleted file mode 100644
index 67396db..000
--- a/include/linux/elfnote.h
+++ /dev/null
@@ -1,90 +0,0 @@
-#ifndef _LINUX_ELFNOTE_H
-#define _LINUX_ELFNOTE_H
-/*
- * Helper macros to generate ELF Note structures, which are put into a
- * PT_NOTE segment of the final vmlinux image.  These are useful for
- * including name-value pairs of metadata into the kernel binary (or
- * modules?) for use by external programs.
- *
- * Each note has three parts: a name, a type and a desc.  The name is
- * intended to distinguish the note's originator, so it would be a
- * company, project, subsystem, etc; it must be in a suitable form for
- * use in a section name.  The type is an integer which is used to tag
- * the data, and is considered to be within the "name" namespace (so
- * "FooCo"'s type 42 is distinct from "BarProj"'s type 42).  The
- * "desc" field is the actual data.  There are no constraints on the
- * desc field's contents, though typically they're fairly small.
- *
- * All notes from a given NAME are put into a section named
- * .note.NAME.  When the kernel image is finally linked, all the notes
- * are packed into a single .notes section, which is mapped into the
- * PT_NOTE segment.  Because notes for a given name are grouped into
- * the same section, they'll all be adjacent the output file.
- *
- * This file defines macros for both C and assembler use.  Their
- * syntax is slightly different, but they're semantically similar.
- *
- * See the ELF specification for more detail about ELF notes.
- */
-
-#ifdef __ASSEMBLER__
-/*
- * Generate a structure with the same shape as Elf{32,64}_Nhdr (which
- * turn out to be the same size and shape), followed by the name and
- * desc data with appropriate padding.  The 'desctype' argument is the
- * assembler pseudo op defining the type of the data e.g. .asciz while
- * 'descdata' is the data itself e.g.  "hello, world".
- *
- * e.g. ELFNOTE(XYZCo, 42, .asciz, "forty-two")
- *  ELFNOTE(XYZCo, 12, .long, 0xdeadbeef)
- */
-#define ELFNOTE(name, type, desctype, descdata)\
-.pushsection .note.name;   \
-  .align 4 ;   \
-  .long 2f - 1f/* namesz */;   \
-  .long 4f - 3f/* descsz */;   \
-  .long type   ;   \
-1:.asciz "name";   \
-2:.align 4 ;   \
-3:desctype descdata;   \
-4:.align 4 ;   \
-.popsection;
-#else  /* !__ASSEMBLER__ */
-#include 
-/*
- * Use an anonymous structure which matches the shape of
- * Elf{32,64}_Nhdr, but includes the name and desc data.  The size and
- * type of name and desc depend on the macro arguments.  "name" must
- * be a literal string, and "desc" must be passed by value.  You may
- * only define one note per line, since __LINE__ is used to generate
- * unique symbols.
- */
-#define _ELFNOTE_PASTE(a,b)a##b
-#define _ELFNOTE(size, name, unique, type, desc)   \
-   static const struct {   \
-   struct elf##size##_note _nhdr;  \
-   unsigned char _name[sizeof(name)]   \
-   __attribute__((aligned(sizeof(Elf##size##_Word; \
-   typeof(desc) _desc  \
-
__attribute__((aligned(sizeof(Elf##size##_Word; \
-   } _ELFNOTE_PASTE(_note_, unique)\
-   __attribute_used__  \
-   __attribute__((section(".note." name),  \
-  aligned(sizeof(Elf##size##_Word)),   \
-  unused)) = { \
-   {   \
-   sizeof(name),   \
-   sizeof(desc),   \
-   type,   \
-   },  \
-   name,   \
-   desc\
-   }
-#define ELFNOTE(size, name, type, desc)\
-   _ELFNOTE(size, name, __LINE__, type, desc)
-
-#define ELFNOTE32(name, type, desc) ELFNOTE(32, name, type, desc)
-#define ELFNOTE64(name, type, desc) ELFNOTE(64, name, type, desc)
-#endif /* __ASSEMBLER__ */
-
-#endif /* _LINUX_ELFNOTE_H */
-- 

Re: [PATCH] Software Suspend: Fix suspend when console is in VT_AUTO/KD_GRAPHICS mode

2007-03-10 Thread Pavel Machek
Hi!

> > ...how does qpe know when to repaint the screen, anyway?
> 
> QPE doesn't need to repaint the screen after wake-up - the framebuffer
> memory is retained so the PXA270 lcd controller simply displays what was
> last on the screen when it is re-enabled.

That probably means QPE is broken on machines that do not preserve
framebuffer over suspend :-(.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] rtc: Add RTC class driver for the Maxim MAX6900

2007-03-10 Thread Dale Farnsworth
From: Dale Farnsworth <[EMAIL PROTECTED]>

Signed-off-by: Dale Farnsworth.org <[EMAIL PROTECTED]

---
 drivers/rtc/Kconfig   |   10 +
 drivers/rtc/Makefile  |1 
 drivers/rtc/rtc-max6900.c |  312 
 3 files changed, 323 insertions(+)

Index: linux-2.6-powerpc-df/drivers/rtc/Kconfig
===
--- linux-2.6-powerpc-df.orig/drivers/rtc/Kconfig
+++ linux-2.6-powerpc-df/drivers/rtc/Kconfig
@@ -334,6 +334,16 @@ config RTC_DRV_TEST
  This driver can also be built as a module. If so, the module
  will be called rtc-test.
 
+config RTC_DRV_MAX6900
+   tristate "Maxim 6900"
+   depends on RTC_CLASS && I2C
+   help
+ If you say yes here you will get support for the
+ Maxim MAX6900 I2C RTC chip.
+
+ This driver can also be built as a module. If so, the module
+ will be called rtc-max6900.
+
 config RTC_DRV_MAX6902
tristate "Maxim 6902"
depends on RTC_CLASS && SPI
Index: linux-2.6-powerpc-df/drivers/rtc/Makefile
===
--- linux-2.6-powerpc-df.orig/drivers/rtc/Makefile
+++ linux-2.6-powerpc-df/drivers/rtc/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_RTC_DRV_EP93XX)  += rtc-ep93
 obj-$(CONFIG_RTC_DRV_SA1100)   += rtc-sa1100.o
 obj-$(CONFIG_RTC_DRV_VR41XX)   += rtc-vr41xx.o
 obj-$(CONFIG_RTC_DRV_PL031)+= rtc-pl031.o
+obj-$(CONFIG_RTC_DRV_MAX6900)  += rtc-max6900.o
 obj-$(CONFIG_RTC_DRV_MAX6902)  += rtc-max6902.o
 obj-$(CONFIG_RTC_DRV_V3020)+= rtc-v3020.o
 obj-$(CONFIG_RTC_DRV_AT91RM9200)+= rtc-at91rm9200.o
Index: linux-2.6-powerpc-df/drivers/rtc/rtc-max6900.c
===
--- /dev/null
+++ linux-2.6-powerpc-df/drivers/rtc/rtc-max6900.c
@@ -0,0 +1,312 @@
+/*
+ * rtc class driver for the Maxim MAX6900 chip
+ *
+ * Author: Dale Farnsworth <[EMAIL PROTECTED]>
+ *
+ * based on previously existing rtc class drivers
+ *
+ * 2007 (c) MontaVista, Software, Inc.  This file is licensed under
+ * the terms of the GNU General Public License version 2.  This program
+ * is licensed "as is" without any warranty of any kind, whether express
+ * or implied.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DRV_NAME "max6900"
+#define DRV_VERSION "0.1"
+
+/*
+ * register indices
+ */
+#define MAX6900_REG_SC 0   /* seconds  00-59 */
+#define MAX6900_REG_MN 1   /* minutes  00-59 */
+#define MAX6900_REG_HR 2   /* hours00-23 */
+#define MAX6900_REG_DT 3   /* day of month 00-31 */
+#define MAX6900_REG_MO 4   /* month01-12 */
+#define MAX6900_REG_DW 5   /* day of week   1-7  */
+#define MAX6900_REG_YR 6   /* year 00-99 */
+#define MAX6900_REG_CT 7   /* control */
+#define MAX6900_REG_LEN8
+
+#define MAX6900_REG_CT_WP  (1 << 7)/* Write Protect */
+
+/*
+ * register read/write commands
+ */
+#define MAX6900_REG_CONTROL_WRITE  0x8e
+#define MAX6900_REG_BURST_READ 0xbf
+#define MAX6900_REG_BURST_WRITE0xbe
+#define MAX6900_REG_RESERVED_READ  0x96
+
+#define MAX6900_IDLE_TIME_AFTER_WRITE  3   /* specification says 2.5 mS */
+
+#define MAX6900_I2C_ADDR   0xa0
+
+static unsigned short normal_i2c[] = {
+   MAX6900_I2C_ADDR >> 1,
+   I2C_CLIENT_END
+};
+
+I2C_CLIENT_INSMOD; /* defines addr_data */
+
+static int max6900_probe(struct i2c_adapter *adapter, int addr, int kind);
+
+static int max6900_i2c_read_regs(struct i2c_client *client, u8 *buf)
+{
+   u8 reg_addr[1] = { MAX6900_REG_BURST_READ };
+   struct i2c_msg msgs[2] = {
+   {
+   client->addr,
+   0, /* write */
+   sizeof(reg_addr),
+   reg_addr
+   },
+   {
+   client->addr,
+   I2C_M_RD,
+   MAX6900_REG_LEN,
+   buf
+   }
+   };
+   int rc;
+
+   rc = i2c_transfer(client->adapter, msgs, ARRAY_SIZE(msgs));
+   if (rc != ARRAY_SIZE(msgs)) {
+   dev_err(>dev, "%s: register read failed\n",
+   __FUNCTION__);
+   return -EIO;
+   }
+   return 0;
+}
+
+static int max6900_i2c_write_regs(struct i2c_client *client, u8 const *buf)
+{
+   u8 i2c_buf[MAX6900_REG_LEN + 1] = { MAX6900_REG_BURST_WRITE };
+   struct i2c_msg msgs[1] = {
+   {
+   client->addr,
+   0, /* write */
+   MAX6900_REG_LEN + 1,
+   i2c_buf
+   }
+   };
+   int rc;
+
+   memcpy(_buf[1], buf, MAX6900_REG_LEN);
+
+   

[PATCH] swsusp: Fix resume error path in platform mode

2007-03-10 Thread Rafael J. Wysocki
From: Rafael J. Wysocki <[EMAIL PROTECTED]>

If swsusp is using the platform mode during the resume and the image cannot be
read, the platform mode should be switched off before software_resume() returns.
Make it happen.

Signed-off-by: Rafael J. Wysocki <[EMAIL PROTECTED]>
Acked-by: Pavel Machek <[EMAIL PROTECTED]>
---
 kernel/power/disk.c |1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6.21-rc3/kernel/power/disk.c
===
--- linux-2.6.21-rc3.orig/kernel/power/disk.c
+++ linux-2.6.21-rc3/kernel/power/disk.c
@@ -260,6 +260,7 @@ static int software_resume(void)
error = swsusp_read();
if (error) {
swsusp_free();
+   platform_finish();
goto Thaw;
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] CRIS: Delete unused header file.

2007-03-10 Thread Robert P. J. Day

  Delete apparently unused header file drivers/serial/crisv10.h.

Signed-off-by: Robert P. J. Day <[EMAIL PROTECTED]>

---

diff --git a/drivers/serial/crisv10.h b/drivers/serial/crisv10.h
deleted file mode 100644
index 4a23340..000
--- a/drivers/serial/crisv10.h
+++ /dev/null
@@ -1,136 +0,0 @@
-/*
- * serial.h: Arch-dep definitions for the Etrax100 serial driver.
- *
- * Copyright (C) 1998, 1999, 2000 Axis Communications AB
- */
-
-#ifndef _ETRAX_SERIAL_H
-#define _ETRAX_SERIAL_H
-
-#include 
-#include 
-
-/* Software state per channel */
-
-#ifdef __KERNEL__
-/*
- * This is our internal structure for each serial port's state.
- *
- * Many fields are paralleled by the structure used by the serial_struct
- * structure.
- *
- * For definitions of the flags field, see tty.h
- */
-
-#define SERIAL_RECV_DESCRIPTORS 8
-
-struct etrax_recv_buffer {
-   struct etrax_recv_buffer *next;
-   unsigned short length;
-   unsigned char error;
-   unsigned char pad;
-
-   unsigned char buffer[0];
-};
-
-struct e100_serial {
-   int baud;
-   volatile u8 *port; /* R_SERIALx_CTRL */
-   u32 irq;  /* bitnr in R_IRQ_MASK2 for dmaX_descr */
-
-   /* Output registers */
-   volatile u8 *oclrintradr; /* adr to R_DMA_CHx_CLR_INTR */
-   volatile u32*ofirstadr;   /* adr to R_DMA_CHx_FIRST */
-   volatile u8 *ocmdadr; /* adr to R_DMA_CHx_CMD */
-   const volatile u8   *ostatusadr;  /* adr to R_DMA_CHx_STATUS */
-
-   /* Input registers */
-   volatile u8 *iclrintradr; /* adr to R_DMA_CHx_CLR_INTR */
-   volatile u32*ifirstadr;   /* adr to R_DMA_CHx_FIRST */
-   volatile u8 *icmdadr; /* adr to R_DMA_CHx_CMD */
-   volatile u32*idescradr;   /* adr to R_DMA_CHx_DESCR */
-
-   int flags;  /* defined in tty.h */
-
-   u8  rx_ctrl; /* shadow for R_SERIALx_REC_CTRL */
-   u8  tx_ctrl; /* shadow for R_SERIALx_TR_CTRL */
-   u8  iseteop; /* bit number for R_SET_EOP for the 
input dma */
-   int enabled; /* Set to 1 if the port is enabled in 
HW config */
-
-   u8  dma_out_enabled:1; /* Set to 1 if DMA should be used */
-   u8  dma_in_enabled:1;  /* Set to 1 if DMA should be used */
-
-   /* end of fields defined in rs_table[] in .c-file */
-   u8  uses_dma_in;  /* Set to 1 if DMA is used */
-   u8  uses_dma_out; /* Set to 1 if DMA is used */
-   u8  forced_eop;   /* a fifo eop has been forced */
-   int baud_base; /* For special baudrates */
-   int custom_divisor; /* For special baudrates */
-   struct etrax_dma_descr  tr_descr;
-   struct etrax_dma_descr  rec_descr[SERIAL_RECV_DESCRIPTORS];
-   int cur_rec_descr;
-
-   volatile inttr_running; /* 1 if output is running */
-
-   struct tty_struct   *tty;
-   int read_status_mask;
-   int ignore_status_mask;
-   int x_char; /* xon/xoff character */
-   int close_delay;
-   unsigned short  closing_wait;
-   unsigned short  closing_wait2;
-   unsigned long   event;
-   unsigned long   last_active;
-   int line;
-   int type;  /* PORT_ETRAX */
-   int count;  /* # of fd on device */
-   int blocked_open; /* # of blocked opens */
-   struct circ_buf xmit;
-   struct etrax_recv_buffer *first_recv_buffer;
-   struct etrax_recv_buffer *last_recv_buffer;
-   unsigned intrecv_cnt;
-   unsigned intmax_recv_cnt;
-
-   struct work_struct  work;
-   struct async_icount icount;   /* error-statistics etc.*/
-   struct ktermios normal_termios;
-   struct ktermios callout_termios;
-#ifdef DECLARE_WAITQUEUE
-   wait_queue_head_t   open_wait;
-   wait_queue_head_t   close_wait;
-#else
-   struct wait_queue   *open_wait;
-   struct wait_queue   *close_wait;
-#endif
-
-   unsigned long   char_time_usec;   /* The time for 1 char, 
in usecs */
-   unsigned long   flush_time_usec;  /* How often we should 
flush */
-   unsigned long   last_tx_active_usec;  /* Last tx usec in the 
jiffies */
-   unsigned long   last_tx_active;   /* Last tx time in 
jiffies */
-   unsigned long   last_rx_active_usec;  /* Last rx usec in the 
jiffies */
-   unsigned long   last_rx_active;   /* Last rx time in 
jiffies */
-
-   int 

  1   2   3   4   5   6   >