Re: [PATCH 2/5] revoke: core code

2007-03-15 Thread Pekka Enberg

Hi Andrew,

On Sun, 11 Mar 2007 13:30:49 +0200 (EET) Pekka J Enberg
<[EMAIL PROTECTED]> wrote:

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

n all system calls must return long.


Fixed.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

so  the modification of vm_flags is racy?

> + smp_mb();

Please always document barriers.  There's presumably some vm_flags reader
we're concerned about here, but how is the code reader to know what the
code writer was thinking?


We're need to watch out for page faults after the shared mappings have
been taken down and mmap(2) trying to remap. I'll add a comment here.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

This all looks very strange.  If the calling process expires its timeslice,
the entire system call fails?

What's happening here?


Me being stupid. I followed what unmap_mapping_range_vma is doing but
failed to see what its callers are doing. I'll fix it up.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

do_fsync() is seriously suboptimal - it will run an ext3 commit.
do_sync_file_range(...,
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)
will run maybe five times quicker.

But otoh, do_sync_file_range() will fail to write back the pages for a
data=journal ext3 file, I expect (oops).


But it's good enough for generic_file_revoke, no? Ext3 should probably
implement it's own revoke hook so you can drop the ext2 and ext3 hooks
if you're worried, I did them mostly for testing.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

Why is this code using invalidate_inode_pages2()?  That function keeps on
breaking, has ill-defined semantics and will probably change in the future.

Exactly what semantics are you looking for here, and why?


What the comment says "make pending reads fail." When revoking an
inode, we need to make sure there are no pending I/O that will
complete after revocation and thus leak information.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

The blank line before the EXPORT_SYMBOL() is a waste of space.


I'll fix that up.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

> +static struct inode *revokefs_alloc_inode(struct super_block *sb)
> +{
> + struct revokefs_inode_info *info;
> +
> + info = kmem_cache_alloc(revokefs_inode_cache, GFP_NOFS);
> + if (!info)
> + return NULL;
> +
> + return >vfs_inode;
> +}

Why GFP_NOFS?


GFP_KERNEL should be sufficient. I'll fix that up.

On 3/16/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

> ===
> --- /dev/null 1970-01-01 00:00:00.0 +
> +++ uml-2.6/include/linux/revoked_fs_i.h  2007-03-11 13:09:20.0 
+0200
> @@ -0,0 +1,20 @@
> +#ifndef _LINUX_REVOKED_FS_I_H
> +#define _LINUX_REVOKED_FS_I_H
> +
> +#define REVOKEFS_MAGIC 0x5245564B  /* REVK */

This is supposed to go into magic.h.


Will do. Thank you Andrew.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ACPI: ibm-acpi: allow module to load when acpi notifiers can't be set (v2)

2007-03-15 Thread Len Brown
Applied.


On Thursday 15 March 2007 15:15, Henrique de Moraes Holschuh wrote:
> This patch allows for ibm-acpi to coexist (with diminished functionality) with
> other drivers like ACPI_BAY.  ibm-acpi will simply disable the functions it is
> not able to register ACPI notifiers for.
> 
> Signed-off-by: Henrique de Moraes Holschuh <[EMAIL PROTECTED]>
> Cc: Chris Wedgwood <[EMAIL PROTECTED]>
> Cc: Kristen Carlson Accardi <[EMAIL PROTECTED]>
> ---
> 
>  There was a minor problem in the first version of the patch, which I didn't
>  notice when backporting from acpi-test.  This is a fixed version.  Sorry
>  about this.
> 
>  Len, you can pull this patch from:
>  git://repo.or.cz/linux-2.6/linux-acpi-2.6/ibm-acpi-2.6.git
>  branch for-upstream/acpi-release
>  
>  Please send it to Linus for merge in 2.6.21.
>  
>  It will clash with the patches in acpi-test that are waiting for 2.6.22.
>  I will rediff those, and send you a pull request when this patch
>  gets accepted in mainline.

ok

thanks Henrique,
-Len


>  drivers/acpi/ibm_acpi.c |   19 ---
>  1 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/acpi/ibm_acpi.c b/drivers/acpi/ibm_acpi.c
> index 3690136..dc10966 100644
> --- a/drivers/acpi/ibm_acpi.c
> +++ b/drivers/acpi/ibm_acpi.c
> @@ -2507,7 +2507,7 @@ static int __init setup_notify(struct ibm_struct *ibm)
>   ret = acpi_bus_get_device(*ibm->handle, >device);
>   if (ret < 0) {
>   printk(IBM_ERR "%s device not present\n", ibm->name);
> - return 0;
> + return -ENODEV;
>   }
>  
>   acpi_driver_data(ibm->device) = ibm;
> @@ -2516,8 +2516,13 @@ static int __init setup_notify(struct ibm_struct *ibm)
>   status = acpi_install_notify_handler(*ibm->handle, ibm->type,
>dispatch_notify, ibm);
>   if (ACPI_FAILURE(status)) {
> - printk(IBM_ERR "acpi_install_notify_handler(%s) failed: %d\n",
> -ibm->name, status);
> + if (status == AE_ALREADY_EXISTS) {
> + printk(IBM_NOTICE "another device driver is already 
> handling %s events\n",
> + ibm->name);
> + } else {
> + printk(IBM_ERR "acpi_install_notify_handler(%s) failed: 
> %d\n",
> + ibm->name, status);
> + }
>   return -ENODEV;
>   }
>   ibm->notify_installed = 1;
> @@ -2553,6 +2558,8 @@ static int __init register_driver(struct ibm_struct 
> *ibm)
>   return ret;
>  }
>  
> +static void ibm_exit(struct ibm_struct *ibm);
> +
>  static int __init ibm_init(struct ibm_struct *ibm)
>  {
>   int ret;
> @@ -2594,6 +2601,12 @@ static int __init ibm_init(struct ibm_struct *ibm)
>  
>   if (ibm->notify) {
>   ret = setup_notify(ibm);
> + if (ret == -ENODEV) {
> + printk(IBM_NOTICE "disabling subdriver %s\n",
> + ibm->name);
> + ibm_exit(ibm);
> + return 0;
> + }
>   if (ret < 0)
>   return ret;
>   }
> -- 
> 1.5.0.3
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.21-rc3-mm1

2007-03-15 Thread Mariusz Kozlowski
Hello Mel,

> > > > >   Today after +- 24h of uptime I found some more page allocation
> > > > > failures ('eth1: Can't allocate skb for Rx'). You'll find more here:
> > > > > 
> > > > > http://tuxland.pl/misc/2.6.21-rc3-mm1-page-allocation-failure.txt
> > > > > 
> > > > > System wasn't doing anything unusual, as usual ;-) X, some p2p 
> > > > > software, firefox+flash playing music.
> > > > 
> > > > Do other kernels do this, or is 2.6.21-rc3-mm1 worse?
> > > > 
> > > > It is of course a non-fatal problem and will inevitably happen 
> > > > sometimes,
> > > > but we would like the VM to be able to minimise the occurrence of this
> > > > problem.
> > > 
> > > Mariusz, I would be interested in finding out if this problem still 
> > > occurs when
> > > you set min_free_kbytes to 16384 via /proc/sys/vm/min_free_kbytes. I 
> > > understand
> > > that the problem is not easily reproduced and requiring configuration 
> > > changes
> > > is far from ideal but it'd allow me to find out if options 2 or 3 below 
> > > make
> > > sense in advance.
> > 
> > After a few hours I can confirm that this happens with 
> > 
> > $ cat /proc/sys/vm/min_free_kbytes 
> > 16384
> > 
> > as well. See the syslog output below. Feel free to mail me to do some more 
> > tests.
> 
> Ok, great. Well, not great because it's broken, but I know what's going
> on. I was able to reproduce the problem based on your report on my desktop
> and put together a fix for it. Full regression tests are still running but
> it should be in good enough state for you to test.
> 
> Without this patch, I got allocation failures within 15 minutes by stressing
> the machine. With the patch below, it's been up an hour and 15 minutes and
> I'm seeing no problems so far. Will keep the machine running a few days to
> see what happens.
> 

[...]
 
> Mariusz, please try the following patch. It should not be necessary to
> adjust your min_free_kbytes again but if you see a failure, please try
> with min_free_kbytes set to 16384. Thanks a lot.

Works for me. min_free_kbytes was left at default 2791. I left the laptop with
X + aMule + azureus + firefox (playing music) + kernel compilation so
the box was pushed a bit. Uptime close to 9 hours and no page allocation
failures. I leave it running some more. If anything pops out you'll know it :-)

Thanks,

Mariusz Kozlowski
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] signalfd v1 - signalfd core ...

2007-03-15 Thread Davide Libenzi
On Thu, 15 Mar 2007, Ulrich Drepper wrote:

> On 3/7/07, Davide Libenzi  wrote:
> > Let's do this. How about you throw this way one of the case that would
> > possibly break, and I test it?
> 
> Since you make such claims I assume your signalfd() implementation
> considers a signal delivered once it is reported to an epoll() caller.
> Right?

The wakeup phase in epoll does not mean the signal is delivered. That 
happens when the caller does a read(2) on the signalfd. The read(2) on the 
signalfd ends up in calling dequeue_signal(), the same function use by the 
kernel to spill out signals to deliver (they peek from the same queue).



> This is not what you really want, at least not in all cases.  A signal
> might be something you want to react on right away.  Unless
> pthread_kill() is used it is delivered to the _process_ and not a
> specific thread.  But this means if epoll() reports two events to one
> thread calling epoll() (one of them being a signal) and this thread is
> then stuck processing the other request, the signal is not handled
> even though there might be a second or third thread available to
> receive the signal.  Those threads have the same right to receive the
> signal and the current implementation always looks for the
> best/fastest way to deliver the signal.

The behaviour depends on the sigmask you pass to signalfd(). You can 
select signals that you want to handle in a standard way, and the ones 
that you want to handle with signalfd.
Typically programs using signalfd() do not want the asyncronous behaviour 
of signals at all (with all the limits you have in the handler), and the 
event dispatch loop never blocks by definition (otherwise you have more 
serious problems than a signal not delivered). They are also very likely 
to be single threaded.



> This means to me that reporting the signal in epoll() does _not_ mark
> the signal as handled.  Somehow (probably using the signalfd()
> descriptor) the thread must explicitly request the signal to be
> delivered.  But if you do this the epoll() handling is fantastically
> racy if the signal is not blocked.

As I said, when a signal hits send_signal (or the queued versions), a 
wakeup is done on the wait queue head poll (or select/epoll) is sleeping 
on. This ends up delivring a POLLIN, but the signal is not fetched (by the 
mean of dequeue_signal) until a read(2) on the signalfd is done.
Since both standard delivery and signalfd's read(2) fish from the same 
queue, you have to block the signals that you want to have the guarantee 
to be able to fetch with a read(2) (signalfd supports O_NONBLOCK also).
If you do not block the signal, you get a wakeup, but you may not find a 
signal to dequeue at the next read(2), because a standard delivery might 
have stole the signal by preceeding you in dequeue_signal.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] Three critical patches still aren't merged in 2.6.21

2007-03-15 Thread Greg KH
On Thu, Mar 15, 2007 at 12:34:07PM -0400, Chuck Ebbert wrote:
> I've been holding off sending these in for -stable until they're
> merged, but now I wonder when that will happen.

Feel free to send them to stable@ when they go to Linus as it sounds
like they are almost there.

Thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Fixup COMPAT_VDSO to work with CONFIG_PARAVIRT

2007-03-15 Thread Zachary Amsden

Jeremy Fitzhardinge wrote:
  

+} else if (strcmp(secstrings+sechdrs[i].sh_name,
".dynamic") == 0) {
+Elf32_Dyn *dyn = (void *)hdr + sechdrs[i].sh_offset;
+int tag;
+while ((tag = (++dyn)->d_tag) != DT_NULL)
  


Um, no.
  
  

Walk based on size instead?



No, I was just complaining about the embedded assignment, before dinner,
so I was overly terse.
  


My last embedded assignment was a robot microcontroller, and I dropped 
out of that class.  So I _need_ embedded assignments.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: core2 duo, interrupts: is this normal?

2007-03-15 Thread Len Brown
On Thursday 15 March 2007 21:24, Norberto Bensa wrote:
> Hello,
> 
> is this output, normal? I meant, why counters on CPU1 is zero? Isn't this 
> balanced?

yes, it is normal.

If you had an interrupt-limited workload then irqbalance
would pick things up and spread them out.

-Len


> $ cat /proc/interrupts
>CPU0   CPU1
>   0:4180170  0   IO-APIC-edge  timer
>   1:   8060  0   IO-APIC-edge  i8042
>   7:  0  0   IO-APIC-edge  parport0
>   9:  0  0   IO-APIC-fasteoi   acpi
>  12:  5  0   IO-APIC-edge  i8042
>  16: 322297  0   IO-APIC-fasteoi   uhci_hcd:usb3, libata, nvidia, 
> EMU10K1
>  17: 896399  0   IO-APIC-fasteoi   bttv0, eth0, libata
>  18:  72867  0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb7
>  19:  27770  0   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb5
>  20:  0  0   IO-APIC-fasteoi   uhci_hcd:usb4
>  21:  0  0   IO-APIC-fasteoi   uhci_hcd:usb6
>  22:  3  0   IO-APIC-fasteoi   ohci1394
>  23:155  0   IO-APIC-fasteoi   HDA Intel
> 219: 103056  0   PCI-MSI-edge  libata
> NMI:  0  0
> LOC:40776134077622
> ERR:  0
> MIS:  0
> 
> 
> Many thanks in advance,
> Norberto
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] signalfd v1 - signalfd core ...

2007-03-15 Thread Ulrich Drepper

On 3/7/07, Davide Libenzi  wrote:

Let's do this. How about you throw this way one of the case that would
possibly break, and I test it?


Since you make such claims I assume your signalfd() implementation
considers a signal delivered once it is reported to an epoll() caller.
Right?

This is not what you really want, at least not in all cases.  A signal
might be something you want to react on right away.  Unless
pthread_kill() is used it is delivered to the _process_ and not a
specific thread.  But this means if epoll() reports two events to one
thread calling epoll() (one of them being a signal) and this thread is
then stuck processing the other request, the signal is not handled
even though there might be a second or third thread available to
receive the signal.  Those threads have the same right to receive the
signal and the current implementation always looks for the
best/fastest way to deliver the signal.

This means to me that reporting the signal in epoll() does _not_ mark
the signal as handled.  Somehow (probably using the signalfd()
descriptor) the thread must explicitly request the signal to be
delivered.  But if you do this the epoll() handling is fantastically
racy if the signal is not blocked.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] PPC: Delete unused header file.

2007-03-15 Thread Paul Mackerras
Robert P. J. Day writes:

>   Delete apparently unused header file arch/ppc/syslib/cpc710.h.

I suggest you send this to [EMAIL PROTECTED] and Matt Porter
<[EMAIL PROTECTED]> for review.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Fixup COMPAT_VDSO to work with CONFIG_PARAVIRT

2007-03-15 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
> Well testing that is not so fun.  I installed SUSE Pro 9.0, and
> strings on ld.so contains the magic at_sysinfo assert!  But it doesn't
> install TLS libraries, so I'll have to install them by hand.
>
> In works - in theory.  Look, a puppy!
>
> Scratchbox is rumored to produce the fabled assertion even on modern
> distros by installing its own toolchain which includes the dreaded glibc.

I think Andi and Andrew have boxes which are afflicted.

> I'm playing safe.  Binary identical relocation to 0xe000 was my goal.

Yeah, fair enough.  But as Eric likes to keep pointing out, an
executable ELF file need not have any sections at all, so the only safe
course for anything "real" is via the section headers.

So I guess the right thing to do is relocate the dynamic stuff via
PT_DYNAMIC, and relocate the symtab if its present.

>>> +} else if (strcmp(secstrings+sechdrs[i].sh_name,
>>> ".dynamic") == 0) {
>>> +Elf32_Dyn *dyn = (void *)hdr + sechdrs[i].sh_offset;
>>> +int tag;
>>> +while ((tag = (++dyn)->d_tag) != DT_NULL)
>>>   
>>
>> Um, no.
>>   
>
> Walk based on size instead?

No, I was just complaining about the embedded assignment, before dinner,
so I was overly terse.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Improve error recovery in serial mouse driver

2007-03-15 Thread Dmitry Torokhov
On Thursday 15 March 2007 15:16, Peter Osterlund wrote:
> If bytes get lost in the communication with a serial mouse using the
> MS protocol, the kernel driver could do a better job getting back in
> sync. The first byte in a packet has bit 6 set, and no other bytes
> have that bit set. Therefore, if a byte is received with bit 6 cleared
> when the driver thinks it is at byte 0 in the packet, the driver thinks
> wrong and the byte should just be ignored.
> 
> This fix prevents spurious left/right button events when the serial
> communication is disturbed by a CPU-hungry real-time process.
> 
> Signed-off-by: Peter Osterlund <[EMAIL PROTECTED]>

Applied, thank you Peter.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: thread stacks and strict vm overcommit accounting

2007-03-15 Thread KAMEZAWA Hiroyuki
On Thu, 15 Mar 2007 11:06:21 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> > On Tue, 13 Mar 2007 18:33:20 +0200 Dan Aloni <[EMAIL PROTECTED]> wrote:
> > Hello,
> > 
> > This question is relevent to 2.6.20.
> > 
> > I noticed that if the RSS for the stack size is say, 8MB, running
> > a single-threaded process doesn't incur an increase of 8MB to
> > Committed_AS (/proc/meminfo).
> > 
> > However, on multi-threaded apps linked with pthread (on Debian
> > Etch with 2.6.20 vanilla x86_64), every thread will incur the
> > the specified maximum stack size RSS (assuming that you use
> > the default attr). In other words, it appears that vm accounting
> > works differently in that case.
> > 
> > Is this the intended behaviour?
> 
> That sounds like a bug to me.

AFAIK, "main" thread's stack is marked as VM_GROWS?? and its size can be
changed dynamically. "other" threads' stack are alloced by mmap (or malloc 
maybe)
and it never grows. This is difference between multi-thread and single thread.

So, you should be carefull to the size of stack when you use multi-threaded apps
and vm_overcommit_ratio at the same time. Because MAP_NORESERVE is accounted
if sysctl_overcommit_memory == OVERCOMMIT_NEVER, a program like java will fail
to create a new thread sometimes.

I have no good idea to fix this difference, sorry.

-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/18] Make common x86 arch area for i386 and x86_64 - Take 2

2007-03-15 Thread Christoph Lameter
On Thu, 15 Mar 2007, Steven Rostedt wrote:

> On Thu, 2007-03-15 at 17:06 +0100, Andi Kleen wrote:
> 
> > Well I just see a lot of pain from these patches but I doubt 
> > they will avoid any bugs. If people don't compile test both
> > archs they will always likely break on another. There are lots
> > of subtle dependencies that are not expressed in the pathname
> > even after this intrusive operation (e.g. in the includes).
> > 
> > That's just how it is.
> 
> Or that's just how you see it.

In the future it is likely that x86_64 will significantly deviate from 
i386. i386 is going to be gradually abandoned because it does not support 
the ever larger memory sizes and be mainly used for embedded devices. 
x86_64 is going to acquire more functionality that will not be available 
for i386. We plan f.e. to add virtual memmap support for x86_64. Virtual 
memmap support may require a large chunk of virtual memory space that is 
not available on i386. Its not good to have to deal with i386 issues when 
doing x86_64 arch development.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/18] Make common x86 arch area for i386 and x86_64 - Take 2

2007-03-15 Thread Kasper Sandberg
On Wed, 2007-03-14 at 01:08 -0400, Steven Rostedt wrote:
> [Hopefully fixed email client to make it to the list this time]
> [This series has changed by using git-diff -M]

> Seems appropriate, but I really don't care what it's called.  One thing about
> this name, is that typing arch/x86 doesn't tab complete x86_64 anymore.
> But if you can think of something better, I'd be happy to apply it.
> 
sorry for being so late, but about what it could be called, well, what
about common_x86 or common/x86 or something?


> 
> -- Steve
> 
> PS. Sorry for the spam. I need to figure out how to tame quilt mail!
> 
> 
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/filemap.c: unconditionally call mark_page_accessed

2007-03-15 Thread Rik van Riel

[EMAIL PROTECTED] wrote:


On the other hand, Andreas suggested only marking it once every 32 calls,
but that required a helper variable.  Statistically, jiffies%32 should
end up about the same as a helper variable %32.

This of course, if just calling mark_page_accessed() is actually expensive
enough that we don't want to do it unconditionally.


Not caching a needed page and having to wait for a disk seek
to complete will be *way* more expensive than any call to
mark_page_accessed().

A modern CPU can do somewhere on the order of 50 million
instructions in the time it takes to bring one page in from
disk.

However, this does not mean we should unconditionally call
mark_page_accessed(), since that could cause use to push
wanted data out of the cache because of one program that
does its streaming accesses in a strange way...

This is a situation where getting it right almost certainly
matters.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Replace pid_t in autofs with struct pid reference

2007-03-15 Thread Ian Kent
On Mon, 12 Mar 2007, [EMAIL PROTECTED] wrote:

> 
> From: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
> Subject: [PATCH 2/2] Replace pid_t in autofs with struct pid reference.
> 
> Make autofs container-friendly by caching struct pid reference rather
> than pid_t and using pid_nr() to retreive a task's pid_t.
> 
> ChangeLog:
>   - Fix Eric Biederman's comments - Use find_get_pid() to hold a
> reference to oz_pgrp and release while unmounting; separate out
> changes to autofs and autofs4.

What changes to autofs4?
Do you intend this change to be made for autofs4 also?
Perhaps you expected me to do them, in which case you probably should 
ask me to do the patch.

>   - Fix Cedric's comments: retain old prototype of parse_options()
> and move necessary change to its caller.
> 
> Signed-off-by: Sukadev Bhattiprolu <[EMAIL PROTECTED]>
> Cc: Cedric Le Goater <[EMAIL PROTECTED]>
> Cc: Dave Hansen <[EMAIL PROTECTED]>
> Cc: Serge Hallyn <[EMAIL PROTECTED]>
> Cc: Eric Biederman <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]
> Acked-by: Eric W. Biederman <[EMAIL PROTECTED]>
> 
> ---
>  fs/autofs/autofs_i.h |4 ++--
>  fs/autofs/inode.c|   20 
>  fs/autofs/root.c |6 --
>  3 files changed, 22 insertions(+), 8 deletions(-)
> 
> Index: lx26-21-rc3-mm2/fs/autofs/autofs_i.h
> ===
> --- lx26-21-rc3-mm2.orig/fs/autofs/autofs_i.h 2007-03-12 17:12:05.0 
> -0700
> +++ lx26-21-rc3-mm2/fs/autofs/autofs_i.h  2007-03-12 17:18:55.0 
> -0700
> @@ -101,7 +101,7 @@ struct autofs_symlink {
>  struct autofs_sb_info {
>   u32 magic;
>   struct file *pipe;
> - pid_t oz_pgrp;
> + struct pid *oz_pgrp;
>   int catatonic;
>   struct super_block *sb;
>   unsigned long exp_timeout;
> @@ -122,7 +122,7 @@ static inline struct autofs_sb_info *aut
> filesystem without "magic".) */
>  
>  static inline int autofs_oz_mode(struct autofs_sb_info *sbi) {
> - return sbi->catatonic || process_group(current) == sbi->oz_pgrp;
> + return sbi->catatonic || task_pgrp(current) == sbi->oz_pgrp;
>  }
>  
>  /* Hash operations */
> Index: lx26-21-rc3-mm2/fs/autofs/inode.c
> ===
> --- lx26-21-rc3-mm2.orig/fs/autofs/inode.c2007-03-12 17:18:48.0 
> -0700
> +++ lx26-21-rc3-mm2/fs/autofs/inode.c 2007-03-12 17:18:55.0 -0700
> @@ -37,6 +37,8 @@ void autofs_kill_sb(struct super_block *
>   if (!sbi->catatonic)
>   autofs_catatonic_mode(sbi); /* Free wait queues, close pipe */
>  
> + put_pid(sbi->oz_pgrp);
> +
>   autofs_hash_nuke(sbi);
>   for (n = 0 ; n < AUTOFS_MAX_SYMLINKS ; n++) {
>   if (test_bit(n, sbi->symlink_bitmap))
> @@ -139,6 +141,7 @@ int autofs_fill_super(struct super_block
>   int pipefd;
>   struct autofs_sb_info *sbi;
>   int minproto, maxproto;
> + pid_t pgid;
>  
>   sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
>   if (!sbi)
> @@ -150,7 +153,6 @@ int autofs_fill_super(struct super_block
>   sbi->pipe = NULL;
>   sbi->catatonic = 1;
>   sbi->exp_timeout = 0;
> - sbi->oz_pgrp = process_group(current);
>   autofs_initialize_hash(>dirhash);
>   sbi->queues = NULL;
>   memset(sbi->symlink_bitmap, 0, sizeof(long)*AUTOFS_SYMLINK_BITMAP_LEN);
> @@ -171,7 +173,7 @@ int autofs_fill_super(struct super_block
>  
>   /* Can this call block?  - WTF cares? s is locked. */
>   if (parse_options(data, , _inode->i_uid,
> - _inode->i_gid, >oz_pgrp, ,
> + _inode->i_gid, , ,
>   )) {
>   printk("autofs: called with bogus options\n");
>   goto fail_dput;
> @@ -184,13 +186,21 @@ int autofs_fill_super(struct super_block
>   goto fail_dput;
>   }
>  
> - DPRINTK(("autofs: pipe fd = %d, pgrp = %u\n", pipefd, sbi->oz_pgrp));
> + DPRINTK(("autofs: pipe fd = %d, pgrp = %u\n", pipefd, pgid));
> + sbi->oz_pgrp = find_get_pid(pgid);
> +
> + if (!sbi->oz_pgrp) {
> + printk("autofs: could not find process group %d\n", pgid);
> + goto fail_dput;
> + }
> +
>   pipe = fget(pipefd);
>   
>   if (!pipe) {
>   printk("autofs: could not open pipe file descriptor\n");
> - goto fail_dput;
> + goto fail_put_pid;
>   }
> +
>   if (!pipe->f_op || !pipe->f_op->write)
>   goto fail_fput;
>   sbi->pipe = pipe;
> @@ -205,6 +215,8 @@ int autofs_fill_super(struct super_block
>  fail_fput:
>   printk("autofs: pipe file descriptor does not contain proper ops\n");
>   fput(pipe);
> +fail_put_pid:
> + put_pid(sbi->oz_pgrp);
>  fail_dput:
>   dput(root);
>   goto fail_free;
> Index: lx26-21-rc3-mm2/fs/autofs/root.c
> ===
> --- 

Re: [PATCH] blackfin: balance parenthesis in macros

2007-03-15 Thread Wu, Bryan
On Thu, 2007-03-15 at 18:12 -0400, Mariusz Kozlowski wrote:
> Hello,
> 
> This patch (against 2.6.21-rc3-mm1) balances parenthesis in blackfin
> header files.
> 
> Signed-off-by: Mariusz Kozlowski <[EMAIL PROTECTED]>
> 
>  include/asm-blackfin/mach-bf535/bf535.h |4 ++--
>  include/asm-blackfin/scatterlist.h  |2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff -u linux-2.6.21-rc3-mm1-a/include/asm-blackfin/mach-bf535/bf535.h 
> linux-2.6.21-rc3-mm1-b/include/asm-blackfin/mach-bf535/bf535.h
> --- linux-2.6.21-rc3-mm1-a/include/asm-blackfin/mach-bf535/bf535.h  
> 2007-03-15 22:25:34.0 +0100
> +++ linux-2.6.21-rc3-mm1-b/include/asm-blackfin/mach-bf535/bf535.h  
> 2007-03-15 22:33:09.0 +0100
> @@ -224,7 +224,7 @@
>  #define UART0_LSR_TEMT 0x40/* TSR and UARTx_thr both 
> empty */
> 
>  #define UART0_MSR_ADDR 0xffc0180c  /* UART 0 Modem status 
> register  16 bit */
> -#define UART0_MSR  HALFWORD_REF(UART0_MSR_ADDR
> +#define UART0_MSR  HALFWORD_REF(UART0_MSR_ADDR)
>  #define UART0_SCR_ADDR 0xffc0180e  /* UART 0 Scratch register  
> 16 bit */
>  #define UART0_SCR  HALFWORD_REF(UART0_SCR_ADDR)
>  #define UART0_IRCR_ADDR0xffc01810  /* UART 0 IrDA 
> Control register  16 bit */
> @@ -331,7 +331,7 @@
>  #define UART1_LSR_TEMT  0x40   /* TSR and UARTx_thr both empty */
> 
>  #define UART1_MSR_ADDR 0xffc01c0c  /* UART 1 Modem status 
> register  16 bit */
> -#define UART1_MSR  HALFWORD_REF(UART1_MSR_ADDR
> +#define UART1_MSR  HALFWORD_REF(UART1_MSR_ADDR)
>  #define UART1_SCR_ADDR 0xffc01c0e  /* UART 1 Scratch register  
> 16 bit */
>  #define UART1_SCR  HALFWORD_REF(UART1_SCR_ADDR)
> 
> diff -upr linux-2.6.21-rc3-mm1-a/include/asm-blackfin/scatterlist.h 
> linux-2.6.21-rc3-mm1-b/include/asm-blackfin/scatterlist.h
> --- linux-2.6.21-rc3-mm1-a/include/asm-blackfin/scatterlist.h   2007-03-15 
> 22:25:34.0 +0100
> +++ linux-2.6.21-rc3-mm1-b/include/asm-blackfin/scatterlist.h   2007-03-15 
> 22:30:18.0 +0100
> @@ -17,7 +17,7 @@ struct scatterlist {
>   * returns, or alternatively stop on the first sg_dma_len(sg) which
>   * is 0.
>   */
> -#define sg_address(sg) (page_address((sg)->page) + (sg)->offset
> +#define sg_address(sg) (page_address((sg)->page) + (sg)->offset)
>  #define sg_dma_address(sg)  ((sg)->dma_address)
>  #define sg_dma_len(sg)  ((sg)->length)
> 
> 
> 
> Regards,
> 
> Mariusz Kozlowski

Thank you Mariusz,

Mike applied your patch to our SVN repo. I will send out blackfin-arch
update patch including your contribution later.

Your review is very helpful for our development.

Best regards,
-Bryan Wu
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Fixup COMPAT_VDSO to work with CONFIG_PARAVIRT

2007-03-15 Thread Zachary Amsden

Jeremy Fitzhardinge wrote:

Zachary Amsden wrote:
  

Invoke black magic to relocate the VDSO even when COMPAT_VDSO is enabled
by fixing up the ELF object.
  



So does it actually work?  Can you boot the broken distros with this in
place?
  


Well testing that is not so fun.  I installed SUSE Pro 9.0, and strings 
on ld.so contains the magic at_sysinfo assert!  But it doesn't install 
TLS libraries, so I'll have to install them by hand.


In works - in theory.  Look, a puppy!

Scratchbox is rumored to produce the fabled assertion even on modern 
distros by installing its own toolchain which includes the dreaded glibc.



Using sections is wrong; you should be going through the phdrs, and
looking for PT_DYNAMIC for relocation.
  


Will do.


Does anyone expect the symbolic info to be correct?  It might be better
to just stomp it so nobody gets any ideas.

On the other hand, we don't want to break compatibility with anything...
  


I'm playing safe.  Binary identical relocation to 0xe000 was my goal.


+   } else if (strcmp(secstrings+sechdrs[i].sh_name, ".dynamic") == 
0) {
+   Elf32_Dyn *dyn = (void *)hdr + sechdrs[i].sh_offset;
+   int tag;
+   while ((tag = (++dyn)->d_tag) != DT_NULL)
  



Um, no.
  


Walk based on size instead?


+   } else if (strcmp(secstrings+sechdrs[i].sh_name, ".useless") == 
0) {
+   /* This is demonic; see vsyscall.lds.S; it puts the
+* .got in a section named .useless */
+   uint32_t *got = (void *)hdr + sechdrs[i].sh_offset;
+   *got += VDSO_HIGH_BASE;
+   }
  



This won't get relocated with one of the other relocations?  It's in the
text phdr.
  


Hmm, I can try that.  Thanks for the suggestions / fixes.

Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/filemap.c: unconditionally call mark_page_accessed

2007-03-15 Thread Valdis . Kletnieks
On Thu, 15 Mar 2007 14:35:17 EDT, Rik van Riel said:
> [EMAIL PROTECTED] wrote:
> > On Wed, 14 Mar 2007 22:33:17 BST, Andreas Mohr said:
> > 
> >> it'd seem we need some kind of state management here to figure out good
> >> intervals of when to call mark_page_accessed() *again* for this page. E.g.
> >> despite non-changing access patterns you could still call 
> >> mark_page_accessed()
> >> every 32 calls or so to avoid expiry, but this would need extra helper
> >> variables.
> > 
> > What if you did something like
> > 
> > if (jiffies%32) {...
> > 
> > (Possibly scaling it so the low-order bits change).  No need to lock it, as
> > "right most of the time" is close enough.
> 
> Bad idea.  That way you would only count page accesses if the
> phase of the moon^Wjiffie is just right.

On the other hand, Andreas suggested only marking it once every 32 calls,
but that required a helper variable.  Statistically, jiffies%32 should
end up about the same as a helper variable %32.

This of course, if just calling mark_page_accessed() is actually expensive
enough that we don't want to do it unconditionally.



pgp7KzMDuLQFc.pgp
Description: PGP signature


Re: [patch 10/34] Xen-pv_ops: Simplify smp_call_function*() by using common implementation

2007-03-15 Thread Randy Dunlap
On Tue, 13 Mar 2007 16:30:27 -0700 Jeremy Fitzhardinge wrote:

> smp_call_function and smp_call_function_single are almost complete
> duplicates of the same logic.  This patch combines them by
> implementing them in terms of the more general
> smp_call_function_mask().

The kernel-doc is still not quite correct.  Patch below applies
on top of this patch from Jeremy.
---

From: Randy Dunlap <[EMAIL PROTECTED]>

Clean up arch/i386/kernel/smp.c after the Xen pv_ops patches for
smp_call_function variants.

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 arch/i386/kernel/smp.c |   13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

--- linux-2621-rc3.orig/arch/i386/kernel/smp.c
+++ linux-2621-rc3/arch/i386/kernel/smp.c
@@ -517,14 +517,14 @@ static struct call_data_struct *call_dat
 
 
 /**
- * smp_call_function_mask(): Run a function on a set of other CPUs.
+ * smp_call_function_mask - Run a function on a set of other CPUs.
  * @mask: The set of cpus to run on.  Must not include the current cpu.
  * @func: The function to run. This must be fast and non-blocking.
  * @info: An arbitrary pointer to pass to the function.
  * @wait: If true, wait (atomically) until function has completed on other 
CPUs.
  *
  * Returns 0 on success, else a negative status code. Does not return until
- * remote CPUs are nearly ready to execute <> or are or have finished.
+ * remote CPUs are nearly ready to execute func() or are or have finished.
  *
  * You must not call this function with disabled interrupts or from a
  * hardware interrupt handler or from a bottom half handler.
@@ -583,14 +583,14 @@ int smp_call_function_mask(cpumask_t mas
 }
 
 /**
- * smp_call_function(): Run a function on all other CPUs.
+ * smp_call_function - Run a function on all other CPUs.
  * @func: The function to run. This must be fast and non-blocking.
  * @info: An arbitrary pointer to pass to the function.
  * @nonatomic: currently unused.
  * @wait: If true, wait (atomically) until function has completed on other 
CPUs.
  *
  * Returns 0 on success, else a negative status code. Does not return until
- * remote CPUs are nearly ready to execute <> or are or have executed.
+ * remote CPUs are nearly ready to execute func() or are or have executed.
  *
  * You must not call this function with disabled interrupts or from a
  * hardware interrupt handler or from a bottom half handler.
@@ -602,8 +602,9 @@ int smp_call_function(void (*func) (void
 }
 EXPORT_SYMBOL(smp_call_function);
 
-/*
+/**
  * smp_call_function_single - Run a function on another CPU
+ * @cpu: The target (destination) CPU number.
  * @func: The function to run. This must be fast and non-blocking.
  * @info: An arbitrary pointer to pass to the function.
  * @nonatomic: Currently unused.
@@ -611,7 +612,7 @@ EXPORT_SYMBOL(smp_call_function);
  *
  * Retrurns 0 on success, else a negative status code.
  *
- * Does not return until the remote CPU is nearly ready to execute 
+ * Does not return until the remote CPU is nearly ready to execute func()
  * or is or has executed.
  */
 int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fastboot] [PATCH 1/1] Allow i386 crash kernels to handle x86_64 dumps

2007-03-15 Thread Vivek Goyal
On Fri, Mar 16, 2007 at 11:40:07AM +0900, Magnus Damm wrote:
> On 3/16/07, Horms <[EMAIL PROTECTED]> wrote:
> >On Thu, Mar 15, 2007 at 06:56:16PM +0530, Vivek Goyal wrote:
> >> On Thu, Mar 15, 2007 at 12:22:57PM +, Ian Campbell wrote:
> >> > On Thu, 2007-03-15 at 11:17 +0530, Vivek Goyal wrote:
> >> > > > > But I think changing this macro might run into issues. It is
> >> > > > > being used at few places in kernel, for example while loading
> >> > > > > module. This will essentially mean that we allow loading 64bit
> >> > > > > x86_64 modules on 32bit i386 systems?
> >> >
> >> > Yes, not sure how I missed that fact...
> >> >
> >> > > Kexec will also not allow loading an x86_64 kernel on a 32bit 
> >machine.
> >> >
> >> > For crash kernel only or for regular kexec too?
> >> >
> >>
> >> I think for both. One of the possible reasons I think is that one never
> >> knows is underlying machine has got 64bit extensions or not. So even if
> >> we load the kernel it will never boot. Secondly, we might not be able to
> >> handle 64bit address in 32bit kernel/user space?
> >
> >Perhaps I am miss-understanding what you are saying, but I do
> >recally kexecing from 32->64 and 64->32 bit kernels on x86_64 hardware.
> >I can run these checks again if it helps.
> 

I stand corrected. I can kexec an bzImage 32->64bit. That's a different
thing that it ran into some initrd issues later but fundamentally kexec
could load 64bit kernel bzImage and do the successful transition.

So it will now be left to the user. If he tries to kexec to a 64bit kernel
on a machine not supporting 32bit extensions, then kexec will not give
any advance warning.

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Lumpy Reclaim V5

2007-03-15 Thread Andrew Morton
On Mon, 12 Mar 2007 18:22:45 + Andy Whitcroft <[EMAIL PROTECTED]> wrote:

> Following this email are three patches which represent the
> current state of the lumpy reclaim patches; collectively lumpy V5.

So where do we stand with this now?Does it make anything get better?

I (continue to) think that if this is to be truly useful, we need some way
of using it from kswapd to keep a certain minimum number of order-1,
order-2, etc pages in the freelists.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH] Fixup COMPAT_VDSO to work with CONFIG_PARAVIRT

2007-03-15 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
> Invoke black magic to relocate the VDSO even when COMPAT_VDSO is enabled
> by fixing up the ELF object.
>   

So does it actually work?  Can you boot the broken distros with this in
place?

> Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>
>
> Index: linux-2.6.21/arch/i386/kernel/entry.S
> ===
> --- linux-2.6.21.orig/arch/i386/kernel/entry.S2007-03-06 
> 18:51:33.0 -0800
> +++ linux-2.6.21/arch/i386/kernel/entry.S 2007-03-15 18:14:11.0 
> -0800
> @@ -305,16 +305,12 @@ sysenter_past_esp:
>   pushl $(__USER_CS)
>   CFI_ADJUST_CFA_OFFSET 4
>   /*CFI_REL_OFFSET cs, 0*/
> -#ifndef CONFIG_COMPAT_VDSO
>   /*
>* Push current_thread_info()->sysenter_return to the stack.
>* A tiny bit of offset fixup is necessary - 4*4 means the 4 words
>* pushed above; +8 corresponds to copy_thread's esp0 setting.
>*/
>   pushl (TI_sysenter_return-THREAD_SIZE+8+4*4)(%esp)
> -#else
> - pushl $SYSENTER_RETURN
> -#endif
>   CFI_ADJUST_CFA_OFFSET 4
>   CFI_REL_OFFSET eip, 0
>  
> Index: linux-2.6.21/arch/i386/kernel/sysenter.c
> ===
> --- linux-2.6.21.orig/arch/i386/kernel/sysenter.c 2007-03-06 
> 18:51:34.0 -0800
> +++ linux-2.6.21/arch/i386/kernel/sysenter.c  2007-03-15 18:27:43.0 
> -0800
> @@ -72,6 +72,99 @@ extern const char vsyscall_int80_start, 
>  extern const char vsyscall_sysenter_start, vsyscall_sysenter_end;
>  static struct page *syscall_pages[1];
>  
> +#ifdef CONFIG_COMPAT_VDSO
> +static void fixup_vsyscall_elf(char *page)
> +{
> + Elf32_Ehdr *hdr;
> + Elf32_Shdr *sechdrs;
> + Elf32_Phdr *phdr;
> + char *secstrings;
> + int i, j, n;
> +
> + hdr = (Elf32_Ehdr *)page;
> + 
> + printk("Remapping vsyscall page to %08x\n", (unsigned 
> int)VDSO_HIGH_BASE);
> +
> + /* Sanity checks against insmoding binaries or wrong arch,
> +   weird elf version */
> + if (memcmp(hdr->e_ident, ELFMAG, 4) != 0 ||
> + !elf_check_arch(hdr) ||
> + hdr->e_type != ET_DYN)
> + panic("Bogus ELF in vsyscall DSO\n");
> +
> + hdr->e_entry += VDSO_HIGH_BASE;
> + sechdrs = (void *)hdr + hdr->e_shoff;
> + secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
> +
> + for (i = 1; i < hdr->e_shnum; i++) {
>   

Using sections is wrong; you should be going through the phdrs, and
looking for PT_DYNAMIC for relocation.

> + if (!(sechdrs[i].sh_flags & SHF_ALLOC))
> + continue;
> +
> + sechdrs[i].sh_addr += VDSO_HIGH_BASE;
> + if (strcmp(secstrings+sechdrs[i].sh_name, ".dynsym") == 0) {
> + Elf32_Sym  *sym =  (void *)hdr + sechdrs[i].sh_offset;
> + n = sechdrs[i].sh_size / sizeof(*sym);
> + for (j = 1; j < n;  j++) {
> + int ndx = sym[j].st_shndx;
> + if (ndx == SHN_UNDEF || ndx == SHN_ABS)
> + continue;
> + sym[j].st_value += VDSO_HIGH_BASE;
> + }
>   

Does anyone expect the symbolic info to be correct?  It might be better
to just stomp it so nobody gets any ideas.

On the other hand, we don't want to break compatibility with anything...

> + } else if (strcmp(secstrings+sechdrs[i].sh_name, ".dynamic") == 
> 0) {
> + Elf32_Dyn *dyn = (void *)hdr + sechdrs[i].sh_offset;
> + int tag;
> + while ((tag = (++dyn)->d_tag) != DT_NULL)
>   

Um, no.

> + } else if (strcmp(secstrings+sechdrs[i].sh_name, ".useless") == 
> 0) {
> + /* This is demonic; see vsyscall.lds.S; it puts the
> +  * .got in a section named .useless */
> + uint32_t *got = (void *)hdr + sechdrs[i].sh_offset;
> + *got += VDSO_HIGH_BASE;
> + }
>   

This won't get relocated with one of the other relocations?  It's in the
text phdr.

> + }
> + phdr = (void *)hdr + hdr->e_phoff;
> + for (i = 0; i < hdr->e_phnum; i++) {
> + phdr[i].p_vaddr += VDSO_HIGH_BASE;
> + phdr[i].p_paddr += VDSO_HIGH_BASE;
> + }
> +
> +#if 0
> +/* 
> + * To verify the binary image in memory is identical, linked in the VDSO page
> + * from a COMPAT_VDSO compile without this patch; then diff the two.  For a
> + * non-relocated fixmap, the VDSO image is identical.
> + */
> +{
> + extern const char vsyscall_orig_start, vsyscall_orig_end;
> + int *l1 = (int *)page, *l2 = (int *)_orig_start;
> + int foo = vsyscall_orig_end - vsyscall_orig_start / 4;
> + for (i = 0; i < foo; i++) {
> + if (l1[i] != l2[i]) {
> + printk("vsyscall - delta [%03x] orig %08x new 

Re: [PATCH 10/22 take 3] UBI: EBA unit

2007-03-15 Thread Randy Dunlap
On Thu, 15 Mar 2007 18:29:51 -0500 Josh Boyer wrote:

> On Thu, Mar 15, 2007 at 02:24:10PM -0700, Randy Dunlap wrote:
> > On Thu, 15 Mar 2007 11:07:03 -0800 Andrew Morton wrote:
> > 
> > > 
> > > There's way too much code here to expect it to get decently reviewed, 
> > > alas.
> > 
> > Yes.
> > 
> > /me repeats wish that Not Everything Should Be Sent to lkml.  :(
> 
> Just curious, but where would you suggest this be sent to for review then?

Valid question.  I should have chosen some other more appropriate
patch to make that comment.

I don't see a better list for UBI patches, so lkml is OK IMO.


Here is a summary of my thinking on Linux-related mailing lists.

1.  Bug reports can go to lkml or focused mailing lists.

2.  Development (like patches) should go to focused mailing lists
if there is such a list and they have enough usage.

Development areas that qualify for this IMO are:
- ACPI
- ATA
- file systems
- frame buffer
- ieee1394
- MM/VM
- multimedia
- networking
- PCI
- power management, suspend/resume
- SCSI
- sound
- USB
- virtualization


(not that I expect anything close to concensus on this)
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX][PATCH] fixing placement of register stack under ulimit -s

2007-03-15 Thread KAMEZAWA Hiroyuki
plz allow me to explain more.

"Why register-stack/memory-stack upside down is bad" is a bit complicated.
So...this is a test and result for explaining bug. 

This is a sample code and its result on 2.6.21-rc3.
Note: base address of memory'stack can be randomly change.

== sample code ==
[EMAIL PROTECTED] ~]$ cat sample.c
#include 

void do_print(int num)
{
if (num == 0)
return;
printf("%d\n",num);
do_print(num - 1);
}

int main(int argc, char *argv[])
{
do_print(1);
return 0;
}

== before ulimit ==
[EMAIL PROTECTED] ~]$ uname -a
Linux drpq 2.6.21-rc3 #3 SMP Fri Mar 16 11:57:47 JST 2007 ia64 ia64 ia64 
GNU/Linux
[EMAIL PROTECTED] ~]$ ulimit -s
8192
[EMAIL PROTECTED] ~]$ ulimit -s -H
unlimited
[EMAIL PROTECTED] ~]$ ./sample
1


1
[EMAIL PROTECTED] ~]$
== after ulimit -s 8192 ==

[EMAIL PROTECTED] ~]$ ulimit -s
8192
[EMAIL PROTECTED] ~]$ ulimit -s -H
8192
[EMAIL PROTECTED] ~]$ ./sample  
1



9612
9611
9610
9609
9608
Segmentation fault

[EMAIL PROTECTED] ~]$ ./sample   (when I'm lucky)
1


1
[EMAIL PROTECTED] ~]$
=

This number 9608 is too short to use up all stack. The reason of this is 
"ulimit -s + memory stack randomization + register-stack-expansion" is buggy.
The program can only use one page for register stack if unlucky.
My patch will fix this case.

-Kame








-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC, PATCH] Fixup COMPAT_VDSO to work with CONFIG_PARAVIRT

2007-03-15 Thread Zachary Amsden
Paravirt-ops guests which move the fixmap also end up moving the syscall 
VDSO.  This fails if it is prelinked at a fixed address, which is why 
COMPAT_VDSO is broken under CONFIG_VMI (and also under CONFIG_XEN).  
Several options are available to try to address this.  Jan had cooked up 
a patch for Xen that used build magic to find the parts of the VDSO that 
need relocation.  I don't like the idea of having auto-generated 
relocations, as someday something could change between two linked 
objects (timestamp, elf notes perhaps) that is not a relocation.  So I 
prefer human supervision over the relocation and explicitly fixing 
everything by hand.  I'm not necessarily advocating one solution over 
the other; my way is more code to maintain if the VDSO linkage changes.  
I'm looking for feedback about which way is best.


Also, it appears that COMPAT_VDSO could disappear entirely.  Since this 
approach should work with older broken ld.so (2.3.2 is the version, I 
believe), we should be able to switch over completely to using the gate 
vma style of linking the vdso.  One can even get the address 
randomization benefits by simply running fixup on the vdso if you are 
prepared to take the cost of allocating an extra page per process.  Or 
you could randomize just once at boot, which makes the randomization 
per-machine, still sufficient to slow network based worm attacks which 
might rely on a fixed VDSO address.


Clearly this patch needs more testing and feedback, which I'm sure it 
will get...




Zach

P.S. - Eric, I've copied you as you appear to be an ELF expert, or at 
least have a greater grasp of Elven Magic than me, and I'm hoping I got 
all the dynamic tags which need relocation right.
Invoke black magic to relocate the VDSO even when COMPAT_VDSO is enabled
by fixing up the ELF object.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

Index: linux-2.6.21/arch/i386/kernel/entry.S
===
--- linux-2.6.21.orig/arch/i386/kernel/entry.S  2007-03-06 18:51:33.0 
-0800
+++ linux-2.6.21/arch/i386/kernel/entry.S   2007-03-15 18:14:11.0 
-0800
@@ -305,16 +305,12 @@ sysenter_past_esp:
pushl $(__USER_CS)
CFI_ADJUST_CFA_OFFSET 4
/*CFI_REL_OFFSET cs, 0*/
-#ifndef CONFIG_COMPAT_VDSO
/*
 * Push current_thread_info()->sysenter_return to the stack.
 * A tiny bit of offset fixup is necessary - 4*4 means the 4 words
 * pushed above; +8 corresponds to copy_thread's esp0 setting.
 */
pushl (TI_sysenter_return-THREAD_SIZE+8+4*4)(%esp)
-#else
-   pushl $SYSENTER_RETURN
-#endif
CFI_ADJUST_CFA_OFFSET 4
CFI_REL_OFFSET eip, 0
 
Index: linux-2.6.21/arch/i386/kernel/sysenter.c
===
--- linux-2.6.21.orig/arch/i386/kernel/sysenter.c   2007-03-06 
18:51:34.0 -0800
+++ linux-2.6.21/arch/i386/kernel/sysenter.c2007-03-15 18:27:43.0 
-0800
@@ -72,6 +72,99 @@ extern const char vsyscall_int80_start, 
 extern const char vsyscall_sysenter_start, vsyscall_sysenter_end;
 static struct page *syscall_pages[1];
 
+#ifdef CONFIG_COMPAT_VDSO
+static void fixup_vsyscall_elf(char *page)
+{
+   Elf32_Ehdr *hdr;
+   Elf32_Shdr *sechdrs;
+   Elf32_Phdr *phdr;
+   char *secstrings;
+   int i, j, n;
+
+   hdr = (Elf32_Ehdr *)page;
+   
+   printk("Remapping vsyscall page to %08x\n", (unsigned 
int)VDSO_HIGH_BASE);
+
+   /* Sanity checks against insmoding binaries or wrong arch,
+   weird elf version */
+   if (memcmp(hdr->e_ident, ELFMAG, 4) != 0 ||
+   !elf_check_arch(hdr) ||
+   hdr->e_type != ET_DYN)
+   panic("Bogus ELF in vsyscall DSO\n");
+
+   hdr->e_entry += VDSO_HIGH_BASE;
+   sechdrs = (void *)hdr + hdr->e_shoff;
+   secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
+
+   for (i = 1; i < hdr->e_shnum; i++) {
+   if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+   continue;
+
+   sechdrs[i].sh_addr += VDSO_HIGH_BASE;
+   if (strcmp(secstrings+sechdrs[i].sh_name, ".dynsym") == 0) {
+   Elf32_Sym  *sym =  (void *)hdr + sechdrs[i].sh_offset;
+   n = sechdrs[i].sh_size / sizeof(*sym);
+   for (j = 1; j < n;  j++) {
+   int ndx = sym[j].st_shndx;
+   if (ndx == SHN_UNDEF || ndx == SHN_ABS)
+   continue;
+   sym[j].st_value += VDSO_HIGH_BASE;
+   }
+   } else if (strcmp(secstrings+sechdrs[i].sh_name, ".dynamic") == 
0) {
+   Elf32_Dyn *dyn = (void *)hdr + sechdrs[i].sh_offset;
+   int tag;
+   while ((tag = (++dyn)->d_tag) != DT_NULL)
+   

Re: [PATCH 1/1] Allow i386 crash kernels to handle x86_64 dumps

2007-03-15 Thread Vivek Goyal
On Fri, Mar 16, 2007 at 08:48:08AM +0900, Horms wrote:
> On Thu, Mar 15, 2007 at 06:56:16PM +0530, Vivek Goyal wrote:
> > On Thu, Mar 15, 2007 at 12:22:57PM +, Ian Campbell wrote:
> > > On Thu, 2007-03-15 at 11:17 +0530, Vivek Goyal wrote:
> > > > > > But I think changing this macro might run into issues. It is
> > > > > > being used at few places in kernel, for example while loading
> > > > > > module. This will essentially mean that we allow loading 64bit
> > > > > > x86_64 modules on 32bit i386 systems?
> > > 
> > > Yes, not sure how I missed that fact...
> > > 
> > > > Kexec will also not allow loading an x86_64 kernel on a 32bit machine.
> > > 
> > > For crash kernel only or for regular kexec too?
> > > 
> > 
> > I think for both. One of the possible reasons I think is that one never
> > knows is underlying machine has got 64bit extensions or not. So even if
> > we load the kernel it will never boot. Secondly, we might not be able to
> > handle 64bit address in 32bit kernel/user space?
> 
> Perhaps I am miss-understanding what you are saying, but I do
> recally kexecing from 32->64 and 64->32 bit kernels on x86_64 hardware.
> I can run these checks again if it helps.
> 

Yesterday I tested it. I could kexec from 64->32bit but not vice versa.
kexec-tools itself gave error message.

"Cannot determine the file type of ../x86_64-vmlinux/vmlinux"

I did not investigate deeper but I got a basic question. How will kexec
know that underlying 32bit machine supports 64bit extensions or not? Do
we allow loading 64bit kernel even underlying machine might not support
it?

Probably you can also give it a try. 
 
> > > > So how about something like vmcore_elf_allowed_cross_arch()? Vmcore
> > > > code can continue to check elf_check_arch() and if that fails it can
> > > > invoke vmcore_elf_allowed_cross_arch() to find out what cross arch are
> > > > allowed for vmcore. 
> > > 
> > > Something like this?
> > > 
> > > Ian.
> > > 
> > > ---  
> > > 
> > > Allow i386 crash kernels to handle x86_64 dumps.
> > > 
> > > The specific case I am encountering is kdump under Xen with a 64 bit
> > > hypervisor and 32 bit kernel/userspace. The dump created is a 64 bit
> > > due to the hypervisor but the dump kernel is 32 bit in for maximum
> > > compatibility.
> > > 
> > > It's possibly less likely to be useful in a purely native scenario but
> > > I see no reason to disallow it.
> > > 
> > > Signed-off-by: Ian Campbell <[EMAIL PROTECTED]>
> > > 
> > > diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> > > index d960507..523e109 100644
> > > --- a/fs/proc/vmcore.c
> > > +++ b/fs/proc/vmcore.c
> > > @@ -514,7 +514,7 @@ static int __init parse_crash_elf64_headers(void)
> > >   /* Do some basic Verification. */
> > >   if (memcmp(ehdr.e_ident, ELFMAG, SELFMAG) != 0 ||
> > >   (ehdr.e_type != ET_CORE) ||
> > > - !elf_check_arch() ||
> > > + !vmcore_elf_check_arch() ||
> > >   ehdr.e_ident[EI_CLASS] != ELFCLASS64 ||
> > >   ehdr.e_ident[EI_VERSION] != EV_CURRENT ||
> > >   ehdr.e_version != EV_CURRENT ||
> > > diff --git a/include/asm-i386/kexec.h b/include/asm-i386/kexec.h
> > > index 4dfc9f5..c76737e 100644
> > > --- a/include/asm-i386/kexec.h
> > > +++ b/include/asm-i386/kexec.h
> > > @@ -47,6 +47,9 @@
> > >  /* The native architecture */
> > >  #define KEXEC_ARCH KEXEC_ARCH_386
> > > 
> > > +/* We can also handle crash dumps from 64 bit kernel. */
> > > +#define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64)
> > > +
> > 
> > Ideal place for this probably should have been arch dependent crash_dump.h
> > file. But we don't have one and no point introducing one just for this 
> > macro.
> > 
> > This change looks good to me.
> 
> Won't the above change break non i386 archtectures as
> vmcore_elf_check_arch_cross isn't defined for them?
> 

In original patch he has put an arch independent definition in
include/linux/crash_dump.h which will make sure it is not broken on
other architectures.

Thanks
Vivek

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Fastboot] [PATCH 1/1] Allow i386 crash kernels to handle x86_64 dumps

2007-03-15 Thread Magnus Damm

On 3/16/07, Horms <[EMAIL PROTECTED]> wrote:

On Thu, Mar 15, 2007 at 06:56:16PM +0530, Vivek Goyal wrote:
> On Thu, Mar 15, 2007 at 12:22:57PM +, Ian Campbell wrote:
> > On Thu, 2007-03-15 at 11:17 +0530, Vivek Goyal wrote:
> > > > > But I think changing this macro might run into issues. It is
> > > > > being used at few places in kernel, for example while loading
> > > > > module. This will essentially mean that we allow loading 64bit
> > > > > x86_64 modules on 32bit i386 systems?
> >
> > Yes, not sure how I missed that fact...
> >
> > > Kexec will also not allow loading an x86_64 kernel on a 32bit machine.
> >
> > For crash kernel only or for regular kexec too?
> >
>
> I think for both. One of the possible reasons I think is that one never
> knows is underlying machine has got 64bit extensions or not. So even if
> we load the kernel it will never boot. Secondly, we might not be able to
> handle 64bit address in 32bit kernel/user space?

Perhaps I am miss-understanding what you are saying, but I do
recally kexecing from 32->64 and 64->32 bit kernels on x86_64 hardware.
I can run these checks again if it helps.


I recall kexecing a bzImage for x86_64 on i386, but I'm not 100% sure.
I think it worked because the bzImage loader code was regular 32 bit
x86 code, but that may be wrong as well.


Won't the above change break non i386 archtectures as
vmcore_elf_check_arch_cross isn't defined for them?


Right. And maybe it's a good idea to make sure that this feature is
actually supported by kexec-tools before adding code to the kernel?

My gut feeling about this is that you are begging for trouble. The
kexec/kdump solution is fragile just by itself, and trying to go
between architectures is just going to be painful.

/ magnus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_REORDER Kconfig help strange sentence.

2007-03-15 Thread Randy Dunlap
On Tue, 13 Mar 2007 17:37:35 +1100 Rusty Russell wrote:

> On Tue, 2007-03-13 at 00:56 +0100, Andi Kleen wrote:
> > On Tue, Mar 13, 2007 at 10:18:03AM +1100, Rusty Russell wrote:
> > > OK, this confused me:
> > > 
> > > Function reordering (REORDER) [N/y/?] (NEW) ?
> > > 
> > > This option enables the toolchain to reorder functions for a more 
> > > optimal TLB usage. If you have pretty much any version of 
> > > binutils, 
> > > this can increase your kernel build time by roughly one minute.
> > > 
> > > "If you have pretty much any version of binutils"?  Huh?
> > > 
> > > You mean "This will slow your kernel build by about a minute"?
> > 
> > Yes. Lots of sections seem to trigger some quadratic behaviour in ld.
> > 
> > It might be fixed in some unreleased CVS version though (not 100% sure) 
> > 
> > -Andi
> 
> OK, well here is a patch for the moment.
> 
> ==
> Clarify CONFIG_REORDER explanation
> 
> if (1 && X) => if (X).
> 
> Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
> 
> diff -r de5618b5e562 arch/x86_64/Kconfig
> --- a/arch/x86_64/Kconfig Tue Mar 13 11:41:55 2007 +1100
> +++ b/arch/x86_64/Kconfig Tue Mar 13 17:27:05 2007 +1100
> @@ -632,8 +632,8 @@ config REORDER
>   default n
>   help
>   This option enables the toolchain to reorder functions for a more 
> - optimal TLB usage. If you have pretty much any version of binutils, 
> -  this can increase your kernel build time by roughly one minute.
> + optimal TLB usage.  This will slow your kernel build by
> +  roughly one minute.

Please consistently use  for help text.
Yes, it was already mucked up.

>  config K8_NB
>   def_bool y



---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] Allow i386 crash kernels to handle x86_64 dumps

2007-03-15 Thread Vivek Goyal
On Thu, Mar 15, 2007 at 01:42:39PM +, Ian Campbell wrote:
> On Thu, 2007-03-15 at 18:56 +0530, Vivek Goyal wrote:
> > 
> > Ideal place for this probably should have been arch dependent
> > crash_dump.h file. But we don't have one and no point introducing one
> > just for this  macro.
> 
> Agreed.
> 
> > This change looks good to me. 
> 
> Is there a kdump tree which you'll apply to or shall I resend CCing
> apkm? (I'll add an Acked-by if that's ok).
> 

There is no separate kdump tree. Generally Andrew picks up these changes.
I guess just resend it copying Andrew. Yes you can add Acked-by me.

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Loading both the pata_atiixp and the ahci driver causes problems

2007-03-15 Thread Jon Masters

Chuck Ebbert wrote:


If you try to load both the pata_atiixp and the ahci driver
(for the same ATI SB600 adapter), very strange things happen.
The AHCI driver churns for three minutes or so, spewing
messages like this, then nothing works:

<6>ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
<4>ata3.00: qc timeout (cmd 0xec)
<4>ata3.00: failed to IDENTIFY (I/O error, err_mask=0x104)



Shouldn't it be able to tell the device has already been
claimed by some other driver?


One would assume it'd fail to grab the PCI IO ranges twice? I haven't 
looked at the code but I have seen this bug mentioned elsewhere so I 
might well end up having to do that yet :-)


Jon.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: AMD64 kernel oops

2007-03-15 Thread Parag Warudkar
Joerg Platte  naasa.net> writes:

> Pid: 14, comm: events/0 Not tainted 2.6.18-4-amd64 #1
> RIP: 0010:[]  [] keyring_destroy+0x32/0x96

[Snip]

> Can this oops be caused by a known and already 
> fixed problem in a newer kernel versions? In this case I would submit a bug 
> to the Debian BTS. Otherwise what can I do to further reproduce and debug 
> this oops?
> 

Check out http://bugzilla.kernel.org/show_bug.cgi?id=8067 which is a duplicate 
of
http://bugzilla.kernel.org/show_bug.cgi?id=7727 which is fixed. There is a 
patch available on the 
bugzilla if you want to try it out. 

HTH
Parag



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] scc_pata: move from ide/ppc to ide/pci

2007-03-15 Thread Akira Iguchi
This patch moves scc_pata from ide/ppc to ide/pci in order to
build it in normal module.

Signed-off-by: Kou Ishizaki <[EMAIL PROTECTED]>
Signed-off-by: Akira Iguchi <[EMAIL PROTECTED]>
---

diff -Nrpu -X linux-2.6.21-rc3/Documentation/dontdiff 
linux-2.6.21-rc3/drivers/ide/pci/scc_pata.c 
linux-2.6.21-rc3.mod/drivers/ide/pci/scc_pata.c
--- linux-2.6.21-rc3/drivers/ide/pci/scc_pata.c 1970-01-01 09:00:00.0 
+0900
+++ linux-2.6.21-rc3.mod/drivers/ide/pci/scc_pata.c 2007-03-16 
18:47:36.0 +0900
@@ -0,0 +1,858 @@
+/*
+ * Support for IDE interfaces on Celleb platform
+ *
+ * (C) Copyright 2006 TOSHIBA CORPORATION
+ *
+ * This code is based on drivers/ide/pci/siimage.c:
+ * Copyright (C) 2001-2002 Andre Hedrick <[EMAIL PROTECTED]>
+ * Copyright (C) 2003  Red Hat <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define PCI_DEVICE_ID_TOSHIBA_SCC_ATA0x01b4
+
+#define SCC_PATA_NAME   "scc IDE"
+
+#define TDVHSEL_MASTER  0x0001
+#define TDVHSEL_SLAVE   0x0004
+
+#define MODE_JCUSFEN0x0080
+
+#define CCKCTRL_ATARESET0x0004
+#define CCKCTRL_BUFCNT  0x0002
+#define CCKCTRL_CRST0x0001
+#define CCKCTRL_OCLKEN  0x0100
+#define CCKCTRL_ATACLKOEN   0x0002
+#define CCKCTRL_LCLKEN  0x0001
+
+#define QCHCD_IOS_SS   0x0001
+
+#define QCHSD_STPDIAG  0x0002
+
+#define INTMASK_MSK 0xD112
+#define INTSTS_SERROR  0x8000
+#define INTSTS_PRERR   0x4000
+#define INTSTS_RERR0x1000
+#define INTSTS_ICERR   0x0100
+#define INTSTS_BMSINT  0x0010
+#define INTSTS_BMHE0x0008
+#define INTSTS_IOIRQS   0x0004
+#define INTSTS_INTRQ0x0002
+#define INTSTS_ACTEINT  0x0001
+
+#define ECMODE_VALUE 0x01
+
+static struct scc_ports {
+   unsigned long ctl, dma;
+   unsigned char hwif_id;  /* for removing hwif from system */
+} scc_ports[MAX_HWIFS];
+
+/* PIO transfer mode  table */
+/* JCHST */
+static unsigned long JCHSTtbl[2][7] = {
+   {0x0E, 0x05, 0x02, 0x03, 0x02, 0x00, 0x00},   /* 100MHz */
+   {0x13, 0x07, 0x04, 0x04, 0x03, 0x00, 0x00}/* 133MHz */
+};
+
+/* JCHHT */
+static unsigned long JCHHTtbl[2][7] = {
+   {0x0E, 0x02, 0x02, 0x02, 0x02, 0x00, 0x00},   /* 100MHz */
+   {0x13, 0x03, 0x03, 0x03, 0x03, 0x00, 0x00}/* 133MHz */
+};
+
+/* JCHCT */
+static unsigned long JCHCTtbl[2][7] = {
+   {0x1D, 0x1D, 0x1C, 0x0B, 0x06, 0x00, 0x00},   /* 100MHz */
+   {0x27, 0x26, 0x26, 0x0E, 0x09, 0x00, 0x00}/* 133MHz */
+};
+
+
+/* DMA transfer mode  table */
+/* JCHDCTM/JCHDCTS */
+static unsigned long JCHDCTxtbl[2][7] = {
+   {0x0A, 0x06, 0x04, 0x03, 0x01, 0x00, 0x00},   /* 100MHz */
+   {0x0E, 0x09, 0x06, 0x04, 0x02, 0x01, 0x00}/* 133MHz */
+};
+
+/* JCSTWTM/JCSTWTS  */
+static unsigned long JCSTWTxtbl[2][7] = {
+   {0x06, 0x04, 0x03, 0x02, 0x02, 0x02, 0x00},   /* 100MHz */
+   {0x09, 0x06, 0x04, 0x02, 0x02, 0x02, 0x02}/* 133MHz */
+};
+
+/* JCTSS */
+static unsigned long JCTSStbl[2][7] = {
+   {0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x00},   /* 100MHz */
+   {0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05}/* 133MHz */
+};
+
+/* JCENVT */
+static unsigned long JCENVTtbl[2][7] = {
+   {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x00},   /* 100MHz */
+   {0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02}/* 133MHz */
+};
+
+/* JCACTSELS/JCACTSELM */
+static unsigned long JCACTSELtbl[2][7] = {
+   {0x00, 0x00, 0x00, 0x00, 0x01, 0x01, 0x00},   /* 100MHz */
+   {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01}/* 133MHz */
+};
+
+
+static u8 scc_ide_inb(unsigned long port)
+{
+   u32 data = in_be32((void*)port);
+   return (u8)data;
+}
+
+static u16 scc_ide_inw(unsigned long port)
+{
+   u32 data = in_be32((void*)port);
+   return (u16)data;
+}
+
+static void scc_ide_insw(unsigned long port, void *addr, u32 count)
+{
+   u16 *ptr = (u16 *)addr;
+   while (count--) {
+   *ptr++ = le16_to_cpu(in_be32((void*)port));
+   }
+}
+
+static void scc_ide_insl(unsigned long 

[PATCH 1/2] scc_pata: dependency fix

2007-03-15 Thread Akira Iguchi
This patch fixes:
* the dependency of scc_pata on BLK_DEV_IDEDMA_PCI
* incorrect link to ide-core
* move scc_pata from ide/ppc to ide/pci

Signed-off-by: Kou Ishizaki <[EMAIL PROTECTED]>
Signed-off-by: Akira Iguchi <[EMAIL PROTECTED]>
---

diff -Nrpu -X linux-2.6.21-rc3/Documentation/dontdiff 
linux-2.6.21-rc3/drivers/ide/Kconfig linux-2.6.21-rc3.mod/drivers/ide/Kconfig
--- linux-2.6.21-rc3/drivers/ide/Kconfig2007-03-07 13:41:20.0 
+0900
+++ linux-2.6.21-rc3.mod/drivers/ide/Kconfig2007-03-16 18:49:04.0 
+0900
@@ -769,6 +769,14 @@ config BLK_DEV_TC86C001
help
This driver adds support for Toshiba TC86C001 GOKU-S chip.
 
+config BLK_DEV_CELLEB
+   tristate "Toshiba's Cell Reference Set IDE support"
+   depends on PPC_CELLEB
+   help
+ This driver provides support for the built-in IDE controller on
+ Toshiba Cell Reference Board.
+ If unsure, say Y.
+
 endif
 
 config BLK_DEV_IDE_PMAC
@@ -800,14 +808,6 @@ config BLK_DEV_IDEDMA_PMAC
  to transfer data to and from memory.  Saying Y is safe and improves
  performance.
 
-config BLK_DEV_IDE_CELLEB
-   bool "Toshiba's Cell Reference Set IDE support"
-   depends on PPC_CELLEB
-   help
- This driver provides support for the built-in IDE controller on
- Toshiba Cell Reference Board.
- If unsure, say Y.
-
 config BLK_DEV_IDE_SWARM
tristate "IDE for Sibyte evaluation boards"
depends on SIBYTE_SB1xxx_SOC
diff -Nrpu -X linux-2.6.21-rc3/Documentation/dontdiff 
linux-2.6.21-rc3/drivers/ide/Makefile linux-2.6.21-rc3.mod/drivers/ide/Makefile
--- linux-2.6.21-rc3/drivers/ide/Makefile   2007-03-07 13:41:20.0 
+0900
+++ linux-2.6.21-rc3.mod/drivers/ide/Makefile   2007-03-16 18:48:02.0 
+0900
@@ -37,7 +37,6 @@ ide-core-$(CONFIG_BLK_DEV_Q40IDE) += leg
 # built-in only drivers from ppc/
 ide-core-$(CONFIG_BLK_DEV_MPC8xx_IDE)  += ppc/mpc8xx.o
 ide-core-$(CONFIG_BLK_DEV_IDE_PMAC)+= ppc/pmac.o
-ide-core-$(CONFIG_BLK_DEV_IDE_CELLEB)  += ppc/scc_pata.o
 
 # built-in only drivers from h8300/
 ide-core-$(CONFIG_H8300)   += h8300/ide-h8300.o
diff -Nrpu -X linux-2.6.21-rc3/Documentation/dontdiff 
linux-2.6.21-rc3/drivers/ide/pci/Makefile 
linux-2.6.21-rc3.mod/drivers/ide/pci/Makefile
--- linux-2.6.21-rc3/drivers/ide/pci/Makefile   2007-03-07 13:41:20.0 
+0900
+++ linux-2.6.21-rc3.mod/drivers/ide/pci/Makefile   2007-03-16 
18:49:05.0 +0900
@@ -3,6 +3,7 @@ obj-$(CONFIG_BLK_DEV_AEC62XX)   += aec62x
 obj-$(CONFIG_BLK_DEV_ALI15X3)  += alim15x3.o
 obj-$(CONFIG_BLK_DEV_AMD74XX)  += amd74xx.o
 obj-$(CONFIG_BLK_DEV_ATIIXP)   += atiixp.o
+obj-$(CONFIG_BLK_DEV_CELLEB)   += scc_pata.o
 obj-$(CONFIG_BLK_DEV_CMD64X)   += cmd64x.o
 obj-$(CONFIG_BLK_DEV_CS5520)   += cs5520.o
 obj-$(CONFIG_BLK_DEV_CS5530)   += cs5530.o
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-usb-devel] USB Keyboard

2007-03-15 Thread Alan Stern
On Thu, 15 Mar 2007, linux-os (Dick Johnson) wrote:

> It's not the same hardware and all the machines that I tried that
> have keyboards end up WORKING with the USB keyboard as well!  But
> Dmitry Torokhov was right! I just burned a CD with all three modules,
> and the keyboard works! I didn't bother to check the DEBUG messages.

Congratulations.  Sometimes these problems have easy solutions.  :-)

> It's interesting that the "wrong" module loaded fine with no warnings
> that it might not be the correct one!

There's no warning because the driver doesn't know anything is wrong.  
Even though it may not find any devices to manage when it first gets 
loaded, there's nothing to prevent you adding, for example, a PC-card with 
a USB controller on it at some later time.

That's true in general for most Linux drivers.  (The ones that aren't
platform-specific, anyway.)  They don't look for devices to manage at load
time; instead the driver core calls their probe() routine later on.  
Consequently drivers can't tell at load time whether there will be any
useful work for them to do.

Alan Stern

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5] Fix rmmod/read/write races in /proc entries

2007-03-15 Thread Andrew Morton
On Sun, 11 Mar 2007 20:04:56 +0300 Alexey Dobriyan <[EMAIL PROTECTED]> wrote:

> Differences from version 4:
>   Updated in-code comments. Largely rewritten changelog.
>   Lockdep please. --akpm
>   ->read_proc, ->write_proc aren't special, Extend protection to
>   most methods for regular /proc files. Mentioned by viro.
> Differences from version 3:
>   Use completion instead of unlock/schedule/lock
>   Move refcount waiting business after removing PDE from lists,
>   so that *cough* possible concurrent remove_proc_entry() will
>   work.

My, what a lot of code you have here.  I note that nobody can be assed even
reviewing it.  Now why is that?

> Fix following races:
> ===
> 1. Write via ->write_proc sleeps in copy_from_user(). Module disappears
>meanwhile. Or, more generically, system call done on /proc file, method
>supplied by module is called, module dissapeares meanwhile.
> 
>pde = create_proc_entry()
>if (!pde)
>   return -ENOMEM;
>pde->write_proc = ...
>   open
>   write
>   copy_from_user
>pde = create_proc_entry();
>if (!pde) {
>   remove_proc_entry();
>   return -ENOMEM;
>   /* module unloaded */
>}

We usually fix that race by pinning the module: make whoever registered the
proc entries also register their THIS_MODULE, do a try_module_get() on it
before we start to play with data structures which the module owns.

Can we do that here?

And is the above race fix related to the below one in any fashion?

> ==
> 2. bogo-revoke aka proc_kill_inodes()
> 
>   remove_proc_entry   vfs_read
>   proc_kill_inodes[check ->f_op validness]
>   [check ->f_op->read validness]
>   [verify_area, security permissions checks]
>   ->f_op = NULL;
>   if (file->f_op->read)
>   /* ->f_op dereference, boom */

So you fixed this via sort-of-refcounting on pde->pde_users.

hmm.

> NOTE, NOTE, NOTE: file_operations are proxied for regular files only. Let's
> see how this scheme behaves, then extend if needed for directories.
> Directories creators in /proc only set ->owner for them, so proxying for
> directories may be unneeded.
> 
> NOTE, NOTE, NOTE: methods being proxied are ->llseek, ->read, ->write,
> ->poll, ->unlocked_ioctl, ->ioctl, ->compat_ioctl, ->open, ->release.
> If your in-tree module uses something else, yell on me. Full audit pending.
> 
> Signed-off-by: Alexey Dobriyan <[EMAIL PROTECTED]>
> ---
> 
>  fs/proc/generic.c   |   32 +
>  fs/proc/inode.c |  279 
> +++-
>  include/linux/proc_fs.h |   13 ++
>  3 files changed, 321 insertions(+), 3 deletions(-)
> 
> --- a/fs/proc/generic.c
> +++ b/fs/proc/generic.c
> @@ -20,6 +20,7 @@ #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "internal.h"
> @@ -613,6 +614,9 @@ static struct proc_dir_entry *proc_creat
>   ent->namelen = len;
>   ent->mode = mode;
>   ent->nlink = nlink;
> + ent->pde_users = 0;
> + spin_lock_init(>pde_unload_lock);
> + ent->pde_unload_completion = NULL;
>   out:
>   return ent;
>  }
> @@ -734,9 +738,35 @@ void remove_proc_entry(const char *name,
>   de = *p;
>   *p = de->next;
>   de->next = NULL;
> +
> + spin_lock(>pde_unload_lock);
> + /*
> +  * Stop accepting new callers into module. If you're
> +  * dynamically allocating ->proc_fops, save a pointer somewhere.
> +  */
> + de->proc_fops = NULL;
> + /* Wait until all existing callers into module are done. */
> + if (de->pde_users > 0) {
> + DECLARE_COMPLETION_ONSTACK(c);
> +
> + if (!de->pde_unload_completion)
> + de->pde_unload_completion = 
> +
> + spin_unlock(>pde_unload_lock);
> + spin_unlock(_subdir_lock);
> +
> + wait_for_completion(de->pde_unload_completion);
> +
> + spin_lock(_subdir_lock);
> + goto continue_removing;
> + }
> + spin_unlock(>pde_unload_lock);
> +
> +continue_removing:
>   if (S_ISDIR(de->mode))
>   parent->nlink--;
> - proc_kill_inodes(de);
> + if (!S_ISREG(de->mode))
> + proc_kill_inodes(de);
>   de->nlink = 0;
>   WARN_ON(de->subdir);
>   if (!atomic_read(>count))
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -142,6 +142,277 @@ static const struct super_operations pro
>   .remount_fs = proc_remount,
>  };
>  
> 

Re: PCI DAC DMA APIs

2007-03-15 Thread David Miller
From: Christoph Hellwig <[EMAIL PROTECTED]>
Date: Thu, 15 Mar 2007 19:18:34 +

> On Thu, Mar 15, 2007 at 12:38:13PM +, Jan Beulich wrote:
> > While the kernel headers provide for this, there don't appear to be any
> > in-tree users (which seems contrary to general Linux policies). Would there
> > be objections to remove all of these?
> 
> They should go away.  Having them in for more than five years without
> any users is almost a guarantee for bitrot.

Yes, probably we should get rid of them.

The idea wasn't sparc optimizations, it was for things like those
Dolphin clustering cards that essentially want to get at all of
physical memory from the PCI card.

The IOMMU is a limited resource, so at the expense of lack of
prefetching and write caching we provide a way to do unlimited DMA
mapping with 64-bit DAC addresses.

None of these drivers ever got integrated, so it's a total loss.

Someone will complain when we pull it out, but fsck them, they
had years to do something about this. :)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] revoke: core code

2007-03-15 Thread Andrew Morton
On Sun, 11 Mar 2007 13:30:49 +0200 (EET) Pekka J Enberg <[EMAIL PROTECTED]> 
wrote:

> From: Pekka Enberg <[EMAIL PROTECTED]>
> 
> The revokeat(2) and frevoke(2) system calls invalidate open file
> descriptors and shared mappings of an inode. After an successful
> revocation, operations on file descriptors fail with the EBADF or
> ENXIO error code for regular and device files,
> respectively. Attempting to read from or write to a revoked mapping
> causes SIGBUS.
> 
> The actual operation is done in two passes:
> 
>  1. Revoke all file descriptors that point to the given inode. We do
> this under tasklist_lock so that after this pass, we don't need
> to worry about racing with close(2) or dup(2).
>
>  2. Take down shared memory mappings of the inode and close all file
> pointers.
> 
> The file descriptors and memory mapping ranges are preserved until the
> owning task does close(2) and munmap(2), respectively.
> 
> ...
>
> +asmlinkage int sys_revokeat(int dfd, const char __user *filename);
> +asmlinkage int sys_frevoke(unsigned int fd);

n all system calls must return long.

> +static int revoke_vma(struct vm_area_struct *vma, struct zap_details 
> *details)
> +{
> + unsigned long restart_addr, start_addr, end_addr;
> + int need_break;
> +
> + start_addr = vma->vm_start;
> + end_addr = vma->vm_end;
> +
> + /*
> +  * Not holding ->mmap_sem here.
> +  */
> + vma->vm_flags |= VM_REVOKED;

so  the modification of vm_flags is racy?

> + smp_mb();

Please always document barriers.  There's presumably some vm_flags reader
we're concerned about here, but how is the code reader to know what the
code writer was thinking?


> +  again:
> + restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr,
> +   details);
> +
> + need_break = need_resched() || need_lockbreak(details->i_mmap_lock);
> + if (need_break)
> + goto out_need_break;
> +
> + if (restart_addr < end_addr) {
> + start_addr = restart_addr;
> + goto again;
> + }
> + return 0;
> +
> +  out_need_break:
> + spin_unlock(details->i_mmap_lock);
> + cond_resched();
> + spin_lock(details->i_mmap_lock);
> + return -EINTR;
> +}
> +
> +static int revoke_mapping(struct address_space *mapping, struct file 
> *to_exclude)
> +{
> + struct vm_area_struct *vma;
> + struct prio_tree_iter iter;
> + struct zap_details details;
> + int err = 0;
> +
> + details.i_mmap_lock = >i_mmap_lock;
> +
> + spin_lock(>i_mmap_lock);
> + vma_prio_tree_foreach(vma, , >i_mmap, 0, ULONG_MAX) {
> + if ((vma->vm_flags & VM_SHARED) && vma->vm_file != to_exclude) {
> + err = revoke_vma(vma, );
> + if (err)
> + goto out;
> + }
> + }
> +
> + list_for_each_entry(vma, >i_mmap_nonlinear, 
> shared.vm_set.list) {
> + if ((vma->vm_flags & VM_SHARED) && vma->vm_file != to_exclude) {
> + err = revoke_vma(vma, );
> + if (err)
> + goto out;
> + }
> + }
> +  out:
> + spin_unlock(>i_mmap_lock);
> + return err;
> +}

This all looks very strange.  If the calling process expires its timeslice,
the entire system call fails?

What's happening here?


> +
> +int generic_file_revoke(struct file *file)
> +{
> + int err;
> +
> + /*
> +  * Flush pending writes.
> +  */
> + err = do_fsync(file, 1);
> + if (err)
> + goto out;
> +
> + /*
> +  * Make pending reads fail.
> +  */
> + err = invalidate_inode_pages2(file->f_mapping);
> +
> +  out:
> + return err;
> +}
> +
> +EXPORT_SYMBOL(generic_file_revoke);

do_fsync() is seriously suboptimal - it will run an ext3 commit. 
do_sync_file_range(...,
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER)
will run maybe five times quicker.

But otoh, do_sync_file_range() will fail to write back the pages for a
data=journal ext3 file, I expect (oops).


Why is this code using invalidate_inode_pages2()?  That function keeps on
breaking, has ill-defined semantics and will probably change in the future.

Exactly what semantics are you looking for here, and why?

The blank line before the EXPORT_SYMBOL() is a waste of space.

> +/*
> + *   Filesystem for revoked files.
> + */
> +
> +static struct inode *revokefs_alloc_inode(struct super_block *sb)
> +{
> + struct revokefs_inode_info *info;
> +
> + info = kmem_cache_alloc(revokefs_inode_cache, GFP_NOFS);
> + if (!info)
> + return NULL;
> +
> + return >vfs_inode;
> +}

Why GFP_NOFS?

> ===
> --- /dev/null 1970-01-01 00:00:00.0 +
> +++ uml-2.6/include/linux/revoked_fs_i.h  2007-03-11 13:09:20.0 
> +0200
> @@ -0,0 +1,20 @@
> +#ifndef 

Re: Summary of resource management discussion

2007-03-15 Thread Srivatsa Vaddagiri
On Thu, Mar 15, 2007 at 12:12:50PM -0700, Paul Menage wrote:
> There are some things that benefit from having an abstract
> container-like object available to store state, e.g. "is this
> container deleted?", "should userspace get a callback when this
> container is empty?". 

IMO we can still get these bits of information using nsproxy itself (I
admit I haven't looked at the callback requirement yet).

But IMO a bigger use of 'struct container' object in your patches is to
store hierarchical information and avoid /repeating/ that information in
each resource object (struct cpuset, struct cpu_limit, struct rss_limit
etc) a 'struct container' is attached to (as pointed out here : 
http://lkml.org/lkml/2007/3/7/356). However I don't know how many
controllers will ever support such hierarchical res mgmt and thats why I
said option 3 [above URL] may not be a bad compromise. 

Also if you find a good answer for my earlier question "what more
task-grouping behavior do you want to implement using an additional pointer 
that you can't reusing ->task_proxy", it would drive home the need for
additional pointers/structures.

> >> >a. Paul Menage's patches:
> >> >
> >> >(tsk->containers->container[cpu_ctlr.subsys_id] - X)->cpu_limit
> >>
> >> So what's the '-X' that you're referring to
> >
> >Oh ..that's to seek pointer to begining of the cpulimit structure (subsys
> >pointer in 'struct container' points to a structure embedded in a larger
> >structure. -X gets you to point to the larger structure).
> 
> OK, so shouldn't that be listed as an overhead for your rcfs version
> too? 

X shouldn't be needed in rcfs patches, because "->ctlr_data" in nsproxy
can directly point to the larger structure (there is no 'struct
container_subsys_state' equivalent in rcfs patches).

Container patches:

(tsk->containers->container[cpu_ctlr.subsys_id] - X)->cpu_limit

rcfs:

tsk->nsproxy->ctlr_data[cpu_ctlr.subsys_id]->cpu_limit

> >Yes me too. But maybe to keep in simple in initial versions, we should
> >avoid that optimisation and at the same time get statistics on duplicates?.
> 
> That's an implementation detail - we have more important points to
> agree on right now ...

yes :)

Eric, did you have any opinion on this thread?

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/13] BLK_DEV_IDE_CELLEB dependency fix

2007-03-15 Thread Akira Iguchi
Hi,

> Bart wrote:
>> Al wrote:
>> So AFAICS the minimal fix for that sucker is dependency on BLK_DEV_IDE=y;
>> however, I really wonder if
>>  * it needs to be linked into ide-core (as opposed to being a normal
>> module of its own)
>
>AFAICS there are no legacy device ordering issues with scc_pata so it doesn't
>need to be linked into ide-core but I'll leave the definitive answer to Akira
>
>>  * alternatively, its init should be called explicitly.

I don't have the answer why scc_pata is linked into ide-core.
Reviewing your comments and codes, I will make the following fixes:
  * remove link to ide-core and make normal module
  * move from ide/ppc to ide/pci

I will send these patches later.

Best regards,
Akira Iguchi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


core2 duo, interrupts: is this normal?

2007-03-15 Thread Norberto Bensa
Hello,

is this output, normal? I meant, why counters on CPU1 is zero? Isn't this 
balanced?

$ cat /proc/interrupts
   CPU0   CPU1
  0:4180170  0   IO-APIC-edge  timer
  1:   8060  0   IO-APIC-edge  i8042
  7:  0  0   IO-APIC-edge  parport0
  9:  0  0   IO-APIC-fasteoi   acpi
 12:  5  0   IO-APIC-edge  i8042
 16: 322297  0   IO-APIC-fasteoi   uhci_hcd:usb3, libata, nvidia, 
EMU10K1
 17: 896399  0   IO-APIC-fasteoi   bttv0, eth0, libata
 18:  72867  0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb7
 19:  27770  0   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb5
 20:  0  0   IO-APIC-fasteoi   uhci_hcd:usb4
 21:  0  0   IO-APIC-fasteoi   uhci_hcd:usb6
 22:  3  0   IO-APIC-fasteoi   ohci1394
 23:155  0   IO-APIC-fasteoi   HDA Intel
219: 103056  0   PCI-MSI-edge  libata
NMI:  0  0
LOC:40776134077622
ERR:  0
MIS:  0


Many thanks in advance,
Norberto
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS internal error xfs_da_do_buf(2) at line 2087 of file fs/xfs/xfs_da_btree.c. Caller 0xc01b00bd

2007-03-15 Thread David Chinner
On Wed, Mar 14, 2007 at 12:34:29PM +0100, Marco Berizzi wrote:
> Hello everybody.
> Since 2.6.19.2 + commit 7fbbb01dca7704d52ace6f45a805c98a5b0362f9

What commit is that? gitweb search tells me it's an nmi watchdog
change. Doesn't seem likely to change XFS behaviour - can
you post a url to the commit?

> I'm experimenting these errors.
> 2.6.19.1 has been worked good for more
> than 30 days.

With the above commit?

> I have reverted back to 2.6.19.1 to see if
> this problem happens again.

without the above commit?

> find_or_create_page+0x37/0x8e
> _xfs_buf_lookup_pages+0x132/0x2ea
> _xfs_buf_initialize+0xc8/0xf6
> xfs_buf_get_flags+0xf8/0x11d
> xfs_buf_read_flags+0x1c/0x7f
> xfs_trans_read_buf+0x16a/0x34f
> xfs_itobp+0x7c/0x242
> xfs_iread+0x68/0x1d3
> xfs_iget_core+0xe7/0x687
> xfs_iget+0xd8/0x150
> xfs_dir_lookup_int+0x98/0x10e
> xfs_lookup+0x5a/0x90
> xfs_vn_lookup+0x52/0x93

Curious - never seen this before - possibly a corrupted inode
number in the directory has led to this.

> ba 4e 8b cd
> Mar 12 14:35:21 Pleiadi kernel: Filesystem "sda8": XFS internal error
> xfs_da_do_buf(2) at line 2087 of file fs/xfs/xfs_da_btree.c.  Caller
> 0xc01b00bd
> Mar 12 14:35:21 Pleiadi kernel:  [] xfs_da_do_buf+0x70c/0x7b1
> Mar 12 14:35:21 Pleiadi kernel:  [] xfs_da_read_buf+0x30/0x35
> Mar 12 14:35:21 Pleiadi kernel:  [] xfs_da_read_buf+0x30/0x35

Hmm - these could simply be follow-on errors from the first
problem - the buffer would now probably be bad or corrupted,
and the directory buffer read code here is saying the buffer
is bad. All the errors appear to have thesame data in the buffer
(which is lacking the correct magic numbers) so i'd say they
are related to the above error.

Can you run xfs_repair on that filesystem and see if reports
(and fixes) any problems?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Return EPERM not ECHILD on security_task_wait failure

2007-03-15 Thread Roland McGrath
wait* syscalls return -ECHILD even when an individual PID of a live
child was requested explicitly, when security_task_wait denies the
operation.  This means that something like a broken SELinux policy
can produce an unexpected failure that looks just like a bug with
wait or ptrace or something.

This patch makes do_wait return -EPERM instead of -ECHILD if some
children were ruled out solely because security_task_wait failed.

Signed-off-by: Roland McGrath <[EMAIL PROTECTED]>
---
 kernel/exit.c |   12 +++-
 1 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index f132349..a41052f 100644  
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1067,7 +1067,7 @@ static int eligible_child(pid_t pid, int
return 2;
 
if (security_task_wait(p))
-   return 0;
+   return -1;
 
return 1;
 }
@@ -1449,6 +1449,7 @@ static long do_wait(pid_t pid, int optio
DECLARE_WAITQUEUE(wait, current);
struct task_struct *tsk;
int flag, retval;
+   int allowed, denied;
 
add_wait_queue(>signal->wait_chldexit,);
 repeat:
@@ -1457,6 +1458,7 @@ repeat:
 * match our criteria, even if we are not able to reap it yet.
 */
flag = 0;
+   allowed = denied = 0;
current->state = TASK_INTERRUPTIBLE;
read_lock(_lock);
tsk = current;
@@ -1472,6 +1474,12 @@ repeat:
if (!ret)
continue;
 
+   if (unlikely(ret < 0)) {
+   denied = 1;
+   continue;
+   }
+   allowed = 1;
+
switch (p->state) {
case TASK_TRACED:
/*
@@ -1570,6 +1578,8 @@ check_continued:
goto repeat;
}
retval = -ECHILD;
+   if (unlikely(denied) && !allowed)
+   retval = -EPERM;
 end:
current->state = TASK_RUNNING;
remove_wait_queue(>signal->wait_chldexit,);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.20] pwc : Cisco VT Camera support

2007-03-15 Thread Jean Tourrilhes
Hi,

I already sent this e-mail to Luc and on the pwc mailing list,
and got no answer. I'm trying again with the hope that this patch
would go in the kernel...


I have a Cisco VT Camera, and it was just collecting dust. I
decided to try connecting it to my Linux box at home.

Just a disgression about the product. The Cisco VT Camera is a
webcam Cisco sold to work with their IP phone hardware and
software. It's mostly useless on Windows, as it interfaces only to
Cisco software. You can find some for cheap on eBay...
Physically, it's just a Logitech Pro 4000. The only difference
with the Pro 4000 is the Cisco logo and that it's grey like the Pro
3000.
I believe Cisco is now selling the Cisco VT Camera II, which
look to be something else...

So, assuming that it was a Pro 4000 inside, I created the
little patch attached.
I'm new to webcam under Linux, but I managed to get an image
from it using xawtv, and the image looked all right, so I consider
that a success. The imaged seemed a bit small and I could not get the
microphone driver loaded, but I assume it's my lack of experience.
Note that I did not try any other type_id, but this one works
great.

Have fun...

Jean

---

diff -u -p linux/drivers/media/video/pwc/pwc-if.c~ 
linux/drivers/media/video/pwc/pwc-if.c
--- linux/drivers/media/video/pwc/pwc-if.c~ 2007-02-23 22:08:40.0 
-0800
+++ linux/drivers/media/video/pwc/pwc-if.c  2007-03-04 22:42:43.0 
-0800
@@ -1547,6 +1547,10 @@ static int usb_pwc_probe(struct usb_inte
features |= FEATURE_MOTOR_PANTILT;
break;
case 0x08b6:
+   PWC_INFO("Logitech/Cisco VT Camera webcam detected.\n");
+   name = "Cisco VT Camera";
+   type_id = 740; /* CCD sensor */
+   break;
case 0x08b7:
case 0x08b8:
PWC_INFO("Logitech QuickCam detected (reserved ID).\n");
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 2/7] RSS controller core

2007-03-15 Thread Eric W. Biederman
Alan Cox <[EMAIL PROTECTED]> writes:

>> stuff is happening by comparing page->count and page->_mapcount, but it
>> certainly wouldn't be conclusive.  But, does this kind of nonsense even
>> happen in practice?  
>
> "Is it useful for me as a bad guy to make it happen ?"

To create a DOS attack.

- Allocate some memory you know your victim will want in the future,
  (shared libraries and the like).
- Wait until your victim is using the memory you allocated.
- Terminate your memory resource group.
- Victim is pushed over memory limits by your exiting.
- Victim can no longer allocate memory
- Victim dies

It's not quite that easy unless your victim calls mlockall(MCL_FUTURE),
but the potential is clearly there.

Am I missing something?  Or is this fundamental to any first touch scenario?

I just know I have problems with first touch because it is darn hard to
reason about.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sky2 PHY setup

2007-03-15 Thread Thomas Glanzmann
Hello Stephen,

> yesterday I pulled from Linus tree because I saw the sky2 updated and I
> tried to break it but it seems that my problems are gone. I let you know
> if anything pops up in the future.

bad news. I today tried the sky2 driver which is in Linus Kernel Tree
(HEAD) on a machine with very high network load and it stopped working
without any kernel messages after doing a flawless job under high load
for 5 hours. My watchdog rebooted the machine after 500 seconds. ;-(

Thomas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH take3 00/20] Make common x86 arch area for i386 and x86_64 - Take 3

2007-03-15 Thread Rusty Russell
On Thu, 2007-03-15 at 01:13 -0400, Steven Rostedt wrote:
> Once again here's an attempt to put the shared files of x86_64 and i386
> into a separate directory.

OK, that's fine, but the next step is to have "make ARCH=x86" compile,
with a config option as to whether to build 32 or 64 bit.  This will
involve a fair amount of Makefile hair, but if you can get Andi to buy
into that then the rest is a simple matter of code churn.  For most
kernel hackers, this would be the flag day.

Moving the rest of the files across to xxx_32.c, xxx_64.h etc is going
to involve a great deal of untangling and code cleanup.  It's also going
to completely screw a whole heap of my cleanup patches.  Oh well.

(Still hoping for an executive summary from the PPC folks).

Cheers!
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [REPOST] x86_64, i386: Add command line length to boot protocol

2007-03-15 Thread H. Peter Anvin

Bernhard Walle wrote:

Because the command line is increased to 2048 characters after 2.6.21,
it's not possible for boot loaders and userspace tools to determine the length
of the command line the kernel can understand. The benefit of knowing the
length is that users can be warned if the command line size is too long which
prevents surprise if things don't work after bootup.

This patch updates the boot protocol to contain a field called
"cmdline_size" that contain the length of the command line (excluding
the terminating zero).

The patch also adds missing fields (of protocol version 2.05) to the x86_64
setup code.


Signed-off-by: Bernhard Walle <[EMAIL PROTECTED]>
Cc: Alon Bar-Lev <[EMAIL PROTECTED]>


Acked-by: H. Peter Anvin <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 13/13] signal/timer/event fds v6 - KAIO eventfd support example ...

2007-03-15 Thread Davide Libenzi
This is an example about how to add eventfd support to the current KAIO code,
in order to enable KAIO to post readiness events to a pollable fd
(hence compatible with POSIX select/poll). The KAIO code simply signals
the eventfd fd when events are ready, and this triggers a POLLIN in the fd.
This patch uses a reserved for future use member of the struct iocb to pass
an eventfd file descriptor, that KAIO will use to post events every time
a request completes. At that point, an aio_getevents() will return the
completed result to a struct io_event.
I made a quick test program to verify the patch, and it runs fine here:

http://www.xmailserver.org/eventfd-aio-test.c

The test program uses poll(2), but it'd, of course, work with select and
epoll too.
This can allow to schedule both block I/O and other poll-able devices requests,
and wait for results using select/poll/epoll.




Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.21-rc3.quilt/fs/aio.c
===
--- linux-2.6.21-rc3.quilt.orig/fs/aio.c2007-03-15 15:52:45.0 
-0700
+++ linux-2.6.21-rc3.quilt/fs/aio.c 2007-03-15 17:15:20.0 -0700
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -421,6 +422,7 @@
req->private = NULL;
req->ki_iovec = NULL;
INIT_LIST_HEAD(>ki_run_list);
+   req->ki_eventfd = ERR_PTR(-EINVAL);
 
/* Check if the completion queue has enough free space to
 * accept an event from this io.
@@ -462,6 +464,8 @@
 {
assert_spin_locked(>ctx_lock);
 
+   if (!IS_ERR(req->ki_eventfd))
+   fput(req->ki_eventfd);
if (req->ki_dtor)
req->ki_dtor(req);
if (req->ki_iovec != >ki_inline_vec)
@@ -946,6 +950,14 @@
return 1;
}
 
+   /*
+* Check if the user asked us to deliver the result through an
+* eventfd. The eventfd_signal() function is safe to be called
+* from IRQ context.
+*/
+   if (unlikely(!IS_ERR(iocb->ki_eventfd)))
+   eventfd_signal(iocb->ki_eventfd, 1);
+
info = >ring_info;
 
/* add a completion event to the ring buffer.
@@ -1555,6 +1567,19 @@
fput(file);
return -EAGAIN;
}
+   if (iocb->aio_resfd != 0) {
+   /*
+* If the aio_resfd field of the iocb is not zero, get an
+* instance of the file* now. The file descriptor must be
+* an eventfd() fd, and will be signaled for each completed
+* event using the eventfd_signal() function.
+*/
+   req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd);
+   if (IS_ERR(req->ki_eventfd)) {
+   ret = PTR_ERR(req->ki_eventfd);
+   goto out_put_req;
+   }
+   }
 
req->ki_filp = file;
ret = put_user(req->ki_key, _iocb->aio_key);
Index: linux-2.6.21-rc3.quilt/include/linux/aio.h
===
--- linux-2.6.21-rc3.quilt.orig/include/linux/aio.h 2007-03-15 
15:52:45.0 -0700
+++ linux-2.6.21-rc3.quilt/include/linux/aio.h  2007-03-15 16:13:45.0 
-0700
@@ -119,6 +119,12 @@
 
struct list_headki_list;/* the aio core uses this
 * for cancellation */
+
+   /*
+* If the aio_resfd field of the userspace iocb is not zero,
+* this is the underlying file* to deliver event to.
+*/
+   struct file *ki_eventfd;
 };
 
 #define is_sync_kiocb(iocb)((iocb)->ki_key == KIOCB_SYNC_KEY)
Index: linux-2.6.21-rc3.quilt/include/linux/aio_abi.h
===
--- linux-2.6.21-rc3.quilt.orig/include/linux/aio_abi.h 2007-03-15 
15:52:45.0 -0700
+++ linux-2.6.21-rc3.quilt/include/linux/aio_abi.h  2007-03-15 
16:13:45.0 -0700
@@ -84,7 +84,11 @@
 
/* extra parameters */
__u64   aio_reserved2;  /* TODO: use this for a (struct sigevent *) */
-   __u64   aio_reserved3;
+   __u32   aio_reserved3;
+   /*
+* If different from 0, this is an eventfd to deliver AIO results to
+*/
+   __u32   aio_resfd;
 }; /* 64 bytes */
 
 #undef IFBIG

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 9/13] signal/timer/event fds v6 - timerfd compat code ...

2007-03-15 Thread Davide Libenzi
This patch implement the necessary compat code for the timerfd system call.


Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/fs/compat.c
===
--- linux-2.6.21-rc3.quilt.orig/fs/compat.c 2007-03-15 15:53:11.0 
-0700
+++ linux-2.6.21-rc3.quilt/fs/compat.c  2007-03-15 16:11:52.0 -0700
@@ -2257,3 +2257,23 @@
return sys_signalfd(ufd, ksigmask, sizeof(sigset_t));
 }
 
+
+asmlinkage long compat_sys_timerfd(int ufd, int clockid, int flags,
+  const struct compat_itimerspec __user *utmr)
+{
+   long res;
+   struct itimerspec t;
+   struct itimerspec __user *ut;
+
+   res = -EFAULT;
+   if (get_compat_itimerspec(, utmr))
+   goto err_exit;
+   ut = compat_alloc_user_space(sizeof(*ut));
+   if (copy_to_user(ut, , sizeof(t)) )
+   goto err_exit;
+
+   res = sys_timerfd(ufd, clockid, flags, ut);
+err_exit:
+   return res;
+}
+
Index: linux-2.6.21-rc3.quilt/include/linux/compat.h
===
--- linux-2.6.21-rc3.quilt.orig/include/linux/compat.h  2007-03-15 
15:53:11.0 -0700
+++ linux-2.6.21-rc3.quilt/include/linux/compat.h   2007-03-15 
16:11:52.0 -0700
@@ -225,6 +225,11 @@
return lhs->tv_nsec - rhs->tv_nsec;
 }
 
+extern int get_compat_itimerspec(struct itimerspec *dst,
+const struct compat_itimerspec __user *src);
+extern int put_compat_itimerspec(struct compat_itimerspec __user *dst,
+const struct itimerspec *src);
+
 asmlinkage long compat_sys_adjtimex(struct compat_timex __user *utp);
 
 extern int compat_printk(const char *fmt, ...);
Index: linux-2.6.21-rc3.quilt/kernel/compat.c
===
--- linux-2.6.21-rc3.quilt.orig/kernel/compat.c 2007-03-15 15:53:11.0 
-0700
+++ linux-2.6.21-rc3.quilt/kernel/compat.c  2007-03-15 16:11:52.0 
-0700
@@ -475,8 +475,8 @@
return min_length;
 }
 
-static int get_compat_itimerspec(struct itimerspec *dst, 
-struct compat_itimerspec __user *src)
+int get_compat_itimerspec(struct itimerspec *dst,
+ const struct compat_itimerspec __user *src)
 { 
if (get_compat_timespec(>it_interval, >it_interval) ||
get_compat_timespec(>it_value, >it_value))
@@ -484,8 +484,8 @@
return 0;
 } 
 
-static int put_compat_itimerspec(struct compat_itimerspec __user *dst, 
-struct itimerspec *src)
+int put_compat_itimerspec(struct compat_itimerspec __user *dst,
+ const struct itimerspec *src)
 { 
if (put_compat_timespec(>it_interval, >it_interval) ||
put_compat_timespec(>it_value, >it_value))

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 8/13] signal/timer/event fds v6 - timerfd wire up x86_64 arch ...

2007-03-15 Thread Davide Libenzi
This patch wire the timerfd system call to the x86_64 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/x86_64/ia32/ia32entry.S2007-03-15 
15:53:13.0 -0700
+++ linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S 2007-03-15 
16:11:50.0 -0700
@@ -720,4 +720,5 @@
.quad sys_getcpu
.quad sys_epoll_pwait
.quad sys_signalfd  /* 320 */
+   .quad sys_timerfd
 ia32_syscall_end:  
Index: linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-x86_64/unistd.h 2007-03-15 
15:53:13.0 -0700
+++ linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h  2007-03-15 
16:11:50.0 -0700
@@ -621,8 +621,10 @@
 __SYSCALL(__NR_move_pages, sys_move_pages)
 #define __NR_signalfd  280
 __SYSCALL(__NR_signalfd, sys_signalfd)
+#define __NR_timerfd   281
+__SYSCALL(__NR_timerfd, sys_timerfd)
 
-#define __NR_syscall_max __NR_signalfd
+#define __NR_syscall_max __NR_timerfd
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 12/13] signal/timer/event fds v6 - eventfd wire up x86_64 arch ...

2007-03-15 Thread Davide Libenzi
This patch wire the eventfd system call to the x86_64 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/x86_64/ia32/ia32entry.S2007-03-15 
16:11:50.0 -0700
+++ linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S 2007-03-15 
16:13:43.0 -0700
@@ -721,4 +721,5 @@
.quad sys_epoll_pwait
.quad sys_signalfd  /* 320 */
.quad sys_timerfd
+   .quad sys_eventfd
 ia32_syscall_end:  
Index: linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-x86_64/unistd.h 2007-03-15 
16:11:50.0 -0700
+++ linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h  2007-03-15 
16:13:43.0 -0700
@@ -623,8 +623,10 @@
 __SYSCALL(__NR_signalfd, sys_signalfd)
 #define __NR_timerfd   281
 __SYSCALL(__NR_timerfd, sys_timerfd)
+#define __NR_eventfd   282
+__SYSCALL(__NR_eventfd, sys_eventfd)
 
-#define __NR_syscall_max __NR_timerfd
+#define __NR_syscall_max __NR_eventfd
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 11/13] signal/timer/event fds v6 - eventfd wire up i386 arch ...

2007-03-15 Thread Davide Libenzi
This patch wire the eventfd system call to the i386 architecture.



Signed-off-by: Davide Libenzi 


- Davide


Index: linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/i386/kernel/syscall_table.S
2007-03-15 16:11:47.0 -0700
+++ linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S 2007-03-15 
16:13:40.0 -0700
@@ -321,3 +321,4 @@
.long sys_epoll_pwait
.long sys_signalfd  /* 320 */
.long sys_timerfd
+   .long sys_eventfd
Index: linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-i386/unistd.h   2007-03-15 
16:11:47.0 -0700
+++ linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h2007-03-15 
16:13:40.0 -0700
@@ -327,10 +327,11 @@
 #define __NR_epoll_pwait   319
 #define __NR_signalfd  320
 #define __NR_timerfd   321
+#define __NR_eventfd   322
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 322
+#define NR_syscalls 323
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: thread stacks and strict vm overcommit accounting

2007-03-15 Thread Alan Cox
> > > With a typical size as a fuzz factor preaccounted in later kernels.
> > 
> > Where's that done?
> 
> I don't know what Alan is referring to there.

fs/exec.c - we add 20 pages to the stack vma size initially.

> We've no more committed to providing each instance with 8MB of stack,
> than we've committed to providing each instance with RLIMIT_AS of
> address space.  The rlimits are limits, not commitments, surely?

Yes, its just that the C programming language is utterly and
mindbogglingly broken when it comes to resource exhaustion for the stack.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 7/13] signal/timer/event fds v6 - timerfd wire up i386 arch ...

2007-03-15 Thread Davide Libenzi
This patch wire the timerfd system call to the i386 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/i386/kernel/syscall_table.S
2007-03-15 15:53:15.0 -0700
+++ linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S 2007-03-15 
16:11:47.0 -0700
@@ -320,3 +320,4 @@
.long sys_getcpu
.long sys_epoll_pwait
.long sys_signalfd  /* 320 */
+   .long sys_timerfd
Index: linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-i386/unistd.h   2007-03-15 
15:53:15.0 -0700
+++ linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h2007-03-15 
16:11:47.0 -0700
@@ -326,10 +326,11 @@
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
 #define __NR_signalfd  320
+#define __NR_timerfd   321
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 321
+#define NR_syscalls 322
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 3/13] signal/timer/event fds v6 - signalfd wire up i386 arch ...

2007-03-15 Thread Davide Libenzi
This patch wire the signalfd system call to the i386 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/i386/kernel/syscall_table.S
2007-02-04 10:44:54.0 -0800
+++ linux-2.6.21-rc3.quilt/arch/i386/kernel/syscall_table.S 2007-03-15 
15:34:12.0 -0700
@@ -319,3 +319,4 @@
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+   .long sys_signalfd  /* 320 */
Index: linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-i386/unistd.h   2007-02-04 
10:44:54.0 -0800
+++ linux-2.6.21-rc3.quilt/include/asm-i386/unistd.h2007-03-15 
15:34:12.0 -0700
@@ -325,10 +325,11 @@
 #define __NR_move_pages317
 #define __NR_getcpu318
 #define __NR_epoll_pwait   319
+#define __NR_signalfd  320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 320
+#define NR_syscalls 321
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 10/13] signal/timer/event fds v6 - eventfd core ...

2007-03-15 Thread Davide Libenzi
This is a very simple and light file descriptor, that can be used as
event wait/dispatch by userspace (both wait and dispatch) and by the
kernel (dispatch only). It can be used instead of pipe(2) in all cases
where those would simply be used to signal events. Their kernel overhead
is much lower than pipes, and they do not consume two fds. When used in
the kernel, it can offer an fd-bridge to enable, for example, functionalities
like KAIO or syslets/threadlets to signal to an fd the completion of certain
operations. But more in general, an eventfd can be used by the kernel to
signal readiness, in a POSIX poll/select way, of interfaces that would
otherwise be incompatible with it. The API is:

int eventfd(unsigned int count);

The eventfd API accepts an initial "count" parameter, and returns an
eventfd fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and 
write(2).
The POLLIN flag is raised when the internal counter is greater than zero.
The POLLOUT flag is raised when at least a value of "1" can be written to
the internal counter.
The POLLERR flag is raised when an overflow in the counter value is detected.
The write(2) operation can never overflow the counter, since it blocks
(unless O_NONBLOCK is set, in which case -EAGAIN is returned).
But the eventfd_signal() function can do it, since it's supposed to not
sleep during its operation.
The read(2) function reads the __u64 counter value, and reset the internal
value to zero. If the value read is equal to (__u64) -1, an overflow
happened on the internal counter (due to 2^64 eventfd_signal() posts
that has never been retired - unlickely, but possible).
The write(2) call writes an __u64 count value, and adds it
to the current counter. The eventfd fd supports O_NONBLOCK also.
On the kernel side, we have:

struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, unsigned int n);

The eventfd_fget() should be called to get a struct file* from an eventfd
fd (this is an fget() + check of f_op being an eventfd fops pointer).
The kernel can then call eventfd_signal() every time it wants to post
an event to userspace. The eventfd_signal() function can be called from any
context.
An eventfd() simple test and bench is available here:

http://www.xmailserver.org/eventfd-bench.c

This is the eventfd-based version of pipetest-4 (pipe(2) based):

http://www.xmailserver.org/pipetest-4.c

Not that performance matters much in the eventfd case, but eventfd-bench
shows almost as double as performance than pipetest-4.




Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.21-rc3.quilt/fs/Makefile
===
--- linux-2.6.21-rc3.quilt.orig/fs/Makefile 2007-03-15 15:53:07.0 
-0700
+++ linux-2.6.21-rc3.quilt/fs/Makefile  2007-03-15 16:11:54.0 -0700
@@ -11,7 +11,7 @@
attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
-   stack.o anon_inodes.o signalfd.o timerfd.o
+   stack.o anon_inodes.o signalfd.o timerfd.o eventfd.o
 
 ifeq ($(CONFIG_BLOCK),y)
 obj-y +=   buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
Index: linux-2.6.21-rc3.quilt/include/linux/syscalls.h
===
--- linux-2.6.21-rc3.quilt.orig/include/linux/syscalls.h2007-03-15 
15:53:07.0 -0700
+++ linux-2.6.21-rc3.quilt/include/linux/syscalls.h 2007-03-15 
16:11:54.0 -0700
@@ -605,6 +605,7 @@
 asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t 
sizemask);
 asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
const struct itimerspec __user *utmr);
+asmlinkage long sys_eventfd(unsigned int count);
 
 int kernel_execve(const char *filename, char *const argv[], char *const 
envp[]);
 
Index: linux-2.6.21-rc3.quilt/fs/eventfd.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc3.quilt/fs/eventfd.c 2007-03-15 16:11:54.0 -0700
@@ -0,0 +1,271 @@
+/*
+ *  fs/eventfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+struct eventfd_ctx {
+   spinlock_t lock;
+   wait_queue_head_t wqh;
+   __u64 count;
+};
+
+
+static void eventfd_cleanup(struct eventfd_ctx *ctx);
+static int eventfd_close(struct inode *inode, struct file *file);
+static unsigned int eventfd_poll(struct file *file, poll_table *wait);
+static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
+   loff_t *ppos);
+static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t 
count,
+

[patch 5/13] signal/timer/event fds v6 - signalfd compat code ...

2007-03-15 Thread Davide Libenzi
This patch implement the necessary compat code for the signalfd system call.


Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/fs/compat.c
===
--- linux-2.6.21-rc3.quilt.orig/fs/compat.c 2007-02-04 10:44:54.0 
-0800
+++ linux-2.6.21-rc3.quilt/fs/compat.c  2007-03-15 15:35:58.0 -0700
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -2235,3 +2236,24 @@
return sys_ni_syscall();
 }
 #endif
+
+asmlinkage long compat_sys_signalfd(int ufd,
+   const compat_sigset_t __user *sigmask,
+   compat_size_t sigsetsize)
+{
+   compat_sigset_t ss32;
+   sigset_t tmp;
+   sigset_t __user *ksigmask;
+
+   if (sigsetsize != sizeof(compat_sigset_t))
+   return -EINVAL;
+   if (copy_from_user(, sigmask, sizeof(ss32)))
+   return -EFAULT;
+   sigset_from_compat(, );
+   ksigmask = compat_alloc_user_space(sizeof(sigset_t));
+   if (copy_to_user(ksigmask, , sizeof(sigset_t)))
+   return -EFAULT;
+
+   return sys_signalfd(ufd, ksigmask, sizeof(sigset_t));
+}
+

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/13] signal/timer/event fds v6 - anonymous inode source ...

2007-03-15 Thread Davide Libenzi
This patch add an anonymous inode source, to be used for files that need 
and inode only in order to create a file*. We do not care of having an 
inode for each file, and we do not even care of having different names in 
the associated dentries (dentry names will be same for classes of file*).
This allow code reuse, and will be used by epoll, signalfd and timerfd 
(and whatever else there'll be).



Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.21-rc3.quilt/fs/anon_inodes.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc3.quilt/fs/anon_inodes.c 2007-03-15 15:32:33.0 
-0700
@@ -0,0 +1,203 @@
+/*
+ *  fs/anon_inodes.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+static int ainofs_delete_dentry(struct dentry *dentry);
+static struct inode *aino_getinode(void);
+static struct inode *aino_mkinode(void);
+static int ainofs_get_sb(struct file_system_type *fs_type, int flags,
+const char *dev_name, void *data, struct vfsmount 
*mnt);
+
+
+
+static struct vfsmount *aino_mnt __read_mostly;
+static struct inode *aino_inode;
+static struct file_operations aino_fops = { };
+static struct file_system_type aino_fs_type = {
+   .name   = "ainofs",
+   .get_sb = ainofs_get_sb,
+   .kill_sb= kill_anon_super,
+};
+static struct dentry_operations ainofs_dentry_operations = {
+   .d_delete   = ainofs_delete_dentry,
+};
+
+
+
+int aino_getfd(int *pfd, struct inode **pinode, struct file **pfile,
+  char const *name, const struct file_operations *fops, void *priv)
+{
+   struct qstr this;
+   struct dentry *dentry;
+   struct inode *inode;
+   struct file *file;
+   int error, fd;
+
+   error = -ENFILE;
+   file = get_empty_filp();
+   if (!file)
+   goto eexit_1;
+
+   inode = aino_getinode();
+   if (IS_ERR(inode)) {
+   error = PTR_ERR(inode);
+   goto eexit_2;
+   }
+
+   error = get_unused_fd();
+   if (error < 0)
+   goto eexit_3;
+   fd = error;
+
+   /*
+* Link the inode to a directory entry by creating a unique name
+* using the inode sequence number.
+*/
+   error = -ENOMEM;
+   this.name = name;
+   this.len = strlen(name);
+   this.hash = 0;
+   dentry = d_alloc(aino_mnt->mnt_sb->s_root, );
+   if (!dentry)
+   goto eexit_4;
+   dentry->d_op = _dentry_operations;
+   /* Do not publish this dentry inside the global dentry hash table */
+   dentry->d_flags &= ~DCACHE_UNHASHED;
+   d_instantiate(dentry, inode);
+
+   file->f_path.mnt = mntget(aino_mnt);
+   file->f_path.dentry = dentry;
+   file->f_mapping = inode->i_mapping;
+
+   file->f_pos = 0;
+   file->f_flags = O_RDWR;
+   file->f_op = fops;
+   file->f_mode = FMODE_READ | FMODE_WRITE;
+   file->f_version = 0;
+   file->private_data = priv;
+
+   fd_install(fd, file);
+
+   *pfd = fd;
+   *pinode = inode;
+   *pfile = file;
+   return 0;
+
+eexit_4:
+   put_unused_fd(fd);
+eexit_3:
+   iput(inode);
+eexit_2:
+   put_filp(file);
+eexit_1:
+   return error;
+}
+
+
+static int ainofs_delete_dentry(struct dentry *dentry)
+{
+   /*
+* We faked vfs to believe the dentry was hashed when we created it.
+* Now we restore the flag so that dput() will work correctly.
+*/
+   dentry->d_flags |= DCACHE_UNHASHED;
+   return 1;
+}
+
+
+static struct inode *aino_getinode(void)
+{
+   return igrab(aino_inode);
+}
+
+
+/*
+ * A single inode exist for all aino files. On the contrary of pipes,
+ * aino inodes has no per-instance data associated, so we can avoid
+ * the allocation of multiple of them.
+ */
+static struct inode *aino_mkinode(void)
+{
+   int error = -ENOMEM;
+   struct inode *inode = new_inode(aino_mnt->mnt_sb);
+
+   if (!inode)
+   goto eexit_1;
+
+   inode->i_fop = _fops;
+
+   /*
+* Mark the inode dirty from the very beginning,
+* that way it will never be moved to the dirty
+* list because mark_inode_dirty() will think
+* that it already _is_ on the dirty list.
+*/
+   inode->i_state = I_DIRTY;
+   inode->i_mode = S_IRUSR | S_IWUSR;
+   inode->i_uid = current->fsuid;
+   inode->i_gid = current->fsgid;
+   inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+   return inode;
+
+eexit_1:
+   return ERR_PTR(error);
+}
+
+
+static int ainofs_get_sb(struct file_system_type *fs_type, int flags,
+const char *dev_name, void *data, struct vfsmount *mnt)
+{
+   return get_sb_pseudo(fs_type, "aino:", NULL, 

[patch 2/13] signal/timer/event fds v6 - signalfd core ...

2007-03-15 Thread Davide Libenzi
This patch series implements the new signalfd() system call.
I took part of the original Linus code (and you know how
badly it can be broken :), and I added even more breakage ;)
Signals are fetched from the same signal queue used by the process,
so signalfd will compete with standard kernel delivery in dequeue_signal().
If you want to reliably fetch signals on the signalfd file, you need to
block them with sigprocmask(SIG_BLOCK).
This seems to be working fine on my Dual Opteron machine. I made a quick 
test program for it:

http://www.xmailserver.org/signafd-test.c

The signalfd() system call implements signal delivery into a file 
descriptor receiver. The signalfd file descriptor if created with the 
following API:

int signalfd(int ufd, const sigset_t *mask, size_t masksize);

The "ufd" parameter allows to change an existing signalfd sigmask, w/out 
going to close/create cycle (Linus idea). Use "ufd" == -1 if you want a 
brand new signalfd file.
The "mask" allows to specify the signal mask of signals that we are 
interested in. The "masksize" parameter is the size of "mask".
The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
will return POLLIN when signals are available to be dequeued. As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.
The read(2) system call will return a "struct signalfd_siginfo" structure
in the userspace supplied buffer. The return value is the number of bytes
copied in the supplied buffer, or -1 in case of error. The read(2) call
can also return 0, in case the sighand structure to which the signalfd
was attached, has been orphaned. The O_NONBLOCK flag is also supported, and
read(2) will return -EAGAIN in case no signal is available.
The format of the struct signalfd_siginfo is, and the valid fields depends
of the (->code & __SI_MASK) value, in the same way a struct siginfo would:

struct signalfd_siginfo {
__u32 signo;/* si_signo */
__s32 err;  /* si_errno */
__s32 code; /* si_code */
__u32 pid;  /* si_pid */
__u32 uid;  /* si_uid */
__s32 fd;   /* si_fd */
__u32 tid;  /* si_fd */
__u32 band; /* si_band */
__u32 overrun;  /* si_overrun */
__u32 trapno;   /* si_trapno */
__s32 status;   /* si_status */
__s32 svint;/* si_int */
__u64 svptr;/* si_ptr */
__u64 utime;/* si_utime */
__u64 stime;/* si_stime */
__u64 addr; /* si_addr */
};



Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.21-rc3.quilt/fs/signalfd.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc3.quilt/fs/signalfd.c2007-03-15 15:33:52.0 
-0700
@@ -0,0 +1,381 @@
+/*
+ *  fs/signalfd.c
+ *
+ *  Copyright (C) 2003  Linus Torvalds
+ *
+ *  Mon Mar 5, 2007: Davide Libenzi 
+ *  Changed ->read() to return a siginfo strcture instead of signal number.
+ *  Fixed locking in ->poll().
+ *  Added sighand-detach notification.
+ *  Added fd re-use in sys_signalfd() syscall.
+ *  Now using anonymous inode source.
+ *  Thanks to Oleg Nesterov for useful code review and suggestions.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+struct signalfd_ctx {
+   struct list_head lnk;
+   wait_queue_head_t wqh;
+   sigset_t sigmask;
+   struct task_struct *tsk;
+};
+
+
+
+static struct sighand_struct *signalfd_get_sighand(struct signalfd_ctx *ctx,
+  unsigned long *flags);
+static void signalfd_put_sighand(struct signalfd_ctx *ctx,
+struct sighand_struct *sighand,
+unsigned long *flags);
+static void signalfd_cleanup(struct signalfd_ctx *ctx);
+static int signalfd_close(struct inode *inode, struct file *file);
+static unsigned int signalfd_poll(struct file *file, poll_table *wait);
+static int signalfd_copyinfo(struct signalfd_siginfo __user *uinfo,
+siginfo_t const *kinfo);
+static ssize_t signalfd_read(struct file *file, char __user *buf, size_t count,
+loff_t *ppos);
+
+
+
+static const struct file_operations signalfd_fops = {
+   .release= signalfd_close,
+   .poll   = signalfd_poll,
+   .read   = signalfd_read,
+};
+static struct kmem_cache *signalfd_ctx_cachep;
+
+
+
+static struct sighand_struct *signalfd_get_sighand(struct signalfd_ctx *ctx,
+  unsigned long *flags)
+{
+   struct sighand_struct *sighand;
+
+   rcu_read_lock();
+   sighand = lock_task_sighand(ctx->tsk, flags);
+   rcu_read_unlock();
+
+   if (sighand && 

[patch 6/13] signal/timer/event fds v6 - timerfd core ...

2007-03-15 Thread Davide Libenzi
This patch introduces a new system call for timers events delivered
though file descriptors. This allows timer event to be used with
standard POSIX poll(2), select(2) and read(2). As a consequence of
supporting the Linux f_op->poll subsystem, they can be used with
epoll(2) too.
The system call is defined as:

int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);

The "ufd" parameter allows for re-use (re-programming) of an existing
timerfd w/out going through the close/open cycle (same as signalfd).
If "ufd" is -1, s new file descriptor will be created, otherwise the
existing "ufd" will be re-programmed.
The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME.
The time specified in the "utmr->it_value" parameter is the expiry
time for the timer.
If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute
time, otherwise it's a relative time.
If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
tv_nsec == 0), this is the period at which the following ticks should
be generated.
The "utmr->it_interval" should be set to zero if only one tick is requested.
Setting the "utmr->it_value" to zero will disable the timer, or will create
a timerfd without the timer enabled.
The function returns the new (or same, in case "ufd" is a valid timerfd
descriptor) file, or -1 in case of error.
As stated before, the timerfd file descriptor supports poll(2), select(2)
and epoll(2). When a timer event happened on the timerfd, a POLLIN mask
will be returned.
The read(2) call can be used, and it will return a u32 variable holding
the number of "ticks" that happened on the interface since the last call
to read(2). The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN
will be returned if no ticks happened.
A quick test program, shows timerfd working correctly on my amd64 box:

http://www.xmailserver.org/timerfd-test.c




Signed-off-by: Davide Libenzi 



- Davide



Index: linux-2.6.21-rc3.quilt/fs/timerfd.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.21-rc3.quilt/fs/timerfd.c 2007-03-15 16:08:05.0 -0700
@@ -0,0 +1,257 @@
+/*
+ *  fs/timerfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *
+ *
+ *  Thanks to Thomas Gleixner for code reviews and useful comments.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+
+struct timerfd_ctx {
+   struct hrtimer tmr;
+   ktime_t texp, tintv;
+   spinlock_t lock;
+   wait_queue_head_t wqh;
+   unsigned long ticks;
+};
+
+
+static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr);
+static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+ const struct itimerspec *ktmr);
+static int timerfd_close(struct inode *inode, struct file *file);
+static unsigned int timerfd_poll(struct file *file, poll_table *wait);
+static ssize_t timerfd_read(struct file *file, char __user *buf, size_t count,
+   loff_t *ppos);
+
+
+
+static const struct file_operations timerfd_fops = {
+   .release= timerfd_close,
+   .poll   = timerfd_poll,
+   .read   = timerfd_read,
+};
+static struct kmem_cache *timerfd_ctx_cachep;
+
+
+
+static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr)
+{
+   struct timerfd_ctx *ctx = container_of(htmr, struct timerfd_ctx, tmr);
+   enum hrtimer_restart rval = HRTIMER_NORESTART;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   ctx->ticks++;
+   wake_up_locked(>wqh);
+   if (ctx->tintv.tv64 != 0) {
+   hrtimer_forward(htmr, hrtimer_cb_get_time(htmr), ctx->tintv);
+   rval = HRTIMER_RESTART;
+   }
+   spin_unlock_irqrestore(>lock, flags);
+
+   return rval;
+}
+
+
+static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
+ const struct itimerspec *ktmr)
+{
+   enum hrtimer_mode htmode;
+
+   htmode = (flags & TFD_TIMER_ABSTIME) ? HRTIMER_MODE_ABS: 
HRTIMER_MODE_REL;
+
+   ctx->ticks = 0;
+   ctx->texp = timespec_to_ktime(ktmr->it_value);
+   ctx->tintv = timespec_to_ktime(ktmr->it_interval);
+   hrtimer_init(>tmr, clockid, htmode);
+   ctx->tmr.expires = ctx->texp;
+   ctx->tmr.function = timerfd_tmrproc;
+   if (ctx->texp.tv64 != 0)
+   hrtimer_start(>tmr, ctx->texp, htmode);
+}
+
+
+asmlinkage long sys_timerfd(int ufd, int clockid, int flags,
+   const struct itimerspec __user *utmr)
+{
+   int error;
+   struct timerfd_ctx *ctx;
+   struct file *file;
+   struct inode *inode;
+   struct itimerspec ktmr;
+
+   if (copy_from_user(, utmr, sizeof(ktmr)))
+   return -EFAULT;
+
+   if (clockid != 

[patch 4/13] signal/timer/event fds v6 - signalfd wire up x86_64 arch ...

2007-03-15 Thread Davide Libenzi
This patch wire the signalfd system call to the x86_64 architecture.



Signed-off-by: Davide Libenzi 


- Davide



Index: linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h
===
--- linux-2.6.21-rc3.quilt.orig/include/asm-x86_64/unistd.h 2007-02-04 
10:44:54.0 -0800
+++ linux-2.6.21-rc3.quilt/include/asm-x86_64/unistd.h  2007-03-15 
15:34:29.0 -0700
@@ -619,8 +619,10 @@
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_signalfd  280
+__SYSCALL(__NR_signalfd, sys_signalfd)
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_signalfd
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S
===
--- linux-2.6.21-rc3.quilt.orig/arch/x86_64/ia32/ia32entry.S2007-03-15 
15:19:20.0 -0700
+++ linux-2.6.21-rc3.quilt/arch/x86_64/ia32/ia32entry.S 2007-03-15 
15:35:35.0 -0700
@@ -714,9 +714,10 @@
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
-   .quad sys_tee
+   .quad sys_tee   /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
.quad sys_epoll_pwait
+   .quad sys_signalfd  /* 320 */
 ia32_syscall_end:  

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] Allow i386 crash kernels to handle x86_64 dumps

2007-03-15 Thread Horms
On Thu, Mar 15, 2007 at 01:42:39PM +, Ian Campbell wrote:
> On Thu, 2007-03-15 at 18:56 +0530, Vivek Goyal wrote:
> > 
> > Ideal place for this probably should have been arch dependent
> > crash_dump.h file. But we don't have one and no point introducing one
> > just for this  macro.
> 
> Agreed.
> 
> > This change looks good to me. 
> 
> Is there a kdump tree which you'll apply to or shall I resend CCing
> apkm? (I'll add an Acked-by if that's ok).

There isn't a kexec tree at this time (though I am happy to entertain
creating one). For now most patches go in either through Andrew or the
relevant architecture maintainers.

-- 
Horms
  H: http://www.vergenet.net/~horms/
  W: http://www.valinux.co.jp/en/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] Allow i386 crash kernels to handle x86_64 dumps

2007-03-15 Thread Horms
On Thu, Mar 15, 2007 at 06:56:16PM +0530, Vivek Goyal wrote:
> On Thu, Mar 15, 2007 at 12:22:57PM +, Ian Campbell wrote:
> > On Thu, 2007-03-15 at 11:17 +0530, Vivek Goyal wrote:
> > > > > But I think changing this macro might run into issues. It is
> > > > > being used at few places in kernel, for example while loading
> > > > > module. This will essentially mean that we allow loading 64bit
> > > > > x86_64 modules on 32bit i386 systems?
> > 
> > Yes, not sure how I missed that fact...
> > 
> > > Kexec will also not allow loading an x86_64 kernel on a 32bit machine.
> > 
> > For crash kernel only or for regular kexec too?
> > 
> 
> I think for both. One of the possible reasons I think is that one never
> knows is underlying machine has got 64bit extensions or not. So even if
> we load the kernel it will never boot. Secondly, we might not be able to
> handle 64bit address in 32bit kernel/user space?

Perhaps I am miss-understanding what you are saying, but I do
recally kexecing from 32->64 and 64->32 bit kernels on x86_64 hardware.
I can run these checks again if it helps.

> > > So how about something like vmcore_elf_allowed_cross_arch()? Vmcore
> > > code can continue to check elf_check_arch() and if that fails it can
> > > invoke vmcore_elf_allowed_cross_arch() to find out what cross arch are
> > > allowed for vmcore. 
> > 
> > Something like this?
> > 
> > Ian.
> > 
> > ---  
> > 
> > Allow i386 crash kernels to handle x86_64 dumps.
> > 
> > The specific case I am encountering is kdump under Xen with a 64 bit
> > hypervisor and 32 bit kernel/userspace. The dump created is a 64 bit
> > due to the hypervisor but the dump kernel is 32 bit in for maximum
> > compatibility.
> > 
> > It's possibly less likely to be useful in a purely native scenario but
> > I see no reason to disallow it.
> > 
> > Signed-off-by: Ian Campbell <[EMAIL PROTECTED]>
> > 
> > diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> > index d960507..523e109 100644
> > --- a/fs/proc/vmcore.c
> > +++ b/fs/proc/vmcore.c
> > @@ -514,7 +514,7 @@ static int __init parse_crash_elf64_headers(void)
> > /* Do some basic Verification. */
> > if (memcmp(ehdr.e_ident, ELFMAG, SELFMAG) != 0 ||
> > (ehdr.e_type != ET_CORE) ||
> > -   !elf_check_arch() ||
> > +   !vmcore_elf_check_arch() ||
> > ehdr.e_ident[EI_CLASS] != ELFCLASS64 ||
> > ehdr.e_ident[EI_VERSION] != EV_CURRENT ||
> > ehdr.e_version != EV_CURRENT ||
> > diff --git a/include/asm-i386/kexec.h b/include/asm-i386/kexec.h
> > index 4dfc9f5..c76737e 100644
> > --- a/include/asm-i386/kexec.h
> > +++ b/include/asm-i386/kexec.h
> > @@ -47,6 +47,9 @@
> >  /* The native architecture */
> >  #define KEXEC_ARCH KEXEC_ARCH_386
> > 
> > +/* We can also handle crash dumps from 64 bit kernel. */
> > +#define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64)
> > +
> 
> Ideal place for this probably should have been arch dependent crash_dump.h
> file. But we don't have one and no point introducing one just for this 
> macro.
> 
> This change looks good to me.

Won't the above change break non i386 archtectures as
vmcore_elf_check_arch_cross isn't defined for them?

-- 
Horms
  H: http://www.vergenet.net/~horms/
  W: http://www.valinux.co.jp/en/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] swsusp: Do not use page flags

2007-03-15 Thread Rafael J. Wysocki
On Thursday, 15 March 2007 23:23, Andrew Morton wrote:
> On Thu, 15 Mar 2007 23:19:02 +0100 (CET)
> Jiri Kosina <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, 15 Mar 2007, Andrew Morton wrote:
> > 
> > > > > And why _does_ suspend use GFP_ATOMIC all over the place?
> > > > Generally, because it cannot sleep.
> > > Why not?
> > 
> > I guess it's simply beucase of kswapd being already frozen, so there is no 
> > chance that once GFP_KERNEL allocation goes to sleep, it is going to get 
> > any free pages eventually ... ?
> 
> No, things should run fine with a dead kswapd.
> 
> There are reasons why we can't call into filesystems from there, but
> GFP_NOIO will ensure that and it is heaps better than GFP_ATOMIC.

In fact the role of swsusp_shrink_memory() is to ensure that our subsequent
atomic allocations won't fail.

Still, the particular allocations in create_basic_memory_bitmaps() are made
before we call swsusp_shrink_memory(), so it's better to use GFP_NOIO in there.

I'll prepare a patch for that on top of the current series.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: thread stacks and strict vm overcommit accounting

2007-03-15 Thread Dan Aloni
On Thu, Mar 15, 2007 at 03:36:13PM -0700, Andrew Morton wrote:
> 
> > > > > Is this the intended behaviour?
> > > > 
> > > > That sounds like a bug to me.
> > > 
> > > I'm suspecting it's an oddity rather than a bug.
> > 
> > It is intended behaviour.
> 
> Each instance of
> 
> main()
> {
>   sleep(100);
> }
> 
> appears to increase Committed_AS by around 200kb.  But we've committed to
> providing it with 8MB for stack.
> 
> How come this is correct?

Perhaps it makes a lot of sense if you regard stack growth at 
the same sense that you regard heap growth by the means of brk(). 

Just by the fact that the stack is limited on default and RLIMIT_DATA 
is unlimited, doesn't mean the we need to account for the maximum
stack size. 

Perhaps for embedded systems where you want to have overcommit_memory=2 
overcommit_ratio=100 and no swap (for design constraints), just to make
sure that allocations fail *always before* OOM gets triggered (and 
therefore OOM never gets triggered, thankfully), it would have been
useful to look at Commited_AS to realize how much the system is close 
to the maximum memory utilization potential.

Learning about this 'oddity' in Commited_AS, I'd guess it would be 
better for me not to rely on it for measurements and perhaps tweak 
smaller values of RSS_STACK for processes on that embedded system.

-- 
Dan Aloni
XIV LTD, http://www.xivstorage.com
da-x (at) monatomic.org, dan (at) xiv.co.il
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [REPOST] x86_64, i386: Add command line length to boot protocol

2007-03-15 Thread H. Peter Anvin

Alon Bar-Lev wrote:

Hello,

I really don' t understand why you insist that the boot protocol

=2.02 had 255 limit!

Please remove this from the description.
You want to add size, that's OK, but please don't mess with previous
definitions.
Boot protocol 2.02 introduced the null terminated string truncated by
kernel, which can be at any size.



Well, except for a very brief window, the limit *was* 255.  If the boot 
loader wants to verify nontruncation, this is a valid concern.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/filemap.c: unconditionally call mark_page_accessed

2007-03-15 Thread Andrea Arcangeli
On Thu, Mar 15, 2007 at 06:15:45PM -0500, Dave Kleikamp wrote:
> On Thu, 2007-03-15 at 23:59 +0100, Andrea Arcangeli wrote:
> > On Thu, Mar 15, 2007 at 05:44:01PM +, Hugh Dickins wrote:
> > > who removed the !offset condition, he should be consulted on its
> > > reintroduction.
> > 
> > the !offset check looks a pretty broken heuristic indeed, it would
> > break random I/O.
> 
> I wouldn't call it broken.  At worst, I'd say it's imperfect.  But
> that's the nature of a heuristic.  It most likely works in a huge
> majority of cases.

well, IMHO in the huge majority of cases the prev_page check isn't
necessary in the first place (and IMHO it hurts a lot more than it can
help, as demonstrated by specweb, since we'll bite on the good guys to
help the bad guys).

The only case where I can imagine the prev_page to make sense is to
handle contiguous I/O made with a small buffer, so clearly an
inefficient code in the first place. But if this guy is reading with
http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: Simplify smp_call_function*() by using common implementation

2007-03-15 Thread Jeremy Fitzhardinge
Andrew Morton wrote:
> Hopeless, sorry.   It's probably time to start thinking about raising x86
> patches against the x86 tree (at least).
>   

How's this?

J

Subject: Simplify smp_call_function*() by using common implementation

smp_call_function and smp_call_function_single are almost complete
duplicates of the same logic.  This patch combines them by
implementing them in terms of the more general
smp_call_function_mask().

[ Jan, Andi: This only changes arch/i386; can x86_64 be changed in the
  same way? ]

[ Rebased onto Jan's x86_64-mm-consolidate-smp_send_stop patch ]

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Jan Beulich <[EMAIL PROTECTED]>
Cc: Stephane Eranian <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>
Cc: "Randy.Dunlap" <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>

---
 arch/i386/kernel/smp.c |  177 +++-
 1 file changed, 86 insertions(+), 91 deletions(-)

===
--- a/arch/i386/kernel/smp.c
+++ b/arch/i386/kernel/smp.c
@@ -515,14 +515,26 @@ void unlock_ipi_call_lock(void)
 
 static struct call_data_struct *call_data;
 
-static void __smp_call_function(void (*func) (void *info), void *info,
-   int nonatomic, int wait)
+
+static int __smp_call_function_mask(cpumask_t mask,
+   void (*func)(void *), void *info,
+   int wait)
 {
struct call_data_struct data;
-   int cpus = num_online_cpus() - 1;
+   cpumask_t allbutself;
+   int cpus;
+
+   /* Can deadlock when called with interrupts disabled */
+   WARN_ON(irqs_disabled());
+
+   allbutself = cpu_online_map;
+   cpu_clear(smp_processor_id(), allbutself);
+
+   cpus_and(mask, mask, allbutself);
+   cpus = cpus_weight(mask);
 
if (!cpus)
-   return;
+   return 0;
 
data.func = func;
data.info = info;
@@ -533,9 +545,12 @@ static void __smp_call_function(void (*f
 
call_data = 
mb();
-   
-   /* Send a message to all other CPUs and wait for them to respond */
-   send_IPI_allbutself(CALL_FUNCTION_VECTOR);
+
+   /* Send a message to other CPUs */
+   if (cpus_equal(mask, allbutself))
+   send_IPI_allbutself(CALL_FUNCTION_VECTOR);
+   else
+   send_IPI_mask(mask, CALL_FUNCTION_VECTOR);
 
/* Wait for response */
while (atomic_read() != cpus)
@@ -544,6 +559,34 @@ static void __smp_call_function(void (*f
if (wait)
while (atomic_read() != cpus)
cpu_relax();
+
+   return 0;
+}
+
+/**
+ * smp_call_function_mask(): Run a function on a set of other CPUs.
+ * @mask: The set of cpus to run on.  Must not include the current cpu.
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @wait: If true, wait (atomically) until function has completed on other 
CPUs.
+ *
+ * Returns 0 on success, else a negative status code. Does not return until
+ * remote CPUs are nearly ready to execute <> or are or have finished.
+ *
+ * You must not call this function with disabled interrupts or from a
+ * hardware interrupt handler or from a bottom half handler.
+ */
+int smp_call_function_mask(cpumask_t mask,
+void (*func)(void *), void *info,
+int wait)
+{
+   int ret;
+
+   spin_lock(_lock);
+   ret = __smp_call_function_mask(mask, func, info, wait);
+   spin_unlock(_lock);
+
+   return ret;
 }
 
 /**
@@ -559,20 +602,43 @@ static void __smp_call_function(void (*f
  * You must not call this function with disabled interrupts or from a
  * hardware interrupt handler or from a bottom half handler.
  */
-int smp_call_function (void (*func) (void *info), void *info, int nonatomic,
-   int wait)
-{
-   /* Can deadlock when called with interrupts disabled */
-   WARN_ON(irqs_disabled());
-
-   /* Holding any lock stops cpus from going down. */
-   spin_lock(_lock);
-   __smp_call_function(func, info, nonatomic, wait);
-   spin_unlock(_lock);
-
-   return 0;
+int smp_call_function(void (*func) (void *info), void *info, int nonatomic,
+ int wait)
+{
+   return smp_call_function_mask(cpu_online_map, func, info, wait);
 }
 EXPORT_SYMBOL(smp_call_function);
+
+/*
+ * smp_call_function_single - Run a function on another CPU
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @nonatomic: Currently unused.
+ * @wait: If true, wait until function has completed on other CPUs.
+ *
+ * Retrurns 0 on success, else a negative status code.
+ *
+ * Does not return until the remote CPU is nearly ready to execute 
+ 

Re: [PATCH 10/22 take 3] UBI: EBA unit

2007-03-15 Thread Josh Boyer
On Thu, Mar 15, 2007 at 02:24:10PM -0700, Randy Dunlap wrote:
> On Thu, 15 Mar 2007 11:07:03 -0800 Andrew Morton wrote:
> 
> > 
> > There's way too much code here to expect it to get decently reviewed, alas.
> 
> Yes.
> 
> /me repeats wish that Not Everything Should Be Sent to lkml.  :(

Just curious, but where would you suggest this be sent to for review then?

josh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fix cyclades.h for x86_64 (and probably others)

2007-03-15 Thread Klaus Kudielka
On Thu, Mar 15, 2007 at 11:07:08AM -0800, Andrew Morton wrote:
> Looks OK, thanks.
> 
> It would be nice as a followup patch to simply remove ucchar, uclong and
> all that gunk altogether from that driver  and just use u8, u16 etc.
> 
> But if you decide to do that, please fix your email client first - it is
> replacing tabs with spaces.

Something like this? Applies & compiles Ok on 2.6.20.
I don't have access to the hardware right now, but am pretty sure that
the result is the same.

BTW, it was a copy-paste which made the spaces ;)

Regards, Klaus

--- include/linux/cyclades.h.orig   2007-03-15 23:46:00.0 +0100
+++ include/linux/cyclades.h2007-03-15 23:14:26.0 +0100
@@ -67,6 +67,8 @@
 #ifndef _LINUX_CYCLADES_H
 #define _LINUX_CYCLADES_H
 
+#include 
+
 struct cyclades_monitor {
 unsigned long   int_count;
 unsigned long   char_count;
@@ -149,15 +151,6 @@
  * architectures and compilers.
  */
 
-#if defined(__alpha__)
-typedef unsigned long  ucdouble;   /* 64 bits, unsigned */
-typedef unsigned int   uclong; /* 32 bits, unsigned */
-#else
-typedef unsigned long  uclong; /* 32 bits, unsigned */
-#endif
-typedef unsigned short ucshort;/* 16 bits, unsigned */
-typedef unsigned char  ucchar; /* 8 bits, unsigned */
-
 /*
  * Memory Window Sizes
  */
@@ -174,24 +167,24 @@
  */
 
 struct CUSTOM_REG {
-   uclong  fpga_id;/* FPGA Identification Register */
-   uclong  fpga_version;   /* FPGA Version Number Register */
-   uclong  cpu_start;  /* CPU start Register (write) */
-   uclong  cpu_stop;   /* CPU stop Register (write) */
-   uclong  misc_reg;   /* Miscelaneous Register */
-   uclong  idt_mode;   /* IDT mode Register */
-   uclong  uart_irq_status;/* UART IRQ status Register */
-   uclong  clear_timer0_irq;   /* Clear timer interrupt Register */
-   uclong  clear_timer1_irq;   /* Clear timer interrupt Register */
-   uclong  clear_timer2_irq;   /* Clear timer interrupt Register */
-   uclong  test_register;  /* Test Register */
-   uclong  test_count; /* Test Count Register */
-   uclong  timer_select;   /* Timer select register */
-   uclong  pr_uart_irq_status; /* Prioritized UART IRQ stat Reg */
-   uclong  ram_wait_state; /* RAM wait-state Register */
-   uclong  uart_wait_state;/* UART wait-state Register */
-   uclong  timer_wait_state;   /* timer wait-state Register */
-   uclong  ack_wait_state; /* ACK wait State Register */
+   __u32   fpga_id;/* FPGA Identification Register */
+   __u32   fpga_version;   /* FPGA Version Number Register */
+   __u32   cpu_start;  /* CPU start Register (write) */
+   __u32   cpu_stop;   /* CPU stop Register (write) */
+   __u32   misc_reg;   /* Miscelaneous Register */
+   __u32   idt_mode;   /* IDT mode Register */
+   __u32   uart_irq_status;/* UART IRQ status Register */
+   __u32   clear_timer0_irq;   /* Clear timer interrupt Register */
+   __u32   clear_timer1_irq;   /* Clear timer interrupt Register */
+   __u32   clear_timer2_irq;   /* Clear timer interrupt Register */
+   __u32   test_register;  /* Test Register */
+   __u32   test_count; /* Test Count Register */
+   __u32   timer_select;   /* Timer select register */
+   __u32   pr_uart_irq_status; /* Prioritized UART IRQ stat Reg */
+   __u32   ram_wait_state; /* RAM wait-state Register */
+   __u32   uart_wait_state;/* UART wait-state Register */
+   __u32   timer_wait_state;   /* timer wait-state Register */
+   __u32   ack_wait_state; /* ACK wait State Register */
 };
 
 /*
@@ -201,34 +194,34 @@
  */
 
 struct RUNTIME_9060 {
-   uclong  loc_addr_range; /* 00h - Local Address Range */
-   uclong  loc_addr_base;  /* 04h - Local Address Base */
-   uclong  loc_arbitr; /* 08h - Local Arbitration */
-   uclong  endian_descr;   /* 0Ch - Big/Little Endian Descriptor */
-   uclong  loc_rom_range;  /* 10h - Local ROM Range */
-   uclong  loc_rom_base;   /* 14h - Local ROM Base */
-   uclong  loc_bus_descr;  /* 18h - Local Bus descriptor */
-   uclong  loc_range_mst;  /* 1Ch - Local Range for Master to PCI */
-   uclong  loc_base_mst;   /* 20h - Local Base for Master PCI */
-   uclong  loc_range_io;   /* 24h - Local Range for Master IO */
-   uclong  pci_base_mst;   /* 28h - PCI Base for Master PCI */
-   uclong  pci_conf_io;/* 2Ch - PCI configuration for Master IO */
-   uclong  filler1;/* 30h */
-   uclong  filler2;/* 34h */
-   uclong  filler3;/* 38h */
-   uclong  filler4;  

Re: [PATCH] mm/filemap.c: unconditionally call mark_page_accessed

2007-03-15 Thread Dave Kleikamp
On Thu, 2007-03-15 at 23:59 +0100, Andrea Arcangeli wrote:
> On Thu, Mar 15, 2007 at 05:44:01PM +, Hugh Dickins wrote:
> > who removed the !offset condition, he should be consulted on its
> > reintroduction.
> 
> the !offset check looks a pretty broken heuristic indeed, it would
> break random I/O.

I wouldn't call it broken.  At worst, I'd say it's imperfect.  But
that's the nature of a heuristic.  It most likely works in a huge
majority of cases.

> The real fix is to add a ra.prev_offset along with
> ra.prev_page, and if who implements it wants to be stylish he can as
> well use a ra.last_contiguous_read structure that has a page and
> offset fields (and then of course remove ra.prev_page).

I suggested something along these lines, but I wonder if it's overkill.
The !offset check is simple and appears to be a decent improvement over
the current code.
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/filemap.c: unconditionally call mark_page_accessed

2007-03-15 Thread Andrea Arcangeli
On Thu, Mar 15, 2007 at 03:06:01PM -0700, Andrew Morton wrote:
> On Thu, 15 Mar 2007 22:49:23 +0100
> Andrea Arcangeli <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, Mar 15, 2007 at 11:07:35AM -0800, Andrew Morton wrote:
> > > > On Thu, 15 Mar 2007 01:22:45 -0400 (EDT) Ashif Harji <[EMAIL 
> > > > PROTECTED]> wrote:
> > > > I still think the simple fix of removing the 
> > > > condition is the best approach, but I'm certainly open to alternatives.
> > > 
> > > Yes, the problem of falsely activating pages when the file is read in 
> > > small
> > > hunks is worse than the problem which your patch fixes.
> > 
> > Really? I would have expected all performance sensitive apps to read
> > in >=PAGE_SIZE chunks. And if they don't because they split their
> > dataset in blocks (like some database), it may not be so wrong to
> > activate those pages that have two "hot" blocks more aggressively than
> > those pages with a single hot block.
> 
> But the problem which is being fixed here is really obscure: an application
> repeatedly reading the first page and only the first page of a file, always
> via the same fd.
>
> I'd expect that the sub-page-size read scenarion happens heaps more often
> than that, especially when dealing with larger PAGE_SIZEs.

Whatever that app is doing, clearly we have to keep those 4k in cache!
Like obviously the specweb demonstrated that as long as you are
_repeating_ the same read, it's correct to activate the page even if
it was reading from the same page as before.

What is wrong is to activate the page more aggressively if it's
_different_ parts of the page that are being read in a contiguous
way. I thought that the whole point of the ra.prev_page was to detect
_contiguous_ (not random) I/O made with a small buffer, anything else
doesn't make much sense to me.

In short I think taking a ra.prev_offset into account as suggested by
Dave Kleikamp is the best, it may actually benefit the obscure app too ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [QUICKLIST 0/4] Arch independent quicklists V2

2007-03-15 Thread William Lee Irwin III
On Tue, Mar 13, 2007 at 06:12:44PM -0700, William Lee Irwin III wrote:
> There are furthermore distinctions to make between fork() and execve().
> fork() stomps over the entire process address space copying pagetables
> en masse. After execve() a process incrementally faults in PTE's one at
> a time. It should be clear that if case analyses are of interest at
> all, fork() will want cache-hot pages (cache-preloaded pages?) where
> such are largely wasted on incremental faults after execve(). The copy
> operations in fork() should probably also be examined in the context of
> shared pagetables at some point.

To make this perfectly clear, we can deal with the varying usage cases
with hot/cold flags to the pagetable allocator functions. Where bulk
copies such as fork() are happening, it makes perfect sense to
precharge the cache by eager zeroing. Where sparse single pte affairs
such as incrementally faulting things in after execve() are involved,
cache cold preconstructed pagetable pages are ideal. Address hints
could furthermore be used to precharge single cachelines (e.g. via
prefetch) in the sparse usage case.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: thread stacks and strict vm overcommit accounting

2007-03-15 Thread Hugh Dickins
On Thu, 15 Mar 2007, Andrew Morton wrote:
> On Thu, 15 Mar 2007 23:33:43 +
> Alan Cox <[EMAIL PROTECTED]> wrote:
> 
> > > Stack RSS should certainly be included in Committed_AS,
> > > but RLIMIT_STACK merely limits how big the stack vma may grow to:
> > > at any moment the stack vma is probably very much smaller,
> > > and only its current size is accounted in Committed_AS.
> > 
> > With a typical size as a fuzz factor preaccounted in later kernels.
> 
> Where's that done?

I don't know what Alan is referring to there.

> 
> > > > > Is this the intended behaviour?
> > > > 
> > > > That sounds like a bug to me.
> > > 
> > > I'm suspecting it's an oddity rather than a bug.
> > 
> > It is intended behaviour.

Intended in the way the different stacks are implemented,
but odd enough for us to wonder at the difference.

> 
> Each instance of
> 
> main()
> {
>   sleep(100);
> }
> 
> appears to increase Committed_AS by around 200kb.  But we've committed to
> providing it with 8MB for stack.
> 
> How come this is correct?

We've no more committed to providing each instance with 8MB of stack,
than we've committed to providing each instance with RLIMIT_AS of
address space.  The rlimits are limits, not commitments, surely?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 6/13] signalfd/timerfd/asyncfd v5 - timerfd core ...

2007-03-15 Thread Davide Libenzi
On Thu, 15 Mar 2007, Thomas Gleixner wrote:

> Davide,
> 
> On Wed, 2007-03-14 at 15:19 -0700, Davide Libenzi wrote:
> 
> > +static int timerfd_tmrproc(struct hrtimer *htmr)
> > +{
> > +   struct timerfd_ctx *ctx = container_of(htmr, struct timerfd_ctx, tmr);
> > +   int rval = HRTIMER_NORESTART;
> > +   unsigned long flags;
> > +
> > +   spin_lock_irqsave(>lock, flags);
> > +   ctx->ticks++;
> > +   wake_up_locked(>wqh);
> > +   if (ctx->tintv.tv64 != 0) {
> > +   hrtimer_forward(htmr, htmr->base->softirq_time, ctx->tintv);
> 
> Sorry, I missed that in the first reviews. Please use
> hrtimer_cb_get_time(htmr) instead of htmr->base->softirq_time, so this
> is high res timer safe.

Heh, I was actually looking for a function instead of peeking over the 
tiemr strcture, but 2.6.20 did not have. Rebased over 2.6.21-rc3 now, so I 
can use it.




> > +   rval = HRTIMER_RESTART;
> > +   }
> > +   spin_unlock_irqrestore(>lock, flags);
> > +
> > +   return rval;
> > +}
> > +
> > +
> > +static int timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags,
> > +const struct itimerspec *ktmr)
> > +{
> 
> Make this void, returns 0 anyway

Ack



> > +   enum hrtimer_mode htmode;
> > +
> > +   htmode = (flags & TFD_TIMER_ABSTIME) ? HRTIMER_ABS: HRTIMER_REL;
> > +
> > +   ctx->ticks = 0;
> > +   ctx->clockid = clockid;
> > +   ctx->flags = flags;
> > +   ctx->texp = timespec_to_ktime(ktmr->it_value);
> 
> clockid is stored in the timer on setup, so no need to store it again.
> expiry time and flags are not used after setup.
> 
> Please remove those fields.

Ack



> > +   if (ufd == -1) {
> > +   ctx = kmem_cache_alloc(timerfd_ctx_cachep, GFP_KERNEL);
> > +   if (!ctx)
> > +   return -ENOMEM;
> > +
> > +   init_waitqueue_head(>wqh);
> > +   spin_lock_init(>lock);
> > +   ctx->clockid = -1;
> > +
> > +   error = timerfd_setup(ctx, clockid, flags, );
> > +   if (error)
> > +   goto err_ctxfree;
> 
> Timer setup can not fail

Ack, the new version can't.



> > +   /*
> > +* When we call this, the initialization must be complete, since
> > +* aino_getfd() will install the fd.
> > +*/
> > +   error = aino_getfd(, , , "[timerfd]",
> > +  _fops, ctx);
> > +   if (error)
> > +   goto err_ctxfree;
> 
> Again: Please turn this around. No need to start the timer before we
> know, that everything works. 

The timerfd_setup() is not locked, so we need to make sure everything is 
setup, before advertising the fd (and aino_getfd does that).



> > +   kmem_cache_free(timerfd_ctx_cachep, ctx);
> > +}
> > +
> > +
> > +static int timerfd_close(struct inode *inode, struct file *file)
> > +{
> > +   timerfd_cleanup(file->private_data);
> > +   return 0;
> > +}
> > +
> 
> Please move the timerfd_cleanup code into close(). 

I usually prefer to have a cleanup function that works on the file's data, 
but I moved the code in the release function now.
Thx for the review! I'll repost a new version based on 2.6.21-rc3 ...




- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] swsusp: Do not use page flags

2007-03-15 Thread Pavel Machek
Hi!

> > > > On Mon, 12 Mar 2007 22:19:20 +0100 "Rafael J. Wysocki" <[EMAIL 
> > > > PROTECTED]> wrote:
> > > > +int create_basic_memory_bitmaps(void)
> > > > +{
> > > > +   struct memory_bitmap *bm1, *bm2;
> > > > +   int error = 0;
> > > > +
> > > > +   BUG_ON(forbidden_pages_map || free_pages_map);
> > > > +
> > > > +   bm1 = kzalloc(sizeof(struct memory_bitmap), GFP_ATOMIC);
> > > > +   if (!bm1)
> > > > +   return -ENOMEM;
> > > > +
> > > > +   error = memory_bm_create(bm1, GFP_ATOMIC | __GFP_COLD, PG_ANY);
> > > > +   if (error)
> > > > +   goto Free_first_object;
> > > > +
> > > > +   bm2 = kzalloc(sizeof(struct memory_bitmap), GFP_ATOMIC);
> > > > +   if (!bm2)
> > > > +   goto Free_first_bitmap;
> > > > +
> > > > +   error = memory_bm_create(bm2, GFP_ATOMIC | __GFP_COLD, PG_ANY);
> > > > +   if (error)
> > > 
> > > What is the risk that we'll go OOM here?  GFP_ATOMIC is rather unreliable.
> > 
> > Well, this can be called after processes (including kswapd) has been frozen.
> > We can't go to sleep at this point.
> 
> So it _is_ unreliable?

We are careful to leave some memory aside for suspend... We actually
free memory at beggining of suspend, and there's some simple "add few
percent for our overhead" there.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] FUTEX : introduce private hashtables

2007-03-15 Thread William Lee Irwin III
On Fri, Mar 16, 2007 at 07:25:53AM +1100, Nick Piggin wrote:
> I would just avoid the complexity and setup/teardown costs, and just
> use a vmalloc'ed global hash for NUMA.

This patch is not the way to go, but neither are vmalloc()'d global
hashtables. When you just happen to hash to the wrong node, you're in
for quasi-unreproducible poor performance. The size is never right, at
which point RCU resizing is required with all its overhead and memory
freeing delays and failure to resize (even if only to contract) under
pressure. Better would be to use a different data structure admitting
locality of reference and adaptively sizing itself, furthermore
localized to the appropriate sharing domain.  For file-backed futexes,
this would be the struct address_space. For anonymous-backed futexes,
this would be the COW sharing group, which an anon_vma could almost be
used to represent. Using an object to properly represent the COW
sharing group (i.e. Hugh's struct anon) would do the trick, and one
might as well move the rmap code over to it while we're at it since the
anon_vma scanning tricks are all pointless overhead once the COW
sharing group is accurately tracked (the scanning around for nearby vmas
with ->anon_vma set is not great anyway, though the overhead is hidden
in the noise of large teardown and setup operations; inheriting on
fork() is much simpler and faster).

In such a manner localization is accomplished while no interface
extensions are required.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Kconfig style question

2007-03-15 Thread Kumar Gala

For source lines I've seen both:

source "arch/powerpc/platforms/52xx/Kconfig"

and

source arch/powerpc/platforms/85xx/Kconfig

Is there a preferred style?  Quotes or not?

- k
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/filemap.c: unconditionally call mark_page_accessed

2007-03-15 Thread Andrea Arcangeli
On Thu, Mar 15, 2007 at 05:44:01PM +, Hugh Dickins wrote:
> who removed the !offset condition, he should be consulted on its
> reintroduction.

the !offset check looks a pretty broken heuristic indeed, it would
break random I/O. The real fix is to add a ra.prev_offset along with
ra.prev_page, and if who implements it wants to be stylish he can as
well use a ra.last_contiguous_read structure that has a page and
offset fields (and then of course remove ra.prev_page).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix COMPAT_VDSO regression bug

2007-03-15 Thread Leroy van Logchem
Andrew Morton  linux-foundation.org> writes:

> > Revert "[PATCH] Fix CONFIG_COMPAT_VDSO"
> > This reverts commit a1f3bb9ae4497a2ed3eac773fd7798ac33a0371f.
> > 
> > Several systems couldnt boot using CONFIG_HIGHMEM64G=y as
> > reported in bug #8040. Reverting the above patch solved the problem.

> I think reverting it is probably the right thing to do, unless we can fix
> it for real quite promptly.

Chuck Ebbert at redhat.com asked:

> Can you please double check this by trying with/without again -- sometimes
bisects go bad.

As requested I started to redo the test but now without git
using kernel.org tars. The results now are, still using the same .config:
linux-2.6.20.tar.gz  : bad
linux-2.6.20.1.tar.gz: bad (boot log equal)
linux-2.6.20.2.tar.gz: good
linux-2.6.20.3.tar.gz: good
(triple checked)

Chuck is right, the bisect gone bad.
I asked Nilshar to try these kernels too with:
COMPAT_VDSO=y
CONFIG_HIGHMEM64G=y

He did and says 2.6.20.3 works fine. So only 2.6.20 and 2.6.20.1 had
this 'hang' at boot behavior on my Supermicro 7044 while Nilshar's
machine started working with 2.6.20.3

Reverting avoided imo. I hope more people cheer up who reported bug #8040
and confirm it's fine with the latest stable.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: taskstats accounting info

2007-03-15 Thread Balbir Singh

Andrew Morton wrote:

On Wed, 14 Mar 2007 17:48:32 +0530 Balbir Singh <[EMAIL PROTECTED]> wrote:
Randy.Dunlap wrote:

Hi,

Documentation/accounting/delay-accounting.txt says that the
getdelays program has a "-c cmd" argument, but that option
does not seem to exist in Documentation/account/getdelays.c.

Do you have an updated version of getdelays.c?
If not, please correct that documentation.


Yes, I did, but then I changed my laptop. I should have it archived
at some place, I'll dig it out or correct the documentation.


Is getdelays.c the best available example of a program
using the taskstats netlink interface?


It's the most portable example, since it does not depend on libnl.


err, what is libnl?


libnl is a library abstraction for netlink (libnetlink).



If there exists some real userspace infrastructure which utilises
taskstats, can we please get a referece to it into the kernel
Documentation?  Perhaps in the TASKSTATS Kconfig entry, thanks.



That sounds like a good idea. I'll check for details and get back.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Linux 2.6.20.2 - unable to handle kernel paging request - still accessing freed memory

2007-03-15 Thread Greg KH
On Wed, Mar 14, 2007 at 01:23:02PM +0200, Pekka Enberg wrote:
> Hi Greg,
> 
> I think there's some sort of reference counting problem with sysfs in
> 2.6.20 kernels. Can you please help us debug it further?

Is there any way you can use 'git bisect' to try to track down the root
cause of this?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Move to unshared VMAs in NOMMU mode?

2007-03-15 Thread David Howells
Hugh Dickins <[EMAIL PROTECTED]> wrote:

> But if "the SYSV SHM problem" you mention at the beginning
> is just the "nattch" problem you mention at the end, I doubt
> that's worth such a redesign as you're considering here.

Yes, as far as I know that's the problem.  nattch is available to userspace and
seems to misbehave as far as userspace programs are concerned (I think the
program sees that it is 1 and assumes itself to be the last user).

> Actually, I'm rather surprised SHM needs any such nattch count,
> I'd expect it to deducible from file->f_count and mode_DEST
> (but haven't investigated whether that really works out at all).

Ummm...  Currently file->f_count doesn't count the number of shmats because the
VMAs are shared.  If they are no longer shared then the problem goes away.

There may be several VMLs for a particular process pointing to a VMA.

sys_shmdt() doesn't malfunction because it's not possible to split a VMA in
NOMMU mode, and so the whole VMA must match.

Actually, looking carefully at it, it might go wrong it someone does shmat(),
munmap(), shmdt().  do_munmap(), however, protects against too many munmaps (in
whatever form they're issued).

> If you just need a little CONFIG_MMU in ipc/shm.c to solve your
> problem, I don't think more is justified.

Hmmm... I'm not sure it's quite that simple.  SYSV SHM is provided by a chain
of shm -> tiny-shmem -> ramfs.  The mapping is actually managed by ramfs.

> Your struct vm_region idea does look more to my taste than what
> you presently have; yet if you pursue it, I think it would just
> make divergence worse wouldn't it?  NOMMU wanting vma to contain
> a pointer to vm_region, MMU wanting vm_region embedded in vma.

That bit of divergence is, in effect, already there.  In NOMMU-mode the VMA
owns the backing store; in MMU-mode it does not.  This would, at least, rectify
that: fixing it would mean that the backing store is no longer owned by the
VMA, and would permit more flexibility in overlapping mappings.

> I don't really understand why NOMMU chooses to share vmas, or
> vm_regions, rather than just sharing the data which they indicate.

Where would that data be?  How do you keep track of it?  How do you know when
to deallocate it?

I have considered co-opting the pagecache attached to the mapped inode (which
is exactly how I do shared-writable mappings on ramfs), but that only works for
shared mappings.  I still have to have a way to handle unshareable mappings.
At the moment, they're both the same way (unless overridden by the driver/fs),
and I just share the VMA.

> Just because you can use less memory that way?

That's one consideration.  The other is that it makes management of these
chunks of data simpler.  If the memory isn't attached to the VMA then it must
be managed in some other manner.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: thread stacks and strict vm overcommit accounting

2007-03-15 Thread Andrew Morton
On Thu, 15 Mar 2007 23:33:43 +
Alan Cox <[EMAIL PROTECTED]> wrote:

> > Stack RSS should certainly be included in Committed_AS,
> > but RLIMIT_STACK merely limits how big the stack vma may grow to:
> > at any moment the stack vma is probably very much smaller,
> > and only its current size is accounted in Committed_AS.
> 
> With a typical size as a fuzz factor preaccounted in later kernels.

Where's that done?

> > > > Is this the intended behaviour?
> > > 
> > > That sounds like a bug to me.
> > 
> > I'm suspecting it's an oddity rather than a bug.
> 
> It is intended behaviour.

Each instance of

main()
{
sleep(100);
}

appears to increase Committed_AS by around 200kb.  But we've committed to
providing it with 8MB for stack.

How come this is correct?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: thread stacks and strict vm overcommit accounting

2007-03-15 Thread Alan Cox
> Stack RSS should certainly be included in Committed_AS,
> but RLIMIT_STACK merely limits how big the stack vma may grow to:
> at any moment the stack vma is probably very much smaller,
> and only its current size is accounted in Committed_AS.

With a typical size as a fuzz factor preaccounted in later kernels.

> > > Is this the intended behaviour?
> > 
> > That sounds like a bug to me.
> 
> I'm suspecting it's an oddity rather than a bug.

It is intended behaviour.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX][PATCH] fixing placement of register stack under ulimit -s

2007-03-15 Thread KAMEZAWA Hiroyuki
On Fri, 16 Mar 2007 06:20:47 +0900
KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> wrote:

> On Thu, 15 Mar 2007 09:57:28 -0600
> "David Mosberger-Tang" <[EMAIL PROTECTED]> wrote:
> 
> > But aren't you going to be limited to less than a page worth of
> > register-backing store even with your patch applied because the
> > backing store will end up overflowing the memory stack?
> > 
> 
> I think pthread's stack, which is created by malloc, is also shared
> among register-stack and memory-stack. 
> (glibc's pthread's stack is limited by ulimit, too.)
> 
> So, it seems stack_size_limit = register_stack_limit + memory_stack_limit
> is a consistent way. I'm sorry if I don't catch your point.
> 
BTW, what way do you recommened to fix this register-stack/memory-stack upside
down problem ?

Plan A) just handle upside-down case in page fault handler.
This means ulimit -s limitation will limit amount of memory-stack and
register stack independently.
Plan B) handle upside-down case in page fault handler and add modify 
acct_stack_growth() to be able to handle the limitation of sum
of separated vmas.(vma for reg stack and mem stack). 
Plan C) don't allow this upside down as this patch. but change calculation of 
rbs_top.

Note:
To see the problem which my patch want to fix run following code under ulimit 
-s.

==
void eat_stack(int num) {
printf("%d\n", num);
eat_stack(num - 1);
}

int main (void) {
eat_stack(1);
}
==

-- Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 07 of 33] IB/ipath - support larger IB_QP_MAX_DEST_RD_ATOMIC and IB_QP_MAX_QP_RD_ATOMIC

2007-03-15 Thread Bryan O'Sullivan
# HG changeset patch
# User Ralph Campbell <[EMAIL PROTECTED]>
# Date 1173994464 25200
# Node ID 02b57b02578b7ffb189de66f7886214e9d5f2045
# Parent  78ae7bddbd5e205adc12993ad2956e0402ca01d7
IB/ipath - support larger IB_QP_MAX_DEST_RD_ATOMIC and IB_QP_MAX_QP_RD_ATOMIC

This patch adds support for multiple RDMA reads and atomics to be
sent before an ACK is required to be seen by the requester.

Signed-off-by: Bryan O'Sullivan <[EMAIL PROTECTED]>

diff -r 78ae7bddbd5e -r 02b57b02578b drivers/infiniband/hw/ipath/ipath_qp.c
--- a/drivers/infiniband/hw/ipath/ipath_qp.cThu Mar 15 14:34:24 2007 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_qp.cThu Mar 15 14:34:24 2007 -0700
@@ -320,7 +320,8 @@ static void ipath_reset_qp(struct ipath_
qp->remote_qpn = 0;
qp->qkey = 0;
qp->qp_access_flags = 0;
-   clear_bit(IPATH_S_BUSY, >s_flags);
+   qp->s_busy = 0;
+   qp->s_flags &= ~IPATH_S_SIGNAL_REQ_WR;
qp->s_hdrwords = 0;
qp->s_psn = 0;
qp->r_psn = 0;
@@ -333,7 +334,6 @@ static void ipath_reset_qp(struct ipath_
qp->r_state = IB_OPCODE_UC_SEND_LAST;
}
qp->s_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
-   qp->r_ack_state = IB_OPCODE_RC_ACKNOWLEDGE;
qp->r_nak_state = 0;
qp->r_wrid_valid = 0;
qp->s_rnr_timeout = 0;
@@ -344,6 +344,10 @@ static void ipath_reset_qp(struct ipath_
qp->s_ssn = 1;
qp->s_lsn = 0;
qp->s_wait_credit = 0;
+   memset(qp->s_ack_queue, 0, sizeof(qp->s_ack_queue));
+   qp->r_head_ack_queue = 0;
+   qp->s_tail_ack_queue = 0;
+   qp->s_num_rd_atomic = 0;
if (qp->r_rq.wq) {
qp->r_rq.wq->head = 0;
qp->r_rq.wq->tail = 0;
@@ -503,6 +507,10 @@ int ipath_modify_qp(struct ib_qp *ibqp, 
attr->path_mig_state != IB_MIG_REARM)
goto inval;
 
+   if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC)
+   if (attr->max_dest_rd_atomic > IPATH_MAX_RDMA_ATOMIC)
+   goto inval;
+
switch (new_state) {
case IB_QPS_RESET:
ipath_reset_qp(qp);
@@ -558,6 +566,12 @@ int ipath_modify_qp(struct ib_qp *ibqp, 
 
if (attr_mask & IB_QP_QKEY)
qp->qkey = attr->qkey;
+
+   if (attr_mask & IB_QP_MAX_DEST_RD_ATOMIC)
+   qp->r_max_rd_atomic = attr->max_dest_rd_atomic;
+
+   if (attr_mask & IB_QP_MAX_QP_RD_ATOMIC)
+   qp->s_max_rd_atomic = attr->max_rd_atomic;
 
qp->state = new_state;
spin_unlock_irqrestore(>s_lock, flags);
@@ -598,8 +612,8 @@ int ipath_query_qp(struct ib_qp *ibqp, s
attr->alt_pkey_index = 0;
attr->en_sqd_async_notify = 0;
attr->sq_draining = 0;
-   attr->max_rd_atomic = 1;
-   attr->max_dest_rd_atomic = 1;
+   attr->max_rd_atomic = qp->s_max_rd_atomic;
+   attr->max_dest_rd_atomic = qp->r_max_rd_atomic;
attr->min_rnr_timer = qp->r_min_rnr_timer;
attr->port_num = 1;
attr->timeout = qp->timeout;
@@ -614,7 +628,7 @@ int ipath_query_qp(struct ib_qp *ibqp, s
init_attr->recv_cq = qp->ibqp.recv_cq;
init_attr->srq = qp->ibqp.srq;
init_attr->cap = attr->cap;
-   if (qp->s_flags & (1 << IPATH_S_SIGNAL_REQ_WR))
+   if (qp->s_flags & IPATH_S_SIGNAL_REQ_WR)
init_attr->sq_sig_type = IB_SIGNAL_REQ_WR;
else
init_attr->sq_sig_type = IB_SIGNAL_ALL_WR;
@@ -786,7 +800,7 @@ struct ib_qp *ipath_create_qp(struct ib_
qp->s_size = init_attr->cap.max_send_wr + 1;
qp->s_max_sge = init_attr->cap.max_send_sge;
if (init_attr->sq_sig_type == IB_SIGNAL_REQ_WR)
-   qp->s_flags = 1 << IPATH_S_SIGNAL_REQ_WR;
+   qp->s_flags = IPATH_S_SIGNAL_REQ_WR;
else
qp->s_flags = 0;
dev = to_idev(ibpd->device);
diff -r 78ae7bddbd5e -r 02b57b02578b drivers/infiniband/hw/ipath/ipath_rc.c
--- a/drivers/infiniband/hw/ipath/ipath_rc.cThu Mar 15 14:34:24 2007 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_rc.cThu Mar 15 14:34:24 2007 -0700
@@ -37,6 +37,19 @@
 /* cut down ridiculously long IB macro names */
 #define OP(x) IB_OPCODE_RC_##x
 
+static u32 restart_sge(struct ipath_sge_state *ss, struct ipath_swqe *wqe,
+  u32 psn, u32 pmtu)
+{
+   u32 len;
+
+   len = ((psn - wqe->psn) & IPATH_PSN_MASK) * pmtu;
+   ss->sge = wqe->sg_list[0];
+   ss->sg_list = wqe->sg_list + 1;
+   ss->num_sge = wqe->wr.num_sge;
+   ipath_skip_sge(ss, len);
+   return wqe->length - len;
+}
+
 /**
  * ipath_init_restart- initialize the qp->s_sge after a restart
  * @qp: the QP who's SGE we're restarting
@@ -47,15 +60,9 @@ static void ipath_init_restart(struct ip
 static void ipath_init_restart(struct ipath_qp *qp, struct ipath_swqe *wqe)
 {
struct ipath_ibdev *dev;
-   u32 len;
-
-   len = 

[PATCH 06 of 33] IB/ipath - NMI cpu lockup if local loopback used

2007-03-15 Thread Bryan O'Sullivan
# HG changeset patch
# User Ralph Campbell <[EMAIL PROTECTED]>
# Date 1173994464 25200
# Node ID 78ae7bddbd5e205adc12993ad2956e0402ca01d7
# Parent  fa38a027a0853a80c4f7dfc50345c89f195bc85b
IB/ipath - NMI cpu lockup if local loopback used

If a post send is done in loopback and there is no receive queue entry,
the sending QP is put on a timeout list for a while so the receiver has
a chance to post a receive buffer. If the another post send is done,
the code incorrectly tried to put the QP on the timeout list again an
corrupted the timeout list. This eventually leads to a spin lock deadlock
NMI due to the timer function looping forever with the lock held.

Signed-off-by: Bryan O'Sullivan <[EMAIL PROTECTED]>

diff -r fa38a027a085 -r 78ae7bddbd5e drivers/infiniband/hw/ipath/ipath_ruc.c
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c   Thu Mar 15 14:34:24 2007 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c   Thu Mar 15 14:34:24 2007 -0700
@@ -265,7 +265,8 @@ again:
 again:
spin_lock_irqsave(>s_lock, flags);
 
-   if (!(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_SEND_OK)) {
+   if (!(ib_ipath_state_ops[sqp->state] & IPATH_PROCESS_SEND_OK) ||
+   qp->s_rnr_timeout) {
spin_unlock_irqrestore(>s_lock, flags);
goto done;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 19 of 33] IB/ipath - Discard multicast packets without a GRH

2007-03-15 Thread Bryan O'Sullivan
# HG changeset patch
# User Bryan O'Sullivan <[EMAIL PROTECTED]>
# Date 1173994465 25200
# Node ID c96d13efde155eb60dc0eca0bd56e81ecd36281b
# Parent  878b6054e9ca5327db9c9438f66265afaf88b055
IB/ipath - Discard multicast packets without a GRH

This patch fixes a bug where multicast packets without a GRH
were not being dropped as per the IB spec.

Signed-off-by: Ralph Campbell <[EMAIL PROTECTED]>
Signed-off-by: Bryan O'Sullivan <[EMAIL PROTECTED]>

diff -r 878b6054e9ca -r c96d13efde15 drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Mar 15 14:34:25 2007 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Mar 15 14:34:25 2007 -0700
@@ -438,6 +438,10 @@ void ipath_ib_rcv(struct ipath_ibdev *de
struct ipath_mcast *mcast;
struct ipath_mcast_qp *p;
 
+   if (lnh != IPATH_LRH_GRH) {
+   dev->n_pkt_drops++;
+   goto bail;
+   }
mcast = ipath_mcast_find(>u.l.grh.dgid);
if (mcast == NULL) {
dev->n_pkt_drops++;
@@ -445,8 +449,7 @@ void ipath_ib_rcv(struct ipath_ibdev *de
}
dev->n_multicast_rcv++;
list_for_each_entry_rcu(p, >qp_list, list)
-   ipath_qp_rcv(dev, hdr, lnh == IPATH_LRH_GRH, data,
-tlen, p->qp);
+   ipath_qp_rcv(dev, hdr, 1, data, tlen, p->qp);
/*
 * Notify ipath_multicast_detach() if it is waiting for us
 * to finish.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 15 of 33] IB/ipath - allow receive ports mapped into userspace to be shared

2007-03-15 Thread Bryan O'Sullivan
# HG changeset patch
# User Mark Debbage <[EMAIL PROTECTED]>
# Date 1173994465 25200
# Node ID 5ff8f23d0e61169f598ab1d93aa6324d88c17921
# Parent  62da2fb770b66310ac06ba0190bf2bed2a5a764f
IB/ipath - allow receive ports mapped into userspace to be shared

Improve port-sharing performance by allowing any process to receive
packets from the shared hardware port under a spin lock for mutual
exclusion. Previously, one process was nominated as the master and
that process was responsible for receiving all packets from the shared
hardware port and either consuming them or forwarding them to their
destination. This led to starvation problems for other processes when
the master process was busy in computation phases.

Signed-off-by: Bryan O'Sullivan <[EMAIL PROTECTED]>

diff -r 62da2fb770b6 -r 5ff8f23d0e61 drivers/infiniband/hw/ipath/ipath_common.h
--- a/drivers/infiniband/hw/ipath/ipath_common.hThu Mar 15 14:34:25 
2007 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_common.hThu Mar 15 14:34:25 
2007 -0700
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 QLogic, Inc. All rights reserved.
+ * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -318,10 +318,16 @@ struct ipath_base_info {
/* address of readonly memory copy of the rcvhdrq tail register. */
__u64 spi_rcvhdr_tailaddr;
 
-   /* shared memory pages for subports if IPATH_RUNTIME_MASTER is set */
+   /* shared memory pages for subports if port is shared */
__u64 spi_subport_uregbase;
__u64 spi_subport_rcvegrbuf;
__u64 spi_subport_rcvhdr_base;
+
+   /* shared memory page for hardware port if it is shared */
+   __u64 spi_port_uregbase;
+   __u64 spi_port_rcvegrbuf;
+   __u64 spi_port_rcvhdr_base;
+   __u64 spi_port_rcvhdr_tailaddr;
 
 } __attribute__ ((aligned(8)));
 
@@ -346,7 +352,7 @@ struct ipath_base_info {
  * may not be implemented; the user code must deal with this if it
  * cares, or it must abort after initialization reports the difference.
  */
-#define IPATH_USER_SWMINOR 3
+#define IPATH_USER_SWMINOR 4
 
 #define IPATH_USER_SWVERSION ((IPATH_USER_SWMAJOR<<16) | IPATH_USER_SWMINOR)
 
@@ -420,7 +426,7 @@ struct ipath_user_info {
 #define IPATH_CMD_TID_UPDATE   19  /* update expected TID entries */
 #define IPATH_CMD_TID_FREE 20  /* free expected TID entries */
 #define IPATH_CMD_SET_PART_KEY 21  /* add partition key */
-#define IPATH_CMD_SLAVE_INFO   22  /* return info on slave processes */
+#define __IPATH_CMD_SLAVE_INFO 22  /* return info on slave processes (for 
old user code) */
 #define IPATH_CMD_ASSIGN_PORT  23  /* allocate HCA and port */
 #define IPATH_CMD_USER_INIT24  /* set up userspace */
 
@@ -432,7 +438,7 @@ struct ipath_port_info {
__u16 port; /* port on unit assigned to caller */
__u16 subport;  /* subport on unit assigned to caller */
__u16 num_ports;/* number of ports available on unit */
-   __u16 num_subports; /* number of subport slaves opened on port */
+   __u16 num_subports; /* number of subports opened on port */
 };
 
 struct ipath_tid_info {
diff -r 62da2fb770b6 -r 5ff8f23d0e61 
drivers/infiniband/hw/ipath/ipath_file_ops.c
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c  Thu Mar 15 14:34:25 
2007 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c  Thu Mar 15 14:34:25 
2007 -0700
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006 QLogic, Inc. All rights reserved.
+ * Copyright (c) 2006, 2007 QLogic Corporation. All rights reserved.
  * Copyright (c) 2003, 2004, 2005, 2006 PathScale, Inc. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -99,7 +99,7 @@ static int ipath_get_base_info(struct fi
sz = sizeof(*kinfo);
/* If port sharing is not requested, allow the old size structure */
if (!shared)
-   sz -= 3 * sizeof(u64);
+   sz -= 7 * sizeof(u64);
if (ubase_size < sz) {
ipath_cdbg(PROC,
   "Base size %zu, need %zu (version mismatch?)\n",
@@ -177,38 +177,30 @@ static int ipath_get_base_info(struct fi
kinfo->spi_piobufbase = (u64) pd->port_piobufs +
dd->ipath_palign *
(dd->ipath_pbufsport - kinfo->spi_piocnt);
-   kinfo->__spi_uregbase = (u64) dd->ipath_uregbase +
-   dd->ipath_palign * pd->port_port;
} else {
unsigned slave = subport_fp(fp) - 1;
 
kinfo->spi_piocnt = dd->ipath_pbufsport / subport_cnt;
kinfo->spi_piobufbase = (u64) pd->port_piobufs +
dd->ipath_palign * kinfo->spi_piocnt * slave;
+   }
+   if (shared) {
+   kinfo->spi_port_uregbase = 

[PATCH 18 of 33] IB/ipath - Fix calculation for number of kernel PIO buffers

2007-03-15 Thread Bryan O'Sullivan
# HG changeset patch
# User Bryan O'Sullivan <[EMAIL PROTECTED]>
# Date 1173994465 25200
# Node ID 878b6054e9ca5327db9c9438f66265afaf88b055
# Parent  a023ffe32d9df8cba7d8b15c24e7918eeb236a2c
IB/ipath - Fix calculation for number of kernel PIO buffers

If the module parameter "kpiobufs" is set too high, the calculation
to reset it to a sane value was incorrect.

Signed-off-by: Ralph Campbell <[EMAIL PROTECTED]>
Signed-off-by: Bryan O'Sullivan <[EMAIL PROTECTED]>

diff -r a023ffe32d9d -r 878b6054e9ca 
drivers/infiniband/hw/ipath/ipath_init_chip.c
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Mar 15 14:34:25 
2007 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c Thu Mar 15 14:34:25 
2007 -0700
@@ -668,6 +668,7 @@ int ipath_init_chip(struct ipath_devdata
 {
int ret = 0, i;
u32 val32, kpiobufs;
+   u32 piobufs, uports;
u64 val;
struct ipath_portdata *pd = NULL; /* keep gcc4 happy */
gfp_t gfp_flags = GFP_USER | __GFP_COMP;
@@ -702,16 +703,17 @@ int ipath_init_chip(struct ipath_devdata
 * the in memory DMA'ed copies of the registers.  This has to
 * be done early, before we calculate lastport, etc.
 */
-   val = dd->ipath_piobcnt2k + dd->ipath_piobcnt4k;
+   piobufs = dd->ipath_piobcnt2k + dd->ipath_piobcnt4k;
/*
 * calc number of pioavail registers, and save it; we have 2
 * bits per buffer.
 */
-   dd->ipath_pioavregs = ALIGN(val, sizeof(u64) * BITS_PER_BYTE / 2)
+   dd->ipath_pioavregs = ALIGN(piobufs, sizeof(u64) * BITS_PER_BYTE / 2)
/ (sizeof(u64) * BITS_PER_BYTE / 2);
+   uports = dd->ipath_cfgports ? dd->ipath_cfgports - 1 : 0;
if (ipath_kpiobufs == 0) {
/* not set by user (this is default) */
-   if ((dd->ipath_piobcnt2k + dd->ipath_piobcnt4k) > 128)
+   if (piobufs >= (uports * IPATH_MIN_USER_PORT_BUFCNT) + 32)
kpiobufs = 32;
else
kpiobufs = 16;
@@ -719,31 +721,25 @@ int ipath_init_chip(struct ipath_devdata
else
kpiobufs = ipath_kpiobufs;
 
-   if (kpiobufs >
-   (dd->ipath_piobcnt2k + dd->ipath_piobcnt4k -
-(dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT))) {
-   i = dd->ipath_piobcnt2k + dd->ipath_piobcnt4k -
-   (dd->ipath_cfgports * IPATH_MIN_USER_PORT_BUFCNT);
+   if (kpiobufs + (uports * IPATH_MIN_USER_PORT_BUFCNT) > piobufs) {
+   i = (int) piobufs -
+   (int) (uports * IPATH_MIN_USER_PORT_BUFCNT);
if (i < 0)
i = 0;
-   dev_info(>pcidev->dev, "Allocating %d PIO bufs for "
-"kernel leaves too few for %d user ports "
+   dev_info(>pcidev->dev, "Allocating %d PIO bufs of "
+"%d for kernel leaves too few for %d user ports "
 "(%d each); using %u\n", kpiobufs,
-dd->ipath_cfgports - 1,
-IPATH_MIN_USER_PORT_BUFCNT, i);
+piobufs, uports, IPATH_MIN_USER_PORT_BUFCNT, i);
/*
 * shouldn't change ipath_kpiobufs, because could be
 * different for different devices...
 */
kpiobufs = i;
}
-   dd->ipath_lastport_piobuf =
-   dd->ipath_piobcnt2k + dd->ipath_piobcnt4k - kpiobufs;
-   dd->ipath_pbufsport = dd->ipath_cfgports > 1
-   ? dd->ipath_lastport_piobuf / (dd->ipath_cfgports - 1)
-   : 0;
-   val32 = dd->ipath_lastport_piobuf -
-   (dd->ipath_pbufsport * (dd->ipath_cfgports - 1));
+   dd->ipath_lastport_piobuf = piobufs - kpiobufs;
+   dd->ipath_pbufsport =
+   uports ? dd->ipath_lastport_piobuf / uports : 0;
+   val32 = dd->ipath_lastport_piobuf - (dd->ipath_pbufsport * uports);
if (val32 > 0) {
ipath_dbg("allocating %u pbufs/port leaves %u unused, "
  "add to kernel\n", dd->ipath_pbufsport, val32);
@@ -754,8 +750,7 @@ int ipath_init_chip(struct ipath_devdata
dd->ipath_lastpioindex = dd->ipath_lastport_piobuf;
ipath_cdbg(VERBOSE, "%d PIO bufs for kernel out of %d total %u "
   "each for %u user ports\n", kpiobufs,
-  dd->ipath_piobcnt2k + dd->ipath_piobcnt4k,
-  dd->ipath_pbufsport, dd->ipath_cfgports - 1);
+  piobufs, dd->ipath_pbufsport, uports);
 
dd->ipath_f_early_init(dd);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 02 of 33] IB/ipath - fix user memory region creation when IOMMU present

2007-03-15 Thread Bryan O'Sullivan
# HG changeset patch
# User Bryan O'Sullivan <[EMAIL PROTECTED]>
# Date 1173994464 25200
# Node ID 3337d450afeebc553a09fe5c18ed0b2444547c24
# Parent  b1d05f3486f8bba1dd3c5cbca39f06a5e1b3d6fb
IB/ipath - fix user memory region creation when IOMMU present

The loop which initializes the user memory region from an array
of pages was using the wrong limit for the array.  This worked
OK when dma_map_sg() returned the same number as the number of pages.
This patch fixes the problem.

Signed-off-by: Ralph Campbell <[EMAIL PROTECTED]>
Signed-off-by: Bryan O'Sullivan <[EMAIL PROTECTED]>

diff -r b1d05f3486f8 -r 3337d450afee drivers/infiniband/hw/ipath/ipath_mr.c
--- a/drivers/infiniband/hw/ipath/ipath_mr.cThu Mar 15 14:34:24 2007 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_mr.cThu Mar 15 14:34:24 2007 -0700
@@ -210,9 +210,15 @@ struct ib_mr *ipath_reg_user_mr(struct i
m = 0;
n = 0;
list_for_each_entry(chunk, >chunk_list, list) {
-   for (i = 0; i < chunk->nmap; i++) {
-   mr->mr.map[m]->segs[n].vaddr =
-   page_address(chunk->page_list[i].page);
+   for (i = 0; i < chunk->nents; i++) {
+   void *vaddr;
+
+   vaddr = page_address(chunk->page_list[i].page);
+   if (!vaddr) {
+   ret = ERR_PTR(-EINVAL);
+   goto bail;
+   }
+   mr->mr.map[m]->segs[n].vaddr = vaddr;
mr->mr.map[m]->segs[n].length = region->page_size;
n++;
if (n == IPATH_SEGSZ) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] blackfin: balance parenthesis in macros

2007-03-15 Thread Mike Frysinger

On 3/15/07, Mariusz Kozlowski <[EMAIL PROTECTED]> wrote:

This patch (against 2.6.21-rc3-mm1) balances parenthesis in blackfin
header files.

Signed-off-by: Mariusz Kozlowski <[EMAIL PROTECTED]>


thanks, added to our repo
-mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 00 of 33] Set of ipath patches for 2.6.22

2007-03-15 Thread Bryan O'Sullivan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/22 take 3] UBI: EBA unit

2007-03-15 Thread Randy Dunlap
On Thu, 15 Mar 2007 11:07:03 -0800 Andrew Morton wrote:

> 
> There's way too much code here to expect it to get decently reviewed, alas.

Yes.

/me repeats wish that Not Everything Should Be Sent to lkml.  :(

> > On Wed, 14 Mar 2007 17:20:24 +0200 Artem Bityutskiy <[EMAIL PROTECTED]> 
> > wrote:
> >
> > ...
> >
> > +/**
> > + * leb_get_ver - get logical eraseblock version.
> > + *
> > + * @ubi: the UBI device description object
> > + * @vol_id: the volume ID
> > + * @lnum: the logical eraseblock number
> > + *
> > + * The logical eraseblock has to be locked. Note, all this leb_ver stuff is
> > + * obsolete and will be removed eventually. FIXME: to be removed together 
> > with
> > + * leb_ver support.
> > + */

Please use kernel-doc syntax and test it.  Using and testing it
are really easy to do.  It's just a simple language.  Don't make
(even trivial) problems for others to clean up...

Documentation/kernel-doc-nano-HOWTO.txt

Above:  no "blank" line between the function name and its parameters.

> > +static inline int leb_get_ver(struct ubi_info *ubi, int vol_id, int lnum)
> > +{
> > +   int idx, leb_ver;
> > +
> > +   idx = vol_id2idx(ubi, vol_id);
> > +
> > +   spin_lock(>eba.eba_tbl_lock);
> > +   ubi_assert(ubi->eba.eba_tbl[idx].recs);
> > +   leb_ver = ubi->eba.eba_tbl[idx].recs[lnum].leb_ver;
> > +   spin_unlock(>eba.eba_tbl_lock);
> > +
> > +   return leb_ver;
> > +}


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01 of 33] IB/ipath - add ability to set and clear IB local loopback

2007-03-15 Thread Bryan O'Sullivan
# HG changeset patch
# User Bryan O'Sullivan <[EMAIL PROTECTED]>
# Date 1173994464 25200
# Node ID b1d05f3486f8bba1dd3c5cbca39f06a5e1b3d6fb
# Parent  0d37971d4ab0c8b6f7a8f6e8222112321982498f
IB/ipath - add ability to set and clear IB local loopback

This is a sticky state.  It is useful for diagnosing problems with boards
versus cable/switch problems.

Signed-off-by: Dave Olson <[EMAIL PROTECTED]>
Signed-off-by: Bryan O'Sullivan <[EMAIL PROTECTED]>

diff -r 0d37971d4ab0 -r b1d05f3486f8 drivers/infiniband/hw/ipath/ipath_common.h
--- a/drivers/infiniband/hw/ipath/ipath_common.hWed Mar 14 17:53:43 
2007 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_common.hThu Mar 15 14:34:24 
2007 -0700
@@ -78,6 +78,8 @@
 #define IPATH_IB_LINKINIT  3
 #define IPATH_IB_LINKDOWN_SLEEP4
 #define IPATH_IB_LINKDOWN_DISABLE  5
+#define IPATH_IB_LINK_LOOPBACK 6 /* enable local loopback */
+#define IPATH_IB_LINK_EXTERNAL 7 /* normal, disable local loopback */
 
 /*
  * stats maintained by the driver.  For now, at least, this is global
diff -r 0d37971d4ab0 -r b1d05f3486f8 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.cWed Mar 14 17:53:43 
2007 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.cThu Mar 15 14:34:24 
2007 -0700
@@ -1662,6 +1662,22 @@ int ipath_set_linkstate(struct ipath_dev
lstate = IPATH_LINKACTIVE;
break;
 
+   case IPATH_IB_LINK_LOOPBACK:
+   dev_info(>pcidev->dev, "Enabling IB local loopback\n");
+   dd->ipath_ibcctrl |= INFINIPATH_IBCC_LOOPBACK;
+   ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl,
+dd->ipath_ibcctrl);
+   ret = 0;
+   goto bail; // no state change to wait for
+
+   case IPATH_IB_LINK_EXTERNAL:
+   dev_info(>pcidev->dev, "Disabling IB local loopback 
(normal)\n");
+   dd->ipath_ibcctrl &= ~INFINIPATH_IBCC_LOOPBACK;
+   ipath_write_kreg(dd, dd->ipath_kregs->kr_ibcctrl,
+dd->ipath_ibcctrl);
+   ret = 0;
+   goto bail; // no state change to wait for
+
default:
ipath_dbg("Invalid linkstate 0x%x requested\n", newstate);
ret = -EINVAL;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20 of 33] IB/ipath - call free_irq on chip specific initialization failure

2007-03-15 Thread Bryan O'Sullivan
# HG changeset patch
# User Arthur Jones <[EMAIL PROTECTED]>
# Date 1173994465 25200
# Node ID 8a013b707785accfd71589334bbf8e4029ffa892
# Parent  c96d13efde155eb60dc0eca0bd56e81ecd36281b
IB/ipath - call free_irq on chip specific initialization failure

In initialization, if we bailed at chip specific initialization, we
forgot to clean up the irq we had requested.

Signed-off-by: Bryan O'Sullivan <[EMAIL PROTECTED]>

diff -r c96d13efde15 -r 8a013b707785 drivers/infiniband/hw/ipath/ipath_driver.c
--- a/drivers/infiniband/hw/ipath/ipath_driver.cThu Mar 15 14:34:25 
2007 -0700
+++ b/drivers/infiniband/hw/ipath/ipath_driver.cThu Mar 15 14:34:25 
2007 -0700
@@ -486,7 +486,7 @@ static int __devinit ipath_init_one(stru
 
ret = ipath_init_chip(dd, 0);   /* do the chip-specific init */
if (ret)
-   goto bail_iounmap;
+   goto bail_irqsetup;
 
ret = ipath_enable_wc(dd);
 
@@ -504,6 +504,9 @@ static int __devinit ipath_init_one(stru
ipath_register_ib_device(dd);
 
goto bail;
+
+bail_irqsetup:
+   if (pdev->irq) free_irq(pdev->irq, dd);
 
 bail_iounmap:
iounmap((volatile void __iomem *) dd->ipath_kregbase);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] swsusp: Do not use page flags

2007-03-15 Thread Andrew Morton
On Thu, 15 Mar 2007 23:19:02 +0100 (CET)
Jiri Kosina <[EMAIL PROTECTED]> wrote:

> On Thu, 15 Mar 2007, Andrew Morton wrote:
> 
> > > > And why _does_ suspend use GFP_ATOMIC all over the place?
> > > Generally, because it cannot sleep.
> > Why not?
> 
> I guess it's simply beucase of kswapd being already frozen, so there is no 
> chance that once GFP_KERNEL allocation goes to sleep, it is going to get 
> any free pages eventually ... ?

No, things should run fine with a dead kswapd.

There are reasons why we can't call into filesystems from there, but
GFP_NOIO will ensure that and it is heaps better than GFP_ATOMIC.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Description of the ipath patches I just sent

2007-03-15 Thread Bryan O'Sullivan

My patch mailer decided to send them without a summary.  Oops.

This series is a variety of bugfixes and cleanups for the ipath driver. 
 It doesn't touch anything in IB-land.  The patches apply cleanly and 
run happily against 2.6.21-rc3.


 ipath_common.h|   25 -
 ipath_cq.c|   38 +
 ipath_debug.h |1
 ipath_diag.c  |   19
 ipath_driver.c|  125 +++--
 ipath_eeprom.c|4
 ipath_file_ops.c  |  307 --
 ipath_iba6110.c   |  154 ---
 ipath_iba6120.c   |   73 ++-
 ipath_init_chip.c |   88 ++--
 ipath_intr.c  |  100 +++-
 ipath_kernel.h|   10
 ipath_keys.c  |   14
 ipath_mr.c|   12
 ipath_qp.c|  133 +++---
 ipath_rc.c|  960 +---
 ipath_registers.h |   22 -
 ipath_ruc.c   |   63 +-
 ipath_stats.c |   16
 ipath_uc.c|6
 ipath_ud.c|8
 ipath_verbs.c |   14
 ipath_verbs.h |   57 +-
 23 files changed, 1387 insertions(+), 862 deletions(-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   >