Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier
Jeff Garzik wrote:
> Jamie Lokier wrote:
> >By durable, I mean that fsync() should actually commit writes to
> >physical stable storage,
> 
> Yes, it should.

Glad we agree :-)

> >I was surprised that fsync() doesn't do this already.  There was a lot
> >of effort put into block I/O write barriers during 2.5, so that
> >journalling filesystems can force correct write ordering, using disk
> >flush cache commands.
> >
> >After all that effort, I was very surprised to notice that Linux 2.6.x
> >doesn't use that capability to ensure fsync() flushes the disk cache
> >onto stable storage.
> 
> It's surprising you are surprised, given that this [lame] fsync behavior 
> has remaining consistently lame throughout Linux's history.

I was surprised because of the effort put into IDE write barriers to
get it right for in-kernel filesystems, and the messages in 2004
telling concerned users that fsync would use barriers in 2.6, which it
does sometimes but not always.

> [snip huge long proposal]
> 
> Rather than invent new APIs, we should fix the existing ones to _really_ 
> flush data to physical media.
>
> Linux should default to SAFE data storage, and permit users to retain 
> the older unsafe behavior via an option.  It's completely ridiculous 
> that we default to an unsafe fsync.

Well, I agree with you.  Which is why the "new API" I suggested, being
really just an extension of an existing one, allows fsync() to be SAFE
if that's what people want.

To be fair, fsync() is rather overkill for some apps.
sync_file_range() is obviously the right place for fine tuning "less
safe" variations.

> And [anticipating a common response from others] it is completely 
> irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
> current behavior is unsafe.
> 
> Safety before performance -- ESPECIALLY when it comes to storing user data.

Especially now that people work a lot in guest VMs, where the IDE
barrier stuff doesn't work if the host fdatasync() doesn't work.

Since it happened with Mac OS X, I wouldn't be surprised if changing
fsync() and just that wasn't popular.  Heck, you already get people
asking "how to turn off fsync in PostGreSQL"...  (Haven't those people
heard of transactions...?)

But with changes to sync_file_range() [or whatever... I don't care] to
support database's finely tuned commit needs, and then adoption of
that by database vendors, perhaps nobody will mind fsync() becoming
safe then.

Nobody seems bothered by it's performance for other things.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Feb 24

2008-02-25 Thread Geert Uytterhoeven
On Tue, 26 Feb 2008, Stephen Rothwell wrote:
> On Mon, 25 Feb 2008 22:56:04 +0100 (CET) Geert Uytterhoeven <[EMAIL 
> PROTECTED]> wrote:
> >
> > Can you please add
> > http://linux-m68k-cvs.ubb.ca/~geert/linux-m68k-patches-2.6/series?
> > So far there's only one patch in between NEXT_PATCHES_{START,END} yet,
> > though.
> 
> Added, thanks.
> 
> > Ah, I see the m68k cross-compiler is still missing? ;-)
> 
> Please look again :-)  Are there any particular configs we should build?

Great!

I really should update the defconfigs, and add a `defconfig_all' which
builds a kernel for all platforms (except Sun-3).

For now defconfig is OK, I think.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ide-cd: fix some codestyle and most of the checkpatch.pl issues

2008-02-25 Thread Borislav Petkov
Signed-off-by: Borislav Petkov <[EMAIL PROTECTED]>
---
 drivers/ide/ide-cd.c |  634 +
 1 files changed, 323 insertions(+), 311 deletions(-)

diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 3600648..3853eb5 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -13,8 +13,8 @@
  *
  * Suggestions are welcome. Patches that work are more welcome though. ;-)
  * For those wishing to work on this driver, please be sure you download
- * and comply with the latest Mt. Fuji (SFF8090 version 4) and ATAPI 
- * (SFF-8020i rev 2.6) standards. These documents can be obtained by 
+ * and comply with the latest Mt. Fuji (SFF8090 version 4) and ATAPI
+ * (SFF-8020i rev 2.6) standards. These documents can be obtained by
  * anonymous ftp from:
  * 
ftp://fission.dt.wdc.com/pub/standards/SFF_atapi/spec/SFF8020-r2.6/PS/8020r26.ps
  * ftp://ftp.avc-pioneer.com/Mtfuji4/Spec/Fuji4r10.pdf
@@ -41,17 +41,17 @@
 
 #include  /* For SCSI -> ATAPI command conversion */
 
-#include 
-#include 
+#include 
+#include 
 #include 
-#include 
+#include 
 #include 
 
 #include "ide-cd.h"
 
 static DEFINE_MUTEX(idecd_ref_mutex);
 
-#define to_ide_cd(obj) container_of(obj, struct cdrom_info, kref) 
+#define to_ide_cd(obj) container_of(obj, struct cdrom_info, kref)
 
 #define ide_cd_g(disk) \
container_of((disk)->private_data, struct cdrom_info, driver)
@@ -77,13 +77,12 @@ static void ide_cd_put(struct cdrom_info *cd)
mutex_unlock(_ref_mutex);
 }
 
-/
+/*
  * Generic packet command support and error handling routines.
  */
 
-/* Mark that we've seen a media change, and invalidate our internal
-   buffers. */
-static void cdrom_saw_media_change (ide_drive_t *drive)
+/* Mark that we've seen a media change, and invalidate our internal buffers. */
+static void cdrom_saw_media_change(ide_drive_t *drive)
 {
struct cdrom_info *cd = drive->driver_data;
 
@@ -100,46 +99,45 @@ static int cdrom_log_sense(ide_drive_t *drive, struct 
request *rq,
return 0;
 
switch (sense->sense_key) {
-   case NO_SENSE: case RECOVERED_ERROR:
-   break;
-   case NOT_READY:
-   /*
-* don't care about tray state messages for
-* e.g. capacity commands or in-progress or
-* becoming ready
-*/
-   if (sense->asc == 0x3a || sense->asc == 0x04)
-   break;
-   log = 1;
-   break;
-   case ILLEGAL_REQUEST:
-   /*
-* don't log START_STOP unit with LoEj set, since
-* we cannot reliably check if drive can auto-close
-*/
-   if (rq->cmd[0] == GPCMD_START_STOP_UNIT && sense->asc 
== 0x24)
-   break;
-   log = 1;
-   break;
-   case UNIT_ATTENTION:
-   /*
-* Make good and sure we've seen this potential media
-* change. Some drives (i.e. Creative) fail to present
-* the correct sense key in the error register.
-*/
-   cdrom_saw_media_change(drive);
+   case NO_SENSE: case RECOVERED_ERROR:
+   break;
+   case NOT_READY:
+   /*
+* don't care about tray state messages for
+* e.g. capacity commands or in-progress or
+* becoming ready
+*/
+   if (sense->asc == 0x3a || sense->asc == 0x04)
break;
-   default:
-   log = 1;
+   log = 1;
+   break;
+   case ILLEGAL_REQUEST:
+   /*
+* don't log START_STOP unit with LoEj set, since
+* we cannot reliably check if drive can auto-close
+*/
+   if (rq->cmd[0] == GPCMD_START_STOP_UNIT &&
+   sense->asc == 0x24)
break;
+   log = 1;
+   break;
+   case UNIT_ATTENTION:
+   /*
+* Make good and sure we've seen this potential media
+* change. Some drives (i.e. Creative) fail to present
+* the correct sense key in the error register.
+*/
+   cdrom_saw_media_change(drive);
+   break;
+   default:
+   log = 1;
+   break;
}
return log;
 }
 
-static
-void cdrom_analyze_sense_data(ide_drive_t *drive,
- struct request *failed_command,
- struct request_sense *sense)

ide-cd: trivial fixes

2008-02-25 Thread Borislav Petkov
Hi Bart,

here some trivial fixes that i wanted to get out the door.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ide-cd: put proc-related functions together under single ifdef

2008-02-25 Thread Borislav Petkov
Signed-off-by: Borislav Petkov <[EMAIL PROTECTED]>
---
 drivers/ide/ide-cd.c |   29 +
 1 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 546f436..3600648 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -1894,19 +1894,6 @@ int ide_cdrom_setup (ide_drive_t *drive)
return 0;
 }
 
-#ifdef CONFIG_IDE_PROC_FS
-static
-sector_t ide_cdrom_capacity (ide_drive_t *drive)
-{
-   unsigned long capacity, sectors_per_frame;
-
-   if (cdrom_read_capacity(drive, , _per_frame, NULL))
-   return 0;
-
-   return capacity * sectors_per_frame;
-}
-#endif
-
 static void ide_cd_remove(ide_drive_t *drive)
 {
struct cdrom_info *info = drive->driver_data;
@@ -1940,14 +1927,24 @@ static void ide_cd_release(struct kref *kref)
 static int ide_cd_probe(ide_drive_t *);
 
 #ifdef CONFIG_IDE_PROC_FS
-static int proc_idecd_read_capacity
-   (char *page, char **start, off_t off, int count, int *eof, void *data)
+static sector_t ide_cdrom_capacity(ide_drive_t *drive)
+{
+   unsigned long capacity, sectors_per_frame;
+
+   if (cdrom_read_capacity(drive, , _per_frame, NULL))
+   return 0;
+
+   return capacity * sectors_per_frame;
+}
+
+static int proc_idecd_read_capacity(char *page, char **start, off_t off,
+   int count, int *eof, void *data)
 {
ide_drive_t *drive = data;
int len;
 
len = sprintf(page,"%llu\n", (long long)ide_cdrom_capacity(drive));
-   PROC_IDE_READ_RETURN(page,start,off,count,eof,len);
+   PROC_IDE_READ_RETURN(page, start, off, count, eof, len);
 }
 
 static ide_proc_entry_t idecd_proc[] = {
-- 
1.5.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/10] CGroup API files: Various cleanup to CGroup control files

2008-02-25 Thread Li Zefan
Paul Menage wrote:
> On Mon, Feb 25, 2008 at 7:23 PM, Li Zefan <[EMAIL PROTECTED]> wrote:
>>  Should those pathces be rebased againt 2.6.25-rc3 ?
>>
> 
> No, because they're against 2.6.25-rc2-mm1, which is already has (I
> think) any of the new bits in 2.6.25-rc3 that would be affected by
> these patches.
> 
> Paul

-rc2-mm1 came out on 2008-02-16, but the patches I posted several days ago
has been merged into -rc3, so your patches don't apply now. :(

Think about ./MAINTAINERS update and set up a git tree for development of
cgroup and cgroup subsystems as Andrew suggested. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Compex FreedomLine 32 PnP-PCI2 broken with de2104x

2008-02-25 Thread Grant Grundler
On Mon, Feb 25, 2008 at 02:30:00AM -0500, Jeff Garzik wrote:
> Grant Grundler wrote:
>> On Mon, Feb 18, 2008 at 05:40:42PM +0100, Ondrej Zary wrote:
>>> I think that de2104x driver should be removed (or at least its 
>>> MODULE_DEVICE_TABLE) and MODULE_DEVICE_TABLE with only 21040 and 21041 
>>> PCI IDs added to de4x5.
>>>
>>> I can send a patch if this is acceptable.
>> It's acceptable to me. Jeff? (jgarzik)
>
> NAK, sorry, for two reasons:
>
> 1) we don't delete otherwise clean, working drivers

Just to be clear - he's not trying to remove the driver.
He's just interested in making de4x5 the "default" for this set of boards
by doctoring with the pci device ids each driver will claim.

> simply because of a bug triggered by unplugging a cable.

Ondrej would be happy to test any patches sent. Tracking this sort
of bug down usually isn't trivial as the statement above implies.

> 2) de4x5 needs to go away.

Ok. I'd prefer to wait until someone demonstrates de2104x driver is a fully
functional alternative for all the PCI Ids it claims.

thanks,
grant
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.24.3

2008-02-25 Thread Tino Keitel
On Mon, Feb 25, 2008 at 17:00:24 -0800, Greg Kroah-Hartman wrote:
> We (the -stable team) are announcing the release of the 2.6.24.3
> kernel.

Hi,

I can see the patch in http://www.kernel.org/pub/linux/kernel/v2.6/,
but no incremental patch in
http://www.kernel.org/pub/linux/kernel/v2.6/incr/. Is this due to some
delay, or was is just not uploaded?

Regards,
Tino
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: 2.4.36.1 hangs.

2008-02-25 Thread Glen Nakamura
Aloha,

The "ext2_readdir() filp->f_pos fix" patch looks weird...
Perhaps the "filp->f_pos += le16_to_cpu(de->rec_len);" line should be
outside of the if statement like the indentation implies?
As it is, filp->f_pos gets corrupted if de->inode is ever zero...
This could possibly explain why I had a few strange directory
entries until I checked the filesystem with:
e2fsck -D -F -f /dev/{ext2 partition}

- glen

Here is an updated (untested) patch:

--- linux-2.4.36.orig/fs/ext2/dir.c
+++ linux-2.4.36/fs/ext2/dir.c
@@ -240,7 +240,7 @@ ext2_readdir (struct file * filp, void *
loff_t pos = filp->f_pos;
struct inode *inode = filp->f_dentry->d_inode;
struct super_block *sb = inode->i_sb;
-   unsigned offset = pos & ~PAGE_CACHE_MASK;
+   unsigned int offset = pos & ~PAGE_CACHE_MASK;
unsigned long n = pos >> PAGE_CACHE_SHIFT;
unsigned long npages = dir_pages(inode);
unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
@@ -258,8 +258,13 @@ ext2_readdir (struct file * filp, void *
ext2_dirent *de;
struct page *page = ext2_get_page(inode, n);
 
-   if (IS_ERR(page))
+   if (IS_ERR(page)) {
+   ext2_error(sb, __FUNCTION__,
+  "bad page in #%lu",
+  inode->i_ino);
+   filp->f_pos += PAGE_CACHE_SIZE - offset;
continue;
+   }
kaddr = page_address(page);
if (need_revalidate) {
offset = ext2_validate_entry(kaddr, offset, chunk_mask);
@@ -267,7 +272,7 @@ ext2_readdir (struct file * filp, void *
}
de = (ext2_dirent *)(kaddr+offset);
limit = kaddr + PAGE_CACHE_SIZE - EXT2_DIR_REC_LEN(1);
-   for ( ;(char*)de <= limit; de = ext2_next_entry(de))
+   for ( ;(char*)de <= limit; de = ext2_next_entry(de)) {
if (de->inode) {
int over;
unsigned char d_type = DT_UNKNOWN;
@@ -284,11 +289,12 @@ ext2_readdir (struct file * filp, void *
goto done;
}
}
+   filp->f_pos += le16_to_cpu(de->rec_len);
+   }
ext2_put_page(page);
}
 
 done:
-   filp->f_pos = (n << PAGE_CACHE_SHIFT) | offset;
filp->f_version = inode->i_version;
UPDATE_ATIME(inode);
return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Jeff Garzik

Jamie Lokier wrote:

By durable, I mean that fsync() should actually commit writes to
physical stable storage,


Yes, it should.



I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.


It's surprising you are surprised, given that this [lame] fsync behavior 
has remaining consistently lame throughout Linux's history.


[snip huge long proposal]

Rather than invent new APIs, we should fix the existing ones to _really_ 
flush data to physical media.


Linux should default to SAFE data storage, and permit users to retain 
the older unsafe behavior via an option.  It's completely ridiculous 
that we default to an unsafe fsync.


And [anticipating a common response from others] it is completely 
irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
current behavior is unsafe.


Safety before performance -- ESPECIALLY when it comes to storing user data.

Regards,

Jeff (Linux ATA driver dude)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Andrew Morton
On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> wrote:

> (It would be nicer if sync_file_range()
> took a vector of ranges for better elevator scheduling, but let's
> ignore that :-)

Two passes:

Pass 1: shove each of the segments into the queue with
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE

Pass 2: wait for them all to complete and return accumulated result
with SYNC_FILE_RANGE_WAIT_AFTER


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu

2008-02-25 Thread H. Peter Anvin

Yinghai Lu wrote:

 which is the same. set_cpu_cap() is indeed the cleaner form to do this
 so your patch is correct as a cleanup.

set_cpu_cap is right
==
set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability); ===> is wrong
should be
set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability);

x86_capability is a array ...



For an array, the & is optional and has no effect.

So they mean the same thing.

-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu

2008-02-25 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> > set_cpu_cap is right
> > ==
> > set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability); ===> is wrong
> > should be
> > set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability);
> > 
> > x86_capability is a array ...
> > 
> > so this could prevent some data corruption.
> 
> ah, right you are! [...]

actually, not: >x86_capability and c->x86_capability result in the 
same address (it's an array, not a pointer), so there's no "data 
corruption". If x86_capability were a pointer then you would be right - 
so this is all worth cleaning up.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu

2008-02-25 Thread Ingo Molnar

* Yinghai Lu <[EMAIL PROTECTED]> wrote:

> >   #define set_cpu_cap(c, bit) set_bit(bit, (unsigned long 
> > *)((c)->x86_capability)
> >
> >  which is the same. set_cpu_cap() is indeed the cleaner form to do this
> >  so your patch is correct as a cleanup.
> set_cpu_cap is right
> ==
> set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability); ===> is wrong
> should be
> set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability);
> 
> x86_capability is a array ...
> 
> so this could prevent some data corruption.

ah, right you are! The commit was done in a sloppy, incomplete way, 
leaving around lots of direct c->x86_capability references and giving 
room for this bug ...

Btw., there's one other place that has the same bug. I'll fix that up 
too.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [xfs-masters] Re: filesystem corruption on xfs after 2.6.25-rc1 (bisected, powerpc related?)

2008-02-25 Thread Gaudenz Steinlin
On Tue, Feb 26, 2008 at 01:13:56AM +0100, Rafael J. Wysocki wrote:
> On Tuesday, 26 of February 2008, Christoph Hellwig wrote:
> > On Tue, Feb 26, 2008 at 12:52:56AM +0100, Rafael J. Wysocki wrote:
> > > > I'm not suggesting a partial revert; I just wonder which part of the
> > > > change is causing the problem, as part of the debugging process.
> > > 
> > > Understood.
> > > 
> > > My point is, if that's not practical (whatever the reason), I'd consider
> > > reverting all of the commits in question.
> > 
> > If you could revert all of them and verify it makes the problem go away
> > that would be a very good start already.
> 
> The original reporter (CC added) said exactly that, if I understood him
> correctly:
> 
> http://lkml.org/lkml/2008/2/25/123

Sorry if I was not clear. The problematic commit after bisecting is
a69b176df246d59626e6a9c640b44c0921fa4566. Reverting this commit and
commit edd319dc527733e61eec5bdc9ce20c94634b6482 fixes the problem. So
all other commits in the XFS merge for 2.6.25 seem to be OK. I had to
revert the second commit only to avoid a merge conflict.

And I forget to mention on my first post: Please CC me on all replies.
I'm not subscribed to the lists.

Gaudenz

-- 
Ever tried. Ever failed. No matter.
Try again. Fail again. Fail better.
~ Samuel Beckett ~
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: PARAVIRT needed by PARAVIRT_GUEST or X86_VSMP

2008-02-25 Thread Ingo Molnar

* Yinghai Lu <[EMAIL PROTECTED]> wrote:

> so it could be off automatically when PARAVIRT_GUEST or X86_VSMP is 
> not there

thanks, applied. This whole VSMP + PARAVIRT business has to be done 
cleaner though before it's upstream ready, all the Kconfig magic looks 
rather twisted.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier
Dear kernel,

This is a proposal to add "proper" durable fsync() and fdatasync() to Linux.

First the problem, then a proposed solution "with benefits", so to speak.

I need feedback on the details, before implementing anything.  Or
(hopefully) someone else thinks it's very important and does it
themselves :-)

By durable, I mean that fsync() should actually commit writes to
physical stable storage, not just the disk write cache when that is
enabled.  Databases and guest VMs needs this, or an equivalent
feature, if they aren't to face occasional corruption after power
failure and perhaps some crashes.

The alternative is to disable the disk write cache.  But that isn't
modern practice or recommendation, since I/O write barriers were
implemented and they are much faster.

I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.

I noticed this following up discussions on the Qemu mailing list,
about guest VMs and how their IDE flush cache command should translate
to fsync() to avoid data loss.  (For guest VMs, fsync() isn't
necessary if the host machine is fine, and it isn't enough (on Linux
host) if the host machine loses power or the hard disk crashes another
way.)

Then I noticed it again, when I was designing a database engine with
filesystem characteristics.  I thought "how do I ensure ordered
journal writes; can I use fdatasync()?" and was surprised to find the
answer is no, I have to use hacks like calling hdparm, and the authors
of major SQL databases seem to brush the problem under a carpet.

(Interestingly, in the Linux 2.4 patches for write barriers, fsync()
seems to be fine, if a bit slow.)

It isn't the first time this topic has come up:


http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1
("True fsync() in Linux (on IDE)")

In that thread, it was implied that would be fixed in 2.6.  So I bet
some people are under the illusion that it's fixed in 2.6...


For a while, I've been meaning to bring it up on linux-kernel...


The fsync problem
-

Chris Wedgwood wrote:
> On Mon, Feb 25, 2008 at 08:50:40PM +, Jamie Lokier wrote:
> 
> > On Linux (and other host OSes), fdatsync() and fsync() don't always
> > commit data to hard storage; it sometimes only commits it to the hard
> > drive cache.
> 
> That's a filesystem bug IMO.  People should be able to use f[data]sync
> with some level onf confidence or else it's basically pointless.

I agree, I consider it a serious bug, and I would be pleased if
someone paid it some love and attention.

Right now, if you want a reliable database on Linux, you _cannot_
properly depend on fsync() or fdatasync().  Considering how much Linux
is used for critical databases, using these functions, this amazes me.

Also, if you have a guest VM, then the guest's filesystem journalling
is not reliable.  Not only can it lose data on power loss, it can
corrupt the guest filesystem too, due to reordering.  This is contrary
to what people expect, I think.

I'm not sure if a system reset can cause similar loss; I don't know
how disks react to that.

Also, for the person porting ZFS to run on FUSE, same applies...

Linux fsync is faulty in two ways:

   1. Database commits aren't _durable_ against power failure, because
  fsync doesn't flush the disk's cache.  This means data stored
  is not guaranteed to be stored at the expected durability.

   2. It's unsafe for write-ahead logging, because it doesn't really
  guarantee any _ordering_ for the writes at the hard storage
  level.  So aside from losing committed data, it can also corrupt
  structural metadata.

With ext3 it's quite easy to verify that fsync/fdatasync don't always
write a journal entry.  (Apart from looking at the kernel code :-)

Just write some data, fsync(), and observe the number of writes in
/proc/diskstats.  If the current mtime second _hasn't_ changed, the
inode isn't written.  If you write data, say, 10 times a second to the
same place followed by fsync(), you'll see a little more than 10 write
I/Os, and less than 20.

By the way, this shows a trick for fixing #2 (ordering): use fchmod()
to toggle the file attributes, and that will force the next fsync() to
write a journal entry, which _does_ issue a write barrier.  If you do
that with each write as above (write, fchmod change, fsync 10 times a
second), you will clearly see more write I/Os, and you'll hear the
disk behaving differently: it's seeking more.

However, even this ugly trick has problems:

  3. Using the fchmod() trick or good fortune, fsync() issues a write
 barrier.  Right now, this does commit data (if 

Re: compile problem in current x86.git

2008-02-25 Thread Ingo Molnar

* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

> Ingo Molnar wrote:
>> Jeremy, you might want to start tracking x86.git#testing:
>>
>>   http://people.redhat.com/mingo/x86.git/README
>>
>> if you want to follow the latest & greatest x86.git code.
>>   
>
> Right, will do.

generally it's as well tested on x86 as #mm (so you should not notice 
any difference while tracking it), but it also includes work in progress 
bits that might touch other subsystems as well.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] [GFS2] re-support special inode

2008-02-25 Thread Denis Cheng
commit feaa7bba026c181ce071d5a4884f7f9dd26207a1 removed call to
init_special_inode from inode lookuping, this cause problems as:

 # mknod /mnt/gfs2/dev/null c 1 3
 # cat /mnt/gfs2/dev/null
 cat: /mnt/gfs2/dev/null: Invalid argument

without special inode, GFS2 cannot support char device file,
block device file, fifo pipe, and socket file, lose many important
features as a common file system.

this one line patch re add special inode support.

Signed-off-by: Denis Cheng <[EMAIL PROTECTED]>
---
 fs/gfs2/inode.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 1069ceb..18d8e0b 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -150,6 +150,7 @@ void gfs2_set_iop(struct inode *inode)
inode->i_op = _symlink_iops;
} else {
inode->i_op = _file_iops;
+   init_special_inode(inode, inode->i_mode, inode->i_rdev);
}
 
unlock_new_inode(inode);
-- 
1.5.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: force re setting the mmconf for fam10h if acpi=off

2008-02-25 Thread Ingo Molnar

* Yinghai Lu <[EMAIL PROTECTED]> wrote:

> some BIOS only let AMD fam 10h handle bus0, and nvidia mcp55/ck804 to 
> handle other buses. at that case MCFG will cover all over them.
> 
> but with acpi=off, we can not use MCFG. this patch will double check 
> the busnbits, and if it is less handling 256 bues, and acpi=off will 
> forcely reset the mmconf in msr, so we still use mmconf in above case.

thanks, applied.

> @@ -720,14 +720,21 @@ static void __cpuinit fam10h_check_enabl

btw., a cleanliness suggestion: wouldnt it be cleaner to separate all 
the extra AMD family 16 code into a separate file? That should make it 
more focused and more isolated as well.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] [GFS2] remove gfs2_dev_iops

2008-02-25 Thread Denis Cheng
struct inode_operations gfs2_dev_iops is always the same as gfs2_file_iops,
since Jan 2006, when GFS2 merged into mainstream kernel.

So one of them could be removed.

Signed-off-by: Denis Cheng <[EMAIL PROTECTED]>
---
 fs/gfs2/inode.c |2 +-
 fs/gfs2/ops_inode.c |   10 --
 fs/gfs2/ops_inode.h |1 -
 3 files changed, 1 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 37725ad..1069ceb 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -149,7 +149,7 @@ void gfs2_set_iop(struct inode *inode)
} else if (S_ISLNK(mode)) {
inode->i_op = _symlink_iops;
} else {
-   inode->i_op = _dev_iops;
+   inode->i_op = _file_iops;
}
 
unlock_new_inode(inode);
diff --git a/fs/gfs2/ops_inode.c b/fs/gfs2/ops_inode.c
index e874129..ab9a073 100644
--- a/fs/gfs2/ops_inode.c
+++ b/fs/gfs2/ops_inode.c
@@ -1148,16 +1148,6 @@ const struct inode_operations gfs2_file_iops = {
.removexattr = gfs2_removexattr,
 };
 
-const struct inode_operations gfs2_dev_iops = {
-   .permission = gfs2_permission,
-   .setattr = gfs2_setattr,
-   .getattr = gfs2_getattr,
-   .setxattr = gfs2_setxattr,
-   .getxattr = gfs2_getxattr,
-   .listxattr = gfs2_listxattr,
-   .removexattr = gfs2_removexattr,
-};
-
 const struct inode_operations gfs2_dir_iops = {
.create = gfs2_create,
.lookup = gfs2_lookup,
diff --git a/fs/gfs2/ops_inode.h b/fs/gfs2/ops_inode.h
index fd8cee2..14b4b79 100644
--- a/fs/gfs2/ops_inode.h
+++ b/fs/gfs2/ops_inode.h
@@ -15,7 +15,6 @@
 extern const struct inode_operations gfs2_file_iops;
 extern const struct inode_operations gfs2_dir_iops;
 extern const struct inode_operations gfs2_symlink_iops;
-extern const struct inode_operations gfs2_dev_iops;
 extern const struct file_operations gfs2_file_fops;
 extern const struct file_operations gfs2_dir_fops;
 extern const struct file_operations gfs2_file_fops_nolock;
-- 
1.5.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu

2008-02-25 Thread Yinghai Lu
On Mon, Feb 25, 2008 at 11:20 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote:
>
>  * Yinghai Lu <[EMAIL PROTECTED]> wrote:
>
>
> > also fix error in early_init_intel and reference about x86_capality,
>  > because it is array already.., prevent possible data corruption...
>
>  hm, why should there be data corruption:
>
>
>  > - set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability);
>  > + set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
>
>  cpu_cpu_cap() is currently defined as:
>
>   #define set_cpu_cap(c, bit) set_bit(bit, (unsigned long 
> *)((c)->x86_capability)
>
>  which is the same. set_cpu_cap() is indeed the cleaner form to do this
>  so your patch is correct as a cleanup.
set_cpu_cap is right
==
set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability); ===> is wrong
should be
set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability);

x86_capability is a array ...

so this could prevent some data corruption.

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu

2008-02-25 Thread Ingo Molnar

* Yinghai Lu <[EMAIL PROTECTED]> wrote:

> also fix error in early_init_intel and reference about x86_capality, 
> because it is array already.., prevent possible data corruption...

hm, why should there be data corruption:

> - set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability);
> + set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);

cpu_cpu_cap() is currently defined as:

 #define set_cpu_cap(c, bit) set_bit(bit, (unsigned long 
*)((c)->x86_capability)

which is the same. set_cpu_cap() is indeed the cleaner form to do this 
so your patch is correct as a cleanup.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: iwl4965 dropping packets and __dev_addr_discard: address leakage! da_users=1

2008-02-25 Thread Tim Connors
On Mon, 25 Feb 2008, Andrew Morton wrote:

> On Tue, 26 Feb 2008 16:22:43 +1100 Tim Connors <[EMAIL PROTECTED]> wrote:
> 
> > Possibly because of the frequent renegotiating my iwl4965 card has
> > been making, it has now decided it's not going to pass packets
> > reliably until presumably next time I reboot.
> > 
> > I've noticed messages in syslog that I hadn't seen when things were
> > working fine.  The problem possibly has only surfaced after rmmodding
> > the iwl4965 module, then remodding it.  It stays with me through being
> > removed and modprobed again.  It is being bonded in conjunction with a
> > physical ethernet, if that is relevant, although I think I reproduced
> > it when it was on its lonesome.
> > 
> > 
> > Fresh after reboot:
> > 
> > Feb 22 09:49:31 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN 
> > driver for Linux, 1.1.17kds
> > Feb 22 09:49:31 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel 
> > Corporation
> > Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 
> > (level, low) -> IRQ 17
> > Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device 
> > :0c:00.0 to 64
> > Feb 22 09:49:31 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link 
> > 4965AGN
> > Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :00:1b.0[A] -> GSI 21 
> > (level, low) -> IRQ 21
> > Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device 
> > :00:1b.0 to 64
> > Feb 22 09:49:31 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 
> > 802.11a channels
> > Feb 22 09:49:31 dirac kernel: phy0: Selected rate control algorithm 
> > 'iwl-4965-rs'
> > 
> > now finished with it, will be deconfigured and rmmodded:
> > 
> > Feb 23 03:05:23 dirac kernel: __dev_addr_discard: address leakage! 
> > da_users=1
> > Feb 23 03:05:23 dirac kernel: ACPI: PCI interrupt for device :0c:00.0 
> > disabled
> > 
> > The modprobed again:
> > 
> > Feb 23 03:05:35 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN 
> > driver for Linux, 1.1.17kds
> > Feb 23 03:05:35 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel 
> > Corporation
> > Feb 23 03:05:35 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 
> > (level, low) -> IRQ 17
> > Feb 23 03:05:35 dirac kernel: PCI: Setting latency timer of device 
> > :0c:00.0 to 64
> > Feb 23 03:05:35 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link 
> > 4965AGN
> > Feb 23 03:05:35 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 
> > 802.11a channels
> > Feb 23 03:05:35 dirac kernel: phy9: Selected rate control algorithm 
> > 'iwl-4965-rs'
> > 
> > according to lspci -vvv, 0c:00.0 is:
> > 
> > 0c:00.0 Network controller: Intel Corporation PRO/Wireless 4965 AG or AGN 
> > Network Connection (rev 61)
> > Subsystem: Intel Corporation Unknown device 1120
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> > Stepping- SERR+ FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> > SERR-  > Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 379
> > Region 0: Memory at f9ffe000 (64-bit, non-prefetchable) [size=8K]
> > Capabilities: [c8] Power Management version 3
> > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
> > PME(D0+,D1-,D2-,D3hot+,D3cold+)
> > Status: D0 PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ 
> > Queue=0/0 Enable+
> > Address: fee0300c  Data: 4194
> > Capabilities: [e0] Express (v1) Endpoint, MSI 00
> > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
> > <512ns, L1 unlimited
> > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
> > Unsupported-
> > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> > MaxPayload 128 bytes, MaxReadReq 128 bytes
> > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ 
> > TransPend-
> > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, 
> > Latency L0 <128ns, L1 <64us
> > ClockPM+ Suprise- LLActRep- BwNot-
> > LnkCtl: ASPM L0s Enabled; RCB 64 bytes Disabled- Retrain- 
> > CommClk+
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> > DLActive- BWMgmt- ABWMgmt-
> > Capabilities: [100] Advanced Error Reporting 
> > Capabilities: [140] Device Serial Number d7-36-9b-ff-ff-e8-13-00
> > Kernel driver in use: iwl4965
> > Kernel modules: iwl4965
> 
> (cc linux-wireless)
> 
> What kernel version is this?
> 
> Is this a regression from an earlier kenrel version?  If so, which?

kernel 2.6.24
version 4.44.1.20 of the 

Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu

2008-02-25 Thread Ingo Molnar

* Yinghai Lu <[EMAIL PROTECTED]> wrote:

> early_init_intel is introduced by
> 
> commit 2b16a2353814a513cdb5c5c739b76a19d7ea39ce
> Author: Andi Kleen <[EMAIL PROTECTED]>
> Date:   Wed Jan 30 13:32:40 2008 +0100
> 
> x86: move X86_FEATURE_CONSTANT_TSC into early cpu feature detection
> 
> set CONSTANT_TSC for intel cpus
> 
> but it already set in init_intel
> 
> don't need to set that two times in early_init_intel() and init_intel. this 
> patch remove one.
> 
> also fix error in early_init_intel and reference about x86_capality, because 
> it
> is array already.., prevent possible data corruption...
> 
> this should be applied for 2.6.25

thanks Yinghai, applied.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal

2008-02-25 Thread David Brownell
This "flaw" isn't a new thing, of course.  I remember pointing out the rather
annoying proclivity of the PM framework to deadlock when suspend() tried to
remove USB devices ... back around 2.6.10 or so.  Things have shuffled around
a bit, and gotten better in some cases, but not fundamentally changed.

It may be more accurate to say that now we understand some constraints on
device tree management policies ... ones we had previously assumed should
not be issues.  (But AFAICT, without actually considering the question.
Now we know the right question to ask!)


On Monday 25 February 2008, Rafael J. Wysocki wrote:
> IMO the device driver should assure that no new children will be registered
> concurrently with the ->suspend() method (IOW, ->suspend() should wait for
> all such registrations to complete and should prevent any new ones from
> being started) and it should make it impossible to register any new children
> after ->suspend() has run.  It's the driver's problem how to achieve that.

There's also the case where it's framework code that handles the additions
rather than the parent device.  That would be typical for many bridge, hub,
or adapter type drivers ... you may be thinking mostly about drivers acting
as "leaf" nodes in the device tree, at least in terms of real hardware nodes.

Yes, "require that policy from such framework code too".  Just trying to be
sure the description doesn't have gaping holes in the middle.  :)

I can think of a bunch of serial busses where framework code has that sort
of responsiblity.  USB, SPI, I2C ... "legacy" I2C drivers would all need
to be taught not to create/remove children during those intervals, lacking
"new style" conversion (which makes them work more like normal drivers).

- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu

2008-02-25 Thread Yinghai Lu
early_init_intel is introduced by

commit 2b16a2353814a513cdb5c5c739b76a19d7ea39ce
Author: Andi Kleen <[EMAIL PROTECTED]>
Date:   Wed Jan 30 13:32:40 2008 +0100

x86: move X86_FEATURE_CONSTANT_TSC into early cpu feature detection

set CONSTANT_TSC for intel cpus

but it already set in init_intel

don't need to set that two times in early_init_intel() and init_intel. this 
patch remove one.

also fix error in early_init_intel and reference about x86_capality, because it
is array already.., prevent possible data corruption...

this should be applied for 2.6.25

Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]> 

diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
index 62d3f14..210134c 100644
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -1027,7 +1027,7 @@ static void __cpuinit early_init_intel(struct cpuinfo_x86 
*c)
 {
if ((c->x86 == 0xf && c->x86_model >= 0x03) ||
(c->x86 == 0x6 && c->x86_model >= 0x0e))
-   set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability);
+   set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
 }
 
 static void __cpuinit init_intel(struct cpuinfo_x86 *c)
@@ -1071,9 +1071,6 @@ static void __cpuinit init_intel(struct cpuinfo_x86 *c)
 
if (c->x86 == 15)
c->x86_cache_alignment = c->x86_clflush_size * 2;
-   if ((c->x86 == 0xf && c->x86_model >= 0x03) ||
-   (c->x86 == 0x6 && c->x86_model >= 0x0e))
-   set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
if (c->x86 == 6)
set_cpu_cap(c, X86_FEATURE_REP_GOOD);
set_cpu_cap(c, X86_FEATURE_LFENCE_RDTSC);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Memory Resource Controller Add Boot Option

2008-02-25 Thread Hirokazu Takahashi
Hi,

> >>> I'll send out a prototype for comment.
> > 
> > Something like the patch below. The effects of cgroup_disable=foo are:
> > 
> > - foo doesn't show up in /proc/cgroups
> 
> Or we can print out the disable flag, maybe this will be better?
> Because we can distinguish from disabled and not compiled in from
> /proc/cgroups.

It would be neat if the disable flag /proc/cgroups can be cleared/set
on demand. It will depend on the implementation of each controller
whether it works or not.

> > - foo isn't auto-mounted if you mount all cgroups in a single hierarchy
> > - foo isn't visible as an individually mountable subsystem
> 
> You mentioned in a previous mail if we mount a disabled subsystem we
> will get an error. Here we just ignore the mount option. Which makes
> more sense ?
> 
> > 
> > As a result there will only ever be one call to foo->create(), at init
> > time; all processes will stay in this group, and the group will never be
> > mounted on a visible hierarchy. Any additional effects (e.g. not
> > allocating metadata) are up to the foo subsystem.
> > 
> > This doesn't handle early_init subsystems (their "disabled" bit isn't
> > set be, but it could easily be extended to do so if any of the
> > early_init systems wanted it - I think it would just involve some
> > nastier parameter processing since it would occur before the
> > command-line argument parser had been run.
> > 
> > include/linux/cgroup.h |1 +
> > kernel/cgroup.c|   29 +++--
> > 2 files changed, 28 insertions(+), 2 deletions(-)
> > 
> > Index: cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h
> > ===
> > --- cgroup_disable-2.6.25-rc2-mm1.orig/include/linux/cgroup.h
> > +++ cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h
> > @@ -256,6 +256,7 @@ struct cgroup_subsys {
> > void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
> > int subsys_id;
> > int active;
> > +int disabled;
> > int early_init;
> > #define MAX_CGROUP_TYPE_NAMELEN 32
> > const char *name;
> > Index: cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c
> > ===
> > --- cgroup_disable-2.6.25-rc2-mm1.orig/kernel/cgroup.c
> > +++ cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c
> > @@ -790,7 +790,14 @@ static int parse_cgroupfs_options(char *
> > if (!*token)
> > return -EINVAL;
> > if (!strcmp(token, "all")) {
> > -opts->subsys_bits = (1 << CGROUP_SUBSYS_COUNT) - 1;
> > +/* Add all non-disabled subsystems */
> > +int i;
> > +opts->subsys_bits = 0;
> > +for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> > +struct cgroup_subsys *ss = subsys[i];
> > +if (!ss->disabled)
> > +opts->subsys_bits |= 1ul << i;
> > +}
> > } else if (!strcmp(token, "noprefix")) {
> > set_bit(ROOT_NOPREFIX, >flags);
> > } else if (!strncmp(token, "release_agent=", 14)) {
> > @@ -808,7 +815,8 @@ static int parse_cgroupfs_options(char *
> > for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> > ss = subsys[i];
> > if (!strcmp(token, ss->name)) {
> > -set_bit(i, >subsys_bits);
> > +if (!ss->disabled)
> > +set_bit(i, >subsys_bits);
> > break;
> > }
> > }
> > @@ -2596,6 +2606,8 @@ static int proc_cgroupstats_show(struct
> > mutex_lock(_mutex);
> > for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> > struct cgroup_subsys *ss = subsys[i];
> > +if (ss->disabled)
> > +continue;
> > seq_printf(m, "%s\t%lu\t%d\n",
> >ss->name, ss->root->subsys_bits,
> >ss->root->number_of_cgroups);
> > @@ -2991,3 +3003,16 @@ static void cgroup_release_agent(struct
> > spin_unlock(_list_lock);
> > mutex_unlock(_mutex);
> > }
> > +
> > +static int __init cgroup_disable(char *str)
> > +{
> > +int i;
> > +for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> > +struct cgroup_subsys *ss = subsys[i];
> > +if (!strcmp(str, ss->name)) {
> > +ss->disabled = 1;
> > +break;
> > +}
> > +}
> > +}
> > +__setup("cgroup_disable=", cgroup_disable);
> > 
> > 
> >>
> >> Sure thing, if css has the flag, then it would nice. Could you wrap it
> >> up to say
> >> something like css_disabled(_cgroup_subsys)
> >>
> >>
> > 
> > It's the subsys object rather than the css (cgroup_subsys_state).
> > 
> >  We could have something like:
> > 
> > #define cgroup_subsys_disabled(_ss) ((ss_)->disabled)
> > 
> > but I don't see that
> >  cgroup_subsys_disabled(_cgroup_subsys)
> > is better than just putting
> > 
> >  mem_cgroup_subsys.disabled
> > 
> > Paul
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line 

Re: [patch 3/6] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-25 Thread David Rientjes
On Mon, 25 Feb 2008, Paul Jackson wrote:

> $ grep mpol_store_user_nodemask mm/mempolicy.c
> static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
> if (mpol_store_user_nodemask(policy))
> if (!mpol_store_user_nodemask(a))
> if (!mpol_store_user_nodemask(pol) &&
> 
> So I see no need to waste the instructions needed (in the three copies
> of this code, since it's static inline) to convert a non-zero value to
> exactly the value 1.
> 

Done, thanks.

> Hmmm ... speaking of static inline ... I can knock 600 bytes (that's
> IA64 bytes, so equivalent to about 300 x86 bytes) off the kernel text
> size by not inlining the mm/mempolicy.c routines check_pgd_range() and
> interleave_nid().  I wonder if that would be worth doing.  Perhaps
> those two routines are in sufficiently tight corners that the duplicate
> copies of them is needed.
> 

It seems like a worthwhile change to me even though gcc will pass the 
actuals to check_pgd_range() on the stack.  The callers to check_range() 
shouldn't be in any fast paths: migrate_to_node() can take a long time 
depending on the length of the page list and do_mbind() sleeps on 
mmap_sem.

   textdata bss dec hex filename
  11695  24  24   117432ddf mm/mempolicy.o.before
  11215  24  24   112632bff mm/mempolicy.o.after
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] XFS update for 2.6.25-rc4

2008-02-25 Thread Lachlan McIlroy
Please pull from the for-linus branch:
git pull git://oss.sgi.com:8090/xfs/xfs-2.6.git for-linus

This will update the following files:

 fs/xfs/xfs_bit.c |  103 ++
 fs/xfs/xfs_bit.h |   27 ++---
 fs/xfs/xfs_rtalloc.c |   19 ++---
 3 files changed, 120 insertions(+), 29 deletions(-)

through these commits:

commit ef8ece55d9b6825c28a5c1a4bd89b94040cb7b32
Author: Lachlan McIlroy <[EMAIL PROTECTED]>
Date:   Tue Feb 26 17:00:22 2008 +1100

[XFS] Undo bit ops cleanup mod due to regression on 32-bit powermac
platform.

SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30559a

Signed-off-by: Lachlan McIlroy <[EMAIL PROTECTED]>

commit db69c915e67705daac25cad06d816c09be634de0
Author: Lachlan McIlroy <[EMAIL PROTECTED]>
Date:   Tue Feb 26 17:00:14 2008 +1100

[XFS] Undo bit ops cleanup mod due to regression on 32-bit powermac
platform.

SGI-PV: 974005
SGI-Modid: xfs-linux-melb:xfs-kern:30558a

Signed-off-by: Lachlan McIlroy <[EMAIL PROTECTED]>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: add the debugfs interface for the sysprof tool

2008-02-25 Thread Pekka Enberg
Hi,

On Tue, Feb 26, 2008 at 8:27 AM, Pekka Enberg <[EMAIL PROTECTED]> wrote:
> >  You could try passing the --callgraph option to opcontrol.
>
>  Hmm, perhaps I am missing something but I don't think that does what
>  sysprof does. At least I can't find where in the oprofile kernel code
>  does it save the full stack trace for user-space. John?

Ok, so as pointed out by Nicholas/Andrew, oprofile does indeed do
exactly what sysprof does (see
arch/x86/oprofile/backtrace.c::backtrace_address, for example). So,
Soeren, any other reason we can't use the oprofile kernel module for
sysprof?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/6] mempolicy: add MPOL_F_RELATIVE_NODES flag

2008-02-25 Thread David Rientjes
On Tue, 26 Feb 2008, Paul Jackson wrote:

> David wrote:
> +static nodemask_t mpol_relative_nodemask(const nodemask_t *orig,
> +  const nodemask_t *rel)
> +{
> + nodemask_t ret;
> + nodemask_t tmp;
> 
> Could you avoid needing the nodemask_t 'ret' on the stack, by passing
> in a "nodemask_t *" pointer to where you want the resulting nodemask_t
> written, rather than by returning it by value?
> 
>  static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
>const nodemask_t *rel)
> 

Done, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: iwl4965 dropping packets and __dev_addr_discard: address leakage! da_users=1

2008-02-25 Thread Andrew Morton
On Tue, 26 Feb 2008 16:22:43 +1100 Tim Connors <[EMAIL PROTECTED]> wrote:

> Possibly because of the frequent renegotiating my iwl4965 card has
> been making, it has now decided it's not going to pass packets
> reliably until presumably next time I reboot.
> 
> I've noticed messages in syslog that I hadn't seen when things were
> working fine.  The problem possibly has only surfaced after rmmodding
> the iwl4965 module, then remodding it.  It stays with me through being
> removed and modprobed again.  It is being bonded in conjunction with a
> physical ethernet, if that is relevant, although I think I reproduced
> it when it was on its lonesome.
> 
> 
> Fresh after reboot:
> 
> Feb 22 09:49:31 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN 
> driver for Linux, 1.1.17kds
> Feb 22 09:49:31 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel 
> Corporation
> Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 
> (level, low) -> IRQ 17
> Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device 
> :0c:00.0 to 64
> Feb 22 09:49:31 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link 
> 4965AGN
> Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :00:1b.0[A] -> GSI 21 
> (level, low) -> IRQ 21
> Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device 
> :00:1b.0 to 64
> Feb 22 09:49:31 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 
> 802.11a channels
> Feb 22 09:49:31 dirac kernel: phy0: Selected rate control algorithm 
> 'iwl-4965-rs'
> 
> now finished with it, will be deconfigured and rmmodded:
> 
> Feb 23 03:05:23 dirac kernel: __dev_addr_discard: address leakage! da_users=1
> Feb 23 03:05:23 dirac kernel: ACPI: PCI interrupt for device :0c:00.0 
> disabled
> 
> The modprobed again:
> 
> Feb 23 03:05:35 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN 
> driver for Linux, 1.1.17kds
> Feb 23 03:05:35 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel 
> Corporation
> Feb 23 03:05:35 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 
> (level, low) -> IRQ 17
> Feb 23 03:05:35 dirac kernel: PCI: Setting latency timer of device 
> :0c:00.0 to 64
> Feb 23 03:05:35 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link 
> 4965AGN
> Feb 23 03:05:35 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 
> 802.11a channels
> Feb 23 03:05:35 dirac kernel: phy9: Selected rate control algorithm 
> 'iwl-4965-rs'
> 
> according to lspci -vvv, 0c:00.0 is:
> 
> 0c:00.0 Network controller: Intel Corporation PRO/Wireless 4965 AG or AGN 
> Network Connection (rev 61)
> Subsystem: Intel Corporation Unknown device 1120
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR+ FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> SERR-  Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 379
> Region 0: Memory at f9ffe000 (64-bit, non-prefetchable) [size=8K]
> Capabilities: [c8] Power Management version 3
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ 
> Queue=0/0 Enable+
> Address: fee0300c  Data: 4194
> Capabilities: [e0] Express (v1) Endpoint, MSI 00
> DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
> <512ns, L1 unlimited
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
> Unsupported-
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 128 bytes, MaxReadReq 128 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ 
> TransPend-
> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, 
> Latency L0 <128ns, L1 <64us
> ClockPM+ Suprise- LLActRep- BwNot-
> LnkCtl: ASPM L0s Enabled; RCB 64 bytes Disabled- Retrain- 
> CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> DLActive- BWMgmt- ABWMgmt-
> Capabilities: [100] Advanced Error Reporting 
> Capabilities: [140] Device Serial Number d7-36-9b-ff-ff-e8-13-00
> Kernel driver in use: iwl4965
> Kernel modules: iwl4965

(cc linux-wireless)

What kernel version is this?

Is this a regression from an earlier kenrel version?  If so, which?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] update efi region debugging to use MB, GB and TB as well as KB

2008-02-25 Thread Simon Horman
When EFI_DEBUG is defined to a non-zero value in arch/ia64/kernel/efi.c,
the efi memory regions are displayed. This patch enhances the
display code in a few ways:

1. Use TB, GB and MB as well as KB as units.
   Although this introduces rounding errors (KB doesn't as
   size is always a multiple of 4Kb), it does make
   things a lot more readable.

   Also as the range is also shown, it is possible to note the exact size
   if it is important. In my experience, the size field is mostly useful
   for getting a general idea of the size of a region.

   On the rx2620 that I use, there actually is an 8TB region (though not
   backed by physical memory, and 8TB really is a lot more readable than
   8589934592KB.

2. pad the size field with leading spaces to further improve readability

   ...
   ... (   8MB)
   ... ( 928MB)
   ... (   3MB)
   ...

   vs

   ...
   ... (8MB)
   ... (928MB)
   ... (3MB)
   ...


3. Pad the attr field out to 64bits using leading zeros,
   to further improve readability.

   ...
   mem05: type= 2, attr=0x0008, 
range=[0x0400-0x0481f000) (   8MB)
   mem06: type= 7, attr=0x0008, 
range=[0x0481f000-0x3e876000) ( 928MB)
   mem07: type= 5, attr=0x8008, 
range=[0x3e876000-0x3eb8e000) (   3MB)
   mem08: type= 4, attr=0x0008, 
range=[0x3eb8e000-0x3ee7a000) (   2MB)
   ...

   ...
   mem05: type= 2, attr=0x8, range=[0x0400-0x0481f000) (   
8MB)
   mem06: type= 7, attr=0x8, range=[0x0481f000-0x3e876000) ( 
928MB)
   mem07: type= 5, attr=0x8008, 
range=[0x3e876000-0x3eb8e000) (   3MB)
   mem08: type= 4, attr=0x8, range=[0x3eb8e000-0x3ee7a000) (   
2MB)
   ...

4. Use %d instead of %u for the index field, as i is a signed int.

N.B: This code is not compiled unless EFI_DEBUG is non 0.

Signed-off-by: Simon Horman <[EMAIL PROTECTED]>

Index: linux-2.6/arch/ia64/kernel/efi.c
===
--- linux-2.6.orig/arch/ia64/kernel/efi.c   2008-02-26 15:07:57.0 
+0900
+++ linux-2.6/arch/ia64/kernel/efi.c2008-02-26 15:25:33.0 +0900
@@ -543,12 +543,30 @@ efi_init (void)
for (i = 0, p = efi_map_start; p < efi_map_end;
 ++i, p += efi_desc_size)
{
+   const char *unit;
+   unsigned long size;
+
md = p;
-   printk("mem%02u: type=%u, attr=0x%lx, "
-  "range=[0x%016lx-0x%016lx) (%luMB)\n",
+   size = md->num_pages << EFI_PAGE_SHIFT;
+
+   if ((size >> 40) > 0) {
+   size >>= 40;
+   unit = "TB";
+   } else if ((size >> 30) > 0) {
+   size >>= 30;
+   unit = "GB";
+   } else if ((size >> 20) > 0) {
+   size >>= 20;
+   unit = "MB";
+   } else {
+   size >>= 10;
+   unit = "KB";
+   }
+
+   printk("mem%02d: type=%2u, attr=0x%016lx, "
+  "range=[0x%016lx-0x%016lx) (%4lu%s)\n",
   i, md->type, md->attribute, md->phys_addr,
-  md->phys_addr + efi_md_size(md),
-  md->num_pages >> (20 - EFI_PAGE_SHIFT));
+  md->phys_addr + efi_md_size(md), size, unit);
}
}
 #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: add the debugfs interface for the sysprof tool

2008-02-25 Thread Pekka Enberg
On Sun, Feb 24, 2008 at 5:12 AM, Nicholas Miell <[EMAIL PROTECTED]> wrote:
> > Sysprof tracks the full stack frame so it can provide meaningful call
> > tree (who called what) which is invaluable for spotting hot _paths_. I
> > don't see how oprofile can do that as it tracks instruction pointers only.
>
>  You could try passing the --callgraph option to opcontrol.

Hmm, perhaps I am missing something but I don't think that does what
sysprof does. At least I can't find where in the oprofile kernel code
does it save the full stack trace for user-space. John?

Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc3: Reported regressions from 2.6.24

2008-02-25 Thread Michael S. Tsirkin
2008/2/25 Rafael J. Wysocki <[EMAIL PROTECTED]>:
> This message contains a list of some regressions from 2.6.24 reported since
>  2.6.25-rc1 was released, for which there are no fixes in the mainline I know
>  of.  If any of them have been fixed already, please let me know.

If you want that, I think Cc-ing reporters might be a good idea.

>  If you know of any other unresolved regressions from 2.6.24, please let me 
> know
>  either and I'll add them to the list.  Also, please let me know if any of the
>  entries below are invalid.

Two other issues, both with tested patches which do not seem to be in
mainline yet:

http://lkml.org/lkml/2008/2/22/127
2.6.25-rc[1,2]: failed to setup dm-crypt key mapping

http://lkml.org/lkml/2008/2/25/400
new regression in 2.6.25-rc3: no keyboard/lid acpi events on thinkpad T61p
I think this last one is bug 10100

Thanks,
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Contact info

2008-02-25 Thread Willy Tarreau
On Tue, Feb 26, 2008 at 12:23:34AM +0100, [EMAIL PROTECTED] wrote:
> Hello,
> 
> I would like to know who is in charge with the linux credit list since i will 
> no longer use my current email. I'm using temporary this email to update 
> existing info.

you'd better submit a patch for the files with your new address.

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/10] CGroup API files: Various cleanup to CGroup control files

2008-02-25 Thread Paul Menage
On Mon, Feb 25, 2008 at 7:23 PM, Li Zefan <[EMAIL PROTECTED]> wrote:
>
>  Should those pathces be rebased againt 2.6.25-rc3 ?
>

No, because they're against 2.6.25-rc2-mm1, which is already has (I
think) any of the new bits in 2.6.25-rc3 that would be affected by
these patches.

Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)

2008-02-25 Thread Nick Piggin
On Thursday 21 February 2008 21:58, Robin Holt wrote:
> On Thu, Feb 21, 2008 at 03:20:02PM +1100, Nick Piggin wrote:
> > > > So why can't you export a device from your xpmem driver, which
> > > > can be mmap()ed to give out "anonymous" memory pages to be used
> > > > for these communication buffers?
> > >
> > > Because we need to have heap and stack available as well.  MPT does
> > > not control all the communication buffer areas.  I haven't checked, but
> > > this is the same problem that IB will have.  I believe they are
> > > actually allowing any memory region be accessible, but I am not sure of
> > > that.
> >
> > Then you should create a driver that the user program can register
> > and unregister regions of their memory with. The driver can do a
> > get_user_pages to get the pages, and then you'd just need to set up
> > some kind of mapping so that userspace can unmap pages / won't leak
> > memory (and an exit_mm notifier I guess).
>
> OK.  You need to explain this better to me.  How would this driver
> supposedly work?  What we have is an MPI library.  It gets invoked at
> process load time to establish its rank-to-rank communication regions.
> It then turns control over to the processes main().  That is allowed to
> run until it hits the
>   MPI_Init(, );
>
> The process is then totally under the users control until:
>   MPI_Send(intmessage, m_size, MPI_INT, my_rank+half, tag, 
> MPI_COMM_WORLD);
>   MPI_Recv(intmessage, m_size, MPI_INT, my_rank+half,tag, MPI_COMM_WORLD,
> );
>
> That is it.  That is all our allowed interaction with the users process.

OK, when you said something along the lines of "the MPT library has
control of the comm buffer", then I assumed it was an area of virtual
memory which is set up as part of initialization, rather than during
runtime. I guess I jumped to conclusions.


> That doesn't seem too unreasonable, except when you compare it to how the
> driver currently works.  Remember, this is done from a library which has
> no insight into what the user has done to its own virtual address space.
> As a result, each MPI_Send() would result in a system call (or we would
> need to have a set of callouts for changes to a processes VMAs) which
> would be a significant increase in communication overhead.
>
> Maybe I am missing what you intend to do, but what we need is a means of
> tracking one processes virtual address space changes so other processes
> can do direct memory accesses without the need for a system call on each
> communication event.

Yeah it's tricky. BTW. what is the performance difference between
having a system call or no?


> > Because you don't need to swap, you don't need coherency, and you
> > are in control of the areas, then this seems like the best choice.
> > It would allow you to use heap, stack, file-backed, anything.
>
> You are missing one point here.  The MPI specifications that have
> been out there for decades do not require the process use a library
> for allocating the buffer.  I realize that is a horrible shortcoming,
> but that is the world we live in.  Even if we could change that spec,

Can you change the spec? Are you working on it?


> we would still need to support the existing specs.  As a result, the
> user can change their virtual address space as they need and still expect
> communications be cheap.

That's true. How has it been supported up to now? Are you using
these kind of notifiers in patched kernels?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 5/6] mempolicy: add MPOL_F_RELATIVE_NODES flag

2008-02-25 Thread Paul Jackson
David wrote:
+static nodemask_t mpol_relative_nodemask(const nodemask_t *orig,
+const nodemask_t *rel)
+{
+   nodemask_t ret;
+   nodemask_t tmp;

Could you avoid needing the nodemask_t 'ret' on the stack, by passing
in a "nodemask_t *" pointer to where you want the resulting nodemask_t
written, rather than by returning it by value?

 static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
 const nodemask_t *rel)

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: Tree for Feb 26

2008-02-25 Thread Stephen Rothwell
Hi all,

I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/sfr/linux-next.git.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log files
in the Next directory.  Between each merge, the tree was built with
allmodconfig for both powerpc and x86_64.

There were only two minor merge problems.

We are up to 34 trees, more are welcome (even if they are currently
empty).

Status of my local build tests is at
http://kisskb.ellerman.id.au/kisskb/branch/9/. We have added arm and m68k
to the architectures built.

-- 
Cheers,
Stephen Rothwell[EMAIL PROTECTED]


pgpwRtZRnmKMM.pgp
Description: PGP signature


Re: Linux 2.6.24.3

2008-02-25 Thread Greg KH
On Tue, Feb 26, 2008 at 02:39:23PM +0900, Samuel Masham wrote:
> On Tue, Feb 26, 2008 at 10:00 AM, Greg Kroah-Hartman <[EMAIL PROTECTED]> 
> wrote:
> > We (the -stable team) are announcing the release of the 2.6.24.3
> >  kernel.
> >
> 
> Hi Greg, Stable people :)
> 
> Can you confirm you have the mips irq probe crash fix in your queue
> for the next stable release

Yes, you just sent this to me on Monday, give me at least a week to
ignore it before you worried :)

It's in my queue, I'll let you know when I get to it.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/28] Swap over NFS -v16

2008-02-25 Thread Neil Brown
On Saturday February 23, [EMAIL PROTECTED] wrote:
> On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > Another posting of the full swap over NFS series. 
> 
> Well I looked.  There's rather a lot of it and I wouldn't pretend to
> understand it.

But pretending is fun :-)

> 
> What is the NFS and net people's take on all of this?

Well I'm only vaguely an NFS person, barely a net person, sporadically
an mm person, but I've had a look and it seems to mostly make sense.

We introduce a new "emergency" concept for page allocation.
The size of the emergency pool is set by various reservations by
different potential users.
If the number of free pages is below the "emergency" size, then only
users with a "MEMALLOC" flag get to allocate pages.  Further, those
pages get a "reserve" flag set which propagates into slab/slub so
kmalloc/kmemalloc only return memory from those pages to MEMALLOC
users. 
MEMALLOC users are those that set PF_MEMALLOC.  A socket can get
SOCK_MEMALLOC set which will cause certain pieces of code to
temporarily set PF_MEMALLOC while working on that socket.

The upshot is that providing any MEMALLOC user reserves an appropriate
amount of emergency space, returns the emergency memory promptly, and
sets PF_MEMALLOC whenever allocating memory, it's memory allocations
should never fail.

As memory is requested is small units, but allocated as pages, there
needs to be a conversion from small-units to pages.  One of the
patches does this and appears to err on the side of be over-generous,
which is the right thing to do.


Memory reservations are organised in a tree.  I really don't
understand the tree.  Is it just to make /proc/reserve_info look more
helpful?
Certainly all the individual reservations need to be recorded, and the
cumulative reservation needs also to be recorded (currently in the
root of the tree) but what are all the other levels used for?

Reservations are used for all the transient memory that might be used
by the network stack.  This particularly includes the route cache and
skbs for incoming messages.  I have no idea if there is anything else
that needs to be allowed for.

Filesystems can advertise (via address_space_operations) that files
may be used as swap file.  They then provide swapout/swapin methods
which are like writepage/readpage but may behave differently and have
a different way to get credentials from a 'struct file'.


So in general, the patch set looks to have the right sort of shape.  I
cannot be very authoritative on the details as there are a lot of
them, and they touch code that I'm not very familiar with.

Some specific comments on patches:


reserve-slub.patch

   Please avoid irrelevant reformatting in patches.  It makes them
   harder to read.  e.g.:

-static void setup_object(struct kmem_cache *s, struct page *page,
-   void *object)
+static void setup_object(struct kmem_cache *s, struct page *page, void *object)


mm-kmem_estimate_pages.patch

   This introduces
 kestimate
 kestimate_single
 kmem_estimate_pages

   The last obviously returns a number of pages.  The contrast seems
   to suggest the others don't.   But they do...
   I don't think the names are very good, but I concede that it is
   hard to choose good names here.  Maybe:
  kmalloc_estimate_variable
  kmalloc_estimate_fixed
  kmem_alloc_estimate
   ???

mm-reserve.patch

   I'm confused by __mem_reserve_add.

+   reserve = mem_reserve_root.pages;
+   __calc_reserve(res, pages, 0);
+   reserve = mem_reserve_root.pages - reserve;

   __calc_reserve will always add 'pages' to mem_reserve_root.pages.
   So this is a complex way of doing
reserve = pages;
__calc_reserve(res, pages, 0);

And as you can calculate reserve before calling __calc_reserve
(which seems odd when stated that way), the whole function looks
like it could become:

   ret = adjust_memalloc_reserve(pages);
   if (!ret)
__calc_reserve(res, pages, limit);
   return ret;

What am I missing?

Also, mem_reserve_disconnect really should be a "void" function.
Just put a BUG_ON(ret) and don't return anything.

Finally, I'll just repeat that the purpose of the tree structure
eludes me.

net-sk_allocation.patch

Why are the "GFP_KERNEL" call sites just changed to
"sk->sk_allocation" rather than "sk_allocation(sk, GFP_KERNEL)" ??

I assume there is a good reason, and seeing it in the change log
would educate me and make the patch more obviously correct.

netvm-reserve.patch

Function names again:

 sk_adjust_memalloc
 sk_set_memalloc

sound similar.  Purpose is completely different.

mm-page_file_methods.patch

This makes page_offset and others more expensive by adding a
conditional jump to a function call that is not usually made.

Why do swap pages have a different index to 

Re: boot_delay broken ?

2008-02-25 Thread Dave Young
On Tue, Feb 26, 2008 at 1:48 PM, Dave Young <[EMAIL PROTECTED]> wrote:
> On Tue, Feb 26, 2008 at 1:22 PM, Randy Dunlap <[EMAIL PROTECTED]> wrote:
>  > On Mon, 25 Feb 2008 10:14:36 +0800 Dave Young wrote:
>  >
>  >  > On Sun, Feb 24, 2008 at 8:46 AM, Dave Jones <[EMAIL PROTECTED]> wrote:
>  >  > > The boot_delay switch seems to be behaving strangely in the
>  >  > >  current -git.  Setting it to =10 makes the output 'bursty'
>  >  > >  it becomes slow for some printk's whilst others scroll by
>  >  > >  at regular speed.
>  >  > >  Setting it any higher than that seems to make it pause for
>  >  > >  a really long time before it outputs any text at all.
>  >  >
>  >  > On my side there's this issue for a long time
>  >  > http://lkml.org/lkml/2007/8/8/79
>  >
>  >  [http://marc.info/?l=linux-kernel=118655896515049=2]
>  >
>  >  You asked questions and they were answered.  Perhaps you didn't like
>  >  the answers.
>
>  No, I like it.  Thanks.
>
>  But I still want to know why mdelay can not be used.
>  is it not available for all archs or something else?
>
>
>  >
>  >  Here's a question for you.  What kernel boot options did you use?
>  >  Specifically, for lpj= and boot_delay= ?
>
>  I tried boot_delay=100 and boot_delay=200 without lpj set, The result
>  was really slow. It was better with lpj copied from dmesg, but was
>  still slower then mdelay.

Especially at the very beginning after the message "Booting the kernel",
I need to wait several minutes to see the afterwards messages

>
>  I think we can firstly use preset lpj, after delay calibrating just
>  use the system lpj
>
>
>
>  >
>  >  > >
>  >  > >  x86 timer changes perhaps ?
>  >
>  >
>  >  ---
>  >  ~Randy
>  >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: boot_delay broken ?

2008-02-25 Thread Dave Young
On Tue, Feb 26, 2008 at 1:22 PM, Randy Dunlap <[EMAIL PROTECTED]> wrote:
> On Mon, 25 Feb 2008 10:14:36 +0800 Dave Young wrote:
>
>  > On Sun, Feb 24, 2008 at 8:46 AM, Dave Jones <[EMAIL PROTECTED]> wrote:
>  > > The boot_delay switch seems to be behaving strangely in the
>  > >  current -git.  Setting it to =10 makes the output 'bursty'
>  > >  it becomes slow for some printk's whilst others scroll by
>  > >  at regular speed.
>  > >  Setting it any higher than that seems to make it pause for
>  > >  a really long time before it outputs any text at all.
>  >
>  > On my side there's this issue for a long time
>  > http://lkml.org/lkml/2007/8/8/79
>
>  [http://marc.info/?l=linux-kernel=118655896515049=2]
>
>  You asked questions and they were answered.  Perhaps you didn't like
>  the answers.

No, I like it.  Thanks.

But I still want to know why mdelay can not be used.
is it not available for all archs or something else?

>
>  Here's a question for you.  What kernel boot options did you use?
>  Specifically, for lpj= and boot_delay= ?

I tried boot_delay=100 and boot_delay=200 without lpj set, The result
was really slow. It was better with lpj copied from dmesg, but was
still slower then mdelay.

I think we can firstly use preset lpj, after delay calibrating just
use the system lpj

>
>  > >
>  > >  x86 timer changes perhaps ?
>
>
>  ---
>  ~Randy
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/6] mempolicy: add MPOL_F_STATIC_NODES flag

2008-02-25 Thread Paul Jackson
David wrote:
+static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
+{
+   return !!(pol->flags & MPOL_F_STATIC_NODES);
+}

Why the double-negative?  As best as I can tell, the return value of
mpol_store_user_nodemask() is only used in conditional contexts:

$ grep mpol_store_user_nodemask mm/mempolicy.c
static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
if (mpol_store_user_nodemask(policy))
if (!mpol_store_user_nodemask(a))
if (!mpol_store_user_nodemask(pol) &&

So I see no need to waste the instructions needed (in the three copies
of this code, since it's static inline) to convert a non-zero value to
exactly the value 1.

Hmmm ... speaking of static inline ... I can knock 600 bytes (that's
IA64 bytes, so equivalent to about 300 x86 bytes) off the kernel text
size by not inlining the mm/mempolicy.c routines check_pgd_range() and
interleave_nid().  I wonder if that would be worth doing.  Perhaps
those two routines are in sufficiently tight corners that the duplicate
copies of them is needed.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.24.3

2008-02-25 Thread Samuel Masham
On Tue, Feb 26, 2008 at 10:00 AM, Greg Kroah-Hartman <[EMAIL PROTECTED]> wrote:
> We (the -stable team) are announcing the release of the 2.6.24.3
>  kernel.
>

Hi Greg, Stable people :)

Can you confirm you have the mips irq probe crash fix in your queue
for the next stable release

See:

http://lkml.org/lkml/2008/2/24/255

Is there anymore I should do to help the process along?

Thanks

Samuel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


iwl4965 dropping packets and __dev_addr_discard: address leakage! da_users=1

2008-02-25 Thread Tim Connors
Possibly because of the frequent renegotiating my iwl4965 card has
been making, it has now decided it's not going to pass packets
reliably until presumably next time I reboot.

I've noticed messages in syslog that I hadn't seen when things were
working fine.  The problem possibly has only surfaced after rmmodding
the iwl4965 module, then remodding it.  It stays with me through being
removed and modprobed again.  It is being bonded in conjunction with a
physical ethernet, if that is relevant, although I think I reproduced
it when it was on its lonesome.


Fresh after reboot:

Feb 22 09:49:31 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN 
driver for Linux, 1.1.17kds
Feb 22 09:49:31 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel Corporation
Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 
(level, low) -> IRQ 17
Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device :0c:00.0 
to 64
Feb 22 09:49:31 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link 4965AGN
Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :00:1b.0[A] -> GSI 21 
(level, low) -> IRQ 21
Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device :00:1b.0 
to 64
Feb 22 09:49:31 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 
802.11a channels
Feb 22 09:49:31 dirac kernel: phy0: Selected rate control algorithm 
'iwl-4965-rs'

now finished with it, will be deconfigured and rmmodded:

Feb 23 03:05:23 dirac kernel: __dev_addr_discard: address leakage! da_users=1
Feb 23 03:05:23 dirac kernel: ACPI: PCI interrupt for device :0c:00.0 
disabled

The modprobed again:

Feb 23 03:05:35 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN 
driver for Linux, 1.1.17kds
Feb 23 03:05:35 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel Corporation
Feb 23 03:05:35 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 
(level, low) -> IRQ 17
Feb 23 03:05:35 dirac kernel: PCI: Setting latency timer of device :0c:00.0 
to 64
Feb 23 03:05:35 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link 4965AGN
Feb 23 03:05:35 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 
802.11a channels
Feb 23 03:05:35 dirac kernel: phy9: Selected rate control algorithm 
'iwl-4965-rs'

according to lspci -vvv, 0c:00.0 is:

0c:00.0 Network controller: Intel Corporation PRO/Wireless 4965 AG or AGN 
Network Connection (rev 61)
Subsystem: Intel Corporation Unknown device 1120
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- 
Capabilities: [140] Device Serial Number d7-36-9b-ff-ff-e8-13-00
Kernel driver in use: iwl4965
Kernel modules: iwl4965
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: make amd quad core 8 socket system not be clustered_box v2

2008-02-25 Thread Yinghai Lu
On Mon, Feb 25, 2008 at 8:05 PM, Ravikiran Thirumalai <[EMAIL PROTECTED]> wrote:
> On Tue, Feb 26, 2008 at 04:46:25AM +0100, Andi Kleen wrote:
>  >> I don't quite understand the purpose of the patch to begin with.  Looking 
> at
>  >> the current x86 git tree, apic_is_clustered_box is used only to determine 
> if
>  >> tsc is synchronized on the platform.  The AMD docs  imply that TSC's are 
> not
>  >> guaranteed to be synced across cores between nodes -- Opteron BKDG for
>  >> family 10h, Section 2.9.4:
>  >
>  >After long discussions with AMD they determined the CPUID flag
>  >for sync RDTSC will imply synchronization between nodes.
>
>  Ah!
>
>
>  >
>  >If you can't support that in your hardware you're supposed
>  >to clear it.
>
>  Hmm! How would a hardware vendor do that? That doesn't seem to be clear in
>  the BKDG. (Well, this is the problem with undocumented features :()
>
any good sign for APIC_clustered box? there is apicid between cpus
even all cpu are quadcore and fully populated?

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: boot_delay broken ?

2008-02-25 Thread Randy Dunlap
On Mon, 25 Feb 2008 10:14:36 +0800 Dave Young wrote:

> On Sun, Feb 24, 2008 at 8:46 AM, Dave Jones <[EMAIL PROTECTED]> wrote:
> > The boot_delay switch seems to be behaving strangely in the
> >  current -git.  Setting it to =10 makes the output 'bursty'
> >  it becomes slow for some printk's whilst others scroll by
> >  at regular speed.
> >  Setting it any higher than that seems to make it pause for
> >  a really long time before it outputs any text at all.
> 
> On my side there's this issue for a long time
> http://lkml.org/lkml/2007/8/8/79

[http://marc.info/?l=linux-kernel=118655896515049=2]

You asked questions and they were answered.  Perhaps you didn't like
the answers.

Here's a question for you.  What kernel boot options did you use?
Specifically, for lpj= and boot_delay= ?

> >
> >  x86 timer changes perhaps ?


---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Pcihpd-discuss] 2.6.25-rc3 -- SHPC hotplug driver - very long timeouts?

2008-02-25 Thread Greg KH
On Mon, Feb 25, 2008 at 11:30:03PM -0500, Miles Lane wrote:
> Hello,
> 
> When I booted this kernel, the process was hugely delayed in shpchp.
> I don't think I usually build this driver, so perhaps this is its
> standard behavior when the hardware is missing, or some such?  Or, is
> this a bug in the driver?  Either way, the timeouts seem excessively
> long.
> 
> [   24.471137] CPA self-test:
> [   24.474565]  4k 8192 large 216 gb 0 x 8408[c000-f7fff000] miss 0
> [   24.491200]  4k 196608 large 32 gb 0 x 196640[c000-f7fff000] miss 0
> [   24.503549]  4k 196608 large 32 gb 0 x 196640[c000-f7fff000] miss 0
> [   24.504440] ok.
> [   42.209886] shpchp: gave up waiting for init of module pci_hotplug.
> [   42.213190] shpchp: Unknown symbol acpi_run_oshp
> [   72.052520] shpchp: gave up waiting for init of module pci_hotplug.
> [   72.055826] shpchp: Unknown symbol pci_hp_change_slot_info
> [  101.931221] shpchp: gave up waiting for init of module pci_hotplug.
> [  101.934526] shpchp: Unknown symbol pci_hp_register
> [  131.789952] shpchp: gave up waiting for init of module pci_hotplug.
> [  131.793258] shpchp: Unknown symbol pci_hp_deregister
> [  161.683306] shpchp: gave up waiting for init of module pci_hotplug.
> [  161.686611] shpchp: Unknown symbol acpi_get_hp_params_from_firmware
> [  162.935681] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4

This is not a hotplug-specific bug, but rather one in the module loading
logic.  People who had seen this on -rc1 said it went away in -rc3.  I
suggest poking Rusty about it...

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] maple: remove unused variable

2008-02-25 Thread Paul Mundt
On Sat, Feb 16, 2008 at 11:37:33PM +, Adrian McMenamin wrote:
> Remove an unused variable from the definition of struct maple_device
> 
> Signed-off-by: Adrian McMenamin <[EMAIL PROTECTED]>
> ---
> 
> diff -ruN a/include/linux/maple.h b/include/linux/maple.h
> --- a/include/linux/maple.h   2008-02-16 20:52:09.0 +
> +++ b/include/linux/maple.h   2008-02-16 21:42:06.0 +
> @@ -64,7 +64,6 @@
>   int (*connect) (struct maple_device * dev);
>   void (*disconnect) (struct maple_device * dev);
>   struct device_driver drv;
> - int registered;
>  };
>  
Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arch/sh/drivers/heartbeat.c ioremap is expected to succeed

2008-02-25 Thread Paul Mundt
On Mon, Feb 18, 2008 at 02:09:10PM +0100, Roel Kluin wrote:
> !unlikely(hd->base) is equivalent to likely(!hd->base) (for instance see
> comments with commit fd312561adcc90e924f35d3032d5493aeb4c3017), I think
> the ioremap is expected to succeed? please confirm that's right.
> The patch below was *not* tested.
> ---
> ioremap is expected to succeed
> 
> Signed-off-by: Roel Kluin <[EMAIL PROTECTED]>

Indeed it is. Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] maple: fix device detection

2008-02-25 Thread Paul Mundt
On Mon, Feb 25, 2008 at 07:40:26AM +, Adrian McMenamin wrote:
> On Mon, 2008-02-25 at 14:33 +0900, Paul Mundt wrote:
> > On Sun, Feb 24, 2008 at 10:32:53PM +, Adrian McMenamin wrote:
> > > On Sun, 2008-02-24 at 21:50 +, Adrian McMenamin wrote:
> > > > On Sun, 2008-02-24 at 14:30 +, Adrian McMenamin wrote:
> > > > > The maple bus driver that went into the kernel mainline in
> > > > > September 2007 contained some bugs which were revealed by the
> > > > > update of the kobj code for the current release series.
> > > > > Unfortunately those bugs also helped ensure maple devices were
> > > > > properly detected. This patch (against the current git) now ensures
> > > > > that devices are properly detected again.
> > > > > 
> > > > 
> > > > Further testing has shown this has introduced another bug, this time
> > > > limiting the effectiveness of subdevice detection. Please ignore this
> > > > while I work on a fix.
> > > > 
> > > Sorry for the confusion, in fact there is nothing wrong with this code
> > > (ie it should be applied), the error was in the driver for the Dreamcast
> > > controller (the device, in general, into which the subdevices are
> > > plugged in and out).
> > > 
> > > I will post a fix for that.
> > > 
> > > Sorry again.
> > > 
> > So what exactly is supposed to be applied here?
> 
> The patch at the start of this thread - ie
> http://lkml.org/lkml/2008/2/24/125  - this should really go in now as it
> fixes a problem with current code.
> 
Ok, that's applied. Note that the original body was horribly word
wrapped, and your patch was not in -p1 format (while others in the series
are, for reasons that aren't entirely obvious).

> In addition there are two patch sets to add new device support:
> 
> http://lkml.org/lkml/2008/2/24/211- maple controller
> 
> http://lkml.org/lkml/2008/2/24/121  (thread) - maple mouse
> 
This is not 2.6.25 material. Once you have Acked-by's from the input
folks, I'll queue these up in my 2.6.26 tree. The bus unplug thing is
rather unusual, I don't know if Greg has any comments on that or not.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] autofs4 - track uid and gid of last mount requestor - correction

2008-02-25 Thread Ian Kent
On Tue, 26 Feb 2008, Ian Kent wrote:
> +
> + /* Set mount requestor */
> + if (ino) {
> + if (ino) {
> + ino->uid = wq->uid;
> + ino->gid = wq->gid;
> + }
> + }
> +

As has been spotted, this is obviously wrong.
And here is the correction.

Signed-off-by: Ian Kent <[EMAIL PROTECTED]>

Ian

---
diff -up linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.track-last-mount-ids-fix 
linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c
--- linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.track-last-mount-ids-fix
2008-02-26 14:02:05.0 +0900
+++ linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c 2008-02-26 14:02:20.0 
+0900
@@ -385,10 +385,8 @@ int autofs4_wait(struct autofs_sb_info *
 
/* Set mount requestor */
if (ino) {
-   if (ino) {
-   ino->uid = wq->uid;
-   ino->gid = wq->gid;
-   }
+   ino->uid = wq->uid;
+   ino->gid = wq->gid;
}
 
if (de)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: add the debugfs interface for the sysprof tool

2008-02-25 Thread Anton Blanchard
 
> It was only later I tried oprofile and found it not only much more
> difficult to use, but also much less useful when I did get it to work.

This surprises me. Can you please elaborate on why oprofile is "much
less useful" than sysprof?

Anton - who has used oprofile to analyse and tune databases, JVMs,
compilers and operating systems. Maybe I've been missing out on
the killer app for all this time!!!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: add the debugfs interface for the sysprof tool

2008-02-25 Thread Anton Blanchard

Hi Peter,
 
> Usable for me is a cli interface. Also, I absolutely love opannotate.

I definitely agree there.

It's interesting to note that sysprof requires you to run the GUI as
root in order to work. Maybe Ingo and Arjan are confident there are no
bugs in all the libraries that sysprof links to:

# ldd `which sysprof` | wc -l
39

I'm not.

Actually before someone converted it to debugfs, it was even worse, the
sysprof kernel module exported all profiling information to the world:

-r--r--r-- 1 root root 2060 2008-02-25 23:00 /proc/sysprof-trace

Anton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/6] mempolicy: convert MPOL constants to enum

2008-02-25 Thread Paul Jackson
David wrote:
+   /* add additional MPOL_* modes here */

That doesn't explicitly say what I was most worried about saying, which
is that those MPOL_* terms have values visible in the kernel's public
API, and it does say more than might be true, and it doesn't explain
why it says what it says.

It kinda looks like an ugly "maybe this will shut Paul up patch
".  I'd rather leave the code the way it was than add that
comment ... sorry.

>  I'd like to avoid respinning this set

Ah - now we get to the real issue ?;).

There would be no need to respin; one could do just as you proposed
doing with the above change, queue a little add-on patch to the
existing set.

Really ... look around the kernel.  I believe you'll see many instances
of enum values being spelled out, even ones that count 0, 1, 2, ...,
in situations where the values are constrained by outside forces.
People really do avoid relying on the default enum behaviour having any
particular numbering.

Whatever ... do as you will.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] autofs4 - add mount option to display mount device

2008-02-25 Thread Ian Kent
Hi Andrew,

Patch to add a display mount option to show the device number
of the autofs mount super block.

Signed-off-by: Ian Kent < [EMAIL PROTECTED]>

Ian

---
diff -up 
linux-2.6.25-rc2-mm1/fs/autofs4/inode.c.add-mount-device-display-option 
linux-2.6.25-rc2-mm1/fs/autofs4/inode.c
--- linux-2.6.25-rc2-mm1/fs/autofs4/inode.c.add-mount-device-display-option 
2008-02-20 13:01:06.0 +0900
+++ linux-2.6.25-rc2-mm1/fs/autofs4/inode.c 2008-02-20 13:03:45.0 
+0900
@@ -190,6 +190,7 @@ static int autofs4_show_options(struct s
seq_printf(m, ",timeout=%lu", sbi->exp_timeout/HZ);
seq_printf(m, ",minproto=%d", sbi->min_proto);
seq_printf(m, ",maxproto=%d", sbi->max_proto);
+   seq_printf(m, ",dev=%d", autofs4_get_dev(sbi));
 
if (sbi->type & AUTOFS_TYPE_OFFSET)
seq_printf(m, ",offset");
@@ -332,7 +333,7 @@ int autofs4_fill_super(struct super_bloc
sbi->sb = s;
sbi->version = 0;
sbi->sub_version = 0;
-   sbi->type = 0;
+   sbi->type = AUTOFS_TYPE_INDIRECT;
sbi->min_proto = 0;
sbi->max_proto = 0;
mutex_init(>wq_mutex);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] cgroup: fix default notify_on_release setting

2008-02-25 Thread Paul Jackson
>  @@ -2242,6 +2241,9 @@ static long cgroup_create(struct cgroup *parent, 
> struct dentry *dentry,
> cgrp->root = parent->root;
> cgrp->top_cgroup = parent->top_cgroup;
>
>  +   if (notify_on_release(parent))
>  +   set_bit(CGRP_NOTIFY_ON_RELEASE, >flags);


Good catch, Li Zefan - thanks.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc1/2 CD/DVD burning broken

2008-02-25 Thread Borislav Petkov
On Mon, Feb 25, 2008 at 11:08:55PM +0100, Andreas Schwab wrote:
> Borislav Petkov <[EMAIL PROTECTED]> writes:
> 
> > On Mon, Feb 25, 2008 at 08:38:22PM +0100, Andreas Schwab wrote:
> >> "Kiyoshi Ueda" <[EMAIL PROTECTED]> writes:
> >> 
> >> > I'm looking at this problem, but currently no idea why the conversion
> >> > to blk_end_request causes it.
> >> 
> >> cdrom_newpc_intr apparently never sets rq->sense_len.
> >> 
> >
> > actually it does, see the code chunk around line 1188 in 2.6.25-rc2, for
> > example.
> 
> Yes, it does, but it always adds zero.

yep, true. Does that fix your dvd burning problem?

> Move counting of sense bytes into the transfer loop.
> 
> Signed-off-by: Andreas Schwab <[EMAIL PROTECTED]>
> 
> ---
>  drivers/ide/ide-cd.c |5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> --- linux-2.6.25-rc3.orig/drivers/ide/ide-cd.c2008-02-25 
> 01:03:31.0 +0100
> +++ linux-2.6.25-rc3/drivers/ide/ide-cd.c 2008-02-25 22:54:42.0 
> +0100
> @@ -1182,11 +1182,10 @@ static ide_startstop_t cdrom_newpc_intr(
>   else
>   rq->data += blen;
>   }
> + if (!write && blk_sense_request(rq))
> + rq->sense_len += blen;
>   }
>  
> - if (write && blk_sense_request(rq))
> - rq->sense_len += thislen;
> -
>   /*
>* pad, if necessary
>*/
> 
> Andreas.
> 
> -- 
> Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
> SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
> PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
> "And now for something completely different."

-- 
Regards/Gruß,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/6] mempolicy: convert MPOL constants to enum

2008-02-25 Thread David Rientjes
On Mon, 25 Feb 2008, Paul Jackson wrote:

> Eh ... each bit of added clarity to the code reduces
> errors.
> 
> Looking around the kernel, it really seems to me to be
> a common coding to explicitly spell out enum values
> when for some reason they actually matter.
> 
> Like most coding mechanisms, nothing guarantees anything.
> 
> It just nicely represents one particular detail, that
> the values of these MPOL_* terms are not arbitrary.
> 

Of course the MPOL_* modes aren't arbitrary; they are defined in an enum 
that has a well-defined and explicit behavior for how they are mapped to 
int values based on a standard.

I have more mempolicy patches that add additional behavior and cleanups so 
I can queue the following for a later posting.  I'd like to avoid 
respinning this set unless there are actual design or implementation 
concerns that are raised. 
---
 include/linux/mempolicy.h |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -20,7 +20,9 @@ enum {
MPOL_PREFERRED,
MPOL_BIND,
MPOL_INTERLEAVE,
-   MPOL_MAX,   /* always last member of enum */
+   /* add additional MPOL_* modes here */
+
+   MPOL_MAX,
 };
 
 /* Flags for set_mempolicy */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: add the debugfs interface for the sysprof tool

2008-02-25 Thread Anton Blanchard
 
> > From: Soren Sandmann <[EMAIL PROTECTED]>
> > Subject: [PATCH] x86: add the debugfs interface for the sysprof tool
> > 
> > The sysprof tool is a very easy to use GUI tool to find out where 
> > userspace is spending CPU time. See 
> > http://www.daimi.au.dk/~sandmann/sysprof/ for more information and 
> > screenshots on this tool.
> > 
> > Sysprof needs a 200 line kernel module to do it's work, this module 
> > puts some simple profiling data into debugfs.
> 
> thanks, looks good to me - applied.

Woah slow down guys. Did I miss the review?

Yes it's a 200 line patch, but it's a 200 line x86 patch. Surely we
should apply some of the same rigour we did when we merged the oprofile
patch? Is it biarch safe? Will it run on powerpc, arm etc?

I'm still struggling to understand why we need this functionality at
all. Lets argue the ABI and not cloud it with a discussion about
userspace eye candy. What does this give you that is an improvement over
the oprofile kernel-user ABI?

Anton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new regression in 2.6.25-rc3: no keyboard/lid acpi events on thinkpad T61p - resume hang

2008-02-25 Thread Jeff Chua


On Tue, Feb 26, 2008 at 4:45 AM, Michael S. Tsirkin 
<[EMAIL PROTECTED]> wrote:
On Mon, Feb 25, 2008 at 9:46 PM, Andrew Morton 

<[EMAIL PROTECTED]> wrote:
> On Mon, 25 Feb 2008 21:19:24 +0200 "Michael S. Tsirkin" 

<[EMAIL PROTECTED]> wrote:

 >  You mean suspend-to-ram works correctly on your t61p?
 >  Mine suspends, then five seconds later magically resumes itself and 

the

 >  screen is all black.
 Sorry, have not noticed what you were asking about.
 Yes, rc2 seems to suspend/resume fine.
 And after reverting
 revert commit 559bbe6cbd0d8c68d40076a5f7dc98e3bf5864b2.


commit 559bbe6cbd0d8c68d40076a5f7dc98e3bf5864b2
Author: Pavel Machek <[EMAIL PROTECTED]>
Date:   Thu Feb 21 13:56:55 2008 +0100

power_state: get rid of write-only variable in SATA

power_state is scheduled for removal, and libata uses it in write-only
mode. Remove it.

Signed-off-by: Pavel Machek <[EMAIL PROTECTED]>
Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>


I'm experiencing hang after resume from STR with the latest Linus's git 
tree. Reverting the above patch solved the problem.



Thanks,
Jeff


Here's the patch for reference ...


diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 4cf8662..9812bbf 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6560,8 +6560,6 @@ int ata_host_suspend(struct ata_host *host, pm_message_t 
mesg)
ata_lpm_enable(host);

rc = ata_host_request_pm(host, mesg, 0, ATA_EHI_QUIET, 1);
-   if (rc == 0)
-   host->dev->power.power_state = mesg;
return rc;
 }

@@ -6580,7 +6578,6 @@ void ata_host_resume(struct ata_host *host)
 {
ata_host_request_pm(host, PMSG_ON, ATA_EH_SOFTRESET,
ATA_EHI_NO_AUTOPSY | ATA_EHI_QUIET, 0);
-   host->dev->power.power_state = PMSG_ON;

/* reenable link pm */
ata_lpm_disable(host);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: make amd quad core 8 socket system not be clustered_box v2

2008-02-25 Thread Ravikiran Thirumalai
On Tue, Feb 26, 2008 at 04:46:25AM +0100, Andi Kleen wrote:
>> I don't quite understand the purpose of the patch to begin with.  Looking at
>> the current x86 git tree, apic_is_clustered_box is used only to determine if
>> tsc is synchronized on the platform.  The AMD docs  imply that TSC's are not
>> guaranteed to be synced across cores between nodes -- Opteron BKDG for
>> family 10h, Section 2.9.4:
>
>After long discussions with AMD they determined the CPUID flag
>for sync RDTSC will imply synchronization between nodes.

Ah!

>
>If you can't support that in your hardware you're supposed
>to clear it.

Hmm! How would a hardware vendor do that? That doesn't seem to be clear in
the BKDG. (Well, this is the problem with undocumented features :()

Thanks,
Kiran
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/6] mempolicy: convert MPOL constants to enum

2008-02-25 Thread Paul Jackson
David wrote:
> I don't suspect that a kernel developer is going
> to make such an egregious error.

Eh ... each bit of added clarity to the code reduces
errors.

Looking around the kernel, it really seems to me to be
a common coding to explicitly spell out enum values
when for some reason they actually matter.

Like most coding mechanisms, nothing guarantees anything.

It just nicely represents one particular detail, that
the values of these MPOL_* terms are not arbitrary.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Announce: Linux-next (Or Andrew's dream :-))

2008-02-25 Thread Stephen Rothwell
Hi Russell,

On Sat, 16 Feb 2008 00:09:43 + Russell King <[EMAIL PROTECTED]> wrote:
>
> On Tue, Feb 12, 2008 at 12:02:08PM +1100, Stephen Rothwell wrote:
> > I will attempt to build the tree between each merge (and a failed build
> > will again cause the offending tree to be dropped).  These builds will be
> > necessarily restricted to probably one architecture/config.  I will build
> > the entire tree on as many architectures/configs as seem sensible and
> > the results of that will be available on a web page (to be announced).
> 
> This restriction means that the value for the ARM architecture is soo
> limited it's probably not worth the hastle participating in this project.
> 
> We already know that -mm picks up on very few ARM conflicts because
> Andrew doesn't run through the entire set of configurations; unfortunately
> ARM is one of those architectures which is very diverse [*], and because
> of that, ideas like "allyconfig" are just completely irrelevant to it.
> 
> As mentioned elsewhere, what we need for ARM is to extend the kautobuild
> infrastructure (see armlinux.simtec.co.uk) so that we can have more trees
> at least compile tested regularly - but that requires the folk there to
> have additional compute power (which isn't going to happen unless folk
> stamp up some machines _or_ funding).

I now have an arm cross compiler (gcc-4.0.2-glibc-2.3.6
arm-unknown-linux-gnu).  (See the results page at
http://kisskb.ellerman.id.au/kisskb/branch/9/ - I must get a better
name/place :-(.)  Is this sufficient to help you out?  What configs would
be useful to build (as Andrew said, they don't take very long each).

I really want as many subsystems as possible in the linux-next tree in an
attempt to avoid some of the merge/conflict problems we have had in the
past.  What can we do to help?

-- 
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/


pgp7FBm2sHObu.pgp
Description: PGP signature


Re: [Bluez-devel] forcing SCO connection patch

2008-02-25 Thread Louis JANG
Hi Marcel


>> --- linux-2.6.23/net/bluetooth/hci_event.c.orig 2008-02-25
>> 17:17:11.0 +0900
>> +++ linux-2.6.23/net/bluetooth/hci_event.c 2008-02-25
>> 17:30:23.0 +0900
>> @@ -1313,8 +1313,17 @@
>> hci_dev_lock(hdev);
>>
>> conn = hci_conn_hash_lookup_ba(hdev, ev->link_type, >bdaddr);
>> - if (!conn)
>> - goto unlock;
>> + if (!conn) {
>> + if (ev->link_type != ACL_LINK) {
>> + __u8 link_type = (ev->link_type == ESCO_LINK) ? SCO_LINK : ESCO_LINK;
>> +
>> + conn = hci_conn_hash_lookup_ba(hdev, link_type, >bdaddr);
>> + if (conn)
>> + conn->type = ev->link_type;
>> + }
>> + if (!conn)
>> + goto unlock;
>> + }
>
> NAK. There is no need to check for ACL_LINK. The sync_complete will
> only be called for SCO or eSCO connections.
I see. I removed this check line in the patch.

Thanks.
Louis JANG
Signed-off-by: Louis JANG <[EMAIL PROTECTED]>
--- linux-2.6.23/net/bluetooth/hci_event.c.orig 2008-02-26 12:46:36.0 
+0900
+++ linux-2.6.23/net/bluetooth/hci_event.c  2008-02-26 12:47:23.0 
+0900
@@ -1313,8 +1313,15 @@
hci_dev_lock(hdev);
 
conn = hci_conn_hash_lookup_ba(hdev, ev->link_type, >bdaddr);
-   if (!conn)
-   goto unlock;
+   if (!conn) {
+   __u8 link_type = (ev->link_type == ESCO_LINK) ? SCO_LINK : 
ESCO_LINK;
+
+   conn = hci_conn_hash_lookup_ba(hdev, link_type, >bdaddr);
+   if (conn)
+   conn->type = ev->link_type;
+   else
+   goto unlock;
+   }
 
if (!ev->status) {
conn->handle = __le16_to_cpu(ev->handle);


Re: Linux 2.6.24.3

2008-02-25 Thread Randy Dunlap
On Mon, 25 Feb 2008 17:00:24 -0800 Greg Kroah-Hartman wrote:

> We (the -stable team) are announcing the release of the 2.6.24.3
> kernel.
> 
> It fixes a number of different bugs and all users of the 2.6.24 series
> are encouraged to upgrade.
> 
> I'll also be replying to this message with a copy of the patch between
> 2.6.24.2 and 2.6.24.3
> 
> The updated 2.6.24.y git tree can be found at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.24.y.git
> and can be browsed at the normal kernel.org git web browser:
> 
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.24.y.git;a=summary

When HEADERS_CHECK=y:

make[3]: *** No rule to make target 
`/local/linsrc/linux-2.6.24.3/include/linux/if_addrlabel.h', needed by 
`/local/linsrc/linux-2.6.24.3/usr/include/linux/if_addrlabel.h'.  Stop.
make[2]: *** [linux] Error 2

---
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70

2008-02-25 Thread Nick Piggin
On Wednesday 20 February 2008 09:01, Alexey Dobriyan wrote:
> On Tue, Feb 19, 2008 at 11:47:11PM +0300,  wrote:

> > > Are you reproducing it simply by running the
> > > ftest03 binary directly from the shell? How many times between oopses?
> > > It is multi-process but no threads, so races should be minimal down
> > > this path -- can you get an strace of the failing process?
>
> Speaking of multi-proceseness, changing MAXCHILD to 1, nchild to 1,
> AFAICS, generates one child which oopses the very same way (in parallel
> with generic LTP) But, lowering MAXIOVCNT to 8 generates no oops.

Thanks, I was able to reproduce quite easily with these settings.
I think I have the correct patch now (at least it isn't triggerable
any more here).

Thanks,
Nick
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c74b68..2650073 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1750,14 +1750,18 @@ static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
 	} else {
 		const struct iovec *iov = i->iov;
 		size_t base = i->iov_offset;
+		size_t copied = 0;
 
 		/*
 		 * The !iov->iov_len check ensures we skip over unlikely
-		 * zero-length segments.
+		 * zero-length segments (without overruning the iovec).
 		 */
-		while (bytes || !iov->iov_len) {
-			int copy = min(bytes, iov->iov_len - base);
+		while (copied < bytes ||
+unlikely(!iov->iov_len && copied < i->count)) {
+			int copy;
 
+			copy = min(bytes, iov->iov_len - base);
+			copied += copy;
 			bytes -= copy;
 			base += copy;
 			if (iov->iov_len == base) {


Re: [PATCH] x86_64: make amd quad core 8 socket system not be clustered_box v2

2008-02-25 Thread Andi Kleen
> I don't quite understand the purpose of the patch to begin with.  Looking at
> the current x86 git tree, apic_is_clustered_box is used only to determine if
> tsc is synchronized on the platform.  The AMD docs  imply that TSC's are not
> guaranteed to be synced across cores between nodes -- Opteron BKDG for
> family 10h, Section 2.9.4:

After long discussions with AMD they determined the CPUID flag
for sync RDTSC will imply synchronization between nodes.

If you can't support that in your hardware you're supposed
to clear it.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: make amd quad core 8 socket system not be clustered_box v2

2008-02-25 Thread Ravikiran Thirumalai
On Mon, Feb 25, 2008 at 02:05:45PM -0800, Yinghai Lu wrote:
>On Mon, Feb 25, 2008 at 11:08 AM, Ravikiran Thirumalai
><[EMAIL PROTECTED]> wrote:
>> ...
>>  Andi, Yes.  AMD based vSMPowered systems uses clustered APICs, and this
>>  check base on cpu vendor id is not good :(.
>
>please check if you happy with
>
>http://lkml.org/lkml/2008/2/24/273
>

I don't quite understand the purpose of the patch to begin with.  Looking at
the current x86 git tree, apic_is_clustered_box is used only to determine if
tsc is synchronized on the platform.  The AMD docs  imply that TSC's are not
guaranteed to be synced across cores between nodes -- Opteron BKDG for
family 10h, Section 2.9.4:


Note: Timers associated with different CPU cores in the same processor
increment at the same rate. Timers associated with different CPU cores
in different processors increment at slightly different rates if (1) they
are located on different nodes and (2) CLKIN for these nodes is derived from
different, non-synchronized oscillator sources.


But that is not what the x86 tree does (with your patches) -- it looks for the
X86_FEATURE_CONSTANT_TSC at unsynchronized_tsc() and assumes a synchronized
clock.  Huh!??  Am i missing something here?  X86_FEATURE_CONSTANT_TSC
is set from CPUID Fn8000_0007 -- TscInvariant bit, which implies TSC is
not affected by change in PM states.  This does not talk about whether CLKIN
for different packages are from synchronized/non synchronized oscillator
sources in the above quote.

Thanks,
Kiran
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Configure MSI-X vectors to target different CPUs

2008-02-25 Thread caiying
Thanks, Robert. My device does support multiple vectors.

When looking into functions called by pci_enable_msix(), I found 
msi_compose_msg() in arch/i386/kernel/io_apic.c. It tries to get destination 
CPU (TARGET_CPUS) and set this information to msg->address_lo. My question is 
about TARGET_CPUS. Under the asm-i386/mach-default, it is the cpu_online_map. 
Under asm-i386/mach-bigsmp, it is the cpumask_of_cpu(cpu), where cpu is a 
single one. I would guess if a single CPU is set as destination, only that CPU 
will be interrupted. But what will happen when the cpu_online_map is set as 
destination? Any CPU can be interrupted then? Or depending on affinity of the 
corresponding irq?

Please 
CC'ed 
me 
([EMAIL PROTECTED]) 
answers/comments  
in 
response 
to 
this 
posting. 

Thanks,
Ying

- Original Message 
From: Robert Hancock <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: linux-kernel@vger.kernel.org
Sent: Thursday, February 21, 2008 7:59:14 PM
Subject: Re: Configure MSI-X vectors to target different CPUs


[EMAIL PROTECTED] 
wrote:
> 
Hi,
> 
> 
In 
MSI-HOWTO, 
it's 
said:
> 
> 
"Using 
MSI 
enables 
the 
device 
functions 
to 
support 
two 
or 
more 
vectors, 
which 
can 
be 
configured 
to 
target 
different 
CPUs 
to 
increase 
scalability."
> 
> 
So 
how 
can 
I 
set 
up 
MSI-X 
vectors 
to 
target 
different 
CPUs? 
I 
want 
to 
allocate 
the 
same 
number 
of 
MSI-X 
vectors 
as 
CPUs, 
and 
equally 
distribute 
them 
to 
every 
CPU.
> 
> 
Is 
it 
automatically 
done 
by 
Linux 
when 
I 
call 
pci_enable_msix()? 
If 
yes, 
how? 
If 
not, 
what 
should 
I 
do? 
My 
guess 
is 
to 
set 
the 
affinity 
of 
the 
interrupts 
manually. 
Am 
I 
right?
> 
> 
Please 
CC'ed 
me 
([EMAIL PROTECTED]) 
answers/comments  
in 
response 
to 
this 
posting. 
> 
> 
Thanks,
> 
Ying

If 
the 
device 
actually 
supports 
multiple 
vectors 
(not 
all 
do), 
I 
think 
they 
should 
show 
up 
as 
separate 
interrupts 
in 
/proc/interrupts 
and 
you 
can 
either 
set 
the 
affinity 
manually, 
or 
maybe 
irqbalance 
is 
smart 
enough 
for 
this.

Careful, 
though, 
as 
in 
some 
cases 
this 
may 
reduce 
performance 
due 
to 
causing 
more 
cache 
line 
bouncing 
between 
CPUs.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/6] mempolicy: convert MPOL constants to enum

2008-02-25 Thread David Rientjes
On Mon, 25 Feb 2008, Paul Jackson wrote:

> +enum {
> + MPOL_DEFAULT,
> + MPOL_PREFERRED,
> + MPOL_BIND,
> + MPOL_INTERLEAVE,
> + MPOL_MAX,   /* always last member of enum */
> 
> Aren't the values that these constants take part of the
> user visible kernel API?
> 
> In other words, if someone added another MPOL_* in the middle
> of this enum, it would break mbind/set_mempolicy/get_mempolicy
> users, right:
> 
> +enum {
> + MPOL_DEFAULT,
> + MPOL_PREFERRED,
> + MPOL_YET_ANOTHER_FLAG,  /* <== added flag ... oops */
> + MPOL_BIND,
> + MPOL_INTERLEAVE,
> + MPOL_MAX,   /* always last member of enum */
> 

I don't suspect that a kernel developer is going to make such an egregious 
error.  The user would need to be using a new linux/mempolicy.h with an 
old kernel to get the wrong behavior.

> I'm thinking that we should still specify the specific value
> of each of these flags, by way of documenting these necessary
> values, as in:
> 
> +enum {
> + MPOL_DEFAULT = 0,
> + MPOL_PREFERRED = 1,
> + MPOL_BIND = 2,
> + MPOL_INTERLEAVE = 3,
> + MPOL_MAX,   /* always last member of enum */
> 

That looks overly redundant to me and doesn't protect against adding 
MPOL_YET_ANOTHER_FLAG in the middle of preferred and bind to get two mode 
values with the int value of 1.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Memory Resource Controller use strstrip while parsing arguments

2008-02-25 Thread Balbir Singh
Andrew Morton wrote:
> On Mon, 25 Feb 2008 23:57:46 +0530 Balbir Singh <[EMAIL PROTECTED]> wrote:
> 
>> The memory controller has a requirement that while writing values, we need
>> to use echo -n. This patch fixes the problem and makes the UI more 
>> consistent.
> 
> that's a decent improvement ;)
> 
> btw, could I ask that you, Paul and others who work on this and cgroups
> have a think about a ./MAINTAINERS update?

Aah.. yes.. we should do that.



-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] autofs4 - add miscelaneous device for ioctls

2008-02-25 Thread Ian Kent
Hi Andrew,

Patch to add miscellaneous device to autofs4 module for
ioctls.

Signed-off-by: Ian Kent < [EMAIL PROTECTED]>

Ian

---
diff -up linux-2.6.25-rc2-mm1/fs/autofs4/expire.c.device-node-ioctl 
linux-2.6.25-rc2-mm1/fs/autofs4/expire.c
--- linux-2.6.25-rc2-mm1/fs/autofs4/expire.c.device-node-ioctl  2008-01-25 
07:58:37.0 +0900
+++ linux-2.6.25-rc2-mm1/fs/autofs4/expire.c2008-02-22 11:51:41.0 
+0900
@@ -244,10 +244,10 @@ cont:
 }
 
 /* Check if we can expire a direct mount (possibly a tree) */
-static struct dentry *autofs4_expire_direct(struct super_block *sb,
-   struct vfsmount *mnt,
-   struct autofs_sb_info *sbi,
-   int how)
+struct dentry *autofs4_expire_direct(struct super_block *sb,
+struct vfsmount *mnt,
+struct autofs_sb_info *sbi,
+int how)
 {
unsigned long timeout;
struct dentry *root = dget(sb->s_root);
@@ -281,10 +281,10 @@ static struct dentry *autofs4_expire_dir
  *  - it is unused by any user process
  *  - it has been unused for exp_timeout time
  */
-static struct dentry *autofs4_expire_indirect(struct super_block *sb,
- struct vfsmount *mnt,
- struct autofs_sb_info *sbi,
- int how)
+struct dentry *autofs4_expire_indirect(struct super_block *sb,
+  struct vfsmount *mnt,
+  struct autofs_sb_info *sbi,
+  int how)
 {
unsigned long timeout;
struct dentry *root = sb->s_root;
diff -up linux-2.6.25-rc2-mm1/fs/autofs4/init.c.device-node-ioctl 
linux-2.6.25-rc2-mm1/fs/autofs4/init.c
--- linux-2.6.25-rc2-mm1/fs/autofs4/init.c.device-node-ioctl2008-01-25 
07:58:37.0 +0900
+++ linux-2.6.25-rc2-mm1/fs/autofs4/init.c  2008-02-22 11:51:41.0 
+0900
@@ -29,11 +29,20 @@ static struct file_system_type autofs_fs
 
 static int __init init_autofs4_fs(void)
 {
-   return register_filesystem(_fs_type);
+   int err;
+
+   err = register_filesystem(_fs_type);
+   if (err)
+   return err;
+
+   err = autofs_dev_ioctl_init();
+
+   return err;
 }
 
 static void __exit exit_autofs4_fs(void)
 {
+   autofs_dev_ioctl_exit();
unregister_filesystem(_fs_type);
 }
 
diff -up linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h.device-node-ioctl 
linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h
--- linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h.device-node-ioctl
2008-02-22 11:51:41.0 +0900
+++ linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h  2008-02-22 11:51:41.0 
+0900
@@ -14,6 +14,7 @@
 /* Internal header file for autofs */
 
 #include 
+#include 
 #include 
 #include 
 
@@ -40,6 +41,16 @@
 #define DPRINTK(fmt,args...) do {} while(0)
 #endif
 
+#define WARN(fmt,args...) \
+do { \
+   printk("KERN_WARNING pid %d: %s: " fmt "\n" , current->pid , 
__FUNCTION__ , ##args); \
+} while(0)
+
+#define ERROR(fmt,args...) \
+do { \
+   printk("KERN_ERR pid %d: %s: " fmt "\n" , current->pid , __FUNCTION__ , 
##args); \
+} while(0)
+
 /* Unified info structure.  This is pointed to by both the dentry and
inode structures.  Each file in the filesystem has an instance of this
structure.  It holds a reference to the dentry, so dentries are never
@@ -172,6 +183,17 @@ int autofs4_expire_run(struct super_bloc
struct autofs_packet_expire __user *);
 int autofs4_expire_multi(struct super_block *, struct vfsmount *,
struct autofs_sb_info *, int __user *);
+struct dentry *autofs4_expire_direct(struct super_block *sb,
+struct vfsmount *mnt,
+struct autofs_sb_info *sbi, int how);
+struct dentry *autofs4_expire_indirect(struct super_block *sb,
+  struct vfsmount *mnt,
+  struct autofs_sb_info *sbi, int how);
+
+/* Device node initialization */
+
+int autofs_dev_ioctl_init(void);
+void autofs_dev_ioctl_exit(void);
 
 /* Operations structures */
 
diff -up linux-2.6.25-rc2-mm1/fs/autofs4/Makefile.device-node-ioctl 
linux-2.6.25-rc2-mm1/fs/autofs4/Makefile
--- linux-2.6.25-rc2-mm1/fs/autofs4/Makefile.device-node-ioctl  2008-01-25 
07:58:37.0 +0900
+++ linux-2.6.25-rc2-mm1/fs/autofs4/Makefile2008-02-22 11:51:41.0 
+0900
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_AUTOFS4_FS) += autofs4.o
 
-autofs4-objs := init.o inode.o root.o symlink.o waitq.o expire.o
+autofs4-objs := init.o inode.o root.o symlink.o waitq.o expire.o dev-ioctl.o
diff -up /dev/null linux-2.6.25-rc2-mm1/fs/autofs4/dev-ioctl.c
--- /dev/null   

[PATCH 3/4] autofs4 - track uid and gid of last mount requestor

2008-02-25 Thread Ian Kent
Hi Andrew,

Patch to track the uid and gid of the last process to request
a mount for on an autofs dentry.

Signed-off-by: Ian Kent < [EMAIL PROTECTED]>

Ian

---
diff -up linux-2.6.25-rc2-mm1/fs/autofs4/inode.c.track-last-mount-ids 
linux-2.6.25-rc2-mm1/fs/autofs4/inode.c
--- linux-2.6.25-rc2-mm1/fs/autofs4/inode.c.track-last-mount-ids
2008-02-20 13:11:28.0 +0900
+++ linux-2.6.25-rc2-mm1/fs/autofs4/inode.c 2008-02-20 13:12:23.0 
+0900
@@ -43,6 +43,8 @@ struct autofs_info *autofs4_init_ino(str
 
ino->flags = 0;
ino->mode = mode;
+   ino->uid = 0;
+   ino->gid = 0;
ino->inode = NULL;
ino->dentry = NULL;
ino->size = 0;
diff -up linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h.track-last-mount-ids 
linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h
--- linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h.track-last-mount-ids 
2008-02-20 13:14:03.0 +0900
+++ linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h  2008-02-20 13:14:34.0 
+0900
@@ -58,6 +58,9 @@ struct autofs_info {
unsigned long last_used;
atomic_t count;
 
+   uid_t uid;
+   gid_t gid;
+
mode_t  mode;
size_t  size;
 
diff -up linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.track-last-mount-ids 
linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c
--- linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.track-last-mount-ids
2008-02-20 13:06:20.0 +0900
+++ linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c 2008-02-20 13:10:23.0 
+0900
@@ -363,6 +363,38 @@ int autofs4_wait(struct autofs_sb_info *
 
status = wq->status;
 
+   /*
+* For direct and offset mounts we need to track the requestor
+* uid and gid in the dentry info struct. This is so it can be
+* supplied, on request, by the misc device ioctl interface.
+* This is needed during daemon resatart when reconnecting
+* to existing, active, autofs mounts. The uid and gid (and
+* related string values) may be used for macro substitution
+* in autofs mount maps.
+*/
+   if (!status) {
+   struct dentry *de = NULL;
+
+   /* direct mount or browsable map */
+   ino = autofs4_dentry_ino(dentry);
+   if (!ino) {
+   /* If not lookup actual dentry used */
+   de = d_lookup(dentry->d_parent, >d_name);
+   ino = autofs4_dentry_ino(de);
+   }
+
+   /* Set mount requestor */
+   if (ino) {
+   if (ino) {
+   ino->uid = wq->uid;
+   ino->gid = wq->gid;
+   }
+   }
+
+   if (de)
+   dput(de);
+   }
+
/* Are we the last process to need status? */
if (atomic_dec_and_test(>wait_ctr))
kfree(wq);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] autofs4 - check for invalid dentry in getpath

2008-02-25 Thread Ian Kent
Hi Andrew,

Patch to catch invalid dentry when calculating it's path.

Signed-off-by: Ian Kent <[EMAIL PROTECTED]>

Ian

---
diff -up linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.getpath-check-valid-dentry 
linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c
--- linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.getpath-check-valid-dentry  
2008-02-20 12:55:39.0 +0900
+++ linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c 2008-02-20 12:54:46.0 
+0900
@@ -171,7 +171,7 @@ static int autofs4_getpath(struct autofs
for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent)
len += tmp->d_name.len + 1;
 
-   if (--len > NAME_MAX) {
+   if (!len || --len > NAME_MAX) {
spin_unlock(_lock);
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/10] CGroup API files: Various cleanup to CGroup control files

2008-02-25 Thread Li Zefan
[EMAIL PROTECTED] wrote:
> This patchset is a roll-up of the non-contraversial items of the
> various patches that I've sent out recently, fixed according to the
> feedback received.
> 
> In summary they are:
> 
> - general rename of read_uint/write_uint to read_u64/write_u64
> 
> - use these methods for cpusets and memory controller files
> 
> - add a read_map cgroup file method, and use it in the memory
>   controller
> 
> - move the "releasable" control file to the debug subsystem
> 
> - make the debug subsystem config option default to "n"
> 
> The only user-visible changes are the movement of the "releasable"
> file and the fact that some write_u64()-based control files are now
> more forgiving of additional whitespace at the end of their input.
> 
> Signed-off-by: Paul Menage <[EMAIL PROTECTED]>
> 
> --
> --

Should those pathces be rebased againt 2.6.25-rc3 ?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] cifs: remove GLOBAL_EXTERN macro

2008-02-25 Thread Harvey Harrison
Global variables should be defined in C files, not in headers.

1) Comment out unused vars
GlobalDnotifyRsp_Q
GlobalUidList

2) Declare vars in cifsfs.c that need it and change to extern in
cifsglob.h

3) Change to extern in cifsglob.h for vars that were already being
declared in cifsfs.c

4) Remove GLOBAL_EXTERN

Signed-off-by: Harvey Harrison <[EMAIL PROTECTED]>
---
Steven, here is a revised patch that has a bit more thought behind it.


 fs/cifs/cifsfs.c   |   31 -
 fs/cifs/cifsglob.h |   76 
 2 files changed, 64 insertions(+), 43 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index fcc4342..aae6752 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -37,7 +37,6 @@
 #include 
 #include "cifsfs.h"
 #include "cifspdu.h"
-#define DECLARE_GLOBALS_HERE
 #include "cifsglob.h"
 #include "cifsproto.h"
 #include "cifs_debug.h"
@@ -85,6 +84,34 @@ module_param(cifs_max_pending, int, 0);
 MODULE_PARM_DESC(cifs_max_pending, "Simultaneous requests to server. "
   "Default: 50 Range: 2 to 256");
 
+struct list_head GlobalSMBSessionList;
+struct list_head GlobalTreeConnectionList;
+rwlock_t GlobalSMBSeslock;
+
+struct list_head GlobalOplock_Q;
+
+struct list_head GlobalDnotifyReqList;
+
+unsigned int GlobalCurrentXid;
+unsigned int GlobalTotalActiveXid;
+unsigned int GlobalMaxActiveXid;
+spinlock_t GlobalMid_Lock;
+char Local_System_Name[15];
+
+atomic_t sesInfoAllocCount;
+atomic_t tconInfoAllocCount;
+atomic_t tcpSesAllocCount;
+atomic_t tcpSesReconnectCount;
+atomic_t tconInfoReconnectCount;
+
+atomic_t bufAllocCount;/* current number allocated  */
+#ifdef CONFIG_CIFS_STATS2
+atomic_t totBufAllocCount; /* total allocated over all time */
+atomic_t totSmBufAllocCount;
+#endif
+atomic_t smBufAllocCount;
+atomic_t midCount;
+
 extern mempool_t *cifs_sm_req_poolp;
 extern mempool_t *cifs_req_poolp;
 extern mempool_t *cifs_mid_poolp;
@@ -1001,7 +1028,7 @@ init_cifs(void)
INIT_LIST_HEAD(_Q);
 #ifdef CONFIG_CIFS_EXPERIMENTAL
INIT_LIST_HEAD();
-   INIT_LIST_HEAD(_Q);
+/* INIT_LIST_HEAD(_Q); */
 #endif
 /*
  *  Initialize Global counters
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 5d32d8d..c45acfd 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -583,79 +583,73 @@ require use of the stronger protocol */
  *
  /
 
-#ifdef DECLARE_GLOBALS_HERE
-#define GLOBAL_EXTERN
-#else
-#define GLOBAL_EXTERN extern
-#endif
-
 /*
  * The list of servers that did not respond with NT LM 0.12.
  * This list helps improve performance and eliminate the messages indicating
  * that we had a communications error talking to the server in this list.
  */
 /* Feature not supported */
-/* GLOBAL_EXTERN struct servers_not_supported *NotSuppList; */
+/* extern struct servers_not_supported *NotSuppList; */
 
 /*
  * The following is a hash table of all the users we know about.
  */
-GLOBAL_EXTERN struct smbUidInfo *GlobalUidList[UID_HASH];
+/* extern struct smbUidInfo *GlobalUidList[UID_HASH]; */
 
-/* GLOBAL_EXTERN struct list_head GlobalServerList; BB not implemented yet */
-GLOBAL_EXTERN struct list_head GlobalSMBSessionList;
-GLOBAL_EXTERN struct list_head GlobalTreeConnectionList;
-GLOBAL_EXTERN rwlock_t GlobalSMBSeslock;  /* protects list inserts on 3 above 
*/
+/* extern struct list_head GlobalServerList; BB not implemented yet */
+extern struct list_head GlobalSMBSessionList;
+extern struct list_head GlobalTreeConnectionList;
+extern rwlock_t GlobalSMBSeslock;  /* protects list inserts on 3 above */
 
-GLOBAL_EXTERN struct list_head GlobalOplock_Q;
+extern struct list_head GlobalOplock_Q;
 
 /* Outstanding dir notify requests */
-GLOBAL_EXTERN struct list_head GlobalDnotifyReqList;
+extern struct list_head GlobalDnotifyReqList;
 /* DirNotify response queue */
-GLOBAL_EXTERN struct list_head GlobalDnotifyRsp_Q;
+/* extern struct list_head GlobalDnotifyRsp_Q; */
 
 /*
  * Global transaction id (XID) information
  */
-GLOBAL_EXTERN unsigned int GlobalCurrentXid;   /* protected by GlobalMid_Sem */
-GLOBAL_EXTERN unsigned int GlobalTotalActiveXid; /* prot by GlobalMid_Sem */
-GLOBAL_EXTERN unsigned int GlobalMaxActiveXid; /* prot by GlobalMid_Sem */
-GLOBAL_EXTERN spinlock_t GlobalMid_Lock;  /* protects above & list operations 
*/
+extern unsigned int GlobalCurrentXid;  /* protected by GlobalMid_Sem */
+extern unsigned int GlobalTotalActiveXid; /* prot by GlobalMid_Sem */
+extern unsigned int GlobalMaxActiveXid;/* prot by GlobalMid_Sem */
+extern spinlock_t GlobalMid_Lock;  /* protects above & list operations */
  /* on midQ entries */
-GLOBAL_EXTERN char Local_System_Name[15];
+extern char Local_System_Name[15];
 
 /*
  *  Global counters, updated atomically
  */
-GLOBAL_EXTERN atomic_t sesInfoAllocCount;
-GLOBAL_EXTERN atomic_t tconInfoAllocCount;

[PATCH 0/4] autofs4 - autofs needs a miscelaneous device for ioctls

2008-02-25 Thread Ian Kent

Hi Andrew,

There is a problem with active restarts in autofs (that is to
say restarting autofs when there are busy mounts).

Currently autofs uses "umount -l" to clear active mounts at
restart. While using lazy umount works for most cases, anything
that needs to walk back up the mount tree to construct a path,
such as getcwd(2) and the proc file system /proc//cwd, no
longer works because the point from which the path is constructed
has been detached from the mount tree.

The actual problem with autofs is that it can't reconnect to
existing mounts. Immediately one things of just adding the
ability to remount autofs file systems would solve it, but
alas, that can't work. This is because autofs direct mounts
and the implementation of "on demand mount and expire" of
nested mount trees have the file system mounted on top of
the mount trigger dentry.

To resolve this a miscellaneous device node for routing ioctl
commands to these mount points has been implemented for the
autofs4 kernel module.

For those wishing to test this out an updated user space daemon
is needed. Checking out and building from the git repo or
applying all the current patches to the 5.0.3 tar distribution
will do the trick. This is all available at the usual location
on kernel.org.

Ian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/6] mempolicy: convert MPOL constants to enum

2008-02-25 Thread Paul Jackson
David wrote:

+enum {
+   MPOL_DEFAULT,
+   MPOL_PREFERRED,
+   MPOL_BIND,
+   MPOL_INTERLEAVE,
+   MPOL_MAX,   /* always last member of enum */

Aren't the values that these constants take part of the
user visible kernel API?

In other words, if someone added another MPOL_* in the middle
of this enum, it would break mbind/set_mempolicy/get_mempolicy
users, right:

+enum {
+   MPOL_DEFAULT,
+   MPOL_PREFERRED,
+   MPOL_YET_ANOTHER_FLAG,  /* <== added flag ... oops */
+   MPOL_BIND,
+   MPOL_INTERLEAVE,
+   MPOL_MAX,   /* always last member of enum */

I'm thinking that we should still specify the specific value
of each of these flags, by way of documenting these necessary
values, as in:

+enum {
+   MPOL_DEFAULT = 0,
+   MPOL_PREFERRED = 1,
+   MPOL_BIND = 2,
+   MPOL_INTERLEAVE = 3,
+   MPOL_MAX,   /* always last member of enum */


-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel oops with bluetooth usb dongle

2008-02-25 Thread Marcel Holtmann

Hi Quel,


-- Original message --
From: Thomas Gleixner <[EMAIL PROTECTED]>

On Fri, 22 Feb 2008, David Woodhouse wrote:


On Fri, 2008-02-22 at 08:23 +0100, Thomas Gleixner wrote:


+   del_timer(>info_timer);
+
   hcon->l2cap_data = NULL;
   kfree(conn);


Shouldn't that be del_timer_sync() ?


Hmm, probably yes.


Hi,

Great news: only adding adding del_timer_sync() to 2.6.25-rc3 does  
prevent the crash.


Bad news: I still cannot use the device.

hcitool inq, hcitool scan, hcitool name  and hcitool info  


commands work.

hcitool cc , sdptool , rfcomm connect command fail,  
most of them

with a 'Connection reset by peer' error.


what does "hciconfig hci0 version" tell you about your device? Some of  
the none major based Bluetooth chips are broken and might need an  
extra tweak within the USB driver.


Regards

Marcel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bluez-devel] forcing SCO connection patch

2008-02-25 Thread Marcel Holtmann

Hi Louis,


I fixed all of errors except 80 characters warning.
Thanks

Louis JANG

Signed-off-by: Louis JANG <[EMAIL PROTECTED]>

--- linux-2.6.23/net/bluetooth/hci_event.c.orig	2008-02-25  
17:17:11.0 +0900
+++ linux-2.6.23/net/bluetooth/hci_event.c	2008-02-25  
17:30:23.0 +0900

@@ -1313,8 +1313,17 @@
hci_dev_lock(hdev);

conn = hci_conn_hash_lookup_ba(hdev, ev->link_type, >bdaddr);
-   if (!conn)
-   goto unlock;
+   if (!conn) {
+   if (ev->link_type != ACL_LINK) {
+			__u8 link_type = (ev->link_type == ESCO_LINK) ? SCO_LINK :  
ESCO_LINK;

+
+   conn = hci_conn_hash_lookup_ba(hdev, link_type, 
>bdaddr);
+   if (conn)
+   conn->type = ev->link_type;
+   }
+   if (!conn)
+   goto unlock;
+   }


NAK. There is no need to check for ACL_LINK. The sync_complete will  
only be called for SCO or eSCO connections.


diff -uNr linux-2.6.23/include/net/bluetooth-orig/sco.h linux-2.6.23/ 
include/net/bluetooth/sco.h
--- linux-2.6.23/include/net/bluetooth-orig/sco.h	2007-10-10  
05:31:38.0 +0900
+++ linux-2.6.23/include/net/bluetooth/sco.h	2008-02-25  
18:04:20.0 +0900

@@ -51,6 +51,8 @@
__u8  dev_class[3];
};

+#define SCO_FORCESCO   0x03
+


NAK. We don't need this. And even if we really would want this, we  
would do it via extra parameters inside sockaddr_sco. In that case we  
would do it right and exposing eSCO settings and not some boolean  
parameter.


Regards

Marcel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Memory Resource Controller Add Boot Option

2008-02-25 Thread Li Zefan
Paul Menage wrote:
>>> I'll send out a prototype for comment.
> 
> Something like the patch below. The effects of cgroup_disable=foo are:
> 
> - foo doesn't show up in /proc/cgroups

Or we can print out the disable flag, maybe this will be better?
Because we can distinguish from disabled and not compiled in from
/proc/cgroups.

> - foo isn't auto-mounted if you mount all cgroups in a single hierarchy
> - foo isn't visible as an individually mountable subsystem

You mentioned in a previous mail if we mount a disabled subsystem we
will get an error. Here we just ignore the mount option. Which makes
more sense ?

> 
> As a result there will only ever be one call to foo->create(), at init
> time; all processes will stay in this group, and the group will never be
> mounted on a visible hierarchy. Any additional effects (e.g. not
> allocating metadata) are up to the foo subsystem.
> 
> This doesn't handle early_init subsystems (their "disabled" bit isn't
> set be, but it could easily be extended to do so if any of the
> early_init systems wanted it - I think it would just involve some
> nastier parameter processing since it would occur before the
> command-line argument parser had been run.
> 
> include/linux/cgroup.h |1 +
> kernel/cgroup.c|   29 +++--
> 2 files changed, 28 insertions(+), 2 deletions(-)
> 
> Index: cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h
> ===
> --- cgroup_disable-2.6.25-rc2-mm1.orig/include/linux/cgroup.h
> +++ cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h
> @@ -256,6 +256,7 @@ struct cgroup_subsys {
> void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
> int subsys_id;
> int active;
> +int disabled;
> int early_init;
> #define MAX_CGROUP_TYPE_NAMELEN 32
> const char *name;
> Index: cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c
> ===
> --- cgroup_disable-2.6.25-rc2-mm1.orig/kernel/cgroup.c
> +++ cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c
> @@ -790,7 +790,14 @@ static int parse_cgroupfs_options(char *
> if (!*token)
> return -EINVAL;
> if (!strcmp(token, "all")) {
> -opts->subsys_bits = (1 << CGROUP_SUBSYS_COUNT) - 1;
> +/* Add all non-disabled subsystems */
> +int i;
> +opts->subsys_bits = 0;
> +for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> +struct cgroup_subsys *ss = subsys[i];
> +if (!ss->disabled)
> +opts->subsys_bits |= 1ul << i;
> +}
> } else if (!strcmp(token, "noprefix")) {
> set_bit(ROOT_NOPREFIX, >flags);
> } else if (!strncmp(token, "release_agent=", 14)) {
> @@ -808,7 +815,8 @@ static int parse_cgroupfs_options(char *
> for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> ss = subsys[i];
> if (!strcmp(token, ss->name)) {
> -set_bit(i, >subsys_bits);
> +if (!ss->disabled)
> +set_bit(i, >subsys_bits);
> break;
> }
> }
> @@ -2596,6 +2606,8 @@ static int proc_cgroupstats_show(struct
> mutex_lock(_mutex);
> for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> struct cgroup_subsys *ss = subsys[i];
> +if (ss->disabled)
> +continue;
> seq_printf(m, "%s\t%lu\t%d\n",
>ss->name, ss->root->subsys_bits,
>ss->root->number_of_cgroups);
> @@ -2991,3 +3003,16 @@ static void cgroup_release_agent(struct
> spin_unlock(_list_lock);
> mutex_unlock(_mutex);
> }
> +
> +static int __init cgroup_disable(char *str)
> +{
> +int i;
> +for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> +struct cgroup_subsys *ss = subsys[i];
> +if (!strcmp(str, ss->name)) {
> +ss->disabled = 1;
> +break;
> +}
> +}
> +}
> +__setup("cgroup_disable=", cgroup_disable);
> 
> 
>>
>> Sure thing, if css has the flag, then it would nice. Could you wrap it
>> up to say
>> something like css_disabled(_cgroup_subsys)
>>
>>
> 
> It's the subsys object rather than the css (cgroup_subsys_state).
> 
>  We could have something like:
> 
> #define cgroup_subsys_disabled(_ss) ((ss_)->disabled)
> 
> but I don't see that
>  cgroup_subsys_disabled(_cgroup_subsys)
> is better than just putting
> 
>  mem_cgroup_subsys.disabled
> 
> Paul
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Don't risk NULL deref in marker

2008-02-25 Thread Mathieu Desnoyers
* Jesper Juhl ([EMAIL PROTECTED]) wrote:
> 
> get_marker() may return NULL, so test for it.
> 

Hrm, yes, if we have two marker_probe_unregister callers calling it for
the same marker, one expecting it to fail and they race, yes, it can
happen. Although this is not expected to happen often if the caller acts
sanely, it's a good thing to fix it. Thanks!

Acked-by: Mathieu Desnoyers <[EMAIL PROTECTED]>

> 
> Signed-off-by: Jesper Juhl <[EMAIL PROTECTED]>
> ---
> 
> diff --git a/kernel/marker.c b/kernel/marker.c
> index 50effc0..f211f08 100644
> --- a/kernel/marker.c
> +++ b/kernel/marker.c
> @@ -698,12 +698,11 @@ int marker_probe_unregister(const char *name,
>  {
>   struct marker_entry *entry;
>   struct marker_probe_closure *old;
> - int ret = 0;
> + int ret = -ENOENT;
>  
>   mutex_lock(_mutex);
>   entry = get_marker(name);
>   if (!entry) {
> - ret = -ENOENT;
>   goto end;
>   }
>   if (entry->rcu_pending)
> @@ -713,12 +712,16 @@ int marker_probe_unregister(const char *name,
>   marker_update_probes(); /* may update entry */
>   mutex_lock(_mutex);
>   entry = get_marker(name);
> + if (!entry) {
> + goto end;
> + }
>   entry->oldptr = old;
>   entry->rcu_pending = 1;
>   /* write rcu_pending before calling the RCU callback */
>   smp_wmb();
>   call_rcu(>rcu, free_old_closure);
>   remove_marker(name);/* Ignore busy error message */
> + ret = 0;
>  end:
>   mutex_unlock(_mutex);
>   return ret;

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] mmiotrace full patch, preview 1

2008-02-25 Thread Pavel Roskin

Quoting Christoph Hellwig <[EMAIL PROTECTED]>:


On Mon, Feb 25, 2008 at 02:49:22PM -0800, Andrew Morton wrote:

the things which it finds.

> +static DECLARE_MUTEX(kmmio_init_mutex);

That's not a mutex.

> +  down(_init_mutex);

It's a semaphore.  Please do convert it to a mutex.

Andy, I'd say that addition of new semaphores is worth a warning - they're
rarely legitimate.


I'm not sure that any semaphore should be a warning, but the initializer
for semaphore used as binary mutex (DECLARE_MUTEX and init_MUTEX) are
worth it.


It looks like a mutex, it acts like a mutex, but it isn't a mutex,  
it's a trap for the unwary.  Weird.  I was annoyed by it before; now I  
see a fellow developer actually getting into that trap.


I'd say, rename DECLARE_MUTEX to DECLARE_SEMAPHORE and let external  
code be fixed one way or another (i.e. stick with the "mutex" name or  
stick with the semaphore functionality if it's really needed).


--
Regards,
Pavel Roskin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tiny cpusets -- cpusets for small systems?

2008-02-25 Thread Max Krasnyanskiy

Paul Jackson wrote:

So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower
level mechanism that actually makes kernel aware of what's isolated what's not.
Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains
but scheduler does not use cpusets directly.


One could use cpusets to control the setting of cpu_isolated_map,
separate from the code such as your select_irq_affinity() that
uses it.
Yes. That's what I proposed too. In one of the CPU isolation threads with 
Peter. The only issue is that you need to simulate CPU_DOWN hotplug even in 
order to cleanup what's already running on those CPUs.



In a foreseeable future 2-8 cores will be most common configuration.
Do you think that cpusets are needed/useful for those machines ?
The reason I'm asking is because given the restrictions you mentioned
above it seems that you might as well just do
taskset -c 1,2,3 app1
	taskset -c 3,4,5 app2 


People tend to manage the CPU and memory placement of the threads
and processes within a single co-operating job using taskset
(sched_setaffinity) and numactl (mbind, set_mempolicy.)

They tend to manage the placement of multiple unrelated jobs onto
a single system, whether on separate or shared CPUs and nodes,
using cpusets.

>

Something like cpu_isolated_map looks to me like a system-wide
mechanism, which should, like sched_domains, be managed system-wide.
Managing it with a mechanism that encourages each thread to update
it directly, as if that thread owned the system, will break down,
resulting in conflicting updates, as multiple, insufficiently
co-operating threads issue conflicting settings.
I'm not sure how to interpret that. I think you might have mixed a couple of 
things I asked about in one reply ;-).
The question was that given the restrictions you talked about when you 
explained tiny-cpusets functionality I asked how much one gains from using 
them compared to the taskset/numactl. ie On the machines with 2-8 cores it's 
fairly easy to manage cpus with simple affinity masks.


The second part of your reply seems to imply that I somehow made you think 
that I suggested that cpu_isolated_map is managed per thread. That is of 
course not the case. It's definitely a system-wide mechanism and individual 
threads have nothing to do with it.
btw I just re-read my prev reply. I definitely did not say anything about 
threads managing cpu_isolated_map :).



Stuff that I'm working on this days (wireless basestations) is designed
with the  following model:
cpuN - runs soft-RT networking and management code
cpuN+1 to cpuN+x - are used as dedicated engines
ie Simplest example would be
cpu0 - runs IP, L2 and control plane
	cpu1 - runs hard-RT MAC 

So if CPU isolation is implemented on top of the cpusets what kind of API do 
you envision for such an app ?


That depends on what more API is needed.  Do we need to place
irqs better ... cpusets might not be a natural for that use.
Aren't irqs directed to specific CPUs, not to hierarchically
nested subsets of CPUs.


You clipped the part where I elaborated. Which was:
So if CPU isolation is implemented on top of the cpusets what kind of API do 
you envision for such an app ? I mean currently cpusets seems to be mostly dealing
with entire processes, whereas in this case we're really dealing with the threads. 
ie Different threads of the same process require different policies, some must run

on isolated cpus some must not. I guess one could write a thread's pid into 
cpusets
fs but that's not very convenient. pthread_set_affinity() is exactly what's 
needed.
In other words how would an app place its individual threads into the 
different cpusets.
IRQ stuff is separate, like we said above cpusets could simply update 
cpu_isolated_map which would take care of IRQs. I was talking specifically 
about the thread management.



Separate question:
  Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear
  as general purpose systems running a Linux kernel in your systems?
  These dedicated engines seem more like intelligent devices to me,
  such as disk controllers, which the kernel controls via device
  drivers, not by loading itself on them too.
We still want to be able to run normal threads on them. Which means IPI, 
memory management, etc is still needed. So yes they better show up as normal 
CPUs :)
Also with dynamic isolation you can for example un-isolate a cpu when you're 
compiling stuff on the machine and then isolate it when you're running special 
app(s).


Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH] page reclaim throttle take2

2008-02-25 Thread KOSAKI Motohiro
Hi

this patch is page reclaim improvement.

o previous discussion:
http://marc.info/?l=linux-mm=120339997125985=2

o test method
  $ ./hackbench 120 process 1000

o test result (average of 5 times measure)

limit   hackbench sys-time major-fault   max-spent-time 
time(s)   (s)in shrink_zone()
 (jiffies)

3   42.06 378.70   5336  6306


o reason why restrict parallel reclaim 3 task per zone

we tested various parameter.
  - restrict 1 is best major fault.
but worst max spent time.
  - restrict 3 is best max spent reclaim time and hackbench result.

I think "restrict 3" cause most good experience.


limit  hackbench sys-time major-fault   max-spent-time 
   time(s)   (s)in shrink_zone()
(jiffies)

1  48.50 283.89   3690  9057
2  44.43 350.94   5245  7159
3  42.06 378.70   5336  6306
4  48.84 401.87   5474  6669
unlimited  282.301248.47  29026  -



Please any comments!



Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>
CC: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]>
CC: Balbir Singh <[EMAIL PROTECTED]>
CC: Rik van Riel <[EMAIL PROTECTED]>
CC: Lee Schermerhorn <[EMAIL PROTECTED]>
CC: Nick Piggin <[EMAIL PROTECTED]>


---
 include/linux/mmzone.h |3 +
 mm/page_alloc.c|4 +
 mm/vmscan.c|  101 -
 3 files changed, 99 insertions(+), 9 deletions(-)

Index: b/include/linux/mmzone.h
===
--- a/include/linux/mmzone.h2008-02-25 21:37:49.0 +0900
+++ b/include/linux/mmzone.h2008-02-26 10:12:12.0 +0900
@@ -335,6 +335,9 @@ struct zone {
unsigned long   spanned_pages;  /* total size, including holes 
*/
unsigned long   present_pages;  /* amount of memory (excluding 
holes) */
 
+
+   atomic_tnr_reclaimers;
+   wait_queue_head_t   reclaim_throttle_waitq;
/*
 * rarely used fields:
 */
Index: b/mm/page_alloc.c
===
--- a/mm/page_alloc.c   2008-02-25 21:37:49.0 +0900
+++ b/mm/page_alloc.c   2008-02-26 10:12:12.0 +0900
@@ -3466,6 +3466,10 @@ static void __meminit free_area_init_cor
zone->nr_scan_inactive = 0;
zap_zone_vm_stats(zone);
zone->flags = 0;
+
+   zone->nr_reclaimers = ATOMIC_INIT(0);
+   init_waitqueue_head(>reclaim_throttle_waitq);
+
if (!size)
continue;
 
Index: b/mm/vmscan.c
===
--- a/mm/vmscan.c   2008-02-25 21:37:49.0 +0900
+++ b/mm/vmscan.c   2008-02-26 10:59:38.0 +0900
@@ -1252,6 +1252,55 @@ static unsigned long shrink_zone(int pri
return nr_reclaimed;
 }
 
+
+#define RECLAIM_LIMIT (3)
+
+static int do_shrink_zone_throttled(int priority, struct zone *zone,
+   struct scan_control *sc,
+   unsigned long *ret_reclaimed)
+{
+   u64 start_time;
+   int ret = 0;
+
+   start_time = jiffies_64;
+
+   wait_event(zone->reclaim_throttle_waitq,
+  atomic_add_unless(>nr_reclaimers, 1, RECLAIM_LIMIT));
+
+   /* more reclaim until needed? */
+   if (scan_global_lru(sc) &&
+   !(current->flags & PF_KSWAPD) &&
+   time_after64(jiffies, start_time + HZ/10)) {
+   if (zone_watermark_ok(zone, sc->order, 4*zone->pages_high,
+ MAX_NR_ZONES-1, 0)) {
+   ret = -EAGAIN;
+   goto out;
+   }
+   }
+
+   *ret_reclaimed += shrink_zone(priority, zone, sc);
+
+out:
+   atomic_dec(>nr_reclaimers);
+   wake_up_all(>reclaim_throttle_waitq);
+
+   return ret;
+}
+
+static unsigned long shrink_zone_throttled(int priority, struct zone *zone,
+  struct scan_control *sc)
+{
+   unsigned long nr_reclaimed = 0;
+   int ret;
+
+   ret = do_shrink_zone_throttled(priority, zone, sc, _reclaimed);
+
+   if (ret == -EAGAIN)
+   nr_reclaimed = 1;
+
+   return nr_reclaimed;
+}
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
@@ -1268,12 +1317,11 @@ static unsigned long shrink_zone(int pri
  * If a zone is deemed to be full of pinned 

[PATCH] x86_64: force re setting the mmconf for fam10h if acpi=off

2008-02-25 Thread Yinghai Lu

some BIOS only let AMD fam 10h handle bus0, and nvidia mcp55/ck804
to handle other buses. at that case MCFG will cover all over them.

but with acpi=off, we can not use MCFG. this patch will double check
the busnbits, and if it is less handling 256 bues, and acpi=off
will forcely reset the mmconf in msr, so we still use mmconf in above case.

Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]>

Index: linux-2.6/arch/x86/kernel/setup_64.c
===
--- linux-2.6.orig/arch/x86/kernel/setup_64.c
+++ linux-2.6/arch/x86/kernel/setup_64.c
@@ -720,14 +720,21 @@ static void __cpuinit fam10h_check_enabl
 
/* try to make sure that AP's setting is identical to BSP setting */
if (val & FAM10H_MMIO_CONF_ENABLE) {
-   u64 base;
-   base = val & (0xULL << 32);
-   if (fam10h_pci_mmconf_base_status <= 0) {
-   fam10h_pci_mmconf_base = base;
-   fam10h_pci_mmconf_base_status = 1;
-   return;
-   } else if (fam10h_pci_mmconf_base ==  base)
-   return;
+   unsigned busnbits;
+   busnbits = (val >> FAM10H_MMIO_CONF_BUSRANGE_SHIFT) &
+   FAM10H_MMIO_CONF_BUSRANGE_MASK;
+
+   /* only trust the one handle 256 buses, if acpi=off */
+   if (!acpi_pci_disabled || busnbits >= 8) {
+   u64 base;
+   base = val & (0xULL << 32);
+   if (fam10h_pci_mmconf_base_status <= 0) {
+   fam10h_pci_mmconf_base = base;
+   fam10h_pci_mmconf_base_status = 1;
+   return;
+   } else if (fam10h_pci_mmconf_base ==  base)
+   return;
+   }
}
 
/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1

2008-02-25 Thread nickcheng
Hi Aron,
Thanks for your patience.
If you still got into trouble, please let me know.
Thank you again,
-Original Message-
From: Aron Stansvik [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 26, 2008 6:52 AM
To: erich
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]; linux-kernel@vger.kernel.org
Subject: Re: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1

Hi Erich.

2008/2/25, nickcheng <[EMAIL PROTECTED]>:
> Hi Aron,
>  From our field experiences and customers' feedbacks, all of them direct
to
>  vibration and power issues.
>  The vibration could be caused by FANs not only by themselves.

Okay. I have a chassi fan that is quite close to the drives, I will
try disabling it. I've also ordered two Nexus TwinDisk anti vibration
harddrive mounts with which I'll place the disks in my 5.25" slots
instead, away from any fans.

If this doesn't work, I'm stumped, as I really don't think it's the
power supply and I don't have the money to buy a new one.

>  You mentioned it could be the F/W issue.
>  If the environment does not meet the prerequisite, FW could not work
>  correctly.
>  Actually FW just reacts to the situations not it causes the issue.

Of course, I understand this. Just trying to figure this problem out..

>  Please check it out!!

I'll report back with my findings with moving disk away from fans and
using anti-vibrations mounts.

Thanks for taking your time to reply.

Aron

>  Thank you,
>
>
>  -Original Message-
>  From: Aron Stansvik [mailto:[EMAIL PROTECTED]
>  Sent: Sunday, February 24, 2008 1:54 AM
>  To: [EMAIL PROTECTED]
>  Cc: erich; [EMAIL PROTECTED]; [EMAIL PROTECTED];
>  linux-kernel@vger.kernel.org
>
> Subject: Re: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1
>
>  Hello again Areca and LKML hackers.
>
>  2008/2/18, Aron Stansvik <[EMAIL PROTECTED]>:
>  > Hello Nick.
>  >
>  >  Sorry that I'm not answering until now. I've been busy.
>  >
>  >  2008/2/13, nickcheng <[EMAIL PROTECTED]>:
>  >
>  > > Hi Aron,
>  >  >  From our experience and some customers' feedback, your issue could
be
>  caused
>  >  >  by power instability or vibration to your HDs.
>  >  >  Please check step by step:
>  >  >  (1).under your original environment, increase the SCSI command
value,
>  >  >  default=30, with the shell script, set_scsicmd_timeout(). 90 or 120
is
>  >  >  enough.
>  >  >  (2).if method 1 does not work, find out the vibration source or
change
>  the
>  >  >  power supply
>  >
>  >
>  > I will try to increase that value. I don't think it's vibration; the
>  >  disks are firmly in place in a very heavy chassi (Silverstone
>  >  SST-TJ05B-T). And I really don't think there's something wrong with
>  >  the power supply, it's a pretty new Silverstone ST65ZF 650W. This is
>  >  my own personal workstation, so I don't just have another power supply
>  >  to test with :/
>  >
>  >  I will report back on my success/failure. Thanks for your answer.
>
>  I've now tried with both 90 and 120 for the timeout value, and the
>  problem still persists. It seems to happen when lots of small writes
>  are occuring, e.g. when installing something.
>
>  I really don't think the disks are vibrating, I don't see how they
>  could. One more thing I'm going to test is to use the legacy ATA power
>  connector instead of the SATA power connector. This was what I was
>  using before when I only had a single drive and no RAID controller.
>  Maybe my power supply is malfunctioning and not giving enough power on
>  the SATA power connectors.. but I doubt it.
>
>  Is there anything else that could cause this? Have you guys at Areca
>  tested the ARC-1200 with Raptors in RAID1?
>
>  :(
>
>  Regards,
>  Aron
>
>  >
>  >
>  >  Aron
>  >
>  >
>  >  >  If your still have any questions, please feel free to let me know.
>  >  >
>  >  >  P.S. The attached driver source, arcmsr-1.20.00.15-71224, has been
>  >  >  upstreamed to kernel.org and will be released in kernel 2.6.25. If
you
>  like,
>  >  >  you could update your driver with it.
>  >  >  It fixes some minor bugs, but these bugs are nothing to do with
your
>  issue.
>  >  >
>  >  >
>  >  >  -Original Message-
>  >  >  From: erich [mailto:[EMAIL PROTECTED]
>  >  >  Sent: Wednesday, February 13, 2008 4:33 PM
>  >  >  To: (廣安科技)鄭守謙
>  >  >  Subject: Fw: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1
>  >  >
>  >  >
>  >  >
>  >  >  - Original Message -
>  >  >  From: "Andrew Morton" <[EMAIL PROTECTED]>
>  >  >  To: "Aron Stansvik" <[EMAIL PROTECTED]>
>  >  >  Cc: ; <[EMAIL PROTECTED]>;
>  "erich"
>  >  >  <[EMAIL PROTECTED]>
>  >  >  Sent: Wednesday, February 13, 2008 4:03 PM
>  >  >  Subject: Re: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1
>  >  >
>  >  >
>  >  >  >
>  >  >  > (cc's added)
>  >  >  >
>  >  >  > On Mon, 11 Feb 2008 17:44:08 +0100 "Aron Stansvik"
>  <[EMAIL PROTECTED]>
>  >  >  > wrote:
>  >  >  >
>  >  >  >> Hello LKML.
>  >  >  >>
>  >  >  >> Under semi-high disk I/O (e.g. installing 

Re: GAK!!!! Re: PCI: AMD SATA IDE mode quirk

2008-02-25 Thread Crane Cai
在 2008-02-26Tue的 06:53 +0800,Jeff Garzik写道:
> Alan Cox wrote:
> >> Signed-off-by: Crane Cai <[EMAIL PROTECTED]>
> >> Acked-by: Jeff Garzik <[EMAIL PROTECTED]>
> >> Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]>
> >
> > Vomitted-upon-by: Alan Cox <[EMAIL PROTECTED]>
> >
> >> -if ((pdev->class >> 8) == PCI_CLASS_STORAGE_IDE) {
> >> -u8 tmp;
> >> +/* set sb600/sb700/sb800 sata to ahci mode */
> >> +u8 tmp;
> >> 
> >> +pci_read_config_byte(pdev, PCI_CLASS_DEVICE, );
> >> +if (tmp == 0x01) {
> >
> > CLASS_DEVICE is cached in pdev->class so why not simply do:
> >
> >   if (pdev->class & (1 << 8))
> 
> I agree.  I [obviously] missed this when I ack'd, mainly ack'ing the
> overall change.
> 
> BIOS certainly may modify that PCI config register, but that's before
> the kernel boots.  So, using pdev->class is fine.
> 
> Jeff

pdev->class is also quirked when resume. We need to reread PCI
configuation on resume.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH sched-devel 0/7] CPU isolation extensions

2008-02-25 Thread Max Krasnyanskiy

Hi Peter,

Sorry for delay in reply.


Please, wrap your emails at 78 - most mailers can do this.

Done.


On Fri, 2008-02-22 at 14:05 -0800, Max Krasnyanskiy wrote:

Peter Zijlstra wrote:

On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote:




List of commits
   cpuisol: Make cpu isolation configrable and export isolated map
 
cpu_isolated_map was a bad hack when it was introduced, I feel we should

deprecate it and fully integrate the functionality into cpusets. That would
give a much more flexible end-result.

That's not not currently possible and will introduce a lot of complexity.
I'm pretty sure you missied the discussion I had with Paul (you were cc'ed on 
that btw).
In fact I provided the link to that discussion in the original email.
Here it is again:
http://marc.info/?l=linux-kernel=120180692331461=2


I read it, I just firmly disagree.

Basically the problem is very simple. CPU isolation needs a simple/efficient way to check 
if CPU N is isolated.


I'm not seeing the need for that outside of setting up the various
states. That is, once all the affinities are set up, you'd hardly ever
would (or should - imho) need to know if a particular CPU is isolated or not.
Unless I'm missing something that's only possible for a very static system. 
What I mean is that yes you could go and set irq affinity, apps affinity, 
workqueue thread affinity, etc not to run on the isolated cpus. It works 
_until_ something changes, at which point the system needs to know that it's 
not supposed to touch CPU N.
For example new IRQ is registered, new workqueue is created (fs mounted, net 
network interface is created, etc), new kthread is started, etc.
Sure we can introduce default affinity masks for irqs, workqueues, etc. But 
that's essentially just duplicating cpu_isolated_map.



 cpuset/cgroup APIs are not designed for that. In other to figure
out if a CPU N is isolated one has to iterate through all cpusets and checks 
their cpu maps.
That requires several levels of locking (cgroup and cpuset).



The other issue is that cpusets are a bit too dynamic (see the thread above for 
more details)
we'd need notified mechanisms to notify subsystems when a CPUs become isolated. 
Again more
complexity. Since I integrated cpu isolation with cpu hotplug it's already 
addressed in a nice
simple way.


I guess you have another definition of nice than I do.

No, not really.
Lets talk specifics. My goal was not to introduce a bunch of new functionality 
and rewrite workqueues and stuff, instead I wanted to integrated with existing 
mechanisms. CPU maps are used everywhere and exporting cpu_isolated_map was a 
natural way to make other parts of the kernel aware of the isolated CPUs.



Please take a look at that discussion. I do not think it's worth the effort to 
put this into
cpusets. cpu_isolated_map is very clean and simple concept and integrates nicely with the 
rest of the cpu maps. ie It's very much the same concept and API as cpu_online_map, etc.


I'm thinking cpu_isolated_map is a very dirty hack.

If we want to integrate this stuff with cpusets I think the best approach would be is to 
have cpusets update the cpu_isolated_map just like it currently updates scheduler domains. 
 

CPU-sets can already isolate cpus by either creating a cpu outside of any set,
or a set with a single cpu not shared by any other sets.
This only works for user-space. As I mentioned about for full CPU isolation various kernel 
subsystems need to be aware of that CPUs are isolated in order to avoid activities on them.


Yes, hence the proposed system flag to handle the kernel bits like
unbounded kernel threads and IRQs.
I do not see a specific proposals here. The funny part that we're not even 
disagreeing on the high level. Yes It'd be nice to have such a flag ;-)

But how will genirq subsystem, for example, be aware of that flag ?
ie How would it know that by default it is not supposed to route irqs to the 
CPUs in the cpusets with that flag ?

As I explained above setting affinity for existing irqs is not enough.
Same for workqueus or any other susbsytem that wants to run per-cpu threads 
and stuff.



This also allows for isolated groups, there are good reasons to isolate groups,
esp. now that we have a stronger RT balancer. SMP and hard RT are not
exclusive. A design that does not take that into account is too rigid.



You're thinking scheduling only. Paul had the same confusion ;-)


I'm not, I'm thinking it ought to allow for it.
One way I can think of how to support groups and allow for RT balancer is 
this: Make scheduler ignore cpu_isolated_map and give cpusets full control of 
the scheduler domains. Use cpu_isolated_map to only for hw irq and other 
kernel sub-systems. That way cpusets could mark cpus in the group as isolated 
to get rid of the  kernel activity and build sched domain such that tasks get 
balanced in it.
The thing I do not like about it is that there is no way to boot the system 
with CPU N isolated 

RE: 2.6.25-current-git hangs on boot

2008-02-25 Thread Pallipadi, Venkatesh

>-Original Message-
>From: Rafael J. Wysocki [mailto:[EMAIL PROTECTED] 
>Sent: Sunday, February 24, 2008 3:18 AM
>To: Soeren Sonnenburg
>Cc: Oliver Pinter; Linux Kernel; Pallipadi, Venkatesh
>Subject: Re: 2.6.25-current-git hangs on boot
>
>On Sunday, 24 of February 2008, Soeren Sonnenburg wrote:
>> On Sat, 2008-02-23 at 20:00 +0100, Oliver Pinter wrote:
>> > the pci=nommconf kernel parameter helped it?
>> 
>> yes indeed, this switch reliably helps to over come the hang at *this
>> stage* (I tried booting with booth the switch and w/o).
>> 
>> however with 50% chance I still see a hang directly after
>> 
>> cpuidle: using governor ladder
>
>Do you have CONFIG_CPU_IDLE set?  If you have, please try to 
>unset it and
>retest.
>

Rafael,

I am looking at the CPU_IDLE part of this regression. Just want to note
that there is another regression with needing pci=nommconf in current
git which was not required in .24. I am not sure whether you are already
tracking that as a different issue.

Thanks,
Venki
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.25-rc2 rcupreempt WARN after suspend to ram

2008-02-25 Thread Karsten Wiese
Am Sonntag, 24. Februar 2008 schrieb Paul E. McKenney:
> On Sun, Feb 24, 2008 at 04:38:15PM +0100, Karsten Wiese wrote:
> > Am Samstag, 23. Februar 2008 schrieb Karsten Wiese:
> > > Am Samstag, 23. Februar 2008 schrieb Paul E. McKenney:
> > > > On Sat, Feb 23, 2008 at 01:41:02PM +0100, Karsten Wiese wrote:
> > > > > Hi,
> > > > > 
> > > > > This appeared in dmesg after
> > > > >   $ echo core > /sys/power/pm_test
> > > > > followed by 3 cycles of
> > > > >   $ echo mem > /sys/power/state
> > > > > . .config attached.
> > > > > 
> > > > > dmesg excerpt (, full ~1MByte available):
> > > > 
> > > > Does this tree have http://lkml.org/lkml/2008/1/29/208 applied?
> > > 
> > > Yes. This tree was linus' git head as of yesterday or the day before.
> > 
> > Updated to git-head of today, same test and .config, different symptoms
> > like in this thread: http://lkml.org/lkml/2008/2/23/260
> > Later in this thread, Alan Cox said it looked like irq problems.
> > Maybe also the rcupreemt related WARN_ON I saw are caused by irq problems.
> 
> Might be, but am taking a closer look at the interaction between irq,
> dynticks, and rcupreempt in any case.

[Added Rafael, Thomas and Steven to CC]

The "different symptoms" above are indeed unrelated and solved by reverting
"commit 559bbe6cbd0d8c68d40076a5f7dc98e3bf5864b2
 power_state: get rid of write-only variable in SATA"

The cpu_hotplug code used by suspend together with hr_timer and nohz looks
suspicious:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 1 > /sys/devices/system/cpu/cpu1/online
(repeat until dmesg|tail shows WARNs; here it took 2 iterations)
causes symptoms like in 1st message of this thread again.

Thanks,
  Karsten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Tiny cpusets -- cpusets for small systems?

2008-02-25 Thread Paul Jackson
> So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as 
> lower
> level mechanism that actually makes kernel aware of what's isolated what's 
> not.
> Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains
> but scheduler does not use cpusets directly.

One could use cpusets to control the setting of cpu_isolated_map,
separate from the code such as your select_irq_affinity() that
uses it.


> In a foreseeable future 2-8 cores will be most common configuration.
> Do you think that cpusets are needed/useful for those machines ?
> The reason I'm asking is because given the restrictions you mentioned
> above it seems that you might as well just do
>   taskset -c 1,2,3 app1
>   taskset -c 3,4,5 app2 

People tend to manage the CPU and memory placement of the threads
and processes within a single co-operating job using taskset
(sched_setaffinity) and numactl (mbind, set_mempolicy.)

They tend to manage the placement of multiple unrelated jobs onto
a single system, whether on separate or shared CPUs and nodes,
using cpusets.

Something like cpu_isolated_map looks to me like a system-wide
mechanism, which should, like sched_domains, be managed system-wide.
Managing it with a mechanism that encourages each thread to update
it directly, as if that thread owned the system, will break down,
resulting in conflicting updates, as multiple, insufficiently
co-operating threads issue conflicting settings.


> Stuff that I'm working on this days (wireless basestations) is designed
> with the  following model:
>   cpuN - runs soft-RT networking and management code
>   cpuN+1 to cpuN+x - are used as dedicated engines
> ie Simplest example would be
>   cpu0 - runs IP, L2 and control plane
>   cpu1 - runs hard-RT MAC 
> 
> So if CPU isolation is implemented on top of the cpusets what kind of API do 
> you envision for such an app ?

That depends on what more API is needed.  Do we need to place
irqs better ... cpusets might not be a natural for that use.
Aren't irqs directed to specific CPUs, not to hierarchically
nested subsets of CPUs.

Separate question:
  Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear
  as general purpose systems running a Linux kernel in your systems?
  These dedicated engines seem more like intelligent devices to me,
  such as disk controllers, which the kernel controls via device
  drivers, not by loading itself on them too.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Make the kernel NTP code hand 64-bit *unsigned* values to do_div()

2008-02-25 Thread David Howells
Roman Zippel <[EMAIL PROTECTED]> wrote:

> I would actually prefer to introduce an explicit API for signed 64 
> divides to get rid of the temps completely

Yeah, that's a better plan.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-25 Thread David Howells
Daniel Phillips <[EMAIL PROTECTED]> wrote:

> On Monday 25 February 2008 15:19, David Howells wrote:
> > So I guess there's a problem in cachefiles's efficiency - possibly due
> > to the fact that it tries to be fully asynchronous.
> 
> OK, not just my imagination, and it makes me feel better about the patch 
> set because efficiency bugs are fixable while fundamental limitations 
> are not.

One can hope:-)

> How much of a hurry are you in to merge this feature?  You have bits 
> like this:

I'd like to get it upstream sooner rather than later.  As it's not upstream,
but it's prerequisite patches touch a lot of code, I have to spend time
regularly making my patches work again.  Merge windows are completely not fun.

> "Add a function to install a monitor on the page lock waitqueue for a 
> particular page, thus allowing the page being unlocked to be detected.
> This is used by CacheFiles to detect read completion on a page in the 
> backing filesystem so that it can then copy the data to the waiting 
> netfs page."
> 
> We already have that hook, it is called bio_endio.

Except that isn't accessible.  CacheFiles currently has no access to the
notification from the blockdev to the backing fs, if indeed there is one.  All
we can do it trap the backing fs page becoming available.

> My strong intuition is that your whole mechanism should sit directly on the
> block device, no matter how attractive it seems to be able to piggyback on
> the namespace and layout management code of existing filesystems.

There's a place for both.

Consider a laptop with a small disk, possibly subdivided between Linux and
Windows.  Linux then subdivides its bit further to get a swap space.  What you
then propose is to break off yet another chunk to provide the cache.  You
can't then use this other chunk for anything else, even if it's, say, 1% used
by the cache.

The way CacheFiles works is that you tell it that it can use up to a certain
percentage of the otherwise free disk space on an otherwise existing
filesystem.  In the laptop case, you may just have a single big partition.  The
cache will fill up as much of it can, and as the other contents of the
partition consume space, the cache will be culled to make room.

On the other hand, a system like my desktop, where I can slap in extra disks
with mound of extra disk space, it might very well make sense to commit block
devices to caching, as this can be used to gain performance.

I have another cache backend (CacheFS) which takes the form of a filesystem,
thus allowing you to mount a blockdev as a cache.  It's much faster than Ext3
at storing and retrieving files... at first.  The problem is that I've mucked
up the free space retrieval such that performance degrades by 20x over time for
files of any size.

Basically any cache on a raw blockdev _is_ a filesystem, just one in which
you're randomly allowed to discard data to make life easier.

> I see your current effort as the moral equivalent of FUSE: you are able to
> demonstrate certain desirable behavioral properties, but you are unable to
> reach full theoretical efficiency because there are layers and layers of
> interface gunk interposed between the netfs user and the cache device.

The interface gunk is meant to be as thin as possible, but there are
constraints (see the documentation in the main FS-Cache patch for more
details):

 (1) It's a requirement that it only be tied to, say, AFS.  We might have
 several netfs's that want caching: AFS, CIFS, ISOFS (okay, that last isn't
 really a netfs, but it might still want caching).

 (2) I want to be able to change the backing cache.  Under some circumstances I
 might want to use an existing filesystem, under others I might want to
 commit a blockdev.  I've even been asked about using battery-backed RAM -
 which has different design constraints.

 (3) The constraint has been imposed by the NFS team that the cache be
 completely asynchronous.  I haven't quite met this: readpages() will wait
 until the cache knows whether or not the pages are available on the
 principle that read operations done through the cache can be considered
 synchronous.  This is an attempt to reduce the context switchage involved.

Unfortunately, the asynchronicity requirement has caused the middle layer to
bloat.  Fortunately, the backing cache needn't bloat as it can use the middle
layer's bloat.

> That said, I also see you have put a huge amount of work into this over 
> the years, it is nicely broken out, you are responsive and easy to work 
> with, all arguments for an early merge.  Against that, you invade core 
> kernel for reasons that are not necessarily justified:
>
>   * two new page flags

I need to keep track of two bits of per-cached-page information:

 (1) This page is known by the cache, and that the cache must be informed if
 the page is going to go away.

 (2) This page is being written to disk by the cache, and that it cannot be
 released 

Re: Tabs, spaces, indent and 80 character lines

2008-02-25 Thread Jan Engelhardt

On Feb 25 2008 23:13, Richard Knutsson wrote:
> Miles Bader wrote:
>> Why do people even respond to these trolls...?
>>   
> Obviously, this must to have been discussed before, with a clear conclusion.

It has been discussed before, at http://lkml.org/lkml/2007/11/12/19 .

What is really frustrating is that some of the people which _do_ 
enforce one style of the two (tabs-only or tabs-spaces) have only
indirectly voiced their preferred style, as in
http://lkml.org/lkml/2008/1/19/67 .

Often it was just (1)"checkpatch flagged your spaces, hence they are 
wrong" instead of (2)"in my tree, I only take tabs-only patches".


Now back to coding, oh and don't forget send a patch for CodingStyle 
since a mail without one is often taken even less seriously.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Make the kernel NTP code hand 64-bit *unsigned* values to do_div()

2008-02-25 Thread Roman Zippel
Hi,

On Thu, 21 Feb 2008, David Howells wrote:

> The kernel NTP code shouldn't hand 64-bit *signed* values to do_div().  Make 
> it
> instead hand 64-bit unsigned values.  This gets rid of a couple of warnings.

I would actually prefer to introduce an explicit API for signed 64 
divides to get rid of the temps completely, something like below.
Right now it uses do_div as fallback. When all archs are converted, do_div 
can be single compatibility define and perhaps we can get rid of it
completely.
Bonus feature: implement the x86 version without the asm casts allowing 
gcc to generate better code.

bye, Roman

---
 include/asm-generic/div64.h |   14 ++
 include/asm-i386/div64.h|   20 
 include/linux/calc64.h  |   28 
 kernel/time.c   |   26 +++---
 kernel/time/ntp.c   |   21 +
 lib/div64.c |   21 -
 6 files changed, 94 insertions(+), 36 deletions(-)

Index: linux-2.6/include/asm-generic/div64.h
===
--- linux-2.6.orig/include/asm-generic/div64.h
+++ linux-2.6/include/asm-generic/div64.h
@@ -35,6 +35,20 @@ static inline uint64_t div64_64(uint64_t
return dividend / divisor;
 }
 
+static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
+{
+   *remainder = dividend % divisor;
+   return dividend / divisor;
+}
+#define div_u64_remdiv_u64_rem
+
+static inline s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder)
+{
+   *remainder = dividend % divisor;
+   return dividend / divisor;
+}
+#define div_s64_remdiv_s64_rem
+
 #elif BITS_PER_LONG == 32
 
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);
Index: linux-2.6/include/asm-i386/div64.h
===
--- linux-2.6.orig/include/asm-i386/div64.h
+++ linux-2.6/include/asm-i386/div64.h
@@ -48,5 +48,25 @@ div_ll_X_l_rem(long long divs, long div,
 
 }
 
+static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
+{
+   union {
+   u64 v64;
+   u32 v32[2];
+   } d = { dividend };
+   u32 upper;
+
+   upper = d.v32[1];
+   if (upper) {
+   upper = d.v32[1] % divisor;
+   d.v32[1] = d.v32[1] / divisor;
+   }
+   asm ("divl %2" : "=a" (d.v32[0]), "=d" (*remainder) :
+   "rm" (divisor), "0" (d.v32[0]), "1" (upper));
+   return d.v64;
+}
+#define div_u64_remdiv_u64_rem
+
 extern uint64_t div64_64(uint64_t dividend, uint64_t divisor);
+
 #endif
Index: linux-2.6/include/linux/calc64.h
===
--- linux-2.6.orig/include/linux/calc64.h
+++ linux-2.6/include/linux/calc64.h
@@ -46,4 +46,32 @@ static inline long div_long_long_rem_sig
return res;
 }
 
+#ifndef div_u64_rem
+static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
+{
+   *remainder = do_div(dividend, divisor);
+   return dividend;
+}
+#endif
+
+#ifndef div_u64
+static inline u64 div_u64(u64 dividend, u32 divisor)
+{
+   u32 remainder;
+   return div_u64_rem(dividend, divisor, );
+}
+#endif
+
+#ifndef div_s64_rem
+extern s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder);
+#endif
+
+#ifndef div_s64
+static inline s64 div_s64(s64 dividend, s32 divisor)
+{
+   s32 remainder;
+   return div_s64_rem(dividend, divisor, );
+}
+#endif
+
 #endif
Index: linux-2.6/kernel/time.c
===
--- linux-2.6.orig/kernel/time.c
+++ linux-2.6/kernel/time.c
@@ -661,9 +661,7 @@ clock_t jiffies_to_clock_t(long x)
 #if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
return x / (HZ / USER_HZ);
 #else
-   u64 tmp = (u64)x * TICK_NSEC;
-   do_div(tmp, (NSEC_PER_SEC / USER_HZ));
-   return (long)tmp;
+   return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);
 #endif
 }
 EXPORT_SYMBOL(jiffies_to_clock_t);
@@ -675,16 +673,12 @@ unsigned long clock_t_to_jiffies(unsigne
return ~0UL;
return x * (HZ / USER_HZ);
 #else
-   u64 jif;
-
/* Don't worry about loss of precision here .. */
if (x >= ~0UL / HZ * USER_HZ)
return ~0UL;
 
/* .. but do try to contain it here */
-   jif = x * (u64) HZ;
-   do_div(jif, USER_HZ);
-   return jif;
+   return div_u64((u64)x * HZ, USER_HZ);
 #endif
 }
 EXPORT_SYMBOL(clock_t_to_jiffies);
@@ -692,17 +686,15 @@ EXPORT_SYMBOL(clock_t_to_jiffies);
 u64 jiffies_64_to_clock_t(u64 x)
 {
 #if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0
-   do_div(x, HZ / USER_HZ);
+   return div_u64(x, HZ / USER_HZ);
 #else
/*
 * There are better ways that don't overflow early,
 * but even this doesn't overflow in hundreds of years
 * in 64 bits, so..
 */
-   x *= 

  1   2   3   4   5   6   7   8   9   10   >