Re: Proposal for "proper" durable fsync() and fdatasync()
Jeff Garzik wrote: > Jamie Lokier wrote: > >By durable, I mean that fsync() should actually commit writes to > >physical stable storage, > > Yes, it should. Glad we agree :-) > >I was surprised that fsync() doesn't do this already. There was a lot > >of effort put into block I/O write barriers during 2.5, so that > >journalling filesystems can force correct write ordering, using disk > >flush cache commands. > > > >After all that effort, I was very surprised to notice that Linux 2.6.x > >doesn't use that capability to ensure fsync() flushes the disk cache > >onto stable storage. > > It's surprising you are surprised, given that this [lame] fsync behavior > has remaining consistently lame throughout Linux's history. I was surprised because of the effort put into IDE write barriers to get it right for in-kernel filesystems, and the messages in 2004 telling concerned users that fsync would use barriers in 2.6, which it does sometimes but not always. > [snip huge long proposal] > > Rather than invent new APIs, we should fix the existing ones to _really_ > flush data to physical media. > > Linux should default to SAFE data storage, and permit users to retain > the older unsafe behavior via an option. It's completely ridiculous > that we default to an unsafe fsync. Well, I agree with you. Which is why the "new API" I suggested, being really just an extension of an existing one, allows fsync() to be SAFE if that's what people want. To be fair, fsync() is rather overkill for some apps. sync_file_range() is obviously the right place for fine tuning "less safe" variations. > And [anticipating a common response from others] it is completely > irrelevant that POSIX fsync(2) permits Linux's current behavior. The > current behavior is unsafe. > > Safety before performance -- ESPECIALLY when it comes to storing user data. Especially now that people work a lot in guest VMs, where the IDE barrier stuff doesn't work if the host fdatasync() doesn't work. Since it happened with Mac OS X, I wouldn't be surprised if changing fsync() and just that wasn't popular. Heck, you already get people asking "how to turn off fsync in PostGreSQL"... (Haven't those people heard of transactions...?) But with changes to sync_file_range() [or whatever... I don't care] to support database's finely tuned commit needs, and then adoption of that by database vendors, perhaps nobody will mind fsync() becoming safe then. Nobody seems bothered by it's performance for other things. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: Tree for Feb 24
On Tue, 26 Feb 2008, Stephen Rothwell wrote: > On Mon, 25 Feb 2008 22:56:04 +0100 (CET) Geert Uytterhoeven <[EMAIL > PROTECTED]> wrote: > > > > Can you please add > > http://linux-m68k-cvs.ubb.ca/~geert/linux-m68k-patches-2.6/series? > > So far there's only one patch in between NEXT_PATCHES_{START,END} yet, > > though. > > Added, thanks. > > > Ah, I see the m68k cross-compiler is still missing? ;-) > > Please look again :-) Are there any particular configs we should build? Great! I really should update the defconfigs, and add a `defconfig_all' which builds a kernel for all platforms (except Sun-3). For now defconfig is OK, I think. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [EMAIL PROTECTED] In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ide-cd: fix some codestyle and most of the checkpatch.pl issues
Signed-off-by: Borislav Petkov <[EMAIL PROTECTED]> --- drivers/ide/ide-cd.c | 634 + 1 files changed, 323 insertions(+), 311 deletions(-) diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c index 3600648..3853eb5 100644 --- a/drivers/ide/ide-cd.c +++ b/drivers/ide/ide-cd.c @@ -13,8 +13,8 @@ * * Suggestions are welcome. Patches that work are more welcome though. ;-) * For those wishing to work on this driver, please be sure you download - * and comply with the latest Mt. Fuji (SFF8090 version 4) and ATAPI - * (SFF-8020i rev 2.6) standards. These documents can be obtained by + * and comply with the latest Mt. Fuji (SFF8090 version 4) and ATAPI + * (SFF-8020i rev 2.6) standards. These documents can be obtained by * anonymous ftp from: * ftp://fission.dt.wdc.com/pub/standards/SFF_atapi/spec/SFF8020-r2.6/PS/8020r26.ps * ftp://ftp.avc-pioneer.com/Mtfuji4/Spec/Fuji4r10.pdf @@ -41,17 +41,17 @@ #include /* For SCSI -> ATAPI command conversion */ -#include -#include +#include +#include #include -#include +#include #include #include "ide-cd.h" static DEFINE_MUTEX(idecd_ref_mutex); -#define to_ide_cd(obj) container_of(obj, struct cdrom_info, kref) +#define to_ide_cd(obj) container_of(obj, struct cdrom_info, kref) #define ide_cd_g(disk) \ container_of((disk)->private_data, struct cdrom_info, driver) @@ -77,13 +77,12 @@ static void ide_cd_put(struct cdrom_info *cd) mutex_unlock(_ref_mutex); } -/ +/* * Generic packet command support and error handling routines. */ -/* Mark that we've seen a media change, and invalidate our internal - buffers. */ -static void cdrom_saw_media_change (ide_drive_t *drive) +/* Mark that we've seen a media change, and invalidate our internal buffers. */ +static void cdrom_saw_media_change(ide_drive_t *drive) { struct cdrom_info *cd = drive->driver_data; @@ -100,46 +99,45 @@ static int cdrom_log_sense(ide_drive_t *drive, struct request *rq, return 0; switch (sense->sense_key) { - case NO_SENSE: case RECOVERED_ERROR: - break; - case NOT_READY: - /* -* don't care about tray state messages for -* e.g. capacity commands or in-progress or -* becoming ready -*/ - if (sense->asc == 0x3a || sense->asc == 0x04) - break; - log = 1; - break; - case ILLEGAL_REQUEST: - /* -* don't log START_STOP unit with LoEj set, since -* we cannot reliably check if drive can auto-close -*/ - if (rq->cmd[0] == GPCMD_START_STOP_UNIT && sense->asc == 0x24) - break; - log = 1; - break; - case UNIT_ATTENTION: - /* -* Make good and sure we've seen this potential media -* change. Some drives (i.e. Creative) fail to present -* the correct sense key in the error register. -*/ - cdrom_saw_media_change(drive); + case NO_SENSE: case RECOVERED_ERROR: + break; + case NOT_READY: + /* +* don't care about tray state messages for +* e.g. capacity commands or in-progress or +* becoming ready +*/ + if (sense->asc == 0x3a || sense->asc == 0x04) break; - default: - log = 1; + log = 1; + break; + case ILLEGAL_REQUEST: + /* +* don't log START_STOP unit with LoEj set, since +* we cannot reliably check if drive can auto-close +*/ + if (rq->cmd[0] == GPCMD_START_STOP_UNIT && + sense->asc == 0x24) break; + log = 1; + break; + case UNIT_ATTENTION: + /* +* Make good and sure we've seen this potential media +* change. Some drives (i.e. Creative) fail to present +* the correct sense key in the error register. +*/ + cdrom_saw_media_change(drive); + break; + default: + log = 1; + break; } return log; } -static -void cdrom_analyze_sense_data(ide_drive_t *drive, - struct request *failed_command, - struct request_sense *sense)
ide-cd: trivial fixes
Hi Bart, here some trivial fixes that i wanted to get out the door. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ide-cd: put proc-related functions together under single ifdef
Signed-off-by: Borislav Petkov <[EMAIL PROTECTED]> --- drivers/ide/ide-cd.c | 29 + 1 files changed, 13 insertions(+), 16 deletions(-) diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c index 546f436..3600648 100644 --- a/drivers/ide/ide-cd.c +++ b/drivers/ide/ide-cd.c @@ -1894,19 +1894,6 @@ int ide_cdrom_setup (ide_drive_t *drive) return 0; } -#ifdef CONFIG_IDE_PROC_FS -static -sector_t ide_cdrom_capacity (ide_drive_t *drive) -{ - unsigned long capacity, sectors_per_frame; - - if (cdrom_read_capacity(drive, , _per_frame, NULL)) - return 0; - - return capacity * sectors_per_frame; -} -#endif - static void ide_cd_remove(ide_drive_t *drive) { struct cdrom_info *info = drive->driver_data; @@ -1940,14 +1927,24 @@ static void ide_cd_release(struct kref *kref) static int ide_cd_probe(ide_drive_t *); #ifdef CONFIG_IDE_PROC_FS -static int proc_idecd_read_capacity - (char *page, char **start, off_t off, int count, int *eof, void *data) +static sector_t ide_cdrom_capacity(ide_drive_t *drive) +{ + unsigned long capacity, sectors_per_frame; + + if (cdrom_read_capacity(drive, , _per_frame, NULL)) + return 0; + + return capacity * sectors_per_frame; +} + +static int proc_idecd_read_capacity(char *page, char **start, off_t off, + int count, int *eof, void *data) { ide_drive_t *drive = data; int len; len = sprintf(page,"%llu\n", (long long)ide_cdrom_capacity(drive)); - PROC_IDE_READ_RETURN(page,start,off,count,eof,len); + PROC_IDE_READ_RETURN(page, start, off, count, eof, len); } static ide_proc_entry_t idecd_proc[] = { -- 1.5.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] CGroup API files: Various cleanup to CGroup control files
Paul Menage wrote: > On Mon, Feb 25, 2008 at 7:23 PM, Li Zefan <[EMAIL PROTECTED]> wrote: >> Should those pathces be rebased againt 2.6.25-rc3 ? >> > > No, because they're against 2.6.25-rc2-mm1, which is already has (I > think) any of the new bits in 2.6.25-rc3 that would be affected by > these patches. > > Paul -rc2-mm1 came out on 2008-02-16, but the patches I posted several days ago has been merged into -rc3, so your patches don't apply now. :( Think about ./MAINTAINERS update and set up a git tree for development of cgroup and cgroup subsystems as Andrew suggested. :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Compex FreedomLine 32 PnP-PCI2 broken with de2104x
On Mon, Feb 25, 2008 at 02:30:00AM -0500, Jeff Garzik wrote: > Grant Grundler wrote: >> On Mon, Feb 18, 2008 at 05:40:42PM +0100, Ondrej Zary wrote: >>> I think that de2104x driver should be removed (or at least its >>> MODULE_DEVICE_TABLE) and MODULE_DEVICE_TABLE with only 21040 and 21041 >>> PCI IDs added to de4x5. >>> >>> I can send a patch if this is acceptable. >> It's acceptable to me. Jeff? (jgarzik) > > NAK, sorry, for two reasons: > > 1) we don't delete otherwise clean, working drivers Just to be clear - he's not trying to remove the driver. He's just interested in making de4x5 the "default" for this set of boards by doctoring with the pci device ids each driver will claim. > simply because of a bug triggered by unplugging a cable. Ondrej would be happy to test any patches sent. Tracking this sort of bug down usually isn't trivial as the statement above implies. > 2) de4x5 needs to go away. Ok. I'd prefer to wait until someone demonstrates de2104x driver is a fully functional alternative for all the PCI Ids it claims. thanks, grant -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.24.3
On Mon, Feb 25, 2008 at 17:00:24 -0800, Greg Kroah-Hartman wrote: > We (the -stable team) are announcing the release of the 2.6.24.3 > kernel. Hi, I can see the patch in http://www.kernel.org/pub/linux/kernel/v2.6/, but no incremental patch in http://www.kernel.org/pub/linux/kernel/v2.6/incr/. Is this due to some delay, or was is just not uploaded? Regards, Tino -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PROBLEM: 2.4.36.1 hangs.
Aloha, The "ext2_readdir() filp->f_pos fix" patch looks weird... Perhaps the "filp->f_pos += le16_to_cpu(de->rec_len);" line should be outside of the if statement like the indentation implies? As it is, filp->f_pos gets corrupted if de->inode is ever zero... This could possibly explain why I had a few strange directory entries until I checked the filesystem with: e2fsck -D -F -f /dev/{ext2 partition} - glen Here is an updated (untested) patch: --- linux-2.4.36.orig/fs/ext2/dir.c +++ linux-2.4.36/fs/ext2/dir.c @@ -240,7 +240,7 @@ ext2_readdir (struct file * filp, void * loff_t pos = filp->f_pos; struct inode *inode = filp->f_dentry->d_inode; struct super_block *sb = inode->i_sb; - unsigned offset = pos & ~PAGE_CACHE_MASK; + unsigned int offset = pos & ~PAGE_CACHE_MASK; unsigned long n = pos >> PAGE_CACHE_SHIFT; unsigned long npages = dir_pages(inode); unsigned chunk_mask = ~(ext2_chunk_size(inode)-1); @@ -258,8 +258,13 @@ ext2_readdir (struct file * filp, void * ext2_dirent *de; struct page *page = ext2_get_page(inode, n); - if (IS_ERR(page)) + if (IS_ERR(page)) { + ext2_error(sb, __FUNCTION__, + "bad page in #%lu", + inode->i_ino); + filp->f_pos += PAGE_CACHE_SIZE - offset; continue; + } kaddr = page_address(page); if (need_revalidate) { offset = ext2_validate_entry(kaddr, offset, chunk_mask); @@ -267,7 +272,7 @@ ext2_readdir (struct file * filp, void * } de = (ext2_dirent *)(kaddr+offset); limit = kaddr + PAGE_CACHE_SIZE - EXT2_DIR_REC_LEN(1); - for ( ;(char*)de <= limit; de = ext2_next_entry(de)) + for ( ;(char*)de <= limit; de = ext2_next_entry(de)) { if (de->inode) { int over; unsigned char d_type = DT_UNKNOWN; @@ -284,11 +289,12 @@ ext2_readdir (struct file * filp, void * goto done; } } + filp->f_pos += le16_to_cpu(de->rec_len); + } ext2_put_page(page); } done: - filp->f_pos = (n << PAGE_CACHE_SHIFT) | offset; filp->f_version = inode->i_version; UPDATE_ATIME(inode); return 0; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jamie Lokier wrote: By durable, I mean that fsync() should actually commit writes to physical stable storage, Yes, it should. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. It's surprising you are surprised, given that this [lame] fsync behavior has remaining consistently lame throughout Linux's history. [snip huge long proposal] Rather than invent new APIs, we should fix the existing ones to _really_ flush data to physical media. Linux should default to SAFE data storage, and permit users to retain the older unsafe behavior via an option. It's completely ridiculous that we default to an unsafe fsync. And [anticipating a common response from others] it is completely irrelevant that POSIX fsync(2) permits Linux's current behavior. The current behavior is unsafe. Safety before performance -- ESPECIALLY when it comes to storing user data. Regards, Jeff (Linux ATA driver dude) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier <[EMAIL PROTECTED]> wrote: > (It would be nicer if sync_file_range() > took a vector of ranges for better elevator scheduling, but let's > ignore that :-) Two passes: Pass 1: shove each of the segments into the queue with SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE Pass 2: wait for them all to complete and return accumulated result with SYNC_FILE_RANGE_WAIT_AFTER -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu
Yinghai Lu wrote: which is the same. set_cpu_cap() is indeed the cleaner form to do this so your patch is correct as a cleanup. set_cpu_cap is right == set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability); ===> is wrong should be set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability); x86_capability is a array ... For an array, the & is optional and has no effect. So they mean the same thing. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > > set_cpu_cap is right > > == > > set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability); ===> is wrong > > should be > > set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability); > > > > x86_capability is a array ... > > > > so this could prevent some data corruption. > > ah, right you are! [...] actually, not: >x86_capability and c->x86_capability result in the same address (it's an array, not a pointer), so there's no "data corruption". If x86_capability were a pointer then you would be right - so this is all worth cleaning up. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu
* Yinghai Lu <[EMAIL PROTECTED]> wrote: > > #define set_cpu_cap(c, bit) set_bit(bit, (unsigned long > > *)((c)->x86_capability) > > > > which is the same. set_cpu_cap() is indeed the cleaner form to do this > > so your patch is correct as a cleanup. > set_cpu_cap is right > == > set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability); ===> is wrong > should be > set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability); > > x86_capability is a array ... > > so this could prevent some data corruption. ah, right you are! The commit was done in a sloppy, incomplete way, leaving around lots of direct c->x86_capability references and giving room for this bug ... Btw., there's one other place that has the same bug. I'll fix that up too. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [xfs-masters] Re: filesystem corruption on xfs after 2.6.25-rc1 (bisected, powerpc related?)
On Tue, Feb 26, 2008 at 01:13:56AM +0100, Rafael J. Wysocki wrote: > On Tuesday, 26 of February 2008, Christoph Hellwig wrote: > > On Tue, Feb 26, 2008 at 12:52:56AM +0100, Rafael J. Wysocki wrote: > > > > I'm not suggesting a partial revert; I just wonder which part of the > > > > change is causing the problem, as part of the debugging process. > > > > > > Understood. > > > > > > My point is, if that's not practical (whatever the reason), I'd consider > > > reverting all of the commits in question. > > > > If you could revert all of them and verify it makes the problem go away > > that would be a very good start already. > > The original reporter (CC added) said exactly that, if I understood him > correctly: > > http://lkml.org/lkml/2008/2/25/123 Sorry if I was not clear. The problematic commit after bisecting is a69b176df246d59626e6a9c640b44c0921fa4566. Reverting this commit and commit edd319dc527733e61eec5bdc9ce20c94634b6482 fixes the problem. So all other commits in the XFS merge for 2.6.25 seem to be OK. I had to revert the second commit only to avoid a merge conflict. And I forget to mention on my first post: Please CC me on all replies. I'm not subscribed to the lists. Gaudenz -- Ever tried. Ever failed. No matter. Try again. Fail again. Fail better. ~ Samuel Beckett ~ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: PARAVIRT needed by PARAVIRT_GUEST or X86_VSMP
* Yinghai Lu <[EMAIL PROTECTED]> wrote: > so it could be off automatically when PARAVIRT_GUEST or X86_VSMP is > not there thanks, applied. This whole VSMP + PARAVIRT business has to be done cleaner though before it's upstream ready, all the Kconfig magic looks rather twisted. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Proposal for "proper" durable fsync() and fdatasync()
Dear kernel, This is a proposal to add "proper" durable fsync() and fdatasync() to Linux. First the problem, then a proposed solution "with benefits", so to speak. I need feedback on the details, before implementing anything. Or (hopefully) someone else thinks it's very important and does it themselves :-) By durable, I mean that fsync() should actually commit writes to physical stable storage, not just the disk write cache when that is enabled. Databases and guest VMs needs this, or an equivalent feature, if they aren't to face occasional corruption after power failure and perhaps some crashes. The alternative is to disable the disk write cache. But that isn't modern practice or recommendation, since I/O write barriers were implemented and they are much faster. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. I noticed this following up discussions on the Qemu mailing list, about guest VMs and how their IDE flush cache command should translate to fsync() to avoid data loss. (For guest VMs, fsync() isn't necessary if the host machine is fine, and it isn't enough (on Linux host) if the host machine loses power or the hard disk crashes another way.) Then I noticed it again, when I was designing a database engine with filesystem characteristics. I thought "how do I ensure ordered journal writes; can I use fdatasync()?" and was surprised to find the answer is no, I have to use hacks like calling hdparm, and the authors of major SQL databases seem to brush the problem under a carpet. (Interestingly, in the Linux 2.4 patches for write barriers, fsync() seems to be fine, if a bit slow.) It isn't the first time this topic has come up: http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1 ("True fsync() in Linux (on IDE)") In that thread, it was implied that would be fixed in 2.6. So I bet some people are under the illusion that it's fixed in 2.6... For a while, I've been meaning to bring it up on linux-kernel... The fsync problem - Chris Wedgwood wrote: > On Mon, Feb 25, 2008 at 08:50:40PM +, Jamie Lokier wrote: > > > On Linux (and other host OSes), fdatsync() and fsync() don't always > > commit data to hard storage; it sometimes only commits it to the hard > > drive cache. > > That's a filesystem bug IMO. People should be able to use f[data]sync > with some level onf confidence or else it's basically pointless. I agree, I consider it a serious bug, and I would be pleased if someone paid it some love and attention. Right now, if you want a reliable database on Linux, you _cannot_ properly depend on fsync() or fdatasync(). Considering how much Linux is used for critical databases, using these functions, this amazes me. Also, if you have a guest VM, then the guest's filesystem journalling is not reliable. Not only can it lose data on power loss, it can corrupt the guest filesystem too, due to reordering. This is contrary to what people expect, I think. I'm not sure if a system reset can cause similar loss; I don't know how disks react to that. Also, for the person porting ZFS to run on FUSE, same applies... Linux fsync is faulty in two ways: 1. Database commits aren't _durable_ against power failure, because fsync doesn't flush the disk's cache. This means data stored is not guaranteed to be stored at the expected durability. 2. It's unsafe for write-ahead logging, because it doesn't really guarantee any _ordering_ for the writes at the hard storage level. So aside from losing committed data, it can also corrupt structural metadata. With ext3 it's quite easy to verify that fsync/fdatasync don't always write a journal entry. (Apart from looking at the kernel code :-) Just write some data, fsync(), and observe the number of writes in /proc/diskstats. If the current mtime second _hasn't_ changed, the inode isn't written. If you write data, say, 10 times a second to the same place followed by fsync(), you'll see a little more than 10 write I/Os, and less than 20. By the way, this shows a trick for fixing #2 (ordering): use fchmod() to toggle the file attributes, and that will force the next fsync() to write a journal entry, which _does_ issue a write barrier. If you do that with each write as above (write, fchmod change, fsync 10 times a second), you will clearly see more write I/Os, and you'll hear the disk behaving differently: it's seeking more. However, even this ugly trick has problems: 3. Using the fchmod() trick or good fortune, fsync() issues a write barrier. Right now, this does commit data (if
Re: compile problem in current x86.git
* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > Ingo Molnar wrote: >> Jeremy, you might want to start tracking x86.git#testing: >> >> http://people.redhat.com/mingo/x86.git/README >> >> if you want to follow the latest & greatest x86.git code. >> > > Right, will do. generally it's as well tested on x86 as #mm (so you should not notice any difference while tracking it), but it also includes work in progress bits that might touch other subsystems as well. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] [GFS2] re-support special inode
commit feaa7bba026c181ce071d5a4884f7f9dd26207a1 removed call to init_special_inode from inode lookuping, this cause problems as: # mknod /mnt/gfs2/dev/null c 1 3 # cat /mnt/gfs2/dev/null cat: /mnt/gfs2/dev/null: Invalid argument without special inode, GFS2 cannot support char device file, block device file, fifo pipe, and socket file, lose many important features as a common file system. this one line patch re add special inode support. Signed-off-by: Denis Cheng <[EMAIL PROTECTED]> --- fs/gfs2/inode.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index 1069ceb..18d8e0b 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -150,6 +150,7 @@ void gfs2_set_iop(struct inode *inode) inode->i_op = _symlink_iops; } else { inode->i_op = _file_iops; + init_special_inode(inode, inode->i_mode, inode->i_rdev); } unlock_new_inode(inode); -- 1.5.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: force re setting the mmconf for fam10h if acpi=off
* Yinghai Lu <[EMAIL PROTECTED]> wrote: > some BIOS only let AMD fam 10h handle bus0, and nvidia mcp55/ck804 to > handle other buses. at that case MCFG will cover all over them. > > but with acpi=off, we can not use MCFG. this patch will double check > the busnbits, and if it is less handling 256 bues, and acpi=off will > forcely reset the mmconf in msr, so we still use mmconf in above case. thanks, applied. > @@ -720,14 +720,21 @@ static void __cpuinit fam10h_check_enabl btw., a cleanliness suggestion: wouldnt it be cleaner to separate all the extra AMD family 16 code into a separate file? That should make it more focused and more isolated as well. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] [GFS2] remove gfs2_dev_iops
struct inode_operations gfs2_dev_iops is always the same as gfs2_file_iops, since Jan 2006, when GFS2 merged into mainstream kernel. So one of them could be removed. Signed-off-by: Denis Cheng <[EMAIL PROTECTED]> --- fs/gfs2/inode.c |2 +- fs/gfs2/ops_inode.c | 10 -- fs/gfs2/ops_inode.h |1 - 3 files changed, 1 insertions(+), 12 deletions(-) diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index 37725ad..1069ceb 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -149,7 +149,7 @@ void gfs2_set_iop(struct inode *inode) } else if (S_ISLNK(mode)) { inode->i_op = _symlink_iops; } else { - inode->i_op = _dev_iops; + inode->i_op = _file_iops; } unlock_new_inode(inode); diff --git a/fs/gfs2/ops_inode.c b/fs/gfs2/ops_inode.c index e874129..ab9a073 100644 --- a/fs/gfs2/ops_inode.c +++ b/fs/gfs2/ops_inode.c @@ -1148,16 +1148,6 @@ const struct inode_operations gfs2_file_iops = { .removexattr = gfs2_removexattr, }; -const struct inode_operations gfs2_dev_iops = { - .permission = gfs2_permission, - .setattr = gfs2_setattr, - .getattr = gfs2_getattr, - .setxattr = gfs2_setxattr, - .getxattr = gfs2_getxattr, - .listxattr = gfs2_listxattr, - .removexattr = gfs2_removexattr, -}; - const struct inode_operations gfs2_dir_iops = { .create = gfs2_create, .lookup = gfs2_lookup, diff --git a/fs/gfs2/ops_inode.h b/fs/gfs2/ops_inode.h index fd8cee2..14b4b79 100644 --- a/fs/gfs2/ops_inode.h +++ b/fs/gfs2/ops_inode.h @@ -15,7 +15,6 @@ extern const struct inode_operations gfs2_file_iops; extern const struct inode_operations gfs2_dir_iops; extern const struct inode_operations gfs2_symlink_iops; -extern const struct inode_operations gfs2_dev_iops; extern const struct file_operations gfs2_file_fops; extern const struct file_operations gfs2_dir_fops; extern const struct file_operations gfs2_file_fops_nolock; -- 1.5.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu
On Mon, Feb 25, 2008 at 11:20 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote: > > * Yinghai Lu <[EMAIL PROTECTED]> wrote: > > > > also fix error in early_init_intel and reference about x86_capality, > > because it is array already.., prevent possible data corruption... > > hm, why should there be data corruption: > > > > - set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability); > > + set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); > > cpu_cpu_cap() is currently defined as: > > #define set_cpu_cap(c, bit) set_bit(bit, (unsigned long > *)((c)->x86_capability) > > which is the same. set_cpu_cap() is indeed the cleaner form to do this > so your patch is correct as a cleanup. set_cpu_cap is right == set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability); ===> is wrong should be set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability); x86_capability is a array ... so this could prevent some data corruption. YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu
* Yinghai Lu <[EMAIL PROTECTED]> wrote: > also fix error in early_init_intel and reference about x86_capality, > because it is array already.., prevent possible data corruption... hm, why should there be data corruption: > - set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability); > + set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); cpu_cpu_cap() is currently defined as: #define set_cpu_cap(c, bit) set_bit(bit, (unsigned long *)((c)->x86_capability) which is the same. set_cpu_cap() is indeed the cleaner form to do this so your patch is correct as a cleanup. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: iwl4965 dropping packets and __dev_addr_discard: address leakage! da_users=1
On Mon, 25 Feb 2008, Andrew Morton wrote: > On Tue, 26 Feb 2008 16:22:43 +1100 Tim Connors <[EMAIL PROTECTED]> wrote: > > > Possibly because of the frequent renegotiating my iwl4965 card has > > been making, it has now decided it's not going to pass packets > > reliably until presumably next time I reboot. > > > > I've noticed messages in syslog that I hadn't seen when things were > > working fine. The problem possibly has only surfaced after rmmodding > > the iwl4965 module, then remodding it. It stays with me through being > > removed and modprobed again. It is being bonded in conjunction with a > > physical ethernet, if that is relevant, although I think I reproduced > > it when it was on its lonesome. > > > > > > Fresh after reboot: > > > > Feb 22 09:49:31 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN > > driver for Linux, 1.1.17kds > > Feb 22 09:49:31 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel > > Corporation > > Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 > > (level, low) -> IRQ 17 > > Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device > > :0c:00.0 to 64 > > Feb 22 09:49:31 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link > > 4965AGN > > Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :00:1b.0[A] -> GSI 21 > > (level, low) -> IRQ 21 > > Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device > > :00:1b.0 to 64 > > Feb 22 09:49:31 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 > > 802.11a channels > > Feb 22 09:49:31 dirac kernel: phy0: Selected rate control algorithm > > 'iwl-4965-rs' > > > > now finished with it, will be deconfigured and rmmodded: > > > > Feb 23 03:05:23 dirac kernel: __dev_addr_discard: address leakage! > > da_users=1 > > Feb 23 03:05:23 dirac kernel: ACPI: PCI interrupt for device :0c:00.0 > > disabled > > > > The modprobed again: > > > > Feb 23 03:05:35 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN > > driver for Linux, 1.1.17kds > > Feb 23 03:05:35 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel > > Corporation > > Feb 23 03:05:35 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 > > (level, low) -> IRQ 17 > > Feb 23 03:05:35 dirac kernel: PCI: Setting latency timer of device > > :0c:00.0 to 64 > > Feb 23 03:05:35 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link > > 4965AGN > > Feb 23 03:05:35 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 > > 802.11a channels > > Feb 23 03:05:35 dirac kernel: phy9: Selected rate control algorithm > > 'iwl-4965-rs' > > > > according to lspci -vvv, 0c:00.0 is: > > > > 0c:00.0 Network controller: Intel Corporation PRO/Wireless 4965 AG or AGN > > Network Connection (rev 61) > > Subsystem: Intel Corporation Unknown device 1120 > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > > Stepping- SERR+ FastB2B- DisINTx+ > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > > SERR- > Latency: 0, Cache Line Size: 64 bytes > > Interrupt: pin A routed to IRQ 379 > > Region 0: Memory at f9ffe000 (64-bit, non-prefetchable) [size=8K] > > Capabilities: [c8] Power Management version 3 > > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA > > PME(D0+,D1-,D2-,D3hot+,D3cold+) > > Status: D0 PME-Enable- DSel=0 DScale=0 PME- > > Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ > > Queue=0/0 Enable+ > > Address: fee0300c Data: 4194 > > Capabilities: [e0] Express (v1) Endpoint, MSI 00 > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s > > <512ns, L1 unlimited > > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- > > Unsupported- > > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ > > MaxPayload 128 bytes, MaxReadReq 128 bytes > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ > > TransPend- > > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, > > Latency L0 <128ns, L1 <64us > > ClockPM+ Suprise- LLActRep- BwNot- > > LnkCtl: ASPM L0s Enabled; RCB 64 bytes Disabled- Retrain- > > CommClk+ > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > > LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ > > DLActive- BWMgmt- ABWMgmt- > > Capabilities: [100] Advanced Error Reporting > > Capabilities: [140] Device Serial Number d7-36-9b-ff-ff-e8-13-00 > > Kernel driver in use: iwl4965 > > Kernel modules: iwl4965 > > (cc linux-wireless) > > What kernel version is this? > > Is this a regression from an earlier kenrel version? If so, which? kernel 2.6.24 version 4.44.1.20 of the
Re: [PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu
* Yinghai Lu <[EMAIL PROTECTED]> wrote: > early_init_intel is introduced by > > commit 2b16a2353814a513cdb5c5c739b76a19d7ea39ce > Author: Andi Kleen <[EMAIL PROTECTED]> > Date: Wed Jan 30 13:32:40 2008 +0100 > > x86: move X86_FEATURE_CONSTANT_TSC into early cpu feature detection > > set CONSTANT_TSC for intel cpus > > but it already set in init_intel > > don't need to set that two times in early_init_intel() and init_intel. this > patch remove one. > > also fix error in early_init_intel and reference about x86_capality, because > it > is array already.., prevent possible data corruption... > > this should be applied for 2.6.25 thanks Yinghai, applied. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] Fundamental flaw in system suspend, exposed by freezer removal
This "flaw" isn't a new thing, of course. I remember pointing out the rather annoying proclivity of the PM framework to deadlock when suspend() tried to remove USB devices ... back around 2.6.10 or so. Things have shuffled around a bit, and gotten better in some cases, but not fundamentally changed. It may be more accurate to say that now we understand some constraints on device tree management policies ... ones we had previously assumed should not be issues. (But AFAICT, without actually considering the question. Now we know the right question to ask!) On Monday 25 February 2008, Rafael J. Wysocki wrote: > IMO the device driver should assure that no new children will be registered > concurrently with the ->suspend() method (IOW, ->suspend() should wait for > all such registrations to complete and should prevent any new ones from > being started) and it should make it impossible to register any new children > after ->suspend() has run. It's the driver's problem how to achieve that. There's also the case where it's framework code that handles the additions rather than the parent device. That would be typical for many bridge, hub, or adapter type drivers ... you may be thinking mostly about drivers acting as "leaf" nodes in the device tree, at least in terms of real hardware nodes. Yes, "require that policy from such framework code too". Just trying to be sure the description doesn't have gaping holes in the middle. :) I can think of a bunch of serial busses where framework code has that sort of responsiblity. USB, SPI, I2C ... "legacy" I2C drivers would all need to be taught not to create/remove children during those intervals, lacking "new style" conversion (which makes them work more like normal drivers). - Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86_64: remove wrong setting about CONSTANT_TSC for intel cpu
early_init_intel is introduced by commit 2b16a2353814a513cdb5c5c739b76a19d7ea39ce Author: Andi Kleen <[EMAIL PROTECTED]> Date: Wed Jan 30 13:32:40 2008 +0100 x86: move X86_FEATURE_CONSTANT_TSC into early cpu feature detection set CONSTANT_TSC for intel cpus but it already set in init_intel don't need to set that two times in early_init_intel() and init_intel. this patch remove one. also fix error in early_init_intel and reference about x86_capality, because it is array already.., prevent possible data corruption... this should be applied for 2.6.25 Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]> diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c index 62d3f14..210134c 100644 --- a/arch/x86/kernel/setup_64.c +++ b/arch/x86/kernel/setup_64.c @@ -1027,7 +1027,7 @@ static void __cpuinit early_init_intel(struct cpuinfo_x86 *c) { if ((c->x86 == 0xf && c->x86_model >= 0x03) || (c->x86 == 0x6 && c->x86_model >= 0x0e)) - set_bit(X86_FEATURE_CONSTANT_TSC, >x86_capability); + set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); } static void __cpuinit init_intel(struct cpuinfo_x86 *c) @@ -1071,9 +1071,6 @@ static void __cpuinit init_intel(struct cpuinfo_x86 *c) if (c->x86 == 15) c->x86_cache_alignment = c->x86_clflush_size * 2; - if ((c->x86 == 0xf && c->x86_model >= 0x03) || - (c->x86 == 0x6 && c->x86_model >= 0x0e)) - set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); if (c->x86 == 6) set_cpu_cap(c, X86_FEATURE_REP_GOOD); set_cpu_cap(c, X86_FEATURE_LFENCE_RDTSC); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Memory Resource Controller Add Boot Option
Hi, > >>> I'll send out a prototype for comment. > > > > Something like the patch below. The effects of cgroup_disable=foo are: > > > > - foo doesn't show up in /proc/cgroups > > Or we can print out the disable flag, maybe this will be better? > Because we can distinguish from disabled and not compiled in from > /proc/cgroups. It would be neat if the disable flag /proc/cgroups can be cleared/set on demand. It will depend on the implementation of each controller whether it works or not. > > - foo isn't auto-mounted if you mount all cgroups in a single hierarchy > > - foo isn't visible as an individually mountable subsystem > > You mentioned in a previous mail if we mount a disabled subsystem we > will get an error. Here we just ignore the mount option. Which makes > more sense ? > > > > > As a result there will only ever be one call to foo->create(), at init > > time; all processes will stay in this group, and the group will never be > > mounted on a visible hierarchy. Any additional effects (e.g. not > > allocating metadata) are up to the foo subsystem. > > > > This doesn't handle early_init subsystems (their "disabled" bit isn't > > set be, but it could easily be extended to do so if any of the > > early_init systems wanted it - I think it would just involve some > > nastier parameter processing since it would occur before the > > command-line argument parser had been run. > > > > include/linux/cgroup.h |1 + > > kernel/cgroup.c| 29 +++-- > > 2 files changed, 28 insertions(+), 2 deletions(-) > > > > Index: cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h > > === > > --- cgroup_disable-2.6.25-rc2-mm1.orig/include/linux/cgroup.h > > +++ cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h > > @@ -256,6 +256,7 @@ struct cgroup_subsys { > > void (*bind)(struct cgroup_subsys *ss, struct cgroup *root); > > int subsys_id; > > int active; > > +int disabled; > > int early_init; > > #define MAX_CGROUP_TYPE_NAMELEN 32 > > const char *name; > > Index: cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c > > === > > --- cgroup_disable-2.6.25-rc2-mm1.orig/kernel/cgroup.c > > +++ cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c > > @@ -790,7 +790,14 @@ static int parse_cgroupfs_options(char * > > if (!*token) > > return -EINVAL; > > if (!strcmp(token, "all")) { > > -opts->subsys_bits = (1 << CGROUP_SUBSYS_COUNT) - 1; > > +/* Add all non-disabled subsystems */ > > +int i; > > +opts->subsys_bits = 0; > > +for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > > +struct cgroup_subsys *ss = subsys[i]; > > +if (!ss->disabled) > > +opts->subsys_bits |= 1ul << i; > > +} > > } else if (!strcmp(token, "noprefix")) { > > set_bit(ROOT_NOPREFIX, >flags); > > } else if (!strncmp(token, "release_agent=", 14)) { > > @@ -808,7 +815,8 @@ static int parse_cgroupfs_options(char * > > for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > > ss = subsys[i]; > > if (!strcmp(token, ss->name)) { > > -set_bit(i, >subsys_bits); > > +if (!ss->disabled) > > +set_bit(i, >subsys_bits); > > break; > > } > > } > > @@ -2596,6 +2606,8 @@ static int proc_cgroupstats_show(struct > > mutex_lock(_mutex); > > for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > > struct cgroup_subsys *ss = subsys[i]; > > +if (ss->disabled) > > +continue; > > seq_printf(m, "%s\t%lu\t%d\n", > >ss->name, ss->root->subsys_bits, > >ss->root->number_of_cgroups); > > @@ -2991,3 +3003,16 @@ static void cgroup_release_agent(struct > > spin_unlock(_list_lock); > > mutex_unlock(_mutex); > > } > > + > > +static int __init cgroup_disable(char *str) > > +{ > > +int i; > > +for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > > +struct cgroup_subsys *ss = subsys[i]; > > +if (!strcmp(str, ss->name)) { > > +ss->disabled = 1; > > +break; > > +} > > +} > > +} > > +__setup("cgroup_disable=", cgroup_disable); > > > > > >> > >> Sure thing, if css has the flag, then it would nice. Could you wrap it > >> up to say > >> something like css_disabled(_cgroup_subsys) > >> > >> > > > > It's the subsys object rather than the css (cgroup_subsys_state). > > > > We could have something like: > > > > #define cgroup_subsys_disabled(_ss) ((ss_)->disabled) > > > > but I don't see that > > cgroup_subsys_disabled(_cgroup_subsys) > > is better than just putting > > > > mem_cgroup_subsys.disabled > > > > Paul > > > > > > -- To unsubscribe from this list: send the line
Re: [patch 3/6] mempolicy: add MPOL_F_STATIC_NODES flag
On Mon, 25 Feb 2008, Paul Jackson wrote: > $ grep mpol_store_user_nodemask mm/mempolicy.c > static inline int mpol_store_user_nodemask(const struct mempolicy *pol) > if (mpol_store_user_nodemask(policy)) > if (!mpol_store_user_nodemask(a)) > if (!mpol_store_user_nodemask(pol) && > > So I see no need to waste the instructions needed (in the three copies > of this code, since it's static inline) to convert a non-zero value to > exactly the value 1. > Done, thanks. > Hmmm ... speaking of static inline ... I can knock 600 bytes (that's > IA64 bytes, so equivalent to about 300 x86 bytes) off the kernel text > size by not inlining the mm/mempolicy.c routines check_pgd_range() and > interleave_nid(). I wonder if that would be worth doing. Perhaps > those two routines are in sufficiently tight corners that the duplicate > copies of them is needed. > It seems like a worthwhile change to me even though gcc will pass the actuals to check_pgd_range() on the stack. The callers to check_range() shouldn't be in any fast paths: migrate_to_node() can take a long time depending on the length of the page list and do_mbind() sleeps on mmap_sem. textdata bss dec hex filename 11695 24 24 117432ddf mm/mempolicy.o.before 11215 24 24 112632bff mm/mempolicy.o.after -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] XFS update for 2.6.25-rc4
Please pull from the for-linus branch: git pull git://oss.sgi.com:8090/xfs/xfs-2.6.git for-linus This will update the following files: fs/xfs/xfs_bit.c | 103 ++ fs/xfs/xfs_bit.h | 27 ++--- fs/xfs/xfs_rtalloc.c | 19 ++--- 3 files changed, 120 insertions(+), 29 deletions(-) through these commits: commit ef8ece55d9b6825c28a5c1a4bd89b94040cb7b32 Author: Lachlan McIlroy <[EMAIL PROTECTED]> Date: Tue Feb 26 17:00:22 2008 +1100 [XFS] Undo bit ops cleanup mod due to regression on 32-bit powermac platform. SGI-PV: 971186 SGI-Modid: xfs-linux-melb:xfs-kern:30559a Signed-off-by: Lachlan McIlroy <[EMAIL PROTECTED]> commit db69c915e67705daac25cad06d816c09be634de0 Author: Lachlan McIlroy <[EMAIL PROTECTED]> Date: Tue Feb 26 17:00:14 2008 +1100 [XFS] Undo bit ops cleanup mod due to regression on 32-bit powermac platform. SGI-PV: 974005 SGI-Modid: xfs-linux-melb:xfs-kern:30558a Signed-off-by: Lachlan McIlroy <[EMAIL PROTECTED]> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: add the debugfs interface for the sysprof tool
Hi, On Tue, Feb 26, 2008 at 8:27 AM, Pekka Enberg <[EMAIL PROTECTED]> wrote: > > You could try passing the --callgraph option to opcontrol. > > Hmm, perhaps I am missing something but I don't think that does what > sysprof does. At least I can't find where in the oprofile kernel code > does it save the full stack trace for user-space. John? Ok, so as pointed out by Nicholas/Andrew, oprofile does indeed do exactly what sysprof does (see arch/x86/oprofile/backtrace.c::backtrace_address, for example). So, Soeren, any other reason we can't use the oprofile kernel module for sysprof? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/6] mempolicy: add MPOL_F_RELATIVE_NODES flag
On Tue, 26 Feb 2008, Paul Jackson wrote: > David wrote: > +static nodemask_t mpol_relative_nodemask(const nodemask_t *orig, > + const nodemask_t *rel) > +{ > + nodemask_t ret; > + nodemask_t tmp; > > Could you avoid needing the nodemask_t 'ret' on the stack, by passing > in a "nodemask_t *" pointer to where you want the resulting nodemask_t > written, rather than by returning it by value? > > static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, >const nodemask_t *rel) > Done, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: iwl4965 dropping packets and __dev_addr_discard: address leakage! da_users=1
On Tue, 26 Feb 2008 16:22:43 +1100 Tim Connors <[EMAIL PROTECTED]> wrote: > Possibly because of the frequent renegotiating my iwl4965 card has > been making, it has now decided it's not going to pass packets > reliably until presumably next time I reboot. > > I've noticed messages in syslog that I hadn't seen when things were > working fine. The problem possibly has only surfaced after rmmodding > the iwl4965 module, then remodding it. It stays with me through being > removed and modprobed again. It is being bonded in conjunction with a > physical ethernet, if that is relevant, although I think I reproduced > it when it was on its lonesome. > > > Fresh after reboot: > > Feb 22 09:49:31 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN > driver for Linux, 1.1.17kds > Feb 22 09:49:31 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel > Corporation > Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 > (level, low) -> IRQ 17 > Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device > :0c:00.0 to 64 > Feb 22 09:49:31 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link > 4965AGN > Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :00:1b.0[A] -> GSI 21 > (level, low) -> IRQ 21 > Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device > :00:1b.0 to 64 > Feb 22 09:49:31 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 > 802.11a channels > Feb 22 09:49:31 dirac kernel: phy0: Selected rate control algorithm > 'iwl-4965-rs' > > now finished with it, will be deconfigured and rmmodded: > > Feb 23 03:05:23 dirac kernel: __dev_addr_discard: address leakage! da_users=1 > Feb 23 03:05:23 dirac kernel: ACPI: PCI interrupt for device :0c:00.0 > disabled > > The modprobed again: > > Feb 23 03:05:35 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN > driver for Linux, 1.1.17kds > Feb 23 03:05:35 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel > Corporation > Feb 23 03:05:35 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 > (level, low) -> IRQ 17 > Feb 23 03:05:35 dirac kernel: PCI: Setting latency timer of device > :0c:00.0 to 64 > Feb 23 03:05:35 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link > 4965AGN > Feb 23 03:05:35 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 > 802.11a channels > Feb 23 03:05:35 dirac kernel: phy9: Selected rate control algorithm > 'iwl-4965-rs' > > according to lspci -vvv, 0c:00.0 is: > > 0c:00.0 Network controller: Intel Corporation PRO/Wireless 4965 AG or AGN > Network Connection (rev 61) > Subsystem: Intel Corporation Unknown device 1120 > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR+ FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > SERR- Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 379 > Region 0: Memory at f9ffe000 (64-bit, non-prefetchable) [size=8K] > Capabilities: [c8] Power Management version 3 > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA > PME(D0+,D1-,D2-,D3hot+,D3cold+) > Status: D0 PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ > Queue=0/0 Enable+ > Address: fee0300c Data: 4194 > Capabilities: [e0] Express (v1) Endpoint, MSI 00 > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s > <512ns, L1 unlimited > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > DevCtl: Report errors: Correctable- Non-Fatal- Fatal- > Unsupported- > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ > MaxPayload 128 bytes, MaxReadReq 128 bytes > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ > TransPend- > LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, > Latency L0 <128ns, L1 <64us > ClockPM+ Suprise- LLActRep- BwNot- > LnkCtl: ASPM L0s Enabled; RCB 64 bytes Disabled- Retrain- > CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ > DLActive- BWMgmt- ABWMgmt- > Capabilities: [100] Advanced Error Reporting > Capabilities: [140] Device Serial Number d7-36-9b-ff-ff-e8-13-00 > Kernel driver in use: iwl4965 > Kernel modules: iwl4965 (cc linux-wireless) What kernel version is this? Is this a regression from an earlier kenrel version? If so, which? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] update efi region debugging to use MB, GB and TB as well as KB
When EFI_DEBUG is defined to a non-zero value in arch/ia64/kernel/efi.c, the efi memory regions are displayed. This patch enhances the display code in a few ways: 1. Use TB, GB and MB as well as KB as units. Although this introduces rounding errors (KB doesn't as size is always a multiple of 4Kb), it does make things a lot more readable. Also as the range is also shown, it is possible to note the exact size if it is important. In my experience, the size field is mostly useful for getting a general idea of the size of a region. On the rx2620 that I use, there actually is an 8TB region (though not backed by physical memory, and 8TB really is a lot more readable than 8589934592KB. 2. pad the size field with leading spaces to further improve readability ... ... ( 8MB) ... ( 928MB) ... ( 3MB) ... vs ... ... (8MB) ... (928MB) ... (3MB) ... 3. Pad the attr field out to 64bits using leading zeros, to further improve readability. ... mem05: type= 2, attr=0x0008, range=[0x0400-0x0481f000) ( 8MB) mem06: type= 7, attr=0x0008, range=[0x0481f000-0x3e876000) ( 928MB) mem07: type= 5, attr=0x8008, range=[0x3e876000-0x3eb8e000) ( 3MB) mem08: type= 4, attr=0x0008, range=[0x3eb8e000-0x3ee7a000) ( 2MB) ... ... mem05: type= 2, attr=0x8, range=[0x0400-0x0481f000) ( 8MB) mem06: type= 7, attr=0x8, range=[0x0481f000-0x3e876000) ( 928MB) mem07: type= 5, attr=0x8008, range=[0x3e876000-0x3eb8e000) ( 3MB) mem08: type= 4, attr=0x8, range=[0x3eb8e000-0x3ee7a000) ( 2MB) ... 4. Use %d instead of %u for the index field, as i is a signed int. N.B: This code is not compiled unless EFI_DEBUG is non 0. Signed-off-by: Simon Horman <[EMAIL PROTECTED]> Index: linux-2.6/arch/ia64/kernel/efi.c === --- linux-2.6.orig/arch/ia64/kernel/efi.c 2008-02-26 15:07:57.0 +0900 +++ linux-2.6/arch/ia64/kernel/efi.c2008-02-26 15:25:33.0 +0900 @@ -543,12 +543,30 @@ efi_init (void) for (i = 0, p = efi_map_start; p < efi_map_end; ++i, p += efi_desc_size) { + const char *unit; + unsigned long size; + md = p; - printk("mem%02u: type=%u, attr=0x%lx, " - "range=[0x%016lx-0x%016lx) (%luMB)\n", + size = md->num_pages << EFI_PAGE_SHIFT; + + if ((size >> 40) > 0) { + size >>= 40; + unit = "TB"; + } else if ((size >> 30) > 0) { + size >>= 30; + unit = "GB"; + } else if ((size >> 20) > 0) { + size >>= 20; + unit = "MB"; + } else { + size >>= 10; + unit = "KB"; + } + + printk("mem%02d: type=%2u, attr=0x%016lx, " + "range=[0x%016lx-0x%016lx) (%4lu%s)\n", i, md->type, md->attribute, md->phys_addr, - md->phys_addr + efi_md_size(md), - md->num_pages >> (20 - EFI_PAGE_SHIFT)); + md->phys_addr + efi_md_size(md), size, unit); } } #endif -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: add the debugfs interface for the sysprof tool
On Sun, Feb 24, 2008 at 5:12 AM, Nicholas Miell <[EMAIL PROTECTED]> wrote: > > Sysprof tracks the full stack frame so it can provide meaningful call > > tree (who called what) which is invaluable for spotting hot _paths_. I > > don't see how oprofile can do that as it tracks instruction pointers only. > > You could try passing the --callgraph option to opcontrol. Hmm, perhaps I am missing something but I don't think that does what sysprof does. At least I can't find where in the oprofile kernel code does it save the full stack trace for user-space. John? Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.25-rc3: Reported regressions from 2.6.24
2008/2/25 Rafael J. Wysocki <[EMAIL PROTECTED]>: > This message contains a list of some regressions from 2.6.24 reported since > 2.6.25-rc1 was released, for which there are no fixes in the mainline I know > of. If any of them have been fixed already, please let me know. If you want that, I think Cc-ing reporters might be a good idea. > If you know of any other unresolved regressions from 2.6.24, please let me > know > either and I'll add them to the list. Also, please let me know if any of the > entries below are invalid. Two other issues, both with tested patches which do not seem to be in mainline yet: http://lkml.org/lkml/2008/2/22/127 2.6.25-rc[1,2]: failed to setup dm-crypt key mapping http://lkml.org/lkml/2008/2/25/400 new regression in 2.6.25-rc3: no keyboard/lid acpi events on thinkpad T61p I think this last one is bug 10100 Thanks, MST -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Contact info
On Tue, Feb 26, 2008 at 12:23:34AM +0100, [EMAIL PROTECTED] wrote: > Hello, > > I would like to know who is in charge with the linux credit list since i will > no longer use my current email. I'm using temporary this email to update > existing info. you'd better submit a patch for the files with your new address. Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] CGroup API files: Various cleanup to CGroup control files
On Mon, Feb 25, 2008 at 7:23 PM, Li Zefan <[EMAIL PROTECTED]> wrote: > > Should those pathces be rebased againt 2.6.25-rc3 ? > No, because they're against 2.6.25-rc2-mm1, which is already has (I think) any of the new bits in 2.6.25-rc3 that would be affected by these patches. Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/6] mmu_notifier: Support for drivers with revers maps (f.e. for XPmem)
On Thursday 21 February 2008 21:58, Robin Holt wrote: > On Thu, Feb 21, 2008 at 03:20:02PM +1100, Nick Piggin wrote: > > > > So why can't you export a device from your xpmem driver, which > > > > can be mmap()ed to give out "anonymous" memory pages to be used > > > > for these communication buffers? > > > > > > Because we need to have heap and stack available as well. MPT does > > > not control all the communication buffer areas. I haven't checked, but > > > this is the same problem that IB will have. I believe they are > > > actually allowing any memory region be accessible, but I am not sure of > > > that. > > > > Then you should create a driver that the user program can register > > and unregister regions of their memory with. The driver can do a > > get_user_pages to get the pages, and then you'd just need to set up > > some kind of mapping so that userspace can unmap pages / won't leak > > memory (and an exit_mm notifier I guess). > > OK. You need to explain this better to me. How would this driver > supposedly work? What we have is an MPI library. It gets invoked at > process load time to establish its rank-to-rank communication regions. > It then turns control over to the processes main(). That is allowed to > run until it hits the > MPI_Init(, ); > > The process is then totally under the users control until: > MPI_Send(intmessage, m_size, MPI_INT, my_rank+half, tag, > MPI_COMM_WORLD); > MPI_Recv(intmessage, m_size, MPI_INT, my_rank+half,tag, MPI_COMM_WORLD, > ); > > That is it. That is all our allowed interaction with the users process. OK, when you said something along the lines of "the MPT library has control of the comm buffer", then I assumed it was an area of virtual memory which is set up as part of initialization, rather than during runtime. I guess I jumped to conclusions. > That doesn't seem too unreasonable, except when you compare it to how the > driver currently works. Remember, this is done from a library which has > no insight into what the user has done to its own virtual address space. > As a result, each MPI_Send() would result in a system call (or we would > need to have a set of callouts for changes to a processes VMAs) which > would be a significant increase in communication overhead. > > Maybe I am missing what you intend to do, but what we need is a means of > tracking one processes virtual address space changes so other processes > can do direct memory accesses without the need for a system call on each > communication event. Yeah it's tricky. BTW. what is the performance difference between having a system call or no? > > Because you don't need to swap, you don't need coherency, and you > > are in control of the areas, then this seems like the best choice. > > It would allow you to use heap, stack, file-backed, anything. > > You are missing one point here. The MPI specifications that have > been out there for decades do not require the process use a library > for allocating the buffer. I realize that is a horrible shortcoming, > but that is the world we live in. Even if we could change that spec, Can you change the spec? Are you working on it? > we would still need to support the existing specs. As a result, the > user can change their virtual address space as they need and still expect > communications be cheap. That's true. How has it been supported up to now? Are you using these kind of notifiers in patched kernels? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 5/6] mempolicy: add MPOL_F_RELATIVE_NODES flag
David wrote: +static nodemask_t mpol_relative_nodemask(const nodemask_t *orig, +const nodemask_t *rel) +{ + nodemask_t ret; + nodemask_t tmp; Could you avoid needing the nodemask_t 'ret' on the stack, by passing in a "nodemask_t *" pointer to where you want the resulting nodemask_t written, rather than by returning it by value? static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig, const nodemask_t *rel) -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-next: Tree for Feb 26
Hi all, I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/sfr/linux-next.git. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with allmodconfig for both powerpc and x86_64. There were only two minor merge problems. We are up to 34 trees, more are welcome (even if they are currently empty). Status of my local build tests is at http://kisskb.ellerman.id.au/kisskb/branch/9/. We have added arm and m68k to the architectures built. -- Cheers, Stephen Rothwell[EMAIL PROTECTED] pgpwRtZRnmKMM.pgp Description: PGP signature
Re: Linux 2.6.24.3
On Tue, Feb 26, 2008 at 02:39:23PM +0900, Samuel Masham wrote: > On Tue, Feb 26, 2008 at 10:00 AM, Greg Kroah-Hartman <[EMAIL PROTECTED]> > wrote: > > We (the -stable team) are announcing the release of the 2.6.24.3 > > kernel. > > > > Hi Greg, Stable people :) > > Can you confirm you have the mips irq probe crash fix in your queue > for the next stable release Yes, you just sent this to me on Monday, give me at least a week to ignore it before you worried :) It's in my queue, I'll let you know when I get to it. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/28] Swap over NFS -v16
On Saturday February 23, [EMAIL PROTECTED] wrote: > On Wed, 20 Feb 2008 15:46:10 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > Another posting of the full swap over NFS series. > > Well I looked. There's rather a lot of it and I wouldn't pretend to > understand it. But pretending is fun :-) > > What is the NFS and net people's take on all of this? Well I'm only vaguely an NFS person, barely a net person, sporadically an mm person, but I've had a look and it seems to mostly make sense. We introduce a new "emergency" concept for page allocation. The size of the emergency pool is set by various reservations by different potential users. If the number of free pages is below the "emergency" size, then only users with a "MEMALLOC" flag get to allocate pages. Further, those pages get a "reserve" flag set which propagates into slab/slub so kmalloc/kmemalloc only return memory from those pages to MEMALLOC users. MEMALLOC users are those that set PF_MEMALLOC. A socket can get SOCK_MEMALLOC set which will cause certain pieces of code to temporarily set PF_MEMALLOC while working on that socket. The upshot is that providing any MEMALLOC user reserves an appropriate amount of emergency space, returns the emergency memory promptly, and sets PF_MEMALLOC whenever allocating memory, it's memory allocations should never fail. As memory is requested is small units, but allocated as pages, there needs to be a conversion from small-units to pages. One of the patches does this and appears to err on the side of be over-generous, which is the right thing to do. Memory reservations are organised in a tree. I really don't understand the tree. Is it just to make /proc/reserve_info look more helpful? Certainly all the individual reservations need to be recorded, and the cumulative reservation needs also to be recorded (currently in the root of the tree) but what are all the other levels used for? Reservations are used for all the transient memory that might be used by the network stack. This particularly includes the route cache and skbs for incoming messages. I have no idea if there is anything else that needs to be allowed for. Filesystems can advertise (via address_space_operations) that files may be used as swap file. They then provide swapout/swapin methods which are like writepage/readpage but may behave differently and have a different way to get credentials from a 'struct file'. So in general, the patch set looks to have the right sort of shape. I cannot be very authoritative on the details as there are a lot of them, and they touch code that I'm not very familiar with. Some specific comments on patches: reserve-slub.patch Please avoid irrelevant reformatting in patches. It makes them harder to read. e.g.: -static void setup_object(struct kmem_cache *s, struct page *page, - void *object) +static void setup_object(struct kmem_cache *s, struct page *page, void *object) mm-kmem_estimate_pages.patch This introduces kestimate kestimate_single kmem_estimate_pages The last obviously returns a number of pages. The contrast seems to suggest the others don't. But they do... I don't think the names are very good, but I concede that it is hard to choose good names here. Maybe: kmalloc_estimate_variable kmalloc_estimate_fixed kmem_alloc_estimate ??? mm-reserve.patch I'm confused by __mem_reserve_add. + reserve = mem_reserve_root.pages; + __calc_reserve(res, pages, 0); + reserve = mem_reserve_root.pages - reserve; __calc_reserve will always add 'pages' to mem_reserve_root.pages. So this is a complex way of doing reserve = pages; __calc_reserve(res, pages, 0); And as you can calculate reserve before calling __calc_reserve (which seems odd when stated that way), the whole function looks like it could become: ret = adjust_memalloc_reserve(pages); if (!ret) __calc_reserve(res, pages, limit); return ret; What am I missing? Also, mem_reserve_disconnect really should be a "void" function. Just put a BUG_ON(ret) and don't return anything. Finally, I'll just repeat that the purpose of the tree structure eludes me. net-sk_allocation.patch Why are the "GFP_KERNEL" call sites just changed to "sk->sk_allocation" rather than "sk_allocation(sk, GFP_KERNEL)" ?? I assume there is a good reason, and seeing it in the change log would educate me and make the patch more obviously correct. netvm-reserve.patch Function names again: sk_adjust_memalloc sk_set_memalloc sound similar. Purpose is completely different. mm-page_file_methods.patch This makes page_offset and others more expensive by adding a conditional jump to a function call that is not usually made. Why do swap pages have a different index to
Re: boot_delay broken ?
On Tue, Feb 26, 2008 at 1:48 PM, Dave Young <[EMAIL PROTECTED]> wrote: > On Tue, Feb 26, 2008 at 1:22 PM, Randy Dunlap <[EMAIL PROTECTED]> wrote: > > On Mon, 25 Feb 2008 10:14:36 +0800 Dave Young wrote: > > > > > On Sun, Feb 24, 2008 at 8:46 AM, Dave Jones <[EMAIL PROTECTED]> wrote: > > > > The boot_delay switch seems to be behaving strangely in the > > > > current -git. Setting it to =10 makes the output 'bursty' > > > > it becomes slow for some printk's whilst others scroll by > > > > at regular speed. > > > > Setting it any higher than that seems to make it pause for > > > > a really long time before it outputs any text at all. > > > > > > On my side there's this issue for a long time > > > http://lkml.org/lkml/2007/8/8/79 > > > > [http://marc.info/?l=linux-kernel=118655896515049=2] > > > > You asked questions and they were answered. Perhaps you didn't like > > the answers. > > No, I like it. Thanks. > > But I still want to know why mdelay can not be used. > is it not available for all archs or something else? > > > > > > Here's a question for you. What kernel boot options did you use? > > Specifically, for lpj= and boot_delay= ? > > I tried boot_delay=100 and boot_delay=200 without lpj set, The result > was really slow. It was better with lpj copied from dmesg, but was > still slower then mdelay. Especially at the very beginning after the message "Booting the kernel", I need to wait several minutes to see the afterwards messages > > I think we can firstly use preset lpj, after delay calibrating just > use the system lpj > > > > > > > > > > > > > x86 timer changes perhaps ? > > > > > > --- > > ~Randy > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: boot_delay broken ?
On Tue, Feb 26, 2008 at 1:22 PM, Randy Dunlap <[EMAIL PROTECTED]> wrote: > On Mon, 25 Feb 2008 10:14:36 +0800 Dave Young wrote: > > > On Sun, Feb 24, 2008 at 8:46 AM, Dave Jones <[EMAIL PROTECTED]> wrote: > > > The boot_delay switch seems to be behaving strangely in the > > > current -git. Setting it to =10 makes the output 'bursty' > > > it becomes slow for some printk's whilst others scroll by > > > at regular speed. > > > Setting it any higher than that seems to make it pause for > > > a really long time before it outputs any text at all. > > > > On my side there's this issue for a long time > > http://lkml.org/lkml/2007/8/8/79 > > [http://marc.info/?l=linux-kernel=118655896515049=2] > > You asked questions and they were answered. Perhaps you didn't like > the answers. No, I like it. Thanks. But I still want to know why mdelay can not be used. is it not available for all archs or something else? > > Here's a question for you. What kernel boot options did you use? > Specifically, for lpj= and boot_delay= ? I tried boot_delay=100 and boot_delay=200 without lpj set, The result was really slow. It was better with lpj copied from dmesg, but was still slower then mdelay. I think we can firstly use preset lpj, after delay calibrating just use the system lpj > > > > > > > x86 timer changes perhaps ? > > > --- > ~Randy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 3/6] mempolicy: add MPOL_F_STATIC_NODES flag
David wrote: +static inline int mpol_store_user_nodemask(const struct mempolicy *pol) +{ + return !!(pol->flags & MPOL_F_STATIC_NODES); +} Why the double-negative? As best as I can tell, the return value of mpol_store_user_nodemask() is only used in conditional contexts: $ grep mpol_store_user_nodemask mm/mempolicy.c static inline int mpol_store_user_nodemask(const struct mempolicy *pol) if (mpol_store_user_nodemask(policy)) if (!mpol_store_user_nodemask(a)) if (!mpol_store_user_nodemask(pol) && So I see no need to waste the instructions needed (in the three copies of this code, since it's static inline) to convert a non-zero value to exactly the value 1. Hmmm ... speaking of static inline ... I can knock 600 bytes (that's IA64 bytes, so equivalent to about 300 x86 bytes) off the kernel text size by not inlining the mm/mempolicy.c routines check_pgd_range() and interleave_nid(). I wonder if that would be worth doing. Perhaps those two routines are in sufficiently tight corners that the duplicate copies of them is needed. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.6.24.3
On Tue, Feb 26, 2008 at 10:00 AM, Greg Kroah-Hartman <[EMAIL PROTECTED]> wrote: > We (the -stable team) are announcing the release of the 2.6.24.3 > kernel. > Hi Greg, Stable people :) Can you confirm you have the mips irq probe crash fix in your queue for the next stable release See: http://lkml.org/lkml/2008/2/24/255 Is there anymore I should do to help the process along? Thanks Samuel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
iwl4965 dropping packets and __dev_addr_discard: address leakage! da_users=1
Possibly because of the frequent renegotiating my iwl4965 card has been making, it has now decided it's not going to pass packets reliably until presumably next time I reboot. I've noticed messages in syslog that I hadn't seen when things were working fine. The problem possibly has only surfaced after rmmodding the iwl4965 module, then remodding it. It stays with me through being removed and modprobed again. It is being bonded in conjunction with a physical ethernet, if that is relevant, although I think I reproduced it when it was on its lonesome. Fresh after reboot: Feb 22 09:49:31 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN driver for Linux, 1.1.17kds Feb 22 09:49:31 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel Corporation Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 (level, low) -> IRQ 17 Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device :0c:00.0 to 64 Feb 22 09:49:31 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link 4965AGN Feb 22 09:49:31 dirac kernel: ACPI: PCI Interrupt :00:1b.0[A] -> GSI 21 (level, low) -> IRQ 21 Feb 22 09:49:31 dirac kernel: PCI: Setting latency timer of device :00:1b.0 to 64 Feb 22 09:49:31 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 802.11a channels Feb 22 09:49:31 dirac kernel: phy0: Selected rate control algorithm 'iwl-4965-rs' now finished with it, will be deconfigured and rmmodded: Feb 23 03:05:23 dirac kernel: __dev_addr_discard: address leakage! da_users=1 Feb 23 03:05:23 dirac kernel: ACPI: PCI interrupt for device :0c:00.0 disabled The modprobed again: Feb 23 03:05:35 dirac kernel: iwl4965: Intel(R) Wireless WiFi Link 4965AGN driver for Linux, 1.1.17kds Feb 23 03:05:35 dirac kernel: iwl4965: Copyright(c) 2003-2007 Intel Corporation Feb 23 03:05:35 dirac kernel: ACPI: PCI Interrupt :0c:00.0[A] -> GSI 17 (level, low) -> IRQ 17 Feb 23 03:05:35 dirac kernel: PCI: Setting latency timer of device :0c:00.0 to 64 Feb 23 03:05:35 dirac kernel: iwl4965: Detected Intel Wireless WiFi Link 4965AGN Feb 23 03:05:35 dirac kernel: iwl4965: Tunable channels: 11 802.11bg, 13 802.11a channels Feb 23 03:05:35 dirac kernel: phy9: Selected rate control algorithm 'iwl-4965-rs' according to lspci -vvv, 0c:00.0 is: 0c:00.0 Network controller: Intel Corporation PRO/Wireless 4965 AG or AGN Network Connection (rev 61) Subsystem: Intel Corporation Unknown device 1120 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Capabilities: [140] Device Serial Number d7-36-9b-ff-ff-e8-13-00 Kernel driver in use: iwl4965 Kernel modules: iwl4965 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: make amd quad core 8 socket system not be clustered_box v2
On Mon, Feb 25, 2008 at 8:05 PM, Ravikiran Thirumalai <[EMAIL PROTECTED]> wrote: > On Tue, Feb 26, 2008 at 04:46:25AM +0100, Andi Kleen wrote: > >> I don't quite understand the purpose of the patch to begin with. Looking > at > >> the current x86 git tree, apic_is_clustered_box is used only to determine > if > >> tsc is synchronized on the platform. The AMD docs imply that TSC's are > not > >> guaranteed to be synced across cores between nodes -- Opteron BKDG for > >> family 10h, Section 2.9.4: > > > >After long discussions with AMD they determined the CPUID flag > >for sync RDTSC will imply synchronization between nodes. > > Ah! > > > > > >If you can't support that in your hardware you're supposed > >to clear it. > > Hmm! How would a hardware vendor do that? That doesn't seem to be clear in > the BKDG. (Well, this is the problem with undocumented features :() > any good sign for APIC_clustered box? there is apicid between cpus even all cpu are quadcore and fully populated? YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: boot_delay broken ?
On Mon, 25 Feb 2008 10:14:36 +0800 Dave Young wrote: > On Sun, Feb 24, 2008 at 8:46 AM, Dave Jones <[EMAIL PROTECTED]> wrote: > > The boot_delay switch seems to be behaving strangely in the > > current -git. Setting it to =10 makes the output 'bursty' > > it becomes slow for some printk's whilst others scroll by > > at regular speed. > > Setting it any higher than that seems to make it pause for > > a really long time before it outputs any text at all. > > On my side there's this issue for a long time > http://lkml.org/lkml/2007/8/8/79 [http://marc.info/?l=linux-kernel=118655896515049=2] You asked questions and they were answered. Perhaps you didn't like the answers. Here's a question for you. What kernel boot options did you use? Specifically, for lpj= and boot_delay= ? > > > > x86 timer changes perhaps ? --- ~Randy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Pcihpd-discuss] 2.6.25-rc3 -- SHPC hotplug driver - very long timeouts?
On Mon, Feb 25, 2008 at 11:30:03PM -0500, Miles Lane wrote: > Hello, > > When I booted this kernel, the process was hugely delayed in shpchp. > I don't think I usually build this driver, so perhaps this is its > standard behavior when the hardware is missing, or some such? Or, is > this a bug in the driver? Either way, the timeouts seem excessively > long. > > [ 24.471137] CPA self-test: > [ 24.474565] 4k 8192 large 216 gb 0 x 8408[c000-f7fff000] miss 0 > [ 24.491200] 4k 196608 large 32 gb 0 x 196640[c000-f7fff000] miss 0 > [ 24.503549] 4k 196608 large 32 gb 0 x 196640[c000-f7fff000] miss 0 > [ 24.504440] ok. > [ 42.209886] shpchp: gave up waiting for init of module pci_hotplug. > [ 42.213190] shpchp: Unknown symbol acpi_run_oshp > [ 72.052520] shpchp: gave up waiting for init of module pci_hotplug. > [ 72.055826] shpchp: Unknown symbol pci_hp_change_slot_info > [ 101.931221] shpchp: gave up waiting for init of module pci_hotplug. > [ 101.934526] shpchp: Unknown symbol pci_hp_register > [ 131.789952] shpchp: gave up waiting for init of module pci_hotplug. > [ 131.793258] shpchp: Unknown symbol pci_hp_deregister > [ 161.683306] shpchp: gave up waiting for init of module pci_hotplug. > [ 161.686611] shpchp: Unknown symbol acpi_get_hp_params_from_firmware > [ 162.935681] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 This is not a hotplug-specific bug, but rather one in the module loading logic. People who had seen this on -rc1 said it went away in -rc3. I suggest poking Rusty about it... thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] maple: remove unused variable
On Sat, Feb 16, 2008 at 11:37:33PM +, Adrian McMenamin wrote: > Remove an unused variable from the definition of struct maple_device > > Signed-off-by: Adrian McMenamin <[EMAIL PROTECTED]> > --- > > diff -ruN a/include/linux/maple.h b/include/linux/maple.h > --- a/include/linux/maple.h 2008-02-16 20:52:09.0 + > +++ b/include/linux/maple.h 2008-02-16 21:42:06.0 + > @@ -64,7 +64,6 @@ > int (*connect) (struct maple_device * dev); > void (*disconnect) (struct maple_device * dev); > struct device_driver drv; > - int registered; > }; > Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arch/sh/drivers/heartbeat.c ioremap is expected to succeed
On Mon, Feb 18, 2008 at 02:09:10PM +0100, Roel Kluin wrote: > !unlikely(hd->base) is equivalent to likely(!hd->base) (for instance see > comments with commit fd312561adcc90e924f35d3032d5493aeb4c3017), I think > the ioremap is expected to succeed? please confirm that's right. > The patch below was *not* tested. > --- > ioremap is expected to succeed > > Signed-off-by: Roel Kluin <[EMAIL PROTECTED]> Indeed it is. Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] maple: fix device detection
On Mon, Feb 25, 2008 at 07:40:26AM +, Adrian McMenamin wrote: > On Mon, 2008-02-25 at 14:33 +0900, Paul Mundt wrote: > > On Sun, Feb 24, 2008 at 10:32:53PM +, Adrian McMenamin wrote: > > > On Sun, 2008-02-24 at 21:50 +, Adrian McMenamin wrote: > > > > On Sun, 2008-02-24 at 14:30 +, Adrian McMenamin wrote: > > > > > The maple bus driver that went into the kernel mainline in > > > > > September 2007 contained some bugs which were revealed by the > > > > > update of the kobj code for the current release series. > > > > > Unfortunately those bugs also helped ensure maple devices were > > > > > properly detected. This patch (against the current git) now ensures > > > > > that devices are properly detected again. > > > > > > > > > > > > > Further testing has shown this has introduced another bug, this time > > > > limiting the effectiveness of subdevice detection. Please ignore this > > > > while I work on a fix. > > > > > > > Sorry for the confusion, in fact there is nothing wrong with this code > > > (ie it should be applied), the error was in the driver for the Dreamcast > > > controller (the device, in general, into which the subdevices are > > > plugged in and out). > > > > > > I will post a fix for that. > > > > > > Sorry again. > > > > > So what exactly is supposed to be applied here? > > The patch at the start of this thread - ie > http://lkml.org/lkml/2008/2/24/125 - this should really go in now as it > fixes a problem with current code. > Ok, that's applied. Note that the original body was horribly word wrapped, and your patch was not in -p1 format (while others in the series are, for reasons that aren't entirely obvious). > In addition there are two patch sets to add new device support: > > http://lkml.org/lkml/2008/2/24/211- maple controller > > http://lkml.org/lkml/2008/2/24/121 (thread) - maple mouse > This is not 2.6.25 material. Once you have Acked-by's from the input folks, I'll queue these up in my 2.6.26 tree. The bus unplug thing is rather unusual, I don't know if Greg has any comments on that or not. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] autofs4 - track uid and gid of last mount requestor - correction
On Tue, 26 Feb 2008, Ian Kent wrote: > + > + /* Set mount requestor */ > + if (ino) { > + if (ino) { > + ino->uid = wq->uid; > + ino->gid = wq->gid; > + } > + } > + As has been spotted, this is obviously wrong. And here is the correction. Signed-off-by: Ian Kent <[EMAIL PROTECTED]> Ian --- diff -up linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.track-last-mount-ids-fix linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c --- linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.track-last-mount-ids-fix 2008-02-26 14:02:05.0 +0900 +++ linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c 2008-02-26 14:02:20.0 +0900 @@ -385,10 +385,8 @@ int autofs4_wait(struct autofs_sb_info * /* Set mount requestor */ if (ino) { - if (ino) { - ino->uid = wq->uid; - ino->gid = wq->gid; - } + ino->uid = wq->uid; + ino->gid = wq->gid; } if (de) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: add the debugfs interface for the sysprof tool
> It was only later I tried oprofile and found it not only much more > difficult to use, but also much less useful when I did get it to work. This surprises me. Can you please elaborate on why oprofile is "much less useful" than sysprof? Anton - who has used oprofile to analyse and tune databases, JVMs, compilers and operating systems. Maybe I've been missing out on the killer app for all this time!!! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: add the debugfs interface for the sysprof tool
Hi Peter, > Usable for me is a cli interface. Also, I absolutely love opannotate. I definitely agree there. It's interesting to note that sysprof requires you to run the GUI as root in order to work. Maybe Ingo and Arjan are confident there are no bugs in all the libraries that sysprof links to: # ldd `which sysprof` | wc -l 39 I'm not. Actually before someone converted it to debugfs, it was even worse, the sysprof kernel module exported all profiling information to the world: -r--r--r-- 1 root root 2060 2008-02-25 23:00 /proc/sysprof-trace Anton -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/6] mempolicy: convert MPOL constants to enum
David wrote: + /* add additional MPOL_* modes here */ That doesn't explicitly say what I was most worried about saying, which is that those MPOL_* terms have values visible in the kernel's public API, and it does say more than might be true, and it doesn't explain why it says what it says. It kinda looks like an ugly "maybe this will shut Paul up patch ". I'd rather leave the code the way it was than add that comment ... sorry. > I'd like to avoid respinning this set Ah - now we get to the real issue ?;). There would be no need to respin; one could do just as you proposed doing with the above change, queue a little add-on patch to the existing set. Really ... look around the kernel. I believe you'll see many instances of enum values being spelled out, even ones that count 0, 1, 2, ..., in situations where the values are constrained by outside forces. People really do avoid relying on the default enum behaviour having any particular numbering. Whatever ... do as you will. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] autofs4 - add mount option to display mount device
Hi Andrew, Patch to add a display mount option to show the device number of the autofs mount super block. Signed-off-by: Ian Kent < [EMAIL PROTECTED]> Ian --- diff -up linux-2.6.25-rc2-mm1/fs/autofs4/inode.c.add-mount-device-display-option linux-2.6.25-rc2-mm1/fs/autofs4/inode.c --- linux-2.6.25-rc2-mm1/fs/autofs4/inode.c.add-mount-device-display-option 2008-02-20 13:01:06.0 +0900 +++ linux-2.6.25-rc2-mm1/fs/autofs4/inode.c 2008-02-20 13:03:45.0 +0900 @@ -190,6 +190,7 @@ static int autofs4_show_options(struct s seq_printf(m, ",timeout=%lu", sbi->exp_timeout/HZ); seq_printf(m, ",minproto=%d", sbi->min_proto); seq_printf(m, ",maxproto=%d", sbi->max_proto); + seq_printf(m, ",dev=%d", autofs4_get_dev(sbi)); if (sbi->type & AUTOFS_TYPE_OFFSET) seq_printf(m, ",offset"); @@ -332,7 +333,7 @@ int autofs4_fill_super(struct super_bloc sbi->sb = s; sbi->version = 0; sbi->sub_version = 0; - sbi->type = 0; + sbi->type = AUTOFS_TYPE_INDIRECT; sbi->min_proto = 0; sbi->max_proto = 0; mutex_init(>wq_mutex); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] cgroup: fix default notify_on_release setting
> @@ -2242,6 +2241,9 @@ static long cgroup_create(struct cgroup *parent, > struct dentry *dentry, > cgrp->root = parent->root; > cgrp->top_cgroup = parent->top_cgroup; > > + if (notify_on_release(parent)) > + set_bit(CGRP_NOTIFY_ON_RELEASE, >flags); Good catch, Li Zefan - thanks. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.25-rc1/2 CD/DVD burning broken
On Mon, Feb 25, 2008 at 11:08:55PM +0100, Andreas Schwab wrote: > Borislav Petkov <[EMAIL PROTECTED]> writes: > > > On Mon, Feb 25, 2008 at 08:38:22PM +0100, Andreas Schwab wrote: > >> "Kiyoshi Ueda" <[EMAIL PROTECTED]> writes: > >> > >> > I'm looking at this problem, but currently no idea why the conversion > >> > to blk_end_request causes it. > >> > >> cdrom_newpc_intr apparently never sets rq->sense_len. > >> > > > > actually it does, see the code chunk around line 1188 in 2.6.25-rc2, for > > example. > > Yes, it does, but it always adds zero. yep, true. Does that fix your dvd burning problem? > Move counting of sense bytes into the transfer loop. > > Signed-off-by: Andreas Schwab <[EMAIL PROTECTED]> > > --- > drivers/ide/ide-cd.c |5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > --- linux-2.6.25-rc3.orig/drivers/ide/ide-cd.c2008-02-25 > 01:03:31.0 +0100 > +++ linux-2.6.25-rc3/drivers/ide/ide-cd.c 2008-02-25 22:54:42.0 > +0100 > @@ -1182,11 +1182,10 @@ static ide_startstop_t cdrom_newpc_intr( > else > rq->data += blen; > } > + if (!write && blk_sense_request(rq)) > + rq->sense_len += blen; > } > > - if (write && blk_sense_request(rq)) > - rq->sense_len += thislen; > - > /* >* pad, if necessary >*/ > > Andreas. > > -- > Andreas Schwab, SuSE Labs, [EMAIL PROTECTED] > SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany > PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 > "And now for something completely different." -- Regards/Gruß, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/6] mempolicy: convert MPOL constants to enum
On Mon, 25 Feb 2008, Paul Jackson wrote: > Eh ... each bit of added clarity to the code reduces > errors. > > Looking around the kernel, it really seems to me to be > a common coding to explicitly spell out enum values > when for some reason they actually matter. > > Like most coding mechanisms, nothing guarantees anything. > > It just nicely represents one particular detail, that > the values of these MPOL_* terms are not arbitrary. > Of course the MPOL_* modes aren't arbitrary; they are defined in an enum that has a well-defined and explicit behavior for how they are mapped to int values based on a standard. I have more mempolicy patches that add additional behavior and cleanups so I can queue the following for a later posting. I'd like to avoid respinning this set unless there are actual design or implementation concerns that are raised. --- include/linux/mempolicy.h |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -20,7 +20,9 @@ enum { MPOL_PREFERRED, MPOL_BIND, MPOL_INTERLEAVE, - MPOL_MAX, /* always last member of enum */ + /* add additional MPOL_* modes here */ + + MPOL_MAX, }; /* Flags for set_mempolicy */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: add the debugfs interface for the sysprof tool
> > From: Soren Sandmann <[EMAIL PROTECTED]> > > Subject: [PATCH] x86: add the debugfs interface for the sysprof tool > > > > The sysprof tool is a very easy to use GUI tool to find out where > > userspace is spending CPU time. See > > http://www.daimi.au.dk/~sandmann/sysprof/ for more information and > > screenshots on this tool. > > > > Sysprof needs a 200 line kernel module to do it's work, this module > > puts some simple profiling data into debugfs. > > thanks, looks good to me - applied. Woah slow down guys. Did I miss the review? Yes it's a 200 line patch, but it's a 200 line x86 patch. Surely we should apply some of the same rigour we did when we merged the oprofile patch? Is it biarch safe? Will it run on powerpc, arm etc? I'm still struggling to understand why we need this functionality at all. Lets argue the ABI and not cloud it with a discussion about userspace eye candy. What does this give you that is an improvement over the oprofile kernel-user ABI? Anton -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: new regression in 2.6.25-rc3: no keyboard/lid acpi events on thinkpad T61p - resume hang
On Tue, Feb 26, 2008 at 4:45 AM, Michael S. Tsirkin <[EMAIL PROTECTED]> wrote: On Mon, Feb 25, 2008 at 9:46 PM, Andrew Morton <[EMAIL PROTECTED]> wrote: > On Mon, 25 Feb 2008 21:19:24 +0200 "Michael S. Tsirkin" <[EMAIL PROTECTED]> wrote: > You mean suspend-to-ram works correctly on your t61p? > Mine suspends, then five seconds later magically resumes itself and the > screen is all black. Sorry, have not noticed what you were asking about. Yes, rc2 seems to suspend/resume fine. And after reverting revert commit 559bbe6cbd0d8c68d40076a5f7dc98e3bf5864b2. commit 559bbe6cbd0d8c68d40076a5f7dc98e3bf5864b2 Author: Pavel Machek <[EMAIL PROTECTED]> Date: Thu Feb 21 13:56:55 2008 +0100 power_state: get rid of write-only variable in SATA power_state is scheduled for removal, and libata uses it in write-only mode. Remove it. Signed-off-by: Pavel Machek <[EMAIL PROTECTED]> Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> I'm experiencing hang after resume from STR with the latest Linus's git tree. Reverting the above patch solved the problem. Thanks, Jeff Here's the patch for reference ... diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 4cf8662..9812bbf 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -6560,8 +6560,6 @@ int ata_host_suspend(struct ata_host *host, pm_message_t mesg) ata_lpm_enable(host); rc = ata_host_request_pm(host, mesg, 0, ATA_EHI_QUIET, 1); - if (rc == 0) - host->dev->power.power_state = mesg; return rc; } @@ -6580,7 +6578,6 @@ void ata_host_resume(struct ata_host *host) { ata_host_request_pm(host, PMSG_ON, ATA_EH_SOFTRESET, ATA_EHI_NO_AUTOPSY | ATA_EHI_QUIET, 0); - host->dev->power.power_state = PMSG_ON; /* reenable link pm */ ata_lpm_disable(host); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: make amd quad core 8 socket system not be clustered_box v2
On Tue, Feb 26, 2008 at 04:46:25AM +0100, Andi Kleen wrote: >> I don't quite understand the purpose of the patch to begin with. Looking at >> the current x86 git tree, apic_is_clustered_box is used only to determine if >> tsc is synchronized on the platform. The AMD docs imply that TSC's are not >> guaranteed to be synced across cores between nodes -- Opteron BKDG for >> family 10h, Section 2.9.4: > >After long discussions with AMD they determined the CPUID flag >for sync RDTSC will imply synchronization between nodes. Ah! > >If you can't support that in your hardware you're supposed >to clear it. Hmm! How would a hardware vendor do that? That doesn't seem to be clear in the BKDG. (Well, this is the problem with undocumented features :() Thanks, Kiran -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/6] mempolicy: convert MPOL constants to enum
David wrote: > I don't suspect that a kernel developer is going > to make such an egregious error. Eh ... each bit of added clarity to the code reduces errors. Looking around the kernel, it really seems to me to be a common coding to explicitly spell out enum values when for some reason they actually matter. Like most coding mechanisms, nothing guarantees anything. It just nicely represents one particular detail, that the values of these MPOL_* terms are not arbitrary. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Announce: Linux-next (Or Andrew's dream :-))
Hi Russell, On Sat, 16 Feb 2008 00:09:43 + Russell King <[EMAIL PROTECTED]> wrote: > > On Tue, Feb 12, 2008 at 12:02:08PM +1100, Stephen Rothwell wrote: > > I will attempt to build the tree between each merge (and a failed build > > will again cause the offending tree to be dropped). These builds will be > > necessarily restricted to probably one architecture/config. I will build > > the entire tree on as many architectures/configs as seem sensible and > > the results of that will be available on a web page (to be announced). > > This restriction means that the value for the ARM architecture is soo > limited it's probably not worth the hastle participating in this project. > > We already know that -mm picks up on very few ARM conflicts because > Andrew doesn't run through the entire set of configurations; unfortunately > ARM is one of those architectures which is very diverse [*], and because > of that, ideas like "allyconfig" are just completely irrelevant to it. > > As mentioned elsewhere, what we need for ARM is to extend the kautobuild > infrastructure (see armlinux.simtec.co.uk) so that we can have more trees > at least compile tested regularly - but that requires the folk there to > have additional compute power (which isn't going to happen unless folk > stamp up some machines _or_ funding). I now have an arm cross compiler (gcc-4.0.2-glibc-2.3.6 arm-unknown-linux-gnu). (See the results page at http://kisskb.ellerman.id.au/kisskb/branch/9/ - I must get a better name/place :-(.) Is this sufficient to help you out? What configs would be useful to build (as Andrew said, they don't take very long each). I really want as many subsystems as possible in the linux-next tree in an attempt to avoid some of the merge/conflict problems we have had in the past. What can we do to help? -- Cheers, Stephen Rothwell[EMAIL PROTECTED] http://www.canb.auug.org.au/~sfr/ pgp7FBm2sHObu.pgp Description: PGP signature
Re: [Bluez-devel] forcing SCO connection patch
Hi Marcel >> --- linux-2.6.23/net/bluetooth/hci_event.c.orig 2008-02-25 >> 17:17:11.0 +0900 >> +++ linux-2.6.23/net/bluetooth/hci_event.c 2008-02-25 >> 17:30:23.0 +0900 >> @@ -1313,8 +1313,17 @@ >> hci_dev_lock(hdev); >> >> conn = hci_conn_hash_lookup_ba(hdev, ev->link_type, >bdaddr); >> - if (!conn) >> - goto unlock; >> + if (!conn) { >> + if (ev->link_type != ACL_LINK) { >> + __u8 link_type = (ev->link_type == ESCO_LINK) ? SCO_LINK : ESCO_LINK; >> + >> + conn = hci_conn_hash_lookup_ba(hdev, link_type, >bdaddr); >> + if (conn) >> + conn->type = ev->link_type; >> + } >> + if (!conn) >> + goto unlock; >> + } > > NAK. There is no need to check for ACL_LINK. The sync_complete will > only be called for SCO or eSCO connections. I see. I removed this check line in the patch. Thanks. Louis JANG Signed-off-by: Louis JANG <[EMAIL PROTECTED]> --- linux-2.6.23/net/bluetooth/hci_event.c.orig 2008-02-26 12:46:36.0 +0900 +++ linux-2.6.23/net/bluetooth/hci_event.c 2008-02-26 12:47:23.0 +0900 @@ -1313,8 +1313,15 @@ hci_dev_lock(hdev); conn = hci_conn_hash_lookup_ba(hdev, ev->link_type, >bdaddr); - if (!conn) - goto unlock; + if (!conn) { + __u8 link_type = (ev->link_type == ESCO_LINK) ? SCO_LINK : ESCO_LINK; + + conn = hci_conn_hash_lookup_ba(hdev, link_type, >bdaddr); + if (conn) + conn->type = ev->link_type; + else + goto unlock; + } if (!ev->status) { conn->handle = __le16_to_cpu(ev->handle);
Re: Linux 2.6.24.3
On Mon, 25 Feb 2008 17:00:24 -0800 Greg Kroah-Hartman wrote: > We (the -stable team) are announcing the release of the 2.6.24.3 > kernel. > > It fixes a number of different bugs and all users of the 2.6.24 series > are encouraged to upgrade. > > I'll also be replying to this message with a copy of the patch between > 2.6.24.2 and 2.6.24.3 > > The updated 2.6.24.y git tree can be found at: > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.24.y.git > and can be browsed at the normal kernel.org git web browser: > > http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.24.y.git;a=summary When HEADERS_CHECK=y: make[3]: *** No rule to make target `/local/linsrc/linux-2.6.24.3/include/linux/if_addrlabel.h', needed by `/local/linsrc/linux-2.6.24.3/usr/include/linux/if_addrlabel.h'. Stop. make[2]: *** [linux] Error 2 --- ~Randy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-sha1: RIP [] iov_iter_advance+0x38/0x70
On Wednesday 20 February 2008 09:01, Alexey Dobriyan wrote: > On Tue, Feb 19, 2008 at 11:47:11PM +0300, wrote: > > > Are you reproducing it simply by running the > > > ftest03 binary directly from the shell? How many times between oopses? > > > It is multi-process but no threads, so races should be minimal down > > > this path -- can you get an strace of the failing process? > > Speaking of multi-proceseness, changing MAXCHILD to 1, nchild to 1, > AFAICS, generates one child which oopses the very same way (in parallel > with generic LTP) But, lowering MAXIOVCNT to 8 generates no oops. Thanks, I was able to reproduce quite easily with these settings. I think I have the correct patch now (at least it isn't triggerable any more here). Thanks, Nick diff --git a/mm/filemap.c b/mm/filemap.c index 5c74b68..2650073 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1750,14 +1750,18 @@ static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes) } else { const struct iovec *iov = i->iov; size_t base = i->iov_offset; + size_t copied = 0; /* * The !iov->iov_len check ensures we skip over unlikely - * zero-length segments. + * zero-length segments (without overruning the iovec). */ - while (bytes || !iov->iov_len) { - int copy = min(bytes, iov->iov_len - base); + while (copied < bytes || +unlikely(!iov->iov_len && copied < i->count)) { + int copy; + copy = min(bytes, iov->iov_len - base); + copied += copy; bytes -= copy; base += copy; if (iov->iov_len == base) {
Re: [PATCH] x86_64: make amd quad core 8 socket system not be clustered_box v2
> I don't quite understand the purpose of the patch to begin with. Looking at > the current x86 git tree, apic_is_clustered_box is used only to determine if > tsc is synchronized on the platform. The AMD docs imply that TSC's are not > guaranteed to be synced across cores between nodes -- Opteron BKDG for > family 10h, Section 2.9.4: After long discussions with AMD they determined the CPUID flag for sync RDTSC will imply synchronization between nodes. If you can't support that in your hardware you're supposed to clear it. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: make amd quad core 8 socket system not be clustered_box v2
On Mon, Feb 25, 2008 at 02:05:45PM -0800, Yinghai Lu wrote: >On Mon, Feb 25, 2008 at 11:08 AM, Ravikiran Thirumalai ><[EMAIL PROTECTED]> wrote: >> ... >> Andi, Yes. AMD based vSMPowered systems uses clustered APICs, and this >> check base on cpu vendor id is not good :(. > >please check if you happy with > >http://lkml.org/lkml/2008/2/24/273 > I don't quite understand the purpose of the patch to begin with. Looking at the current x86 git tree, apic_is_clustered_box is used only to determine if tsc is synchronized on the platform. The AMD docs imply that TSC's are not guaranteed to be synced across cores between nodes -- Opteron BKDG for family 10h, Section 2.9.4: Note: Timers associated with different CPU cores in the same processor increment at the same rate. Timers associated with different CPU cores in different processors increment at slightly different rates if (1) they are located on different nodes and (2) CLKIN for these nodes is derived from different, non-synchronized oscillator sources. But that is not what the x86 tree does (with your patches) -- it looks for the X86_FEATURE_CONSTANT_TSC at unsynchronized_tsc() and assumes a synchronized clock. Huh!?? Am i missing something here? X86_FEATURE_CONSTANT_TSC is set from CPUID Fn8000_0007 -- TscInvariant bit, which implies TSC is not affected by change in PM states. This does not talk about whether CLKIN for different packages are from synchronized/non synchronized oscillator sources in the above quote. Thanks, Kiran -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Configure MSI-X vectors to target different CPUs
Thanks, Robert. My device does support multiple vectors. When looking into functions called by pci_enable_msix(), I found msi_compose_msg() in arch/i386/kernel/io_apic.c. It tries to get destination CPU (TARGET_CPUS) and set this information to msg->address_lo. My question is about TARGET_CPUS. Under the asm-i386/mach-default, it is the cpu_online_map. Under asm-i386/mach-bigsmp, it is the cpumask_of_cpu(cpu), where cpu is a single one. I would guess if a single CPU is set as destination, only that CPU will be interrupted. But what will happen when the cpu_online_map is set as destination? Any CPU can be interrupted then? Or depending on affinity of the corresponding irq? Please CC'ed me ([EMAIL PROTECTED]) answers/comments in response to this posting. Thanks, Ying - Original Message From: Robert Hancock <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: linux-kernel@vger.kernel.org Sent: Thursday, February 21, 2008 7:59:14 PM Subject: Re: Configure MSI-X vectors to target different CPUs [EMAIL PROTECTED] wrote: > Hi, > > In MSI-HOWTO, it's said: > > "Using MSI enables the device functions to support two or more vectors, which can be configured to target different CPUs to increase scalability." > > So how can I set up MSI-X vectors to target different CPUs? I want to allocate the same number of MSI-X vectors as CPUs, and equally distribute them to every CPU. > > Is it automatically done by Linux when I call pci_enable_msix()? If yes, how? If not, what should I do? My guess is to set the affinity of the interrupts manually. Am I right? > > Please CC'ed me ([EMAIL PROTECTED]) answers/comments in response to this posting. > > Thanks, > Ying If the device actually supports multiple vectors (not all do), I think they should show up as separate interrupts in /proc/interrupts and you can either set the affinity manually, or maybe irqbalance is smart enough for this. Careful, though, as in some cases this may reduce performance due to causing more cache line bouncing between CPUs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/6] mempolicy: convert MPOL constants to enum
On Mon, 25 Feb 2008, Paul Jackson wrote: > +enum { > + MPOL_DEFAULT, > + MPOL_PREFERRED, > + MPOL_BIND, > + MPOL_INTERLEAVE, > + MPOL_MAX, /* always last member of enum */ > > Aren't the values that these constants take part of the > user visible kernel API? > > In other words, if someone added another MPOL_* in the middle > of this enum, it would break mbind/set_mempolicy/get_mempolicy > users, right: > > +enum { > + MPOL_DEFAULT, > + MPOL_PREFERRED, > + MPOL_YET_ANOTHER_FLAG, /* <== added flag ... oops */ > + MPOL_BIND, > + MPOL_INTERLEAVE, > + MPOL_MAX, /* always last member of enum */ > I don't suspect that a kernel developer is going to make such an egregious error. The user would need to be using a new linux/mempolicy.h with an old kernel to get the wrong behavior. > I'm thinking that we should still specify the specific value > of each of these flags, by way of documenting these necessary > values, as in: > > +enum { > + MPOL_DEFAULT = 0, > + MPOL_PREFERRED = 1, > + MPOL_BIND = 2, > + MPOL_INTERLEAVE = 3, > + MPOL_MAX, /* always last member of enum */ > That looks overly redundant to me and doesn't protect against adding MPOL_YET_ANOTHER_FLAG in the middle of preferred and bind to get two mode values with the int value of 1. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Memory Resource Controller use strstrip while parsing arguments
Andrew Morton wrote: > On Mon, 25 Feb 2008 23:57:46 +0530 Balbir Singh <[EMAIL PROTECTED]> wrote: > >> The memory controller has a requirement that while writing values, we need >> to use echo -n. This patch fixes the problem and makes the UI more >> consistent. > > that's a decent improvement ;) > > btw, could I ask that you, Paul and others who work on this and cgroups > have a think about a ./MAINTAINERS update? Aah.. yes.. we should do that. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] autofs4 - add miscelaneous device for ioctls
Hi Andrew, Patch to add miscellaneous device to autofs4 module for ioctls. Signed-off-by: Ian Kent < [EMAIL PROTECTED]> Ian --- diff -up linux-2.6.25-rc2-mm1/fs/autofs4/expire.c.device-node-ioctl linux-2.6.25-rc2-mm1/fs/autofs4/expire.c --- linux-2.6.25-rc2-mm1/fs/autofs4/expire.c.device-node-ioctl 2008-01-25 07:58:37.0 +0900 +++ linux-2.6.25-rc2-mm1/fs/autofs4/expire.c2008-02-22 11:51:41.0 +0900 @@ -244,10 +244,10 @@ cont: } /* Check if we can expire a direct mount (possibly a tree) */ -static struct dentry *autofs4_expire_direct(struct super_block *sb, - struct vfsmount *mnt, - struct autofs_sb_info *sbi, - int how) +struct dentry *autofs4_expire_direct(struct super_block *sb, +struct vfsmount *mnt, +struct autofs_sb_info *sbi, +int how) { unsigned long timeout; struct dentry *root = dget(sb->s_root); @@ -281,10 +281,10 @@ static struct dentry *autofs4_expire_dir * - it is unused by any user process * - it has been unused for exp_timeout time */ -static struct dentry *autofs4_expire_indirect(struct super_block *sb, - struct vfsmount *mnt, - struct autofs_sb_info *sbi, - int how) +struct dentry *autofs4_expire_indirect(struct super_block *sb, + struct vfsmount *mnt, + struct autofs_sb_info *sbi, + int how) { unsigned long timeout; struct dentry *root = sb->s_root; diff -up linux-2.6.25-rc2-mm1/fs/autofs4/init.c.device-node-ioctl linux-2.6.25-rc2-mm1/fs/autofs4/init.c --- linux-2.6.25-rc2-mm1/fs/autofs4/init.c.device-node-ioctl2008-01-25 07:58:37.0 +0900 +++ linux-2.6.25-rc2-mm1/fs/autofs4/init.c 2008-02-22 11:51:41.0 +0900 @@ -29,11 +29,20 @@ static struct file_system_type autofs_fs static int __init init_autofs4_fs(void) { - return register_filesystem(_fs_type); + int err; + + err = register_filesystem(_fs_type); + if (err) + return err; + + err = autofs_dev_ioctl_init(); + + return err; } static void __exit exit_autofs4_fs(void) { + autofs_dev_ioctl_exit(); unregister_filesystem(_fs_type); } diff -up linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h.device-node-ioctl linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h --- linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h.device-node-ioctl 2008-02-22 11:51:41.0 +0900 +++ linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h 2008-02-22 11:51:41.0 +0900 @@ -14,6 +14,7 @@ /* Internal header file for autofs */ #include +#include #include #include @@ -40,6 +41,16 @@ #define DPRINTK(fmt,args...) do {} while(0) #endif +#define WARN(fmt,args...) \ +do { \ + printk("KERN_WARNING pid %d: %s: " fmt "\n" , current->pid , __FUNCTION__ , ##args); \ +} while(0) + +#define ERROR(fmt,args...) \ +do { \ + printk("KERN_ERR pid %d: %s: " fmt "\n" , current->pid , __FUNCTION__ , ##args); \ +} while(0) + /* Unified info structure. This is pointed to by both the dentry and inode structures. Each file in the filesystem has an instance of this structure. It holds a reference to the dentry, so dentries are never @@ -172,6 +183,17 @@ int autofs4_expire_run(struct super_bloc struct autofs_packet_expire __user *); int autofs4_expire_multi(struct super_block *, struct vfsmount *, struct autofs_sb_info *, int __user *); +struct dentry *autofs4_expire_direct(struct super_block *sb, +struct vfsmount *mnt, +struct autofs_sb_info *sbi, int how); +struct dentry *autofs4_expire_indirect(struct super_block *sb, + struct vfsmount *mnt, + struct autofs_sb_info *sbi, int how); + +/* Device node initialization */ + +int autofs_dev_ioctl_init(void); +void autofs_dev_ioctl_exit(void); /* Operations structures */ diff -up linux-2.6.25-rc2-mm1/fs/autofs4/Makefile.device-node-ioctl linux-2.6.25-rc2-mm1/fs/autofs4/Makefile --- linux-2.6.25-rc2-mm1/fs/autofs4/Makefile.device-node-ioctl 2008-01-25 07:58:37.0 +0900 +++ linux-2.6.25-rc2-mm1/fs/autofs4/Makefile2008-02-22 11:51:41.0 +0900 @@ -4,4 +4,4 @@ obj-$(CONFIG_AUTOFS4_FS) += autofs4.o -autofs4-objs := init.o inode.o root.o symlink.o waitq.o expire.o +autofs4-objs := init.o inode.o root.o symlink.o waitq.o expire.o dev-ioctl.o diff -up /dev/null linux-2.6.25-rc2-mm1/fs/autofs4/dev-ioctl.c --- /dev/null
[PATCH 3/4] autofs4 - track uid and gid of last mount requestor
Hi Andrew, Patch to track the uid and gid of the last process to request a mount for on an autofs dentry. Signed-off-by: Ian Kent < [EMAIL PROTECTED]> Ian --- diff -up linux-2.6.25-rc2-mm1/fs/autofs4/inode.c.track-last-mount-ids linux-2.6.25-rc2-mm1/fs/autofs4/inode.c --- linux-2.6.25-rc2-mm1/fs/autofs4/inode.c.track-last-mount-ids 2008-02-20 13:11:28.0 +0900 +++ linux-2.6.25-rc2-mm1/fs/autofs4/inode.c 2008-02-20 13:12:23.0 +0900 @@ -43,6 +43,8 @@ struct autofs_info *autofs4_init_ino(str ino->flags = 0; ino->mode = mode; + ino->uid = 0; + ino->gid = 0; ino->inode = NULL; ino->dentry = NULL; ino->size = 0; diff -up linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h.track-last-mount-ids linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h --- linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h.track-last-mount-ids 2008-02-20 13:14:03.0 +0900 +++ linux-2.6.25-rc2-mm1/fs/autofs4/autofs_i.h 2008-02-20 13:14:34.0 +0900 @@ -58,6 +58,9 @@ struct autofs_info { unsigned long last_used; atomic_t count; + uid_t uid; + gid_t gid; + mode_t mode; size_t size; diff -up linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.track-last-mount-ids linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c --- linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.track-last-mount-ids 2008-02-20 13:06:20.0 +0900 +++ linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c 2008-02-20 13:10:23.0 +0900 @@ -363,6 +363,38 @@ int autofs4_wait(struct autofs_sb_info * status = wq->status; + /* +* For direct and offset mounts we need to track the requestor +* uid and gid in the dentry info struct. This is so it can be +* supplied, on request, by the misc device ioctl interface. +* This is needed during daemon resatart when reconnecting +* to existing, active, autofs mounts. The uid and gid (and +* related string values) may be used for macro substitution +* in autofs mount maps. +*/ + if (!status) { + struct dentry *de = NULL; + + /* direct mount or browsable map */ + ino = autofs4_dentry_ino(dentry); + if (!ino) { + /* If not lookup actual dentry used */ + de = d_lookup(dentry->d_parent, >d_name); + ino = autofs4_dentry_ino(de); + } + + /* Set mount requestor */ + if (ino) { + if (ino) { + ino->uid = wq->uid; + ino->gid = wq->gid; + } + } + + if (de) + dput(de); + } + /* Are we the last process to need status? */ if (atomic_dec_and_test(>wait_ctr)) kfree(wq); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] autofs4 - check for invalid dentry in getpath
Hi Andrew, Patch to catch invalid dentry when calculating it's path. Signed-off-by: Ian Kent <[EMAIL PROTECTED]> Ian --- diff -up linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.getpath-check-valid-dentry linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c --- linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c.getpath-check-valid-dentry 2008-02-20 12:55:39.0 +0900 +++ linux-2.6.25-rc2-mm1/fs/autofs4/waitq.c 2008-02-20 12:54:46.0 +0900 @@ -171,7 +171,7 @@ static int autofs4_getpath(struct autofs for (tmp = dentry ; tmp != root ; tmp = tmp->d_parent) len += tmp->d_name.len + 1; - if (--len > NAME_MAX) { + if (!len || --len > NAME_MAX) { spin_unlock(_lock); return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/10] CGroup API files: Various cleanup to CGroup control files
[EMAIL PROTECTED] wrote: > This patchset is a roll-up of the non-contraversial items of the > various patches that I've sent out recently, fixed according to the > feedback received. > > In summary they are: > > - general rename of read_uint/write_uint to read_u64/write_u64 > > - use these methods for cpusets and memory controller files > > - add a read_map cgroup file method, and use it in the memory > controller > > - move the "releasable" control file to the debug subsystem > > - make the debug subsystem config option default to "n" > > The only user-visible changes are the movement of the "releasable" > file and the fact that some write_u64()-based control files are now > more forgiving of additional whitespace at the end of their input. > > Signed-off-by: Paul Menage <[EMAIL PROTECTED]> > > -- > -- Should those pathces be rebased againt 2.6.25-rc3 ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] cifs: remove GLOBAL_EXTERN macro
Global variables should be defined in C files, not in headers. 1) Comment out unused vars GlobalDnotifyRsp_Q GlobalUidList 2) Declare vars in cifsfs.c that need it and change to extern in cifsglob.h 3) Change to extern in cifsglob.h for vars that were already being declared in cifsfs.c 4) Remove GLOBAL_EXTERN Signed-off-by: Harvey Harrison <[EMAIL PROTECTED]> --- Steven, here is a revised patch that has a bit more thought behind it. fs/cifs/cifsfs.c | 31 - fs/cifs/cifsglob.h | 76 2 files changed, 64 insertions(+), 43 deletions(-) diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index fcc4342..aae6752 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -37,7 +37,6 @@ #include #include "cifsfs.h" #include "cifspdu.h" -#define DECLARE_GLOBALS_HERE #include "cifsglob.h" #include "cifsproto.h" #include "cifs_debug.h" @@ -85,6 +84,34 @@ module_param(cifs_max_pending, int, 0); MODULE_PARM_DESC(cifs_max_pending, "Simultaneous requests to server. " "Default: 50 Range: 2 to 256"); +struct list_head GlobalSMBSessionList; +struct list_head GlobalTreeConnectionList; +rwlock_t GlobalSMBSeslock; + +struct list_head GlobalOplock_Q; + +struct list_head GlobalDnotifyReqList; + +unsigned int GlobalCurrentXid; +unsigned int GlobalTotalActiveXid; +unsigned int GlobalMaxActiveXid; +spinlock_t GlobalMid_Lock; +char Local_System_Name[15]; + +atomic_t sesInfoAllocCount; +atomic_t tconInfoAllocCount; +atomic_t tcpSesAllocCount; +atomic_t tcpSesReconnectCount; +atomic_t tconInfoReconnectCount; + +atomic_t bufAllocCount;/* current number allocated */ +#ifdef CONFIG_CIFS_STATS2 +atomic_t totBufAllocCount; /* total allocated over all time */ +atomic_t totSmBufAllocCount; +#endif +atomic_t smBufAllocCount; +atomic_t midCount; + extern mempool_t *cifs_sm_req_poolp; extern mempool_t *cifs_req_poolp; extern mempool_t *cifs_mid_poolp; @@ -1001,7 +1028,7 @@ init_cifs(void) INIT_LIST_HEAD(_Q); #ifdef CONFIG_CIFS_EXPERIMENTAL INIT_LIST_HEAD(); - INIT_LIST_HEAD(_Q); +/* INIT_LIST_HEAD(_Q); */ #endif /* * Initialize Global counters diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h index 5d32d8d..c45acfd 100644 --- a/fs/cifs/cifsglob.h +++ b/fs/cifs/cifsglob.h @@ -583,79 +583,73 @@ require use of the stronger protocol */ * / -#ifdef DECLARE_GLOBALS_HERE -#define GLOBAL_EXTERN -#else -#define GLOBAL_EXTERN extern -#endif - /* * The list of servers that did not respond with NT LM 0.12. * This list helps improve performance and eliminate the messages indicating * that we had a communications error talking to the server in this list. */ /* Feature not supported */ -/* GLOBAL_EXTERN struct servers_not_supported *NotSuppList; */ +/* extern struct servers_not_supported *NotSuppList; */ /* * The following is a hash table of all the users we know about. */ -GLOBAL_EXTERN struct smbUidInfo *GlobalUidList[UID_HASH]; +/* extern struct smbUidInfo *GlobalUidList[UID_HASH]; */ -/* GLOBAL_EXTERN struct list_head GlobalServerList; BB not implemented yet */ -GLOBAL_EXTERN struct list_head GlobalSMBSessionList; -GLOBAL_EXTERN struct list_head GlobalTreeConnectionList; -GLOBAL_EXTERN rwlock_t GlobalSMBSeslock; /* protects list inserts on 3 above */ +/* extern struct list_head GlobalServerList; BB not implemented yet */ +extern struct list_head GlobalSMBSessionList; +extern struct list_head GlobalTreeConnectionList; +extern rwlock_t GlobalSMBSeslock; /* protects list inserts on 3 above */ -GLOBAL_EXTERN struct list_head GlobalOplock_Q; +extern struct list_head GlobalOplock_Q; /* Outstanding dir notify requests */ -GLOBAL_EXTERN struct list_head GlobalDnotifyReqList; +extern struct list_head GlobalDnotifyReqList; /* DirNotify response queue */ -GLOBAL_EXTERN struct list_head GlobalDnotifyRsp_Q; +/* extern struct list_head GlobalDnotifyRsp_Q; */ /* * Global transaction id (XID) information */ -GLOBAL_EXTERN unsigned int GlobalCurrentXid; /* protected by GlobalMid_Sem */ -GLOBAL_EXTERN unsigned int GlobalTotalActiveXid; /* prot by GlobalMid_Sem */ -GLOBAL_EXTERN unsigned int GlobalMaxActiveXid; /* prot by GlobalMid_Sem */ -GLOBAL_EXTERN spinlock_t GlobalMid_Lock; /* protects above & list operations */ +extern unsigned int GlobalCurrentXid; /* protected by GlobalMid_Sem */ +extern unsigned int GlobalTotalActiveXid; /* prot by GlobalMid_Sem */ +extern unsigned int GlobalMaxActiveXid;/* prot by GlobalMid_Sem */ +extern spinlock_t GlobalMid_Lock; /* protects above & list operations */ /* on midQ entries */ -GLOBAL_EXTERN char Local_System_Name[15]; +extern char Local_System_Name[15]; /* * Global counters, updated atomically */ -GLOBAL_EXTERN atomic_t sesInfoAllocCount; -GLOBAL_EXTERN atomic_t tconInfoAllocCount;
[PATCH 0/4] autofs4 - autofs needs a miscelaneous device for ioctls
Hi Andrew, There is a problem with active restarts in autofs (that is to say restarting autofs when there are busy mounts). Currently autofs uses "umount -l" to clear active mounts at restart. While using lazy umount works for most cases, anything that needs to walk back up the mount tree to construct a path, such as getcwd(2) and the proc file system /proc//cwd, no longer works because the point from which the path is constructed has been detached from the mount tree. The actual problem with autofs is that it can't reconnect to existing mounts. Immediately one things of just adding the ability to remount autofs file systems would solve it, but alas, that can't work. This is because autofs direct mounts and the implementation of "on demand mount and expire" of nested mount trees have the file system mounted on top of the mount trigger dentry. To resolve this a miscellaneous device node for routing ioctl commands to these mount points has been implemented for the autofs4 kernel module. For those wishing to test this out an updated user space daemon is needed. Checking out and building from the git repo or applying all the current patches to the 5.0.3 tar distribution will do the trick. This is all available at the usual location on kernel.org. Ian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/6] mempolicy: convert MPOL constants to enum
David wrote: +enum { + MPOL_DEFAULT, + MPOL_PREFERRED, + MPOL_BIND, + MPOL_INTERLEAVE, + MPOL_MAX, /* always last member of enum */ Aren't the values that these constants take part of the user visible kernel API? In other words, if someone added another MPOL_* in the middle of this enum, it would break mbind/set_mempolicy/get_mempolicy users, right: +enum { + MPOL_DEFAULT, + MPOL_PREFERRED, + MPOL_YET_ANOTHER_FLAG, /* <== added flag ... oops */ + MPOL_BIND, + MPOL_INTERLEAVE, + MPOL_MAX, /* always last member of enum */ I'm thinking that we should still specify the specific value of each of these flags, by way of documenting these necessary values, as in: +enum { + MPOL_DEFAULT = 0, + MPOL_PREFERRED = 1, + MPOL_BIND = 2, + MPOL_INTERLEAVE = 3, + MPOL_MAX, /* always last member of enum */ -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel oops with bluetooth usb dongle
Hi Quel, -- Original message -- From: Thomas Gleixner <[EMAIL PROTECTED]> On Fri, 22 Feb 2008, David Woodhouse wrote: On Fri, 2008-02-22 at 08:23 +0100, Thomas Gleixner wrote: + del_timer(>info_timer); + hcon->l2cap_data = NULL; kfree(conn); Shouldn't that be del_timer_sync() ? Hmm, probably yes. Hi, Great news: only adding adding del_timer_sync() to 2.6.25-rc3 does prevent the crash. Bad news: I still cannot use the device. hcitool inq, hcitool scan, hcitool name and hcitool info commands work. hcitool cc , sdptool , rfcomm connect command fail, most of them with a 'Connection reset by peer' error. what does "hciconfig hci0 version" tell you about your device? Some of the none major based Bluetooth chips are broken and might need an extra tweak within the USB driver. Regards Marcel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bluez-devel] forcing SCO connection patch
Hi Louis, I fixed all of errors except 80 characters warning. Thanks Louis JANG Signed-off-by: Louis JANG <[EMAIL PROTECTED]> --- linux-2.6.23/net/bluetooth/hci_event.c.orig 2008-02-25 17:17:11.0 +0900 +++ linux-2.6.23/net/bluetooth/hci_event.c 2008-02-25 17:30:23.0 +0900 @@ -1313,8 +1313,17 @@ hci_dev_lock(hdev); conn = hci_conn_hash_lookup_ba(hdev, ev->link_type, >bdaddr); - if (!conn) - goto unlock; + if (!conn) { + if (ev->link_type != ACL_LINK) { + __u8 link_type = (ev->link_type == ESCO_LINK) ? SCO_LINK : ESCO_LINK; + + conn = hci_conn_hash_lookup_ba(hdev, link_type, >bdaddr); + if (conn) + conn->type = ev->link_type; + } + if (!conn) + goto unlock; + } NAK. There is no need to check for ACL_LINK. The sync_complete will only be called for SCO or eSCO connections. diff -uNr linux-2.6.23/include/net/bluetooth-orig/sco.h linux-2.6.23/ include/net/bluetooth/sco.h --- linux-2.6.23/include/net/bluetooth-orig/sco.h 2007-10-10 05:31:38.0 +0900 +++ linux-2.6.23/include/net/bluetooth/sco.h 2008-02-25 18:04:20.0 +0900 @@ -51,6 +51,8 @@ __u8 dev_class[3]; }; +#define SCO_FORCESCO 0x03 + NAK. We don't need this. And even if we really would want this, we would do it via extra parameters inside sockaddr_sco. In that case we would do it right and exposing eSCO settings and not some boolean parameter. Regards Marcel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Memory Resource Controller Add Boot Option
Paul Menage wrote: >>> I'll send out a prototype for comment. > > Something like the patch below. The effects of cgroup_disable=foo are: > > - foo doesn't show up in /proc/cgroups Or we can print out the disable flag, maybe this will be better? Because we can distinguish from disabled and not compiled in from /proc/cgroups. > - foo isn't auto-mounted if you mount all cgroups in a single hierarchy > - foo isn't visible as an individually mountable subsystem You mentioned in a previous mail if we mount a disabled subsystem we will get an error. Here we just ignore the mount option. Which makes more sense ? > > As a result there will only ever be one call to foo->create(), at init > time; all processes will stay in this group, and the group will never be > mounted on a visible hierarchy. Any additional effects (e.g. not > allocating metadata) are up to the foo subsystem. > > This doesn't handle early_init subsystems (their "disabled" bit isn't > set be, but it could easily be extended to do so if any of the > early_init systems wanted it - I think it would just involve some > nastier parameter processing since it would occur before the > command-line argument parser had been run. > > include/linux/cgroup.h |1 + > kernel/cgroup.c| 29 +++-- > 2 files changed, 28 insertions(+), 2 deletions(-) > > Index: cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h > === > --- cgroup_disable-2.6.25-rc2-mm1.orig/include/linux/cgroup.h > +++ cgroup_disable-2.6.25-rc2-mm1/include/linux/cgroup.h > @@ -256,6 +256,7 @@ struct cgroup_subsys { > void (*bind)(struct cgroup_subsys *ss, struct cgroup *root); > int subsys_id; > int active; > +int disabled; > int early_init; > #define MAX_CGROUP_TYPE_NAMELEN 32 > const char *name; > Index: cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c > === > --- cgroup_disable-2.6.25-rc2-mm1.orig/kernel/cgroup.c > +++ cgroup_disable-2.6.25-rc2-mm1/kernel/cgroup.c > @@ -790,7 +790,14 @@ static int parse_cgroupfs_options(char * > if (!*token) > return -EINVAL; > if (!strcmp(token, "all")) { > -opts->subsys_bits = (1 << CGROUP_SUBSYS_COUNT) - 1; > +/* Add all non-disabled subsystems */ > +int i; > +opts->subsys_bits = 0; > +for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > +struct cgroup_subsys *ss = subsys[i]; > +if (!ss->disabled) > +opts->subsys_bits |= 1ul << i; > +} > } else if (!strcmp(token, "noprefix")) { > set_bit(ROOT_NOPREFIX, >flags); > } else if (!strncmp(token, "release_agent=", 14)) { > @@ -808,7 +815,8 @@ static int parse_cgroupfs_options(char * > for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > ss = subsys[i]; > if (!strcmp(token, ss->name)) { > -set_bit(i, >subsys_bits); > +if (!ss->disabled) > +set_bit(i, >subsys_bits); > break; > } > } > @@ -2596,6 +2606,8 @@ static int proc_cgroupstats_show(struct > mutex_lock(_mutex); > for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > struct cgroup_subsys *ss = subsys[i]; > +if (ss->disabled) > +continue; > seq_printf(m, "%s\t%lu\t%d\n", >ss->name, ss->root->subsys_bits, >ss->root->number_of_cgroups); > @@ -2991,3 +3003,16 @@ static void cgroup_release_agent(struct > spin_unlock(_list_lock); > mutex_unlock(_mutex); > } > + > +static int __init cgroup_disable(char *str) > +{ > +int i; > +for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) { > +struct cgroup_subsys *ss = subsys[i]; > +if (!strcmp(str, ss->name)) { > +ss->disabled = 1; > +break; > +} > +} > +} > +__setup("cgroup_disable=", cgroup_disable); > > >> >> Sure thing, if css has the flag, then it would nice. Could you wrap it >> up to say >> something like css_disabled(_cgroup_subsys) >> >> > > It's the subsys object rather than the css (cgroup_subsys_state). > > We could have something like: > > #define cgroup_subsys_disabled(_ss) ((ss_)->disabled) > > but I don't see that > cgroup_subsys_disabled(_cgroup_subsys) > is better than just putting > > mem_cgroup_subsys.disabled > > Paul > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Don't risk NULL deref in marker
* Jesper Juhl ([EMAIL PROTECTED]) wrote: > > get_marker() may return NULL, so test for it. > Hrm, yes, if we have two marker_probe_unregister callers calling it for the same marker, one expecting it to fail and they race, yes, it can happen. Although this is not expected to happen often if the caller acts sanely, it's a good thing to fix it. Thanks! Acked-by: Mathieu Desnoyers <[EMAIL PROTECTED]> > > Signed-off-by: Jesper Juhl <[EMAIL PROTECTED]> > --- > > diff --git a/kernel/marker.c b/kernel/marker.c > index 50effc0..f211f08 100644 > --- a/kernel/marker.c > +++ b/kernel/marker.c > @@ -698,12 +698,11 @@ int marker_probe_unregister(const char *name, > { > struct marker_entry *entry; > struct marker_probe_closure *old; > - int ret = 0; > + int ret = -ENOENT; > > mutex_lock(_mutex); > entry = get_marker(name); > if (!entry) { > - ret = -ENOENT; > goto end; > } > if (entry->rcu_pending) > @@ -713,12 +712,16 @@ int marker_probe_unregister(const char *name, > marker_update_probes(); /* may update entry */ > mutex_lock(_mutex); > entry = get_marker(name); > + if (!entry) { > + goto end; > + } > entry->oldptr = old; > entry->rcu_pending = 1; > /* write rcu_pending before calling the RCU callback */ > smp_wmb(); > call_rcu(>rcu, free_old_closure); > remove_marker(name);/* Ignore busy error message */ > + ret = 0; > end: > mutex_unlock(_mutex); > return ret; -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] mmiotrace full patch, preview 1
Quoting Christoph Hellwig <[EMAIL PROTECTED]>: On Mon, Feb 25, 2008 at 02:49:22PM -0800, Andrew Morton wrote: the things which it finds. > +static DECLARE_MUTEX(kmmio_init_mutex); That's not a mutex. > + down(_init_mutex); It's a semaphore. Please do convert it to a mutex. Andy, I'd say that addition of new semaphores is worth a warning - they're rarely legitimate. I'm not sure that any semaphore should be a warning, but the initializer for semaphore used as binary mutex (DECLARE_MUTEX and init_MUTEX) are worth it. It looks like a mutex, it acts like a mutex, but it isn't a mutex, it's a trap for the unwary. Weird. I was annoyed by it before; now I see a fellow developer actually getting into that trap. I'd say, rename DECLARE_MUTEX to DECLARE_SEMAPHORE and let external code be fixed one way or another (i.e. stick with the "mutex" name or stick with the semaphore functionality if it's really needed). -- Regards, Pavel Roskin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
Paul Jackson wrote: So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as lower level mechanism that actually makes kernel aware of what's isolated what's not. Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains but scheduler does not use cpusets directly. One could use cpusets to control the setting of cpu_isolated_map, separate from the code such as your select_irq_affinity() that uses it. Yes. That's what I proposed too. In one of the CPU isolation threads with Peter. The only issue is that you need to simulate CPU_DOWN hotplug even in order to cleanup what's already running on those CPUs. In a foreseeable future 2-8 cores will be most common configuration. Do you think that cpusets are needed/useful for those machines ? The reason I'm asking is because given the restrictions you mentioned above it seems that you might as well just do taskset -c 1,2,3 app1 taskset -c 3,4,5 app2 People tend to manage the CPU and memory placement of the threads and processes within a single co-operating job using taskset (sched_setaffinity) and numactl (mbind, set_mempolicy.) They tend to manage the placement of multiple unrelated jobs onto a single system, whether on separate or shared CPUs and nodes, using cpusets. > Something like cpu_isolated_map looks to me like a system-wide mechanism, which should, like sched_domains, be managed system-wide. Managing it with a mechanism that encourages each thread to update it directly, as if that thread owned the system, will break down, resulting in conflicting updates, as multiple, insufficiently co-operating threads issue conflicting settings. I'm not sure how to interpret that. I think you might have mixed a couple of things I asked about in one reply ;-). The question was that given the restrictions you talked about when you explained tiny-cpusets functionality I asked how much one gains from using them compared to the taskset/numactl. ie On the machines with 2-8 cores it's fairly easy to manage cpus with simple affinity masks. The second part of your reply seems to imply that I somehow made you think that I suggested that cpu_isolated_map is managed per thread. That is of course not the case. It's definitely a system-wide mechanism and individual threads have nothing to do with it. btw I just re-read my prev reply. I definitely did not say anything about threads managing cpu_isolated_map :). Stuff that I'm working on this days (wireless basestations) is designed with the following model: cpuN - runs soft-RT networking and management code cpuN+1 to cpuN+x - are used as dedicated engines ie Simplest example would be cpu0 - runs IP, L2 and control plane cpu1 - runs hard-RT MAC So if CPU isolation is implemented on top of the cpusets what kind of API do you envision for such an app ? That depends on what more API is needed. Do we need to place irqs better ... cpusets might not be a natural for that use. Aren't irqs directed to specific CPUs, not to hierarchically nested subsets of CPUs. You clipped the part where I elaborated. Which was: So if CPU isolation is implemented on top of the cpusets what kind of API do you envision for such an app ? I mean currently cpusets seems to be mostly dealing with entire processes, whereas in this case we're really dealing with the threads. ie Different threads of the same process require different policies, some must run on isolated cpus some must not. I guess one could write a thread's pid into cpusets fs but that's not very convenient. pthread_set_affinity() is exactly what's needed. In other words how would an app place its individual threads into the different cpusets. IRQ stuff is separate, like we said above cpusets could simply update cpu_isolated_map which would take care of IRQs. I was talking specifically about the thread management. Separate question: Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear as general purpose systems running a Linux kernel in your systems? These dedicated engines seem more like intelligent devices to me, such as disk controllers, which the kernel controls via device drivers, not by loading itself on them too. We still want to be able to run normal threads on them. Which means IPI, memory management, etc is still needed. So yes they better show up as normal CPUs :) Also with dynamic isolation you can for example un-isolate a cpu when you're compiling stuff on the machine and then isolate it when you're running special app(s). Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH] page reclaim throttle take2
Hi this patch is page reclaim improvement. o previous discussion: http://marc.info/?l=linux-mm=120339997125985=2 o test method $ ./hackbench 120 process 1000 o test result (average of 5 times measure) limit hackbench sys-time major-fault max-spent-time time(s) (s)in shrink_zone() (jiffies) 3 42.06 378.70 5336 6306 o reason why restrict parallel reclaim 3 task per zone we tested various parameter. - restrict 1 is best major fault. but worst max spent time. - restrict 3 is best max spent reclaim time and hackbench result. I think "restrict 3" cause most good experience. limit hackbench sys-time major-fault max-spent-time time(s) (s)in shrink_zone() (jiffies) 1 48.50 283.89 3690 9057 2 44.43 350.94 5245 7159 3 42.06 378.70 5336 6306 4 48.84 401.87 5474 6669 unlimited 282.301248.47 29026 - Please any comments! Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]> CC: KAMEZAWA Hiroyuki <[EMAIL PROTECTED]> CC: Balbir Singh <[EMAIL PROTECTED]> CC: Rik van Riel <[EMAIL PROTECTED]> CC: Lee Schermerhorn <[EMAIL PROTECTED]> CC: Nick Piggin <[EMAIL PROTECTED]> --- include/linux/mmzone.h |3 + mm/page_alloc.c|4 + mm/vmscan.c| 101 - 3 files changed, 99 insertions(+), 9 deletions(-) Index: b/include/linux/mmzone.h === --- a/include/linux/mmzone.h2008-02-25 21:37:49.0 +0900 +++ b/include/linux/mmzone.h2008-02-26 10:12:12.0 +0900 @@ -335,6 +335,9 @@ struct zone { unsigned long spanned_pages; /* total size, including holes */ unsigned long present_pages; /* amount of memory (excluding holes) */ + + atomic_tnr_reclaimers; + wait_queue_head_t reclaim_throttle_waitq; /* * rarely used fields: */ Index: b/mm/page_alloc.c === --- a/mm/page_alloc.c 2008-02-25 21:37:49.0 +0900 +++ b/mm/page_alloc.c 2008-02-26 10:12:12.0 +0900 @@ -3466,6 +3466,10 @@ static void __meminit free_area_init_cor zone->nr_scan_inactive = 0; zap_zone_vm_stats(zone); zone->flags = 0; + + zone->nr_reclaimers = ATOMIC_INIT(0); + init_waitqueue_head(>reclaim_throttle_waitq); + if (!size) continue; Index: b/mm/vmscan.c === --- a/mm/vmscan.c 2008-02-25 21:37:49.0 +0900 +++ b/mm/vmscan.c 2008-02-26 10:59:38.0 +0900 @@ -1252,6 +1252,55 @@ static unsigned long shrink_zone(int pri return nr_reclaimed; } + +#define RECLAIM_LIMIT (3) + +static int do_shrink_zone_throttled(int priority, struct zone *zone, + struct scan_control *sc, + unsigned long *ret_reclaimed) +{ + u64 start_time; + int ret = 0; + + start_time = jiffies_64; + + wait_event(zone->reclaim_throttle_waitq, + atomic_add_unless(>nr_reclaimers, 1, RECLAIM_LIMIT)); + + /* more reclaim until needed? */ + if (scan_global_lru(sc) && + !(current->flags & PF_KSWAPD) && + time_after64(jiffies, start_time + HZ/10)) { + if (zone_watermark_ok(zone, sc->order, 4*zone->pages_high, + MAX_NR_ZONES-1, 0)) { + ret = -EAGAIN; + goto out; + } + } + + *ret_reclaimed += shrink_zone(priority, zone, sc); + +out: + atomic_dec(>nr_reclaimers); + wake_up_all(>reclaim_throttle_waitq); + + return ret; +} + +static unsigned long shrink_zone_throttled(int priority, struct zone *zone, + struct scan_control *sc) +{ + unsigned long nr_reclaimed = 0; + int ret; + + ret = do_shrink_zone_throttled(priority, zone, sc, _reclaimed); + + if (ret == -EAGAIN) + nr_reclaimed = 1; + + return nr_reclaimed; +} + /* * This is the direct reclaim path, for page-allocating processes. We only * try to reclaim pages from zones which will satisfy the caller's allocation @@ -1268,12 +1317,11 @@ static unsigned long shrink_zone(int pri * If a zone is deemed to be full of pinned
[PATCH] x86_64: force re setting the mmconf for fam10h if acpi=off
some BIOS only let AMD fam 10h handle bus0, and nvidia mcp55/ck804 to handle other buses. at that case MCFG will cover all over them. but with acpi=off, we can not use MCFG. this patch will double check the busnbits, and if it is less handling 256 bues, and acpi=off will forcely reset the mmconf in msr, so we still use mmconf in above case. Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]> Index: linux-2.6/arch/x86/kernel/setup_64.c === --- linux-2.6.orig/arch/x86/kernel/setup_64.c +++ linux-2.6/arch/x86/kernel/setup_64.c @@ -720,14 +720,21 @@ static void __cpuinit fam10h_check_enabl /* try to make sure that AP's setting is identical to BSP setting */ if (val & FAM10H_MMIO_CONF_ENABLE) { - u64 base; - base = val & (0xULL << 32); - if (fam10h_pci_mmconf_base_status <= 0) { - fam10h_pci_mmconf_base = base; - fam10h_pci_mmconf_base_status = 1; - return; - } else if (fam10h_pci_mmconf_base == base) - return; + unsigned busnbits; + busnbits = (val >> FAM10H_MMIO_CONF_BUSRANGE_SHIFT) & + FAM10H_MMIO_CONF_BUSRANGE_MASK; + + /* only trust the one handle 256 buses, if acpi=off */ + if (!acpi_pci_disabled || busnbits >= 8) { + u64 base; + base = val & (0xULL << 32); + if (fam10h_pci_mmconf_base_status <= 0) { + fam10h_pci_mmconf_base = base; + fam10h_pci_mmconf_base_status = 1; + return; + } else if (fam10h_pci_mmconf_base == base) + return; + } } /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1
Hi Aron, Thanks for your patience. If you still got into trouble, please let me know. Thank you again, -Original Message- From: Aron Stansvik [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 26, 2008 6:52 AM To: erich Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org Subject: Re: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1 Hi Erich. 2008/2/25, nickcheng <[EMAIL PROTECTED]>: > Hi Aron, > From our field experiences and customers' feedbacks, all of them direct to > vibration and power issues. > The vibration could be caused by FANs not only by themselves. Okay. I have a chassi fan that is quite close to the drives, I will try disabling it. I've also ordered two Nexus TwinDisk anti vibration harddrive mounts with which I'll place the disks in my 5.25" slots instead, away from any fans. If this doesn't work, I'm stumped, as I really don't think it's the power supply and I don't have the money to buy a new one. > You mentioned it could be the F/W issue. > If the environment does not meet the prerequisite, FW could not work > correctly. > Actually FW just reacts to the situations not it causes the issue. Of course, I understand this. Just trying to figure this problem out.. > Please check it out!! I'll report back with my findings with moving disk away from fans and using anti-vibrations mounts. Thanks for taking your time to reply. Aron > Thank you, > > > -Original Message- > From: Aron Stansvik [mailto:[EMAIL PROTECTED] > Sent: Sunday, February 24, 2008 1:54 AM > To: [EMAIL PROTECTED] > Cc: erich; [EMAIL PROTECTED]; [EMAIL PROTECTED]; > linux-kernel@vger.kernel.org > > Subject: Re: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1 > > Hello again Areca and LKML hackers. > > 2008/2/18, Aron Stansvik <[EMAIL PROTECTED]>: > > Hello Nick. > > > > Sorry that I'm not answering until now. I've been busy. > > > > 2008/2/13, nickcheng <[EMAIL PROTECTED]>: > > > > > Hi Aron, > > > From our experience and some customers' feedback, your issue could be > caused > > > by power instability or vibration to your HDs. > > > Please check step by step: > > > (1).under your original environment, increase the SCSI command value, > > > default=30, with the shell script, set_scsicmd_timeout(). 90 or 120 is > > > enough. > > > (2).if method 1 does not work, find out the vibration source or change > the > > > power supply > > > > > > I will try to increase that value. I don't think it's vibration; the > > disks are firmly in place in a very heavy chassi (Silverstone > > SST-TJ05B-T). And I really don't think there's something wrong with > > the power supply, it's a pretty new Silverstone ST65ZF 650W. This is > > my own personal workstation, so I don't just have another power supply > > to test with :/ > > > > I will report back on my success/failure. Thanks for your answer. > > I've now tried with both 90 and 120 for the timeout value, and the > problem still persists. It seems to happen when lots of small writes > are occuring, e.g. when installing something. > > I really don't think the disks are vibrating, I don't see how they > could. One more thing I'm going to test is to use the legacy ATA power > connector instead of the SATA power connector. This was what I was > using before when I only had a single drive and no RAID controller. > Maybe my power supply is malfunctioning and not giving enough power on > the SATA power connectors.. but I doubt it. > > Is there anything else that could cause this? Have you guys at Areca > tested the ARC-1200 with Raptors in RAID1? > > :( > > Regards, > Aron > > > > > > > Aron > > > > > > > If your still have any questions, please feel free to let me know. > > > > > > P.S. The attached driver source, arcmsr-1.20.00.15-71224, has been > > > upstreamed to kernel.org and will be released in kernel 2.6.25. If you > like, > > > you could update your driver with it. > > > It fixes some minor bugs, but these bugs are nothing to do with your > issue. > > > > > > > > > -Original Message- > > > From: erich [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, February 13, 2008 4:33 PM > > > To: (廣安科技)鄭守謙 > > > Subject: Fw: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1 > > > > > > > > > > > > - Original Message - > > > From: "Andrew Morton" <[EMAIL PROTECTED]> > > > To: "Aron Stansvik" <[EMAIL PROTECTED]> > > > Cc: ; <[EMAIL PROTECTED]>; > "erich" > > > <[EMAIL PROTECTED]> > > > Sent: Wednesday, February 13, 2008 4:03 PM > > > Subject: Re: Aborted commands with arcmsr and 2xWD1500ADFD in RAID1 > > > > > > > > > > > > > > (cc's added) > > > > > > > > On Mon, 11 Feb 2008 17:44:08 +0100 "Aron Stansvik" > <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > >> Hello LKML. > > > >> > > > >> Under semi-high disk I/O (e.g. installing
Re: GAK!!!! Re: PCI: AMD SATA IDE mode quirk
在 2008-02-26Tue的 06:53 +0800,Jeff Garzik写道: > Alan Cox wrote: > >> Signed-off-by: Crane Cai <[EMAIL PROTECTED]> > >> Acked-by: Jeff Garzik <[EMAIL PROTECTED]> > >> Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]> > > > > Vomitted-upon-by: Alan Cox <[EMAIL PROTECTED]> > > > >> -if ((pdev->class >> 8) == PCI_CLASS_STORAGE_IDE) { > >> -u8 tmp; > >> +/* set sb600/sb700/sb800 sata to ahci mode */ > >> +u8 tmp; > >> > >> +pci_read_config_byte(pdev, PCI_CLASS_DEVICE, ); > >> +if (tmp == 0x01) { > > > > CLASS_DEVICE is cached in pdev->class so why not simply do: > > > > if (pdev->class & (1 << 8)) > > I agree. I [obviously] missed this when I ack'd, mainly ack'ing the > overall change. > > BIOS certainly may modify that PCI config register, but that's before > the kernel boots. So, using pdev->class is fine. > > Jeff pdev->class is also quirked when resume. We need to reread PCI configuation on resume. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH sched-devel 0/7] CPU isolation extensions
Hi Peter, Sorry for delay in reply. Please, wrap your emails at 78 - most mailers can do this. Done. On Fri, 2008-02-22 at 14:05 -0800, Max Krasnyanskiy wrote: Peter Zijlstra wrote: On Thu, 2008-02-21 at 18:38 -0800, Max Krasnyanskiy wrote: List of commits cpuisol: Make cpu isolation configrable and export isolated map cpu_isolated_map was a bad hack when it was introduced, I feel we should deprecate it and fully integrate the functionality into cpusets. That would give a much more flexible end-result. That's not not currently possible and will introduce a lot of complexity. I'm pretty sure you missied the discussion I had with Paul (you were cc'ed on that btw). In fact I provided the link to that discussion in the original email. Here it is again: http://marc.info/?l=linux-kernel=120180692331461=2 I read it, I just firmly disagree. Basically the problem is very simple. CPU isolation needs a simple/efficient way to check if CPU N is isolated. I'm not seeing the need for that outside of setting up the various states. That is, once all the affinities are set up, you'd hardly ever would (or should - imho) need to know if a particular CPU is isolated or not. Unless I'm missing something that's only possible for a very static system. What I mean is that yes you could go and set irq affinity, apps affinity, workqueue thread affinity, etc not to run on the isolated cpus. It works _until_ something changes, at which point the system needs to know that it's not supposed to touch CPU N. For example new IRQ is registered, new workqueue is created (fs mounted, net network interface is created, etc), new kthread is started, etc. Sure we can introduce default affinity masks for irqs, workqueues, etc. But that's essentially just duplicating cpu_isolated_map. cpuset/cgroup APIs are not designed for that. In other to figure out if a CPU N is isolated one has to iterate through all cpusets and checks their cpu maps. That requires several levels of locking (cgroup and cpuset). The other issue is that cpusets are a bit too dynamic (see the thread above for more details) we'd need notified mechanisms to notify subsystems when a CPUs become isolated. Again more complexity. Since I integrated cpu isolation with cpu hotplug it's already addressed in a nice simple way. I guess you have another definition of nice than I do. No, not really. Lets talk specifics. My goal was not to introduce a bunch of new functionality and rewrite workqueues and stuff, instead I wanted to integrated with existing mechanisms. CPU maps are used everywhere and exporting cpu_isolated_map was a natural way to make other parts of the kernel aware of the isolated CPUs. Please take a look at that discussion. I do not think it's worth the effort to put this into cpusets. cpu_isolated_map is very clean and simple concept and integrates nicely with the rest of the cpu maps. ie It's very much the same concept and API as cpu_online_map, etc. I'm thinking cpu_isolated_map is a very dirty hack. If we want to integrate this stuff with cpusets I think the best approach would be is to have cpusets update the cpu_isolated_map just like it currently updates scheduler domains. CPU-sets can already isolate cpus by either creating a cpu outside of any set, or a set with a single cpu not shared by any other sets. This only works for user-space. As I mentioned about for full CPU isolation various kernel subsystems need to be aware of that CPUs are isolated in order to avoid activities on them. Yes, hence the proposed system flag to handle the kernel bits like unbounded kernel threads and IRQs. I do not see a specific proposals here. The funny part that we're not even disagreeing on the high level. Yes It'd be nice to have such a flag ;-) But how will genirq subsystem, for example, be aware of that flag ? ie How would it know that by default it is not supposed to route irqs to the CPUs in the cpusets with that flag ? As I explained above setting affinity for existing irqs is not enough. Same for workqueus or any other susbsytem that wants to run per-cpu threads and stuff. This also allows for isolated groups, there are good reasons to isolate groups, esp. now that we have a stronger RT balancer. SMP and hard RT are not exclusive. A design that does not take that into account is too rigid. You're thinking scheduling only. Paul had the same confusion ;-) I'm not, I'm thinking it ought to allow for it. One way I can think of how to support groups and allow for RT balancer is this: Make scheduler ignore cpu_isolated_map and give cpusets full control of the scheduler domains. Use cpu_isolated_map to only for hw irq and other kernel sub-systems. That way cpusets could mark cpus in the group as isolated to get rid of the kernel activity and build sched domain such that tasks get balanced in it. The thing I do not like about it is that there is no way to boot the system with CPU N isolated
RE: 2.6.25-current-git hangs on boot
>-Original Message- >From: Rafael J. Wysocki [mailto:[EMAIL PROTECTED] >Sent: Sunday, February 24, 2008 3:18 AM >To: Soeren Sonnenburg >Cc: Oliver Pinter; Linux Kernel; Pallipadi, Venkatesh >Subject: Re: 2.6.25-current-git hangs on boot > >On Sunday, 24 of February 2008, Soeren Sonnenburg wrote: >> On Sat, 2008-02-23 at 20:00 +0100, Oliver Pinter wrote: >> > the pci=nommconf kernel parameter helped it? >> >> yes indeed, this switch reliably helps to over come the hang at *this >> stage* (I tried booting with booth the switch and w/o). >> >> however with 50% chance I still see a hang directly after >> >> cpuidle: using governor ladder > >Do you have CONFIG_CPU_IDLE set? If you have, please try to >unset it and >retest. > Rafael, I am looking at the CPU_IDLE part of this regression. Just want to note that there is another regression with needing pci=nommconf in current git which was not required in .24. I am not sure whether you are already tracking that as a different issue. Thanks, Venki -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.25-rc2 rcupreempt WARN after suspend to ram
Am Sonntag, 24. Februar 2008 schrieb Paul E. McKenney: > On Sun, Feb 24, 2008 at 04:38:15PM +0100, Karsten Wiese wrote: > > Am Samstag, 23. Februar 2008 schrieb Karsten Wiese: > > > Am Samstag, 23. Februar 2008 schrieb Paul E. McKenney: > > > > On Sat, Feb 23, 2008 at 01:41:02PM +0100, Karsten Wiese wrote: > > > > > Hi, > > > > > > > > > > This appeared in dmesg after > > > > > $ echo core > /sys/power/pm_test > > > > > followed by 3 cycles of > > > > > $ echo mem > /sys/power/state > > > > > . .config attached. > > > > > > > > > > dmesg excerpt (, full ~1MByte available): > > > > > > > > Does this tree have http://lkml.org/lkml/2008/1/29/208 applied? > > > > > > Yes. This tree was linus' git head as of yesterday or the day before. > > > > Updated to git-head of today, same test and .config, different symptoms > > like in this thread: http://lkml.org/lkml/2008/2/23/260 > > Later in this thread, Alan Cox said it looked like irq problems. > > Maybe also the rcupreemt related WARN_ON I saw are caused by irq problems. > > Might be, but am taking a closer look at the interaction between irq, > dynticks, and rcupreempt in any case. [Added Rafael, Thomas and Steven to CC] The "different symptoms" above are indeed unrelated and solved by reverting "commit 559bbe6cbd0d8c68d40076a5f7dc98e3bf5864b2 power_state: get rid of write-only variable in SATA" The cpu_hotplug code used by suspend together with hr_timer and nohz looks suspicious: $ echo 0 > /sys/devices/system/cpu/cpu1/online $ echo 1 > /sys/devices/system/cpu/cpu1/online (repeat until dmesg|tail shows WARNs; here it took 2 iterations) causes symptoms like in 1st message of this thread again. Thanks, Karsten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tiny cpusets -- cpusets for small systems?
> So. I see cpusets as a higher level API/mechanism and cpu_isolated_map as > lower > level mechanism that actually makes kernel aware of what's isolated what's > not. > Kind of like sched domain/cpuset relationship. ie cpusets affect sched domains > but scheduler does not use cpusets directly. One could use cpusets to control the setting of cpu_isolated_map, separate from the code such as your select_irq_affinity() that uses it. > In a foreseeable future 2-8 cores will be most common configuration. > Do you think that cpusets are needed/useful for those machines ? > The reason I'm asking is because given the restrictions you mentioned > above it seems that you might as well just do > taskset -c 1,2,3 app1 > taskset -c 3,4,5 app2 People tend to manage the CPU and memory placement of the threads and processes within a single co-operating job using taskset (sched_setaffinity) and numactl (mbind, set_mempolicy.) They tend to manage the placement of multiple unrelated jobs onto a single system, whether on separate or shared CPUs and nodes, using cpusets. Something like cpu_isolated_map looks to me like a system-wide mechanism, which should, like sched_domains, be managed system-wide. Managing it with a mechanism that encourages each thread to update it directly, as if that thread owned the system, will break down, resulting in conflicting updates, as multiple, insufficiently co-operating threads issue conflicting settings. > Stuff that I'm working on this days (wireless basestations) is designed > with the following model: > cpuN - runs soft-RT networking and management code > cpuN+1 to cpuN+x - are used as dedicated engines > ie Simplest example would be > cpu0 - runs IP, L2 and control plane > cpu1 - runs hard-RT MAC > > So if CPU isolation is implemented on top of the cpusets what kind of API do > you envision for such an app ? That depends on what more API is needed. Do we need to place irqs better ... cpusets might not be a natural for that use. Aren't irqs directed to specific CPUs, not to hierarchically nested subsets of CPUs. Separate question: Is it desired that the dedicated CPUs cpuN+1 ... cpuN+x even appear as general purpose systems running a Linux kernel in your systems? These dedicated engines seem more like intelligent devices to me, such as disk controllers, which the kernel controls via device drivers, not by loading itself on them too. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.940.382.4214 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make the kernel NTP code hand 64-bit *unsigned* values to do_div()
Roman Zippel <[EMAIL PROTECTED]> wrote: > I would actually prefer to introduce an explicit API for signed 64 > divides to get rid of the temps completely Yeah, that's a better plan. David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips <[EMAIL PROTECTED]> wrote: > On Monday 25 February 2008 15:19, David Howells wrote: > > So I guess there's a problem in cachefiles's efficiency - possibly due > > to the fact that it tries to be fully asynchronous. > > OK, not just my imagination, and it makes me feel better about the patch > set because efficiency bugs are fixable while fundamental limitations > are not. One can hope:-) > How much of a hurry are you in to merge this feature? You have bits > like this: I'd like to get it upstream sooner rather than later. As it's not upstream, but it's prerequisite patches touch a lot of code, I have to spend time regularly making my patches work again. Merge windows are completely not fun. > "Add a function to install a monitor on the page lock waitqueue for a > particular page, thus allowing the page being unlocked to be detected. > This is used by CacheFiles to detect read completion on a page in the > backing filesystem so that it can then copy the data to the waiting > netfs page." > > We already have that hook, it is called bio_endio. Except that isn't accessible. CacheFiles currently has no access to the notification from the blockdev to the backing fs, if indeed there is one. All we can do it trap the backing fs page becoming available. > My strong intuition is that your whole mechanism should sit directly on the > block device, no matter how attractive it seems to be able to piggyback on > the namespace and layout management code of existing filesystems. There's a place for both. Consider a laptop with a small disk, possibly subdivided between Linux and Windows. Linux then subdivides its bit further to get a swap space. What you then propose is to break off yet another chunk to provide the cache. You can't then use this other chunk for anything else, even if it's, say, 1% used by the cache. The way CacheFiles works is that you tell it that it can use up to a certain percentage of the otherwise free disk space on an otherwise existing filesystem. In the laptop case, you may just have a single big partition. The cache will fill up as much of it can, and as the other contents of the partition consume space, the cache will be culled to make room. On the other hand, a system like my desktop, where I can slap in extra disks with mound of extra disk space, it might very well make sense to commit block devices to caching, as this can be used to gain performance. I have another cache backend (CacheFS) which takes the form of a filesystem, thus allowing you to mount a blockdev as a cache. It's much faster than Ext3 at storing and retrieving files... at first. The problem is that I've mucked up the free space retrieval such that performance degrades by 20x over time for files of any size. Basically any cache on a raw blockdev _is_ a filesystem, just one in which you're randomly allowed to discard data to make life easier. > I see your current effort as the moral equivalent of FUSE: you are able to > demonstrate certain desirable behavioral properties, but you are unable to > reach full theoretical efficiency because there are layers and layers of > interface gunk interposed between the netfs user and the cache device. The interface gunk is meant to be as thin as possible, but there are constraints (see the documentation in the main FS-Cache patch for more details): (1) It's a requirement that it only be tied to, say, AFS. We might have several netfs's that want caching: AFS, CIFS, ISOFS (okay, that last isn't really a netfs, but it might still want caching). (2) I want to be able to change the backing cache. Under some circumstances I might want to use an existing filesystem, under others I might want to commit a blockdev. I've even been asked about using battery-backed RAM - which has different design constraints. (3) The constraint has been imposed by the NFS team that the cache be completely asynchronous. I haven't quite met this: readpages() will wait until the cache knows whether or not the pages are available on the principle that read operations done through the cache can be considered synchronous. This is an attempt to reduce the context switchage involved. Unfortunately, the asynchronicity requirement has caused the middle layer to bloat. Fortunately, the backing cache needn't bloat as it can use the middle layer's bloat. > That said, I also see you have put a huge amount of work into this over > the years, it is nicely broken out, you are responsive and easy to work > with, all arguments for an early merge. Against that, you invade core > kernel for reasons that are not necessarily justified: > > * two new page flags I need to keep track of two bits of per-cached-page information: (1) This page is known by the cache, and that the cache must be informed if the page is going to go away. (2) This page is being written to disk by the cache, and that it cannot be released
Re: Tabs, spaces, indent and 80 character lines
On Feb 25 2008 23:13, Richard Knutsson wrote: > Miles Bader wrote: >> Why do people even respond to these trolls...? >> > Obviously, this must to have been discussed before, with a clear conclusion. It has been discussed before, at http://lkml.org/lkml/2007/11/12/19 . What is really frustrating is that some of the people which _do_ enforce one style of the two (tabs-only or tabs-spaces) have only indirectly voiced their preferred style, as in http://lkml.org/lkml/2008/1/19/67 . Often it was just (1)"checkpatch flagged your spaces, hence they are wrong" instead of (2)"in my tree, I only take tabs-only patches". Now back to coding, oh and don't forget send a patch for CodingStyle since a mail without one is often taken even less seriously. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Make the kernel NTP code hand 64-bit *unsigned* values to do_div()
Hi, On Thu, 21 Feb 2008, David Howells wrote: > The kernel NTP code shouldn't hand 64-bit *signed* values to do_div(). Make > it > instead hand 64-bit unsigned values. This gets rid of a couple of warnings. I would actually prefer to introduce an explicit API for signed 64 divides to get rid of the temps completely, something like below. Right now it uses do_div as fallback. When all archs are converted, do_div can be single compatibility define and perhaps we can get rid of it completely. Bonus feature: implement the x86 version without the asm casts allowing gcc to generate better code. bye, Roman --- include/asm-generic/div64.h | 14 ++ include/asm-i386/div64.h| 20 include/linux/calc64.h | 28 kernel/time.c | 26 +++--- kernel/time/ntp.c | 21 + lib/div64.c | 21 - 6 files changed, 94 insertions(+), 36 deletions(-) Index: linux-2.6/include/asm-generic/div64.h === --- linux-2.6.orig/include/asm-generic/div64.h +++ linux-2.6/include/asm-generic/div64.h @@ -35,6 +35,20 @@ static inline uint64_t div64_64(uint64_t return dividend / divisor; } +static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder) +{ + *remainder = dividend % divisor; + return dividend / divisor; +} +#define div_u64_remdiv_u64_rem + +static inline s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder) +{ + *remainder = dividend % divisor; + return dividend / divisor; +} +#define div_s64_remdiv_s64_rem + #elif BITS_PER_LONG == 32 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor); Index: linux-2.6/include/asm-i386/div64.h === --- linux-2.6.orig/include/asm-i386/div64.h +++ linux-2.6/include/asm-i386/div64.h @@ -48,5 +48,25 @@ div_ll_X_l_rem(long long divs, long div, } +static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder) +{ + union { + u64 v64; + u32 v32[2]; + } d = { dividend }; + u32 upper; + + upper = d.v32[1]; + if (upper) { + upper = d.v32[1] % divisor; + d.v32[1] = d.v32[1] / divisor; + } + asm ("divl %2" : "=a" (d.v32[0]), "=d" (*remainder) : + "rm" (divisor), "0" (d.v32[0]), "1" (upper)); + return d.v64; +} +#define div_u64_remdiv_u64_rem + extern uint64_t div64_64(uint64_t dividend, uint64_t divisor); + #endif Index: linux-2.6/include/linux/calc64.h === --- linux-2.6.orig/include/linux/calc64.h +++ linux-2.6/include/linux/calc64.h @@ -46,4 +46,32 @@ static inline long div_long_long_rem_sig return res; } +#ifndef div_u64_rem +static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder) +{ + *remainder = do_div(dividend, divisor); + return dividend; +} +#endif + +#ifndef div_u64 +static inline u64 div_u64(u64 dividend, u32 divisor) +{ + u32 remainder; + return div_u64_rem(dividend, divisor, ); +} +#endif + +#ifndef div_s64_rem +extern s64 div_s64_rem(s64 dividend, s32 divisor, s32 *remainder); +#endif + +#ifndef div_s64 +static inline s64 div_s64(s64 dividend, s32 divisor) +{ + s32 remainder; + return div_s64_rem(dividend, divisor, ); +} +#endif + #endif Index: linux-2.6/kernel/time.c === --- linux-2.6.orig/kernel/time.c +++ linux-2.6/kernel/time.c @@ -661,9 +661,7 @@ clock_t jiffies_to_clock_t(long x) #if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0 return x / (HZ / USER_HZ); #else - u64 tmp = (u64)x * TICK_NSEC; - do_div(tmp, (NSEC_PER_SEC / USER_HZ)); - return (long)tmp; + return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ); #endif } EXPORT_SYMBOL(jiffies_to_clock_t); @@ -675,16 +673,12 @@ unsigned long clock_t_to_jiffies(unsigne return ~0UL; return x * (HZ / USER_HZ); #else - u64 jif; - /* Don't worry about loss of precision here .. */ if (x >= ~0UL / HZ * USER_HZ) return ~0UL; /* .. but do try to contain it here */ - jif = x * (u64) HZ; - do_div(jif, USER_HZ); - return jif; + return div_u64((u64)x * HZ, USER_HZ); #endif } EXPORT_SYMBOL(clock_t_to_jiffies); @@ -692,17 +686,15 @@ EXPORT_SYMBOL(clock_t_to_jiffies); u64 jiffies_64_to_clock_t(u64 x) { #if (TICK_NSEC % (NSEC_PER_SEC / USER_HZ)) == 0 - do_div(x, HZ / USER_HZ); + return div_u64(x, HZ / USER_HZ); #else /* * There are better ways that don't overflow early, * but even this doesn't overflow in hundreds of years * in 64 bits, so.. */ - x *=