Re: Bugreport: Kernel 2.4.x crash
Hi! I have no experience with kernel debugging, but so far, I have found no log entry giving me a hint and the screen is blank after the crash Could you disable console blanking (setterm -blank 0). We really need a hint where it crashed. Over the easter weekend I took some time for testing. One ide channel does not work with dma enabled, which is bootup default. After about 30 seconds, the channel is switched to pio and the machine running again. Funny though: Before, I could not return from console blanking or reach the machine through network. But as for any production system, I rather keep it running than spend downtime seeking the error. Thank you all. Jrn -- Jrn Engel mailto: [EMAIL PROTECTED] http://wohnheim.fh-wedel.de/~joern - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Bugreport: Kernel 2.4.x crash
1. Kernel crash w/out error message or logfile entry 2. A Fileserver with an ABIT Hotrod 66 (htp366) controller will crash within 5-60 minutes after boot with a 2.4.x kernel. 2.2.x works fine. No other exotic hardware. Another possibility might be Reiserfs, which I use for all partitions except /. I have no experience with kernel debugging, but so far, I have found no log entry giving me a hint and the screen is blank after the crash. There might have been some output before, but the machine is in the basement and too important for excessive testing. I have tried 2.4.2 and 2.4.3 once each. 3. ide, hpt366 4. 2.4.2, 2.4.3 5. - 6. - 7. All this information is taken from the running 2.2.18 Kernel. 7.1. sh /usr/src/linux/scripts/ver_linux -- Versions installed: (if some fields are empty or look -- unusual then possibly you have very old versions) Linux belfast 2.2.18 #1 Fri Feb 23 14:47:14 CET 2001 i586 unknown Kernel modules 2.4.2 Gnu C 2.95.3 Gnu Make 3.79.1 Binutils 2.11.90.0.1 Linux C Library2.2.2 Dynamic linker ldd (GNU libc) 2.2.2 Procps 2.0.7 Mount 2.11b Net-tools 2.05 Console-tools 0.2.3 Sh-utils 2.0.11 Modules Loaded sb uart401 sound soundcore 7.2. cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 5 model : 4 model name : Pentium MMX stepping: 3 cpu MHz : 200.459 fdiv_bug: no hlt_bug : no sep_bug : no f00f_bug: yes coma_bug: no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr mce cx8 mmx bogomips: 399.76 7.3 cat /var/log/ksymoops/20010401164317.modules (2.4.3) sb 2128 0 (unused) sb_lib 33936 0 [sb] uart401 6352 0 [sb_lib] sound 56400 0 [sb_lib uart401] soundcore 3792 5 [sb_lib sound] raid1 12784 0 (unused) raid0 3520 0 (unused) md 41056 0 [raid1 raid0] 7.4. cat /proc/ioports -001f : dma1 0020-003f : pic1 0040-005f : timer 0060-006f : keyboard 0080-008f : dma page reg 00a0-00bf : pic2 00c0-00df : dma2 00f0-00ff : fpu 01f0-01f7 : ide0 0220-022f : soundblaster 02f8-02ff : serial(set) 0330-0333 : MPU-401 UART 03c0-03df : vga+ 03e8-03ef : serial(auto) 03f6-03f6 : ide0 03f8-03ff : serial(set) 6100-6107 : ide2 6202-6202 : ide2 6400-6407 : ide3 6502-6502 : ide3 6700-677f : eth0 f000-f007 : ide0 f008-f00f : ide1 7.5 lspci -vvv 00:00.0 Host bridge: Intel Corporation 430HX - 82439HX TXC [Triton II] (rev 03) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium TAbort- TAbort- MAbort+ SERR- PERR- Latency: 32 00:07.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] (rev 01) Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 0 00:07.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] (prog-if 80 [Master]) Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 32 Region 4: I/O ports at f000 00:08.0 Unknown mass storage controller: Triones Technologies, Inc. HPT366 (rev 01) Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 248 (2000ns min, 2000ns max), cache line size 08 Interrupt: pin A routed to IRQ 11 Region 0: I/O ports at 6100 Region 1: I/O ports at 6200 Region 4: I/O ports at 6300 00:08.1 Unknown mass storage controller: Triones Technologies, Inc. HPT366 (rev 01) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 248 (2000ns min, 2000ns max), cache line size 08 Interrupt: pin A routed to IRQ 11 Region 0: I/O ports at 6400 Region 1: I/O ports at 6500 Region 4: I/O ports at 6600 00:0a.0 VGA compatible controller: S3 Inc. Trio 64V2/DX or /GX (rev 16) (prog-if 00 [VGA]) Subsystem: Elsa AG: Unknown device 0935 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Interrupt:
Re: [RFC] MTD driver for MMC cards
On Mon, 16 April 2007 01:33:17 +0200, Arnd Bergmann wrote: There is also still some need for performance testing. Jörn brought up the point that if a specific card can't have multiple open erase block simulateously, it's rather pointless for logfs. It might still be useful to use jffs2 on those cards, because IFAIK that only writes to one erase block at any time. This appears to be a problem for practically all consumer-available flash media. They spend a lot of effort trying to hide any flash properties from their users. And while this is a decent strategy for FAT, ext3, ntfs and similar, it is actually very inefficient for a flash filesystem. After talking to several manufacturers, most seemed to be fairly open-minded towards supporting an alternate interface with raw flash access. So much for the good news. Bad news is that such an elternate interface still needs to be defined. Jörn -- Courage is not the absence of fear, but rather the judgement that something else is more important than fear. -- Ambrose Redmoon - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ZFS with Linux: An Open Plea
On Mon, 16 April 2007 17:46:50 +0200, Tomasz Kłoczko wrote: On Mon, 16 Apr 2007, Christoph Hellwig wrote: Numbers, please. So far in all interesting benchmarks it actually was slower. But when they're faster than XFS somewhere I'd defintly be interesting in looking at why this is true and if possible and important enough fix it. Christoph, could you show some numbers as well? While I usually trust your opinion, I have yet to see any substantial argument against ZFS from your side. http://cmynhier.blogspot.com/2006/05/zfs-io-reordering-benchmark.html http://blogs.sun.com/bill/#zfs_vs_the_benchmark If you read closely you may notice that ZFS had relatively little to do with read performance under heavy write load. ZFS simply has some fancy I/O scheduling code that in particular deals with deadlines. The Linux equivalent appears to be CONFIG_IOSCHED_DEADLINE. But the quoted benchmark does not mention which scheduler was used for Linux. So unless the benchmark is redone and properly documented, its numbers are fairly worthless. Bummer. http://cmynhier.blogspot.com/2006/05/zfs-benchmarking.html The company I work for would probably balk if I put that script here No publically available benchmark. So even if a third party wanted to, it couldn't recreate the benchmark. Again, fairly worthless. So by my count, neither side has showed any worthwile numbers. Whether ZFS performance is better or worse is anyone's guess. Jörn -- Simplicity is prerequisite for reliability. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/17] Large Blocksize Support V3
On Tue, 24 April 2007 15:21:05 -0700, [EMAIL PROTECTED] wrote: This patchset modifies the Linux kernel so that larger block sizes than page size can be supported. Larger block sizes are handled by using compound pages of an arbitrary order for the page cache instead of single pages with order 0. I like to see this. 2. 32/64k blocksize is also used in flash devices. Same issues. Actually most chips I encounter these days already have 128KiB. And some people seem to do some kind of raid-0 in the drivers to increase bandwidth. FS-visible blocksize is also increased by that. Unsupported - Mmapping blocks larger than page size Bummer. Can this change in the future? Issues: - There are numerous places where the kernel can no longer assume that the page cache consists of PAGE_SIZE pages that have not been fixed yet. - Defrag warning: The patch set can fragment memory very fast. It is likely that Mel Gorman's anti-frag patches and some more work by him on defragmentation may be needed if one wants to use super sized pages. If you run a 2.6.21 kernel with this patch and start a kernel compile on a 4k volume with a concurrent copy operation to a 64k volume on a system with only 1 Gig then you will go boom (ummm no ... OOM) fast. How well Mel's antifrag/defrag methods address this issue still has to be seen. only 1 Gig :) With my LogFS hat on, I don't care too much whether data is cached in terms of pages or blocks. What matters to me most is to get fed blocksize chunk on writeback and be able to read blocksize'd chunks. Compressing 64KiB at a time gives somewhere around 10% (don't remember exact number) better compression when compared to 4KiB. JFFS2 can benefit from this as well. That should also be sufficient for cross-platform compatibility, shouldn't it? Better performance for the pagecache is also nice to have, no doubt. But if system stability remains an issue, I'd rather keep slow and stable. Jörn -- More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason - including blind stupidity. -- W. A. Wulf - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Documenting MS_RELATIME
On Mon, 12 February 2007 18:49:39 +0100, Jan Engelhardt wrote: On Feb 12 2007 10:40, Dave Jones wrote: The one problem with noatime is that mutt's 'new mail arrived' breaks Just why does not it use mtime then to check for New Mail Arrived, like bash does? Just a guess: because it has to compare the time? Bash can simply compare mtime of (single) mailbox with time of last login. Mutt would have to compare mtime of (many) mailboxes with... I believe with atime of mailboxes. Jörn -- Joern's library part 1: http://lwn.net/Articles/2.6-kernel-api/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH x86 for review III] [10/29] i386: don't include bugs.h
On Mon, 12 February 2007 17:51:30 +0100, Andi Kleen wrote: From: Andrew Morton [EMAIL PROTECTED] That stupid non-inlined-static function in bugs.h causes: include/asm/bugs.h:186: warning: 'check_bugs' defined but not used But fortunately the include isn't needed. Cc: Andi Kleen [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Andi Kleen [EMAIL PROTECTED] --- arch/i386/kernel/alternative.c |1 - 1 file changed, 1 deletion(-) Index: linux/arch/i386/kernel/alternative.c === --- linux.orig/arch/i386/kernel/alternative.c +++ linux/arch/i386/kernel/alternative.c @@ -4,7 +4,6 @@ #include linux/list.h #include asm/alternative.h #include asm/sections.h -#include asm/bugs.h static int no_replacement= 0; static int smp_alt_once = 0; Didn't your patchset also include a near-identical patch from Adrian Bunk (with - and + exchanged)? Jörn -- Courage is not the absence of fear, but rather the judgement that something else is more important than fear. -- Ambrose Redmoon - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA-performance: Linux vs. FreeBSD
On Tue, 13 February 2007 11:27:58 +, Alan wrote: isn't yet a heavily optimised libata path. Secondly erase block size matters with flash drives so the bigger each I/O the better erase block behaviour we should get. Although that should max out somewhere between 16KiB and 128KiB, depending on the chips being used. Jörn -- If you're willing to restrict the flexibility of your approach, you can almost always do something better. -- John Carmack - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA-performance: Linux vs. FreeBSD
On Tue, 13 February 2007 11:29:18 +0100, Martin A. Fink wrote: Please Read Carefully! I talk about flash disk, not normal harddisks. There are no mechanical parts in flash disks, only flash memory. And therefore 48MB/s is excellent (compared to all other available disks) [...] Well. The testdrive has 27GB. The final drive will have 225 GB. And there will be 3 cameras and thus 3 disks. This means we talk about 140 MB/s for around 90 minutes. Do you have any numbers on the performance for the final drive? Single flash chips are relatively slow, the high bandwidth is usually achieved by writing in parallel to several of them. With the bigger drive you get more chips and the manufacturer could run more of them in parallel. Jörn -- With a PC, I always felt limited by the software available. On Unix, I am limited only by my knowledge. -- Peter J. Schoenster - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GPL vs non-GPL device drivers
On Thu, 15 February 2007 00:40:31 -0800, v j wrote: Oh, I am sorry. Seems like the German courts have spoken. I am not sure about what, but they have spoken. Sorry for the confusion. In short, there seem to be two classes of closed-source drivers: 1. ATI and nVidia. Both are well-known, in both cases they seem to avoid the legally important aspect of shipping their driver along with a kernel and they seem to be legally in relatively safe water. At least I haven't heard about them getting sued yet. 2. The embedded companies. By the very nature of selling an embedded device they are shipping their drivers along with a kernel and seem to be in very shallow water. Dozens of them have received letters from lawyers and didn't even dare go to court - they just complied. While this list is not exhaustive and your company's case may be different from all others, it does give you a hint of what your chances might be in court. Go to http://gpl-violations.org/ and do your research. The question whether a specific closed-source driver is legal or not can only be answered in court and only on a case-by-case basis. You should have a good idea of what many developers personal opinion is and with the research you can also estimate your legal position. Then make your decision, as noone here is going to make it for you - even if some would like to. Jörn -- Eighty percent of success is showing up. -- Woody Allen - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Thu, 15 February 2007 19:38:14 +0100, Juan Piernas Canovas wrote: The patch for 2.6.11 is not still stable enough to be released. Be patient ;-) While I don't want to discourage you, this is about the point in development where most log structured filesystems stopped. Doing a little web research, you will notice those todo-lists with cleaner being the top item for...years! Getting that one to work robustly is _very_ hard work and just today I've noticed that mine was not as robust as I would have liked to think. Also, you may note that by updating to newer kernels, the VM writeout policies can change and impact your cleaner. To an extent even that you had a rock-solid filesystem with 2.6.18 and thing crumble between your fingers in 2.6.19 or later. If the latter happens, most likely the VM is not to blame, it just proved that your cleaner is still getting some corner-cases wrong and needs more work. There goes another week of debugging. :( Jörn -- You ain't got no problem, Jules. I'm on the motherfucker. Go back in there, chill them niggers out and wait for the Wolf, who should be coming directly. -- Marsellus Wallace - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Thu, 15 February 2007 23:59:14 +0100, Juan Piernas Canovas wrote: Actually, the version of DualFS for Linux 2.4.19 implements a cleaner. In our case, the cleaner is not really a problem because there is not too much to clean (the meta-data device only contains meta-data blocks which are 5-6% of the file system blocks; you do not have to move data blocks). That sounds as if you have not hit the interesting cases yet. Fun starts when your device is near-full and you have a write-intensive workload. In your case, that would be metadata-write-intensive. For one, this is where write performance of log-structured filesystems usually goes down the drain. And worse, it is where the cleaner can run into a deadlock. Being good where log-structured filesystems usually are horrible is a challenge. And I'm sure many people are more interested in those performance number than in the ones you shine at. :) Jörn -- Joern's library part 14: http://www.sandpile.org/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Fri, 16 February 2007 18:47:48 -0500, Bill Davidsen wrote: Actually I am interested in the common case, where the machine is not out of space, or memory, or CPU, but when it is appropriately sized to the workload. Not that I lack interest in corner cases, but the running flat out case doesn't reflect case where there's enough hardware, now the o/s needs to use it well. There is one detail about this specific corner case you may be missing. Most log-structured filesystems don't just drop in performance - they can run into a deadlock and the only recovery from this is the lovely backup-mkfs-restore procedure. If it was just performance, I would agree with you. Jörn -- He that composes himself is wiser than he that composes a book. -- B. Franklin - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Sat, 17 February 2007 13:10:23 -0500, Bill Davidsen wrote: I missed that. Which corner case did you find triggers this in DualFS? This is not specific to DualFS, it applies to any log-structured filesystem. Garbage collection always needs at least one spare segment to collect valid data into. Regular writes may require additional free segments, so GC has to kick in and free those when space is getting tight. (1) GC frees segments by writing all valid data in it into the spare segment. If there is remaining space in the spare segment, GC can move more data from further segment. Nice and simple. The requirement is that GC *always* frees more segments than it uses up doing so. If that requirement is not fulfilled, GC will simply use up its last spare segment without freeing a new one. We have a deadlock. Now imagine your filesystem is 90% full and all data is spread perfectly across all segments. The best segment you could pick for GC is 90% full. One would imagine that GC would only need to copy those 90% into a spare segment and have freed 100%, making overall progress. But more log-structured filesystems maintain a tree of some sorts on the medium. If you move data elsewhere, you also need to update the indirect block pointing to it. So that has to get written as well. If you have doubly or triply indirect blocks, those need to get written. So you can end up writing 180% or more to free 100%. Deadlock. And if you read the documentation of the original Sprite LFS or any other of the newer log-structured filesystems, you usually won't see a solution to this problem, or even an acknowledgement that the problem exists in the first place. But there is no shortage of log-structured filesystem projects that were abandoned years ago and have cleaner or garbage collector as their top item on the todo-list. Coincidence? (1) GC may also kick in earlier, but that is just an optimization and doesn't change the worst case, so that bit is irrelevant here. Btw, the deadlock problem is solvable and I definitely don't want to discourage further work in this area. DualFS does look interesting. But my solution for this problem will likely eat up all the performance DualFS has gained and more, as it isn't aimed at hard disks. So someone has to come up with a different idea. Jörn -- To recognize individual spam features you have to try to get into the mind of the spammer, and frankly I want to spend as little time inside the minds of spammers as possible. -- Paul Graham - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Sat, 17 February 2007 15:47:01 -0500, Sorin Faibish wrote: DualFS can probably get around this corner case as it is up to the user to select the size of the MD device size. If you want to prevent this corner case you can always use a device bigger than 10% of the data device which is exagerate for any FS assuming that the directory files are so large (this is when you have billions of files with long names). In general the problem you mention is mainly due to the data blocks filling the file system. In DualFS case you have the choice of selecting different sizes for the MD and Data volume. When Data volume gets full the GC will have a problem but the MD device will not have a problem. It is my understanding that most of the GC problem you mention is due to the filling of the FS with data and the result is a MD operation being disrupted by the filling of the FS with data blocks. As about the performance impact on solving this problem, as you mentioned all journal FSs will have this problem, I am sure that DualFS performance impact will be less than others at least due to using only one MD write instead of 2. You seem to make the usual mistakes when people start to think about this problem. But I could misinterpret you, so let me paraphrase your mail in questions and answer what I believe you said. Q: Are journaling filesystems identical to log-structured filesystems? Not quite. Journaling filesystems usually have a very small journal (or log, same thing) and only store the information necessary for atomic transactions in the journal. Not sure what a journal FS is, but the name seems closer to a journaling filesystem. Q: DualFS seperates Data and Metadata. Does that make a difference? Not really. What I called data in my previous mail is a log-structured filesystems view of data. DualFS stored file content seperately, so from an lfs view, that doesn't even exist. But directory content exists and behaves just like file content wrt. the deadlock problem. Any data or metadata that cannot be GC'd by simply copying but requires writing further information like indirect blocks, B-Tree nodes, etc. will cause the problem. Q: If the user simply reserves some extra space, does the problem go away? Definitely not. It will be harder to hit, but a rare deadlock is still a deadlock. Again, this is only concerned with the log-structured part of DualFS, so we can ignore the Data volume. When data is spread perfectly across all segments, the best segment one can pick for GC is just as bad as the worst. So let us take some examples. If 50% of the lfs is free, you can pick a 50% segment for GC. Writing every single block in it may require writing one additional indirect block, so GC is required to write out a 100% segment. It doesn't make any progress at 50% (in a worst case scenario) and could deadlock if less than 50% were free. If, however, GC has to write out a singly and a doubly indirect block, 67% of the lfs need to be free. In general, if the maximum height of your tree is N, you need (N-1)/N * 100% free space. Most people refer to that as too much. If you have less free space, the filesystem will work just fine most of the time. That is nice and cool, but it won't help your rare user that happens to hit the rare deadlock. Any lfs needs a strategy to prevent this deadlock for good, not just make it mildly unlikely. Jörn -- Error protection by error detection and correction. -- from a university class - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
Maybe this is a decent approach to deal with the problem. First some definitions. T is the target segment to be cleaned, S is the spare segment that valid data is written to, O are other segments that contain indirect blocks I for valid data D in T. Have two different GC mechanisms to choose between: 1. Regular GC that copies D and I into S. On average D+I should require less space than S can offer. 2. Slow GC only copies D into S. Indirect blocks get modified in-place in O. This variant requires more seeks due to writing in various O, but it guarantees that D always requires less space than S can offer. Whenever you are running out of spare segments and are in danger of the deadlock, switch to mechanism 2. Now your correctness problem is reduced to a performance problem. Jörn -- To recognize individual spam features you have to try to get into the mind of the spammer, and frankly I want to spend as little time inside the minds of spammers as possible. -- Paul Graham - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Tue, 20 February 2007 00:57:50 +0100, Juan Piernas Canovas wrote: I understand the problem that you describe with respect to the GC, but let me explain why I think that it has a small impact on DualFS. Actually, the GC may become a problem when the number of free segments is 50% or less. If your LFS always guarantees, at least, 50% of free segments (note that I am talking about segments, not free space), the deadlock problem disappears, right? This is a quite naive solution, but it works. I don't see how you can guarantee 50% free segments. Can you explain that bit? In a traditional LFS, with data and meta-data blocks, 50% of free segments represents a huge amount of wasted disk space. But, in DualFS, 50% of free segments in the meta-data device is not too much. In a typical Ext2, or Ext3 file system, there are 20 data blocks for every meta-data block (that is, meta-data blocks are 5% of the disk blocks used by files). Since files are implemented in DualFS in the same way, we can suppose the same ratio for DualFS (1). This will work fairly well for most people. It is possible to construct metadata-heavy workloads, however. Many large directories containing symlinks or special files (char/block devices, sockets, fifos, whiteouts) come to mind. Most likely noone of your user will ever want that, but a malicious attacker might. That, btw, brings me to a completely unrelated topic. Having a fixed ratio a metadata to data is simple to implement, but allowing this ratio to dynamically change would be nicer for administration. You can add that to the Christmas wishlist for the nice boys, if you like. Remember, I am supposing a naive implementation of the cleaner. With a cleverer one, the meta-data device can be smaller, and the amount of disk space finally wasted can be smaller too. The following paper proposes some improvements: - Jeanna Neefe Matthews, Drew Roselli, Adam Costello, Randy Wang, and Thomas Anderson. Improving the Performance of Log-structured File Systems with Adaptive Methods. Proc. Sixteenth ACM Symposium on Operating Systems Principles (SOSP), October 1997, pages 238 - 251. BTW, I think that what they propose is very similar to the two-strategies GC that you propose in a separate e-mail. Will have to read it up after I get some sleep. It is late. The point of all the above is that you must improve the common case, and manage the worst case correctly. And that is the idea behind DualFS :) A fine principle to work with. Surprisingly, what is the worst case for you is the common case for LogFS, so maybe I'm more interested in it than most people. Or maybe I'm just more paranoid. Anyway, keep up the work. It is an interesting idea to pursue. Jörn -- He who knows that enough is enough will always have enough. -- Lao Tsu - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] slab: deal with NULL pointers passed to kmem_cache_free
On Mon, 19 March 2007 14:10:38 -0700, Andrew Morton wrote: Would prefer to do: static inline void kmem_cache_free_if_not_null(struct kmem_cache *cachep, void *objp) { if (objp) kmem_cache_free(cachep, objp); } so that we don't add extra overhead to all the thousands of existing, well-behaved callsites. In principle, this would work. But two things need changing, imho: 1. Don't inline the function. kmem_cache_free() has only about 34 NULL callers, if my grep is reliable, so this case is arguable. But in general, out-of-line functions are better than many extra conditionals pulled in through the inline one. 2. Switch the names. According to Rusty's benchmark, the easiest way to use and interface should be the correct one. Every new driver written by a rookie will call kmem_cache_free(), simply because the name seems simpler. void kmem_cache_free_fast(struct kmem_cache *cachep, void *objp) { /* old kmem_cache_free() */ } void kmem_cache_free(struct kmem_cache *cachep, void *objp) { if (likely(objp)) kmem_cache_free_fast(cachep, objp); } Jörn -- Correctness comes second. Features come third. Performance comes last. Maintainability is easily forgotten. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Tue, 20 March 2007 01:42:46 +0100, Thomas Gleixner wrote: On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote: 4. JFFS2 has its own wear-leving scheme, as do several other filesystems, so they probably want to bypass this piece of the stack. JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own wear levelling sucks. Ok, fine. How about LogFS, then? LogFS can easily leverage UBI's wear algorithm. Ok, now we have reached the absurd. UBI quite fundamentally cannot do wear leveling as good as LogFS can. Simply because UBI has zero knowledge of the _contents_ of its blocks. Knowing whether a block is 90% garbage or not makes a great difference. Also LogFS currently requires erasesizes of 2^n. Thomas, I can give you my opinion on this flamewar in private - after you have cooled off. Jörn -- When I am working on a problem I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong. -- R. Buckminster Fuller - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Wed, 21 March 2007 12:25:34 +0100, Thomas Gleixner wrote: On Wed, 2007-03-21 at 12:05 +0100, Jörn Engel wrote: Ok, now we have reached the absurd. UBI quite fundamentally cannot do wear leveling as good as LogFS can. Simply because UBI has zero knowledge of the _contents_ of its blocks. Knowing whether a block is 90% garbage or not makes a great difference. Also LogFS currently requires erasesizes of 2^n. Last time I talked to you about that, you said it would be possible and fixable. We talked about several mechanisms, which would allow a filesystem or other users to hint such things to UBI. Note the word currently. And yes, we did talk about hints. Back then I still believed in UBI. That has changed and I would like to spare myself another flamewar, so please leave it at that. Even if the LogFS wear levelling is so superior, it CAN'T do across device wear levelling. Correct. And I don't see any problem with this. I see two classes of usecases for flash, with some amount of overlap in between. 1. Small amounts of flash. Here the flash contains a large ratio of read-only data. Bootloader, kernel, etc. Having wear levelling across the device will gain you something. This is what you designed UBI for. 2. Large amounts of flash. Just to be precise, large can go well into the Terabyte range and beyond. I don't mean large as in the biggest embedded device I worked on last year - that is still small. Even if such flashes still contain a bootloader and a kernel, that will occupy less than 1% of the device. Wear leveling across the device is fairly pointless here. This is what I designed LogFS for. There is some middle ground where a combination of UBI and LogFS may make sense. LogFS can still make sense for devices as small as 64MiB. But I'm not too concerned about that because flashes will continue to grow and the advantages of cross-device wear leveling will continue to diminish. Jörn -- Security vulnerabilities are here to stay. -- Scott Culp, Manager of the Microsoft Security Response Center, 2001 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Wed, 21 March 2007 12:57:42 +0100, Thomas Gleixner wrote: On Wed, 2007-03-21 at 12:35 +0100, Jörn Engel wrote: Even if such flashes still contain a bootloader and a kernel, that will occupy less than 1% of the device. Wear leveling across the device is fairly pointless here. This is what I designed LogFS for. Still you need to have a solution for handling bitflips in those bootloader and kernel areas. Correct. It may make sense to use UBI for that, I don't know. What I do know is that UBI cannot make wear leveling decisions as well as LogFS. And that is all I care about wrt. this discussion. Jörn -- Joern's library part 8: http://citeseer.ist.psu.edu/plank97tutorial.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
*sigh* I really did not want to become involved in this. So please be nice and leave the flamethrower in your weapon closet or I will disappear again before you can say fire. On Tue, 20 March 2007 21:32:40 +, David Woodhouse wrote: On Tue, 2007-03-20 at 10:58 -0800, David Lang wrote: What Matt and Ted are looking at is the question 'are flash devices close enough to other block devices that it would make sense to change the existing linux definition of a block device to handle the special requirements of flash' I've seen no real proposals about how this could be done, so it's a purely academic question. What you have seen and shot down were patches to make mtd more generic. So let me just assume both mtd and jffs2 were generic, even though they currently aren't. In very broad terms, an mtd is a device with: 1. a read operation 2. a write operation 3. an erase operation 4. a minimal write blocksize 5. a minimal erase blocksize 6. a method to query bad eraseblocks 7. a method to mark bad eraseblocks Anything else? There are many more fields, but I believe this is the essential. point() and unpoint() were omitted, because they are just one option to provide XIP. filemap_xip.c is another used for block devices. In very broad terms, a block device has: 1. a read operation 2. a write operation 3. some devices have an ioctl() for erase, but that is uncommon 4. a blocksize What is missing? Obviously the erase operation needs to become a first-class citizen and block devices need two fields for the two meaningful blocksizes. And they need methods to query and set bad blocks. So far it looks simple enough. Obviously there are many messy details left out, so it will be a lot of work in practice. So the question is: is it worth it? What are the gains from combining mtd and block devices? [ And at this point I would like to state again that I don't want to become involved in the UBI discussion. The question whether two seperate subsystems make sense is quite independent and I don't want both discussions to get mixed up. ] Jörn -- He who knows others is wise. He who knows himself is enlightened. -- Lao Tsu - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] slab: deal with NULL pointers passed to kmem_cache_free
On Wed, 21 March 2007 08:30:27 -0800, Andrew Morton wrote: On Wed, 21 Mar 2007 16:41:19 +0200 Pekka Enberg [EMAIL PROTECTED] wrote: Yeah, I'll try to sneak a patch past Andrew. That would be sneaky. Thing is, such a patch would amount to adding a test-for-NULL to codepaths which we *know* do not need it. There is no point in doing that. How about two patches, one renaming kmem_cache_free to kmem_cache_free_fast or __kmem_cache_free or whatever pleases you most, the second adding kmem_cache_free with a NULL check. The point is that the easiest way to use kmem_cache_free should be the safest, but not necessarily the fastest. Existing well-tuned and NULL-aware code paths can remain fast, random new code will be safe. Jörn -- Joern's library part 14: http://www.sandpile.org/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Wed, 21 March 2007 12:25:34 +0100, Thomas Gleixner wrote: On Wed, 2007-03-21 at 12:05 +0100, Jörn Engel wrote: Also LogFS currently requires erasesizes of 2^n. Last time I talked to you about that, you said it would be possible and fixable. Actually, no. LogFS is not broken, there is nothing to fix. And there is no fundamental reason why UBI should export blocks with non-power-of-two sizes. UBI currently consists of two parts that are intimately intertwined in the current implementation, but have relatively little connection otherwise. 1. Logical volume management. 2. Static volumes. Logical volume management can just as easily move its management information into a table, instead of having it spread across all blocks. Blocks can keep their original size. Since you have to scan flash anyway, you can also scan for a table, compare a magical number and do some extra check to protect yourself against a UBI image inside some logical volume. No big deal. Static volumes can keep a header inside their volumes. The tiny first-stage bootloader is currently scanning flash and can continue to do so. But at least this header no longer causes trouble for LogFS or any other UBI user. UBI is just as broken as LogFS is. It works with every user in mainline (which comes down to JFFS2). LogFS works with every MTD device in mainline. The only combination that doesn't work is LogFS on UBI - due to deliberate design decisions on both sides. Jörn -- Joern's library part 8: http://citeseer.ist.psu.edu/plank97tutorial.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] block2mtd_paramline[] mustn't be __initdata
On Sun, 25 March 2007 16:58:05 +0200, Adrian Bunk wrote: block2mtd_paramline[] is used in the non-__init block2mtd_setup() Signed-off-by: Adrian Bunk [EMAIL PROTECTED] Acked-By: Jörn Engel [EMAIL PROTECTED] Adrian, can you put me on Cc: next time? --- --- linux-2.6.21-rc4-mm1/drivers/mtd/devices/block2mtd.c.old 2007-03-25 15:56:10.0 +0200 +++ linux-2.6.21-rc4-mm1/drivers/mtd/devices/block2mtd.c 2007-03-25 15:56:31.0 +0200 @@ -423,7 +423,7 @@ #ifndef MODULE static int block2mtd_init_called = 0; -static __initdata char block2mtd_paramline[80 + 12]; /* 80 for device, 12 for erase size */ +static char block2mtd_paramline[80 + 12]; /* 80 for device, 12 for erase size */ #endif Jörn -- Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo vorher keine existiert hat. -- Doris Lessing - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Sun, 25 March 2007 13:49:58 -0800, David Lang wrote: On Sun, 25 Mar 2007, Jörn Engel wrote: Logical volume management can just as easily move its management information into a table, instead of having it spread across all blocks. Blocks can keep their original size. Since you have to scan flash anyway, you can also scan for a table, compare a magical number and do some extra check to protect yourself against a UBI image inside some logical volume. No big deal. [ This was not a request for UBI to be changed. The only purpose was to illustrate that LogFS is not broken. The previous thread suggested otherwise and I just couldn't leave it at that. ] if you are being paranoid about write cycles putting the write count in the block you are writing avoids doing an erase/write elsewhere although, since you can flip bits to 1 without requireing an erase you [ vice versa. you can flip bits to 0 without erasing. ] could sacrafice some space and say that your table has a normal counter for the number of times the block has been erased, but a 'tally counter' where you turn one bit on each time you erase the block, and when you fill up the tally block you re-write the entire table, clearing all the tallys. if you have relativly large eraseblocks it seems like you could afford to sacrafice the space in your master table to avoid erases of it Or you could have a table and any number of updates to it. Erase one block, append a small update marker to the table. There are plenty of options. All have in common that code would be more complicated. Another advantage is that erase counts don't get reset if the race against a power failure during erase is lost. Whether the advantaves of power-of-two blocksizes and safe erasecounts are worth it, I leave for others to decide. Jörn -- Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it. -- Perlis's Programming Proverb #58, SIGPLAN Notices, Sept. 1982 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Mon, 26 March 2007 00:46:33 +0100, David Woodhouse wrote: On Mon, 2007-03-26 at 00:55 +0200, Jörn Engel wrote: although, since you can flip bits to 1 without requireing an erase you [ vice versa. you can flip bits to 0 without erasing. ] And on NAND flash you can't just do it in multiple cycles one bit at a time. The 'tally' trick isn't viable there. You can on NAND. ECC is done in software. And for a data structure as simple as the 'tally', foregoing ECC is not a huge problem - most bitflips are easily detected and the remaining only cause off-by-a-few on the erase count. On NOR with transparent (hardware) ECC you can't. Jörn -- Homo Sapiens is a goal, not a description. -- unknown - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Mon, 26 March 2007 01:21:25 +0100, David Woodhouse wrote: On Mon, 2007-03-26 at 02:01 +0200, Jörn Engel wrote: You can on NAND. ECC is done in software. And for a data structure as simple as the 'tally', foregoing ECC is not a huge problem - most bitflips are easily detected and the remaining only cause off-by-a-few on the erase count. You're only allowed a limited number of write cycles to each page though. So you can't just clear the bits in a 2112-byte page one at a time; typically when you clear the fifth bit, the contents of the whole page become undefined until the next erase cycle. That limitation stems from ECC and ECC is done in software. Currently everyone and his dog is doing ECC in chunks of 256 bytes on NAND. So your minimum write size is 256 bytes _if you care about ECC_. If you don't care, you can write single bits on NAND, just as you can on NOR. Controlling ECC in software means we are quite flexible. Given sufficient incentive, we can change the rules quite significantly. Jörn -- You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is. -- Rob Pike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Mon, 26 March 2007 10:45:57 +0100, David Woodhouse wrote: No, on NAND flash it's a limitation of the hardware. The number of write cycles you can perform to a given page is limited. Exceed it and the contents of that page become undefined due to leakage, until you next erase it. Are you sure? Do you have any specs or similar that state this? So far I have only encountered this limitation by word of mouth. And such a myth coming from ECC effects is nothing that would surprise me. Jörn -- The cheapest, fastest and most reliable components of a computer system are those that aren't there. -- Gordon Bell, DEC labratories - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images
On Mon, 26 March 2007 13:49:06 +0300, Artem Bityutskiy wrote: On Sun, 2007-03-25 at 22:08 +0200, Jörn Engel wrote: Logical volume management can just as easily move its management information into a table, instead of having it spread across all blocks. Blocks can keep their original size. Since you have to scan flash anyway, you can also scan for a table, compare a magical number and do some extra check to protect yourself against a UBI image inside some logical volume. No big deal. First off, I see these no big deal statements for years already, and no decent implementation proved by usage in real world. Could we please, move these academic discussions to another thread? You could wait a day, then reread what I wrote. Maybe you will notice that what I wrote is not identical to what we have discussed about a year ago and you seem to have read. You may also want to reread this: ||[ This was not a request for UBI to be changed. The only purpose was to ||illustrate that LogFS is not broken. The previous thread suggested ||otherwise and I just couldn't leave it at that. ] Jörn -- tglx1 thinks that joern should get a (TM) for Thinking Is Hard -- Thomas Gleixner - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: If not readdir() then what?
On Sun, 8 April 2007 11:11:20 -0700, H. Peter Anvin wrote: Well, the question is if you can keep the seekdir/telldir cookie around as a pointer -- preferrably in userspace, of course. You would presumably garbage-collect them on closedir() -- there is no other point at which you could. Garbage-collecting them on closedir() does not work. It surprised me as well, but there seem to be applications that keep the telldir() cookie around after closedir(). Iirc, rm -r was one of them. Neil, is this correct? Jörn -- Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming. -- Rob Pike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: If not readdir() then what?
On Sun, 8 April 2007 21:44:26 -0400, Theodore Tso wrote: Well, Joern thought that rm -rf might relying on the telldir cookie being valid in precisely that circumstance. If that is true, I'd argue that this is a BUG in GNU coreutils that should be fixed... I heard it and accepted that claim without checking it. Might have been a mistake. But the claim came from an NFS developer, which may explain a thing or two. NFS clients have to deal with a server rebooting underneith them and should still behave as expected. An rm -r running on the client concurrently to a rebooting server is a problem indeed and could be solved with seekdir/telldir. That surely doesn't make life any easier for filesystem developers, I agree. From that point of view, all telldir cookies should end their life at closedir time. For rm -r it would be sufficient if the nfs client simply didn't seekdir at all. For ls -lR, this would return duplicate dentries. Jörn -- My second remark is that our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interface for the new fallocate() system call
On Mon, 9 April 2007 23:01:42 +1000, Paul Mackerras wrote: Jörn Engel writes: Wouldn't that work be confined to fallocate()? If I understand Heiko correctly, the alternative would slow s390 down for every syscall, including more performance-critical ones. The alternative that Jakub suggested wouldn't slow s390 down. True. And it appears to be one of the least offensive options we have. Jörn -- My second remark is that our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Add a norecovery option to ext3/4?
On Mon, 9 April 2007 12:21:15 -0500, Eric Sandeen wrote: Phillip Susi wrote: When the filesystem is told to mount the disk read only, that means it should not write to it. It means the filesystem should not be writeable when it is mounted. This is not the same as saying that the filesystem itself should do no IO in the course of making that read-only mount available. The filesystem has two interfaces. One to the device underneith, one to userspace. Read-only should certainly mean that no writes cross the userspace interface. Traditionally it has implicitly also meant that no writes are crossing the device interface. Whether that was/is an explicit requirement - who knows. Journaling filesystems have introduced this thing called journal replay. And I have to admit, it makes thing _a lot_ easier to always replay the journal, even when being mounted read-only. But it is easier is a pretty lame excuse. Under all conditions it should be safe to mount a read-only block device, but that is not the same as mounting a filesystem read-only. In particular, it is a lame excuse when this claim is true. If the block-device is read-only, then journal replay will not work as expected and all the not so easy work has to be done anyway. Did I miss anything? Is it actually easier to mount a read-only device with unclean journal than mounting a read-write device and not replay the journal? Jörn -- Joern's library part 8: http://citeseer.ist.psu.edu/plank97tutorial.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Add a norecovery option to ext3/4?
On Tue, 10 April 2007 07:27:18 -0400, Theodore Tso wrote: I suppose what you could do is to read in the journal, and use it to create an remapping table so that when you want to read block #5126, and block number 5126 is in the journal, to read the journal version of the block instead of the one on disk. That would allow for safe access to a filesystem being mounted read-only without the journal being present. Another option would be to access the medium through a mapping inode, replay the journal into the mapping inode and _not_ flush the dirty pages. But as long as a remapping table is sufficient for ext3 journal format, such a table should be simpler and faster. Patches gratefully accepted Not likely to come from me anytime soon. There's a certain other filesystem I have to finish first that still suffers from the same problem. Jörn -- Do not stop an army on its way home. -- Sun Tzu - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/13] fs: convert core functions to zero_user_page
On Tue, 10 April 2007 22:56:38 -0700, Andrew Morton wrote: And I'm surprised that this: +static inline void memclear_highpage_flush(struct page *page, unsigned int offset, unsigned int size) +{ + return zero_user_page(page, offset, size); +} compiled. zero_user_page() returns void... As does memclear_highpage_flush(). Some of my code looks like: void some_func(...) { if (foo) return do_foo(...); if (bar) return do_bar(...); ... } do_foo() and do_bar() also return void. Saves an extra line for the return statment and the brackets. Doesn't help in the code you quoted, of course. Jörn -- Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest. -- Rob Pike - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: If not readdir() then what?
On Wed, 11 April 2007 16:23:21 -0700, H. Peter Anvin wrote: David Lang wrote: On Thu, 12 Apr 2007, Neil Brown wrote: For the second. You say that you would need at least 96 bits in order to make that guarantee; 64 bits of hash, plus a 32-bit count value in the hash collision chain. I think 96 is a bit greedy. Surely 48 bits of hash and 16 bits of collision-chain-position would plenty. You would need 65537 entries before a collision was even possible, and billions before it was at all likely. (How big does a set of 48bit numbers have to get before the probability that No subset of 65536 numbers are all the same drops below 0.95?) you can get a hash collision with two entries. Yes, but the probability is 2^-n for an n-bit hash, assuming it's uniformly distributed. The probability approaches 1/2 as the number of entries hashes approaches 2^(n/2) (birthday number.) I believe you are both barking up the wrong tree. Neil proposed a 16bit collision chain. With that, it takes 65537 entries before a collision chain overflow is possible. Calling a collision chain overflow collision is inviting confusion, of course. :) Jörn -- The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: If not readdir() then what?
On Thu, 12 April 2007 11:46:41 +1000, Neil Brown wrote: I could argue that nfs came before ext3+dirindex, so ext3 should have been designed to work properly with NFS. You could argue that fixing it in nfsd fixes it for all filesystems. But I'm not sure either of those arguments are likely to be at all convincing... Caring about a non-ext3 filesystem, I sure would like an nfs solution as well. :) Hmmm. I wonder. Which is more likely? - That two 64bit hashes from some set are the same - or that 65536 48bit hashes from a set of equal size are the same. The former. Each bit going from hash strength to collision chain length reduces the likelihood of an overflow. In the extreme case of a 0bit hash and 64bit collision chain, you need 2^64 entries compared to 2^32 for the other extreme. However, the collision chain gives me quite a bit of headache. One would have to store each entry's position on the chain, deal with older entries getting deleted, newer entries getting removed, etc. All this requires a lot of complicated code that basically never gets tested in the wild. Just settling for a 64bit hash and returning -EEXIST when someone causes a collision an creat() sounds more appealing. Directories with 4 billion entries will cause problems, but that is hardly news to anyone. Jörn -- Fantasy is more important than knowledge. Knowledge is limited, while fantasy embraces the whole world. -- Albert Einstein - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: If not readdir() then what?
On Thu, 12 April 2007 15:57:41 +1000, Neil Brown wrote: However, the collision chain gives me quite a bit of headache. One would have to store each entry's position on the chain, deal with older entries getting deleted, newer entries getting removed, etc. All this requires a lot of complicated code that basically never gets tested in the wild. This is a simple consequence of the design decision to use hashes as the search key. They aren't dense and they will collide. So the solution will be a bit fuzzy around the edges. And maybe that is an acceptable tradeoff. But the filesystem should take full responsibility for it, whether in performance or correctness :-) Sure. And seeing that not using hashes would kill performance long before 4 billion dentries are reached, there don't seem to be many downsides to hashing in principle. Just settling for a 64bit hash and returning -EEXIST when someone causes a collision an creat() sounds more appealing. Directories with 4 billion entries will cause problems, but that is hardly news to anyone. I think you want -EFBIG or -ENOSPC. -EEXIST sounds just wrong. None of them are 100% correct. But you are right, -ENOSPC seems to do less harm. But there are alternatives. e.g. internal chaining. Insist on a unique 64bit hash for every file. If the hash is in use, increment and try again. On lookup, if the hash leads you to a file with the wrong name, increment and try again until you find a hole (hash value that is not stored). When you delete an entry, leave a place holder if the next hash is in use. Conversely if the next hash is not in use, delete the entry and delete the previous one if it is a place holder. That would work and is limited to reasonable complexity. It still suffers from getting virtually no testing in the wild and therefore being one of the dark corners little critters thrive in. But one can at least add a config option to fold the hash to 16bit or so. And cross fingers that at least one person will occasionally test with that option. You have to require 64bit cookies/fpos, but I think that today, that is a reasonable thing to require (5 years ago it might not have been). Which brings us back to the start of this thread. Jörn -- Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Wed, 21 February 2007 05:36:22 +0100, Juan Piernas Canovas wrote: I don't see how you can guarantee 50% free segments. Can you explain that bit? It is quite simple. If 50% of your segments are busy, and the other 50% are free, and the file system needs a new segment, the cleaner starts freeing some of busy ones. If the cleaner is unable to free one segment at least, your file system gets full (and it returns a nice ENOSPC error). This solution wastes the half of your storage device, but it is deadlock-free. Obviously, there are better approaches. Ah, ok. It is deadlock free, if the maximal height of your tree is 2. It is not 100% deadlock free if the height is 3 or more. Also, I strongly suspect that your tree is higher than 2. A medium sized directory will have data blocks, indirect blocks and the inode proper, which gives you a height of 3. Your inodes need to get accessed somehow and unless they have fixed positions like in ext2, you need a further tree structure of some sorts, so you're more likely looking at a height of 5. With a height of 5, you would need to keep 80% of you metadata free. That is starting to get wasteful. So I suspect that my proposed alternate cleaner mechanism or the even better hole plugging mechanism proposed in the paper a few posts above would be a better path to follow. A fine principle to work with. Surprisingly, what is the worst case for you is the common case for LogFS, so maybe I'm more interested in it than most people. Or maybe I'm just more paranoid. No, you are right. It is the common case for LogFS because it has data and meta-data blocks in the same address space, but that is not the case of DualFS. Anyway, I'm very interested in your work because any solution to the problem of the GC will be also applicable to DualFS. So, keep up with it. ;-) Actually, no. It is the common case for LogFS because it is designed for flash media. Unlike hard disks, flash lifetime is limited by the amount of data written to it. Therefore, having a cleaner run when the filesystem is idle would cause unnecessary writes and reduce lifetime. As a result, the LogFS cleaner runs as lazily as possible and the filesystem tries hard not to mix data with different lifetimes in one segment. LogFS tries to avoid the cleaner like the plague. But if it ever needs to run it, the deadlock scenario is very close and I need to be very aware of it. :) In a way, the DualFS approach does not change rules for the log-structured filesystem at all. If you had designed your filesystem in such a way that you simply used two existent filesystems and wrote Actual Data (AD) to one, Metadata (MD) to another, what is MD to DualFS is plain data to one of your underlying filesystems. It can cause a bit of confusion, because I tend to call MD data and you tend to call AD data, but that is about all. Jörn -- But this is not to say that the main benefit of Linux and other GPL software is lower-cost. Control is the main benefit--cost is secondary. -- Bruce Perens - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [MTD] CHIPS: oops in cfi_amdstd_sync
On Tue, 20 February 2007 17:46:13 -0800, Vijay Sampath wrote: The files cfi_cmdset_0002.c and cfi_cmdset_0020.c do not initialize their wait queues like is done in cfi_cmdset_0001.c. This causes an oops when the wait queue is accessed. I have copied the code from cfi_cmdset_0001.c that is pertinent to initialization of the wait queue. Patch looks good, but I can no longer test it. Josh may still have access to some commandset 20 chips. Josh, any objections? Jörn -- The only real mistake is the one from which we learn nothing. -- John Powell signature.asc Description: Digital signature
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Wed, 21 February 2007 19:31:40 +0100, Juan Piernas Canovas wrote: I do not understand. Do you mean that if I have 10 segments, 5 busy and 5 free, after cleaning I could need 6 segments? How? Where the extra blocks come from? This is a fairly complicated subject and I have trouble explaining it to people - even though I hope that maybe one or two dozen understand it by now. So let me try to give you an example: In LogFS, inodes are stored in an inode file. There are no B-Trees yet, so the regular unix indirect blocks are used. My example will be writing to a directory, so that should only involve metadata by your definition and be a valid example for DualFS as well. If it is not, please tell me where the difference lies. The directory is large, so appending to it involves writing a datablock (D0), and indirect block (D1) and a doubly indirect block (D2). Before: Segment 1: [some data] [ D1 ] [more data] Segment 2: [some data] [ D0 ] [more data] Segment 3: [some data] [ D2 ] [more data] Segment 4: [ empty ] ... After: Segment 1: [some data] [garbage] [more data] Segment 2: [some data] [garbage] [more data] Segment 3: [some data] [garbage] [more data] Segment 4: [D0][D1][D2][ empty] ... Ok. After this, the position of D2 on the medium has changed. So we need to update the inode and write that as well. If the inode number for this directory is high, we will need to write the inode (I0), an indirect block (I1) and a doubly indirect block (I2). The picture becomes a bit more complicates. Before: Segment 1: [some data] [ D1 ] [more data] Segment 2: [some data] [ D0 ] [more data] Segment 3: [some data] [ D2 ] [more data] Segment 4: [ empty ] Segment 5: [some data] [ I1 ] [more data] Segment 6: [some data] [ I0 ] [more data] Segment 7: [some data] [ I2 ] [more data] ... After: Segment 1: [some data] [garbage] [more data] Segment 2: [some data] [garbage] [more data] Segment 3: [some data] [garbage] [more data] Segment 4: [D0][D1][D2][I0][I1][I2][ empty ] Segment 5: [some data] [garbage] [more data] Segment 6: [some data] [garbage] [more data] Segment 7: [some data] [garbage] [more data] ... So what has just happened? The user did a single touch foo in a large directory and has caused six objects to move. Unless some of those objects were in the same segment before, we now have six segments containing a tiny amount of garbage. And there is almost no way how you can squeeze that garbage back out. The cleaner will fundamentally do the same thing as a regular write - it will move objects. So if you want to clean a segment containing the block of a different directory, you may again have to move five additional objects, the indirect blocks, inode and ifile indirect blocks. At this point, your cleaner is becoming a threat. There is a real danger that it will create more garbage in unrelated segments than it frees up. I claim that you cannot keep 50% clean segments, unless you move away from the simplistic cleaner I described above. Jörn -- If you're willing to restrict the flexibility of your approach, you can almost always do something better. -- John Carmack - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Thu, 22 February 2007 05:30:03 +0100, Juan Piernas Canovas wrote: DualFS writes meta-blocks in variable-sized chunks that we call partial segments. The meta-data device, however, is divided into segments, which have the same size. A partial segment can be as large a a segment, but a segment usually has more that one partial segment. Besides, a partial segment can not cross a segment boundary. Sure, that's a fairly common approach. A partial segment is a transaction unit, and contains all the blocks modified by a file system operation, including indirect blocks and i-nodes (actually, it contains the blocks modified by several file system operations, but let us assume that every partial segment only contains the blocks modified by a single file system operation). So, the above figure is as follows in DualFS: Before: Segment 1: [some data] [ D0 D1 D2 I ] [more data] Segment 2: [ some data ] Segment 3: [ empty] If the datablock D0 is modified, what you get is: Segment 1: [some data] [ garbage ] [more data] Segment 2: [ some data ] Segment 3: [ D0 D1 D2 I ] [ empty ] You have fairly strict assumptions about the Before: picture. But what happens if those assumptions fail. To give you an example, imagine the following small script: $ for i in `seq 100`; do touch $i; done This will create a million dentries in one directory. It will also create a million inodes, but let us ignore those for a moment. It is fairly unlikely that you can fit a million dentries into [D0], so you will need more than one block. Let's call them [DA], [DB], [DC], etc. So you have to write out the first block [DA]. Before: Segment 1: [some data] [ DA D1 D2 I ] [more data] Segment 2: [ some data ] Segment 3: [ empty] If the datablock D0 is modified, what you get is: Segment 1: [some data] [ garbage ] [more data] Segment 2: [ some data ] Segment 3: [ DA D1 D2 I ] [ empty ] That is exactly your picture. Fine. Next you write [DB]. Before: see above After: Segment 1: [some data] [ garbage ] [more data] Segment 2: [ some data ] Segment 3: [ DA][garbage] [ DB D1 D2 I ] [ empty] You write [DC]. Note that Segment 3 does not have enough space for another partial segment: Segment 1: [some data] [ garbage ] [more data] Segment 2: [ some data ] Segment 3: [ DA][garbage] [ DB][garbage] [wasted] Segment 4: [ DC D1 D2 I ] [ empty ] You write [DD] and [DE]: Segment 1: [some data] [ garbage ] [more data] Segment 2: [ some data ] Segment 3: [ DA][garbage] [ DB][garbage] [wasted] Segment 4: [ DC][garbage] [ DD][garbage] [wasted] Segment 5: [ DE D1 D2 I ] [ empty ] And some time later you even have to switch to a new indirect block, so you get before: Segment n : [ DX D1 D2 I ] [ empty ] After: Segment n : [ DX D1][garb] [ DY DI D2 I ] [ empty] What you end up with after all this is quite unlike you Before picture. Instead of this: Segment 1: [some data] [ D0 D1 D2 I ] [more data] You may have something closer to this: Segment 1: [some data] [ D1 ] [more data] Segment 2: [some data] [ D0 ] [more data] Segment 3: [some data] [ D2 ] [more data] You should try the testcase and look at a dump of your filesystem afterwards. I usually just read the raw device in a hex editor. Jörn -- Beware of bugs in the above code; I have only proved it correct, but not tried it. -- Donald Knuth - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Thu, 22 February 2007 20:57:12 +0100, Juan Piernas Canovas wrote: I do not agree with this picture, because it does not show that all the indirect blocks which point to a direct block are along with it in the same segment. That figure should look like: Segment 1: [some data] [ DA D1' D2' ] [more data] Segment 2: [some data] [ D0 D1' D2' ] [more data] Segment 3: [some data] [ DB D1 D2 ] [more data] where D0, DA, and DB are datablocks, D1 and D2 indirect blocks which point to the datablocks, and D1' and D2' obsolete copies of those indirect blocks. By using this figure, is is clear that if you need to move D0 to clean the segment 2, you will need only one free segment at most, and not more. You will get: Segment 1: [some data] [ DA D1' D2' ] [more data] Segment 2: [free] Segment 3: [some data] [ DB D1' D2' ] [more data] .. Segment n: [ D0 D1 D2 ] [ empty ] That is, D0 needs in the new segment the same space that it needs in the previous one. The differences are subtle but important. Ah, now I see. Yes, that is deadlock-free. If you are not accounting the bytes of used space but the number of used segments, and you count each partially used segment the same as a 100% used segment, there is no deadlock. Some people may consider this to be cheating, however. It will cause more than 50% wasted space. All obsolete copies are garbage, after all. With a maximum tree height of N, you can have up to (N-1) / N of your filesystem occupied by garbage. It also means that df will have unexpected output. You cannot estimate how much data can fit into the filesystem, as that depends on how much garbage you will accumulate in the segments. Admittedly this is not a problem for DualFS, as the uncertainty only exists for metadata, do df for DualFS still makes sense. Another downside is that with large amounts of garbage between otherwise useful data, your disk cache hit rate goes down. Read performance is suffering. But that may be a fair tradeoff and will only show up in large metadata reads in the uncached (per Linux) case. Seems fair. Quite interesting, actually. The costs of your design are disk space, depending on the amount and depth of your metadata, and metadata read performance. Disk space is cheap and metadata reads tend to be slow for most filesystems, in comparison to data reads. You gain faster metadata writes and loss of journal overhead. I like the idea. Jörn -- All art is but imitation of nature. -- Lucius Annaeus Seneca - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Sat, 24 February 2007 09:32:49 -0800, Christoph Lameter wrote: If that is a problem for particular object pools then we may be able to except those from the merging. How much of a gain is the merging anyway? Once you start having explicit whitelists or blacklists of pools that can be merged, one can start to wonder if the result is worth the effort. Jörn -- Joern's library part 6: http://www.gzip.org/zlib/feldspar.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Sun, 25 February 2007 03:41:40 +0100, Juan Piernas Canovas wrote: Well, our experimental results say another thing. As I have said, the greatest part of the files are written at once, so their meta-data blocks are together on disk. This allows DualFS to implement an explicit prefetching of meta-data blocks which is quite effective, specially when there are several processes reading from disk at the same time. On the other hand, DualFS also implements an on-line meta-data relocation mechanism which can help to improve meta-data prefetching, and garbage collection. Obviously, there can be some slow-growing files that can produce some garbage, but they do not hurt the overall performance of the file system. Well, my concerns about the design have gone. There remain some concerns about the source code and I hope they will disappear just as fast. :) Obviously, a patch against 2.4.x is fairly useless. Iirc, you claimed somewhere to have a patch against 2.6.11, but I was unable to find that. Porting 2.6.11 to 2.6.20 should be simple enough. Then there is some assembly code inside the patch that you seem to have copied from some other project. I would be surprised if that is really required. If you can replace it with C code, please do. If the assembly actually is a performance gain (and I consider it your duty to prove that), you can have a two-patch series with the first introducing DualFS and the second adding the assembly as a config option for one architecture. Yeah :) If you have taken a look to my presentation at LFS07, the disk traffic of meta-data blocks is dominated by writes. Last time I tried it was only available to members. Is it generally available now? Jörn -- My second remark is that our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Sat, 24 February 2007 16:14:48 -0800, Christoph Lameter wrote: It eliminates 50% of the slab caches. Thus it reduces the management overhead by half. How much management overhead is there left with SLUB? Is it just the one per-node slab? Is there runtime overhead as well? In a slightly different approach, can we possibly get rid of some slab caches, instead of merging them at boot time? On my system I have 97 slab caches right now, ignoring the generic kmalloc() ones. Of those, 28 are completely empty, 23 contain =10 objects, 23 =100 and 23 contain 100 objects. It is fairly obvious to me that the highly populated slab caches are a big win. But is it worth it to have slab caches with a single object inside? Maybe some of these caches are populated for some systems. But there could also be candidates for removal among them. # active_objs num_objs name 0 0 dm-crypt_io 0 0 dm_io 0 0 dm_tio 0 0 ext3_xattr 0 0 fat_cache 0 0 fat_inode_cache 0 0 flow_cache 0 0 inet_peer_cache 0 0 ip_conntrack_expect 0 0 ip_mrt_cache 0 0 isofs_inode_cache 0 0 jbd_1k 0 0 jbd_4k 0 0 kiocb 0 0 kioctx 0 0 nfs_inode_cache 0 0 nfs_page 0 0 posix_timers_cache 0 0 request_sock_TCP 0 0 revoke_record 0 0 rpc_inode_cache 0 0 scsi_io_context 0 0 secpath_cache 0 0 skbuff_fclone_cache 0 0 tw_sock_TCP 0 0 udf_inode_cache 0 0 uhci_urb_priv 0 0 xfrm_dst_cache 1 169 dnotify_cache 1 30 arp_cache 1 7 mqueue_inode_cache 2 101 eventpoll_pwq 2 203 fasync_cache 2 254 revoke_table 2 30 eventpoll_epi 2 9 RAW 4 17 ip_conntrack 7 10 biovec-128 7 10 biovec-64 7 20 biovec-16 7 42 file_lock_cache 7 59 biovec-4 7 59 uid_cache 7 8 biovec-256 7 9 bdev_cache 8 127 inotify_event_cache 8 20 rpc_tasks 8 8 rpc_buffers 10 113 ip_fib_alias 10 113 ip_fib_hash 10 12 blkdev_queue 11 203 biovec-1 11 22 blkdev_requests 13 92 inotify_watch_cache 16 169 journal_handle 16 203 tcp_bind_bucket 16 72 journal_head 18 18 UDP 19 19 names_cache 19 28 TCP 22 30 mnt_cache 27 27 sigqueue 27 60 ip_dst_cache 32 32 sgpool-128 32 32 sgpool-32 32 32 sgpool-64 32 36 nfs_read_data 32 45 sgpool-16 32 60 sgpool-8 36 42 nfs_write_data 72 80 cfq_pool 74 127 blkdev_ioc 78 92 cfq_ioc_pool 94 94 pgd 107 113 fs_cache 108 108 mm_struct 108 140 files_cache 123 123 sighand_cache 125 140 UNIX 130 130 signal_cache 147 147 task_struct 154 174 idr_layer_cache 158 404 pid 190 190 sock_inode_cache 260 295 bio 273 273 proc_inode_cache 840 920 skbuff_head_cache 1234 1326 inode_cache 1507 1510 shmem_inode_cache 2871 3051 anon_vma 2910 3360 filp 5161 5292 sysfs_dir_cache 5762 6164 vm_area_struct 12056 19446 radix_tree_node 65776 151272 buffer_head 578304 578304 ext3_inode_cache 677490 677490 dentry_cache Jörn -- And spam is a useful source of entropy for /dev/random too! -- Jasmine Strong - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote: When you do it like this, who can the kernel/filesystem *guarantee* that when the data is written there actually is room on the harddrive? What you described seems like using truncate/ftruncate to increase the file's size. That is not at all what posix_fallocate is for. posix_fallocate must make sure that the requested blocks on the disk are reserved (allocated) for the file's use and that at no point in the future will, say, a msync() fail because a mmap(MAP_SHARED) page has been written to. That actually causes an interesting problem for compressing filesystems. The space consumed by blocks depends on their contents and how well it compresses. At the moment, the only option I see to support posix_fallocate for LogFS is to set an inode flag disabling compression, then allocate the blocks. But if the file already contains large amounts of compressed data, I have a problem. Disabling compression for a range within a file is not supported, so I can only return an error. But which one? Jörn -- A surrounded army must be given a way out. -- Sun Tzu - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Mon, 5 March 2007 01:36:36 +0100, Arnd Bergmann wrote: Using the current glibc implementation on a compressed file system ideally should be a very expensive no-op because you won't actually allocate much space for a file when writing zeroes to it. You also don't benefit of a contiguous allocation in logfs, since flash has uniform seek times over all the medium. I'd suggest you implement posix_fallocate as an real nop and just return success without doing anything. You could also return ENOSPC in case the blocks requested by posix_fallocate don't fit on the medium without compression, but that is more or less just guesswork (like statfs is). Quoting POSIX_FALLOCATE(3): The function posix_fallocate() ensures that disk space is allocated for the file referred to by the descriptor fd for the bytes in the range starting at offset and continuing for len bytes. After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. If the size of the file is less than offset+len, then the file is increased to this size; otherwise the file size is left unchanged. Afaics, the (main) purpose of this function is not to decrease fragmentation but to ensure mmap() won't cause any problems because the medium fills up. That problem exists for LogFS as well, once rw mmap() is supported. Simply returning success without doing anything would be a bug. -ENOSPC is a better choice, but still a lame implementation. And falling back on libc to write zeroes in a loop is an exercise in futility. Does the allocation have to be persistent beyond lifetime of the file descriptor? It would be fairly simple to support the write guarantee while the file is open (or rather the inode remains cached) and drop it afterwards. Jörn -- [One] doesn't need to know [...] how to cause a headache in order to take an aspirin. -- Scott Culp, Manager of the Microsoft Security Response Center, 2001 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Mon, 5 March 2007 00:32:14 +, Anton Altaparmakov wrote: I don't know how your compression algorithm works [...] LogFS is designed for flash media, so it does not have to worry much about reducing disk seeks. It is log-structured, which simplifies compression further. When writing a block, it basically compresses it and appends it to the log. Writes only have to be byte-aligned, so no space is lost for padding. The bad news for posix_fallocate() is that even if libc is smart enough to write random data, mmap() can still cause problems. If the VM decides to write a given page twice, the second write compresses better and the medium has filled up between the two writes, the users will have fun. Jörn -- Joern's library part 9: http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Mon, 5 March 2007 07:08:03 -0800, Ulrich Drepper wrote: Jörn Engel wrote: Does the allocation have to be persistent beyond lifetime of the file descriptor? Of course. You call posix_fallocate once for the lifetime of the file when it is created to ensure that all future uses will work. That part is not quite clear from the manpage but I trust most people would assume the same. It seems your filesystem will not be able to support this unless compression is turned off. Correct. Compression needs to be turned off for a file, if posix_fallocate(3) is to succeed. What I could do is disable compression (meaning that no data written in the future will be compressed) and rewrite all blocks within the given range. Still, it is quite obvious that noone designing this interface has lost much thought to compressing filesystems. Whatever I can come up with will either be incompatible or some sort of hack. :( Jörn -- Courage is not the absence of fear, but rather the judgement that something else is more important than fear. -- Ambrose Redmoon - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Heads up on sys_fallocate()
On Wed, 7 March 2007 09:51:35 +0100, Jan Kara wrote: I'll probably first write some userspace fs-reorganizer to find out how much these changes in layout are able to give you in performance (i.e. whether it's worth the effort of more complicated kernel online defragmenter). Have tried profiling the read accesses and prereading them asynchronously on startup? That appears to have improved E17 a lot. See http://lca2007.linux.org.au/talk/101 (and watch the video). Jörn -- The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] add file position info to proc
On Tue, 27 March 2007 21:24:20 +, Pavel Machek wrote: From: Miklos Szeredi [EMAIL PROTECTED] This patch adds support for finding out the current file position, open flags and possibly other info in the future. These new entries are added: /proc/PID/fdinfo/FD /proc/PID/task/TID/fdinfo/FD For each fd the information is provided in the following format: pos:1234 flags: 012 Octal? Maybe we should use more traditional hex here? Or even list flags by name? The flags are defined in octal. Whether that choice makes sense or should be rethought is a different question. I would definitely prefer hex. Jörn -- You ain't got no problem, Jules. I'm on the motherfucker. Go back in there, chill them niggers out and wait for the Wolf, who should be coming directly. -- Marsellus Wallace - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interface for the new fallocate() system call
On Fri, 30 March 2007 19:15:58 +1000, Paul Mackerras wrote: Heiko Carstens writes: If possible I'd prefer the six-32-bit-args approach. It does mean extra unnecessary work for 64-bit platforms, though... Wouldn't that work be confined to fallocate()? If I understand Heiko correctly, the alternative would slow s390 down for every syscall, including more performance-critical ones. Jörn -- tglx1 thinks that joern should get a (TM) for Thinking Is Hard -- Thomas Gleixner - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/16] zlib-decompression-status.diff
On Sun, 1 April 2007 20:15:42 +0200, Jan Engelhardt wrote: +static inline void putstr(const char *s) { +printk(%s, s); +return; +} + static int __init crd_load(int in_fd, int out_fd) { int result; @@ -418,7 +423,7 @@ static int __init crd_load(int in_fd, in return -1; } makecrc(); - result = gunzip(); + result = gunzip(putstr); You are sure this wasn't meant as an April fools joke? Passing the address of an inline function certainly has a humorous aspect. ;) Also, you can remove the return; in the void function and possibly change this bit to match Documentation/CodingStyle. +if(putstr != NULL) putstr(*); The patch alternately uses puts() and putstr(), which looks rather odd. Not sure whether that makes sense or not. Jörn -- My second remark is that our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: missing madvise functionality
On Tue, 3 April 2007 23:10:14 +0200, Eric Dumazet wrote: mmap()/brk() must give fresh NULL pages, but maybe madvise(MADV_DONTNEED) can relax this requirement (if the pages were reclaimed, then a page fault could bring a new page with random content) ...provided that it doesn't leak information from the kernel? Jörn -- All art is but imitation of nature. -- Lucius Annaeus Seneca - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] block2mtd lockdep_init_map warning
On Mon, 7 January 2008 11:05:26 +0100, Peter Zijlstra wrote: Would something like this work for people? Looks a lot better than what I thought of. However, does the #ifdef within is_module_address() make sense when afaict lockdep is the only caller of that function? Looks as if the whole function should be made conditional or none of it. Not-Yet-Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- Index: linux-2.6/include/linux/sched.h === --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1160,6 +1160,7 @@ struct task_struct { int lockdep_depth; struct held_lock held_locks[MAX_LOCK_DEPTH]; unsigned int lockdep_recursion; + struct module *loading_module; #endif /* journalling filesystem info */ Index: linux-2.6/kernel/module.c === --- linux-2.6.orig/kernel/module.c +++ linux-2.6/kernel/module.c @@ -2023,6 +2023,9 @@ static struct module *load_module(void _ printk(KERN_WARNING %s: Ignoring obsolete parameters\n, mod-name); +#ifdef CONFIG_LOCKDEP + current-loading_module = mod; +#endif /* Size of section 0 is 0, so this works well if no params */ err = parse_args(mod-name, mod-args, (struct kernel_param *) @@ -2030,6 +2033,9 @@ static struct module *load_module(void _ sechdrs[setupindex].sh_size / sizeof(struct kernel_param), NULL); +#ifdef CONFIG_LOCKDEP + current-loading_module = NULL +#endif if (err 0) goto arch_cleanup; @@ -2454,6 +2460,17 @@ int is_module_address(unsigned long addr } } +#ifdef CONFIG_LOCKDEP + if (current-loading_module) { + mod = current-loading_module; + if (within(addr, mod-module_init, mod-init_text_size) + || within(addr, mod-module_core, mod-core_text_size)) { + preempt_enable(); + return 1; + } + } +#endif + preempt_enable(); return 0; Jörn -- I don't understand it. Nobody does. -- Richard P. Feynman -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [noob q. on block layer] block IO read-ahead during sequential *write*?
On Mon, 7 January 2008 13:25:09 +0100, Frantisek Rysanek wrote: let me start with a simple example. The following commands: cp /dev/zero /dev/hda dd if=/dev/zero of=/dev/hda [bs=512] both have one common side-effect: apart from the disk being properly overwritten with zeroes, the kernel seems to keep reading sectors ahead of the current seek position of the sequential write. Block devices are cached in the page cache. If you write less than a full page, any remainder has to be read from the device. If you retry the dd with bs=4096 (or whatever your architecture's page size happens to be), does this still occur? Jörn -- Chance favors only the prepared mind. -- Louis Pasteur -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Claim maintainership for block2mtd and update email addresses
On Mon, 7 January 2008 15:23:00 -0800, Andrew Morton wrote: On Sun, 6 Jan 2008 14:56:01 +0100 J__rn Engel [EMAIL PROTECTED] wrote: You found a new one! That make a round dozen, I believe. http://logfs.org/logfs/joern - * Copyright (C) 2004-2006 Jörn Engel [EMAIL PROTECTED] + * Copyright (C) 2004-2006 Joern Engel [EMAIL PROTECTED] Yup. Your name comes out like that when sylpheed does its save-email-to-a-file thing as well and I haven't got around to working out why, or to reporting it. In this case it looks like the dud characters came about due to [MTD] Fix legacy character sets throughout drivers/mtd, include/linux/mtd. Which doesn't look like it fixed anything much really. Going with the asciified/anglicised/bastardised spelling is a practical (albeit unhappy) solution. I'm happy if people spend effort and make unicode work. Until then I'll semi-officially change my name to Joern and keep collecting unusual specimens. Jörn -- Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest. -- Rob Pike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] greatly reduce SLOB external fragmentation
On Thu, 10 January 2008 11:49:25 -0600, Matt Mackall wrote: b) grouping objects of the same -type- (not size) together should mean they have similar lifetimes and thereby keep fragmentation low (b) is known to be false, you just have to look at our dcache and icache pinning. (b) is half-true, actually. The grouping by lifetime makes a lot of sense. LogFS has a similar problem to slabs (only full segments are useful, a single object can pin the segment). And when I grouped my objects very roughly by their life expectency, the impact was *HUGE*! In both cases, you want slabs/segments that are either close to 100% full or close to 0% full. It matters a lot when you have to move objects around and I would bet it matters even more when you cannot move objects and the slab just remains pinned. So just because the type alone is a relatively bad heuristic for life expectency does not make the concept false. Bonwick was onto something. He just failed in picking a good heuristic. Quite likely spreading by type was even a bonus when slab was developed, because even such a crude heuristic is slightly better than completely randomized lifetimes. I've been meaning to split the dentry cache into 2-3 seperate ones for a while and kept spending my time elsewhere. But I remain convinced that this will make a measurable difference. Jörn -- Never argue with idiots - first they drag you down to their level, then they beat you with experience. -- unknown -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Sun, 16 September 2007 00:30:32 +0200, Andrea Arcangeli wrote: Movable? I rather assume all slab allocations aren't movable. Then slab defrag can try to tackle on users like dcache and inodes. Keep in mind that with the exception of updatedb, those inodes/dentries will be pinned and you won't move them, which is why I prefer to consider them not movable too... since there's no guarantee they are. I have been toying with the idea of having seperate caches for pinned and movable dentries. Downside of such a patch would be the number of memcpy() operations when moving dentries from one cache to the other. Upside is that a fair amount of slab cache can be made movable. memcpy() is still faster than reading an object from disk. Most likely the current reaction to such a patch would be to shoot it down due to overhead, so I didn't pursue it. All I have is an old patch to seperate never-cached from possibly-cached dentries. It will increase the odds of freeing a slab, but provide no guarantee. But the point here is: dentries/inodes can be made movable if there are clear advantages to it. Maybe they should? Jörn -- Joern's library part 2: http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Sat, 15 September 2007 01:44:49 -0700, Andrew Morton wrote: On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel [EMAIL PROTECTED] wrote: While I agree with your concern, those numbers are quite silly. The chances of 99.8% of pages being free and the remaining 0.2% being perfectly spread across all 2MB large_pages are lower than those of SHA1 creating a collision. Actually it'd be pretty easy to craft an application which allocates seven pages for pagecache, then one for something, then seven for pagecache, then one for something, etc. I've had test apps which do that sort of thing accidentally. The result wasn't pretty. I bet! My (false) assumption was the same as Goswin's. If non-movable pages are clearly seperated from movable ones and will evict movable ones before polluting further mixed superpages, Nick's scenario would be nearly infinitely impossible. Assumption doesn't reflect current code. Enforcing this assumption would cost extra overhead. The amount of effort to make Christoph's approach work reliably seems substantial and I have no idea whether it would be worth it. Jörn -- Happiness isn't having what you want, it's wanting what you have. -- unknown - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Sun, 16 September 2007 11:15:36 -0700, Linus Torvalds wrote: On Sun, 16 Sep 2007, Jörn Engel wrote: I have been toying with the idea of having seperate caches for pinned and movable dentries. Downside of such a patch would be the number of memcpy() operations when moving dentries from one cache to the other. Totally inappropriate. I bet 99% of all dentry_lookup() calls involve turning the last dentry from having a count of zero (movable) to having a count of 1 (pinned). So such an approach would fundamentally be broken. It would slow down all normal dentry lookups, since the *common* case for leaf dentries is that they have a zero count. Why am I not surprised? :) So it's much better to do it on a directory/file basis, on the assumption that files are *mostly* movable (or just freeable). The fact that they aren't always (ie while kept open etc), is likely statistically not all that important. My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc. which are pinned for their entire lifetime and another for regular files/inodes. One could take a three-way approach and have always-pinned, often-pinned and rarely-pinned. We won't get never-pinned that way. Jörn -- The wise man seeks everything in himself; the ignorant man tries to get everything from somebody else. -- unknown - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Mon, 17 September 2007 00:06:24 +0200, Goswin von Brederlow wrote: How probable is it that the dentry is needed again? If you copy it and it is not needed then you wasted time. If you throw it out and it is needed then you wasted time too. Depending on the probability one of the two is cheaper overall. Idealy I would throw away dentries that haven't been accessed recently and copy recently used ones. How much of a systems ram is spend on dentires? How much on task structures? Does anyone have some stats on that? If it is 10% of the total ram combined then I don't see much point in moving them. Just keep them out of the way of users memory so the buddy system can work effectively. As usual, the answer is it depends. I've had up to 600MB in dentry and inode slabs on a 1GiB machine after updatedb. This machine currently has 13MB in dentries, which seems to be reasonable for my purposes. Jörn -- Audacity augments courage; hesitation, fear. -- Publilius Syrus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tue, 18 September 2007 11:00:40 +0100, Mel Gorman wrote: We still lack data on what sort of workloads really benefit from large blocks Compressing filesystems like jffs2 and logfs gain better compression ratio with larger blocks. Going from 4KiB to 64KiB gave somewhere around 10% benefit iirc. Testdata was a 128MiB qemu root filesystem. Granted, the same could be achieved by adding some extra code and a few bounce buffers to the filesystem. How suck a hack would perform I'd prefer not to find out, though. :) Jörn -- Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. -- Doug MacIlroy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64: Make sparsemem/vmemmap the default memory model
On Mon, 12 November 2007 20:41:10 -0800, Christoph Lameter wrote: On Mon, 12 Nov 2007, Ray Lee wrote: Discontig obviously needs to die. However, FlatMem is consistently faster, averaging about 2.1% better overall for your numbers above. Is the page allocator not, erm, a fast path, where that matters? Order FlatSparse % diff 0 639 641 0.3 IMHO Order 0 currently matters most and the difference is negligible there. Is it? I am a bit concerned about the non-monotonic distribution. Difference starts a near-0, grows to 4.4, drops to near-0, grows to 4.9, drops to near-0. Order FlatSparse % diff 0 639 641 0.3 1 567 593 4.4 2 679 692 1.9 3 763 781 2.3 4 961 962 0.1 5 135613922.6 6 222423364.8 7 486950744.0 8 12500 12732 1.8 9 27926 28165 0.8 10 58578 58682 0.2 Is there an explanation for this behaviour? More to the point, could repeated runs also return 4% difference for order-0? Jörn -- It does not require a majority to prevail, but rather an irate, tireless minority keen to set brush fires in people's minds. -- Samuel Adams - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] New Kernel Bugs
On Tue, 13 November 2007 15:18:07 -0500, Mark Lord wrote: I just find it weird that something can be known broken for several -rc* kernels before I happen to install it, discover it's broken on my own machine, and then I track it down, fix it, and submit the patch, generally all within a couple of hours. Where the heck was the dude(ess) that broke it ?? AWOL. And when I receive hostility from the maintainers of said code for fixing their bugs, well.. that really motivates me to continue reporting new ones.. Given a decent bug report, I agree that having the bug not looked at is shameful. But what can a developer do if a bug report effectively reads there is some bug somewhere in recent kernels? How can I know that in this particular case it is my bug that I introduced? It could just as easily be 50 other people and none of them are eager to debug it unless they suspect it to be their bug. This is a common problem and fairly unrelated to linux in general or the kernel in particular. Who is going to be the sucker that figures out which developer the bug belongs to? And I have yet to find a project, commercial or opensource, where volunteers flock to become such a sucker. One option is to push this role to the bug reporter. Another is to strong-arm some developers into this role, by whatever means. A third would be for $LARGE_COMPANY to hire some people. If you have a better idea or would volunteer your time, I'd be grateful. Simply blaming one side, whether bug reporter or a random developer, for not being the sucker doesn't help anyone. Jörn -- Joern's library part 2: http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] New Kernel Bugs
On Tue, 13 November 2007 13:56:58 -0800, Andrew Morton wrote: It's relatively common that a regression in subsystem A will manifest as a failure in subsystem B, and the report initially lands on the desk of the subsystem B developers. But that's OK. The subsystem B people are the ones with the expertise to be able to work out where the bug resides and to help the subsystem A people understand what went wrong. Alas, sometimes the B people will just roll eyes and do nothing because they know the problem wasn't in their code. Sometimes. And sometimes the A people will ignore the B people after the root cause has been worked out. Do you have a good idea how to shame A into action? Should I put you on Cc:? Right now I'm in the eye-rolling phase. Jörn -- The cost of changing business rules is much more expensive for software than for a secretaty. -- unknown - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64: Make sparsemem/vmemmap the default memory model
On Tue, 13 November 2007 13:52:17 -0800, Christoph Lameter wrote: Could you run your own test to verify? You bastard! You know I'm too lazy to do that. ;) As long as the order-0 number is stable across multiple runs I don't mind. The numbers just looked suspiciously as if they were not stable. That's all. Jörn -- Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life. -- Charles Shultz - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [alsa-devel] [BUG] New Kernel Bugs
On Thu, 15 November 2007 13:26:51 +0100, Rene Herman wrote: Can you please just shelve this crap? You have a way of knowing that ALSA will accept you and that is knowing or assuming that the ALSA project doesn't consist of drooling retards. Well, my experience with moderation has been that moderated mails are stuck in some queue for weeks. Two seperate lists, neither of them was alsa. If also is doing a better job, great. But it still has to live with the general reputation of non-subscriber moderation. When a project list goes to the difficulty of moderating non-subscribers it has made the explicit choice to _not_ become subscriber only. Then refusing valid non-subscribers after all makes no sense whatsoever. I'm sorry you got your feelings hurt by that other list but it was no doubt an accident; take it up with them. Been there, done that. In spite of people not being drooling retards, the amount of time and effort they invest into either moderation or improving the ruleset is quite limited. Problems persist. And even without mails being held hostage for weeks, every single moderation mail is annoying. Like the one I'm sure to receive after sending this out. Jörn -- Joern's library part 5: http://www.faqs.org/faqs/compression-faq/part2/section-9.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Document I_SYNC and I_DATASYNC
After some archeology (see http://logfs.org/logfs/inode_state_bits) I finally figured out what the three I_DIRTY bits do. Maybe others would prefer less effort to reach this insight. Signed-off-by: Jörn Engel [EMAIL PROTECTED] --- include/linux/fs.h |8 1 file changed, 4 insertions(+), 4 deletions(-) --- git_I_DIRTY/include/linux/fs.h~I_DIRTY 2007-11-15 20:51:57.0 +0100 +++ git_I_DIRTY/include/linux/fs.h 2007-11-16 03:45:16.0 +0100 @@ -1276,8 +1276,10 @@ struct super_operations { * * Two bits are used for locking and completion notification, I_LOCK and I_SYNC. * - * I_DIRTY_SYNCInode itself is dirty. - * I_DIRTY_DATASYNCData-related inode changes pending + * I_DIRTY_SYNCInode is dirty, but doesn't have to be written on + * fdatasync(). i_atime is the usual cause. + * I_DIRTY_DATASYNCInode is dirty and must be written on fdatasync(), f.e. + * because i_size changed. * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean. * I_NEW get_new_inode() sets i_state to I_LOCK|I_NEW. Both * are cleared by unlock_new_inode(), called from iget(). @@ -1309,8 +1311,6 @@ struct super_operations { * purpose reduces latency and prevents some filesystem- * specific deadlocks. * - * Q: Why does I_DIRTY_DATASYNC exist? It appears as if it could be replaced - *by (I_DIRTY_SYNC|I_DIRTY_PAGES). * Q: What is the difference between I_WILL_FREE and I_FREEING? * Q: igrab() only checks on (I_FREEING|I_WILL_FREE). Should it also check on *I_CLEAR? If not, why? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Treat disk space like memory space
On Fri, 16 November 2007 10:30:12 -0800, H. Peter Anvin wrote: This, by the way, has been discussed on and off -- often in the context of undelete (which is an identical problem.) The problem usually is that performance of real storage users suffer because of locality issues. However, flash storage doesn't have locality requirements... It does, although significantly less so than disks. Read latency is typically between 100x and 1000x less than disk latency. Another argument against this is that free space directly translates to speed, both for disks and flash. Disk filesystems fragment like hell if the disk is constanly near-full and flash filesystems require a lot more garbage collection overhead. Jörn -- To my face you have the audacity to advise me to become a thief - the worst kind of thief that is conceivable, a thief of spiritual things, a thief of ideas! It is insufferable, intolerable! -- M. Binet in Scarabouche - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Documentation about unaligned memory access
On Fri, 23 November 2007 00:15:53 +, Daniel Drake wrote: What's the definition of an unaligned access? = Unaligned memory accesses occur when you try to read N bytes of data starting from an address that is not evenly divisible by N (i.e. addr % N != 0). For example, reading 4 bytes of data from address 0x1004 is fine, but reading 4 bytes of data from address 0x1005 would be an unaligned memory access. The wording could also apply to a DMA of 8k from a 4k-aligned address. But I don't have a good idea how to improve it. It's safe to assume that memcpy will always copy bytewise and hence will never cause an unaligned access. s/always copy/always behave as if copying/ memcpy usually copies at least wordwise, possibly even in bigger chunks. But that is just the inner loop. Unaligned bytes at the beginning/end receive special treatment. Jörn -- The rabbit runs faster than the fox, because the rabbit is rinning for his life while the fox is only running for his dinner. -- Aesop - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Fri, 30 November 2007 14:43:12 +0100, Ingo Molnar wrote: http://redhat.com/~mingo/latency-tracing-patches/latency-tracing-v2.6.24-rc3.combo.patch does it work any better? It compiles. It boots with a 512M RAM (384M was too little with all the other debug options on). But it seems to lock up when running trace-cmd. On a rerun it locks up again, but with different output. Rerun was captured: http://logfs.org/~joern/trace1.jpg I should do a couple of runs, but my girlfriend claims realtime priority for the evening. Jörn -- Chance favors only the prepared mind. -- Louis Pasteur - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Thu, 15 November 2007 20:36:12 +0100, Ingo Molnar wrote: * Ingo Molnar [EMAIL PROTECTED] wrote: pick up the latest latency tracer patch from: sorry, wrong URLs, the correct links are: http://redhat.com/~mingo/latency-tracing-patches/latency-tracer-v2.6.24-rc2-git5-combo.patch http://redhat.com/~mingo/latency-tracing-patches/trace-cmd.c Don't seem to work with plain 2.6.23: kernel/sched.c:3384: warning: ‘struct prio_array’ declared inside parameter list kernel/sched.c:3384: warning: its scope is only this definition or declaration, which is probably not what you want kernel/sched.c: In function ‘trace_array’: kernel/sched.c:3391: error: dereferencing pointer to incomplete type kernel/sched.c:3393: error: dereferencing pointer to incomplete type kernel/sched.c:3393: error: dereferencing pointer to incomplete type kernel/sched.c:3396: error: dereferencing pointer to incomplete type kernel/sched.c:3396: error: dereferencing pointer to incomplete type kernel/sched.c: In function ‘trace_all_runnable_tasks’: kernel/sched.c:3407: error: ‘struct rq’ has no member named ‘active’ make[1]: *** [kernel/sched.o] Error 1 And I cannot find a definition of struct prio_array in current git either. Is another patch needed? Jörn -- Time? What's that? Time is only worth what you do with it. -- Theo de Raadt - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Fri, 30 November 2007 14:35:46 +0100, Ingo Molnar wrote: * Jörn Engel [EMAIL PROTECTED] wrote: kernel/sched.c:3384: warning: ‘struct prio_array’ declared inside parameter list kernel/sched.c:3384: warning: its scope is only this definition or declaration, which is probably not what you want kernel/sched.c: In function ‘trace_array’: kernel/sched.c:3391: error: dereferencing pointer to incomplete type kernel/sched.c:3393: error: dereferencing pointer to incomplete type kernel/sched.c:3393: error: dereferencing pointer to incomplete type kernel/sched.c:3396: error: dereferencing pointer to incomplete type kernel/sched.c:3396: error: dereferencing pointer to incomplete type kernel/sched.c: In function ‘trace_all_runnable_tasks’: kernel/sched.c:3407: error: ‘struct rq’ has no member named ‘active’ make[1]: *** [kernel/sched.o] Error 1 And I cannot find a definition of struct prio_array in current git either. Is another patch needed? change that to rt_prio_array in the code. Solves the prio_array problem, but leaves the non-existing member active. I've upgraded to -rc3 and will give your latest patch a whirl. Jörn -- Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. -- Doug MacIlroy - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Fri, 30 November 2007 19:46:25 +0100, Ingo Molnar wrote: * Jörn Engel [EMAIL PROTECTED] wrote: It compiles. It boots with a 512M RAM (384M was too little with all the other debug options on). But it seems to lock up when running trace-cmd. On a rerun it locks up again, but with different output. hm, you should decrease MAX_TRACE in kernel/latency_tracing.c from 1 million to 16K or so. 1 million entries probably depletes lowmem quite seriously. That's ok. RAM is cheap. Rerun was captured: http://logfs.org/~joern/trace1.jpg hm, that looks weird. if you disable CONFIG_PROVE_LOCKING, does that improve things? (or just turns a noisy lockup into a silent lockup?) Not much, although the dumps look different now: http://logfs.org/~joern/trace3.jpg http://logfs.org/~joern/trace4.jpg I have to change my qemu setup a little to see the top of those dumps... I should do a couple of runs, but my girlfriend claims realtime priority for the evening. yeah, SCHED_IDLE is not generally well received by them. ...as soon as more urgent tasks has finished (weekend is over). Jörn -- It does not matter how slowly you go, so long as you do not stop. -- Confucius -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Sat, 1 December 2007 19:32:56 +0100, Ingo Molnar wrote: * Jörn Engel [EMAIL PROTECTED] wrote: I have to change my qemu setup a little to see the top of those dumps... btw., if you start qemu like this: qemu -cdrom ./cdrom.iso -hda ./hda.img -boot c -full-screen -kernel ~/bzImage -append root=/dev/hda1 earlyprintk=serial,ttyS0,9600 console=tty console=ttyS0,9600 enforcing=0 debug you'll get the inner kernel's serial console log to qemu's standard output. Pretty useful for capturing kernel crashes. Almost. -serial stdio was missing. Much better now. stopped custom tracer. BUG: spinlock recursion on CPU#0, sh/953 lock: c030f280, .magic: dead4ead, .owner: sh/953, .owner_cpu: 0 Pid: 953, comm: sh Not tainted 2.6.24-rc3-ge1cca7e8-dirty #2 [c0103a04] show_trace_log_lvl+0x35/0x54 [c010450a] show_trace+0x2c/0x2e [c0104e6d] dump_stack+0x84/0x8a [c01ded7c] spin_bug+0xa7/0xae [c01def14] _raw_spin_lock+0x45/0xfa [c02a02b1] _spin_lock_irqsave+0x68/0x7a [c01087e7] pit_read+0x14/0x99 [c0130ee9] get_monotonic_cycles+0xf/0x2d [c013c0ef] now+0x2a/0x7c [c013c33b] trace+0x4d/0x1e8 [c013dbf3] __mcount+0x95/0xa6 [c010d35c] mcount+0x14/0x18 [c0135a44] lock_acquired+0xe/0x1d7 [c02a02b9] _spin_lock_irqsave+0x70/0x7a [c01087e7] pit_read+0x14/0x99 [c0130791] update_wall_time+0x23/0x692 [c0121756] do_timer+0x24/0xb1 [c01331fe] tick_periodic+0x49/0x84 [c013325b] tick_handle_periodic+0x22/0x73 [c0106315] timer_interrupt+0x4f/0x56 [c013e2c7] handle_IRQ_event+0x24/0x4f [c013f44a] handle_edge_irq+0xb8/0x125 [c01054ee] do_IRQ+0x89/0xa3 [c01033df] common_interrupt+0x23/0x28 [c015d924] vfs_write+0xa6/0x14c [c015df6e] sys_write+0x4c/0x70 [c0102a1f] syscall_call+0x7/0xb === I assume you have the latency tracer working. If you could send me your config, I could do a manual config-bisect and see which part of mine causes the problem. Jörn -- Admonish your friends privately, but praise them openly. -- Publilius Syrus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Sat, 1 December 2007 21:54:56 +0100, Ingo Molnar wrote: * J??rn Engel [EMAIL PROTECTED] wrote: stopped custom tracer. BUG: spinlock recursion on CPU#0, sh/953 lock: c030f280, .magic: dead4ead, .owner: sh/953, .owner_cpu: 0 Pid: 953, comm: sh Not tainted 2.6.24-rc3-ge1cca7e8-dirty #2 [c0103a04] show_trace_log_lvl+0x35/0x54 [c010450a] show_trace+0x2c/0x2e [c0104e6d] dump_stack+0x84/0x8a [c01ded7c] spin_bug+0xa7/0xae [c01def14] _raw_spin_lock+0x45/0xfa [c02a02b1] _spin_lock_irqsave+0x68/0x7a [c01087e7] pit_read+0x14/0x99 [c0130ee9] get_monotonic_cycles+0xf/0x2d ah. You should mark pit_read() function as notrace. PIT clocksource is rare. (add the 'notrace' word to the function prototype) Hardly a change at all. Apart from some offsets, this dump is identical. stopped custom tracer. BUG: spinlock recursion on CPU#0, sh/954 lock: c030f280, .magic: dead4ead, .owner: sh/954, .owner_cpu: 0 Pid: 954, comm: sh Not tainted 2.6.24-rc3-ge1cca7e8-dirty #3 [c0103a04] show_trace_log_lvl+0x35/0x54 [c010450a] show_trace+0x2c/0x2e [c0104e6d] dump_stack+0x84/0x8a [c01ded7c] spin_bug+0xa7/0xae [c01def14] _raw_spin_lock+0x45/0xfa [c02a02b1] _spin_lock_irqsave+0x68/0x7a [c01087e2] pit_read+0xf/0x91 [c0130ee1] get_monotonic_cycles+0xf/0x2d [c013c0e7] now+0x2a/0x7c [c013c333] trace+0x4d/0x1e8 [c013dbeb] __mcount+0x95/0xa6 [c010d354] mcount+0x14/0x18 [c0135a3c] lock_acquired+0xe/0x1d7 [c02a02b9] _spin_lock_irqsave+0x70/0x7a [c01087e2] pit_read+0xf/0x91 [c0130789] update_wall_time+0x23/0x692 [c012174e] do_timer+0x24/0xb1 [c01331f6] tick_periodic+0x49/0x84 [c0133253] tick_handle_periodic+0x22/0x73 [c0106315] timer_interrupt+0x4f/0x56 [c013e2bf] handle_IRQ_event+0x24/0x4f [c013f442] handle_edge_irq+0xb8/0x125 [c01054ee] do_IRQ+0x89/0xa3 [c01033df] common_interrupt+0x23/0x28 [c010d354] mcount+0x14/0x18 [c0120130] sysctl_head_finish+0xc/0x33 [c0192d64] proc_sys_write+0x96/0xa0 [c015d91c] vfs_write+0xa6/0x14c [c015df66] sys_write+0x4c/0x70 [c0102a1f] syscall_call+0x7/0xb === Jörn -- Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats. -- Howard Aiken quoted by Ken Iverson quoted by Jim Horning quoted by Raph Levien, 1979 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Sun, 2 December 2007 09:56:08 +0100, Ingo Molnar wrote: * Jörn Engel [EMAIL PROTECTED] wrote: ah. You should mark pit_read() function as notrace. PIT clocksource is rare. (add the 'notrace' word to the function prototype) Hardly a change at all. Apart from some offsets, this dump is identical. stopped custom tracer. BUG: spinlock recursion on CPU#0, sh/954 lock: c030f280, .magic: dead4ead, .owner: sh/954, .owner_cpu: 0 Pid: 954, comm: sh Not tainted 2.6.24-rc3-ge1cca7e8-dirty #3 [c0103a04] show_trace_log_lvl+0x35/0x54 [c010450a] show_trace+0x2c/0x2e [c0104e6d] dump_stack+0x84/0x8a [c01ded7c] spin_bug+0xa7/0xae [c01def14] _raw_spin_lock+0x45/0xfa [c02a02b1] _spin_lock_irqsave+0x68/0x7a [c01087e2] pit_read+0xf/0x91 [c0130ee1] get_monotonic_cycles+0xf/0x2d [c013c0e7] now+0x2a/0x7c [c013c333] trace+0x4d/0x1e8 [c013dbeb] __mcount+0x95/0xa6 [c010d354] mcount+0x14/0x18 [c0135a3c] lock_acquired+0xe/0x1d7 [c02a02b9] _spin_lock_irqsave+0x70/0x7a [c01087e2] pit_read+0xf/0x91 hm, it seems lock_acquired() [in kernel/lockdep.c] needs to be marked 'notrace' too - otherwise we recurse back into pit_read(). This time not even the offsets have changed. Dump is identical. Jörn -- Mundie uses a textbook tactic of manipulation: start with some reasonable talk, and lead the audience to an unreasonable conclusion. -- Bruce Perens -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Sun, 2 December 2007 12:31:43 +0100, Jörn Engel wrote: This time not even the offsets have changed. Dump is identical. After another ten or so notrace annotations throughout the spinlock code, the latency tracer appears to work. Not sure how many useful information is missing through all the annotations, though. Jörn -- Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo vorher keine existiert hat. -- Doris Lessing -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Sun, 2 December 2007 14:57:11 +0100, Ingo Molnar wrote: hm, do you have CONFIG_FRAME_POINTERS=y, i.e. are the dumps reliable? I do. Went through 10odd runs and annotated the function right below mcount each time. Seems to work now. Trouble is that it doesn't solve my real problem at hand. Something is causing significant delays when writing to logfs. Core logfs code is not running, but may cause whatever other code is running and burning up all the cpu time. Wasting 100ms of qemu-time to write a single page happens fairly frequently. With the latency tracer the problem appears to have become worse. Now the loftlockup code triggers quite frequently. Which makes a bit of sense, as the problem is a busy CPU, rather than an idle one. Guess I'll try oprofile or lcov instead. Jörn -- Joern's library part 5: http://www.faqs.org/faqs/compression-faq/part2/section-9.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Sun, 2 December 2007 13:31:25 +0100, Jörn Engel wrote: After another ten or so notrace annotations throughout the spinlock code, the latency tracer appears to work. Not sure how many useful information is missing through all the annotations, though. And here is a patch with the needed annotations. Looks a bit shabby, as it was generated though git diff, patcher, interdiff and combinediff. Jörn -- Joern's library part 10: http://blogs.msdn.com/David_Gristwood/archive/2004/06/24/164849.aspx unchanged: --- a/arch/x86/kernel/i8253.c +++ b/arch/x86/kernel/i8253.c @@ -125,7 +125,7 @@ void __init setup_pit_timer(void) * to just read by itself. So use jiffies to emulate a free * running counter: */ -static cycle_t pit_read(void) +static notrace cycle_t pit_read(void) { unsigned long flags; int count; unchanged: --- a/kernel/spinlock.c +++ b/kernel/spinlock.c @@ -76,7 +76,7 @@ void __lockfunc _read_lock(rwlock_t *lock) } EXPORT_SYMBOL(_read_lock); -unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock) +unsigned long notrace __lockfunc _spin_lock_irqsave(spinlock_t *lock) { unsigned long flags; @@ -341,7 +341,7 @@ void __lockfunc _read_unlock(rwlock_t *lock) } EXPORT_SYMBOL(_read_unlock); -void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags) +void notrace __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags) { spin_release(lock-dep_map, 1, _RET_IP_); _raw_spin_unlock(lock); unchanged: --- a/lib/spinlock_debug.c +++ b/lib/spinlock_debug.c @@ -148,7 +148,7 @@ int _raw_spin_trylock(spinlock_t *lock) return ret; } -void _raw_spin_unlock(spinlock_t *lock) +void notrace _raw_spin_unlock(spinlock_t *lock) { debug_spin_unlock(lock); __raw_spin_unlock(lock-raw_lock); only in patch2: unchanged: --- linux/arch/x86/kernel/tsc_32.c +++ linux-2.6.24-rc3logfs/arch/x86/kernel/tsc_32.c 2007-12-02 15:21:15.0 +0100 @@ -92,7 +92,7 @@ /* * Scheduler clock - returns current time in nanosec units. */ -unsigned long long native_sched_clock(void) +unsigned notrace long long native_sched_clock(void) { unsigned long long this_offset; only in patch2: unchanged: --- linux/kernel/lockdep.c +++ linux-2.6.24-rc3logfs/kernel/lockdep.c 2007-12-02 15:21:16.0 +0100 @@ -139,7 +139,7 @@ return i; } -static void lock_time_inc(struct lock_time *lt, s64 time) +static notrace void lock_time_inc(struct lock_time *lt, s64 time) { if (time lt-max) lt-max = time; @@ -198,7 +198,7 @@ memset(class-contention_point, 0, sizeof(class-contention_point)); } -static struct lock_class_stats *get_lock_stats(struct lock_class *class) +static notrace struct lock_class_stats *get_lock_stats(struct lock_class *class) { return get_cpu_var(lock_stats)[class - lock_classes]; } @@ -208,7 +208,7 @@ put_cpu_var(lock_stats); } -static void lock_release_holdtime(struct held_lock *hlock) +static notrace void lock_release_holdtime(struct held_lock *hlock) { struct lock_class_stats *stats; s64 holdtime; @@ -2872,7 +2872,7 @@ } EXPORT_SYMBOL_GPL(lock_contended); -void lock_acquired(struct lockdep_map *lock) +void notrace lock_acquired(struct lockdep_map *lock) { unsigned long flags; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Sun, 2 December 2007 16:47:46 +0100, Ingo Molnar wrote: well what does the trace say, where do the delays come from? To get a quick overview you can make tracing lighter weight by doing: echo 0 /proc/sys/kernel/mcount_enabled echo 1 /proc/sys/kernel/trace_syscalls I mistyped and did echo 1 /proc/sys/kernel/mcount_enabled Result looked like a livelock and finally convinced me to abandon the latency tracer. Sorry, but it appears to be the right tool for the wrong job. Jörn -- They laughed at Galileo. They laughed at Copernicus. They laughed at Columbus. But remember, they also laughed at Bozo the Clown. -- unknown -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel Development Objective-C
On Sat, 1 December 2007 21:59:31 +0200, Avi Kivity wrote: Object orientation in C leaves much to be desired; see the huge number of void pointers and container_of()s in the kernel. While true, this isn't such a bad problem. A language really sucks when it tries to disallow something useful. Back in university I was forced to write system software in pascal. Simple pointer arithmetic became a 5-line piece of code. Imo the main advantage of C is simply that it doesn't get in the way. Jörn -- But this is not to say that the main benefit of Linux and other GPL software is lower-cost. Control is the main benefit--cost is secondary. -- Bruce Perens -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Sun, 2 December 2007 21:07:22 +0100, Ingo Molnar wrote: * Jörn Engel [EMAIL PROTECTED] wrote: Result looked like a livelock and finally convinced me to abandon the latency tracer. Sorry, but it appears to be the right tool for the wrong job. hm, we routinely use it in -rt to capture what on earth is happening incidents. The snippet below is a random snipped from a trace that i've just captured, with mcount enabled. It seems to work fine here, with and without mcount. (pit clocksource is almost never used, that's why you had those early problems.) oprofile helps if you can reliably reproduce the slowdown in a loop or for a long amount of time, with lots of CPU utilization - and then it's also lower overhead. The tracer can be used to capture rare or complex events, and gives the full flow control and what is happening within the kernel. Such a trace would be useful indeed. But so far the patch has only given me grief and nothing remotely like useful output. Maybe I should simply use the complete -rt patch instead of debugging the broken-out latency-tracer patch. Jörn -- Mundie uses a textbook tactic of manipulation: start with some reasonable talk, and lead the audience to an unreasonable conclusion. -- Bruce Perens -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Sun, 2 December 2007 21:45:59 +0100, Ingo Molnar wrote: to capture a 1 second trace of what the system is doing. I think your troubles are due to running it within a qemu guest - that is not a typical utilization so you are on unchartered waters. Looks like it. Guess I'll switch to something else for the moment. Jörn -- Linux is more the core point of a concept that surrounds open source which, in turn, is based on a false concept. This concept is that people actually want to look at source code. -- Rob Enderle -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Sun, 2 December 2007 21:45:59 +0100, Ingo Molnar wrote: to capture that trace i did not use -rt, i just patched latest -git with: http://people.redhat.com/mingo/latency-tracing-patches/latency-tracing-v2.6.24-rc3.combo.patch (this has your fixes included already) have done: echo 1 /proc/sys/kernel/mcount_enabled and have run: ./trace-cmd sleep 1 trace.txt http://people.redhat.com/mingo/latency-tracing-patches/trace-cmd.c to capture a 1 second trace of what the system is doing. I think your troubles are due to running it within a qemu guest - that is not a typical utilization so you are on unchartered waters. Maybe one more thing: can you send me the config you used for the setup above? I'd like to know whether qemu or my config is to blame. Jörn -- Eighty percent of success is showing up. -- Woody Allen -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Sun, 2 December 2007 22:19:00 +0100, Ingo Molnar wrote: * Jörn Engel [EMAIL PROTECTED] wrote: Maybe one more thing: can you send me the config you used for the setup above? I'd like to know whether qemu or my config is to blame. sure - attached. After an eternity of compile time, this config does generate some useful output. qemu is not to blame. Jörn -- Joern's library part 9: http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Strange 1-second pauses during Resume-from-RAM
On Mon, 3 December 2007 01:57:02 +0100, Jörn Engel wrote: After an eternity of compile time, this config does generate some useful output. qemu is not to blame. Or is it? The output definitely looks suspicious. Large amounts of code get processed within a microsecond, while update_wall_time() appears to cause huge delays every time it is called: http://logfs.org/~joern/trace Does this output make sense or does it rather indicate some sloppiness wrt. time in the qemu virtual machine? Jörn -- tglx1 thinks that joern should get a (TM) for Thinking Is Hard -- Thomas Gleixner -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
On Tue, 4 December 2007 13:54:21 -0800, Jared Hulbert wrote: Maybe I'm missing something but I don't see it. We want a block interface for these devices, we just need a faster slimmer interface. Maybe a new mtdblock interface that doesn't do erase would be the place for? Doesn't do erase? MTD has to learn almost all tricks from the block layer, as devices are becoming high-latency high-bandwidth, compared to what MTD was designed for. In order to get any decent performance, we need asynchronous operations, request queues and caching. The only useful advantage MTD does have over block devices is an _explicit_ erase operation. Did you mean doesn't do _implicit_ erase. Jörn -- It's just what we asked for, but not what we want! -- anonymous -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [x86] kernel/audit.c cleanup according to checkpatch.pl
On Thu, 3 January 2008 14:19:25 +0300, Cyrill Gorcunov wrote: @@ -232,7 +232,8 @@ void audit_log_lost(const char *message) if (print) { printk(KERN_WARNING -audit: audit_lost=%d audit_rate_limit=%d audit_backlog_limit=%d\n, +audit: audit_lost=%d audit_rate_limit=%d +audit_backlog_limit=%d\n, atomic_read(audit_lost), audit_rate_limit, audit_backlog_limit); This hunk is a bit questionable. It can easily deceive a reader to assume two seperate lines printed out and sometimes defeats grepping for printk output to find the code generating the message. Rest looks good to me. Jörn -- He that composes himself is wiser than he that composes a book. -- B. Franklin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Claim maintainership for block2mtd and update email addresses
I have been prime author and maintainer of block2mtd from day one, but neither MAINTAINERS nor the module source makes this fact clear. And while I'm at it, update my email addresses tree-wide, as the old address currently bounces and change my name to joern as unicode will likely continue to cause trouble until the end of this century. Signed-off-by: Jörn Engel [EMAIL PROTECTED] --- MAINTAINERS | 10 -- drivers/mtd/devices/block2mtd.c |4 ++-- drivers/mtd/devices/phram.c |4 ++-- drivers/mtd/maps/mtx-1_flash.c |2 +- scripts/checkstack.pl |2 +- 5 files changed, 14 insertions(+), 8 deletions(-) --- linux-2.6.24-rc3logfs/drivers/mtd/devices/block2mtd.c~block2mtd_maintainer 2007-08-08 19:30:04.0 +0200 +++ linux-2.6.24-rc3logfs/drivers/mtd/devices/block2mtd.c 2008-01-06 14:22:57.0 +0100 @@ -4,7 +4,7 @@ * block2mtd.c - create an mtd from a block device * * Copyright (C) 2001,2002 Simon Evans [EMAIL PROTECTED] - * Copyright (C) 2004-2006 Jörn Engel [EMAIL PROTECTED] + * Copyright (C) 2004-2006 Joern Engel [EMAIL PROTECTED] * * Licence: GPL */ @@ -485,5 +485,5 @@ module_init(block2mtd_init); module_exit(block2mtd_exit); MODULE_LICENSE(GPL); -MODULE_AUTHOR(Simon Evans [EMAIL PROTECTED] and others); +MODULE_AUTHOR(Joern Engel [EMAIL PROTECTED]); MODULE_DESCRIPTION(Emulate an MTD using a block device); --- linux-2.6.24-rc3logfs/MAINTAINERS~block2mtd_maintainer 2007-11-30 13:59:51.0 +0100 +++ linux-2.6.24-rc3logfs/MAINTAINERS 2008-01-06 14:21:49.0 +0100 @@ -835,6 +835,12 @@ L: linux-kernel@vger.kernel.org T: git kernel.org:/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git S: Maintained +BLOCK2MTD DRIVER +P: Joern Engel +M: [EMAIL PROTECTED] +L: [EMAIL PROTECTED] +S: Maintained + BLUETOOTH SUBSYSTEM P: Marcel Holtmann M: [EMAIL PROTECTED] @@ -2985,8 +2991,8 @@ L:[EMAIL PROTECTED] S: Maintained PHRAM MTD DRIVER -P: Jörn Engel -M: [EMAIL PROTECTED] +P: Joern Engel +M: [EMAIL PROTECTED] L: [EMAIL PROTECTED] S: Maintained --- linux-2.6.24-rc3logfs/drivers/mtd/devices/phram.c~block2mtd_maintainer 2007-08-08 19:30:04.0 +0200 +++ linux-2.6.24-rc3logfs/drivers/mtd/devices/phram.c 2008-01-06 14:22:30.0 +0100 @@ -2,7 +2,7 @@ * $Id: phram.c,v 1.16 2005/11/07 11:14:25 gleixner Exp $ * * Copyright (c) Jochen Schäuble [EMAIL PROTECTED] - * Copyright (c) 2003-2004 Jörn Engel [EMAIL PROTECTED] + * Copyright (c) 2003-2004 Joern Engel [EMAIL PROTECTED] * * Usage: * @@ -299,5 +299,5 @@ module_init(init_phram); module_exit(cleanup_phram); MODULE_LICENSE(GPL); -MODULE_AUTHOR(Jörn Engel [EMAIL PROTECTED]); +MODULE_AUTHOR(Joern Engel [EMAIL PROTECTED]); MODULE_DESCRIPTION(MTD driver for physical RAM); --- linux-2.6.24-rc3logfs/scripts/checkstack.pl~block2mtd_maintainer 2007-11-15 20:52:00.0 +0100 +++ linux-2.6.24-rc3logfs/scripts/checkstack.pl 2008-01-06 14:28:14.0 +0100 @@ -2,7 +2,7 @@ # Check the stack usage of functions # -# Copyright Joern Engel [EMAIL PROTECTED] +# Copyright Joern Engel [EMAIL PROTECTED] # Inspired by Linus Torvalds # Original idea maybe from Keith Owens # s390 port and big speedup by Arnd Bergmann [EMAIL PROTECTED] --- linux-2.6.24-rc3logfs/drivers/mtd/maps/mtx-1_flash.c~block2mtd_maintainer 2007-08-08 19:30:04.0 +0200 +++ linux-2.6.24-rc3logfs/drivers/mtd/maps/mtx-1_flash.c2008-01-06 14:28:44.0 +0100 @@ -4,7 +4,7 @@ * $Id: mtx-1_flash.c,v 1.2 2005/11/07 11:14:27 gleixner Exp $ * * (C) 2005 Bruno Randolf [EMAIL PROTECTED] - * (C) 2005 Jörn Engel [EMAIL PROTECTED] + * (C) 2005 Joern Engel [EMAIL PROTECTED] * */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] block2mtd lockdep_init_map warning
On Sun, 6 January 2008 14:11:47 -0500, Erez Zadok wrote: The problem appears to be an interaction of two components--module loading and lockdep--that's perhaps why it wasn't given enough attention. Correct. For modules lockdep depends on initializations done after module_init has finished. However block2mtd is an odd sod that can call into lockdep code during module_init, causing the bug you noticed. Several solutions are possible. Modules could get two initcalls, one to decide whether module load should get aborted, the other run later, after the remaining module initializations are done. Or the module loader could always do the initializations and revoke them later, if module_init failed. But I personally am too unfamiliar with the module code to trust my judgement and have yet to receive feedback. Even you seem to ignore my mails and not even Cc: me later on. I must have done something really horrible in my last life, it seems. Jörn -- A quarrel is quickly settled when deserted by one party; there is no battle unless there be two. -- Seneca -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: fix typo in mtd kconfig
David, will you take this patch? Signed-off-by: Dave Jones [EMAIL PROTECTED] Signed-off-by: Joern Engel [EMAIL PROTECTED] diff --git a/drivers/mtd/nand/Kconfig b/drivers/mtd/nand/Kconfig index 8f9c3ba..246d451 100644 --- a/drivers/mtd/nand/Kconfig +++ b/drivers/mtd/nand/Kconfig @@ -300,7 +300,7 @@ config MTD_NAND_PLATFORM via platform_data. config MTD_ALAUDA - tristate MTD driver for Olympus MAUSB-10 and Fijufilm DPC-R1 + tristate MTD driver for Olympus MAUSB-10 and Fujifilm DPC-R1 depends on MTD_NAND USB help These two (and possibly other) Alauda-based cardreaders for Jörn -- It does not matter how slowly you go, so long as you do not stop. -- Confucius - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BLOCK2MTD] WARNING: at kernel/lockdep.c:2331 lockdep_init_map()
On Fri, 19 October 2007 13:53:40 -0400, Erez Zadok wrote: I've been having this problem for some time with mtd, which I use to mount jffs2 images (for unionfs testing). I've seen it in several recent major kernels, including 2.6.24. Here's the sequence of ops I perform: Since when roughly? 2.6.20ish? Before? # cp jffs2-empty.img /tmp/foo # losetup /dev/loop0 /tmp/foo # modprobe mtdblock # modprobe block2mtd block2mtd=/dev/loop0,128ki # mount -t jffs2 /dev/mtdblock0 /n/lower/b0 Side note: you don't need mtdblock: # cp jffs2-empty.img /tmp/foo # losetup /dev/loop0 /tmp/foo # modprobe block2mtd block2mtd=/dev/loop0,128ki # mount -t jffs2 mtd0 /n/lower/b0 It doesn't really hurt, 'tis just superfluous. The jffs2-empty.img is a small jffs2 image, of an empty directory, created w/ the jffs2 utils. At the point I modprobe block2mtd, I get the following lockdep warning and a BUG message: BUG: key f88e1340 not in .data! WARNING: at kernel/lockdep.c:2331 lockdep_init_map() [c0102bc2] show_trace_log_lvl+0x1a/0x2f [c0103692] show_trace+0x12/0x14 [c01037b2] dump_stack+0x15/0x17 [c0125432] lockdep_init_map+0x94/0x3e4 [c0125001] debug_mutex_init+0x2c/0x3c [c01210d4] __mutex_init+0x38/0x40 [f88e01d3] 0xf88e01d3 [c011dda7] parse_args+0x123/0x200 [c012b725] sys_init_module+0xdd0/0x122c [c0102586] sysenter_past_esp+0x5f/0x91 === block2mtd: mtd0: [d: /dev/loop0] erase_size = 128KiB [131072] block2mtd: version $Revision: 1.30 $ Could be my problem. I'll see if I can reproduce it. Can you send me your .config or a link to it? Jörn -- /* Keep these two variables together */ int bar; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eccbuf is statically defined and always evaluate to true
On Fri, 19 October 2007 19:26:35 +0200, Samuel Tardieu wrote: --- drivers/mtd/devices/doc2000.c |4 ++-- drivers/mtd/devices/doc2001plus.c |2 +- 2 files changed, 3 insertions(+), 3 deletions(-) Acked-by: Joern Engel [EMAIL PROTECTED] I assume you don't actually use this driver and just ran make randconfig or allyesconfig or so.. Jörn -- Science is like sex: sometimes something useful comes out, but that is not the reason we are doing it. -- Richard Feynman - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BLOCK2MTD] WARNING: at kernel/lockdep.c:2331 lockdep_init_map()
On Fri, 19 October 2007 16:04:10 -0400, Erez Zadok wrote: In message [EMAIL PROTECTED], =?utf-8?B?SsO2cm4=?= Engel writes: Since when roughly? 2.6.20ish? Before? Yeah, I guess around that time. If you want, I could go back and test each of my backports and see if it has the lockdep message or not. That's ok. Just wanted to get a rough idea. Side note: you don't need mtdblock: # cp jffs2-empty.img /tmp/foo # losetup /dev/loop0 /tmp/foo # modprobe block2mtd block2mtd=/dev/loop0,128ki # mount -t jffs2 mtd0 /n/lower/b0 It doesn't really hurt, 'tis just superfluous. Neat. Curious, but where does mtd0 come from then? It's not in my /dev (which uses devfs on an FC6 system). JFFS2 interprets that itself. The only reason why JFFS2 needed a block device was to determine the minor number of the mtd underneith. So code was added to find the correct mtd from mtd0 or mtd:some_name instead. I believe you can even disable CONFIG_BLOCK now. And the code itself was moved to drivers/mtd/mtdsuper.c fairly recently. Jörn -- Joern's library part 2: http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BLOCK2MTD] WARNING: at kernel/lockdep.c:2331 lockdep_init_map()
On Fri, 19 October 2007 20:31:29 +0200, Peter Zijlstra wrote: BUG: key f88e1340 not in .data! WARNING: at kernel/lockdep.c:2331 lockdep_init_map() [c0102bc2] show_trace_log_lvl+0x1a/0x2f [c0103692] show_trace+0x12/0x14 [c01037b2] dump_stack+0x15/0x17 [c0125432] lockdep_init_map+0x94/0x3e4 [c0125001] debug_mutex_init+0x2c/0x3c [c01210d4] __mutex_init+0x38/0x40 [f88e01d3] 0xf88e01d3 [c011dda7] parse_args+0x123/0x200 [c012b725] sys_init_module+0xdd0/0x122c [c0102586] sysenter_past_esp+0x5f/0x91 === block2mtd: mtd0: [d: /dev/loop0] erase_size = 128KiB [131072] block2mtd: version $Revision: 1.30 $ Someone stuck a key object in non static storage. That breaks lockdep, don't do that :-) Is the mutex_init() done from a function tagged with __init? Root cause is an ordering problem in module loading. Code flow is roughly this: sys_init_module `- load_module : `- parse_args : `- block2mtd_setup : `- __mutex_init : `- lockdep_init_map : `- static_obj : `- is_module_address `- __link_module is_module_address() would return something sane, if __link_module() had already been called. In fact, if the parameter is passed through /sys/modules/block2mtd/parameters/block2mtd _after_ module load time, the exact same code works fine. Only when passing the parameter as a module parameter do we see this problem. So what should be done? We could move parse_args() below __link_module(), but I'd guess such a change would break some other modules what depend on certain parameters or at least should fail to load with illegal parameters. Do such modules exist? Or we could add some kind of parse_args_late() that is called after __link_module(), if requested by a module, and annotate block2mtd to prefer that version. [ Adding Ingo on Cc:. Since block2mtd predates lockdep I found a bug in his code and not the other way around. ;) ] Jörn -- Do not stop an army on its way home. -- Sun Tzu - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.24] block2mtd: removing a device and typo fixes
On Tue, 12 February 2008 13:47:51 +, Stephane Chazelas wrote: this patch addresses a number of small issues mainly regarding the output made by this driver to dmesg: - Some of the blkmtd's had not been changed to block2mtd which caused display problem - the parse_err() macro was displaying block2mtd: twice Fairly obvious fixes. Also, one can add a block2mtd mtd device with things like: echo /dev/loop3,$((256*1024)) | sudo tee /sys/module/block2mtd/parameters/block2mtd but individual mtds cannot be removed. You can only do a modprobe -r block2mtd to remove *all* the block2mtd mtds. This patch proposes to add the cabability with: echo /dev/loop3,remove | sudo tee /sys/module/block2mtd/parameters/block2mtd Sounds sane enough. But I do have some reservations about the implementation. It would be best if you split the patch in two. One with the obvious stuff above and one for this. The core of remove_device_by_name() is shared with block2mtd_exit(), so a common helper would be good. Your error handling is better, so let's keep that version. And independently of your patch a mutex protecting the device list from simultaneous modifications would be good to have. Side note: I may not have internet access until 19th or so. Jörn -- Rules of Optimization: Rule 1: Don't do it. Rule 2 (for experts only): Don't do it yet. -- M.A. Jackson -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/