from:"Jörn Engel"

Re: Bugreport: Kernel 2.4.x crash

2001-04-18 Thread Jörn Engel


Hi!

  I have no experience with kernel debugging, but so far, I have found
  no log entry giving me a hint and the screen is blank after the crash
 
 Could you disable console blanking (setterm -blank 0).
 
 We really need a hint where it crashed.

Over the easter weekend I took some time for testing. One ide channel does
not work with dma enabled, which is bootup default. After about 30 seconds,
the channel is switched to pio and the machine running again.

Funny though: Before, I could not return from console blanking or reach the
machine through network. But as for any production system, I rather keep it
running than spend downtime seeking the error.

Thank you all.

Jrn

-- 
Jrn Engel
mailto: [EMAIL PROTECTED]
http://wohnheim.fh-wedel.de/~joern
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Bugreport: Kernel 2.4.x crash

2001-04-03 Thread Jörn Engel


1. Kernel crash w/out error message or logfile entry

2. A Fileserver with an ABIT Hotrod 66 (htp366) controller will crash within
5-60 minutes after boot with a 2.4.x kernel. 2.2.x works fine. No other
exotic hardware. Another possibility might be Reiserfs, which I use for all
partitions except /.
I have no experience with kernel debugging, but so far, I have found no log
entry giving me a hint and the screen is blank after the crash. There might
have been some output before, but the machine is in the basement and too
important for excessive testing.
I have tried 2.4.2 and 2.4.3 once each.

3. ide, hpt366

4. 2.4.2, 2.4.3

5. -

6. -

7. All this information is taken from the running 2.2.18 Kernel.

7.1. sh /usr/src/linux/scripts/ver_linux
-- Versions installed: (if some fields are empty or look
-- unusual then possibly you have very old versions)
Linux belfast 2.2.18 #1 Fri Feb 23 14:47:14 CET 2001 i586 unknown
Kernel modules 2.4.2
Gnu C  2.95.3
Gnu Make   3.79.1
Binutils   2.11.90.0.1
Linux C Library2.2.2
Dynamic linker ldd (GNU libc) 2.2.2
Procps 2.0.7
Mount  2.11b
Net-tools  2.05
Console-tools  0.2.3
Sh-utils   2.0.11
Modules Loaded sb uart401 sound soundcore

7.2. cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 5
model   : 4
model name  : Pentium MMX
stepping: 3
cpu MHz : 200.459
fdiv_bug: no
hlt_bug : no
sep_bug : no
f00f_bug: yes
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu vme de pse tsc msr mce cx8 mmx
bogomips: 399.76

7.3 cat /var/log/ksymoops/20010401164317.modules (2.4.3)
sb  2128   0 (unused)
sb_lib 33936   0 [sb]
uart401 6352   0 [sb_lib]
sound  56400   0 [sb_lib uart401]
soundcore   3792   5 [sb_lib sound]
raid1  12784   0 (unused)
raid0   3520   0 (unused)
md 41056   0 [raid1 raid0]

7.4. cat /proc/ioports
-001f : dma1
0020-003f : pic1
0040-005f : timer
0060-006f : keyboard
0080-008f : dma page reg
00a0-00bf : pic2
00c0-00df : dma2
00f0-00ff : fpu
01f0-01f7 : ide0
0220-022f : soundblaster
02f8-02ff : serial(set)
0330-0333 : MPU-401 UART
03c0-03df : vga+
03e8-03ef : serial(auto)
03f6-03f6 : ide0
03f8-03ff : serial(set)
6100-6107 : ide2
6202-6202 : ide2
6400-6407 : ide3
6502-6502 : ide3
6700-677f : eth0
f000-f007 : ide0
f008-f00f : ide1

7.5 lspci -vvv
00:00.0 Host bridge: Intel Corporation 430HX - 82439HX TXC [Triton II] (rev
03)
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium TAbort-
TAbort- MAbort+ SERR- PERR-
Latency: 32

00:07.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
(rev 01)
Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort-
TAbort- MAbort- SERR- PERR-
Latency: 0

00:07.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton
II] (prog-if 80 [Master])
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium TAbort-
TAbort- MAbort- SERR- PERR-
Latency: 32
Region 4: I/O ports at f000

00:08.0 Unknown mass storage controller: Triones Technologies, Inc. HPT366
(rev 01)
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium TAbort-
TAbort- MAbort- SERR- PERR-
Latency: 248 (2000ns min, 2000ns max), cache line size 08
Interrupt: pin A routed to IRQ 11
Region 0: I/O ports at 6100
Region 1: I/O ports at 6200
Region 4: I/O ports at 6300

00:08.1 Unknown mass storage controller: Triones Technologies, Inc. HPT366
(rev 01)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium TAbort-
TAbort- MAbort- SERR- PERR-
Latency: 248 (2000ns min, 2000ns max), cache line size 08
Interrupt: pin A routed to IRQ 11
Region 0: I/O ports at 6400
Region 1: I/O ports at 6500
Region 4: I/O ports at 6600

00:0a.0 VGA compatible controller: S3 Inc. Trio 64V2/DX or /GX (rev 16)
(prog-if 00 [VGA])
Subsystem: Elsa AG: Unknown device 0935
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium TAbort-
TAbort- MAbort- SERR- PERR-
Interrupt:

Re: [RFC] MTD driver for MMC cards

2007-04-15 Thread Jörn Engel

On Mon, 16 April 2007 01:33:17 +0200, Arnd Bergmann wrote:
 
 There is also still some need for performance testing. Jörn
 brought up the point that if a specific card can't have multiple
 open erase block simulateously, it's rather pointless for
 logfs. It might still be useful to use jffs2 on those cards,
 because IFAIK that only writes to one erase block at any
 time.

This appears to be a problem for practically all consumer-available
flash media.  They spend a lot of effort trying to hide any flash
properties from their users.  And while this is a decent strategy for
FAT, ext3, ntfs and similar, it is actually very inefficient for a flash
filesystem.

After talking to several manufacturers, most seemed to be fairly
open-minded towards supporting an alternate interface with raw flash
access.  So much for the good news.  Bad news is that such an elternate
interface still needs to be defined.

Jörn

-- 
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ZFS with Linux: An Open Plea

2007-04-16 Thread Jörn Engel

On Mon, 16 April 2007 17:46:50 +0200, Tomasz Kłoczko wrote:
 On Mon, 16 Apr 2007, Christoph Hellwig wrote:
 
 Numbers, please.  So far in all interesting benchmarks it actually
 was slower.  But when they're faster than XFS somewhere I'd defintly
 be interesting in looking at why this is true and if possible and
 important enough fix it.

Christoph, could you show some numbers as well?  While I usually trust
your opinion, I have yet to see any substantial argument against ZFS
from your side.

 http://cmynhier.blogspot.com/2006/05/zfs-io-reordering-benchmark.html

http://blogs.sun.com/bill/#zfs_vs_the_benchmark

If you read closely you may notice that ZFS had relatively little to do
with read performance under heavy write load.  ZFS simply has some fancy
I/O scheduling code that in particular deals with deadlines.  The Linux
equivalent appears to be CONFIG_IOSCHED_DEADLINE.  But the quoted
benchmark does not mention which scheduler was used for Linux.

So unless the benchmark is redone and properly documented, its numbers
are fairly worthless.  Bummer.

 http://cmynhier.blogspot.com/2006/05/zfs-benchmarking.html

The company I work for would probably balk if I put that script here

No publically available benchmark.  So even if a third party wanted to,
it couldn't recreate the benchmark.  Again, fairly worthless.


So by my count, neither side has showed any worthwile numbers.  Whether
ZFS performance is better or worse is anyone's guess.

Jörn

-- 
Simplicity is prerequisite for reliability.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/17] Large Blocksize Support V3

2007-04-24 Thread Jörn Engel

On Tue, 24 April 2007 15:21:05 -0700, [EMAIL PROTECTED] wrote:
 
 This patchset modifies the Linux kernel so that larger block sizes than
 page size can be supported. Larger block sizes are handled by using
 compound pages of an arbitrary order for the page cache instead of
 single pages with order 0.

I like to see this.

 2. 32/64k blocksize is also used in flash devices. Same issues.

Actually most chips I encounter these days already have 128KiB.  And
some people seem to do some kind of raid-0 in the drivers to increase
bandwidth.  FS-visible blocksize is also increased by that.

 Unsupported
 - Mmapping blocks larger than page size

Bummer.  Can this change in the future?

 Issues:
 - There are numerous places where the kernel can no longer assume that the
   page cache consists of PAGE_SIZE pages that have not been fixed yet.
 - Defrag warning: The patch set can fragment memory very fast.
   It is likely that Mel Gorman's anti-frag patches and some more
   work by him on defragmentation may be needed if one wants to use
   super sized pages.
   If you run a 2.6.21 kernel with this patch and start a kernel compile
   on a 4k volume with a concurrent copy operation to a 64k volume on
   a system with only 1 Gig then you will go boom (ummm no ... OOM) fast.
   How well Mel's antifrag/defrag methods address this issue still has to
   be seen.

only 1 Gig :)

With my LogFS hat on, I don't care too much whether data is cached in
terms of pages or blocks.  What matters to me most is to get fed
blocksize chunk on writeback and be able to read blocksize'd chunks.
Compressing 64KiB at a time gives somewhere around 10% (don't remember
exact number) better compression when compared to 4KiB.  JFFS2 can
benefit from this as well.

That should also be sufficient for cross-platform compatibility,
shouldn't it?

Better performance for the pagecache is also nice to have, no doubt.
But if system stability remains an issue, I'd rather keep slow and
stable.

Jörn

-- 
More computing sins are committed in the name of efficiency (without
necessarily achieving it) than for any other single reason - including
blind stupidity.
-- W. A. Wulf
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Documenting MS_RELATIME

2007-02-12 Thread Jörn Engel

On Mon, 12 February 2007 18:49:39 +0100, Jan Engelhardt wrote:
 On Feb 12 2007 10:40, Dave Jones wrote:
 
 The one problem with noatime is that mutt's 'new mail arrived' breaks
 
 Just why does not it use mtime then to check for New Mail Arrived, like 
 bash does?

Just a guess: because it has to compare the time?

Bash can simply compare mtime of (single) mailbox with time of last
login.  Mutt would have to compare mtime of (many) mailboxes with...
I believe with atime of mailboxes.

Jörn

-- 
Joern's library part 1:
http://lwn.net/Articles/2.6-kernel-api/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH x86 for review III] [10/29] i386: don't include bugs.h

2007-02-12 Thread Jörn Engel

On Mon, 12 February 2007 17:51:30 +0100, Andi Kleen wrote:
 
 From: Andrew Morton [EMAIL PROTECTED]
 
 That stupid non-inlined-static function in bugs.h causes:
 
 include/asm/bugs.h:186: warning: 'check_bugs' defined but not used
 
 But fortunately the include isn't needed.
 
 Cc: Andi Kleen [EMAIL PROTECTED]
 Signed-off-by: Andrew Morton [EMAIL PROTECTED]
 Signed-off-by: Andi Kleen [EMAIL PROTECTED]
 
 ---
 
  arch/i386/kernel/alternative.c |1 -
  1 file changed, 1 deletion(-)
 
 Index: linux/arch/i386/kernel/alternative.c
 ===
 --- linux.orig/arch/i386/kernel/alternative.c
 +++ linux/arch/i386/kernel/alternative.c
 @@ -4,7 +4,6 @@
  #include linux/list.h
  #include asm/alternative.h
  #include asm/sections.h
 -#include asm/bugs.h
  
  static int no_replacement= 0;
  static int smp_alt_once  = 0;

Didn't your patchset also include a near-identical patch from Adrian
Bunk (with - and + exchanged)?

Jörn

-- 
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA-performance: Linux vs. FreeBSD

2007-02-13 Thread Jörn Engel

On Tue, 13 February 2007 11:27:58 +, Alan wrote:
 
 isn't yet a heavily optimised libata path. Secondly erase block size
 matters with flash drives so the bigger each I/O the better erase block
 behaviour we should get.

Although that should max out somewhere between 16KiB and 128KiB,
depending on the chips being used.

Jörn

-- 
If you're willing to restrict the flexibility of your approach,
you can almost always do something better.
-- John Carmack
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA-performance: Linux vs. FreeBSD

2007-02-13 Thread Jörn Engel

On Tue, 13 February 2007 11:29:18 +0100, Martin A. Fink wrote:
 
 Please Read Carefully! I talk about flash disk, not normal harddisks. There 
 are no mechanical parts in flash disks, only flash memory. And therefore 
 48MB/s is excellent (compared to all other available disks)
 
 [...]
 
 Well. The testdrive has 27GB. The final drive will have 225 GB. And there 
 will 
 be 3 cameras and thus 3 disks. This means we talk about 140 MB/s for around 
 90 minutes.

Do you have any numbers on the performance for the final drive?  Single
flash chips are relatively slow, the high bandwidth is usually achieved
by writing in parallel to several of them.  With the bigger drive you
get more chips and the manufacturer could run more of them in parallel.

Jörn

-- 
With a PC, I always felt limited by the software available. On Unix,
I am limited only by my knowledge.
-- Peter J. Schoenster
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GPL vs non-GPL device drivers

2007-02-15 Thread Jörn Engel

On Thu, 15 February 2007 00:40:31 -0800, v j wrote:
 
 Oh, I am sorry. Seems like the German courts have spoken. I am not
 sure about what, but they have spoken. Sorry for the confusion.

In short, there seem to be two classes of closed-source drivers:

1. ATI and nVidia.  Both are well-known, in both cases they seem to
avoid the legally important aspect of shipping their driver along with a
kernel and they seem to be legally in relatively safe water.  At least I
haven't heard about them getting sued yet.

2. The embedded companies.  By the very nature of selling an embedded
device they are shipping their drivers along with a kernel and seem to
be in very shallow water.  Dozens of them have received letters from
lawyers and didn't even dare go to court - they just complied.

While this list is not exhaustive and your company's case may be
different from all others, it does give you a hint of what your chances
might be in court.  Go to http://gpl-violations.org/ and do your
research.

The question whether a specific closed-source driver is legal or not can
only be answered in court and only on a case-by-case basis.  You should
have a good idea of what many developers personal opinion is and with
the research you can also estimate your legal position.  Then make your
decision, as noone here is going to make it for you - even if some would
like to.

Jörn

-- 
Eighty percent of success is showing up.
-- Woody Allen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-15 Thread Jörn Engel

On Thu, 15 February 2007 19:38:14 +0100, Juan Piernas Canovas wrote:
 
 The patch for 2.6.11 is not still stable enough to be released. Be patient 
 ;-)

While I don't want to discourage you, this is about the point in
development where most log structured filesystems stopped.  Doing a
little web research, you will notice those todo-lists with cleaner
being the top item for...years!

Getting that one to work robustly is _very_ hard work and just today
I've noticed that mine was not as robust as I would have liked to think.
Also, you may note that by updating to newer kernels, the VM writeout
policies can change and impact your cleaner.  To an extent even that you
had a rock-solid filesystem with 2.6.18 and thing crumble between your
fingers in 2.6.19 or later.

If the latter happens, most likely the VM is not to blame, it just
proved that your cleaner is still getting some corner-cases wrong and
needs more work.  There goes another week of debugging. :(

Jörn

-- 
You ain't got no problem, Jules. I'm on the motherfucker. Go back in
there, chill them niggers out and wait for the Wolf, who should be
coming directly.
-- Marsellus Wallace
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-16 Thread Jörn Engel

On Thu, 15 February 2007 23:59:14 +0100, Juan Piernas Canovas wrote:
 
 Actually, the version of DualFS for Linux 2.4.19 implements a cleaner. In 
 our case, the cleaner is not really a problem because there is not too 
 much to clean (the meta-data device only contains meta-data blocks which 
 are 5-6% of the file system blocks; you do not have to move data blocks).

That sounds as if you have not hit the interesting cases yet.  Fun
starts when your device is near-full and you have a write-intensive
workload.  In your case, that would be metadata-write-intensive.  For
one, this is where write performance of log-structured filesystems
usually goes down the drain.  And worse, it is where the cleaner can
run into a deadlock.

Being good where log-structured filesystems usually are horrible is a
challenge.  And I'm sure many people are more interested in those
performance number than in the ones you shine at. :)

Jörn

-- 
Joern's library part 14:
http://www.sandpile.org/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-17 Thread Jörn Engel

On Fri, 16 February 2007 18:47:48 -0500, Bill Davidsen wrote:
 
 Actually I am interested in the common case, where the machine is not 
 out of space, or memory, or CPU, but when it is appropriately sized to 
 the workload. Not that I lack interest in corner cases, but the running 
 flat out case doesn't reflect case where there's enough hardware, now 
 the o/s needs to use it well.

There is one detail about this specific corner case you may be missing.
Most log-structured filesystems don't just drop in performance - they
can run into a deadlock and the only recovery from this is the lovely
backup-mkfs-restore procedure.

If it was just performance, I would agree with you.

Jörn

-- 
He that composes himself is wiser than he that composes a book.
-- B. Franklin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-17 Thread Jörn Engel

On Sat, 17 February 2007 13:10:23 -0500, Bill Davidsen wrote:
   
 I missed that. Which corner case did you find triggers this in DualFS?

This is not specific to DualFS, it applies to any log-structured
filesystem.

Garbage collection always needs at least one spare segment to collect
valid data into.  Regular writes may require additional free segments,
so GC has to kick in and free those when space is getting tight.  (1)

GC frees segments by writing all valid data in it into the spare
segment.  If there is remaining space in the spare segment, GC can move
more data from further segment.  Nice and simple.

The requirement is that GC *always* frees more segments than it uses up
doing so.  If that requirement is not fulfilled, GC will simply use up
its last spare segment without freeing a new one.  We have a deadlock.

Now imagine your filesystem is 90% full and all data is spread perfectly
across all segments.  The best segment you could pick for GC is 90%
full.  One would imagine that GC would only need to copy those 90% into
a spare segment and have freed 100%, making overall progress.

But more log-structured filesystems maintain a tree of some sorts on the
medium.  If you move data elsewhere, you also need to update the
indirect block pointing to it.  So that has to get written as well.  If
you have doubly or triply indirect blocks, those need to get written.
So you can end up writing 180% or more to free 100%.  Deadlock.

And if you read the documentation of the original Sprite LFS or any
other of the newer log-structured filesystems, you usually won't see a
solution to this problem, or even an acknowledgement that the problem
exists in the first place.  But there is no shortage of log-structured
filesystem projects that were abandoned years ago and have cleaner or
garbage collector as their top item on the todo-list.  Coincidence?


(1) GC may also kick in earlier, but that is just an optimization and
doesn't change the worst case, so that bit is irrelevant here.


Btw, the deadlock problem is solvable and I definitely don't want to
discourage further work in this area.  DualFS does look interesting.
But my solution for this problem will likely eat up all the performance
DualFS has gained and more, as it isn't aimed at hard disks.  So someone
has to come up with a different idea.

Jörn

-- 
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-17 Thread Jörn Engel

On Sat, 17 February 2007 15:47:01 -0500, Sorin Faibish wrote:

 DualFS can probably get around this corner case as it is up to the user
 to select the size of the MD device size. If you want to prevent this
 corner case you can always use a device bigger than 10% of the data device
 which is exagerate for any FS assuming that the directory files are so
 large (this is when you have billions of files with long names).
 In general the problem you mention is mainly due to the data blocks
 filling the file system. In DualFS case you have the choice of selecting
 different sizes for the MD and Data volume. When Data volume gets full
 the GC will have a problem but the MD device will not have a problem.
 It is my understanding that most of the GC problem you mention is
 due to the filling of the FS with data and the result is a MD operation
 being disrupted by the filling of the FS with data blocks. As about the
 performance impact on solving this problem, as you mentioned all
 journal FSs will have this problem, I am sure that DualFS performance
 impact will be less than others at least due to using only one MD
 write instead of 2.

You seem to make the usual mistakes when people start to think about
this problem.  But I could misinterpret you, so let me paraphrase your
mail in questions and answer what I believe you said.

Q: Are journaling filesystems identical to log-structured filesystems?

Not quite.  Journaling filesystems usually have a very small journal (or
log, same thing) and only store the information necessary for atomic
transactions in the journal.  Not sure what a journal FS is, but the
name seems closer to a journaling filesystem.

Q: DualFS seperates Data and Metadata.  Does that make a difference?

Not really.  What I called data in my previous mail is a
log-structured filesystems view of data.  DualFS stored file content
seperately, so from an lfs view, that doesn't even exist.  But directory
content exists and behaves just like file content wrt. the deadlock
problem.  Any data or metadata that cannot be GC'd by simply copying but
requires writing further information like indirect blocks, B-Tree nodes,
etc. will cause the problem.

Q: If the user simply reserves some extra space, does the problem go
away?

Definitely not.  It will be harder to hit, but a rare deadlock is still
a deadlock.  Again, this is only concerned with the log-structured part
of DualFS, so we can ignore the Data volume.

When data is spread perfectly across all segments, the best segment one
can pick for GC is just as bad as the worst.  So let us take some
examples.  If 50% of the lfs is free, you can pick a 50% segment for GC.
Writing every single block in it may require writing one additional
indirect block, so GC is required to write out a 100% segment.  It
doesn't make any progress at 50% (in a worst case scenario) and could
deadlock if less than 50% were free.

If, however, GC has to write out a singly and a doubly indirect block,
67% of the lfs need to be free.  In general, if the maximum height of
your tree is N, you need (N-1)/N * 100% free space.  Most people refer
to that as too much.

If you have less free space, the filesystem will work just fine most of
the time.  That is nice and cool, but it won't help your rare user that
happens to hit the rare deadlock.  Any lfs needs a strategy to prevent
this deadlock for good, not just make it mildly unlikely.

Jörn

-- 
Error protection by error detection and correction.
-- from a university class
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-18 Thread Jörn Engel

Maybe this is a decent approach to deal with the problem.  First some
definitions.  T is the target segment to be cleaned, S is the spare
segment that valid data is written to, O are other segments that contain
indirect blocks I for valid data D in T.

Have two different GC mechanisms to choose between:
1. Regular GC that copies D and I into S.  On average D+I should require
   less space than S can offer.
2. Slow GC only copies D into S.  Indirect blocks get modified in-place
   in O.  This variant requires more seeks due to writing in various O,
   but it guarantees that D always requires less space than S can offer.

Whenever you are running out of spare segments and are in danger of the
deadlock, switch to mechanism 2.  Now your correctness problem is
reduced to a performance problem.

Jörn

-- 
To recognize individual spam features you have to try to get into the
mind of the spammer, and frankly I want to spend as little time inside
the minds of spammers as possible.
-- Paul Graham
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-19 Thread Jörn Engel

On Tue, 20 February 2007 00:57:50 +0100, Juan Piernas Canovas wrote:
 
 I understand the problem that you describe with respect to the GC, but 
 let me explain why I think that it has a small impact on DualFS.
 
 Actually, the GC may become a problem when the number of free segments is 
 50% or less. If your LFS always guarantees, at least, 50% of free 
 segments (note that I am talking about segments, not free space), the 
 deadlock problem disappears, right? This is a quite naive solution, but it 
 works.

I don't see how you can guarantee 50% free segments.  Can you explain
that bit?

 In a traditional LFS, with data and meta-data blocks, 50% of free segments 
 represents a huge amount of wasted disk space. But, in DualFS, 50% of free 
 segments in the meta-data device is not too much. In a typical Ext2, 
 or Ext3 file system, there are 20 data blocks for every meta-data block 
 (that is, meta-data blocks are 5% of the disk blocks used by files). 
 Since files are implemented in DualFS in the same way, we can suppose the 
 same ratio for DualFS (1).

This will work fairly well for most people.  It is possible to construct
metadata-heavy workloads, however.  Many large directories containing
symlinks or special files (char/block devices, sockets, fifos,
whiteouts) come to mind.  Most likely noone of your user will ever want
that, but a malicious attacker might.

That, btw, brings me to a completely unrelated topic.  Having a fixed
ratio a metadata to data is simple to implement, but allowing this ratio
to dynamically change would be nicer for administration.  You can add
that to the Christmas wishlist for the nice boys, if you like.

 Remember, I am supposing a naive implementation of the cleaner. With a 
 cleverer one, the meta-data device can be smaller, and the amount of
 disk space finally wasted can be smaller too. The following paper proposes 
 some improvements:
 
 - Jeanna Neefe Matthews, Drew Roselli, Adam Costello, Randy Wang, and
   Thomas Anderson.  Improving the Performance of Log-structured File
   Systems with Adaptive Methods.  Proc. Sixteenth ACM Symposium on
   Operating Systems Principles (SOSP), October 1997, pages 238 - 251.
 
 BTW, I think that what they propose is very similar to the two-strategies 
 GC that you propose in a separate e-mail.

Will have to read it up after I get some sleep.  It is late.

 The point of all the above is that you must improve the common case, and 
 manage the worst case correctly. And that is the idea behind DualFS :)

A fine principle to work with.  Surprisingly, what is the worst case for
you is the common case for LogFS, so maybe I'm more interested in it
than most people.  Or maybe I'm just more paranoid.

Anyway, keep up the work.  It is an interesting idea to pursue.

Jörn

-- 
He who knows that enough is enough will always have enough.
-- Lao Tsu
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slab: deal with NULL pointers passed to kmem_cache_free

2007-03-19 Thread Jörn Engel

On Mon, 19 March 2007 14:10:38 -0700, Andrew Morton wrote:
 
 Would prefer to do:
 
 static inline void kmem_cache_free_if_not_null(struct kmem_cache *cachep,
   void *objp)
 {
   if (objp)
   kmem_cache_free(cachep, objp);
 }
 
 so that we don't add extra overhead to all the thousands of existing,
 well-behaved callsites.

In principle, this would work.  But two things need changing, imho:
1. Don't inline the function.  kmem_cache_free() has only about 34 NULL
   callers, if my grep is reliable, so this case is arguable.  But in
   general, out-of-line functions are better than many extra
   conditionals pulled in through the inline one.
2. Switch the names.  According to Rusty's benchmark, the easiest way to
   use and interface should be the correct one.  Every new driver
   written by a rookie will call kmem_cache_free(), simply because the
   name seems simpler.

void kmem_cache_free_fast(struct kmem_cache *cachep, void *objp)
{
/* old kmem_cache_free() */
}

void kmem_cache_free(struct kmem_cache *cachep, void *objp)
{
if (likely(objp))
kmem_cache_free_fast(cachep, objp);
}

Jörn

-- 
Correctness comes second.
Features come third.
Performance comes last.
Maintainability is easily forgotten.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-21 Thread Jörn Engel

On Tue, 20 March 2007 01:42:46 +0100, Thomas Gleixner wrote:
 On Mon, 2007-03-19 at 17:32 -0500, Matt Mackall wrote:
 
4. JFFS2 has its own wear-leving scheme, as do several other
   filesystems, so they probably want to bypass this piece of the stack.
   
   JFFS2 on top of UBI delegates the wear levelling to UBI, as JFFS2s own
   wear levelling sucks. 
  
  Ok, fine. How about LogFS, then?
 
 LogFS can easily leverage UBI's wear algorithm.

Ok, now we have reached the absurd.  UBI quite fundamentally cannot do
wear leveling as good as LogFS can.  Simply because UBI has zero
knowledge of the _contents_ of its blocks.  Knowing whether a block is
90% garbage or not makes a great difference.

Also LogFS currently requires erasesizes of 2^n.

Thomas, I can give you my opinion on this flamewar in private - after
you have cooled off.

Jörn

-- 
When I am working on a problem I never think about beauty.  I think
only how to solve the problem.  But when I have finished, if the
solution is not beautiful, I know it is wrong.
-- R. Buckminster Fuller
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-21 Thread Jörn Engel

On Wed, 21 March 2007 12:25:34 +0100, Thomas Gleixner wrote:
 On Wed, 2007-03-21 at 12:05 +0100, Jörn Engel wrote:
  
  Ok, now we have reached the absurd.  UBI quite fundamentally cannot do
  wear leveling as good as LogFS can.  Simply because UBI has zero
  knowledge of the _contents_ of its blocks.  Knowing whether a block is
  90% garbage or not makes a great difference.
  
  Also LogFS currently requires erasesizes of 2^n.
 
 Last time I talked to you about that, you said it would be possible and
 fixable. We talked about several mechanisms, which would allow a
 filesystem or other users to hint such things to UBI.

Note the word currently.  And yes, we did talk about hints.  Back then
I still believed in UBI.  That has changed and I would like to spare
myself another flamewar, so please leave it at that.

 Even if the LogFS wear levelling is so superior, it CAN'T do across
 device wear levelling.

Correct.  And I don't see any problem with this.  I see two classes of
usecases for flash, with some amount of overlap in between.

1. Small amounts of flash.

Here the flash contains a large ratio of read-only data.  Bootloader,
kernel, etc.  Having wear levelling across the device will gain you
something.  This is what you designed UBI for.

2. Large amounts of flash.

Just to be precise, large can go well into the Terabyte range and
beyond.  I don't mean large as in the biggest embedded device I worked
on last year - that is still small.

Even if such flashes still contain a bootloader and a kernel, that will
occupy less than 1% of the device.  Wear leveling across the device is
fairly pointless here.  This is what I designed LogFS for.


There is some middle ground where a combination of UBI and LogFS may
make sense.  LogFS can still make sense for devices as small as 64MiB.
But I'm not too concerned about that because flashes will continue to
grow and the advantages of cross-device wear leveling will continue to
diminish.

Jörn

-- 
Security vulnerabilities are here to stay.
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-21 Thread Jörn Engel

On Wed, 21 March 2007 12:57:42 +0100, Thomas Gleixner wrote:
 On Wed, 2007-03-21 at 12:35 +0100, Jörn Engel wrote:
  Even if such flashes still contain a bootloader and a kernel, that will
  occupy less than 1% of the device.  Wear leveling across the device is
  fairly pointless here.  This is what I designed LogFS for.
 
 Still you need to have a solution for handling bitflips in those
 bootloader and kernel areas.

Correct.  It may make sense to use UBI for that, I don't know.  What I
do know is that UBI cannot make wear leveling decisions as well as
LogFS.

And that is all I care about wrt. this discussion.

Jörn

-- 
Joern's library part 8:
http://citeseer.ist.psu.edu/plank97tutorial.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-21 Thread Jörn Engel

*sigh*

I really did not want to become involved in this.  So please be nice and
leave the flamethrower in your weapon closet or I will disappear again
before you can say fire.

On Tue, 20 March 2007 21:32:40 +, David Woodhouse wrote:
 On Tue, 2007-03-20 at 10:58 -0800, David Lang wrote:
  What Matt and Ted are looking at is the question 'are flash devices close 
  enough 
  to other block devices that it would make sense to change the existing 
  linux 
  definition of a block device to handle the special requirements of flash'
 
 I've seen no real proposals about how this could be done, so it's a
 purely academic question.

What you have seen and shot down were patches to make mtd more generic.
So let me just assume both mtd and jffs2 were generic, even though they
currently aren't.

In very broad terms, an mtd is a device with:
1. a read operation
2. a write operation
3. an erase operation
4. a minimal write blocksize
5. a minimal erase blocksize
6. a method to query bad eraseblocks
7. a method to mark bad eraseblocks

Anything else?  There are many more fields, but I believe this is the
essential.  point() and unpoint() were omitted, because they are just
one option to provide XIP.  filemap_xip.c is another used for block
devices.

In very broad terms, a block device has:
1. a read operation
2. a write operation
3. some devices have an ioctl() for erase, but that is uncommon
4. a blocksize

What is missing?  Obviously the erase operation needs to become a
first-class citizen and block devices need two fields for the two
meaningful blocksizes.  And they need methods to query and set bad
blocks.

So far it looks simple enough.  Obviously there are many messy details
left out, so it will be a lot of work in practice.  So the question is:
is it worth it?

What are the gains from combining mtd and block devices?

[ And at this point I would like to state again that I don't want to
become involved in the UBI discussion.  The question whether two
seperate subsystems make sense is quite independent and I don't want
both discussions to get mixed up. ]

Jörn

-- 
He who knows others is wise.
He who knows himself is enlightened.
-- Lao Tsu
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] slab: deal with NULL pointers passed to kmem_cache_free

2007-03-21 Thread Jörn Engel

On Wed, 21 March 2007 08:30:27 -0800, Andrew Morton wrote:
 On Wed, 21 Mar 2007 16:41:19 +0200 Pekka Enberg [EMAIL PROTECTED] wrote:
 
  Yeah, I'll try to sneak a patch past Andrew.
 
 That would be sneaky.
 
 Thing is, such a patch would amount to adding a test-for-NULL to codepaths
 which we *know* do not need it.  There is no point in doing that.

How about two patches, one renaming kmem_cache_free to
kmem_cache_free_fast or __kmem_cache_free or whatever pleases you most,
the second adding kmem_cache_free with a NULL check.

The point is that the easiest way to use kmem_cache_free should be the
safest, but not necessarily the fastest.  Existing well-tuned and
NULL-aware code paths can remain fast, random new code will be safe.

Jörn

-- 
Joern's library part 14:
http://www.sandpile.org/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-25 Thread Jörn Engel

On Wed, 21 March 2007 12:25:34 +0100, Thomas Gleixner wrote:
 On Wed, 2007-03-21 at 12:05 +0100, Jörn Engel wrote:
  
  Also LogFS currently requires erasesizes of 2^n.
 
 Last time I talked to you about that, you said it would be possible and
 fixable.

Actually, no.  LogFS is not broken, there is nothing to fix.

And there is no fundamental reason why UBI should export blocks with
non-power-of-two sizes.  UBI currently consists of two parts that are
intimately intertwined in the current implementation, but have
relatively little connection otherwise.

1. Logical volume management.
2. Static volumes.

Logical volume management can just as easily move its management
information into a table, instead of having it spread across all blocks.
Blocks can keep their original size.  Since you have to scan flash
anyway, you can also scan for a table, compare a magical number and do
some extra check to protect yourself against a UBI image inside some
logical volume.  No big deal.

Static volumes can keep a header inside their volumes.  The tiny
first-stage bootloader is currently scanning flash and can continue to
do so.  But at least this header no longer causes trouble for LogFS or
any other UBI user.

UBI is just as broken as LogFS is.  It works with every user in mainline
(which comes down to JFFS2).  LogFS works with every MTD device in
mainline.  The only combination that doesn't work is LogFS on UBI - due
to deliberate design decisions on both sides.

Jörn

-- 
Joern's library part 8:
http://citeseer.ist.psu.edu/plank97tutorial.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] block2mtd_paramline[] mustn't be __initdata

2007-03-25 Thread Jörn Engel

On Sun, 25 March 2007 16:58:05 +0200, Adrian Bunk wrote:
 
 block2mtd_paramline[] is used in the non-__init block2mtd_setup()
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
Acked-By: Jörn Engel [EMAIL PROTECTED]

Adrian, can you put me on Cc: next time?

 ---
 --- linux-2.6.21-rc4-mm1/drivers/mtd/devices/block2mtd.c.old  2007-03-25 
 15:56:10.0 +0200
 +++ linux-2.6.21-rc4-mm1/drivers/mtd/devices/block2mtd.c  2007-03-25 
 15:56:31.0 +0200
 @@ -423,7 +423,7 @@
  
  #ifndef MODULE
  static int block2mtd_init_called = 0;
 -static __initdata char block2mtd_paramline[80 + 12]; /* 80 for device, 12 
 for erase size */
 +static char block2mtd_paramline[80 + 12]; /* 80 for device, 12 for erase 
 size */
  #endif

Jörn

-- 
Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo
vorher keine existiert hat.
-- Doris Lessing
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-25 Thread Jörn Engel

On Sun, 25 March 2007 13:49:58 -0800, David Lang wrote:
 On Sun, 25 Mar 2007, Jörn Engel wrote:
 
 Logical volume management can just as easily move its management
 information into a table, instead of having it spread across all blocks.
 Blocks can keep their original size.  Since you have to scan flash
 anyway, you can also scan for a table, compare a magical number and do
 some extra check to protect yourself against a UBI image inside some
 logical volume.  No big deal.

[ This was not a request for UBI to be changed.  The only purpose was to
illustrate that LogFS is not broken.  The previous thread suggested
otherwise and I just couldn't leave it at that. ]

 if you are being paranoid about write cycles putting the write count in the 
 block you are writing avoids doing an erase/write elsewhere

 although, since you can flip bits to 1 without requireing an erase you 
[ vice versa.  you can flip bits to 0 without erasing. ]
 could sacrafice some space and say that your table has a normal counter for 
 the number of times the block has been erased, but a 'tally counter' where 
 you turn one bit on each time you erase the block, and when you fill up the 
 tally block you re-write the entire table, clearing all the tallys. if you 
 have relativly large eraseblocks it seems like you could afford to 
 sacrafice the space in your master table to avoid erases of it

Or you could have a table and any number of updates to it.  Erase one
block, append a small update marker to the table.  There are plenty of
options.  All have in common that code would be more complicated.

Another advantage is that erase counts don't get reset if the race
against a power failure during erase is lost.

Whether the advantaves of power-of-two blocksizes and safe erasecounts
are worth it, I leave for others to decide.

Jörn

-- 
Fools ignore complexity.  Pragmatists suffer it.
Some can avoid it.  Geniuses remove it.
-- Perlis's Programming Proverb #58, SIGPLAN Notices, Sept.  1982
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-25 Thread Jörn Engel

On Mon, 26 March 2007 00:46:33 +0100, David Woodhouse wrote:
 On Mon, 2007-03-26 at 00:55 +0200, Jörn Engel wrote:
   although, since you can flip bits to 1 without requireing an erase you 
  [ vice versa.  you can flip bits to 0 without erasing. ]
 
 And on NAND flash you can't just do it in multiple cycles one bit at a
 time. The 'tally' trick isn't viable there.

You can on NAND.  ECC is done in software.  And for a data structure as
simple as the 'tally', foregoing ECC is not a huge problem - most
bitflips are easily detected and the remaining only cause off-by-a-few
on the erase count.

On NOR with transparent (hardware) ECC you can't.

Jörn

-- 
Homo Sapiens is a goal, not a description.
-- unknown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-25 Thread Jörn Engel

On Mon, 26 March 2007 01:21:25 +0100, David Woodhouse wrote:
 On Mon, 2007-03-26 at 02:01 +0200, Jörn Engel wrote:
  You can on NAND.  ECC is done in software.  And for a data structure as
  simple as the 'tally', foregoing ECC is not a huge problem - most
  bitflips are easily detected and the remaining only cause off-by-a-few
  on the erase count. 
 
 You're only allowed a limited number of write cycles to each page
 though. So you can't just clear the bits in a 2112-byte page one at a
 time; typically when you clear the fifth bit, the contents of the whole
 page become undefined until the next erase cycle.

That limitation stems from ECC and ECC is done in software.  Currently
everyone and his dog is doing ECC in chunks of 256 bytes on NAND.  So
your minimum write size is 256 bytes _if you care about ECC_.  If you
don't care, you can write single bits on NAND, just as you can on NOR.

Controlling ECC in software means we are quite flexible.  Given
sufficient incentive, we can change the rules quite significantly.

Jörn

-- 
You can't tell where a program is going to spend its time. Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you've proven that's where the bottleneck is.
-- Rob Pike
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-26 Thread Jörn Engel

On Mon, 26 March 2007 10:45:57 +0100, David Woodhouse wrote:
 
 No, on NAND flash it's a limitation of the hardware. The number of write
 cycles you can perform to a given page is limited. Exceed it and the
 contents of that page become undefined due to leakage, until you next
 erase it. 

Are you sure?  Do you have any specs or similar that state this?

So far I have only encountered this limitation by word of mouth.  And
such a myth coming from ECC effects is nothing that would surprise me.

Jörn

-- 
The cheapest, fastest and most reliable components of a computer
system are those that aren't there.
-- Gordon Bell, DEC labratories
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/22 take 3] UBI: Unsorted Block Images

2007-03-26 Thread Jörn Engel

On Mon, 26 March 2007 13:49:06 +0300, Artem Bityutskiy wrote:
 On Sun, 2007-03-25 at 22:08 +0200, Jörn Engel wrote:
  
  Logical volume management can just as easily move its management
  information into a table, instead of having it spread across all blocks.
  Blocks can keep their original size.  Since you have to scan flash
  anyway, you can also scan for a table, compare a magical number and do
  some extra check to protect yourself against a UBI image inside some
  logical volume.  No big deal.
 
 First off, I see these no big deal statements for years already, and no
 decent implementation proved by usage in real world. Could we please,
 move these academic discussions to another thread?

You could wait a day, then reread what I wrote.  Maybe you will notice
that what I wrote is not identical to what we have discussed about a
year ago and you seem to have read.

You may also want to reread this:
||[ This was not a request for UBI to be changed.  The only purpose was to
||illustrate that LogFS is not broken.  The previous thread suggested
||otherwise and I just couldn't leave it at that. ]

Jörn

-- 
tglx1 thinks that joern should get a (TM) for Thinking Is Hard
-- Thomas Gleixner
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread Jörn Engel

On Sun, 8 April 2007 11:11:20 -0700, H. Peter Anvin wrote:
 
 Well, the question is if you can keep the seekdir/telldir cookie around 
 as a pointer -- preferrably in userspace, of course.  You would 
 presumably garbage-collect them on closedir() -- there is no other point 
 at which you could.

Garbage-collecting them on closedir() does not work.  It surprised me as
well, but there seem to be applications that keep the telldir() cookie
around after closedir().  Iirc, rm -r was one of them.

Neil, is this correct?

Jörn

-- 
Data dominates. If you've chosen the right data structures and organized
things well, the algorithms will almost always be self-evident. Data
structures, not algorithms, are central to programming.
-- Rob Pike
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Jörn Engel

On Sun, 8 April 2007 21:44:26 -0400, Theodore Tso wrote:
 
 Well, Joern thought that rm -rf might relying on the telldir cookie
 being valid in precisely that circumstance.  If that is true, I'd
 argue that this is a BUG in GNU coreutils that should be fixed...

I heard it and accepted that claim without checking it.  Might have been
a mistake.  But the claim came from an NFS developer, which may explain
a thing or two.

NFS clients have to deal with a server rebooting underneith them and
should still behave as expected.  An rm -r running on the client
concurrently to a rebooting server is a problem indeed and could be
solved with seekdir/telldir.

That surely doesn't make life any easier for filesystem developers, I
agree.  From that point of view, all telldir cookies should end their
life at closedir time.  For rm -r it would be sufficient if the nfs
client simply didn't seekdir at all.  For ls -lR, this would return
duplicate dentries.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Interface for the new fallocate() system call

2007-04-09 Thread Jörn Engel

On Mon, 9 April 2007 23:01:42 +1000, Paul Mackerras wrote:
 Jörn Engel writes:
 
  Wouldn't that work be confined to fallocate()?  If I understand Heiko
  correctly, the alternative would slow s390 down for every syscall,
  including more performance-critical ones.
 
 The alternative that Jakub suggested wouldn't slow s390 down.

True.  And it appears to be one of the least offensive options we have.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Add a norecovery option to ext3/4?

2007-04-10 Thread Jörn Engel

On Mon, 9 April 2007 12:21:15 -0500, Eric Sandeen wrote:
 Phillip Susi wrote:
  
  When the filesystem is told to mount the disk read only, that means it 
  should not write to it.  
 
 It means the filesystem should not be writeable when it is mounted.
 This is not the same as saying that the filesystem itself should do no
 IO in the course of making that read-only mount available.

The filesystem has two interfaces.  One to the device underneith, one to
userspace.  Read-only should certainly mean that no writes cross the
userspace interface.  Traditionally it has implicitly also meant that
no writes are crossing the device interface.  Whether that was/is an
explicit requirement - who knows.

Journaling filesystems have introduced this thing called journal
replay.  And I have to admit, it makes thing _a lot_ easier to always
replay the journal, even when being mounted read-only.

But it is easier is a pretty lame excuse.

 Under all conditions it should be safe to mount a read-only block
 device, but that is not the same as mounting a filesystem read-only.

In particular, it is a lame excuse when this claim is true.  If the
block-device is read-only, then journal replay will not work as expected
and all the not so easy work has to be done anyway.

Did I miss anything?  Is it actually easier to mount a read-only device
with unclean journal than mounting a read-write device and not replay
the journal?

Jörn

-- 
Joern's library part 8:
http://citeseer.ist.psu.edu/plank97tutorial.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Add a norecovery option to ext3/4?

2007-04-10 Thread Jörn Engel

On Tue, 10 April 2007 07:27:18 -0400, Theodore Tso wrote:
 
 I suppose what you could do is to read in the journal, and use it to
 create an remapping table so that when you want to read block #5126,
 and block number 5126 is in the journal, to read the journal version
 of the block instead of the one on disk.  That would allow for safe
 access to a filesystem being mounted read-only without the journal
 being present.

Another option would be to access the medium through a mapping inode,
replay the journal into the mapping inode and _not_ flush the dirty
pages.  But as long as a remapping table is sufficient for ext3 journal
format, such a table should be simpler and faster.

 Patches gratefully accepted

Not likely to come from me anytime soon.  There's a certain other
filesystem I have to finish first that still suffers from the same
problem.

Jörn

-- 
Do not stop an army on its way home.
-- Sun Tzu
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/13] fs: convert core functions to zero_user_page

2007-04-11 Thread Jörn Engel

On Tue, 10 April 2007 22:56:38 -0700, Andrew Morton wrote:
 
 And I'm surprised that this:
 
 +static inline void memclear_highpage_flush(struct page *page, unsigned int 
 offset, unsigned int size)
 +{
 + return zero_user_page(page, offset, size);
 +}
 
 compiled.  zero_user_page() returns void...

As does memclear_highpage_flush().  Some of my code looks like:
void some_func(...)
{
if (foo)
return do_foo(...);
if (bar)
return do_bar(...);
...
}

do_foo() and do_bar() also return void.  Saves an extra line for the
return statment and the brackets.

Doesn't help in the code you quoted, of course.

Jörn

-- 
Measure. Don't tune for speed until you've measured, and even then
don't unless one part of the code overwhelms the rest.
-- Rob Pike
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Jörn Engel

On Wed, 11 April 2007 16:23:21 -0700, H. Peter Anvin wrote:
 David Lang wrote:
 On Thu, 12 Apr 2007, Neil Brown wrote:
 
 For the second.
  You say that you  would need at least 96 bits in order to make that
  guarantee; 64 bits of hash, plus a 32-bit count value in the hash
  collision chain.  I think 96 is a bit greedy.  Surely 48 bits of
  hash and 16 bits of collision-chain-position would plenty.  You would
  need 65537 entries before a collision was even possible, and
  billions before it was at all likely. (How big does a set of 48bit
  numbers have to get before the probability that No subset of 65536
  numbers are all the same drops below 0.95?)
 
   you can get a hash collision with two entries.
 
 Yes, but the probability is 2^-n for an n-bit hash, assuming it's 
 uniformly distributed.
 
 The probability approaches 1/2 as the number of entries hashes 
 approaches 2^(n/2) (birthday number.)

I believe you are both barking up the wrong tree.  Neil proposed a 16bit
collision chain.  With that, it takes 65537 entries before a collision
chain overflow is possible.

Calling a collision chain overflow collision is inviting confusion, of
course. :)

Jörn

-- 
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Jörn Engel

On Thu, 12 April 2007 11:46:41 +1000, Neil Brown wrote:
 
 I could argue that nfs came before ext3+dirindex, so ext3 should have
 been designed to work properly with NFS.  You could argue that fixing
 it in nfsd fixes it for all filesystems.  But I'm not sure either of
 those arguments are likely to be at all convincing...

Caring about a non-ext3 filesystem, I sure would like an nfs solution as
well. :)

 Hmmm. I wonder.  Which is more likely?
   - That two 64bit hashes from some set are the same
   - or that 65536 48bit hashes from a set of equal size are the same.

The former.  Each bit going from hash strength to collision chain length
reduces the likelihood of an overflow.  In the extreme case of a 0bit
hash and 64bit collision chain, you need 2^64 entries compared to 2^32
for the other extreme.

However, the collision chain gives me quite a bit of headache.  One
would have to store each entry's position on the chain, deal with older
entries getting deleted, newer entries getting removed, etc.  All this
requires a lot of complicated code that basically never gets tested in
the wild.

Just settling for a 64bit hash and returning -EEXIST when someone causes
a collision an creat() sounds more appealing.  Directories with 4
billion entries will cause problems, but that is hardly news to anyone.

Jörn

-- 
Fantasy is more important than knowledge. Knowledge is limited,
while fantasy embraces the whole world.
-- Albert Einstein
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread Jörn Engel

On Thu, 12 April 2007 15:57:41 +1000, Neil Brown wrote:
 
  However, the collision chain gives me quite a bit of headache.  One
  would have to store each entry's position on the chain, deal with older
  entries getting deleted, newer entries getting removed, etc.  All this
  requires a lot of complicated code that basically never gets tested in
  the wild.
 
 This is a simple consequence of the design decision to use hashes as
 the search key.  They aren't dense and they will collide.  So the
 solution will be a bit fuzzy around the edges.  And maybe that is an
 acceptable tradeoff.  But the filesystem should take full
 responsibility for it, whether in performance or correctness :-)

Sure.  And seeing that not using hashes would kill performance long
before 4 billion dentries are reached, there don't seem to be many
downsides to hashing in principle.

  Just settling for a 64bit hash and returning -EEXIST when someone causes
  a collision an creat() sounds more appealing.  Directories with 4
  billion entries will cause problems, but that is hardly news to anyone.
 
 I think you want -EFBIG or -ENOSPC.  -EEXIST sounds just wrong.

None of them are 100% correct.  But you are right, -ENOSPC seems to do
less harm.

 But there are alternatives.  e.g. internal chaining.
 Insist on a unique 64bit hash for every file.  If the hash is in use,
 increment and try again.  On lookup, if the hash leads you to a file
 with the wrong name, increment and try again until you find a hole
 (hash value that is not stored).  When you delete an entry, leave a
 place holder if the next hash is in use.  Conversely if the next hash
 is not in use, delete the entry and delete the previous one if it is a
 place holder.

That would work and is limited to reasonable complexity.  It still
suffers from getting virtually no testing in the wild and therefore
being one of the dark corners little critters thrive in.  But one can at
least add a config option to fold the hash to 16bit or so.  And cross
fingers that at least one person will occasionally test with that
option.

 You have to require 64bit cookies/fpos, but I think that today, that
 is a reasonable thing to require (5 years ago it might not have been).

Which brings us back to the start of this thread.

Jörn

-- 
Why do musicians compose symphonies and poets write poems?
They do it because life wouldn't have any meaning for them if they didn't.
That's why I draw cartoons.  It's my life.
-- Charles Shultz
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-21 Thread Jörn Engel

On Wed, 21 February 2007 05:36:22 +0100, Juan Piernas Canovas wrote:
 
 I don't see how you can guarantee 50% free segments.  Can you explain
 that bit?
 It is quite simple. If 50% of your segments are busy, and the other 50% 
 are free, and the file system needs a new segment, the cleaner starts 
 freeing some of busy ones. If the cleaner is unable to free one segment at 
 least, your file system gets full (and it returns a nice ENOSPC error). 
 This solution wastes the half of your storage device, but it is 
 deadlock-free. Obviously, there are better approaches.

Ah, ok.  It is deadlock free, if the maximal height of your tree is 2.
It is not 100% deadlock free if the height is 3 or more.

Also, I strongly suspect that your tree is higher than 2.  A medium
sized directory will have data blocks, indirect blocks and the inode
proper, which gives you a height of 3.  Your inodes need to get accessed
somehow and unless they have fixed positions like in ext2, you need a
further tree structure of some sorts, so you're more likely looking at a
height of 5.

With a height of 5, you would need to keep 80% of you metadata free.
That is starting to get wasteful.

So I suspect that my proposed alternate cleaner mechanism or the even
better hole plugging mechanism proposed in the paper a few posts above
would be a better path to follow.

 A fine principle to work with.  Surprisingly, what is the worst case for
 you is the common case for LogFS, so maybe I'm more interested in it
 than most people.  Or maybe I'm just more paranoid.
 
 No, you are right. It is the common case for LogFS because it has data and 
 meta-data blocks in the same address space, but that is not the case of 
 DualFS. Anyway, I'm very interested in your work because any solution to 
 the problem of the GC will be also applicable to DualFS. So, keep up with 
 it. ;-)

Actually, no.  It is the common case for LogFS because it is designed
for flash media.  Unlike hard disks, flash lifetime is limited by the
amount of data written to it.  Therefore, having a cleaner run when the
filesystem is idle would cause unnecessary writes and reduce lifetime.

As a result, the LogFS cleaner runs as lazily as possible and the
filesystem tries hard not to mix data with different lifetimes in one
segment.  LogFS tries to avoid the cleaner like the plague.  But if it
ever needs to run it, the deadlock scenario is very close and I need to
be very aware of it. :)

In a way, the DualFS approach does not change rules for the
log-structured filesystem at all.  If you had designed your filesystem
in such a way that you simply used two existent filesystems and wrote
Actual Data (AD) to one, Metadata (MD) to another, what is MD to DualFS
is plain data to one of your underlying filesystems.  It can cause a bit
of confusion, because I tend to call MD data and you tend to call AD
data, but that is about all.

Jörn

-- 
But this is not to say that the main benefit of Linux and other GPL
software is lower-cost. Control is the main benefit--cost is secondary.
-- Bruce Perens
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [MTD] CHIPS: oops in cfi_amdstd_sync

2007-02-21 Thread Jörn Engel

On Tue, 20 February 2007 17:46:13 -0800, Vijay Sampath wrote:
 
 The files cfi_cmdset_0002.c and cfi_cmdset_0020.c do not initialize
 their wait queues like is done in cfi_cmdset_0001.c. This causes an
 oops when the wait queue is accessed. I have copied the code from
 cfi_cmdset_0001.c that is pertinent to initialization of the wait
 queue.

Patch looks good, but I can no longer test it.  Josh may still have
access to some commandset 20 chips.  Josh, any objections?

Jörn

-- 
The only real mistake is the one from which we learn nothing.
-- John Powell


signature.asc
Description: Digital signature

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-21 Thread Jörn Engel

On Wed, 21 February 2007 19:31:40 +0100, Juan Piernas Canovas wrote:
 
 I do not understand. Do you mean that if I have 10 segments, 5 busy and 5 
 free, after cleaning I could need 6 segments? How? Where the extra blocks 
 come from?

This is a fairly complicated subject and I have trouble explaining it to
people - even though I hope that maybe one or two dozen understand it by
now.  So let me try to give you an example:

In LogFS, inodes are stored in an inode file.  There are no B-Trees yet,
so the regular unix indirect blocks are used.  My example will be
writing to a directory, so that should only involve metadata by your
definition and be a valid example for DualFS as well.  If it is not,
please tell me where the difference lies.

The directory is large, so appending to it involves writing a datablock
(D0), and indirect block (D1) and a doubly indirect block (D2).

Before:
Segment 1: [some data] [   D1  ] [more data]
Segment 2: [some data] [   D0  ] [more data]
Segment 3: [some data] [   D2  ] [more data]
Segment 4: [ empty ]
...

After:
Segment 1: [some data] [garbage] [more data]
Segment 2: [some data] [garbage] [more data]
Segment 3: [some data] [garbage] [more data]
Segment 4: [D0][D1][D2][  empty]
...

Ok.  After this, the position of D2 on the medium has changed.  So we
need to update the inode and write that as well.  If the inode number
for this directory is high, we will need to write the inode (I0), an
indirect block (I1) and a doubly indirect block (I2).  The picture
becomes a bit more complicates.

Before:
Segment 1: [some data] [   D1  ] [more data]
Segment 2: [some data] [   D0  ] [more data]
Segment 3: [some data] [   D2  ] [more data]
Segment 4: [ empty ]
Segment 5: [some data] [   I1  ] [more data]
Segment 6: [some data] [   I0  ] [more data]
Segment 7: [some data] [   I2  ] [more data]
...

After:
Segment 1: [some data] [garbage] [more data]
Segment 2: [some data] [garbage] [more data]
Segment 3: [some data] [garbage] [more data]
Segment 4: [D0][D1][D2][I0][I1][I2][ empty ]
Segment 5: [some data] [garbage] [more data]
Segment 6: [some data] [garbage] [more data]
Segment 7: [some data] [garbage] [more data]
...

So what has just happened?  The user did a single touch foo in a large
directory and has caused six objects to move.  Unless some of those
objects were in the same segment before, we now have six segments
containing a tiny amount of garbage.

And there is almost no way how you can squeeze that garbage back out.
The cleaner will fundamentally do the same thing as a regular write - it
will move objects.  So if you want to clean a segment containing the
block of a different directory, you may again have to move five
additional objects, the indirect blocks, inode and ifile indirect
blocks.

At this point, your cleaner is becoming a threat.  There is a real
danger that it will create more garbage in unrelated segments than it
frees up.  I claim that you cannot keep 50% clean segments, unless you
move away from the simplistic cleaner I described above.

Jörn

-- 
If you're willing to restrict the flexibility of your approach,
you can almost always do something better.
-- John Carmack
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-22 Thread Jörn Engel

On Thu, 22 February 2007 05:30:03 +0100, Juan Piernas Canovas wrote:
 
 DualFS writes meta-blocks in variable-sized chunks that we call partial 
 segments. The meta-data device, however, is divided into segments, which 
 have the same size. A partial segment can be as large a a segment, but a 
 segment usually has more that one partial segment. Besides, a partial 
 segment can not cross a segment boundary.

Sure, that's a fairly common approach.

 A partial segment is a transaction unit, and contains all the blocks 
 modified by a file system operation, including indirect blocks and i-nodes 
 (actually, it contains the blocks modified by several file system 
 operations, but let us assume that every partial segment only contains the 
 blocks modified by a single file system operation).
 
 So, the above figure is as follows in DualFS:
 
  Before:
  Segment 1: [some data] [ D0 D1 D2 I ] [more data]
  Segment 2: [ some data  ]
  Segment 3: [   empty]
 
 If the datablock D0 is modified, what you get is:
 
  Segment 1: [some data] [  garbage   ] [more data]
  Segment 2: [ some data  ]
  Segment 3: [ D0 D1 D2 I ] [   empty ]

You have fairly strict assumptions about the Before: picture.  But
what happens if those assumptions fail.  To give you an example, imagine
the following small script:

$ for i in `seq 100`; do touch $i; done

This will create a million dentries in one directory.  It will also
create a million inodes, but let us ignore those for a moment.  It is
fairly unlikely that you can fit a million dentries into [D0], so you
will need more than one block.  Let's call them [DA], [DB], [DC], etc.
So you have to write out the first block [DA].

 Before:
Segment 1: [some data] [ DA D1 D2 I ] [more data]
Segment 2: [ some data  ]
Segment 3: [   empty]

If the datablock D0 is modified, what you get is:

Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA D1 D2 I ] [   empty ]

That is exactly your picture.  Fine.  Next you write [DB].

Before: see above
After:
Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA][garbage] [ DB D1 D2 I ] [ empty]

You write [DC].  Note that Segment 3 does not have enough space for
another partial segment:

Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA][garbage] [ DB][garbage] [wasted]
Segment 4: [ DC D1 D2 I ] [   empty ]

You write [DD] and [DE]:
Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA][garbage] [ DB][garbage] [wasted]
Segment 4: [ DC][garbage] [ DD][garbage] [wasted]
Segment 5: [ DE D1 D2 I ] [   empty ]

And some time later you even have to switch to a new indirect block, so
you get before:

Segment n  : [ DX D1 D2 I ] [   empty ]

After:

Segment n  : [ DX D1][garb] [ DY DI D2 I ] [ empty]

What you end up with after all this is quite unlike you Before
picture.  Instead of this:

  Segment 1: [some data] [ D0 D1 D2 I ] [more data]

You may have something closer to this:

 Segment 1: [some data] [   D1  ] [more data]
 Segment 2: [some data] [   D0  ] [more data]
 Segment 3: [some data] [   D2  ] [more data]

You should try the testcase and look at a dump of your filesystem
afterwards.  I usually just read the raw device in a hex editor.

Jörn

-- 
Beware of bugs in the above code; I have only proved it correct, but
not tried it.
-- Donald Knuth
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-23 Thread Jörn Engel

On Thu, 22 February 2007 20:57:12 +0100, Juan Piernas Canovas wrote:
 
 I do not agree with this picture, because it does not show that all the 
 indirect blocks which point to a direct block are along with it in the 
 same segment. That figure should look like:
 
 Segment 1: [some data] [ DA D1' D2' ] [more data]
 Segment 2: [some data] [ D0 D1' D2' ] [more data]
 Segment 3: [some data] [ DB D1  D2  ] [more data]
 
 where D0, DA, and DB are datablocks, D1 and D2 indirect blocks which 
 point to the datablocks, and D1' and D2' obsolete copies of those 
 indirect blocks. By using this figure, is is clear that if you need to 
 move D0 to clean the segment 2, you will need only one free segment at 
 most, and not more. You will get:
 
 Segment 1: [some data] [ DA D1' D2' ] [more data]
 Segment 2: [free]
 Segment 3: [some data] [ DB D1' D2' ] [more data]
 ..
 Segment n: [ D0 D1 D2 ] [ empty ]
 
 That is, D0 needs in the new segment the same space that it needs in the 
 previous one.
 
 The differences are subtle but important.

Ah, now I see.  Yes, that is deadlock-free.  If you are not accounting
the bytes of used space but the number of used segments, and you count
each partially used segment the same as a 100% used segment, there is no
deadlock.

Some people may consider this to be cheating, however.  It will cause
more than 50% wasted space.  All obsolete copies are garbage, after all.
With a maximum tree height of N, you can have up to (N-1) / N of your
filesystem occupied by garbage.

It also means that df will have unexpected output.  You cannot
estimate how much data can fit into the filesystem, as that depends on
how much garbage you will accumulate in the segments.  Admittedly this
is not a problem for DualFS, as the uncertainty only exists for
metadata, do df for DualFS still makes sense.

Another downside is that with large amounts of garbage between otherwise
useful data, your disk cache hit rate goes down.  Read performance is
suffering.  But that may be a fair tradeoff and will only show up in
large metadata reads in the uncached (per Linux) case.  Seems fair.

Quite interesting, actually.  The costs of your design are disk space,
depending on the amount and depth of your metadata, and metadata read
performance.  Disk space is cheap and metadata reads tend to be slow for
most filesystems, in comparison to data reads.  You gain faster metadata
writes and loss of journal overhead.  I like the idea.

Jörn

-- 
All art is but imitation of nature.
-- Lucius Annaeus Seneca
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-24 Thread Jörn Engel

On Sat, 24 February 2007 09:32:49 -0800, Christoph Lameter wrote:
 
 If that is a problem for particular object pools then we may be able to 
 except those from the merging.

How much of a gain is the merging anyway?  Once you start having
explicit whitelists or blacklists of pools that can be merged, one can
start to wonder if the result is worth the effort.

Jörn

-- 
Joern's library part 6:
http://www.gzip.org/zlib/feldspar.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-25 Thread Jörn Engel

On Sun, 25 February 2007 03:41:40 +0100, Juan Piernas Canovas wrote:
 
 Well, our experimental results say another thing. As I have said, the 
 greatest part of the files are written at once, so their meta-data blocks 
 are together on disk. This allows DualFS to implement an explicit 
 prefetching of meta-data blocks which is quite effective, specially when 
 there are several processes reading from disk at the same time.
 
 On the other hand, DualFS also implements an on-line meta-data relocation 
 mechanism which can help to improve meta-data prefetching, and garbage 
 collection.
 
 Obviously, there can be some slow-growing files that can produce some 
 garbage, but they do not hurt the overall performance of the file system.

Well, my concerns about the design have gone.  There remain some
concerns about the source code and I hope they will disappear just as
fast. :)

Obviously, a patch against 2.4.x is fairly useless.  Iirc, you claimed
somewhere to have a patch against 2.6.11, but I was unable to find that.
Porting 2.6.11 to 2.6.20 should be simple enough.

Then there is some assembly code inside the patch that you seem to have
copied from some other project.  I would be surprised if that is really
required.  If you can replace it with C code, please do.

If the assembly actually is a performance gain (and I consider it your
duty to prove that), you can have a two-patch series with the first
introducing DualFS and the second adding the assembly as a config option
for one architecture.

 Yeah :) If you have taken a look to my presentation at LFS07, the disk 
 traffic of meta-data blocks is dominated by writes.

Last time I tried it was only available to members.  Is it generally
available now?

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-25 Thread Jörn Engel

On Sat, 24 February 2007 16:14:48 -0800, Christoph Lameter wrote:
 
 It eliminates 50% of the slab caches. Thus it reduces the management 
 overhead by half.

How much management overhead is there left with SLUB?  Is it just the
one per-node slab?  Is there runtime overhead as well?

In a slightly different approach, can we possibly get rid of some slab
caches, instead of merging them at boot time?  On my system I have 97
slab caches right now, ignoring the generic kmalloc() ones.  Of those,
28 are completely empty, 23 contain =10 objects, 23 =100 and 23
contain 100 objects.

It is fairly obvious to me that the highly populated slab caches are a
big win.  But is it worth it to have slab caches with a single object
inside?  Maybe some of these caches are populated for some systems.
But there could also be candidates for removal among them.

# active_objs num_objs name
0 0 dm-crypt_io
0 0 dm_io
0 0 dm_tio
0 0 ext3_xattr
0 0 fat_cache
0 0 fat_inode_cache
0 0 flow_cache
0 0 inet_peer_cache
0 0 ip_conntrack_expect
0 0 ip_mrt_cache
0 0 isofs_inode_cache
0 0 jbd_1k
0 0 jbd_4k
0 0 kiocb
0 0 kioctx
0 0 nfs_inode_cache
0 0 nfs_page
0 0 posix_timers_cache
0 0 request_sock_TCP
0 0 revoke_record
0 0 rpc_inode_cache
0 0 scsi_io_context
0 0 secpath_cache
0 0 skbuff_fclone_cache
0 0 tw_sock_TCP
0 0 udf_inode_cache
0 0 uhci_urb_priv
0 0 xfrm_dst_cache
1 169 dnotify_cache
1 30 arp_cache
1 7 mqueue_inode_cache
2 101 eventpoll_pwq
2 203 fasync_cache
2 254 revoke_table
2 30 eventpoll_epi
2 9 RAW
4 17 ip_conntrack
7 10 biovec-128
7 10 biovec-64
7 20 biovec-16
7 42 file_lock_cache
7 59 biovec-4
7 59 uid_cache
7 8 biovec-256
7 9 bdev_cache
8 127 inotify_event_cache
8 20 rpc_tasks
8 8 rpc_buffers
10 113 ip_fib_alias
10 113 ip_fib_hash
10 12 blkdev_queue
11 203 biovec-1
11 22 blkdev_requests
13 92 inotify_watch_cache
16 169 journal_handle
16 203 tcp_bind_bucket
16 72 journal_head
18 18 UDP
19 19 names_cache
19 28 TCP
22 30 mnt_cache
27 27 sigqueue
27 60 ip_dst_cache
32 32 sgpool-128
32 32 sgpool-32
32 32 sgpool-64
32 36 nfs_read_data
32 45 sgpool-16
32 60 sgpool-8
36 42 nfs_write_data
72 80 cfq_pool
74 127 blkdev_ioc
78 92 cfq_ioc_pool
94 94 pgd
107 113 fs_cache
108 108 mm_struct
108 140 files_cache
123 123 sighand_cache
125 140 UNIX
130 130 signal_cache
147 147 task_struct
154 174 idr_layer_cache
158 404 pid
190 190 sock_inode_cache
260 295 bio
273 273 proc_inode_cache
840 920 skbuff_head_cache
1234 1326 inode_cache
1507 1510 shmem_inode_cache
2871 3051 anon_vma
2910 3360 filp
5161 5292 sysfs_dir_cache
5762 6164 vm_area_struct
12056 19446 radix_tree_node
65776 151272 buffer_head
578304 578304 ext3_inode_cache
677490 677490 dentry_cache

Jörn

-- 
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Heads up on sys_fallocate()

2007-03-04 Thread Jörn Engel

On Sun, 4 March 2007 14:38:13 -0800, Ulrich Drepper wrote:
 
 When you do it like this, who can the kernel/filesystem *guarantee* that
 when the data is written there actually is room on the harddrive?
 
 What you described seems like using truncate/ftruncate to increase the
 file's size.  That is not at all what posix_fallocate is for.
 posix_fallocate must make sure that the requested blocks on the disk are
 reserved (allocated) for the file's use and that at no point in the
 future will, say, a msync() fail because a mmap(MAP_SHARED) page has
 been written to.

That actually causes an interesting problem for compressing filesystems.
The space consumed by blocks depends on their contents and how well it
compresses.  At the moment, the only option I see to support
posix_fallocate for LogFS is to set an inode flag disabling compression,
then allocate the blocks.

But if the file already contains large amounts of compressed data, I
have a problem.  Disabling compression for a range within a file is not
supported, so I can only return an error.  But which one?

Jörn

-- 
A surrounded army must be given a way out.
-- Sun Tzu
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jörn Engel

On Mon, 5 March 2007 01:36:36 +0100, Arnd Bergmann wrote:
 
 Using the current glibc implementation on a compressed file system ideally
 should be a very expensive no-op because you won't actually allocate much
 space for a file when writing zeroes to it. You also don't benefit of a
 contiguous allocation in logfs, since flash has uniform seek times over
 all the medium.
 
 I'd suggest you implement posix_fallocate as an real nop and just return
 success without doing anything. You could also return ENOSPC in case
 the blocks requested by posix_fallocate don't fit on the medium without
 compression, but that is more or less just guesswork (like statfs is).

Quoting POSIX_FALLOCATE(3):
   The function posix_fallocate() ensures that disk space is allocated for
   the file referred to by the descriptor fd for the bytes  in  the range
   starting  at  offset  and continuing for len bytes.  After a successful
   call to posix_fallocate(), subsequent writes to bytes in the specified
   range are guaranteed not to fail because of lack of disk space.

   If  the  size  of  the  file  is less than offset+len, then the file is
   increased to this size; otherwise the file size is left unchanged.

Afaics, the (main) purpose of this function is not to decrease
fragmentation but to ensure mmap() won't cause any problems because the
medium fills up.  That problem exists for LogFS as well, once rw mmap()
is supported.

Simply returning success without doing anything would be a bug.  -ENOSPC
is a better choice, but still a lame implementation.  And falling back
on libc to write zeroes in a loop is an exercise in futility.

Does the allocation have to be persistent beyond lifetime of the file
descriptor?  It would be fairly simple to support the write guarantee
while the file is open (or rather the inode remains cached) and drop it
afterwards.

Jörn

-- 
[One] doesn't need to know [...] how to cause a headache in order
to take an aspirin.
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jörn Engel

On Mon, 5 March 2007 00:32:14 +, Anton Altaparmakov wrote:
 
 I don't know how your compression algorithm works [...]

LogFS is designed for flash media, so it does not have to worry much
about reducing disk seeks.  It is log-structured, which simplifies
compression further.

When writing a block, it basically compresses it and appends it to the
log.  Writes only have to be byte-aligned, so no space is lost for
padding.

The bad news for posix_fallocate() is that even if libc is smart enough
to write random data, mmap() can still cause problems.  If the VM
decides to write a given page twice, the second write compresses better
and the medium has filled up between the two writes, the users will have
fun.

Jörn

-- 
Joern's library part 9:
http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jörn Engel

On Mon, 5 March 2007 07:08:03 -0800, Ulrich Drepper wrote:
 Jörn Engel wrote:
  Does the allocation have to be persistent beyond lifetime of the file
  descriptor?
 
 Of course.  You call posix_fallocate once for the lifetime of the file
 when it is created to ensure that all future uses will work.

That part is not quite clear from the manpage but I trust most people
would assume the same.

 It seems your filesystem will not be able to support this unless
 compression is turned off.

Correct.  Compression needs to be turned off for a file, if
posix_fallocate(3) is to succeed.  What I could do is disable
compression (meaning that no data written in the future will be
compressed) and rewrite all blocks within the given range.

Still, it is quite obvious that noone designing this interface has lost
much thought to compressing filesystems.  Whatever I can come up with
will either be incompatible or some sort of hack.  :(

Jörn

-- 
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Heads up on sys_fallocate()

2007-03-07 Thread Jörn Engel

On Wed, 7 March 2007 09:51:35 +0100, Jan Kara wrote:

   I'll probably first write some userspace fs-reorganizer to find out how
 much these changes in layout are able to give you in performance (i.e.
 whether it's worth the effort of more complicated kernel online
 defragmenter).

Have tried profiling the read accesses and prereading them
asynchronously on startup?  That appears to have improved E17 a lot.
See http://lca2007.linux.org.au/talk/101 (and watch the video).

Jörn

-- 
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] add file position info to proc

2007-03-27 Thread Jörn Engel

On Tue, 27 March 2007 21:24:20 +, Pavel Machek wrote:
  From: Miklos Szeredi [EMAIL PROTECTED]
  
  This patch adds support for finding out the current file position,
  open flags and possibly other info in the future.
  
  These new entries are added:
  
/proc/PID/fdinfo/FD
/proc/PID/task/TID/fdinfo/FD
  
  For each fd the information is provided in the following format:
  
  pos:1234
  flags:  012
 
 Octal? Maybe we should use more traditional hex here? Or even list
 flags by name?

The flags are defined in octal.  Whether that choice makes sense or
should be rethought is a different question.  I would definitely prefer
hex.

Jörn

-- 
You ain't got no problem, Jules. I'm on the motherfucker. Go back in
there, chill them niggers out and wait for the Wolf, who should be
coming directly.
-- Marsellus Wallace
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Interface for the new fallocate() system call

2007-03-30 Thread Jörn Engel

On Fri, 30 March 2007 19:15:58 +1000, Paul Mackerras wrote:
 Heiko Carstens writes:
 
  If possible I'd prefer the six-32-bit-args approach.
 
 It does mean extra unnecessary work for 64-bit platforms, though...

Wouldn't that work be confined to fallocate()?  If I understand Heiko
correctly, the alternative would slow s390 down for every syscall,
including more performance-critical ones.

Jörn

-- 
tglx1 thinks that joern should get a (TM) for Thinking Is Hard
-- Thomas Gleixner
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/16] zlib-decompression-status.diff

2007-04-02 Thread Jörn Engel

On Sun, 1 April 2007 20:15:42 +0200, Jan Engelhardt wrote:
  
 +static inline void putstr(const char *s) {
 +printk(%s, s);
 +return;
 +}
 +
  static int __init crd_load(int in_fd, int out_fd)
  {
   int result;
 @@ -418,7 +423,7 @@ static int __init crd_load(int in_fd, in
   return -1;
   }
   makecrc();
 - result = gunzip();
 + result = gunzip(putstr);

You are sure this wasn't meant as an April fools joke?  Passing the
address of an inline function certainly has a humorous aspect. ;)

Also, you can remove the return; in the void function and possibly
change this bit to match Documentation/CodingStyle.

 +if(putstr != NULL) putstr(*);

The patch alternately uses puts() and putstr(), which looks rather odd.
Not sure whether that makes sense or not.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: missing madvise functionality

2007-04-03 Thread Jörn Engel

On Tue, 3 April 2007 23:10:14 +0200, Eric Dumazet wrote:
 
 mmap()/brk() must give fresh NULL pages, but maybe madvise(MADV_DONTNEED) 
 can relax this requirement (if the pages were reclaimed, then a page fault 
 could bring a new page with random content)

...provided that it doesn't leak information from the kernel?

Jörn

-- 
All art is but imitation of nature.
-- Lucius Annaeus Seneca
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] block2mtd lockdep_init_map warning

2008-01-07 Thread Jörn Engel

On Mon, 7 January 2008 11:05:26 +0100, Peter Zijlstra wrote:
 
 Would something like this work for people?

Looks a lot better than what I thought of.  However, does the #ifdef
within is_module_address() make sense when afaict lockdep is the only
caller of that function?  Looks as if the whole function should be made
conditional or none of it.

 Not-Yet-Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 ---
 Index: linux-2.6/include/linux/sched.h
 ===
 --- linux-2.6.orig/include/linux/sched.h
 +++ linux-2.6/include/linux/sched.h
 @@ -1160,6 +1160,7 @@ struct task_struct {
   int lockdep_depth;
   struct held_lock held_locks[MAX_LOCK_DEPTH];
   unsigned int lockdep_recursion;
 + struct module *loading_module;
  #endif
  
  /* journalling filesystem info */
 Index: linux-2.6/kernel/module.c
 ===
 --- linux-2.6.orig/kernel/module.c
 +++ linux-2.6/kernel/module.c
 @@ -2023,6 +2023,9 @@ static struct module *load_module(void _
   printk(KERN_WARNING %s: Ignoring obsolete parameters\n,
  mod-name);
  
 +#ifdef CONFIG_LOCKDEP
 + current-loading_module = mod;
 +#endif
   /* Size of section 0 is 0, so this works well if no params */
   err = parse_args(mod-name, mod-args,
(struct kernel_param *)
 @@ -2030,6 +2033,9 @@ static struct module *load_module(void _
sechdrs[setupindex].sh_size
/ sizeof(struct kernel_param),
NULL);
 +#ifdef CONFIG_LOCKDEP
 + current-loading_module = NULL
 +#endif
   if (err  0)
   goto arch_cleanup;
  
 @@ -2454,6 +2460,17 @@ int is_module_address(unsigned long addr
   }
   }
  
 +#ifdef CONFIG_LOCKDEP
 + if (current-loading_module) {
 + mod = current-loading_module;
 + if (within(addr, mod-module_init, mod-init_text_size)
 + || within(addr, mod-module_core, mod-core_text_size)) {
 + preempt_enable();
 + return 1;
 + }
 + }
 +#endif
 +
   preempt_enable();
  
   return 0;
 
 
 

Jörn

-- 
I don't understand it. Nobody does.
-- Richard P. Feynman
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [noob q. on block layer] block IO read-ahead during sequential write?

2008-01-07 Thread Jörn Engel

On Mon, 7 January 2008 13:25:09 +0100, Frantisek Rysanek wrote:
 
 let me start with a simple example. The following commands:
 
  cp /dev/zero /dev/hda
  dd if=/dev/zero of=/dev/hda [bs=512]
 
 both have one common side-effect: apart from the disk being properly 
 overwritten with zeroes, the kernel seems to keep reading sectors 
 ahead of the current seek position of the sequential write.

Block devices are cached in the page cache.  If you write less than a
full page, any remainder has to be read from the device.

If you retry the dd with bs=4096 (or whatever your architecture's page
size happens to be), does this still occur?

Jörn

-- 
Chance favors only the prepared mind.
-- Louis Pasteur
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Claim maintainership for block2mtd and update email addresses

2008-01-07 Thread Jörn Engel

On Mon, 7 January 2008 15:23:00 -0800, Andrew Morton wrote:
 On Sun, 6 Jan 2008 14:56:01 +0100
 J__rn Engel [EMAIL PROTECTED] wrote:

You found a new one!  That make a round dozen, I believe.
http://logfs.org/logfs/joern

  - * Copyright (C) 2004-2006 JÃ¶rn Engel [EMAIL PROTECTED]
  + * Copyright (C) 2004-2006 Joern Engel [EMAIL PROTECTED]
 
 Yup.  Your name comes out like that when sylpheed does its
 save-email-to-a-file thing as well and I haven't got around to working out
 why, or to reporting it.
 
 In this case it looks like the dud characters came about due to [MTD] Fix
 legacy character sets throughout drivers/mtd, include/linux/mtd.  Which
 doesn't look like it fixed anything much really.
 
 Going with the asciified/anglicised/bastardised spelling is a practical
 (albeit unhappy) solution.

I'm happy if people spend effort and make unicode work.  Until then I'll
semi-officially change my name to Joern and keep collecting unusual
specimens.

Jörn

-- 
Measure. Don't tune for speed until you've measured, and even then
don't unless one part of the code overwhelms the rest.
-- Rob Pike
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] greatly reduce SLOB external fragmentation

2008-01-10 Thread Jörn Engel

On Thu, 10 January 2008 11:49:25 -0600, Matt Mackall wrote:
 
 b) grouping objects of the same -type- (not size) together should mean
 they have similar lifetimes and thereby keep fragmentation low
 
 (b) is known to be false, you just have to look at our dcache and icache
 pinning.

(b) is half-true, actually.  The grouping by lifetime makes a lot of
sense.  LogFS has a similar problem to slabs (only full segments are
useful, a single object can pin the segment).  And when I grouped my
objects very roughly by their life expectency, the impact was *HUGE*!

In both cases, you want slabs/segments that are either close to 100%
full or close to 0% full.  It matters a lot when you have to move
objects around and I would bet it matters even more when you cannot move
objects and the slab just remains pinned.

So just because the type alone is a relatively bad heuristic for life
expectency does not make the concept false.  Bonwick was onto something.
He just failed in picking a good heuristic.  Quite likely spreading by
type was even a bonus when slab was developed, because even such a crude
heuristic is slightly better than completely randomized lifetimes.

I've been meaning to split the dentry cache into 2-3 seperate ones for a
while and kept spending my time elsewhere.  But I remain convinced that
this will make a measurable difference.

Jörn

-- 
Never argue with idiots - first they drag you down to their level,
then they beat you with experience.
-- unknown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Jörn Engel

On Sun, 16 September 2007 00:30:32 +0200, Andrea Arcangeli wrote:
 
 Movable? I rather assume all slab allocations aren't movable. Then
 slab defrag can try to tackle on users like dcache and inodes. Keep in
 mind that with the exception of updatedb, those inodes/dentries will
 be pinned and you won't move them, which is why I prefer to consider
 them not movable too... since there's no guarantee they are.

I have been toying with the idea of having seperate caches for pinned
and movable dentries.  Downside of such a patch would be the number of
memcpy() operations when moving dentries from one cache to the other.
Upside is that a fair amount of slab cache can be made movable.
memcpy() is still faster than reading an object from disk.

Most likely the current reaction to such a patch would be to shoot it
down due to overhead, so I didn't pursue it.  All I have is an old patch
to seperate never-cached from possibly-cached dentries.  It will
increase the odds of freeing a slab, but provide no guarantee.

But the point here is: dentries/inodes can be made movable if there are
clear advantages to it.  Maybe they should?

Jörn

-- 
Joern's library part 2:
http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Jörn Engel

On Sat, 15 September 2007 01:44:49 -0700, Andrew Morton wrote:
 On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel [EMAIL PROTECTED] wrote:
 
  While I agree with your concern, those numbers are quite silly.  The
  chances of 99.8% of pages being free and the remaining 0.2% being
  perfectly spread across all 2MB large_pages are lower than those of SHA1
  creating a collision.
 
 Actually it'd be pretty easy to craft an application which allocates seven
 pages for pagecache, then one for something, then seven for pagecache, then
 one for something, etc.
 
 I've had test apps which do that sort of thing accidentally.  The result
 wasn't pretty.

I bet!  My (false) assumption was the same as Goswin's.  If non-movable
pages are clearly seperated from movable ones and will evict movable
ones before polluting further mixed superpages, Nick's scenario would be
nearly infinitely impossible.

Assumption doesn't reflect current code.  Enforcing this assumption
would cost extra overhead.  The amount of effort to make Christoph's
approach work reliably seems substantial and I have no idea whether it
would be worth it.

Jörn

-- 
Happiness isn't having what you want, it's wanting what you have.
-- unknown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Jörn Engel

On Sun, 16 September 2007 11:15:36 -0700, Linus Torvalds wrote:
 On Sun, 16 Sep 2007, Jörn Engel wrote:
  
  I have been toying with the idea of having seperate caches for pinned
  and movable dentries.  Downside of such a patch would be the number of
  memcpy() operations when moving dentries from one cache to the other.
 
 Totally inappropriate.
 
 I bet 99% of all dentry_lookup() calls involve turning the last dentry 
 from having a count of zero (movable) to having a count of 1 (pinned).
 
 So such an approach would fundamentally be broken. It would slow down all 
 normal dentry lookups, since the *common* case for leaf dentries is that 
 they have a zero count.

Why am I not surprised? :)

 So it's much better to do it on a directory/file basis, on the 
 assumption that files are *mostly* movable (or just freeable). The fact 
 that they aren't always (ie while kept open etc), is likely statistically 
 not all that important.

My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
which are pinned for their entire lifetime and another for regular
files/inodes.  One could take a three-way approach and have
always-pinned, often-pinned and rarely-pinned.

We won't get never-pinned that way.

Jörn

-- 
The wise man seeks everything in himself; the ignorant man tries to get
everything from somebody else.
-- unknown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-16 Thread Jörn Engel

On Mon, 17 September 2007 00:06:24 +0200, Goswin von Brederlow wrote:
 
 How probable is it that the dentry is needed again? If you copy it and
 it is not needed then you wasted time. If you throw it out and it is
 needed then you wasted time too. Depending on the probability one of
 the two is cheaper overall. Idealy I would throw away dentries that
 haven't been accessed recently and copy recently used ones.
 
 How much of a systems ram is spend on dentires? How much on task
 structures? Does anyone have some stats on that? If it is 10% of the
 total ram combined then I don't see much point in moving them. Just
 keep them out of the way of users memory so the buddy system can work
 effectively.

As usual, the answer is it depends.  I've had up to 600MB in dentry
and inode slabs on a 1GiB machine after updatedb.  This machine
currently has 13MB in dentries, which seems to be reasonable for my
purposes.

Jörn

-- 
Audacity augments courage; hesitation, fear.
-- Publilius Syrus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Jörn Engel

On Tue, 18 September 2007 11:00:40 +0100, Mel Gorman wrote:
 
 We still lack data on what sort of workloads really benefit from large
 blocks

Compressing filesystems like jffs2 and logfs gain better compression
ratio with larger blocks.  Going from 4KiB to 64KiB gave somewhere
around 10% benefit iirc.  Testdata was a 128MiB qemu root filesystem.

Granted, the same could be achieved by adding some extra code and a few
bounce buffers to the filesystem.  How suck a hack would perform I'd
prefer not to find out, though. :)

Jörn

-- 
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86_64: Make sparsemem/vmemmap the default memory model

2007-11-13 Thread Jörn Engel

On Mon, 12 November 2007 20:41:10 -0800, Christoph Lameter wrote:
 On Mon, 12 Nov 2007, Ray Lee wrote:
 
  Discontig obviously needs to die. However, FlatMem is consistently
  faster, averaging about 2.1% better overall for your numbers above. Is
  the page allocator not, erm, a fast path, where that matters?
  
  Order   FlatSparse  % diff
  0   639 641 0.3
 
 IMHO Order 0 currently matters most and the difference is negligible 
 there.

Is it?  I am a bit concerned about the non-monotonic distribution.
Difference starts a near-0, grows to 4.4, drops to near-0, grows to 4.9,
drops to near-0.

Order   FlatSparse  % diff
0   639 641 0.3
1   567 593 4.4
2   679 692 1.9
3   763 781 2.3
4   961 962 0.1
5   135613922.6
6   222423364.8
7   486950744.0
8   12500   12732   1.8
9   27926   28165   0.8
10  58578   58682   0.2

Is there an explanation for this behaviour?  More to the point, could
repeated runs also return 4% difference for order-0?

Jörn

-- 
It does not require a majority to prevail, but rather an irate,
tireless minority keen to set brush fires in people's minds.
-- Samuel Adams
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] New Kernel Bugs

2007-11-13 Thread Jörn Engel

On Tue, 13 November 2007 15:18:07 -0500, Mark Lord wrote:
 
 I just find it weird that something can be known broken for several -rc*
 kernels before I happen to install it, discover it's broken on my own 
 machine,
 and then I track it down, fix it, and submit the patch, generally all 
 within a
 couple of hours.  Where the heck was the dude(ess) that broke it ??  AWOL.
 
 And when I receive hostility from the maintainers of said code for fixing
 their bugs, well.. that really motivates me to continue reporting new ones..

Given a decent bug report, I agree that having the bug not looked at is
shameful.  But what can a developer do if a bug report effectively reads
there is some bug somewhere in recent kernels?  How can I know that in
this particular case it is my bug that I introduced?  It could just as
easily be 50 other people and none of them are eager to debug it unless
they suspect it to be their bug.

This is a common problem and fairly unrelated to linux in general or the
kernel in particular.  Who is going to be the sucker that figures out
which developer the bug belongs to?  And I have yet to find a project,
commercial or opensource, where volunteers flock to become such a
sucker.

One option is to push this role to the bug reporter.  Another is to
strong-arm some developers into this role, by whatever means.  A third
would be for $LARGE_COMPANY to hire some people.  If you have a better
idea or would volunteer your time, I'd be grateful.  Simply blaming one
side, whether bug reporter or a random developer, for not being the
sucker doesn't help anyone.

Jörn

-- 
Joern's library part 2:
http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] New Kernel Bugs

2007-11-13 Thread Jörn Engel

On Tue, 13 November 2007 13:56:58 -0800, Andrew Morton wrote:
 
 It's relatively common that a regression in subsystem A will manifest as a
 failure in subsystem B, and the report initially lands on the desk of the
 subsystem B developers.
 
 But that's OK.  The subsystem B people are the ones with the expertise to
 be able to work out where the bug resides and to help the subsystem A
 people understand what went wrong.
 
 Alas, sometimes the B people will just roll eyes and do nothing because
 they know the problem wasn't in their code.  Sometimes.

And sometimes the A people will ignore the B people after the root cause
has been worked out.  Do you have a good idea how to shame A into
action?  Should I put you on Cc:?  Right now I'm in the eye-rolling
phase.

Jörn

-- 
The cost of changing business rules is much more expensive for software
than for a secretaty.
-- unknown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86_64: Make sparsemem/vmemmap the default memory model

2007-11-13 Thread Jörn Engel

On Tue, 13 November 2007 13:52:17 -0800, Christoph Lameter wrote:
 
 Could you run your own test to verify?

You bastard!  You know I'm too lazy to do that. ;)

As long as the order-0 number is stable across multiple runs I don't
mind.  The numbers just looked suspiciously as if they were not stable.
That's all.

Jörn

-- 
Why do musicians compose symphonies and poets write poems?
They do it because life wouldn't have any meaning for them if they didn't.
That's why I draw cartoons.  It's my life.
-- Charles Shultz
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [alsa-devel] [BUG] New Kernel Bugs

2007-11-15 Thread Jörn Engel

On Thu, 15 November 2007 13:26:51 +0100, Rene Herman wrote:
 
 Can you please just shelve this crap? You have a way of knowing that ALSA 
 will accept you and that is knowing or assuming that the ALSA project 
 doesn't consist of drooling retards.

Well, my experience with moderation has been that moderated mails are
stuck in some queue for weeks.  Two seperate lists, neither of them was
alsa.  If also is doing a better job, great.  But it still has to live
with the general reputation of non-subscriber moderation.

 When a project list goes to the difficulty of moderating non-subscribers it 
 has made the explicit choice to _not_ become subscriber only. Then refusing 
 valid non-subscribers after all makes no sense whatsoever. I'm sorry you 
 got your feelings hurt by that other list but it was no doubt an accident; 
 take it up with them.

Been there, done that.  In spite of people not being drooling retards,
the amount of time and effort they invest into either moderation or
improving the ruleset is quite limited.  Problems persist.

And even without mails being held hostage for weeks, every single
moderation mail is annoying.  Like the one I'm sure to receive after
sending this out.

Jörn

-- 
Joern's library part 5:
http://www.faqs.org/faqs/compression-faq/part2/section-9.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Document I_SYNC and I_DATASYNC

2007-11-15 Thread Jörn Engel

After some archeology (see http://logfs.org/logfs/inode_state_bits) I
finally figured out what the three I_DIRTY bits do.  Maybe others would
prefer less effort to reach this insight.

Signed-off-by: Jörn Engel [EMAIL PROTECTED]
---

 include/linux/fs.h |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

--- git_I_DIRTY/include/linux/fs.h~I_DIRTY  2007-11-15 20:51:57.0 
+0100
+++ git_I_DIRTY/include/linux/fs.h  2007-11-16 03:45:16.0 +0100
@@ -1276,8 +1276,10 @@ struct super_operations {
  *
  * Two bits are used for locking and completion notification, I_LOCK and 
I_SYNC.
  *
- * I_DIRTY_SYNCInode itself is dirty.
- * I_DIRTY_DATASYNCData-related inode changes pending
+ * I_DIRTY_SYNCInode is dirty, but doesn't have to be written 
on
+ * fdatasync().  i_atime is the usual cause.
+ * I_DIRTY_DATASYNCInode is dirty and must be written on fdatasync(), f.e.
+ * because i_size changed.
  * I_DIRTY_PAGES   Inode has dirty pages.  Inode itself may be clean.
  * I_NEW   get_new_inode() sets i_state to I_LOCK|I_NEW.  Both
  * are cleared by unlock_new_inode(), called from iget().
@@ -1309,8 +1311,6 @@ struct super_operations {
  * purpose reduces latency and prevents some filesystem-
  * specific deadlocks.
  *
- * Q: Why does I_DIRTY_DATASYNC exist?  It appears as if it could be replaced
- *by (I_DIRTY_SYNC|I_DIRTY_PAGES).
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  * Q: igrab() only checks on (I_FREEING|I_WILL_FREE).  Should it also check on
  *I_CLEAR?  If not, why?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Treat disk space like memory space

2007-11-16 Thread Jörn Engel

On Fri, 16 November 2007 10:30:12 -0800, H. Peter Anvin wrote:
 
 This, by the way, has been discussed on and off -- often in the context 
 of undelete (which is an identical problem.)  The problem usually is 
 that performance of real storage users suffer because of locality 
 issues.  However, flash storage doesn't have locality requirements...

It does, although significantly less so than disks.  Read latency is
typically between 100x and 1000x less than disk latency.

Another argument against this is that free space directly translates to
speed, both for disks and flash.  Disk filesystems fragment like hell if
the disk is constanly near-full and flash filesystems require a lot more
garbage collection overhead.

Jörn

-- 
To my face you have the audacity to advise me to become a thief - the worst
kind of thief that is conceivable, a thief of spiritual things, a thief of
ideas! It is insufferable, intolerable!
-- M. Binet in Scarabouche
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Documentation about unaligned memory access

2007-11-30 Thread Jörn Engel

On Fri, 23 November 2007 00:15:53 +, Daniel Drake wrote:
 
 What's the definition of an unaligned access?
 =
 
 Unaligned memory accesses occur when you try to read N bytes of data starting
 from an address that is not evenly divisible by N (i.e. addr % N != 0).
 For example, reading 4 bytes of data from address 0x1004 is fine, but
 reading 4 bytes of data from address 0x1005 would be an unaligned memory
 access.

The wording could also apply to a DMA of 8k from a 4k-aligned address.
But I don't have a good idea how to improve it.

 It's safe to assume that memcpy will always copy bytewise and hence will
 never cause an unaligned access.

s/always copy/always behave as if copying/

memcpy usually copies at least wordwise, possibly even in bigger chunks.
But that is just the inner loop.  Unaligned bytes at the beginning/end
receive special treatment.

Jörn

-- 
The rabbit runs faster than the fox, because the rabbit is rinning for
his life while the fox is only running for his dinner.
-- Aesop
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-11-30 Thread Jörn Engel

On Fri, 30 November 2007 14:43:12 +0100, Ingo Molnar wrote:
 
   
 http://redhat.com/~mingo/latency-tracing-patches/latency-tracing-v2.6.24-rc3.combo.patch
 
 does it work any better?

It compiles.  It boots with a 512M RAM (384M was too little with all
the other debug options on).  But it seems to lock up when running
trace-cmd.  On a rerun it locks up again, but with different output.
Rerun was captured:
http://logfs.org/~joern/trace1.jpg

I should do a couple of runs, but my girlfriend claims realtime priority
for the evening.

Jörn

-- 
Chance favors only the prepared mind.
-- Louis Pasteur
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-11-30 Thread Jörn Engel

On Thu, 15 November 2007 20:36:12 +0100, Ingo Molnar wrote:
 * Ingo Molnar [EMAIL PROTECTED] wrote:
 
  pick up the latest latency tracer patch from:
 
 sorry, wrong URLs, the correct links are:
 

 http://redhat.com/~mingo/latency-tracing-patches/latency-tracer-v2.6.24-rc2-git5-combo.patch
http://redhat.com/~mingo/latency-tracing-patches/trace-cmd.c

Don't seem to work with plain 2.6.23:

kernel/sched.c:3384: warning: ‘struct prio_array’ declared inside parameter list
kernel/sched.c:3384: warning: its scope is only this definition or declaration, 
which is probably not what you want
kernel/sched.c: In function ‘trace_array’:
kernel/sched.c:3391: error: dereferencing pointer to incomplete type
kernel/sched.c:3393: error: dereferencing pointer to incomplete type
kernel/sched.c:3393: error: dereferencing pointer to incomplete type
kernel/sched.c:3396: error: dereferencing pointer to incomplete type
kernel/sched.c:3396: error: dereferencing pointer to incomplete type
kernel/sched.c: In function ‘trace_all_runnable_tasks’:
kernel/sched.c:3407: error: ‘struct rq’ has no member named ‘active’
make[1]: *** [kernel/sched.o] Error 1

And I cannot find a definition of struct prio_array in current git
either.  Is another patch needed?

Jörn

-- 
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-11-30 Thread Jörn Engel

On Fri, 30 November 2007 14:35:46 +0100, Ingo Molnar wrote:
 * Jörn Engel [EMAIL PROTECTED] wrote:
  
  kernel/sched.c:3384: warning: ‘struct prio_array’ declared inside parameter 
  list
  kernel/sched.c:3384: warning: its scope is only this definition or 
  declaration, which is probably not what you want
  kernel/sched.c: In function ‘trace_array’:
  kernel/sched.c:3391: error: dereferencing pointer to incomplete type
  kernel/sched.c:3393: error: dereferencing pointer to incomplete type
  kernel/sched.c:3393: error: dereferencing pointer to incomplete type
  kernel/sched.c:3396: error: dereferencing pointer to incomplete type
  kernel/sched.c:3396: error: dereferencing pointer to incomplete type
  kernel/sched.c: In function ‘trace_all_runnable_tasks’:
  kernel/sched.c:3407: error: ‘struct rq’ has no member named ‘active’
  make[1]: *** [kernel/sched.o] Error 1
  
  And I cannot find a definition of struct prio_array in current git
  either.  Is another patch needed?
 
 change that to rt_prio_array in the code.

Solves the prio_array problem, but leaves the non-existing member
active.  I've upgraded to -rc3 and will give your latest patch a whirl.

Jörn

-- 
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-01 Thread Jörn Engel

On Fri, 30 November 2007 19:46:25 +0100, Ingo Molnar wrote:
 * Jörn Engel [EMAIL PROTECTED] wrote:
  
  It compiles.  It boots with a 512M RAM (384M was too little with all
  the other debug options on).  But it seems to lock up when running
  trace-cmd.  On a rerun it locks up again, but with different output.
 
 hm, you should decrease MAX_TRACE in kernel/latency_tracing.c from 1 
 million to 16K or so. 1 million entries probably depletes lowmem quite 
 seriously.

That's ok.  RAM is cheap.

  Rerun was captured:
  http://logfs.org/~joern/trace1.jpg
 
 hm, that looks weird. if you disable CONFIG_PROVE_LOCKING, does that 
 improve things? (or just turns a noisy lockup into a silent lockup?)

Not much, although the dumps look different now:
http://logfs.org/~joern/trace3.jpg
http://logfs.org/~joern/trace4.jpg

I have to change my qemu setup a little to see the top of those
dumps...

  I should do a couple of runs, but my girlfriend claims realtime 
  priority for the evening.
 
 yeah, SCHED_IDLE is not generally well received by them.

...as soon as more urgent tasks has finished (weekend is over).

Jörn

-- 
It does not matter how slowly you go, so long as you do not stop.
-- Confucius
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-01 Thread Jörn Engel

On Sat, 1 December 2007 19:32:56 +0100, Ingo Molnar wrote:
 * Jörn Engel [EMAIL PROTECTED] wrote:
 
  I have to change my qemu setup a little to see the top of those 
  dumps...
 
 btw., if you start qemu like this:
 
 qemu -cdrom ./cdrom.iso -hda ./hda.img -boot c -full-screen -kernel 
 ~/bzImage -append root=/dev/hda1 earlyprintk=serial,ttyS0,9600 
 console=tty console=ttyS0,9600 enforcing=0 debug
 
 you'll get the inner kernel's serial console log to qemu's standard 
 output. Pretty useful for capturing kernel crashes.

Almost.  -serial stdio was missing.  Much better now.

stopped custom tracer.
BUG: spinlock recursion on CPU#0, sh/953
 lock: c030f280, .magic: dead4ead, .owner: sh/953, .owner_cpu: 0
Pid: 953, comm: sh Not tainted 2.6.24-rc3-ge1cca7e8-dirty #2
 [c0103a04] show_trace_log_lvl+0x35/0x54
 [c010450a] show_trace+0x2c/0x2e
 [c0104e6d] dump_stack+0x84/0x8a
 [c01ded7c] spin_bug+0xa7/0xae
 [c01def14] _raw_spin_lock+0x45/0xfa
 [c02a02b1] _spin_lock_irqsave+0x68/0x7a
 [c01087e7] pit_read+0x14/0x99
 [c0130ee9] get_monotonic_cycles+0xf/0x2d
 [c013c0ef] now+0x2a/0x7c
 [c013c33b] trace+0x4d/0x1e8
 [c013dbf3] __mcount+0x95/0xa6
 [c010d35c] mcount+0x14/0x18
 [c0135a44] lock_acquired+0xe/0x1d7
 [c02a02b9] _spin_lock_irqsave+0x70/0x7a
 [c01087e7] pit_read+0x14/0x99
 [c0130791] update_wall_time+0x23/0x692
 [c0121756] do_timer+0x24/0xb1
 [c01331fe] tick_periodic+0x49/0x84
 [c013325b] tick_handle_periodic+0x22/0x73
 [c0106315] timer_interrupt+0x4f/0x56
 [c013e2c7] handle_IRQ_event+0x24/0x4f
 [c013f44a] handle_edge_irq+0xb8/0x125
 [c01054ee] do_IRQ+0x89/0xa3
 [c01033df] common_interrupt+0x23/0x28
 [c015d924] vfs_write+0xa6/0x14c
 [c015df6e] sys_write+0x4c/0x70
 [c0102a1f] syscall_call+0x7/0xb
 ===

I assume you have the latency tracer working.  If you could send me your
config, I could do a manual config-bisect and see which part of mine
causes the problem.

Jörn

-- 
Admonish your friends privately, but praise them openly.
-- Publilius Syrus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-01 Thread Jörn Engel

On Sat, 1 December 2007 21:54:56 +0100, Ingo Molnar wrote:
 * J??rn Engel [EMAIL PROTECTED] wrote:
  
  stopped custom tracer.
  BUG: spinlock recursion on CPU#0, sh/953
   lock: c030f280, .magic: dead4ead, .owner: sh/953, .owner_cpu: 0
  Pid: 953, comm: sh Not tainted 2.6.24-rc3-ge1cca7e8-dirty #2
   [c0103a04] show_trace_log_lvl+0x35/0x54
   [c010450a] show_trace+0x2c/0x2e
   [c0104e6d] dump_stack+0x84/0x8a
   [c01ded7c] spin_bug+0xa7/0xae
   [c01def14] _raw_spin_lock+0x45/0xfa
   [c02a02b1] _spin_lock_irqsave+0x68/0x7a
   [c01087e7] pit_read+0x14/0x99
   [c0130ee9] get_monotonic_cycles+0xf/0x2d
 
 ah. You should mark pit_read() function as notrace. PIT clocksource is 
 rare. (add the 'notrace' word to the function prototype)

Hardly a change at all.  Apart from some offsets, this dump is
identical.

stopped custom tracer.
BUG: spinlock recursion on CPU#0, sh/954
 lock: c030f280, .magic: dead4ead, .owner: sh/954, .owner_cpu: 0
Pid: 954, comm: sh Not tainted 2.6.24-rc3-ge1cca7e8-dirty #3
 [c0103a04] show_trace_log_lvl+0x35/0x54
 [c010450a] show_trace+0x2c/0x2e
 [c0104e6d] dump_stack+0x84/0x8a
 [c01ded7c] spin_bug+0xa7/0xae
 [c01def14] _raw_spin_lock+0x45/0xfa
 [c02a02b1] _spin_lock_irqsave+0x68/0x7a
 [c01087e2] pit_read+0xf/0x91
 [c0130ee1] get_monotonic_cycles+0xf/0x2d
 [c013c0e7] now+0x2a/0x7c
 [c013c333] trace+0x4d/0x1e8
 [c013dbeb] __mcount+0x95/0xa6
 [c010d354] mcount+0x14/0x18
 [c0135a3c] lock_acquired+0xe/0x1d7
 [c02a02b9] _spin_lock_irqsave+0x70/0x7a
 [c01087e2] pit_read+0xf/0x91
 [c0130789] update_wall_time+0x23/0x692
 [c012174e] do_timer+0x24/0xb1
 [c01331f6] tick_periodic+0x49/0x84
 [c0133253] tick_handle_periodic+0x22/0x73
 [c0106315] timer_interrupt+0x4f/0x56
 [c013e2bf] handle_IRQ_event+0x24/0x4f
 [c013f442] handle_edge_irq+0xb8/0x125
 [c01054ee] do_IRQ+0x89/0xa3
 [c01033df] common_interrupt+0x23/0x28
 [c010d354] mcount+0x14/0x18
 [c0120130] sysctl_head_finish+0xc/0x33
 [c0192d64] proc_sys_write+0x96/0xa0
 [c015d91c] vfs_write+0xa6/0x14c
 [c015df66] sys_write+0x4c/0x70
 [c0102a1f] syscall_call+0x7/0xb
 ===

Jörn

-- 
Don't worry about people stealing your ideas. If your ideas are any good,
you'll have to ram them down people's throats.
-- Howard Aiken quoted by Ken Iverson quoted by Jim Horning quoted by
   Raph Levien, 1979
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-02 Thread Jörn Engel

On Sun, 2 December 2007 09:56:08 +0100, Ingo Molnar wrote:
 * Jörn Engel [EMAIL PROTECTED] wrote:
 
   ah. You should mark pit_read() function as notrace. PIT clocksource 
   is rare. (add the 'notrace' word to the function prototype)
  
  Hardly a change at all.  Apart from some offsets, this dump is 
  identical.
  
  stopped custom tracer.
  BUG: spinlock recursion on CPU#0, sh/954
   lock: c030f280, .magic: dead4ead, .owner: sh/954, .owner_cpu: 0
  Pid: 954, comm: sh Not tainted 2.6.24-rc3-ge1cca7e8-dirty #3
   [c0103a04] show_trace_log_lvl+0x35/0x54
   [c010450a] show_trace+0x2c/0x2e
   [c0104e6d] dump_stack+0x84/0x8a
   [c01ded7c] spin_bug+0xa7/0xae
   [c01def14] _raw_spin_lock+0x45/0xfa
   [c02a02b1] _spin_lock_irqsave+0x68/0x7a
   [c01087e2] pit_read+0xf/0x91
   [c0130ee1] get_monotonic_cycles+0xf/0x2d
   [c013c0e7] now+0x2a/0x7c
   [c013c333] trace+0x4d/0x1e8
   [c013dbeb] __mcount+0x95/0xa6
   [c010d354] mcount+0x14/0x18
   [c0135a3c] lock_acquired+0xe/0x1d7
   [c02a02b9] _spin_lock_irqsave+0x70/0x7a
   [c01087e2] pit_read+0xf/0x91
 
 hm, it seems lock_acquired() [in kernel/lockdep.c] needs to be marked 
 'notrace' too - otherwise we recurse back into pit_read().

This time not even the offsets have changed.  Dump is identical.

Jörn

-- 
Mundie uses a textbook tactic of manipulation: start with some
reasonable talk, and lead the audience to an unreasonable conclusion.
-- Bruce Perens
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-02 Thread Jörn Engel

On Sun, 2 December 2007 12:31:43 +0100, Jörn Engel wrote:
 
 This time not even the offsets have changed.  Dump is identical.

After another ten or so notrace annotations throughout the spinlock
code, the latency tracer appears to work.  Not sure how many useful
information is missing through all the annotations, though.

Jörn

-- 
Das Aufregende am Schreiben ist es, eine Ordnung zu schaffen, wo
vorher keine existiert hat.
-- Doris Lessing
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-02 Thread Jörn Engel

On Sun, 2 December 2007 14:57:11 +0100, Ingo Molnar wrote:
 
 hm, do you have CONFIG_FRAME_POINTERS=y, i.e. are the dumps reliable?

I do.  Went through 10odd runs and annotated the function right below
mcount each time.  Seems to work now.

Trouble is that it doesn't solve my real problem at hand.  Something is
causing significant delays when writing to logfs.  Core logfs code is
not running, but may cause whatever other code is running and burning up
all the cpu time.  Wasting 100ms of qemu-time to write a single page
happens fairly frequently.

With the latency tracer the problem appears to have become worse.  Now
the loftlockup code triggers quite frequently.  Which makes a bit of
sense, as the problem is a busy CPU, rather than an idle one.

Guess I'll try oprofile or lcov instead.

Jörn

-- 
Joern's library part 5:
http://www.faqs.org/faqs/compression-faq/part2/section-9.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-02 Thread Jörn Engel

On Sun, 2 December 2007 13:31:25 +0100, Jörn Engel wrote:
 
 After another ten or so notrace annotations throughout the spinlock
 code, the latency tracer appears to work.  Not sure how many useful
 information is missing through all the annotations, though.

And here is a patch with the needed annotations.  Looks a bit shabby, as
it was generated though git diff, patcher, interdiff and combinediff.

Jörn

-- 
Joern's library part 10:
http://blogs.msdn.com/David_Gristwood/archive/2004/06/24/164849.aspx

unchanged:
--- a/arch/x86/kernel/i8253.c
+++ b/arch/x86/kernel/i8253.c
@@ -125,7 +125,7 @@ void __init setup_pit_timer(void)
  * to just read by itself. So use jiffies to emulate a free
  * running counter:
  */
-static cycle_t pit_read(void)
+static notrace cycle_t pit_read(void)
 {
unsigned long flags;
int count;
unchanged:
--- a/kernel/spinlock.c
+++ b/kernel/spinlock.c
@@ -76,7 +76,7 @@ void __lockfunc _read_lock(rwlock_t *lock)
 }
 EXPORT_SYMBOL(_read_lock);
 
-unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock)
+unsigned long notrace __lockfunc _spin_lock_irqsave(spinlock_t *lock)
 {
unsigned long flags;
 
@@ -341,7 +341,7 @@ void __lockfunc _read_unlock(rwlock_t *lock)
 }
 EXPORT_SYMBOL(_read_unlock);
 
-void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
+void notrace __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned 
long flags)
 {
spin_release(lock-dep_map, 1, _RET_IP_);
_raw_spin_unlock(lock);
unchanged:
--- a/lib/spinlock_debug.c
+++ b/lib/spinlock_debug.c
@@ -148,7 +148,7 @@ int _raw_spin_trylock(spinlock_t *lock)
return ret;
 }
 
-void _raw_spin_unlock(spinlock_t *lock)
+void notrace _raw_spin_unlock(spinlock_t *lock)
 {
debug_spin_unlock(lock);
__raw_spin_unlock(lock-raw_lock);
only in patch2:
unchanged:
--- linux/arch/x86/kernel/tsc_32.c
+++ linux-2.6.24-rc3logfs/arch/x86/kernel/tsc_32.c  2007-12-02 
15:21:15.0 +0100
@@ -92,7 +92,7 @@
 /*
  * Scheduler clock - returns current time in nanosec units.
  */
-unsigned long long native_sched_clock(void)
+unsigned notrace long long native_sched_clock(void)
 {
unsigned long long this_offset;
 
only in patch2:
unchanged:
--- linux/kernel/lockdep.c
+++ linux-2.6.24-rc3logfs/kernel/lockdep.c  2007-12-02 15:21:16.0 
+0100
@@ -139,7 +139,7 @@
return i;
 }
 
-static void lock_time_inc(struct lock_time *lt, s64 time)
+static notrace void lock_time_inc(struct lock_time *lt, s64 time)
 {
if (time  lt-max)
lt-max = time;
@@ -198,7 +198,7 @@
memset(class-contention_point, 0, sizeof(class-contention_point));
 }
 
-static struct lock_class_stats *get_lock_stats(struct lock_class *class)
+static notrace struct lock_class_stats *get_lock_stats(struct lock_class 
*class)
 {
return get_cpu_var(lock_stats)[class - lock_classes];
 }
@@ -208,7 +208,7 @@
put_cpu_var(lock_stats);
 }
 
-static void lock_release_holdtime(struct held_lock *hlock)
+static notrace void lock_release_holdtime(struct held_lock *hlock)
 {
struct lock_class_stats *stats;
s64 holdtime;
@@ -2872,7 +2872,7 @@
 }
 EXPORT_SYMBOL_GPL(lock_contended);
 
-void lock_acquired(struct lockdep_map *lock)
+void notrace lock_acquired(struct lockdep_map *lock)
 {
unsigned long flags;
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-02 Thread Jörn Engel

On Sun, 2 December 2007 16:47:46 +0100, Ingo Molnar wrote:
 
 well what does the trace say, where do the delays come from? To get a 
 quick overview you can make tracing lighter weight by doing:
 
  echo 0  /proc/sys/kernel/mcount_enabled
  echo 1  /proc/sys/kernel/trace_syscalls

I mistyped and did 
 echo 1  /proc/sys/kernel/mcount_enabled

Result looked like a livelock and finally convinced me to abandon the
latency tracer.  Sorry, but it appears to be the right tool for the
wrong job.

Jörn

-- 
They laughed at Galileo.  They laughed at Copernicus.  They laughed at
Columbus. But remember, they also laughed at Bozo the Clown.
-- unknown
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel Development Objective-C

2007-12-02 Thread Jörn Engel

On Sat, 1 December 2007 21:59:31 +0200, Avi Kivity wrote:
 
 Object orientation in C leaves much to be desired; see the huge number 
 of void pointers and container_of()s in the kernel.

While true, this isn't such a bad problem.  A language really sucks when
it tries to disallow something useful.  Back in university I was forced
to write system software in pascal.  Simple pointer arithmetic became a
5-line piece of code.

Imo the main advantage of C is simply that it doesn't get in the way.

Jörn

-- 
But this is not to say that the main benefit of Linux and other GPL
software is lower-cost. Control is the main benefit--cost is secondary.
-- Bruce Perens
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-02 Thread Jörn Engel

On Sun, 2 December 2007 21:07:22 +0100, Ingo Molnar wrote:
 * Jörn Engel [EMAIL PROTECTED] wrote:
 
  Result looked like a livelock and finally convinced me to abandon the 
  latency tracer.  Sorry, but it appears to be the right tool for the 
  wrong job.
 
 hm, we routinely use it in -rt to capture what on earth is happening 
 incidents. The snippet below is a random snipped from a trace that i've 
 just captured, with mcount enabled. It seems to work fine here, with and 
 without mcount. (pit clocksource is almost never used, that's why you 
 had those early problems.)
 
 oprofile helps if you can reliably reproduce the slowdown in a loop or 
 for a long amount of time, with lots of CPU utilization - and then it's 
 also lower overhead. The tracer can be used to capture rare or complex 
 events, and gives the full flow control and what is happening within the 
 kernel.

Such a trace would be useful indeed.  But so far the patch has only
given me grief and nothing remotely like useful output.  Maybe I should
simply use the complete -rt patch instead of debugging the broken-out
latency-tracer patch.

Jörn

-- 
Mundie uses a textbook tactic of manipulation: start with some
reasonable talk, and lead the audience to an unreasonable conclusion.
-- Bruce Perens
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-02 Thread Jörn Engel

On Sun, 2 December 2007 21:45:59 +0100, Ingo Molnar wrote:
 
 to capture a 1 second trace of what the system is doing. I think your 
 troubles are due to running it within a qemu guest - that is not a 
 typical utilization so you are on unchartered waters.

Looks like it.  Guess I'll switch to something else for the moment.

Jörn

-- 
Linux is more the core point of a concept that surrounds open source
which, in turn, is based on a false concept. This concept is that
people actually want to look at source code.
-- Rob Enderle
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-02 Thread Jörn Engel

On Sun, 2 December 2007 21:45:59 +0100, Ingo Molnar wrote:
 
 to capture that trace i did not use -rt, i just patched latest -git 
 with:
 
   
 http://people.redhat.com/mingo/latency-tracing-patches/latency-tracing-v2.6.24-rc3.combo.patch
 
 (this has your fixes included already)
 
 have done:
 
   echo 1  /proc/sys/kernel/mcount_enabled
 
 and have run:
 
   ./trace-cmd sleep 1  trace.txt
 
   http://people.redhat.com/mingo/latency-tracing-patches/trace-cmd.c
 
 to capture a 1 second trace of what the system is doing. I think your 
 troubles are due to running it within a qemu guest - that is not a 
 typical utilization so you are on unchartered waters.

Maybe one more thing: can you send me the config you used for the setup
above?  I'd like to know whether qemu or my config is to blame.

Jörn

-- 
Eighty percent of success is showing up.
-- Woody Allen
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-02 Thread Jörn Engel

On Sun, 2 December 2007 22:19:00 +0100, Ingo Molnar wrote:
 * Jörn Engel [EMAIL PROTECTED] wrote:
 
  Maybe one more thing: can you send me the config you used for the 
  setup above?  I'd like to know whether qemu or my config is to blame.
 
 sure - attached.

After an eternity of compile time, this config does generate some useful
output.  qemu is not to blame.

Jörn

-- 
Joern's library part 9:
http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG] Strange 1-second pauses during Resume-from-RAM

2007-12-03 Thread Jörn Engel

On Mon, 3 December 2007 01:57:02 +0100, Jörn Engel wrote:
 
 After an eternity of compile time, this config does generate some useful
 output.  qemu is not to blame.

Or is it?  The output definitely looks suspicious.  Large amounts of
code get processed within a microsecond, while update_wall_time()
appears to cause huge delays every time it is called:
http://logfs.org/~joern/trace

Does this output make sense or does it rather indicate some sloppiness
wrt. time in the qemu virtual machine?

Jörn

-- 
tglx1 thinks that joern should get a (TM) for Thinking Is Hard
-- Thomas Gleixner
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: solid state drive access and context switching

2007-12-04 Thread Jörn Engel

On Tue, 4 December 2007 13:54:21 -0800, Jared Hulbert wrote:
 
 Maybe I'm missing something but I don't see it.  We want a block
 interface for these devices, we just need a faster slimmer interface.
 Maybe a new mtdblock interface that doesn't do erase would be the
 place for?

Doesn't do erase?  MTD has to learn almost all tricks from the block
layer, as devices are becoming high-latency high-bandwidth, compared to
what MTD was designed for.  In order to get any decent performance, we
need asynchronous operations, request queues and caching.

The only useful advantage MTD does have over block devices is an
_explicit_ erase operation.  Did you mean doesn't do _implicit_ erase.

Jörn

-- 
It's just what we asked for, but not what we want!
-- anonymous
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [x86] kernel/audit.c cleanup according to checkpatch.pl

2008-01-03 Thread Jörn Engel

On Thu, 3 January 2008 14:19:25 +0300, Cyrill Gorcunov wrote:
 @@ -232,7 +232,8 @@ void audit_log_lost(const char *message)
  
   if (print) {
   printk(KERN_WARNING
 -audit: audit_lost=%d audit_rate_limit=%d 
 audit_backlog_limit=%d\n,
 +audit: audit_lost=%d audit_rate_limit=%d 
 +audit_backlog_limit=%d\n,
  atomic_read(audit_lost),
  audit_rate_limit,
  audit_backlog_limit);

This hunk is a bit questionable.  It can easily deceive a reader to
assume two seperate lines printed out and sometimes defeats grepping
for printk output to find the code generating the message.

Rest looks good to me.

Jörn

-- 
He that composes himself is wiser than he that composes a book.
-- B. Franklin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Claim maintainership for block2mtd and update email addresses

2008-01-06 Thread Jörn Engel

I have been prime author and maintainer of block2mtd from day one, but
neither MAINTAINERS nor the module source makes this fact clear.  And while
I'm at it, update my email addresses tree-wide, as the old address currently
bounces and change my name to joern as unicode will likely continue to
cause trouble until the end of this century.

Signed-off-by: Jörn Engel [EMAIL PROTECTED]
---

 MAINTAINERS |   10 --
 drivers/mtd/devices/block2mtd.c |4 ++--
 drivers/mtd/devices/phram.c |4 ++--
 drivers/mtd/maps/mtx-1_flash.c  |2 +-
 scripts/checkstack.pl   |2 +-
 5 files changed, 14 insertions(+), 8 deletions(-)

--- linux-2.6.24-rc3logfs/drivers/mtd/devices/block2mtd.c~block2mtd_maintainer  
2007-08-08 19:30:04.0 +0200
+++ linux-2.6.24-rc3logfs/drivers/mtd/devices/block2mtd.c   2008-01-06 
14:22:57.0 +0100
@@ -4,7 +4,7 @@
  * block2mtd.c - create an mtd from a block device
  *
  * Copyright (C) 2001,2002 Simon Evans [EMAIL PROTECTED]
- * Copyright (C) 2004-2006 JÃ¶rn Engel [EMAIL PROTECTED]
+ * Copyright (C) 2004-2006 Joern Engel [EMAIL PROTECTED]
  *
  * Licence: GPL
  */
@@ -485,5 +485,5 @@ module_init(block2mtd_init);
 module_exit(block2mtd_exit);
 
 MODULE_LICENSE(GPL);
-MODULE_AUTHOR(Simon Evans [EMAIL PROTECTED] and others);
+MODULE_AUTHOR(Joern Engel [EMAIL PROTECTED]);
 MODULE_DESCRIPTION(Emulate an MTD using a block device);
--- linux-2.6.24-rc3logfs/MAINTAINERS~block2mtd_maintainer  2007-11-30 
13:59:51.0 +0100
+++ linux-2.6.24-rc3logfs/MAINTAINERS   2008-01-06 14:21:49.0 +0100
@@ -835,6 +835,12 @@ L: linux-kernel@vger.kernel.org
 T: git kernel.org:/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git
 S: Maintained
 
+BLOCK2MTD DRIVER
+P: Joern Engel
+M: [EMAIL PROTECTED]
+L: [EMAIL PROTECTED]
+S: Maintained
+
 BLUETOOTH SUBSYSTEM
 P: Marcel Holtmann
 M: [EMAIL PROTECTED]
@@ -2985,8 +2991,8 @@ L:[EMAIL PROTECTED]
 S: Maintained
 
 PHRAM MTD DRIVER
-P: JÃ¶rn Engel
-M: [EMAIL PROTECTED]
+P: Joern Engel
+M: [EMAIL PROTECTED]
 L: [EMAIL PROTECTED]
 S: Maintained
 
--- linux-2.6.24-rc3logfs/drivers/mtd/devices/phram.c~block2mtd_maintainer  
2007-08-08 19:30:04.0 +0200
+++ linux-2.6.24-rc3logfs/drivers/mtd/devices/phram.c   2008-01-06 
14:22:30.0 +0100
@@ -2,7 +2,7 @@
  * $Id: phram.c,v 1.16 2005/11/07 11:14:25 gleixner Exp $
  *
  * Copyright (c)   Jochen SchÃ¤uble [EMAIL PROTECTED]
- * Copyright (c) 2003-2004 JÃ¶rn Engel [EMAIL PROTECTED]
+ * Copyright (c) 2003-2004 Joern Engel [EMAIL PROTECTED]
  *
  * Usage:
  *
@@ -299,5 +299,5 @@ module_init(init_phram);
 module_exit(cleanup_phram);
 
 MODULE_LICENSE(GPL);
-MODULE_AUTHOR(JÃ¶rn Engel [EMAIL PROTECTED]);
+MODULE_AUTHOR(Joern Engel [EMAIL PROTECTED]);
 MODULE_DESCRIPTION(MTD driver for physical RAM);
--- linux-2.6.24-rc3logfs/scripts/checkstack.pl~block2mtd_maintainer
2007-11-15 20:52:00.0 +0100
+++ linux-2.6.24-rc3logfs/scripts/checkstack.pl 2008-01-06 14:28:14.0 
+0100
@@ -2,7 +2,7 @@
 
 #  Check the stack usage of functions
 #
-#  Copyright Joern Engel [EMAIL PROTECTED]
+#  Copyright Joern Engel [EMAIL PROTECTED]
 #  Inspired by Linus Torvalds
 #  Original idea maybe from Keith Owens
 #  s390 port and big speedup by Arnd Bergmann [EMAIL PROTECTED]
--- linux-2.6.24-rc3logfs/drivers/mtd/maps/mtx-1_flash.c~block2mtd_maintainer   
2007-08-08 19:30:04.0 +0200
+++ linux-2.6.24-rc3logfs/drivers/mtd/maps/mtx-1_flash.c2008-01-06 
14:28:44.0 +0100
@@ -4,7 +4,7 @@
  * $Id: mtx-1_flash.c,v 1.2 2005/11/07 11:14:27 gleixner Exp $
  *
  * (C) 2005 Bruno Randolf [EMAIL PROTECTED]
- * (C) 2005 JÃ¶rn Engel [EMAIL PROTECTED]
+ * (C) 2005 Joern Engel [EMAIL PROTECTED]
  *
  */
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] block2mtd lockdep_init_map warning

2008-01-06 Thread Jörn Engel

On Sun, 6 January 2008 14:11:47 -0500, Erez Zadok wrote:
 
 The problem appears to be an interaction of two components--module loading
 and lockdep--that's perhaps why it wasn't given enough attention.

Correct.  For modules lockdep depends on initializations done after
module_init has finished.  However block2mtd is an odd sod that can call
into lockdep code during module_init, causing the bug you noticed.

Several solutions are possible.  Modules could get two initcalls, one to
decide whether module load should get aborted, the other run later,
after the remaining module initializations are done.  Or the module
loader could always do the initializations and revoke them later, if
module_init failed.

But I personally am too unfamiliar with the module code to trust my
judgement and have yet to receive feedback.  Even you seem to ignore my
mails and not even Cc: me later on.  I must have done something really
horrible in my last life, it seems.

Jörn

-- 
A quarrel is quickly settled when deserted by one party; there is
no battle unless there be two.
-- Seneca
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: fix typo in mtd kconfig

2007-10-17 Thread Jörn Engel

David, will you take this patch?

Signed-off-by: Dave Jones [EMAIL PROTECTED]
Signed-off-by: Joern Engel [EMAIL PROTECTED]

diff --git a/drivers/mtd/nand/Kconfig b/drivers/mtd/nand/Kconfig
index 8f9c3ba..246d451 100644
--- a/drivers/mtd/nand/Kconfig
+++ b/drivers/mtd/nand/Kconfig
@@ -300,7 +300,7 @@ config MTD_NAND_PLATFORM
  via platform_data.
 
 config MTD_ALAUDA
-   tristate MTD driver for Olympus MAUSB-10 and Fijufilm DPC-R1
+   tristate MTD driver for Olympus MAUSB-10 and Fujifilm DPC-R1
depends on MTD_NAND  USB
help
  These two (and possibly other) Alauda-based cardreaders for


Jörn

-- 
It does not matter how slowly you go, so long as you do not stop.
-- Confucius
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BLOCK2MTD] WARNING: at kernel/lockdep.c:2331 lockdep_init_map()

2007-10-19 Thread Jörn Engel

On Fri, 19 October 2007 13:53:40 -0400, Erez Zadok wrote:

 I've been having this problem for some time with mtd, which I use to mount
 jffs2 images (for unionfs testing).  I've seen it in several recent major
 kernels, including 2.6.24.  Here's the sequence of ops I perform:

Since when roughly?  2.6.20ish?  Before?

 # cp jffs2-empty.img /tmp/foo
 # losetup /dev/loop0 /tmp/foo
 # modprobe mtdblock
 # modprobe block2mtd block2mtd=/dev/loop0,128ki
 # mount -t jffs2 /dev/mtdblock0 /n/lower/b0

Side note: you don't need mtdblock:
# cp jffs2-empty.img /tmp/foo
# losetup /dev/loop0 /tmp/foo
# modprobe block2mtd block2mtd=/dev/loop0,128ki
# mount -t jffs2 mtd0 /n/lower/b0

It doesn't really hurt, 'tis just superfluous.

 The jffs2-empty.img is a small jffs2 image, of an empty directory, created
 w/ the jffs2 utils.  At the point I modprobe block2mtd, I get the following
 lockdep warning and a BUG message:
 
 BUG: key f88e1340 not in .data!
 WARNING: at kernel/lockdep.c:2331 lockdep_init_map()
  [c0102bc2] show_trace_log_lvl+0x1a/0x2f
  [c0103692] show_trace+0x12/0x14
  [c01037b2] dump_stack+0x15/0x17
  [c0125432] lockdep_init_map+0x94/0x3e4
  [c0125001] debug_mutex_init+0x2c/0x3c
  [c01210d4] __mutex_init+0x38/0x40
  [f88e01d3] 0xf88e01d3
  [c011dda7] parse_args+0x123/0x200
  [c012b725] sys_init_module+0xdd0/0x122c
  [c0102586] sysenter_past_esp+0x5f/0x91
  ===
 block2mtd: mtd0: [d: /dev/loop0] erase_size = 128KiB [131072]
 block2mtd: version $Revision: 1.30 $

Could be my problem.  I'll see if I can reproduce it.  Can you send me
your .config or a link to it?

Jörn

-- 
/* Keep these two variables together */
int bar;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] eccbuf is statically defined and always evaluate to true

2007-10-19 Thread Jörn Engel

On Fri, 19 October 2007 19:26:35 +0200, Samuel Tardieu wrote:
 
 ---
  drivers/mtd/devices/doc2000.c |4 ++--
  drivers/mtd/devices/doc2001plus.c |2 +-
  2 files changed, 3 insertions(+), 3 deletions(-)

Acked-by: Joern Engel [EMAIL PROTECTED]

I assume you don't actually use this driver and just ran make
randconfig or allyesconfig or so..

Jörn

-- 
Science is like sex: sometimes something useful comes out,
but that is not the reason we are doing it.
-- Richard Feynman
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BLOCK2MTD] WARNING: at kernel/lockdep.c:2331 lockdep_init_map()

2007-10-20 Thread Jörn Engel

On Fri, 19 October 2007 16:04:10 -0400, Erez Zadok wrote:
 In message [EMAIL PROTECTED], =?utf-8?B?SsO2cm4=?= Engel writes:
  
  Since when roughly?  2.6.20ish?  Before?
 
 Yeah, I guess around that time.  If you want, I could go back and test each
 of my backports and see if it has the lockdep message or not.

That's ok.  Just wanted to get a rough idea.

  Side note: you don't need mtdblock:
  # cp jffs2-empty.img /tmp/foo
  # losetup /dev/loop0 /tmp/foo
  # modprobe block2mtd block2mtd=/dev/loop0,128ki
  # mount -t jffs2 mtd0 /n/lower/b0
  
  It doesn't really hurt, 'tis just superfluous.
 
 Neat.  Curious, but where does mtd0 come from then?  It's not in my /dev
 (which uses devfs on an FC6 system).

JFFS2 interprets that itself.  The only reason why JFFS2 needed a block
device was to determine the minor number of the mtd underneith.  So code
was added to find the correct mtd from mtd0 or mtd:some_name
instead.  I believe you can even disable CONFIG_BLOCK now.

And the code itself was moved to drivers/mtd/mtdsuper.c fairly recently.

Jörn

-- 
Joern's library part 2:
http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BLOCK2MTD] WARNING: at kernel/lockdep.c:2331 lockdep_init_map()

2007-10-21 Thread Jörn Engel

On Fri, 19 October 2007 20:31:29 +0200, Peter Zijlstra wrote:
  
  BUG: key f88e1340 not in .data!
  WARNING: at kernel/lockdep.c:2331 lockdep_init_map()
   [c0102bc2] show_trace_log_lvl+0x1a/0x2f
   [c0103692] show_trace+0x12/0x14
   [c01037b2] dump_stack+0x15/0x17
   [c0125432] lockdep_init_map+0x94/0x3e4
   [c0125001] debug_mutex_init+0x2c/0x3c
   [c01210d4] __mutex_init+0x38/0x40
   [f88e01d3] 0xf88e01d3
   [c011dda7] parse_args+0x123/0x200
   [c012b725] sys_init_module+0xdd0/0x122c
   [c0102586] sysenter_past_esp+0x5f/0x91
   ===
  block2mtd: mtd0: [d: /dev/loop0] erase_size = 128KiB [131072]
  block2mtd: version $Revision: 1.30 $
 
 Someone stuck a key object in non static storage. That breaks lockdep,
 don't do that :-)
 
 Is the mutex_init() done from a function tagged with __init?

Root cause is an ordering problem in module loading.  Code flow is
roughly this:
sys_init_module
`- load_module
:   `- parse_args
:   `- block2mtd_setup
:   `- __mutex_init
:   `- lockdep_init_map
:   `- static_obj
:   `- is_module_address
`- __link_module

is_module_address() would return something sane, if __link_module() had
already been called.  In fact, if the parameter is passed through
/sys/modules/block2mtd/parameters/block2mtd _after_ module load time,
the exact same code works fine.  Only when passing the parameter as a
module parameter do we see this problem.

So what should be done?  We could move parse_args() below
__link_module(), but I'd guess such a change would break some other
modules what depend on certain parameters or at least should fail to
load with illegal parameters.  Do such modules exist?

Or we could add some kind of parse_args_late() that is called after
__link_module(), if requested by a module, and annotate block2mtd to
prefer that version.

[ Adding Ingo on Cc:.  Since block2mtd predates lockdep I found a bug in
  his code and not the other way around. ;) ]

Jörn

-- 
Do not stop an army on its way home.
-- Sun Tzu
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24] block2mtd: removing a device and typo fixes

2008-02-12 Thread Jörn Engel

On Tue, 12 February 2008 13:47:51 +, Stephane Chazelas wrote:
 
 this patch addresses a number of small issues mainly regarding
 the output made by this driver to dmesg:
   - Some of the blkmtd's had not been changed to block2mtd
 which caused display problem
   - the parse_err() macro was displaying block2mtd:  twice

Fairly obvious fixes.

 Also, one can add a block2mtd mtd device with things like:
 
 echo /dev/loop3,$((256*1024)) |
   sudo tee /sys/module/block2mtd/parameters/block2mtd
 
 but individual mtds cannot be removed. You can only do a
 modprobe -r block2mtd to remove *all* the block2mtd mtds.
 
 This patch proposes to add the cabability with:
 
 echo /dev/loop3,remove |
   sudo tee /sys/module/block2mtd/parameters/block2mtd

Sounds sane enough.  But I do have some reservations about the
implementation.  It would be best if you split the patch in two.  One
with the obvious stuff above and one for this.

The core of remove_device_by_name() is shared with block2mtd_exit(),
so a common helper would be good.  Your error handling is better, so
let's keep that version.

And independently of your patch a mutex protecting the device list from
simultaneous modifications would be good to have.

Side note: I may not have internet access until 19th or so.

Jörn

-- 
Rules of Optimization:
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
-- M.A. Jackson
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1168 matches

Mail list logo