Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-25 Thread Juan Piernas Canovas

On Sun, 25 Feb 2007, [utf-8] Jörn Engel wrote:


On Sun, 25 February 2007 03:41:40 +0100, Juan Piernas Canovas wrote:


Well, our experimental results say another thing. As I have said, the
greatest part of the files are written at once, so their meta-data blocks
are together on disk. This allows DualFS to implement an explicit
prefetching of meta-data blocks which is quite effective, specially when
there are several processes reading from disk at the same time.

On the other hand, DualFS also implements an on-line meta-data relocation
mechanism which can help to improve meta-data prefetching, and garbage
collection.

Obviously, there can be some slow-growing files that can produce some
garbage, but they do not hurt the overall performance of the file system.


Well, my concerns about the design have gone.  There remain some
concerns about the source code and I hope they will disappear just as
fast. :)


This is a bit more complicated ;)


Obviously, a patch against 2.4.x is fairly useless.  Iirc, you claimed
somewhere to have a patch against 2.6.11, but I was unable to find that.
Porting 2.6.11 to 2.6.20 should be simple enough.


I'm working on a public patch of DualFS for Linux 2.6.x. It's a matter of 
time.




Then there is some assembly code inside the patch that you seem to have
copied from some other project.  I would be surprised if that is really
required.  If you can replace it with C code, please do.

If the assembly actually is a performance gain (and I consider it your
duty to prove that), you can have a two-patch series with the first
introducing DualFS and the second adding the assembly as a config option
for one architecture.


No problem. I will see if the assembly code can be replaced with bare C.

Regards,

Juan.
--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-25 Thread Juan Piernas Canovas

On Sun, 25 Feb 2007, [utf-8] Jörn Engel wrote:


On Sun, 25 February 2007 03:41:40 +0100, Juan Piernas Canovas wrote:


Well, our experimental results say another thing. As I have said, the
greatest part of the files are written at once, so their meta-data blocks
are together on disk. This allows DualFS to implement an explicit
prefetching of meta-data blocks which is quite effective, specially when
there are several processes reading from disk at the same time.

On the other hand, DualFS also implements an on-line meta-data relocation
mechanism which can help to improve meta-data prefetching, and garbage
collection.

Obviously, there can be some slow-growing files that can produce some
garbage, but they do not hurt the overall performance of the file system.


Well, my concerns about the design have gone.  There remain some
concerns about the source code and I hope they will disappear just as
fast. :)


This is a bit more complicated ;)


Obviously, a patch against 2.4.x is fairly useless.  Iirc, you claimed
somewhere to have a patch against 2.6.11, but I was unable to find that.
Porting 2.6.11 to 2.6.20 should be simple enough.


I'm working on a public patch of DualFS for Linux 2.6.x. It's a matter of 
time.




Then there is some assembly code inside the patch that you seem to have
copied from some other project.  I would be surprised if that is really
required.  If you can replace it with C code, please do.

If the assembly actually is a performance gain (and I consider it your
duty to prove that), you can have a two-patch series with the first
introducing DualFS and the second adding the assembly as a config option
for one architecture.


No problem. I will see if the assembly code can be replaced with bare C.

Regards,

Juan.
--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.esop=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-24 Thread Juan Piernas Canovas

Hi Jörn,

On Fri, 23 Feb 2007, [utf-8] Jörn Engel wrote:


On Thu, 22 February 2007 20:57:12 +0100, Juan Piernas Canovas wrote:


I do not agree with this picture, because it does not show that all the
indirect blocks which point to a direct block are along with it in the
same segment. That figure should look like:

Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [some data] [ D0 D1' D2' ] [more data]
Segment 3: [some data] [ DB D1  D2  ] [more data]

where D0, DA, and DB are datablocks, D1 and D2 indirect blocks which
point to the datablocks, and D1' and D2' obsolete copies of those
indirect blocks. By using this figure, is is clear that if you need to
move D0 to clean the segment 2, you will need only one free segment at
most, and not more. You will get:

Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [free]
Segment 3: [some data] [ DB D1' D2' ] [more data]
..
Segment n: [ D0 D1 D2 ] [ empty ]

That is, D0 needs in the new segment the same space that it needs in the
previous one.

The differences are subtle but important.


Ah, now I see.  Yes, that is deadlock-free.  If you are not accounting
the bytes of used space but the number of used segments, and you count
each partially used segment the same as a 100% used segment, there is no
deadlock.

Some people may consider this to be cheating, however.  It will cause
more than 50% wasted space.  All obsolete copies are garbage, after all.
With a maximum tree height of N, you can have up to (N-1) / N of your
filesystem occupied by garbage.


I do not agree. Fortunately, the greatest part of the files are written at 
once, so what you usually have is:


Segment 1: [  data  ]
Segment 2: [some data] [ D0 DA DB D1 D2 ] [more data]
Segment 3: [  data  ]
..

On the other hand, the DualFS cleaner tries to clean several segments 
everytime it runs. Therefore, if you have the following case:


Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [some data] [ D0 D1' D2' ] [more data]
Segment 3: [some data] [ DB D1' D2' ] [more data]
..

after cleaning, you can have this one:

Segment 3: [  free  ]
Segment 3: [  free  ]
Segment 3: [  free  ]
..
Segment i: [D0 DA DB D1 D2 ] [   more data  ]

Moreover, if the cleaner starts running when the free space drops below a 
specific threshold, it is very difficult to waste more than 50% of disk 
space, specially with meta-data (actually, I am unable to imagine that 
situation :).



Another downside is that with large amounts of garbage between otherwise
useful data, your disk cache hit rate goes down.  Read performance is
suffering.  But that may be a fair tradeoff and will only show up in
large metadata reads in the uncached (per Linux) case.  Seems fair.


Well, our experimental results say another thing. As I have said, the 
greatest part of the files are written at once, so their meta-data blocks 
are together on disk. This allows DualFS to implement an explicit 
prefetching of meta-data blocks which is quite effective, specially when 
there are several processes reading from disk at the same time.


On the other hand, DualFS also implements an on-line meta-data relocation 
mechanism which can help to improve meta-data prefetching, and garbage 
collection.


Obviously, there can be some slow-growing files that can produce some 
garbage, but they do not hurt the overall performance of the file system.




Quite interesting, actually.  The costs of your design are disk space,
depending on the amount and depth of your metadata, and metadata read
performance.  Disk space is cheap and metadata reads tend to be slow for
most filesystems, in comparison to data reads.  You gain faster metadata
writes and loss of journal overhead.  I like the idea.



Yeah :) If you have taken a look to my presentation at LFS07, the disk 
traffic of meta-data blocks is dominated by writes.



Jörn



Juan.
--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-24 Thread Juan Piernas Canovas

Hi Jörn,

On Fri, 23 Feb 2007, [utf-8] Jörn Engel wrote:


On Thu, 22 February 2007 20:57:12 +0100, Juan Piernas Canovas wrote:


I do not agree with this picture, because it does not show that all the
indirect blocks which point to a direct block are along with it in the
same segment. That figure should look like:

Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [some data] [ D0 D1' D2' ] [more data]
Segment 3: [some data] [ DB D1  D2  ] [more data]

where D0, DA, and DB are datablocks, D1 and D2 indirect blocks which
point to the datablocks, and D1' and D2' obsolete copies of those
indirect blocks. By using this figure, is is clear that if you need to
move D0 to clean the segment 2, you will need only one free segment at
most, and not more. You will get:

Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [free]
Segment 3: [some data] [ DB D1' D2' ] [more data]
..
Segment n: [ D0 D1 D2 ] [ empty ]

That is, D0 needs in the new segment the same space that it needs in the
previous one.

The differences are subtle but important.


Ah, now I see.  Yes, that is deadlock-free.  If you are not accounting
the bytes of used space but the number of used segments, and you count
each partially used segment the same as a 100% used segment, there is no
deadlock.

Some people may consider this to be cheating, however.  It will cause
more than 50% wasted space.  All obsolete copies are garbage, after all.
With a maximum tree height of N, you can have up to (N-1) / N of your
filesystem occupied by garbage.


I do not agree. Fortunately, the greatest part of the files are written at 
once, so what you usually have is:


Segment 1: [  data  ]
Segment 2: [some data] [ D0 DA DB D1 D2 ] [more data]
Segment 3: [  data  ]
..

On the other hand, the DualFS cleaner tries to clean several segments 
everytime it runs. Therefore, if you have the following case:


Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [some data] [ D0 D1' D2' ] [more data]
Segment 3: [some data] [ DB D1' D2' ] [more data]
..

after cleaning, you can have this one:

Segment 3: [  free  ]
Segment 3: [  free  ]
Segment 3: [  free  ]
..
Segment i: [D0 DA DB D1 D2 ] [   more data  ]

Moreover, if the cleaner starts running when the free space drops below a 
specific threshold, it is very difficult to waste more than 50% of disk 
space, specially with meta-data (actually, I am unable to imagine that 
situation :).



Another downside is that with large amounts of garbage between otherwise
useful data, your disk cache hit rate goes down.  Read performance is
suffering.  But that may be a fair tradeoff and will only show up in
large metadata reads in the uncached (per Linux) case.  Seems fair.


Well, our experimental results say another thing. As I have said, the 
greatest part of the files are written at once, so their meta-data blocks 
are together on disk. This allows DualFS to implement an explicit 
prefetching of meta-data blocks which is quite effective, specially when 
there are several processes reading from disk at the same time.


On the other hand, DualFS also implements an on-line meta-data relocation 
mechanism which can help to improve meta-data prefetching, and garbage 
collection.


Obviously, there can be some slow-growing files that can produce some 
garbage, but they do not hurt the overall performance of the file system.




Quite interesting, actually.  The costs of your design are disk space,
depending on the amount and depth of your metadata, and metadata read
performance.  Disk space is cheap and metadata reads tend to be slow for
most filesystems, in comparison to data reads.  You gain faster metadata
writes and loss of journal overhead.  I like the idea.



Yeah :) If you have taken a look to my presentation at LFS07, the disk 
traffic of meta-data blocks is dominated by writes.



Jörn



Juan.
--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.esop=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-22 Thread Juan Piernas Canovas

Hi Jörn,

On Thu, 22 Feb 2007, [utf-8] Jörn Engel wrote:


A partial segment is a transaction unit, and contains "all" the blocks
modified by a file system operation, including indirect blocks and i-nodes
(actually, it contains the blocks modified by several file system
operations, but let us assume that every partial segment only contains the
blocks modified by a single file system operation).

So, the above figure is as follows in DualFS:

 Before:
 Segment 1: [some data] [ D0 D1 D2 I ] [more data]
 Segment 2: [ some data  ]
 Segment 3: [   empty]

If the datablock D0 is modified, what you get is:

 Segment 1: [some data] [  garbage   ] [more data]
 Segment 2: [ some data  ]
 Segment 3: [ D0 D1 D2 I ] [   empty ]


You have fairly strict assumptions about the "Before:" picture.  But


The "Before" figure is intentionally simple because it is enough to 
understand how the meta-data device of DualFS works, and why the cleaner 
is deadlock-free ;-)



what happens if those assumptions fail.  To give you an example, imagine
the following small script:

$ for i in `seq 100`; do touch $i; done

This will create a million dentries in one directory.  It will also
create a million inodes, but let us ignore those for a moment.  It is
fairly unlikely that you can fit a million dentries into [D0], so you
will need more than one block.  Let's call them [DA], [DB], [DC], etc.
So you have to write out the first block [DA].

Before:
Segment 1: [some data] [ DA D1 D2 I ] [more data]
Segment 2: [ some data  ]
Segment 3: [   empty]

If the datablock D0 is modified, what you get is:

Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA D1 D2 I ] [   empty ]

That is exactly your picture.  Fine.  Next you write [DB].

Before: see above
After:
Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA][garbage] [ DB D1 D2 I ] [ empty]

You write [DC].  Note that Segment 3 does not have enough space for
another partial segment:

Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA][garbage] [ DB][garbage] [wasted]
Segment 4: [ DC D1 D2 I ] [   empty ]

You write [DD] and [DE]:
Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA][garbage] [ DB][garbage] [wasted]
Segment 4: [ DC][garbage] [ DD][garbage] [wasted]
Segment 5: [ DE D1 D2 I ] [   empty ]

And some time later you even have to switch to a new indirect block, so
you get before:

Segment n  : [ DX D1 D2 I ] [   empty ]

After:

Segment n  : [ DX D1][garb] [ DY DI D2 I ] [ empty]

What you end up with after all this is quite unlike you "Before"
picture.  Instead of this:


 Segment 1: [some data] [ D0 D1 D2 I ] [more data]




I agree with all the above, althoug it displays the wort case because a 
partial segment usually contains several datablocks, and a few indirect 
blocks. But it is fine for our purposes.



You may have something closer to this:


Segment 1: [some data] [   D1  ] [more data]
Segment 2: [some data] [   D0  ] [more data]
Segment 3: [some data] [   D2  ] [more data]




I do not agree with this picture, because it does not show that all the 
indirect blocks which point to a direct block are along with it in the 
same segment. That figure should look like:


Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [some data] [ D0 D1' D2' ] [more data]
Segment 3: [some data] [ DB D1  D2  ] [more data]

where D0, DA, and DB are datablocks, D1 and D2 indirect blocks which 
point to the datablocks, and D1' and D2' obsolete copies of those 
indirect blocks. By using this figure, is is clear that if you need to 
move D0 to clean the segment 2, you will need only one free segment at 
most, and not more. You will get:


Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [free]
Segment 3: [some data] [ DB D1' D2' ] [more data]
..
Segment n: [ D0 D1 D2 ] [ empty ]

That is, D0 needs in the new segment the same space that it needs in the 
previous one.


The differences are subtle but important.

Regards,

Juan.


You should try the testcase and look at a dump of your filesystem
afterwards.  I usually just read the raw device in a hex editor.

Jörn




--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) 

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-22 Thread Juan Piernas Canovas

Hi Jörn,

On Thu, 22 Feb 2007, [utf-8] Jörn Engel wrote:


A partial segment is a transaction unit, and contains all the blocks
modified by a file system operation, including indirect blocks and i-nodes
(actually, it contains the blocks modified by several file system
operations, but let us assume that every partial segment only contains the
blocks modified by a single file system operation).

So, the above figure is as follows in DualFS:

 Before:
 Segment 1: [some data] [ D0 D1 D2 I ] [more data]
 Segment 2: [ some data  ]
 Segment 3: [   empty]

If the datablock D0 is modified, what you get is:

 Segment 1: [some data] [  garbage   ] [more data]
 Segment 2: [ some data  ]
 Segment 3: [ D0 D1 D2 I ] [   empty ]


You have fairly strict assumptions about the Before: picture.  But


The Before figure is intentionally simple because it is enough to 
understand how the meta-data device of DualFS works, and why the cleaner 
is deadlock-free ;-)



what happens if those assumptions fail.  To give you an example, imagine
the following small script:

$ for i in `seq 100`; do touch $i; done

This will create a million dentries in one directory.  It will also
create a million inodes, but let us ignore those for a moment.  It is
fairly unlikely that you can fit a million dentries into [D0], so you
will need more than one block.  Let's call them [DA], [DB], [DC], etc.
So you have to write out the first block [DA].

Before:
Segment 1: [some data] [ DA D1 D2 I ] [more data]
Segment 2: [ some data  ]
Segment 3: [   empty]

If the datablock D0 is modified, what you get is:

Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA D1 D2 I ] [   empty ]

That is exactly your picture.  Fine.  Next you write [DB].

Before: see above
After:
Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA][garbage] [ DB D1 D2 I ] [ empty]

You write [DC].  Note that Segment 3 does not have enough space for
another partial segment:

Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA][garbage] [ DB][garbage] [wasted]
Segment 4: [ DC D1 D2 I ] [   empty ]

You write [DD] and [DE]:
Segment 1: [some data] [  garbage   ] [more data]
Segment 2: [ some data  ]
Segment 3: [ DA][garbage] [ DB][garbage] [wasted]
Segment 4: [ DC][garbage] [ DD][garbage] [wasted]
Segment 5: [ DE D1 D2 I ] [   empty ]

And some time later you even have to switch to a new indirect block, so
you get before:

Segment n  : [ DX D1 D2 I ] [   empty ]

After:

Segment n  : [ DX D1][garb] [ DY DI D2 I ] [ empty]

What you end up with after all this is quite unlike you Before
picture.  Instead of this:


 Segment 1: [some data] [ D0 D1 D2 I ] [more data]




I agree with all the above, althoug it displays the wort case because a 
partial segment usually contains several datablocks, and a few indirect 
blocks. But it is fine for our purposes.



You may have something closer to this:


Segment 1: [some data] [   D1  ] [more data]
Segment 2: [some data] [   D0  ] [more data]
Segment 3: [some data] [   D2  ] [more data]




I do not agree with this picture, because it does not show that all the 
indirect blocks which point to a direct block are along with it in the 
same segment. That figure should look like:


Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [some data] [ D0 D1' D2' ] [more data]
Segment 3: [some data] [ DB D1  D2  ] [more data]

where D0, DA, and DB are datablocks, D1 and D2 indirect blocks which 
point to the datablocks, and D1' and D2' obsolete copies of those 
indirect blocks. By using this figure, is is clear that if you need to 
move D0 to clean the segment 2, you will need only one free segment at 
most, and not more. You will get:


Segment 1: [some data] [ DA D1' D2' ] [more data]
Segment 2: [free]
Segment 3: [some data] [ DB D1' D2' ] [more data]
..
Segment n: [ D0 D1 D2 ] [ empty ]

That is, D0 needs in the new segment the same space that it needs in the 
previous one.


The differences are subtle but important.

Regards,

Juan.


You should try the testcase and look at a dump of your filesystem
afterwards.  I usually just read the raw device in a hex editor.

Jörn




--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.esop=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-21 Thread Juan Piernas Canovas

Hi Jörn,

I have been thinking about the problem that you describe, and, 
definitively, DualFS does not have that problem. I could be wrong, but, 
I actually believe that the GC implemented by DualFS is deadlock-free. 
The key is the design of the log-structured file system used by DualFS for 
the meta-data device, which is different to the design that you propose.


On Wed, 21 Feb 2007, [utf-8] Jörn Engel wrote:


On Wed, 21 February 2007 19:31:40 +0100, Juan Piernas Canovas wrote:


I do not understand. Do you mean that if I have 10 segments, 5 busy and 5
free, after cleaning I could need 6 segments? How? Where the extra blocks
come from?


This is a fairly complicated subject and I have trouble explaining it to
people - even though I hope that maybe one or two dozen understand it by
now.  So let me try to give you an example:

In LogFS, inodes are stored in an inode file.  There are no B-Trees yet,
so the regular unix indirect blocks are used.  My example will be
writing to a directory, so that should only involve metadata by your
definition and be a valid example for DualFS as well.  If it is not,
please tell me where the difference lies.

The directory is large, so appending to it involves writing a datablock
(D0), and indirect block (D1) and a doubly indirect block (D2).

Before:
Segment 1: [some data] [   D1  ] [more data]
Segment 2: [some data] [   D0  ] [more data]
Segment 3: [some data] [   D2  ] [more data]
Segment 4: [ empty ]
...


DualFS writes meta-blocks in variable-sized chunks that we call partial 
segments. The meta-data device, however, is divided into segments, which 
have the same size. A partial segment can be as large a a segment, but a 
segment usually has more that one partial segment. Besides, a partial 
segment can not cross a segment boundary.


A partial segment is a transaction unit, and contains "all" the blocks 
modified by a file system operation, including indirect blocks and i-nodes 
(actually, it contains the blocks modified by several file system 
operations, but let us assume that every partial segment only contains the 
blocks modified by a single file system operation).


So, the above figure is as follows in DualFS:

 Before:
 Segment 1: [some data] [ D0 D1 D2 I ] [more data]
 Segment 2: [ some data  ]
 Segment 3: [   empty]

If the datablock D0 is modified, what you get is:

 Segment 1: [some data] [  garbage   ] [more data]
 Segment 2: [ some data  ]
 Segment 3: [ D0 D1 D2 I ] [   empty ]

This is very similar to what the cleaner does. Therefore, moving a direct 
datablock (D0) to a new segment does not require more space than in the 
original segment. That is, cleaning a segment in DualFS requires just one 
free segment, and not more.


The result is that you can use all the free segments in DualFS, and its 
cleaner is simple and deadlock-free. Probably the design is not the most 
space-efficient in the world, but it removes some other serious problems.


And, remember, we are talking about meta-data (which is a small part of 
the file system), and disk space (which is quite inexpensive).


Regards,

Juan.



After:
Segment 1: [some data] [garbage] [more data]
Segment 2: [some data] [garbage] [more data]
Segment 3: [some data] [garbage] [more data]
Segment 4: [D0][D1][D2][  empty]
...

Ok.  After this, the position of D2 on the medium has changed.  So we
need to update the inode and write that as well.  If the inode number
for this directory is high, we will need to write the inode (I0), an
indirect block (I1) and a doubly indirect block (I2).  The picture
becomes a bit more complicates.

Before:
Segment 1: [some data] [   D1  ] [more data]
Segment 2: [some data] [   D0  ] [more data]
Segment 3: [some data] [   D2  ] [more data]
Segment 4: [ empty ]
Segment 5: [some data] [   I1  ] [more data]
Segment 6: [some data] [   I0  ] [more data]
Segment 7: [some data] [   I2  ] [more data]
...

After:
Segment 1: [some data] [garbage] [more data]
Segment 2: [some data] [garbage] [more data]
Segment 3: [some data] [garbage] [more data]
Segment 4: [D0][D1][D2][I0][I1][I2][ empty ]
Segment 5: [some data] [garbage] [more data]
Segment 6: [some data] [garbage] [more data]
Segment 7: [some data] [garbage] [more data]
...

So what has just happened?  The user did a single "touch foo" in a large
directory and has caused six objects to move.  Unless some of those
objects were in the same segment before, we now have six segments
containing a tiny amount of garbage.

And there is almost no way how you can squeeze that garbage back out.
The cleaner will fundamentally do the same thing as a regular write - it
will move objects.  So if you want to clean a segment containing the
block of a different directory, you may again have to move five
additional objects, the indirect blocks, inode and ifile indirect
bl

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-21 Thread Juan Piernas Canovas

Hi Jörn,

On Wed, 21 Feb 2007, [utf-8] Jörn Engel wrote:


On Wed, 21 February 2007 05:36:22 +0100, Juan Piernas Canovas wrote:


I don't see how you can guarantee 50% free segments.  Can you explain
that bit?

It is quite simple. If 50% of your segments are busy, and the other 50%
are free, and the file system needs a new segment, the cleaner starts
freeing some of busy ones. If the cleaner is unable to free one segment at
least, your file system gets "full" (and it returns a nice ENOSPC error).
This solution wastes the half of your storage device, but it is
deadlock-free. Obviously, there are better approaches.


Ah, ok.  It is deadlock free, if the maximal height of your tree is 2.
It is not 100% deadlock free if the height is 3 or more.

Also, I strongly suspect that your tree is higher than 2.  A medium
sized directory will have data blocks, indirect blocks and the inode
proper, which gives you a height of 3.  Your inodes need to get accessed
somehow and unless they have fixed positions like in ext2, you need a
further tree structure of some sorts, so you're more likely looking at a
height of 5.

With a height of 5, you would need to keep 80% of you metadata free.
That is starting to get wasteful.

So I suspect that my proposed alternate cleaner mechanism or the even
better "hole plugging" mechanism proposed in the paper a few posts above
would be a better path to follow.


I do not understand. Do you mean that if I have 10 segments, 5 busy and 5 
free, after cleaning I could need 6 segments? How? Where the extra blocks 
come from?


Juan.

--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-21 Thread Juan Piernas Canovas

Hi Jörn,

On Wed, 21 Feb 2007, [utf-8] Jörn Engel wrote:


On Wed, 21 February 2007 05:36:22 +0100, Juan Piernas Canovas wrote:


I don't see how you can guarantee 50% free segments.  Can you explain
that bit?

It is quite simple. If 50% of your segments are busy, and the other 50%
are free, and the file system needs a new segment, the cleaner starts
freeing some of busy ones. If the cleaner is unable to free one segment at
least, your file system gets full (and it returns a nice ENOSPC error).
This solution wastes the half of your storage device, but it is
deadlock-free. Obviously, there are better approaches.


Ah, ok.  It is deadlock free, if the maximal height of your tree is 2.
It is not 100% deadlock free if the height is 3 or more.

Also, I strongly suspect that your tree is higher than 2.  A medium
sized directory will have data blocks, indirect blocks and the inode
proper, which gives you a height of 3.  Your inodes need to get accessed
somehow and unless they have fixed positions like in ext2, you need a
further tree structure of some sorts, so you're more likely looking at a
height of 5.

With a height of 5, you would need to keep 80% of you metadata free.
That is starting to get wasteful.

So I suspect that my proposed alternate cleaner mechanism or the even
better hole plugging mechanism proposed in the paper a few posts above
would be a better path to follow.


I do not understand. Do you mean that if I have 10 segments, 5 busy and 5 
free, after cleaning I could need 6 segments? How? Where the extra blocks 
come from?


Juan.

--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.esop=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-21 Thread Juan Piernas Canovas

Hi Jörn,

I have been thinking about the problem that you describe, and, 
definitively, DualFS does not have that problem. I could be wrong, but, 
I actually believe that the GC implemented by DualFS is deadlock-free. 
The key is the design of the log-structured file system used by DualFS for 
the meta-data device, which is different to the design that you propose.


On Wed, 21 Feb 2007, [utf-8] Jörn Engel wrote:


On Wed, 21 February 2007 19:31:40 +0100, Juan Piernas Canovas wrote:


I do not understand. Do you mean that if I have 10 segments, 5 busy and 5
free, after cleaning I could need 6 segments? How? Where the extra blocks
come from?


This is a fairly complicated subject and I have trouble explaining it to
people - even though I hope that maybe one or two dozen understand it by
now.  So let me try to give you an example:

In LogFS, inodes are stored in an inode file.  There are no B-Trees yet,
so the regular unix indirect blocks are used.  My example will be
writing to a directory, so that should only involve metadata by your
definition and be a valid example for DualFS as well.  If it is not,
please tell me where the difference lies.

The directory is large, so appending to it involves writing a datablock
(D0), and indirect block (D1) and a doubly indirect block (D2).

Before:
Segment 1: [some data] [   D1  ] [more data]
Segment 2: [some data] [   D0  ] [more data]
Segment 3: [some data] [   D2  ] [more data]
Segment 4: [ empty ]
...


DualFS writes meta-blocks in variable-sized chunks that we call partial 
segments. The meta-data device, however, is divided into segments, which 
have the same size. A partial segment can be as large a a segment, but a 
segment usually has more that one partial segment. Besides, a partial 
segment can not cross a segment boundary.


A partial segment is a transaction unit, and contains all the blocks 
modified by a file system operation, including indirect blocks and i-nodes 
(actually, it contains the blocks modified by several file system 
operations, but let us assume that every partial segment only contains the 
blocks modified by a single file system operation).


So, the above figure is as follows in DualFS:

 Before:
 Segment 1: [some data] [ D0 D1 D2 I ] [more data]
 Segment 2: [ some data  ]
 Segment 3: [   empty]

If the datablock D0 is modified, what you get is:

 Segment 1: [some data] [  garbage   ] [more data]
 Segment 2: [ some data  ]
 Segment 3: [ D0 D1 D2 I ] [   empty ]

This is very similar to what the cleaner does. Therefore, moving a direct 
datablock (D0) to a new segment does not require more space than in the 
original segment. That is, cleaning a segment in DualFS requires just one 
free segment, and not more.


The result is that you can use all the free segments in DualFS, and its 
cleaner is simple and deadlock-free. Probably the design is not the most 
space-efficient in the world, but it removes some other serious problems.


And, remember, we are talking about meta-data (which is a small part of 
the file system), and disk space (which is quite inexpensive).


Regards,

Juan.



After:
Segment 1: [some data] [garbage] [more data]
Segment 2: [some data] [garbage] [more data]
Segment 3: [some data] [garbage] [more data]
Segment 4: [D0][D1][D2][  empty]
...

Ok.  After this, the position of D2 on the medium has changed.  So we
need to update the inode and write that as well.  If the inode number
for this directory is high, we will need to write the inode (I0), an
indirect block (I1) and a doubly indirect block (I2).  The picture
becomes a bit more complicates.

Before:
Segment 1: [some data] [   D1  ] [more data]
Segment 2: [some data] [   D0  ] [more data]
Segment 3: [some data] [   D2  ] [more data]
Segment 4: [ empty ]
Segment 5: [some data] [   I1  ] [more data]
Segment 6: [some data] [   I0  ] [more data]
Segment 7: [some data] [   I2  ] [more data]
...

After:
Segment 1: [some data] [garbage] [more data]
Segment 2: [some data] [garbage] [more data]
Segment 3: [some data] [garbage] [more data]
Segment 4: [D0][D1][D2][I0][I1][I2][ empty ]
Segment 5: [some data] [garbage] [more data]
Segment 6: [some data] [garbage] [more data]
Segment 7: [some data] [garbage] [more data]
...

So what has just happened?  The user did a single touch foo in a large
directory and has caused six objects to move.  Unless some of those
objects were in the same segment before, we now have six segments
containing a tiny amount of garbage.

And there is almost no way how you can squeeze that garbage back out.
The cleaner will fundamentally do the same thing as a regular write - it
will move objects.  So if you want to clean a segment containing the
block of a different directory, you may again have to move five
additional objects, the indirect blocks, inode and ifile indirect
blocks.

At this point, your

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-20 Thread Juan Piernas Canovas

Hi Jörn,

On Tue, 20 Feb 2007, [utf-8] Jörn Engel wrote:


On Tue, 20 February 2007 00:57:50 +0100, Juan Piernas Canovas wrote:


Actually, the GC may become a problem when the number of free segments is
50% or less. If your LFS always guarantees, at least, 50% of free
"segments" (note that I am talking about segments, not free space), the
deadlock problem disappears, right? This is a quite naive solution, but it
works.


I don't see how you can guarantee 50% free segments.  Can you explain
that bit?
It is quite simple. If 50% of your segments are busy, and the other 50% 
are free, and the file system needs a new segment, the cleaner starts 
freeing some of busy ones. If the cleaner is unable to free one segment at 
least, your file system gets "full" (and it returns a nice ENOSPC error). 
This solution wastes the half of your storage device, but it is 
deadlock-free. Obviously, there are better approaches.





In a traditional LFS, with data and meta-data blocks, 50% of free segments
represents a huge amount of wasted disk space. But, in DualFS, 50% of free
segments in the meta-data device is not too much. In a typical Ext2,
or Ext3 file system, there are 20 data blocks for every meta-data block
(that is, meta-data blocks are 5% of the disk blocks used by files).
Since files are implemented in DualFS in the same way, we can suppose the
same ratio for DualFS (1).


This will work fairly well for most people.  It is possible to construct
metadata-heavy workloads, however.  Many large directories containing
symlinks or special files (char/block devices, sockets, fifos,
whiteouts) come to mind.  Most likely noone of your user will ever want
that, but a malicious attacker might.

Quotas, bigger meta-data device, cleverer cleaner,... there are 
solutions :)



The point of all the above is that you must improve the common case, and
manage the worst case correctly. And that is the idea behind DualFS :)


A fine principle to work with.  Surprisingly, what is the worst case for
you is the common case for LogFS, so maybe I'm more interested in it
than most people.  Or maybe I'm just more paranoid.



No, you are right. It is the common case for LogFS because it has data and 
meta-data blocks in the same address space, but that is not the case of 
DualFS. Anyway, I'm very interested in your work because any solution to 
the problem of the GC will be also applicable to DualFS. So, keep up with 
it. ;-)


Juan.
--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-20 Thread Juan Piernas Canovas

Hi Jörn,

On Tue, 20 Feb 2007, [utf-8] Jörn Engel wrote:


On Tue, 20 February 2007 00:57:50 +0100, Juan Piernas Canovas wrote:


Actually, the GC may become a problem when the number of free segments is
50% or less. If your LFS always guarantees, at least, 50% of free
segments (note that I am talking about segments, not free space), the
deadlock problem disappears, right? This is a quite naive solution, but it
works.


I don't see how you can guarantee 50% free segments.  Can you explain
that bit?
It is quite simple. If 50% of your segments are busy, and the other 50% 
are free, and the file system needs a new segment, the cleaner starts 
freeing some of busy ones. If the cleaner is unable to free one segment at 
least, your file system gets full (and it returns a nice ENOSPC error). 
This solution wastes the half of your storage device, but it is 
deadlock-free. Obviously, there are better approaches.





In a traditional LFS, with data and meta-data blocks, 50% of free segments
represents a huge amount of wasted disk space. But, in DualFS, 50% of free
segments in the meta-data device is not too much. In a typical Ext2,
or Ext3 file system, there are 20 data blocks for every meta-data block
(that is, meta-data blocks are 5% of the disk blocks used by files).
Since files are implemented in DualFS in the same way, we can suppose the
same ratio for DualFS (1).


This will work fairly well for most people.  It is possible to construct
metadata-heavy workloads, however.  Many large directories containing
symlinks or special files (char/block devices, sockets, fifos,
whiteouts) come to mind.  Most likely noone of your user will ever want
that, but a malicious attacker might.

Quotas, bigger meta-data device, cleverer cleaner,... there are 
solutions :)



The point of all the above is that you must improve the common case, and
manage the worst case correctly. And that is the idea behind DualFS :)


A fine principle to work with.  Surprisingly, what is the worst case for
you is the common case for LogFS, so maybe I'm more interested in it
than most people.  Or maybe I'm just more paranoid.



No, you are right. It is the common case for LogFS because it has data and 
meta-data blocks in the same address space, but that is not the case of 
DualFS. Anyway, I'm very interested in your work because any solution to 
the problem of the GC will be also applicable to DualFS. So, keep up with 
it. ;-)


Juan.
--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.esop=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-19 Thread Juan Piernas Canovas

Hi Jörn,

I understand the problem that you describe with respect to the GC, but 
let me explain why I think that it has a small impact on DualFS.


Actually, the GC may become a problem when the number of free segments is 
50% or less. If your LFS always guarantees, at least, 50% of free 
"segments" (note that I am talking about segments, not free space), the 
deadlock problem disappears, right? This is a quite naive solution, but it 
works.


In a traditional LFS, with data and meta-data blocks, 50% of free segments 
represents a huge amount of wasted disk space. But, in DualFS, 50% of free 
segments in the meta-data device is not too much. In a typical Ext2, 
or Ext3 file system, there are 20 data blocks for every meta-data block 
(that is, meta-data blocks are 5% of the disk blocks used by files). 
Since files are implemented in DualFS in the same way, we can suppose the 
same ratio for DualFS (1).


Now, let us assume that the data device takes 90% of the disk space, and 
the meta-data device the other 10%. When the data device gets full, the 
meta-data blocks will be using the half of the meta-data device, and the 
other half (5% of the entire disk) will be free. Frankly, 5% is not too 
much.


Remember, I am supposing a naive implementation of the cleaner. With a 
cleverer one, the meta-data device can be smaller, and the amount of
disk space finally wasted can be smaller too. The following paper proposes 
some improvements:


- Jeanna Neefe Matthews, Drew Roselli, Adam Costello, Randy Wang, and
  Thomas Anderson.  "Improving the Performance of Log-structured File
  Systems with Adaptive Methods".  Proc. Sixteenth ACM Symposium on
  Operating Systems Principles (SOSP), October 1997, pages 238 - 251.

BTW, I think that what they propose is very similar to the two-strategies 
GC that you propose in a separate e-mail.


The point of all the above is that you must improve the common case, and 
manage the worst case correctly. And that is the idea behind DualFS :)


Regards,

Juan.

(1) DualFS can also use extents to implement regular files, so the ratio 
of data blocks with respect to meta-data blocks can be greater.



On Sun, 18 Feb 2007, [utf-8] Jörn Engel wrote:


On Sat, 17 February 2007 15:47:01 -0500, Sorin Faibish wrote:


DualFS can probably get around this corner case as it is up to the user
to select the size of the MD device size. If you want to prevent this
corner case you can always use a device bigger than 10% of the data device
which is exagerate for any FS assuming that the directory files are so
large (this is when you have billions of files with long names).
In general the problem you mention is mainly due to the data blocks
filling the file system. In DualFS case you have the choice of selecting
different sizes for the MD and Data volume. When Data volume gets full
the GC will have a problem but the MD device will not have a problem.
It is my understanding that most of the GC problem you mention is
due to the filling of the FS with data and the result is a MD operation
being disrupted by the filling of the FS with data blocks. As about the
performance impact on solving this problem, as you mentioned all
journal FSs will have this problem, I am sure that DualFS performance
impact will be less than others at least due to using only one MD
write instead of 2.


You seem to make the usual mistakes when people start to think about
this problem.  But I could misinterpret you, so let me paraphrase your
mail in questions and answer what I believe you said.

Q: Are journaling filesystems identical to log-structured filesystems?

Not quite.  Journaling filesystems usually have a very small journal (or
log, same thing) and only store the information necessary for atomic
transactions in the journal.  Not sure what a "journal FS" is, but the
name seems closer to a journaling filesystem.

Q: DualFS seperates Data and Metadata.  Does that make a difference?

Not really.  What I called "data" in my previous mail is a
log-structured filesystems view of data.  DualFS stored file content
seperately, so from an lfs view, that doesn't even exist.  But directory
content exists and behaves just like file content wrt. the deadlock
problem.  Any data or metadata that cannot be GC'd by simply copying but
requires writing further information like indirect blocks, B-Tree nodes,
etc. will cause the problem.

Q: If the user simply reserves some extra space, does the problem go
away?

Definitely not.  It will be harder to hit, but a rare deadlock is still
a deadlock.  Again, this is only concerned with the log-structured part
of DualFS, so we can ignore the Data volume.

When data is spread perfectly across all segments, the best segment one
can pick for GC is just as bad as the worst.  So let us take some
examples.  If 50% of the lfs is free, you can pick a 50% segment for GC.
Writing every single block in it may require writing one additional
indirect block, so GC is required to write out a 100% 

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-19 Thread Juan Piernas Canovas

Hi Jörn,

I understand the problem that you describe with respect to the GC, but 
let me explain why I think that it has a small impact on DualFS.


Actually, the GC may become a problem when the number of free segments is 
50% or less. If your LFS always guarantees, at least, 50% of free 
segments (note that I am talking about segments, not free space), the 
deadlock problem disappears, right? This is a quite naive solution, but it 
works.


In a traditional LFS, with data and meta-data blocks, 50% of free segments 
represents a huge amount of wasted disk space. But, in DualFS, 50% of free 
segments in the meta-data device is not too much. In a typical Ext2, 
or Ext3 file system, there are 20 data blocks for every meta-data block 
(that is, meta-data blocks are 5% of the disk blocks used by files). 
Since files are implemented in DualFS in the same way, we can suppose the 
same ratio for DualFS (1).


Now, let us assume that the data device takes 90% of the disk space, and 
the meta-data device the other 10%. When the data device gets full, the 
meta-data blocks will be using the half of the meta-data device, and the 
other half (5% of the entire disk) will be free. Frankly, 5% is not too 
much.


Remember, I am supposing a naive implementation of the cleaner. With a 
cleverer one, the meta-data device can be smaller, and the amount of
disk space finally wasted can be smaller too. The following paper proposes 
some improvements:


- Jeanna Neefe Matthews, Drew Roselli, Adam Costello, Randy Wang, and
  Thomas Anderson.  Improving the Performance of Log-structured File
  Systems with Adaptive Methods.  Proc. Sixteenth ACM Symposium on
  Operating Systems Principles (SOSP), October 1997, pages 238 - 251.

BTW, I think that what they propose is very similar to the two-strategies 
GC that you propose in a separate e-mail.


The point of all the above is that you must improve the common case, and 
manage the worst case correctly. And that is the idea behind DualFS :)


Regards,

Juan.

(1) DualFS can also use extents to implement regular files, so the ratio 
of data blocks with respect to meta-data blocks can be greater.



On Sun, 18 Feb 2007, [utf-8] Jörn Engel wrote:


On Sat, 17 February 2007 15:47:01 -0500, Sorin Faibish wrote:


DualFS can probably get around this corner case as it is up to the user
to select the size of the MD device size. If you want to prevent this
corner case you can always use a device bigger than 10% of the data device
which is exagerate for any FS assuming that the directory files are so
large (this is when you have billions of files with long names).
In general the problem you mention is mainly due to the data blocks
filling the file system. In DualFS case you have the choice of selecting
different sizes for the MD and Data volume. When Data volume gets full
the GC will have a problem but the MD device will not have a problem.
It is my understanding that most of the GC problem you mention is
due to the filling of the FS with data and the result is a MD operation
being disrupted by the filling of the FS with data blocks. As about the
performance impact on solving this problem, as you mentioned all
journal FSs will have this problem, I am sure that DualFS performance
impact will be less than others at least due to using only one MD
write instead of 2.


You seem to make the usual mistakes when people start to think about
this problem.  But I could misinterpret you, so let me paraphrase your
mail in questions and answer what I believe you said.

Q: Are journaling filesystems identical to log-structured filesystems?

Not quite.  Journaling filesystems usually have a very small journal (or
log, same thing) and only store the information necessary for atomic
transactions in the journal.  Not sure what a journal FS is, but the
name seems closer to a journaling filesystem.

Q: DualFS seperates Data and Metadata.  Does that make a difference?

Not really.  What I called data in my previous mail is a
log-structured filesystems view of data.  DualFS stored file content
seperately, so from an lfs view, that doesn't even exist.  But directory
content exists and behaves just like file content wrt. the deadlock
problem.  Any data or metadata that cannot be GC'd by simply copying but
requires writing further information like indirect blocks, B-Tree nodes,
etc. will cause the problem.

Q: If the user simply reserves some extra space, does the problem go
away?

Definitely not.  It will be harder to hit, but a rare deadlock is still
a deadlock.  Again, this is only concerned with the log-structured part
of DualFS, so we can ignore the Data volume.

When data is spread perfectly across all segments, the best segment one
can pick for GC is just as bad as the worst.  So let us take some
examples.  If 50% of the lfs is free, you can pick a 50% segment for GC.
Writing every single block in it may require writing one additional
indirect block, so GC is required to write out a 100% segment.  It

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-15 Thread Juan Piernas Canovas

Hi,

On Fri, 16 Feb 2007, Andi Kleen wrote:


If you stripe two disks with a standard fs versus use one of them
as metadata volume and the other as data volume with dualfs i would
expect the striped variant usually be faster because it will give
parallelism not only to data versus metadata, but also to all data
versus other data.


If you have a RAID system, both the data and meta-data devices of DualFS
can be stripped, and you get the same result. No problem for DualFS :)


Sure, but then you need four disks. And if your workloads happens
to be much more data intensive than metadata intensive the
stripped spindles assigned to metadata only will be more idle
than the ones doing data.

Stripping everything from the same pool has the potential
to adapt itself to any workload mix better.

Why do you need four disks? Data and meda-data devices of DualFS can be on 
different disks, can be two partitions of the same disk, or can be two 
areas of the same partition. The important thing is that data and 
meta-data blocks are separated and that they are managed in different 
ways. Please, take a look at the presentation (see below).



I can see that you win for some specific workloads, but it is
hard to see how you can win over a wide range of workloads
because of that.

No, we win for a wide range of common workloads. See the results in the 
PDF (see below).





Also I would expect your design to be slow for metadata read intensive
workloads. E.g. have you tried to boot a root partition with dual fs?
That's a very important IO benchmark for desktop Linux systems.


I do not think so. The performance of DualFS is superb in meta-data read
intensive workloads . And it is also better than the performance of other
file system when reading a directory tree with several copies of the Linux
kernel source code (I showed those results on Tuesday at the LSF07
workshop)


PDFs available?


Sure:

http://www.ditec.um.es/~piernas/dualfs/presentation-lsf07-final.pdf


Is that with running a LFS style cleaner inbetween or without?


'With' a cleaner.


I would be interested in a "install distro with installer ; boot afterwards
from it" type benchmark. Do you have something like this?

-Andi


I think that the results sent by Sorin answer your question :-)

Regards,

Juan.

--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-15 Thread Juan Piernas Canovas

Hi Jörn,

On Thu, 15 Feb 2007, [utf-8] Jörn Engel wrote:


On Thu, 15 February 2007 19:38:14 +0100, Juan Piernas Canovas wrote:


The patch for 2.6.11 is not still stable enough to be released. Be patient
;-)


While I don't want to discourage you, this is about the point in
development where most log structured filesystems stopped.  Doing a
little web research, you will notice those todo-lists with "cleaner"
being the top item for...years!

Getting that one to work robustly is _very_ hard work and just today
I've noticed that mine was not as robust as I would have liked to think.
Also, you may note that by updating to newer kernels, the VM writeout
policies can change and impact your cleaner.  To an extent even that you
had a rock-solid filesystem with 2.6.18 and thing crumble between your
fingers in 2.6.19 or later.

If the latter happens, most likely the VM is not to blame, it just
proved that your cleaner is still getting some corner-cases wrong and
needs more work.  There goes another week of debugging. :(

Jörn

Actually, the version of DualFS for Linux 2.4.19 implements a cleaner. In 
our case, the cleaner is not really a problem because there is not too 
much to clean (the meta-data device only contains meta-data blocks which 
are 5-6% of the file system blocks; you do not have to move data blocks).


Anyway, thank you for warning me ;-)

Regards,

Juan.

--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-15 Thread Juan Piernas Canovas

Hi Andi,

On Thu, 15 Feb 2007, Andi Kleen wrote:


Juan Piernas Canovas <[EMAIL PROTECTED]> writes:

[playing devil's advocate here]


If the data and meta-data devices of DualFS can be on different disks,
DualFS is able to READ and WRITE data and meta-data blocks in
PARALLEL.


XFS can do this too using its real time volumes (which don't contain
any metadata).  It can also have a separate log.


But you still need several 'real' devices to separate data and meta-data 
blocks. DualFS does the same with just one real device. Probably 'data 
device' and 'meta-data device' names are a bit confusing. Think about 
them as partitions, not as real devices.




Also many storage subsystems have some internal parallelism
in writing (e.g. a RAID can write on different disks in parallel for
a single partition) so i'm not sure your distinction is that useful.

But we are talking about a different case. What I have said is that if you 
use two devices, one for the 'regular' file system and another one for the 
log, DualFS is better in that case because it can use the log for reads. 
Other journaling file systems can not do that.



If you stripe two disks with a standard fs versus use one of them
as metadata volume and the other as data volume with dualfs i would
expect the striped variant usually be faster because it will give
parallelism not only to data versus metadata, but also to all data
versus other data.

If you have a RAID system, both the data and meta-data devices of DualFS 
can be stripped, and you get the same result. No problem for DualFS :)



Also I would expect your design to be slow for metadata read intensive
workloads. E.g. have you tried to boot a root partition with dual fs?
That's a very important IO benchmark for desktop Linux systems.

I do not think so. The performance of DualFS is superb in meta-data read 
intensive workloads. And it is also better than the performance of other 
file system when reading a directory tree with several copies of the Linux 
kernel source code (I showed those results on Tuesday at the LSF07 
workshop).



-Andi



Regards,

Juan.
--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-15 Thread Juan Piernas Canovas

Hi all,

On Wed, 14 Feb 2007, Jan Engelhardt wrote:



On Feb 14 2007 16:10, sfaibish wrote:


1. DualFS has only one copy of every meta-data block. This copy is
in the meta-data device,


Where does this differ from typical filesystems like xfs?
At least ext3 and xfs have an option to store the log/journal
on another device too.


No, it is not the same. DualFS uses two 'logical' devices, one for data, 
and one for meta-data, but these devices are usually partitions on the 
same disk, they are not two different disks. And DualFS uses the meta-data 
device to both read and write meta-data blocks, whereas the other 
journaling file systems only use the log to write meta-data.


It's true that XFS, Ext3 and other journaling file systems can use a 
separate disk for the log, but, even then, they have to write two copies 
of every meta-data element. However, in this case, DualFS is even better.


If the data and meta-data devices of DualFS can be on different disks, 
DualFS is able to READ and WRITE data and meta-data blocks in PARALLEL. 
The other journaling file systems, however, can only write one of the two 
copies of every meta-data block in parallel to other file systems 
operations, but they can not write the second copy, and read and write 
data and meta-data blocks, in parallel.





The DualFS code, tools and performance papers are available at:



The code requires kernel patches to 2.4.19 (oldies but goodies) and
a separate fsck code.  The latest kernel we used it for is 2.6.11
and we hope with you help to port it to the latest Linux kernel.


Where is the patch for 2.6.11? Sorry, 2.4.19 is just too old (even if
only considering the 2.4 branch).
[ And perhaps "goldies" flew better with "oldies" ;-) ]



The patch for 2.6.11 is not still stable enough to be released. Be patient 
;-)



Juan.


Jan



--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.es=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-15 Thread Juan Piernas Canovas

Hi all,

On Wed, 14 Feb 2007, Jan Engelhardt wrote:



On Feb 14 2007 16:10, sfaibish wrote:


1. DualFS has only one copy of every meta-data block. This copy is
in the meta-data device,


Where does this differ from typical filesystems like xfs?
At least ext3 and xfs have an option to store the log/journal
on another device too.


No, it is not the same. DualFS uses two 'logical' devices, one for data, 
and one for meta-data, but these devices are usually partitions on the 
same disk, they are not two different disks. And DualFS uses the meta-data 
device to both read and write meta-data blocks, whereas the other 
journaling file systems only use the log to write meta-data.


It's true that XFS, Ext3 and other journaling file systems can use a 
separate disk for the log, but, even then, they have to write two copies 
of every meta-data element. However, in this case, DualFS is even better.


If the data and meta-data devices of DualFS can be on different disks, 
DualFS is able to READ and WRITE data and meta-data blocks in PARALLEL. 
The other journaling file systems, however, can only write one of the two 
copies of every meta-data block in parallel to other file systems 
operations, but they can not write the second copy, and read and write 
data and meta-data blocks, in parallel.





The DualFS code, tools and performance papers are available at:

http://sourceforge.net/projects/dualfs

The code requires kernel patches to 2.4.19 (oldies but goodies) and
a separate fsck code.  The latest kernel we used it for is 2.6.11
and we hope with you help to port it to the latest Linux kernel.


Where is the patch for 2.6.11? Sorry, 2.4.19 is just too old (even if
only considering the 2.4 branch).
[ And perhaps goldies flew better with oldies ;-) ]



The patch for 2.6.11 is not still stable enough to be released. Be patient 
;-)



Juan.


Jan



--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.esop=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-15 Thread Juan Piernas Canovas

Hi Andi,

On Thu, 15 Feb 2007, Andi Kleen wrote:


Juan Piernas Canovas [EMAIL PROTECTED] writes:

[playing devil's advocate here]


If the data and meta-data devices of DualFS can be on different disks,
DualFS is able to READ and WRITE data and meta-data blocks in
PARALLEL.


XFS can do this too using its real time volumes (which don't contain
any metadata).  It can also have a separate log.


But you still need several 'real' devices to separate data and meta-data 
blocks. DualFS does the same with just one real device. Probably 'data 
device' and 'meta-data device' names are a bit confusing. Think about 
them as partitions, not as real devices.




Also many storage subsystems have some internal parallelism
in writing (e.g. a RAID can write on different disks in parallel for
a single partition) so i'm not sure your distinction is that useful.

But we are talking about a different case. What I have said is that if you 
use two devices, one for the 'regular' file system and another one for the 
log, DualFS is better in that case because it can use the log for reads. 
Other journaling file systems can not do that.



If you stripe two disks with a standard fs versus use one of them
as metadata volume and the other as data volume with dualfs i would
expect the striped variant usually be faster because it will give
parallelism not only to data versus metadata, but also to all data
versus other data.

If you have a RAID system, both the data and meta-data devices of DualFS 
can be stripped, and you get the same result. No problem for DualFS :)



Also I would expect your design to be slow for metadata read intensive
workloads. E.g. have you tried to boot a root partition with dual fs?
That's a very important IO benchmark for desktop Linux systems.

I do not think so. The performance of DualFS is superb in meta-data read 
intensive workloads. And it is also better than the performance of other 
file system when reading a directory tree with several copies of the Linux 
kernel source code (I showed those results on Tuesday at the LSF07 
workshop).



-Andi



Regards,

Juan.
--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.esop=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-15 Thread Juan Piernas Canovas

Hi Jörn,

On Thu, 15 Feb 2007, [utf-8] Jörn Engel wrote:


On Thu, 15 February 2007 19:38:14 +0100, Juan Piernas Canovas wrote:


The patch for 2.6.11 is not still stable enough to be released. Be patient
;-)


While I don't want to discourage you, this is about the point in
development where most log structured filesystems stopped.  Doing a
little web research, you will notice those todo-lists with cleaner
being the top item for...years!

Getting that one to work robustly is _very_ hard work and just today
I've noticed that mine was not as robust as I would have liked to think.
Also, you may note that by updating to newer kernels, the VM writeout
policies can change and impact your cleaner.  To an extent even that you
had a rock-solid filesystem with 2.6.18 and thing crumble between your
fingers in 2.6.19 or later.

If the latter happens, most likely the VM is not to blame, it just
proved that your cleaner is still getting some corner-cases wrong and
needs more work.  There goes another week of debugging. :(

Jörn

Actually, the version of DualFS for Linux 2.4.19 implements a cleaner. In 
our case, the cleaner is not really a problem because there is not too 
much to clean (the meta-data device only contains meta-data blocks which 
are 5-6% of the file system blocks; you do not have to move data blocks).


Anyway, thank you for warning me ;-)

Regards,

Juan.

--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.esop=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation

2007-02-15 Thread Juan Piernas Canovas

Hi,

On Fri, 16 Feb 2007, Andi Kleen wrote:


If you stripe two disks with a standard fs versus use one of them
as metadata volume and the other as data volume with dualfs i would
expect the striped variant usually be faster because it will give
parallelism not only to data versus metadata, but also to all data
versus other data.


If you have a RAID system, both the data and meta-data devices of DualFS
can be stripped, and you get the same result. No problem for DualFS :)


Sure, but then you need four disks. And if your workloads happens
to be much more data intensive than metadata intensive the
stripped spindles assigned to metadata only will be more idle
than the ones doing data.

Stripping everything from the same pool has the potential
to adapt itself to any workload mix better.

Why do you need four disks? Data and meda-data devices of DualFS can be on 
different disks, can be two partitions of the same disk, or can be two 
areas of the same partition. The important thing is that data and 
meta-data blocks are separated and that they are managed in different 
ways. Please, take a look at the presentation (see below).



I can see that you win for some specific workloads, but it is
hard to see how you can win over a wide range of workloads
because of that.

No, we win for a wide range of common workloads. See the results in the 
PDF (see below).





Also I would expect your design to be slow for metadata read intensive
workloads. E.g. have you tried to boot a root partition with dual fs?
That's a very important IO benchmark for desktop Linux systems.


I do not think so. The performance of DualFS is superb in meta-data read
intensive workloads . And it is also better than the performance of other
file system when reading a directory tree with several copies of the Linux
kernel source code (I showed those results on Tuesday at the LSF07
workshop)


PDFs available?


Sure:

http://www.ditec.um.es/~piernas/dualfs/presentation-lsf07-final.pdf


Is that with running a LFS style cleaner inbetween or without?


'With' a cleaner.


I would be interested in a install distro with installer ; boot afterwards
from it type benchmark. Do you have something like this?

-Andi


I think that the results sent by Sorin answer your question :-)

Regards,

Juan.

--
D. Juan Piernas Cánovas
Departamento de Ingeniería y Tecnología de Computadores
Facultad de Informática. Universidad de Murcia
Campus de Espinardo - 30080 Murcia (SPAIN)
Tel.: +34968367657Fax: +34968364151
email: [EMAIL PROTECTED]
PGP public key:
http://pgp.rediris.es:11371/pks/lookup?search=piernas%40ditec.um.esop=index

*** Por favor, envíeme sus documentos en formato texto, HTML, PDF o PostScript 
:-) ***

[SOLVED]Re: 2.2.19 && ppa: total lockup. No problem with 2.2.17

2001-03-30 Thread Juan Piernas Canovas

On Fri, 30 Mar 2001, Tim Waugh wrote:

> On Fri, Mar 30, 2001 at 03:55:01PM +0200, Juan Piernas Canovas wrote:
> 
> > The kernel configuration is the same in both 2.2.17 and 2.2.19.
> > Perhaps, the problem is not in the ppa module, but in the parport,
> > parport_pc or parport_probe modules.
> 
> There weren't any parport changes in 2.2.18->2.2.19, and the ones in
> 2.2.17->2.2.18 won't affect you unless you are using a PCI card.
> 
> Could you please try this patch and let me know if the behaviour
> changes?
> 
> Thanks,
> Tim.
> */
> 
> --- linux/drivers/scsi/ppa.h.eh   Fri Mar 30 15:27:43 2001
> +++ linux/drivers/scsi/ppa.h  Fri Mar 30 15:27:52 2001
> @@ -178,7 +178,6 @@
>   eh_device_reset_handler:NULL,   \
>   eh_bus_reset_handler:   ppa_reset,  \
>   eh_host_reset_handler:  ppa_reset,  \
> - use_new_eh_code:1,  \
>   bios_param: ppa_biosparam,  \
>   this_id:-1, \
>   sg_tablesize:   SG_ALL, \
> 

Yes!!!. It works. I am happy now :-)

Thank you very much, Tim.

Juan.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



2.2.19 && ppa: total lockup. No problem with 2.2.17

2001-03-30 Thread Juan Piernas Canovas

On Fri, 30 Mar 2001, Juan Piernas Canovas wrote:

Hi!.
 
When I execute "modprobe ppa" while running a kernel 2.2.19, my computer
hangs completely. No messages. System request key does not work.
 
The kernel configuration is the same in both 2.2.17 and 2.2.19.
Perhaps, the problem is not in the ppa module, but in the parport,
parport_pc or parport_probe modules.
 
Bye!
 
Juan.
 
 
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



2.2.19 ppa: total lockup. No problem with 2.2.17

2001-03-30 Thread Juan Piernas Canovas

On Fri, 30 Mar 2001, Juan Piernas Canovas wrote:

Hi!.
 
When I execute "modprobe ppa" while running a kernel 2.2.19, my computer
hangs completely. No messages. System request key does not work.
 
The kernel configuration is the same in both 2.2.17 and 2.2.19.
Perhaps, the problem is not in the ppa module, but in the parport,
parport_pc or parport_probe modules.
 
Bye!
 
Juan.
 
 
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[SOLVED]Re: 2.2.19 ppa: total lockup. No problem with 2.2.17

2001-03-30 Thread Juan Piernas Canovas

On Fri, 30 Mar 2001, Tim Waugh wrote:

 On Fri, Mar 30, 2001 at 03:55:01PM +0200, Juan Piernas Canovas wrote:
 
  The kernel configuration is the same in both 2.2.17 and 2.2.19.
  Perhaps, the problem is not in the ppa module, but in the parport,
  parport_pc or parport_probe modules.
 
 There weren't any parport changes in 2.2.18-2.2.19, and the ones in
 2.2.17-2.2.18 won't affect you unless you are using a PCI card.
 
 Could you please try this patch and let me know if the behaviour
 changes?
 
 Thanks,
 Tim.
 */
 
 --- linux/drivers/scsi/ppa.h.eh   Fri Mar 30 15:27:43 2001
 +++ linux/drivers/scsi/ppa.h  Fri Mar 30 15:27:52 2001
 @@ -178,7 +178,6 @@
   eh_device_reset_handler:NULL,   \
   eh_bus_reset_handler:   ppa_reset,  \
   eh_host_reset_handler:  ppa_reset,  \
 - use_new_eh_code:1,  \
   bios_param: ppa_biosparam,  \
   this_id:-1, \
   sg_tablesize:   SG_ALL, \
 

Yes!!!. It works. I am happy now :-)

Thank you very much, Tim.

Juan.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Information about kernel threads

2001-02-23 Thread Juan Piernas Canovas

Hi all!

I need some information about kernel threads, specifically about signal
handling. Where can I get any documents about that?

Thanks in advance.

Juan.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Information about kernel threads

2001-02-23 Thread Juan Piernas Canovas

Hi all!

I need some information about kernel threads, specifically about signal
handling. Where can I get any documents about that?

Thanks in advance.

Juan.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/