Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
On 1-Aug-06, at 4:15 AM, Jeffrey V. Merkey wrote: ...I was and have remained loyal to Linux through it all. Except for that little fling with SCO, eh? Off topic, but no more so than your self-aggrandising. --T
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
A filesystem with a fixed number of inodes (= not readjustable while mounted) is ehr.. somewhat unuseable for a lot of people with big and *flexible* storage needs (Talking about NetApp/EMC owners) Which is untrue at least for Solaris, which allows resizing a life file system. FreeBSD and Linux require an unmount. Only for shrinking. Jan Engelhardt --
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
On Mon, Jul 31, 2006 at 09:41:02PM -0700, David Lang wrote: just becouse you have redundancy doesn't mean that your data is idle enough for you to run a repacker with your spare cycles. to run a repacker you need a time when the chunk of the filesystem that you are repacking is not being accessed or written to. it doesn't matter if that data lives on one disk or 9 disks all mirroring the same data, you can't just break off 1 of the copies and repack that becouse by the time you finish it won't match the live drives anymore. database servers have a repacker (vaccum), and they are under tremendous preasure from their users to avoid having to use it becouse of the performance hit that it generates. (the theory in the past is exactly what was presented in this thread, make things run faster most of the time and accept the performance hit when you repack). the trend seems to be for a repacker thread that runs continuously, causing a small impact all the time (that can be calculated into the capacity planning) instead of a large impact once in a while. Ah, but as soon as the repacker thread runs continuously, then you lose all or most of the claimed advantage of wandering logs. Specifically, the claim of the wandering log is that you don't have to write your data twice --- once to the log, and once to the final location on disk (whereas with ext3 you end up having to do double writes). But if the repacker is running continuously, you end up doing double writes anyway, as the repacker moves things from a location that is convenient for the log, to a location which is efficient for reading. Worse yet, if the repacker is moving disk blocks or objects which are no longer in cache, it may end up having to read objects in before writing them to a final location on disk. So instead of a write-write overhead, you end up with a write-read-write overhead. But of course, people tend to disable the repacker when doing benchmarks because they're trying to play the my filesystem/database has bigger performance numbers than yours game - Ted
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Theodore Tso wrote: Ah, but as soon as the repacker thread runs continuously, then you lose all or most of the claimed advantage of wandering logs. Specifically, the claim of the wandering log is that you don't have to write your data twice --- once to the log, and once to the final location on disk (whereas with ext3 you end up having to do double writes). But if the repacker is running continuously, you end up doing double writes anyway, as the repacker moves things from a location that is convenient for the log, to a location which is efficient for reading. Worse yet, if the repacker is moving disk blocks or objects which are no longer in cache, it may end up having to read objects in before writing them to a final location on disk. So instead of a write-write overhead, you end up with a write-read-write overhead. There's no reason to repack *all* of the data. Many workloads write and delete whole files, so file data should be contiguous. The repacker would only need to move metadata and small files. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion]
So ZFS isn't state-of-the-art? Of course it's state-of-the-art (on Solaris ;-) ) WAFL is for high-turnover filesystems on RAID-5 (and assumes flash memory staging areas). s/RAID-5/RAID-4/ Not your run-of-the-mill desktop... The WAFL-Thing was just a joke ;-) Regards, Adrian
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion]
suspect, particularly with 7200/min (s)ATA crap. Quoting myself (again): A quick'n'dirty ZFS-vs-UFS-vs-Reiser3-vs-Reiser4-vs-Ext3 'benchmark' Yeah, the test ran on a single SATA-Harddisk (quick'n'dirty). I'm so sorry but i don't have access to a $$$ Raid-System at home. Anyway: The test shows us that Reiser4 performed very good on my (common 0-8-15) hardware. sdparm --clear=WCE /dev/sda # please. How about using /dev/emcpower* for the next benchmark? I mighty be able to re-run it in a few weeks if people are interested and if i receive constructive suggestions (= Postmark parameters, mkfs options, etc..) Regards, Adrian
Re: metadata plugins (was Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion)
On Mon, Jul 31, 2006 at 10:57:35AM -0500, David Masover wrote: Wil Reichert wrote: Any idea how the fragmentation resulting from re-syncing the tree affects performance over time? Yes, it does affect it a lot. I have no idea how much, and I've never benchmarked it, but purely subjectively, my portage has gotten slower over time. Delayed allocation still performs a lot better here than the v3 immediate allocation. In addition, tree balancing operations are performed on flush as well, so what you get on disk is basically an almost-optimal tree. Of course, this will change a bit over time, but with v4 it takes a lot longer for that to happen than with v3 afaict. There _has_ been some worthwile development in the meantime : ) Kind regards, Chris pgpYfJOp8uyxN.pgp Description: PGP signature
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion]
Adrian Ulrich schrieb am 2006-08-01: suspect, particularly with 7200/min (s)ATA crap. Quoting myself (again): A quick'n'dirty ZFS-vs-UFS-vs-Reiser3-vs-Reiser4-vs-Ext3 'benchmark' Yeah, the test ran on a single SATA-Harddisk (quick'n'dirty). I'm so sorry but i don't have access to a $$$ Raid-System at home. I'm not asking for you to perform testing on a RAID system with SCSI or SAS, but I consider the obtained data (I am focussing on transactions per unit of time) highly suspicious, and suspect write caches might have contributed their share - I haven't seen a drive that shipped with write cache disabled in the past years. sdparm --clear=WCE /dev/sda # please. How about using /dev/emcpower* for the next benchmark? No, it is valid to run the test on commodity hardware, but if you (or the benchmark rather) is claiming transactions, I tend to think ACID, and I highly doubt any 200 GB SATA drive manages 3000 synchronous writes per second without causing either serious fragmentation or background block moving. This is a figure I'd expect for synchronous random access to RAM disks that have no seek and rotational latencies (and research for hybrid disks w/ flash or other nonvolatile fast random access media to cache actual rotating magnetic plattern access is going on elsewhere). I didn't mean to say your particular drive were crap, but 200GB SATA drives are low end, like it or not -- still, I have one in my home computer because these Samsung SP2004C are so nicely quiet. I mighty be able to re-run it in a few weeks if people are interested and if i receive constructive suggestions (= Postmark parameters, mkfs options, etc..) I don't know Postmark, I did suggest to turn the write cache off. If your systems uses hdparm -W0 /dev/sda instead, go ahead. But you're right to collect and evaluate suggestions first if you don't want to run a new benchmark every day :) -- Matthias Andree
Re: reiser4: maybe just fix bugs?
Andrew Morton wrote: On Mon, 31 Jul 2006 10:26:55 +0100 Denis Vlasenko [EMAIL PROTECTED] wrote: The reiser4 thread seem to be longer than usual. Meanwhile here's poor old me trying to find another four hours to finish reviewing the thing. Thanks Andrew. The writeout code is ugly, although that's largely due to a mismatch between what reiser4 wants to do and what the VFS/MM expects it to do. I agree --- both with it being ugly, and that being part of why. If it works, we can live with it, although perhaps the VFS could be made smarter. I would be curious regarding any ideas on that. Next time I read through that code, I will keep in mind that you are open to making VFS changes if it improves things, and I will try to get clever somehow and send it by you. Our squalloc code though is I must say the most complicated and ugliest piece of code I ever worked on for which every cumulative ugliness had a substantive performance advantage requiring us to keep it. If you spare yourself from reading that, it is understandable to do so. I'd say that resier4's major problem is the lack of xattrs, acls and direct-io. That's likely to significantly limit its vendor uptake. (As might the copyright assignment thing, but is that a kernel.org concern?) Thanks to you and the batch write code, direct io support will now be much easier to code, and it probably will get coded the soonest of those features. acls are on the todo list, but doing them right might require solving a few additional issues (finishing the inheritance code, etc.) The plugins appear to be wildly misnamed - they're just an internal abstraction layer which permits later feature additions to be added in a clean and safe manner. Certainly not worth all this fuss. Could I suggest that further technical critiques of reiser4 include a file-and-line reference? That should ease the load on vger. Thanks.
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
On Tue, 01 Aug 2006, Avi Kivity wrote: There's no reason to repack *all* of the data. Many workloads write and delete whole files, so file data should be contiguous. The repacker would only need to move metadata and small files. Move small files? What for? Even if it is only moving metadata, it is not different from what ext3 or xfs are doing today (rewriting metadata from the intent log or block journal to the final location). The UFS+softupdates from the BSD world looks pretty good at avoiding unnecessary writes (at the expense of a long-running but nice background fsck after a crash, which is however easy on the I/O as of recent FreeBSD versions). Which was their main point against logging/journaling BTW, but they are porting XFS as well to save those that need instant complete recovery. -- Matthias Andree
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Matthias Andree wrote: Have you ever seen VxFS or WAFL in action? No I haven't. As long as they are commercial, it's not likely that I will. WAFL was well done. It has several innovations that I admire, including quota trees, non-support of fragments for performance reasons, and the basic WAFL notion applied to an NFS RAID special (though important) case.
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Matthias Andree wrote: On Tue, 01 Aug 2006, Avi Kivity wrote: There's no reason to repack *all* of the data. Many workloads write and delete whole files, so file data should be contiguous. The repacker would only need to move metadata and small files. Move small files? What for? WAFL-style filesystems like contiguous space, so if small files are scattered in otherwise free space, the repacker should free them. Even if it is only moving metadata, it is not different from what ext3 or xfs are doing today (rewriting metadata from the intent log or block journal to the final location). There is no need to repack all metadata; only that which helps in creating free space. For example: if you untar a source tree you'd get mixed metadata and small file data packed together, but there's no need to repack that data. -- error compiling committee.c: too many arguments to function
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Theodore Tso wrote: On Mon, Jul 31, 2006 at 09:41:02PM -0700, David Lang wrote: just becouse you have redundancy doesn't mean that your data is idle enough for you to run a repacker with your spare cycles. to run a repacker you need a time when the chunk of the filesystem that you are repacking is not being accessed or written to. it doesn't matter if that data lives on one disk or 9 disks all mirroring the same data, you can't just break off 1 of the copies and repack that becouse by the time you finish it won't match the live drives anymore. database servers have a repacker (vaccum), and they are under tremendous preasure from their users to avoid having to use it becouse of the performance hit that it generates. (the theory in the past is exactly what was presented in this thread, make things run faster most of the time and accept the performance hit when you repack). the trend seems to be for a repacker thread that runs continuously, causing a small impact all the time (that can be calculated into the capacity planning) instead of a large impact once in a while. Ah, but as soon as the repacker thread runs continuously, then you lose all or most of the claimed advantage of wandering logs. Wandering logs is a term specific to reiser4, and I think you are making a more general remark. You are missing the implications of the oft-cited statistic that 80% of files never or rarely move. You are also missing the implications of the repacker being able to do larger IOs than occur for a random tiny IO workload which is impacting a filesystem that is performing allocations on the fly. Specifically, the claim of the wandering log is that you don't have to write your data twice --- once to the log, and once to the final location on disk (whereas with ext3 you end up having to do double writes). But if the repacker is running continuously, you end up doing double writes anyway, as the repacker moves things from a location that is convenient for the log, to a location which is efficient for reading. Worse yet, if the repacker is moving disk blocks or objects which are no longer in cache, it may end up having to read objects in before writing them to a final location on disk. So instead of a write-write overhead, you end up with a write-read-write overhead. But of course, people tend to disable the repacker when doing benchmarks because they're trying to play the my filesystem/database has bigger performance numbers than yours game When the repacker is done, we will just for you run one of our benchmarks the morning after the repacker is run (and reference this email);-) that was what you wanted us to do to address your concern, yes?;-) - Ted
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion]
Matthias Andree wrote: No, it is valid to run the test on commodity hardware, but if you (or the benchmark rather) is claiming transactions, I tend to think ACID, and I highly doubt any 200 GB SATA drive manages 3000 synchronous writes per second without causing either serious fragmentation or background block moving. You are assuming 1 transaction = 1 sync write. That's not true. Databases and log filesystems can get much more out of a disk write. -- error compiling committee.c: too many arguments to function
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
On Mon, Jul 31, 2006 at 05:59:58PM +0200, Adrian Ulrich wrote: Hello Matthias, This looks rather like an education issue rather than a technical limit. We aren't talking about the same issue: I was asking to do it on-the-fly. Umounting the filesystem, running e2fsck and resize2fs is something different ;-) Which is untrue at least for Solaris, which allows resizing a life file system. FreeBSD and Linux require an unmount. Correct: You can add more inodes to a Solaris UFS on-the-fly if you are lucky enough to have some free space available. A colleague of mine happened to create a ~300gb filesystem and started to migrate Mailboxes (Maildir-style format = many small files (1-3kb)) to the new LUN. At about 70% the filesystem ran out of inodes; Not a big deal with VxFS because such a problem is fixable within seconds. What would have happened if he had used UFS? mkfs -G wouldn't work because he had no additional Diskspace left... *ouch*.. This case is solvable by planning. When you know that the new fs must be created with all inodes from the start, simply count how many you need before migration. (And add a decent safety margin.) That's what I do with my home machine ask disks wear out every third year or so. The tools for ext2/3 tells how many inodes are in use, and the new fs can be made accordingly. The approach works for bigger machines too of course. Helge Hafting
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion]
I didn't mean to say your particular drive were crap, but 200GB SATA drives are low end, like it or not -- And you think an 18 GB SCSI disk just does it better because it's SCSI? Esp. in long sequential reads. Jan Engelhardt --
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Wandering logs is a term specific to reiser4, and I think you are making a more general remark. So, what is UDF's wandering log then? Jan Engelhardt --
Re: reiser4 can now bear with filled fs, looks stable to me...
Hello David, Monday, July 31, 2006, 11:46:34 PM, you wrote: You must be new here... ;-) I wanted to point out that because: Options B and C are all that ever seems to happen when reiserfs-list and lkml collide. and: The speed of a nonworking program is irrelevant. The cost-effectiveness of an impossible solution is irrelevant. maybe the more important thing is to allow people use r4 on their own (rpms, debs, apt/gentoo/repositories, etc.) better, than to push that hard for kernel inclusion. Currently the r4 patch is very easy to apply, you can apply it on top of heavily patched kernels with no or little fuzz, which is very good. But, as Hans wrote earlier, not every user knows how to patch, but in ubuntu for example it is fairly easy (and encouraged by the official forums/wikis) for those users to add additional repositories using synaptic or adept or editing /etc/apt/sources.list. I mean, there were huge objections against FUSE too, remember? But Miklos built a steady and growing userbase. Maybe that is something to realize, Hans, we don't need kernel inclusion to have a growing userbase. (or atleast a steady one) A side note: the only time we had not reiserfs-list and lkml collide (that much) was when Andrew Morton was commenting and when Cristoph made a list of things to phix - that cost people some nerve but it nevertheles was more productive than the usual flamewars. -- Best regards, Maciej
Re: reiser4: maybe just fix bugs?
Hello On Mon, 2006-07-31 at 20:18 -0600, Hans Reiser wrote: Andrew Morton wrote: The writeout code is ugly, although that's largely due to a mismatch between what reiser4 wants to do and what the VFS/MM expects it to do. Yes. reiser4 writeouts atoms. Most of pages get into atoms via sys_write. But pages dirtied via shared mapping do not. They get into atoms in reiser4's writepages address space operation. That is why reiser4_sync_inodes has two steps: on first one it calls generic_sync_sb_inodes to call writepages for dirty inodes to capture pages dirtied via shared mapping into atoms. Second step flushes atoms. I agree --- both with it being ugly, and that being part of why. If it works, we can live with it, although perhaps the VFS could be made smarter. I would be curious regarding any ideas on that. Next time I read through that code, I will keep in mind that you are open to making VFS changes if it improves things, and I will try to get clever somehow and send it by you. Our squalloc code though is I must say the most complicated and ugliest piece of code I ever worked on for which every cumulative ugliness had a substantive performance advantage requiring us to keep it. If you spare yourself from reading that, it is understandable to do so. I'd say that resier4's major problem is the lack of xattrs, acls and direct-io. That's likely to significantly limit its vendor uptake. xattrs is really a problem.
Re: metadata plugins (was Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion)
On Mon, Jul 31, 2006 at 06:05:01PM +0200, Łukasz Mierzwa wrote: I gues that extens are much harder to reuse then normal inodes so when You have something as big as portage tree filled with nano files wich are being modified all the time then You just can't keep performance all the time. You can always tar, rm -fr /usr/portage, untar and You will probably speed things up a lot. I submitted a script to this list which takes care of everything required to recreate your fs. It even converts between different filesystems, for migration purposes or comparitive tests, and currently supports ext2|3, reiser3|4 and xfs. The thing is undergoing some surgery atm. to reduce forced disk flushes. I already replaced the call to sync() after every operation by one fsync() call on the archive file before the formatting takes place. What is still missing is functionality to retrieve things like fs label and UUID from the existing fs and reuse them during mkfs. Testing is also pending, so you might not want to hold your breath waiting for the funky version, the idea of which is to leave everything as it was found, except for better disk layout and possibly changed fs type. It is a completely different approach from convertfs, which tries to do the conversion in-place by moving the fs's contents into a new fs created within a sparse file on the same device and relocating the sparse file's blocks afterwards. My guess is that a failure of any kind in the latter process will destroy your data (this was the case last time I checked), while I do (at least try) everything to ensure that the tarball is written to platters before mkfs occurs. The new version will be posted to wiki.namesys.com asap, no timeframe attached though as Thursday yields an exam, so maybe on Friday, but who knows. The version already posted to the list works well, I used it at least a hundred times, even on stuff like /home and /usr (the latter works only from a live cd or custom initramfs). Kind regards, Chris pgpJCvRI7J8nt.pgp Description: PGP signature
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion]
Jan Engelhardt schrieb am 2006-08-01: I didn't mean to say your particular drive were crap, but 200GB SATA drives are low end, like it or not -- And you think an 18 GB SCSI disk just does it better because it's SCSI? 18 GB SCSI disks are 1999 gear, so who cares? Seagate didn't sell 200 GB SATA drives at that time. Esp. in long sequential reads. You think SCSI drives aren't on par? Right, they're ahead. 98 MB/s for the fastest SCSI drives vs. 88 MB/s for Raptor 150 GB SATA and 74 MB/s for the fastest other ATA drives. (Figures obtained from StorageReview.com's Performance Database.) -- Matthias Andree
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion]
I didn't mean to say your particular drive were crap, but 200GB SATA drives are low end, like it or not -- And you think an 18 GB SCSI disk just does it better because it's SCSI? 18 GB SCSI disks are 1999 gear, so who cares? Seagate didn't sell 200 GB SATA drives at that time. Esp. in long sequential reads. You think SCSI drives aren't on par? Right, they're ahead. 98 MB/s for the fastest SCSI drives vs. 88 MB/s for Raptor 150 GB SATA and 74 MB/s for the fastest other ATA drives. Uhuh. And how do they measure that? Did they actually ran sth like... dd_rescue /dev/hda /dev/null Jan Engelhardt --
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Bernd Schubert [EMAIL PROTECTED] wrote: On Monday 31 July 2006 21:29, Jan-Benedict Glaw wrote: The point is that it's quite hard to really fuck up ext{2,3} with only some KB being written while it seems (due to the fragile^Wsophisticated on-disk data structures) that it's just easy to kill a reiser3 filesystem. Well, I was once very 'luckily' and after a system crash (*) e2fsck put all files into lost+found. Sure, I never experienced this again, but I also never experienced something like this with reiserfs. So please, stop this kind of FUD against reiser3.6. It isn't FUD. One data point doesn't allow you to draw conclusions. Yes, I've seen/heard of ext2/ext3 failures and data loss too. But at least the same number for ReiserFS. And I know it is outnumbered 10 to 1 or so in my sample, so that would indicate at a 10 fold higher probability of catastrophic data loss, other factors mostly the same. While filesystem speed is nice, it also would be great if reiser4.x would be very robust against any kind of hardware failures. Can't have both. -- Dr. Horst H. von Brand User #22616 counter.li.org Departamento de Informatica Fono: +56 32 654431 Universidad Tecnica Federico Santa Maria +56 32 654239 Casilla 110-V, Valparaiso, ChileFax: +56 32 797513
Re: reiser4: maybe just fix bugs?
On Tue, 01 Aug 2006 15:24:37 +0400 Vladimir V. Saveliev [EMAIL PROTECTED] wrote: The writeout code is ugly, although that's largely due to a mismatch between what reiser4 wants to do and what the VFS/MM expects it to do. Yes. reiser4 writeouts atoms. Most of pages get into atoms via sys_write. But pages dirtied via shared mapping do not. They get into atoms in reiser4's writepages address space operation. It think you mean -writepage - reiser4 desn't implement -writepages(). I assume you considered hooking into -set_page_dirty() to do the add-to-atom thing earlier on? We'll merge mm-tracking-shared-dirty-pages.patch into 2.6.19-rc1, which would make that approach considerably more successful, I expect. -set_page_dirty() is a bit awkward because it can be called under spinlock. Maybe comething could also be gained from the new vm_operations_struct.page_mkwrite(), although that's less obvious... That is why reiser4_sync_inodes has two steps: on first one it calls generic_sync_sb_inodes to call writepages for dirty inodes to capture pages dirtied via shared mapping into atoms. Second step flushes atoms. I agree --- both with it being ugly, and that being part of why. If it works, we can live with it, although perhaps the VFS could be made smarter. I would be curious regarding any ideas on that. Next time I read through that code, I will keep in mind that you are open to making VFS changes if it improves things, and I will try to get clever somehow and send it by you. Our squalloc code though is I must say the most complicated and ugliest piece of code I ever worked on for which every cumulative ugliness had a substantive performance advantage requiring us to keep it. If you spare yourself from reading that, it is understandable to do so. I'd say that resier4's major problem is the lack of xattrs, acls and direct-io. That's likely to significantly limit its vendor uptake. xattrs is really a problem. That's not good. The ability to properly support SELinux is likely to be important.
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
While filesystem speed is nice, it also would be great if reiser4.x would be very robust against any kind of hardware failures. Can't have both. ..and some people simply don't care about this: If you are running a 'big' Storage-System with battery protected WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc.. you don't need your filesystem beeing super-robust against bad sectors and such stuff because: a) You've paid enough money to let the storage care about Hardware issues. b) If your storage is on fire you can do a failover using the mirror. c) And if someone ran dd if=/dev/urandom of=/dev/sda you could even rollback your Snapshot. (Btw: i did this once to a Reiser4 filesystem (overwritten about 1.2gb). fsck.reiser4 --rebuild-sb was able to fix it.) ..but what you really need is a flexible and **fast** filesystem: Like Reiser4. (Yeah.. yeah.. i know: ext3 is also flexible and fast.. but Reiser4 simply is *MUCH* faster than ext3 for 'my' workload/application). Regards, Adrian
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Ar Maw, 2006-08-01 am 16:52 +0200, ysgrifennodd Adrian Ulrich: WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc.. you don't need your filesystem beeing super-robust against bad sectors and such stuff because: You do it turns out. Its becoming an issue more and more that the sheer amount of storage means that the undetected error rate from disks, hosts, memory, cables and everything else is rising. There has been a great deal of discussion about this at the filesystem and kernel summits - and data is getting kicked the way of networking - end to end not reliability in the middle. The sort of changes this needs hit the block layer and ever fs.
Re: reiser4: maybe just fix bugs?
Hello On Tue, 2006-08-01 at 07:33 -0700, Andrew Morton wrote: On Tue, 01 Aug 2006 15:24:37 +0400 Vladimir V. Saveliev [EMAIL PROTECTED] wrote: The writeout code is ugly, although that's largely due to a mismatch between what reiser4 wants to do and what the VFS/MM expects it to do. Yes. reiser4 writeouts atoms. Most of pages get into atoms via sys_write. But pages dirtied via shared mapping do not. They get into atoms in reiser4's writepages address space operation. It think you mean -writepage - reiser4 desn't implement -writepages(). no. there is one. It is reiser4/plugin/file/file.c:writepages_unix_file(). reiser4_writepage just kicks kernel thread (entd) which works similar to reiser4_sync_inodes() (described earlier) and waits until several pages (including the one reiser4_writepage is called with) are written. I assume you considered hooking into -set_page_dirty() to do the add-to-atom thing earlier on? Currently, add-to-atom is not simple. It may require memory allocations and disk i/o-s. I guess these are not supposed to be called in -set_page_dirty(). That is why in reiser4_set_page_dirty we just mark the page in mapping's tree and dealy adding to atoms until flush time. We'll merge mm-tracking-shared-dirty-pages.patch into 2.6.19-rc1, which would make that approach considerably more successful, I expect. -set_page_dirty() is a bit awkward because it can be called under spinlock. Maybe comething could also be gained from the new vm_operations_struct.page_mkwrite(), although that's less obvious... That is why reiser4_sync_inodes has two steps: on first one it calls generic_sync_sb_inodes to call writepages for dirty inodes to capture pages dirtied via shared mapping into atoms. Second step flushes atoms. I agree --- both with it being ugly, and that being part of why. If it works, we can live with it, although perhaps the VFS could be made smarter. I would be curious regarding any ideas on that. Next time I read through that code, I will keep in mind that you are open to making VFS changes if it improves things, and I will try to get clever somehow and send it by you. Our squalloc code though is I must say the most complicated and ugliest piece of code I ever worked on for which every cumulative ugliness had a substantive performance advantage requiring us to keep it. If you spare yourself from reading that, it is understandable to do so. I'd say that resier4's major problem is the lack of xattrs, acls and direct-io. That's likely to significantly limit its vendor uptake. xattrs is really a problem. That's not good. The ability to properly support SELinux is likely to be important. Do you think that if reiser4 supported xattrs - it would increase its chances on inclusion? PS: what exactly did you refer to when you said that writeout code is ugly?
Re: metadata plugins (was Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion)
Dnia Fri, 28 Jul 2006 18:33:56 +0200, Linus Torvalds [EMAIL PROTECTED] napisał: In other words, if a filesystem wants to do something fancy, it needs to do so WITH THE VFS LAYER, not as some plugin architecture of its own. We already have exactly the plugin interface we need, and it literally _is_ the VFS interfaces - you can plug in your own filesystems with register_filesystem(), which in turn indirectly allows you to plug in your per-file and per-directory operations for things like lookup etc. What fancy (beside cryptocompress) does reiser4 do now? Can someone point me to a list of things that are required by kernel mainteiners to merge reiser4 into vanilla? I feel like I'm getting lost with current reiser4 status and things that are need to be done. Łukasz Mierzwa
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Alan Cox wrote: Ar Maw, 2006-08-01 am 16:52 +0200, ysgrifennodd Adrian Ulrich: WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc.. you don't need your filesystem beeing super-robust against bad sectors and such stuff because: You do it turns out. Its becoming an issue more and more that the sheer amount of storage means that the undetected error rate from disks, hosts, memory, cables and everything else is rising. Yikes. Undetected. Wait, what? Disks, at least, would be protected by RAID. Are you telling me RAID won't detect such an error? It just seems wholly alien to me that errors would go undetected, and we're OK with that, so long as our filesystems are robust enough. If it's an _undetected_ error, doesn't that cause way more problems (impossible problems) than FS corruption? Ok, your FS is fine -- but now your bank database shows $1k less on random accounts -- is that ok? There has been a great deal of discussion about this at the filesystem and kernel summits - and data is getting kicked the way of networking - end to end not reliability in the middle. Sounds good, but I've never let discussions by people smarter than me prevent me from asking the stupid questions. The sort of changes this needs hit the block layer and ever fs. Seems it would need to hit every application also...
Re: reiser4: maybe just fix bugs?
Vladimir V. Saveliev wrote: Do you think that if reiser4 supported xattrs - it would increase its chances on inclusion? Probably the opposite. If I understand it right, the original Reiser4 model of file metadata is the file-as-directory stuff that caused such a furor the last big push for inclusion (search for Silent semantic changes in Reiser4): foo.mp3/.../rwx# permissions foo.mp3/.../artist # part of the id3 tag So I suspect xattrs would just be a different interface to this stuff, maybe just a subset of it (to prevent namespace collisions): foo.mp3/.../xattr/ # contains files representing attributes Of course, you'd be able to use the standard interface for getting/setting these. The point is, I don't think Hans/Namesys wants to do this unless they're going to do it right, especially because they already have the file-as-dir stuff somewhat done. Note that these are neither mutually exclusive nor mutually dependent -- you don't have to enable file-as-dir to make xattrs work. I know it's not done yet, though. I can understand Hans dragging his feet here, because xattrs and traditional acls are examples of things Reiser4 is supposed to eventually replace. Anyway, if xattrs were done now, the only good that would come of it is building a userbase outside the vanilla kernel. I can't see it as doing anything but hurting inclusion by introducing more confusion about plugins. I could be entirely wrong, though. I speak for neither Hans/Namesys/reiserfs nor LKML. Talk amongst yourselves...
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Horst H. von Brand wrote: Bernd Schubert [EMAIL PROTECTED] wrote: While filesystem speed is nice, it also would be great if reiser4.x would be very robust against any kind of hardware failures. Can't have both. Why not? I mean, other than TANSTAAFL, is there a technical reason for them being mutually exclusive? I suspect it's more we haven't found a way yet...
Re: metadata plugins (was Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion)
Hello On Tue, 2006-08-01 at 17:32 +0200, Łukasz Mierzwa wrote: Dnia Fri, 28 Jul 2006 18:33:56 +0200, Linus Torvalds [EMAIL PROTECTED] napisał: In other words, if a filesystem wants to do something fancy, it needs to do so WITH THE VFS LAYER, not as some plugin architecture of its own. We already have exactly the plugin interface we need, and it literally _is_ the VFS interfaces - you can plug in your own filesystems with register_filesystem(), which in turn indirectly allows you to plug in your per-file and per-directory operations for things like lookup etc. What fancy (beside cryptocompress) does reiser4 do now? it is supposed to provide an ability to easy modify filesystem behaviour in various aspects without breaking compatibility. Can someone point me to a list of things that are required by kernel mainteiners to merge reiser4 into vanilla? list of features reiser4 does not have now: O_DIRECT support - we are working on it now various block size support quota support xattrs and acls list of warnings about reiser4 code: I think that last big list of useful comments (from Christoph Hellwig [EMAIL PROTECTED]) is addressed. Well, except for one minor (I believe) place in file release. Currently, Andrew is trying to find some time to review reiser4 code. I feel like I'm getting lost with current reiser4 status and things that are need to be done. Łukasz Mierzwa
Re: metadata plugins (was Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion)
Christian Trefzer wrote: On Mon, Jul 31, 2006 at 10:57:35AM -0500, David Masover wrote: Wil Reichert wrote: Any idea how the fragmentation resulting from re-syncing the tree affects performance over time? Yes, it does affect it a lot. I have no idea how much, and I've never benchmarked it, but purely subjectively, my portage has gotten slower over time. Delayed allocation still performs a lot better here than the v3 immediate allocation. In addition, tree balancing operations are performed on flush as well, so what you get on disk is basically an almost-optimal tree. Of course, this will change a bit over time, but with v4 it takes a lot longer for that to happen than with v3 afaict. There _has_ been some worthwile development in the meantime : ) Hmm. The thing is, I don't remember v3 slowing down much at all, whereas v4 slowed down pretty dramatically after the first few weeks. It does seem pretty stable now, though, and it doesn't seem to be getting any slower. I've had this particular FS since... hmm... Is there an FS tool to check mkfs time? I think it's a year now, but I'd like to be sure. If not, I'll just find the oldest file, but the clock on this machine isn't reliable (have to set it with NTP every boot)...
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Ar Maw, 2006-08-01 am 11:44 -0500, ysgrifennodd David Masover: Yikes. Undetected. Wait, what? Disks, at least, would be protected by RAID. Are you telling me RAID won't detect such an error? Yes. RAID deals with the case where a device fails. RAID 1 with 2 disks can in theory detect an internal inconsistency but cannot fix it. we're OK with that, so long as our filesystems are robust enough. If it's an _undetected_ error, doesn't that cause way more problems (impossible problems) than FS corruption? Ok, your FS is fine -- but now your bank database shows $1k less on random accounts -- is that ok? Not really no. Your bank is probably using a machine (hopefully using a machine) with ECC memory, ECC cache and the like. The UDMA and SATA storage subsystems use CRC checksums between the controller and the device. SCSI uses various similar systems - some older ones just use a parity bit so have only a 50/50 chance of noticing a bit error. Similarly the media itself is recorded with a lot of FEC (forward error correction) so will spot most changes. Unfortunately when you throw this lot together with astronomical amounts of data you get burned now and then, especially as most systems are not using ECC ram, do not have ECC on the CPU registers and may not even have ECC on the caches in the disks. The sort of changes this needs hit the block layer and ever fs. Seems it would need to hit every application also... Depending how far you propogate it. Someone people working with huge data sets already write and check user level CRC values for this reason (in fact bitkeeper does it for one example). It should be relatively cheap to get much of that benefit without doing application to application just as TCP gets most of its benefit without going app to app. Alan
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Theodore Tso wrote: Ah, but as soon as the repacker thread runs continuously, then you lose all or most of the claimed advantage of wandering logs. [...] So instead of a write-write overhead, you end up with a write-read-write overhead. This would tend to suggest that the repacker should not run constantly, but also that while it's running, performance could be almost as good as ext3. But of course, people tend to disable the repacker when doing benchmarks because they're trying to play the my filesystem/database has bigger performance numbers than yours game So you run your own benchmarks, I'll run mine... Benchmarks for everyone! I'd especially like to see what performance is like with the repacker not running, and during the repack. If performance during a repack is comparable to ext3, I think we win, although we have to amend that statement to My filesystem/database has the same or bigger perfomance numbers than yours.
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
On 8/1/06, David Masover [EMAIL PROTECTED] wrote: Yikes. Undetected. Wait, what? Disks, at least, would be protected by RAID. Are you telling me RAID won't detect such an error? Unless the disk ECC catches it raid won't know anything is wrong. This is why ZFS offers block checksums... it can then try all the permutations of raid regens to find a solution which gives the right checksum. Every level of the system must be paranoid and take measure to avoid corruption if the system is to avoid it... it's a tough problem. It seems that the ZFS folks have addressed this challenge by building as much of what is classically separate layers into one part.
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Alan Cox wrote: Ar Maw, 2006-08-01 am 11:44 -0500, ysgrifennodd David Masover: Yikes. Undetected. Wait, what? Disks, at least, would be protected by RAID. Are you telling me RAID won't detect such an error? Yes. RAID deals with the case where a device fails. RAID 1 with 2 disks can in theory detect an internal inconsistency but cannot fix it. Still, if it does that, that should be enough. The scary part wasn't that there's an internal inconsistency, but that you wouldn't know. And it can fix it if you can figure out which disk went. Or give it 3 disks and it should be entirely automatic -- admin gets paged, admin hotswaps in a new disk, done. we're OK with that, so long as our filesystems are robust enough. If it's an _undetected_ error, doesn't that cause way more problems (impossible problems) than FS corruption? Ok, your FS is fine -- but now your bank database shows $1k less on random accounts -- is that ok? Not really no. Your bank is probably using a machine (hopefully using a machine) with ECC memory, ECC cache and the like. The UDMA and SATA storage subsystems use CRC checksums between the controller and the device. SCSI uses various similar systems - some older ones just use a parity bit so have only a 50/50 chance of noticing a bit error. Similarly the media itself is recorded with a lot of FEC (forward error correction) so will spot most changes. Unfortunately when you throw this lot together with astronomical amounts of data you get burned now and then, especially as most systems are not using ECC ram, do not have ECC on the CPU registers and may not even have ECC on the caches in the disks. It seems like this is the place to fix it, not the software. If the software can fix it easily, great. But I'd much rather rely on the hardware looking after itself, because when hardware goes bad, all bets are off. Specifically, it seems like you do mention lots of hardware solutions, that just aren't always used. It seems like storage itself is getting cheap enough that it's time to step back a year or two in Moore's Law to get the reliability. The sort of changes this needs hit the block layer and ever fs. Seems it would need to hit every application also... Depending how far you propogate it. Someone people working with huge data sets already write and check user level CRC values for this reason (in fact bitkeeper does it for one example). It should be relatively cheap to get much of that benefit without doing application to application just as TCP gets most of its benefit without going app to app. And yet, if you can do that, I'd suspect you can, should, must do it at a lower level than the FS. Again, FS robustness is good, but if the disk itself is going, what good is having your directory (mostly) intact if the files themselves have random corruptions? If you can't trust the disk, you need more than just an FS which can mostly survive hardware failure. You also need the FS itself (or maybe the block layer) to support bad block relocation and all that good stuff, or you need your apps designed to do that job by themselves. It just doesn't make sense to me to do this at the FS level. You mention TCP -- ok, but if TCP is doing its job, I shouldn't also need to implement checksums and other robustness at the protocol layer (http, ftp, ssh), should I? Because in this analogy, it looks like TCP is the block layer and a protocol is the fs. As I understand it, TCP only lets the protocol/application know when something's seriously FUBARed and it has to drop the connection. Similarly, the FS (and the apps) shouldn't have to know about hardware problems until it really can't do anything about it anymore, at which point the right thing to do is for the FS and apps to go oh shit and drop what they're doing, and the admin replaces hardware and restores from backup. Or brings a backup server online, or... I guess my main point was that _undetected_ problems are serious, but if you can detect them, and you have at least a bit of redundancy, you should be good. For instance, if your RAID reports errors that it can't fix, you bring that server down and let the backup server run.
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Gregory Maxwell wrote: On 8/1/06, David Masover [EMAIL PROTECTED] wrote: Yikes. Undetected. Wait, what? Disks, at least, would be protected by RAID. Are you telling me RAID won't detect such an error? Unless the disk ECC catches it raid won't know anything is wrong. This is why ZFS offers block checksums... it can then try all the permutations of raid regens to find a solution which gives the right checksum. Isn't there a way to do this at the block layer? Something in device-mapper? Every level of the system must be paranoid and take measure to avoid corruption if the system is to avoid it... it's a tough problem. It seems that the ZFS folks have addressed this challenge by building as much of what is classically separate layers into one part. Sounds like bad design to me, and I can point to the antipattern, but what do I know?
Ebuild/rpm/deb repo's (was Re: reiser4 can now bear with filled fs, looks stable to me...)
On Tue, 2006-08-01 at 13:28 +0200, Maciej Sołtysiak wrote: Hello David, Monday, July 31, 2006, 11:46:34 PM, you wrote: You must be new here... ;-) I wanted to point out that because: Options B and C are all that ever seems to happen when reiserfs-list and lkml collide. and: The speed of a nonworking program is irrelevant. The cost-effectiveness of an impossible solution is irrelevant. maybe the more important thing is to allow people use r4 on their own (rpms, debs, apt/gentoo/repositories, etc.) better, than to push that hard for kernel inclusion. Yes, and in case of gentoo there are already people maintaining an ebuild which pull in r4 on the wiki. http://gentoo-wiki.com/HOWTO_Reiser4_With_Gentoo-Sources WHen you make it easy for people to use reiser4 by providing ebuilds, rpm's or deb's more users will be tempted to try out reiser4 who would normally not be able or willing to patch the kernel. Maintaning an ebuild for example is easy. And adding in another patch to a kernel deb/rpm should also not be too difficult. It will take some time to do each month but sacrificing a few hours to update these to me would be worth it. Maybe the reiser cummunity can help out the namesys devs? Greets Sander
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
You do it turns out. Its becoming an issue more and more that the sheer amount of storage means that the undetected error rate from disks, hosts, memory, cables and everything else is rising. IMHO the possibility to hit such a random-so-far-undetected-corruption is very low with one of the big/expensive raid systems as they are doing fancy stuff like 'disk scrubbing' and usually do fail disks at very early stages.. * I've seen storage systems from a BIG vendor die due to firmware bugs * I've seen FC-Cards die.. SAN-switches rebooted.. People used my cables to do rope skipping * We had Fire, non-working UPS and faulty diesel generators.. but so far the FSes (and applications) on the Storage never complained about corrupted data. ..YMMV.. Btw: I don't think that Reiserfs really behaves this bad with broken hardware. So far, Reiser3 survived 2 broken Harddrives without problems while i've seen ext2/3 die 4 times so far... (= everything inside /lost+found). Reiser4 survived # mkisofs . /dev/sda Lucky me.. maybe.. To get back on-topic: Some people try very hard to claim that the world doesn't need Reiser4 and that you can do everything with ext3. Ext3 may be fine for them but some people (like me) really need Reiser4 because they got applications/workloads that won't work good (fast) on ext3. Why is it such a big thing to include a filesystem? Even if it's unstable: does anyone care? Eg: the HFS+ driver is buggy (corrupted the FS of my OSX installation 3 times so far) but does this buggyness affect people *not* using it? No. Regards, Adrian
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
This is why ZFS offers block checksums... it can then try all the permutations of raid regens to find a solution which gives the right checksum. Isn't there a way to do this at the block layer? Something in device-mapper? Remember: Suns new Filesystem + Suns new Volume Manager = ZFS
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Alan Cox wrote: Ar Maw, 2006-08-01 am 16:52 +0200, ysgrifennodd Adrian Ulrich: WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc.. you don't need your filesystem beeing super-robust against bad sectors and such stuff because: You do it turns out. Its becoming an issue more and more that the sheer amount of storage means that the undetected error rate from disks, hosts, memory, cables and everything else is rising. I agree with Alan despite being an enthusiastic supporter of neat array based technologies. Most people use absolutely giant disks in laptops and desktop systems (300GB 500GB are common, 750GB on the way). File systems need to be as robust as possible for users of these systems as people are commonly storing personal critical data like photos mostly on these unprotected drives. Even for the high end users, array based mirroring and so on can only do so much to protect you. Mirroring a corrupt file system to a remote data center will mirror your corruption. Rolling back to a snapshot typically only happens when you notice a corruption which can go undetected for quite a while, so even that will benefit from having reliability baked into the file system (i.e., it should grumble about corruption to let you know that you need to roll back or fsck or whatever). An even larger issue is that our tools, like fsck, which are used to uncover these silent corruptions need to scale up to the point that they can uncover issues in minutes instead of days. A lot of the focus at the file system workshop was around how to dramatically reduce the repair time of file systems. In a way, having super reliable storage hardware is only as good as the file system layer on top of it - reliability needs to be baked into the entire IO system stack... ric
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Alan, I have seen only anecdotal evidence against reiserfsck, and I have seen formal tests from Vitaly (which it seems a user has replicated) where our fsck did better than ext3s. Note that these tests are of the latest fsck from us: I am sure everyone understands that it takes time for an fsck to mature, and that our early fsck's were poor. I will also say the V4's fsck is more robust than V3's because we made disk format changes specifically to help fsck. Now I am not dismissing your anecdotes as I will never dismiss data I have not seen, and it sounds like you have seen more data than most people, but I must dismiss your explanation of them. Being able to throw away all of the tree but the leaves and twigs with extent pointers and rebuild all of it makes V4 very robust, more so than ext3. This business of inodes not moving, I don't see what the advantage is, we can lose the directory entry and rebuild just as well as ext3, probably better because we can at least figure out what directory it was in. Vitaly can say all of this more expertly than I Hans
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Ric Wheeler wrote: Alan Cox wrote: You do it turns out. Its becoming an issue more and more that the sheer amount of storage means that the undetected error rate from disks, hosts, memory, cables and everything else is rising. I agree with Alan You will want to try our compression plugin, it has an ecc for every 64k Hans
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Gregory Maxwell wrote: This is why ZFS offers block checksums... it can then try all the permutations of raid regens to find a solution which gives the right checksum. ZFS performance is pretty bad in the only benchmark I have seen of it. Does anyone have serious benchmarks of it? I suspect that our compression plugin (with ecc) will outperform it.
Re: the 'official' point of view expressed by kernelnewbies.org regarding reiser4 inclusion
Ric Wheeler wrote: Alan Cox wrote: Ar Maw, 2006-08-01 am 16:52 +0200, ysgrifennodd Adrian Ulrich: WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc.. you don't need your filesystem beeing super-robust against bad sectors and such stuff because: You do it turns out. Its becoming an issue more and more that the sheer amount of storage means that the undetected error rate from disks, hosts, memory, cables and everything else is rising. Most people use absolutely giant disks in laptops and desktop systems (300GB 500GB are common, 750GB on the way). File systems need to be as robust as possible for users of these systems as people are commonly storing personal critical data like photos mostly on these unprotected drives. Their loss. Robust FS is good, but really, if you aren't doing backup, you are going to lose data. End of story. Even for the high end users, array based mirroring and so on can only do so much to protect you. Mirroring a corrupt file system to a remote data center will mirror your corruption. Assuming it's undetected. Why would it be undetected? Rolling back to a snapshot typically only happens when you notice a corruption which can go undetected for quite a while, so even that will benefit from having reliability baked into the file system (i.e., it should grumble about corruption to let you know that you need to roll back or fsck or whatever). Yes, the filesystem should complain about corruption. So should the block layer -- if you don't trust the FS, use a checksum at the block layer. So should... There are just so many other, better places to do this than the FS. The FS should complain, yes, but if the disk is bad, there's going to be corruption. An even larger issue is that our tools, like fsck, which are used to uncover these silent corruptions need to scale up to the point that they can uncover issues in minutes instead of days. A lot of the focus at the file system workshop was around how to dramatically reduce the repair time of file systems. That would be interesting. I know from experience that fsck.reiser4 is amazing. Blew away my data with something akin to an rm -rf, and fsck fixed it. Tons of crashing/instability in the early days, but only once -- before they even had a version instead of a date, I think -- did I ever have a case where fsck couldn't fix it. So I guess the next step would be to make fsck faster. Someone mentioned a fsck that repairs the FS in the background? In a way, having super reliable storage hardware is only as good as the file system layer on top of it - reliability needs to be baked into the entire IO system stack... That bit makes no sense. If you have super reliable storage failure (never dies), and your FS is also reliable (never dies unless hardware does, but may go bat-shit insane when hardware dies), then you've got a super reliable system. You're right, running Linux's HFS+ or NTFS write support is generally a bad idea, no matter how reliable your hardware is. But this discussion was not about whether an FS is stable, but how well an FS survives hardware corruption.
Re: reiser4: maybe just fix bugs?
On 8/1/06, Andrew Morton [EMAIL PROTECTED] wrote: On Tue, 01 Aug 2006 15:24:37 +0400 Vladimir V. Saveliev [EMAIL PROTECTED] wrote: The writeout code is ugly, although that's largely due to a mismatch between what reiser4 wants to do and what the VFS/MM expects it to do. Yes. reiser4 writeouts atoms. Most of pages get into atoms via sys_write. But pages dirtied via shared mapping do not. They get into atoms in reiser4's writepages address space operation. It think you mean -writepage - reiser4 desn't implement -writepages(). I assume you considered hooking into -set_page_dirty() to do the add-to-atom thing earlier on? We'll merge mm-tracking-shared-dirty-pages.patch into 2.6.19-rc1, which would make that approach considerably more successful, I expect. -set_page_dirty() is a bit awkward because it can be called under spinlock. Maybe comething could also be gained from the new vm_operations_struct.page_mkwrite(), although that's less obvious... That is why reiser4_sync_inodes has two steps: on first one it calls generic_sync_sb_inodes to call writepages for dirty inodes to capture pages dirtied via shared mapping into atoms. Second step flushes atoms. I agree --- both with it being ugly, and that being part of why. If it works, we can live with it, although perhaps the VFS could be made smarter. I would be curious regarding any ideas on that. Next time I read through that code, I will keep in mind that you are open to making VFS changes if it improves things, and I will try to get clever somehow and send it by you. Our squalloc code though is I must say the most complicated and ugliest piece of code I ever worked on for which every cumulative ugliness had a substantive performance advantage requiring us to keep it. If you spare yourself from reading that, it is understandable to do so. I'd say that resier4's major problem is the lack of xattrs, acls and direct-io. That's likely to significantly limit its vendor uptake. xattrs is really a problem. That's not good. The ability to properly support SELinux is likely to be important. i disagreee that it will be difficult. unfortunately, the patch that I am working on right now, which fixes the various reiser4 specific functions to avoid using VFS data structures unless needed, is a prerequisite to enabling xattrs. creating it is a time of tedium for me, and it will cause a bit of internal churn (1000 lines and counting). it's all in the fs/reiser4 directory though, and it should cause minimal disruption, as far as runtime bugs introduced. once that's taken care of, i will be delighted to enable xattr support in a way that will make selinux and beagle and such run as expected, and will have the added advantage of some major scalability improvements for certain lookup and update operations. NATE
Re: reiser4-2.6.18-rc2-mm1: possible circular locking dependency detected in txn_end
Hello Ingo, there is a new reiser4 / lock validator problem: On Sunday 30 July 2006 22:57, Laurent Riffard wrote: === [ INFO: possible circular locking dependency detected ] --- mv/29012 is trying to acquire lock: (txnh-hlock){--..}, at: [e0c8e09b] txn_end+0x191/0x368 [reiser4] but task is already holding lock: (atom-alock){--..}, at: [e0c8a640] txnh_get_atom+0xf6/0x39e [reiser4] which lock already depends on the new lock. it is absolutely legal in reiser4 to lock atom first, then lock transaction handle. i guess the lock validator recorded wrong dependency rule from one place where the spinlocks are taken in reverse order. that place is in fs/reiser4/txnmgr.c:atom_begin_and_assign_to_txnh, that atom is new, just kmalloc'ed object which is inaccessible for others, so it can't a source for deadlock. but how to explain that to the lock validator? the existing dependency chain (in reverse order) is: - #1 (atom-alock){--..}: [c012ce2f] lock_acquire+0x60/0x80 [c0292968] _spin_lock+0x19/0x28 [e0c8bbd7] try_capture+0x7cf/0x1cd7 [reiser4] [e0c786e1] longterm_lock_znode+0x427/0x84f [reiser4] [e0ca55dc] coord_by_handle+0x2be/0x7f7 [reiser4] [e0ca5f89] coord_by_key+0x1e3/0x22d [reiser4] [e0c7dbd2] insert_by_key+0x8f/0xe0 [reiser4] [e0cbf7f1] write_sd_by_inode_common+0x361/0x61a [reiser4] [e0cbfce4] create_object_common+0xf1/0xf6 [reiser4] [e0cbaebf] create_vfs_object+0x51d/0x732 [reiser4] [e0cbb1fd] mkdir_common+0x43/0x4b [reiser4] [c015ed33] vfs_mkdir+0x5a/0x9d [c0160f5e] sys_mkdirat+0x88/0xc0 [c0160fa6] sys_mkdir+0x10/0x12 [c0102c2d] sysenter_past_esp+0x56/0x8d - #0 (txnh-hlock){--..}: [c012ce2f] lock_acquire+0x60/0x80 [c0292968] _spin_lock+0x19/0x28 [e0c8e09b] txn_end+0x191/0x368 [reiser4] [e0c7f97d] reiser4_exit_context+0x1c2/0x571 [reiser4] [e0cbb091] create_vfs_object+0x6ef/0x732 [reiser4] [e0cbb1fd] mkdir_common+0x43/0x4b [reiser4] [c015ed33] vfs_mkdir+0x5a/0x9d [c0160f5e] sys_mkdirat+0x88/0xc0 [c0160fa6] sys_mkdir+0x10/0x12 [c0102c2d] sysenter_past_esp+0x56/0x8d other info that might help us debug this: 2 locks held by mv/29012: #0: (inode-i_mutex/1){--..}, at: [c015f50b] lookup_create+0x1d/0x73 #1: (atom-alock){--..}, at: [e0c8a640] txnh_get_atom+0xf6/0x39e [reiser4] stack backtrace: [c0104df0] show_trace+0xd/0x10 [c0104e0c] dump_stack+0x19/0x1d [c012bc62] print_circular_bug_tail+0x59/0x64 [c012cc3e] __lock_acquire+0x814/0x9a5 [c012ce2f] lock_acquire+0x60/0x80 [c0292968] _spin_lock+0x19/0x28 [e0c8e09b] txn_end+0x191/0x368 [reiser4] [e0c7f97d] reiser4_exit_context+0x1c2/0x571 [reiser4] [e0cbb091] create_vfs_object+0x6ef/0x732 [reiser4] [e0cbb1fd] mkdir_common+0x43/0x4b [reiser4] [c015ed33] vfs_mkdir+0x5a/0x9d [c0160f5e] sys_mkdirat+0x88/0xc0 [c0160fa6] sys_mkdir+0x10/0x12 [c0102c2d] sysenter_past_esp+0x56/0x8d (Linux antares.localdomain 2.6.18-rc2-mm1 #77 Sun Jul 30 15:09:34 CEST 2006 i686 AMD Athlon(TM) XP 1600+ unknown GNU/Linux) -- Alex.
Re: reiser4: maybe just fix bugs?
On 8/1/06, David Masover [EMAIL PROTECTED] wrote: Vladimir V. Saveliev wrote: Do you think that if reiser4 supported xattrs - it would increase its chances on inclusion? Probably the opposite. If I understand it right, the original Reiser4 model of file metadata is the file-as-directory stuff that caused such a furor the last big push for inclusion (search for Silent semantic changes in Reiser4): foo.mp3/.../rwx# permissions foo.mp3/.../artist # part of the id3 tag So I suspect xattrs would just be a different interface to this stuff, maybe just a subset of it (to prevent namespace collisions): foo.mp3/.../xattr/ # contains files representing attributes Of course, you'd be able to use the standard interface for getting/setting these. The point is, I don't think Hans/Namesys wants to do this unless they're going to do it right, especially because they already have the file-as-dir stuff somewhat done. Note that these are neither mutually exclusive nor mutually dependent -- you don't have to enable file-as-dir to make xattrs work. I know it's not done yet, though. I can understand Hans dragging his feet here, because xattrs and traditional acls are examples of things Reiser4 is supposed to eventually replace. Anyway, if xattrs were done now, the only good that would come of it is building a userbase outside the vanilla kernel. I can't see it as doing anything but hurting inclusion by introducing more confusion about plugins. I could be entirely wrong, though. I speak for neither Hans/Namesys/reiserfs nor LKML. Talk amongst yourselves... i should clarify things a bit here. yes, hans' goal is for there to be no difference between the xattr namespace and the readdir one. unfortunately, this is not feasible with the current VFS, and some major work would have to be done to enable this without some pathological cases cropping up. some very smart people think that it cannot be done at all. xattr is a seperate VFS interface, which avoids those problems by defining certain restrictions on how the 'files' which live in that namespace can be manupulated. for instance, hard links are non-existent, and the 'mv' command cannot move a file between different xattr namespaces. enabling xattr would have no connection to the file-as-directory stuff, and (without extra work) would not even allow access to the things reiser4 defined in the '...' directory. also enabling xattr in the way i intend would in no way compromise hans' long-term vision. HOWEVER, i *need* to point out that hans and i disagree somewhat on the specifics here, and so i should say adamently i don't speak here on behalf of hans or namesys. that won't stop me from submitting my own patch though :) NATE
Re: [BUG] nikita-1481, nikita-717 and nikita-373 here and there
On Fri, 2006-06-23 at 02:51 +0300, Jussi Judin wrote: After that I upgraded to Debian patched kernel 2.6.16-14 and to reiser4 patch 2.6.16-4 for that kernel and ran fsck.reiser4. Then I got errors like this in kern.log after a while: WARNING: Error for inode 1731981 (-2) reiser4[nfsd(3817)]: key_warning (fs/reiser4/plugin/file_plugin_common.c:513)[nikita-717]: WARNING: Error for inode 1703086 (-2) reiser4[nfsd(3818)]: key_warning (fs/reiser4/plugin/file_plugin_common.c:513)[nikita-717]: WARNING: Error for inode 1726433 (-2) reiser4[nfsd(3818)]: key_warning (fs/reiser4/plugin/file_plugin_common.c:513)[nikita-717]: I too am getting these warnings: Jul 27 06:28:15 prometheus kernel: reiser4[find(10770)]: key_warning (fs/reiser4/plugin/file_plugin_common.c:513)[nikita-717]: Jul 27 06:28:15 prometheus kernel: WARNING: Error for inode 3922698 (-2) [REPEATED 17 TIMES] Jul 27 06:28:15 prometheus kernel: reiser4[find(10770)]: key_warning (fs/reiser4/plugin/file_plugin_common.c:513)[nikita-717]: Jul 27 06:28:15 prometheus kernel: WARNING: Error for inode 3922697 (-2) [REPEATED 17 TIMES] Jul 27 06:28:16 prometheus kernel: reiser4[find(10770)]: key_warning (fs/reiser4/plugin/file_plugin_common.c:513)[nikita-717]: Jul 27 06:28:16 prometheus kernel: WARNING: Error for inode 3922696 (-2) [REPEATED 17 TIMES] ... ... Jul 27 06:28:19 prometheus kernel: reiser4[find(10770)]: cbk_level_lookup (fs/reiser4/search.c:961)[vs-3533]: Jul 27 06:28:19 prometheus kernel: WARNING: Keys are inconsistent. Fsck? Jul 27 06:28:19 prometheus kernel: reiser4[find(10770)]: key_warning (fs/reiser4/plugin/file_plugin_common.c:513)[nikita-717]: Jul 27 06:28:19 prometheus kernel: WARNING: Error for inode 3922690 (-5) System information: Kernel: 2.6.16.20 Patches: reiser4-for-2.6.16-4.patch.gz Reiser4progs: 1.0.5 This machine is used for recording TV using a DVB card, compresses the files, and serves them via NFS and Samba. Until recently, the system ran kernel linux-2.6.11.6, and performed flawlessly for over a year. After upgrading the kernel, I upgraded reiser4progs, and fscked all reiser4 partitions. No errors were found. The system is run on a UPS, and does not have a history of memory or IO trouble. I am currently investigating why the samba shares have failed, and noticed this in the log. I believe problem/bug this is related to the kernel upgrade, rather than some random corruption because it seem too much of a coincidence to happen so soon after upgrading the kernel. Any help/advice is greatly appreciated. ...back to the samba investigation. Many Thanks, -- Craig Shelley EMail: [EMAIL PROTECTED] Jabber: [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part
Re: Ebuild/rpm/deb repo's (was Re: reiser4 can now bear with filled fs, looks stable to me...)
Hello Sander, Tuesday, August 1, 2006, 8:10:34 PM, you wrote: Yes, and in case of gentoo there are already people maintaining an ebuild which pull in r4 on the wiki. http://gentoo-wiki.com/HOWTO_Reiser4_With_Gentoo-Sources Debian has reiser4progs and kernel-patch-2.6-reiser4: - stable: 20040813-6 - testing: 20050715-1 - unstable: 20050715-1 Very old patches. Also the patch descriptions says: WARNING: this software is to be considered usable but its deployment in production environments is still not recommended. Use at your own risk. I know it is very easy to create ubuntu kernel packages (I have done a few) I might try to do one for current dapper kernel for i386. But it would have to wait due to time my personal constraints (projects, etc.) -- Best regards, Maciej
reiserfs 3.6 with 2TB file size limitation!
Hi, I read on reiserfs site, faq #1, about max file sizes: max file size 2^60 - bytes = 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int for reiserfs 3.6 I do have a reiserfs 3.6 filesystem, on kernel 2.6.12-21mdksmp (mandriva 2006) that would not take a filesize greater than 2TB! Filesize in question:: -rw-r--r-- 1 myuser users 2147483647 Aug 1 16:41 myfile.out Error received: Filesize limit exceeded This filesystem is on a scsi device connected via Adaptec AIC-7899P U160/m card (if it matters). # debugreiserfs /dev/sda1 debugreiserfs 3.6.19 (2003 www.namesys.com) Filesystem state: consistency is not checked after last mounting Reiserfs super block in block 16 on 0x801 of format 3.6 with standard journal Count of blocks on the device: 292967356 Number of bitmaps: 8941 Blocksize: 4096 Free blocks (count of blocks - used [journal, bitmaps, data, reserved] blocks): 172310677 Root block: 121672 Filesystem is NOT clean Tree height: 5 Hash function used to sort names: r5 Objectid map size 2, max 972 Journal parameters: Device [0x0] Magic [0x5613f7b2] Size 8193 blocks (including 1 for journal header) (first block 18) Max transaction length 1024 blocks Max batch size 900 blocks Max commit age 30 Blocks reserved by journal: 0 Fs state field: 0x0: sb_version: 2 inode generation number: 1507612 UUID: 78e233e1-8210-4c9f-8f5d-7159c754db16 LABEL: Set flags in SB: ATTRIBUTES CLEAN Any idea why i would receive this limitation, with a 2.6 kernel and reiserfs 3.6? Can this be corrected to allow the full file size? I would appreciate any hints I check the kernel-sources for the kernel version i am running, and i dont see any mention of large file/filesystem support that i recall was available in older kernel compiles, so it is probably integrated now? _Thanks Richard __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Ebuild/rpm/deb repo's (was Re: reiser4 can now bear with filled fs, looks stable to me...)
On Tue, 2006-08-01 at 23:12 +0200, Maciej Sołtysiak wrote: Hello Sander, Hey Tuesday, August 1, 2006, 8:10:34 PM, you wrote: Yes, and in case of gentoo there are already people maintaining an ebuild which pull in r4 on the wiki. http://gentoo-wiki.com/HOWTO_Reiser4_With_Gentoo-Sources Debian has reiser4progs and kernel-patch-2.6-reiser4: Nice :) - stable: 20040813-6 - testing: 20050715-1 - unstable: 20050715-1 Ouch :( It is in serious need of updating. With the approval of Namesys I would like to add a new entry to the wiki frontpage. I would be someting like Get reiser4 now or Howto install reiser4. Under that we detail the steps to get kernels for distros which include reiser4 and how to patch it yourself. I know it is very easy to create ubuntu kernel packages (I have done a few) I might try to do one for current dapper kernel for i386. But it would have to wait due to time my personal constraints (projects, etc.) Great :) Are there any on the list who know of rpm's for Suse/Redhat/Mandrake that include reiser4? Greets Sander
Re: reiserfs 3.6 with 2TB file size limitation!
Ricardo (Tru64 User) wrote: Hi, Hello I read on reiserfs site, faq #1, about max file sizes: max file size 2^60 - bytes = 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int for reiserfs 3.6 I do have a reiserfs 3.6 filesystem, on kernel 2.6.12-21mdksmp (mandriva 2006) that would not take a filesize greater than 2TB! Filesize in question:: -rw-r--r-- 1 myuser users 2147483647 Aug 1 16:41 myfile.out Hmm.. actually this file has size 2GB, not 2TB. Did you try to specify O_LARGEFILE when creating a file? (man creat) Error received: Filesize limit exceeded This filesystem is on a scsi device connected via Adaptec AIC-7899P U160/m card (if it matters). # debugreiserfs /dev/sda1 debugreiserfs 3.6.19 (2003 www.namesys.com) Filesystem state: consistency is not checked after last mounting Reiserfs super block in block 16 on 0x801 of format 3.6 with standard journal Count of blocks on the device: 292967356 Number of bitmaps: 8941 Blocksize: 4096 Free blocks (count of blocks - used [journal, bitmaps, data, reserved] blocks): 172310677 Root block: 121672 Filesystem is NOT clean Tree height: 5 Hash function used to sort names: r5 Objectid map size 2, max 972 Journal parameters: Device [0x0] Magic [0x5613f7b2] Size 8193 blocks (including 1 for journal header) (first block 18) Max transaction length 1024 blocks Max batch size 900 blocks Max commit age 30 Blocks reserved by journal: 0 Fs state field: 0x0: sb_version: 2 inode generation number: 1507612 UUID: 78e233e1-8210-4c9f-8f5d-7159c754db16 LABEL: Set flags in SB: ATTRIBUTES CLEAN Any idea why i would receive this limitation, with a 2.6 kernel and reiserfs 3.6? Can this be corrected to allow the full file size? I would appreciate any hints I check the kernel-sources for the kernel version i am running, and i dont see any mention of large file/filesystem support that i recall was available in older kernel compiles, so it is probably integrated now? _Thanks Richard __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
David Masover wrote: David Lang wrote: On Mon, 31 Jul 2006, David Masover wrote: Oh, I'm curious -- do hard drives ever carry enough battery/capacitance to cover their caches? It doesn't seem like it would be that hard/expensive, and if it is done that way, then I think it's valid to leave them on. You could just say that other filesystems aren't taking as much advantage of newer drive features as Reiser :P there are no drives that have the ability to flush their cache after they loose power. Aha, so back to the usual argument: UPS! It takes a fraction of a second to flush that cache. You probably don't actually want to flush the cache - but to write to a journal. 16M of cache - split into 32000 writes to single sectors spread over the disk could well take several minutes to write. Slapping it onto a journal would take well under .2 seconds. That's a non-trivial amount of storage though - 3J or so, [EMAIL PROTECTED] - a moderately large/expensive capacitor. And if you've got to spin the drive up, you've just added another order of magnitude. You can see why a flash backup of the write cache may be nicer. You can do it if the disk isn't spinning. It uses moderately less energy - and at a much lower rate, which means the power supply can be _much_ cheaper. I'd guess it's the difference between under $2 and $10. And if you can use it as a lazy write cache for laptops - things just got better battery life wise too.
Re: Solaris ZFS on Linux [Was: Re: the 'official' point of viewexpressed by kernelnewbies.org regarding reiser4 inclusion]
Ian Stirling wrote: David Masover wrote: David Lang wrote: On Mon, 31 Jul 2006, David Masover wrote: Oh, I'm curious -- do hard drives ever carry enough battery/capacitance to cover their caches? It doesn't seem like it would be that hard/expensive, and if it is done that way, then I think it's valid to leave them on. You could just say that other filesystems aren't taking as much advantage of newer drive features as Reiser :P there are no drives that have the ability to flush their cache after they loose power. Aha, so back to the usual argument: UPS! It takes a fraction of a second to flush that cache. You probably don't actually want to flush the cache - but to write to a journal. 16M of cache - split into 32000 writes to single sectors spread over the disk could well take several minutes to write. Slapping it onto a journal would take well under .2 seconds. That's a non-trivial amount of storage though - 3J or so, [EMAIL PROTECTED] - a moderately large/expensive capacitor. Before we get ahead of ourselves, remember: ~$200 buys you a huge amount of battery storage. We're talking several minutes for several boxes, at the very least -- more like 10 minutes. But yes, a journal or a software suspend.
Re: reiser4: maybe just fix bugs?
Nate Diller wrote: On 8/1/06, David Masover [EMAIL PROTECTED] wrote: Vladimir V. Saveliev wrote: I could be entirely wrong, though. I speak for neither Hans/Namesys/reiserfs nor LKML. Talk amongst yourselves... i should clarify things a bit here. yes, hans' goal is for there to be no difference between the xattr namespace and the readdir one. unfortunately, this is not feasible with the current VFS, and some major work would have to be done to enable this without some pathological cases cropping up. some very smart people think that it cannot be done at all. But an xattr interface should work just fine, even if the rest of the system is inaccessible (no readdir interface) -- preventing all these pathological problems, except the one where Hans implements it the way I'm thinking, and kernel people hate it.