Re: Question of stability
On Mon, 20 Sep 2010 07:30:57 -0400 Chris Mason chris.ma...@oracle.com wrote: On Mon, Sep 20, 2010 at 11:00:08AM +, Lubos Kolouch wrote: No, not stable! Again, after powerloss, I have *two* damaged btrfs filesystems. Please tell me more about your system. I do extensive power fail testing here without problems, and corruptions after powerloss are very often caused by the actual hardware. So, what kind of drives do you have, do they have writeback caching on, and what are you layering on top of the drive between btrfs and the kernel? -chris Chris, the actual way how a fs was damaged must not be relevant. From a new fs design one should expect the tree can be mounted no matter what corruption took place up to the case where the fs is indeed empty after mounting because it was completely corrupted. If parts were corrupt then the fs should either be able to assist the user in correcting the damages _online_ or at least simply exclude the damaged fs parts from the actual mounted fs tree. The basic thought must be show me what you have and not shit, how do I get access to the working but not mountable fs parts again?. Would you buy a car that refuses to drive if the ash tray is broken? -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD Optimizations
On Thu, 11 Mar 2010 13:00:17 -0500 Chris Mason chris.ma...@oracle.com wrote: On Thu, Mar 11, 2010 at 06:35:06PM +0100, Stephan von Krawczynski wrote: On Thu, 11 Mar 2010 15:39:05 +0100 Sander san...@humilis.net wrote: Stephan von Krawczynski wrote (ao): Honestly I would just drop the idea of an SSD option simply because the vendors implement all kinds of neat strategies in their devices. So in the end you cannot really tell if the option does something constructive and not destructive in combination with a SSD controller. My understanding of the ssd mount option is also that the fs doens't try to do all kinds of smart (and potential expensive) things which make sense for rotating media to reduce seeks and the like. Sander Such an optimization sounds valid on first sight. But re-think closely: how does the fs really know about seeks needed during some operation? Well the FS makes a few assumptions (in the nonssd case). First it assumes the storage is not a memory device. If things would fit in memory we wouldn't need filesytems in the first place. Ok, here is the bad news. This assumption everything from right to completely wrong, and you cannot really tell the mainstream answer. Two examples from opposite parts of the technology world: - History: way back in the 80's there was a 3rd party hardware for C=1541 (floppy drive for C=64) that read in the complete floppy and served all incoming requests from the ram buffer. So your assumption can already be wrong for a trivial floppy drive from ancient times. - Nowadays: being a linux installation today chances are that the matrix has you. Quite a lot of installations are virtualized. So your storage is a virtual one either, which means it is likely being a fs buffer from the host system, i.e. RAM. And sorry to say: if things would fit in memory you probably still need a fs simply because there is no actual way to organize data (be it executable or not) in RAM without a fs layer. You can't save data without an abstract file data type. To have one accessible you need a fs. Btw the other way round is as interesting: there is currently no fs for linux that knows how to execute in place. Meaning if you really had only RAM and you have a fs to organize your data it would be just logical to have ways to _not_ load data (in other parts of the RAM), but to use it in its original storage (RAM-)space. Then it assumes that adjacent blocks are cheap to read and blocks that are far away are expensive to read. Given expensive raid controllers, cache, and everything else, you're correct that sometimes this assumption is wrong. As already mentioned this assumption may be completely wrong even without a raid controller, being within a virtual environment. Even far away blocks can be one byte away in the next fs buffer of the underlying host fs (assuming your device is in fact a file on the host;-). But, on average seeking hurts. Really a lot. Yes, seeking hurts. But there is no way to know if there is seeking at all. On the other hand, if your storage is a netblock device seeking on the server is probably your smallest problem, compared to the network latency in between. We try to organize files such that files that are likely to be read together are found together on disk. Btrfs is fairly good at this during file creation and not as good as ext*/xfs as files over overwritten and modified again and again (due to cow). You are basically saying that btrfs perfectly organizes write-once devices ;-) If you turn mount -o ssd on for your drive and do a test, you might not notice much difference right away. ssds tend to be pretty good right out of the box. Over time it tends to help, but it is a very hard thing to benchmark in general. Honestly, this sounds like I give up to me ;-) You just said that generally it is very hard to benchmark. Which means nobody can see or feel it in real world in non-tech language. Please understand that I am the last one critizing your and others' brillant work and the time you spend for btrfs. Only I do believe that if you spent one hour on some fs like glusterfs for every 10 hours you spend on btrfs you would be both king and queen for the linux HA community :-) (but probably unemployed, so I can't really beat you for it) -chris -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD Optimizations
On Fri, 12 Mar 2010 02:07:40 +0100 Hubert Kario h...@qbs.com.pl wrote: [...] If the FS were to be smart and know about the 256kb requirement, it would do a read/modify/write cycle somewhere and then write the 4KB. If all the free blocks have been TRIMmed, FS should pick a completely free erasure size block and write those 4KiB of data. Correct implementation of wear leveling in the drive should notice that the write is entirely inside a free block and make just a write cycle adding zeros to the end of supplied data. Your assumption here is that your _addressed_ block layout is completely identical to the SSDs disk layout. Else you cannot know where a free erasure block is located and how to address it from FS. I really wonder what this assumption is based on. You still think a SSD is a true disk with linear addressing. I doubt that very much. Even on true spinning disks your assumption is wrong for relocated sectors. Which basically means that every disk controller firmware fiddles around with the physical layout since decades. Please accept that you cannot do a disks' job in FS. The more advanced technology gets the more disks become black boxes with a defined software interface. Use this interface and drop the idea of having inside knowledge of such a device. That's other peoples' work. If you want to design smart SSD controllers hire at a company that builds those. -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD Optimizations
On Thu, 11 Mar 2010 11:59:57 +0100 Hubert Kario h...@qbs.com.pl wrote: On Thursday 11 March 2010 08:38:53 Sander wrote: Hello Gordan, Gordan Bobic wrote (ao): Mike Fedyk wrote: On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic gor...@bobich.net wrote: Are there options available comparable to ext2/ext3 to help reduce wear and improve performance? With SSDs you don't have to worry about wear. Sorry, but you do have to worry about wear. I was able to destroy a relatively new SD card (2007 or early 2008) just by writing on the first 10MiB over and over again for two or three days. The end of the card still works without problems but about 10 sectors on the beginning give write errors. Sorry, the topic was SSD, not SD. SSDs have controllers that contain heavy closed magic to circumvent all kinds of troubles you get when using classical flash and SD cards. Honestly I would just drop the idea of an SSD option simply because the vendors implement all kinds of neat strategies in their devices. So in the end you cannot really tell if the option does something constructive and not destructive in combination with a SSD controller. Of course you may well discuss about an option for passive flash devices like ide-CF/SD or the like. There is no controller involved so your fs implementation may well work out. -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD Optimizations
On Thu, 11 Mar 2010 12:17:30 + Gordan Bobic gor...@bobich.net wrote: On Thu, 11 Mar 2010 12:31:03 +0100, Stephan von Krawczynski sk...@ithnet.com wrote: On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic gor...@bobich.net wrote: Are there options available comparable to ext2/ext3 to help reduce wear and improve performance? With SSDs you don't have to worry about wear. Sorry, but you do have to worry about wear. I was able to destroy a relatively new SD card (2007 or early 2008) just by writing on the first 10MiB over and over again for two or three days. The end of the card still works without problems but about 10 sectors on the beginning give write errors. Sorry, the topic was SSD, not SD. SD == SSD with an SD interface. That really is quite a statement. You really talk of a few-bucks SD card (like the one in my android handy) as an SSD comparable with Intel XE only with different interface? Come on, stay serious. The product is not only made of SLCs and some raw logic. SSDs have controllers that contain heavy closed magic to circumvent all kinds of troubles you get when using classical flash and SD cards. There is absolutely no basis for thinking that SD cards don't contain wear leveling logic. SD standard, and thus SD cards support a lot of fancy copy protection capabilities, which means there is a lot of firmware involvement on SD cards. It is unlikely that any reputable SD card manufacturer wouldn't also build wear leveling logic into it. I really don't guess about what is built into an SD or even CF card. But we hopefully agree that there is a significant difference compared to a product that calls itself a _disk_. Honestly I would just drop the idea of an SSD option simply because the vendors implement all kinds of neat strategies in their devices. So in the end you cannot really tell if the option does something constructive and not destructive in combination with a SSD controller. You can make an educated guess. For starters given that visible sector sizes are not equal to FS block sizes, it means that FS block sizes can straddle erase block boundaries without the flash controller, no matter how fancy, being able to determine this. Thus, at the very least, aligning FS structures so that they do not straddle erase block boundaries is useful in ALL cases. Thinking otherwise is just sticking your head in the sand because you cannot be bothered to think. And your guess is that intel engineers had no glue when designing the XE including its controller? You think they did not know what you and me know and therefore pray every day that some smart fs designer falls from heaven and saves their product from dying in between? Really? Of course you may well discuss about an option for passive flash devices like ide-CF/SD or the like. There is no controller involved so your fs implementation may well work out. I suggest you educate yourself on the nature of IDE and CF (which is just IDE with a different connector). There most certainly are controllers involved. The days when disks (mechanical or solid state) didn't integrate controllers ended with MFM/RLL and ESDI disks some 20+ years ago. I suggest you don't talk to someone administering some hundred boxes based on CF and SSD mediums for _years_ about pro and con of the respective implementation and its long term usage. Sorry, the world is not built out of paper, sometimes you meet the hard facts. And one of it is that the ssd option in fs is very likely already overrun by the ssd controller designers and mostly _superfluous_. The market has already decided to make SSDs compatible to standard fs layouts. -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD Optimizations
On Thu, 11 Mar 2010 15:01:55 +0100 Hubert Kario h...@qbs.com.pl wrote: [...] The _SD_standard_ states that the media has to implement wear-leveling. So any card with an SD logo implements it. As I stated previously, the algorithms used in SD cards may not be as advanced as those in top-of-the-line Intel SSDs, but I bet they don't differ by much to the ones used in cheapest SSD drives. Well, we are all pretty sure about that. And that is exactly the reason why these are not surviving the market pressure. Why should one care about bad products that are possibly already extincted because of their bad performance when the fs is production ready some day? Besides, why shouldn't we help the drive firmware by - writing the data only in erase-block sizes - trying to write blocks that are smaller than the erase-block in a way that won't cross the erase-block boundary Because if the designing engineer of a good SSD controller wasn't able to cope with that he will have no chance to design a second one. - using TRIM on deallocated parts of the drive Another story. That is a designed part of a software interface between fs and drive bios on which both agreed in its usage pattern. Whereas the above points are pure guess based on dumb and old hardware and its behaviour. This will not only increase the life of the SSD but also increase its performance. TRIM: maybe yes. Rest: pure handwaving. [...] And your guess is that intel engineers had no glue when designing the XE including its controller? You think they did not know what you and me know and therefore pray every day that some smart fs designer falls from heaven and saves their product from dying in between? Really? I am saying that there are problems that CANNOT be solved on the disk firmware level. Some problems HAVE to be addressed higher up the stack. Exactly, you can't assume that the SSDs firmware understands any and all file system layouts, especially if they are on fragmented LVM or other logical volume manager partitions. Hopefully the firmware understands exactly no fs layout at all. That would be braindead. Instead it should understand how to arrange incoming and outgoing data in a way that its own technical requirements are met as perfect as possible. This is no spinning disk, it is completely irrelevant what the data layout looks like as long as the controller finds its way through and copes best with read/write/erase cycles. It may well use additional RAM for caching and data reordering. Do you really believe ascending block numbers are placed in ascending addresses inside the disk (as an example)? Why should they? What does that mean for fs block ordering? If you don't know anyway what a controller does to your data ordering, how do you want to help it with its job? Please accept that we are _not_ talking about trivial flash mem here or pseudo-SSDs consisting of sd cards. The market has already evolved better products. The dinosaurs are extincted even if some are still looking alive. -- Hubert Kario -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD Optimizations
On Thu, 11 Mar 2010 15:39:05 +0100 Sander san...@humilis.net wrote: Stephan von Krawczynski wrote (ao): Honestly I would just drop the idea of an SSD option simply because the vendors implement all kinds of neat strategies in their devices. So in the end you cannot really tell if the option does something constructive and not destructive in combination with a SSD controller. My understanding of the ssd mount option is also that the fs doens't try to do all kinds of smart (and potential expensive) things which make sense for rotating media to reduce seeks and the like. Sander Such an optimization sounds valid on first sight. But re-think closely: how does the fs really know about seeks needed during some operation? If your disk is a single plate one your seeks are completely different from multi plate. So even a simple case is more or less unpredictable. If you consider a RAID or SAN as device base it should be clear that trying to optimize for certain device types is just a fake. What does that tell you? The optimization was a pure loss of work hours in the first place. In fact if you look at this list a lot of talks going on are highly academic and have no real usage scenario. Sometimes trying to be super-smart is indeed not useful (for a fs) ... -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: severe hardlink bug
On Sun, 24 Jan 2010 09:09:44 +0100 Goffredo Baroncelli kreij...@gmail.com wrote: On Sunday 24 January 2010, Michael Niederle wrote: I'm using btrfs with a kernel 2.6.32.2 (builtin) as the root file system of a Gentoo Linux installation. While attempting to install the plt-scheme package a strange error about link counts occurred ([Error 31] Too many Links). See this thread: Mass-Hardlinking Oops - http://thread.gmane.org/gmane.comp.file- systems.btrfs/3427 There is a limit of the number of hardlink for a file. The maximum number of the link depends by name length. Honestly, this dependency is braindead. How do the fs authors think an application programmer should judge how many hardlinks are possible for a certain fs ? This is a really bad design issue. Are we really in year 2010 ? BR Goffredo -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Phoronix article slaming BTRFS
On Wed, 24 Jun 2009 19:38:37 +0200 Jens Axboe jens.ax...@oracle.com wrote: [...] It's easy to throw cache at the problem and make it faster. That's like shaving weight off a car. Might make it go faster, definitely wont make it safer. Interestingly nobody talks about the other end of the ssd market. Ok, a cf card isn't really a ssd, but it is basically the same technology without very intelligent controllers in front. So if you really want to see improvements from ssd options this might be the most visible platform for playing. And again, this is indeed a mainstream market, lots of routers and other embedded gadgets use this - currently mostly implementing ram disks for performance reasons. -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs development plans
On Mon, 20 Apr 2009 12:38:57 -0400 Chris Mason chris.ma...@oracle.com wrote: On Mon, 2009-04-20 at 18:10 +0200, Ahmed Kamal wrote: But now Oracle can re-license Solaris and merge ZFS with btrfs. Just kidding, I don't think it would be technically feasible. May I suggest the name ZbtrFS :) Sorry couldn't resist. On a more serious note though, is there any technical benefits that justify continuing to push money in btrfs The short answer from my point of view is yes. This doesn't really change the motivations for working on btrfs or the problems we're trying to solve. ... which sounds logical to me. From looking at the project for a while one can see you are trying to solve problems that are not really linux' ones... -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some very basic questions
On Wed, 22 Oct 2008 16:35:55 +0200 dbz [EMAIL PROTECTED] wrote: concerning this discussion, I'd like to put up some requests which strongly oppose to those brought up initially: - if you run into an error in the fs structure or any IO error that prevents you from bringing the fs into a consistent state, please simply oops. If a user feels that availability is a main issue, he has to use a failover solution. In this case a fast and clean cut is desireable and no pray-and-hope-mode or 90%-mode. If avaliability is not the issue, it is in any case most important that data on the fs is safe. If you don't oops, you risk to pose further damage onto the filesystem and end up with a completely destroyed fs. Hi Gerald, this is a good proposal to explain why most failover setups do indeed not work. If you look at numerous internet howtos about building failover you will recognise that 95% talk about servers that syncronise their fs by all kinds of tools _offline_, like drbd - or choose some network-dependant raid, like nbd or enbd. All these have in common that they are unreliable just because of the needed mounting during failover. In your example: if box 1 oopses because of some error, chances are that box 2 trying to mount the very same data (which should be because of raid or sync) will indeed fail to mount, too. That leaves you with exactly nothing in hand. - if you get any IO error, please **don't** put up a number of retries or anything. If the device reports an error simply believe it. It is bad enough that many block drivers or controllers try to be smart and put up hundreds of retries. Adding further retries you only end up in wasting hours on useless retries. If availability is an issue, the user again has to put up a failover solution. Again, a clean cut is what is needed. The user has to make shure he uses appropiate configuration according to the importance of his data (mirroring on the fs and/or RAID, failover ...) Well, this leaves you with my proposal to optionally stop retrying, marking files or (better) blocks as dead. - if during mount something unexpected comes up and you can't be shure that the fs will work properly, please deny mounting and request a fsck. This can be easily handled by a start- or mount-script. During mount, take the time you need to ensure that the fs looks proper and safe to use. I'd rather now during boot that something is wrong than to run with a foul fs and end up with data loss or any other mixup later on. As explained above it is exactly the lack of parallel mounts that drives you to not having a lot of time during mount. A failover that takes only 10 minutes for re-mount is no failover, it is sh.t. ext? btw hardly ever mounts TBs at below 10 minutes. - btrfs is no cluster fs, so there is no point of even thinking about it. If somebody feels he needs multiple writeable mounts of the same fs, please use a cluster fs. Of course, you have to live with the tradeoffs. Dreaming of a fs that uses something like witchcraft to do things like locking, quorums, cache synchronisation without penalty and, of course, without any configuration, is pointless. This reads pretty much like a processor is a processor and not multiple processors. We all know today that this time has passed. In 5 years you should pretty much say the same for single fs vs. cluster fs. In my opinon, the whole thing comes up from the idea of using cheap hardware and out-of-the-box configurations to keep promises of reliability and availability which are not realistic. There is a reason why there are more expensive HDDs, RAIDs, SANs with volume mirroring, multipathing and so on. Simply ignoring the fact that you have to use the proper tools to address specific problems and pray to the toothfairy to put a solve-all-my-problems-fs under your pillow is no solution. I'd rather have a solid fs with deterministic behavior and some state-of-the-art features. Well, sorry to say, but I begin to sound a bit like Joseph Stiglitz trying to explain why neoliberalism does not work out. Please accept that this world is full of failure of all kinds. If you deny that all your models and ideas will only be failures, too. All I am saying is that we should accept that dead sectors, braindead firmware-programmers, production in jungle-environment, transportation in rough areas, high temperatures, high humidity, harddisks that have no disks and so on are facts of live. And only a childs answer can be : oops (sorry could not resist this one ;-) Just my 2c. (Gerald) -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some very basic questions
On Tue, 21 Oct 2008 13:15:13 -0400 Christoph Hellwig [EMAIL PROTECTED] wrote: On Tue, Oct 21, 2008 at 07:01:36PM +0200, Stephan von Krawczynski wrote: Sure, but what you say only reflects the ideal world. On a file service, you never have that. In fact you do not even have good control about what is going on. Lets say you have a setup that creates, reads and deletes files 24h a day from numerous clients. At two o'clock in the morning some hd decides to partially die. Files get created on it, fill data up to errors, get deleted and another bunch of data arrives and yet again fs tries to allocate the same dead areas. You loose a lot more data only because the fs did not map out the already known dead blocks. Of course you would replace the dead drive later on, but in the meantime you have a lot of fun. In other words: give me a tool to freeze the world right at the time the errors show up, or map out dead blocks (only because it is a lot easier). When modern disks can't solve the problems with their internal driver remapping anymore you better replace it ASAP as it is a very strong disk failure indication. Last years FAST has some very interesting statitics showing this in the field. And of course a disk is always a disk, right? -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some very basic questions
On Tue, 21 Oct 2008 18:09:40 +0200 Andi Kleen [EMAIL PROTECTED] wrote: While that's true today, I'm not sure it has to be true always. I always thought traditional fsck user interfaces were a UI desaster and could be done much better with some simple tweaks. [...] You are completely right. -Andi -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some very basic questions
On Tue, 21 Oct 2008 18:59:26 +0200 Andi Kleen [EMAIL PROTECTED] wrote: Stephan von Krawczynski [EMAIL PROTECTED] writes: Yes, we hear and say that all the time, name one linux fs doing it, please. ext[234] support it to some extent. It has some limitations (especially when the files are large and you shouldn't do too much followon IO to prevent the data from being overwriten) and the user frontends are not very nice, but it it's there Well, they must be pretty ugly, I really never heard of that. But really, it is not very important, because extX is completely useless with TB-size disks unless you feel good waiting hours for fsck (I did, and will never do again). _All_ customers we deployed ext3 urged us to go back to reiserfs3 ... -Andi -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some very basic questions
On Tue, 21 Oct 2008 11:34:20 -0400 jim owens [EMAIL PROTECTED] wrote: Hearing what user's think they want is always good, but... Stephan von Krawczynski wrote: thanks for your feedback. Understand minimum requirement as minimum requirement to drop the current installation and migrate the data to a new fs platform. I would sure like to know what existing platform and filesystem you have that you think has all 10 of your features. Obviously none, else I would not speak up and try to find one. :-) [...] 1) parallel mounts What I see from that explanation is you have a system design idea using parallel machines to fix problems you have had in the past. To implement your design, you need a filesystem to fit it. Well, I can't hardly deny that. Lets just name the (simple) problem, different names for the very same thing: uptime, availability, redundancy I think it is better to just design a filesystem without the problems and configure the hardware to handle the necessary load. Ok, now you see me astonished. You really think that there is one piece of software around that is without problems ? My idea of the world is really very different from that: The world is far from perfect. That is why I try to deploy solutions that have redundancy for all kinds of problems I can think of and hopefully for a few that I haven't thought of. 2) mounting must not delay the system startup significantly 3) errors in parts of the fs are no reason for a fs to go offline as a whole 4) power loss at any time must not corrupt the fs 5) fsck on a mounted fs, interactively, not part of the mount (all fsck features) I think all of these are part of the reliability goal for btrfs and when you say fsck it is probably misleading if I understand your real requirement to be the same as my customers: - *NO* fsck - filesystem design prevents problems we have had before - filesystem autodetects, isolates, and (possibly) repairs errors - online scan, check, repair filesystem tool initiated by admin - Reliability so high that they never run that check-and-fix tool That is _wrong_ (to a certain extent). You _want to run_ diagnostic tools to make sure that there is no problem. And you don't want some software (not even HAL) to repair errors without prior admin knowledge/permission. Note that I personally have never seen a first release meet the no problems, no need to fix criteria that would obviate any need for a check/fix tool. That really does not depend on the release number of _your_ special software. Your software always depends on other components (hw or sw) that (can) have bugs and weird behaviour. And this is the fact: no perfect world, so don't count on your or others' perfectness. If you do you will fail. jim -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some very basic questions
On Tue, 21 Oct 2008 13:49:43 -0400 Chris Mason [EMAIL PROTECTED] wrote: On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote: 2. general requirements - fs errors without file/dir names are useless - errors in parts of the fs are no reason for a fs to go offline as a whole These two are in progress. Btrfs won't always be able to give a file and directory name, but it will be able to give something that can be turned into a file or directory name. You don't want important diagnostic messages delayed by name lookup. That's a point I really never understood. Why is it non-trivial for a fs to know what file or dir (name) it is currently working on? The name lives in block A, but you might find a corruption while processing block B. Block A might not be in ram anymore, or it might be in ram but locked by another process. On top of all of that, when we print errors it's because things haven't gone well. They are deep inside of various parts of the filesystem, and we might not be able to take the required locks or read from the disk in order to find the name of the thing we're operating on. Ok, this is interesting. In another thread I was told parallel mounts are really complex and you cannot do good things in such an environment that you can do with single mount. Well, then, why don't we do it? All boxes I know have tons of RAM, but fs finds no place in RAM to put large parts (if not all) of the structural fs data including filenames? Besides the simple fact that RAM is always faster than any known disk be it rotating or not, and that RAM is just there, whats the word for not doing it? - parallel mounts (very important!) (two or more hosts mount the same fs concurrently for reading and writing) As Jim and Andi have said, parallel mounts are not in the feature list for Btrfs. Network filesystems will provide these features. Can you explain what network filesystems stands for in this statement, please name two or three examples. NFS (done) CRFS (under development), maybe ceph as well which is also under development. NFS is a good example for a fs that never got redesigned for modern world. I hope it will, but currently it's like Model T on a highway. You have a NFS server with clients. Your NFS server dies, your backup server cannot take over the clients without them resetting their NFS-link (which means reboot to many applications) - no way. Besides that you still need another fs below NFS to bring your data onto some medium, which means you still have the problem how to create redundancy in your server architecture. - versioning (file and dir) From a data structure point of view, version control is fairly easy. From a user interface and policy point of view, it gets difficult very quickly. Aside from snapshotting, version control is outside the scope of btrfs. There are lots of good version control systems available, I'd suggest you use them instead. To me versioning sounds like a not-so-easy-to-implement feature. Nevertheless I trust your experience. If a basic implementation is possible and not too complex, why deny a feature? In general I think snapshotting solves enough of the problem for most of the people most of the time. I'd love for Btrfs to be the perfect FS, but I'm afraid everyone has a different definition of perfect. Storing multiple versions of something is pretty easy. Making a usable interface around those versions is the hard part, especially because you need groups of files to be versioned together in atomic groups (something that looks a lot like a snapshot). Versioning is solved in userspace. We would never be able to implement everything that git or mercurial can do inside the filesystem. Well, quite often the question is not about whole trees of data to be versioned. Even single (few) files or dirs can be of interest. And you want people to set up a complete user space monster to version three openoffice documents (only a rather flawed example of course)? Lots of people need a basic solution, not the groundbreaking answer to all questions. - undelete (file and dir) Undelete is easy Yes, we hear and say that all the time, name one linux fs doing it, please. The fact that nobody is doing it is not a good argument for why it should be done ;) Believe me, if NTFS had a simple undelete tool come with it, we (in linux fs) would have it, too. Why do we always want to be _second best_? Undelete is a policy decision about what to do with files as they are removed. I'd much rather see it implemented above the filesystems instead of individually in each filesystem. This doesn't mean I'll never code it, it just means it won't get implemented directly inside of Btrfs. In comparison with all of the other features pending, undelete is pretty far down on the list
Re: Some very basic questions
On Wed, 22 Oct 2008 05:48:30 -0700 Jeff Schroeder [EMAIL PROTECTED] wrote: NFS is a good example for a fs that never got redesigned for modern world. I hope it will, but currently it's like Model T on a highway. You have a NFS server with clients. Your NFS server dies, your backup server cannot take over the clients without them resetting their NFS-link (which means reboot to many applications) - no way. Besides that you still need another fs below NFS to bring your data onto some medium, which means you still have the problem how to create redundancy in your server architecture. You are somewhat misinformed on this. Perhaps the Linux nfs server can't cope, but I doubt it. NFS was designed to be stateless. I've got a fair amount of experience with a dual head netapp architecture. When 1 head dies, the other transparently fails over. During the brief downtime, the clients will go into I/O wait if at all instead of being disconnected. You might be able to do something similar using nfsd and keepalived if both servers were connected to the same storage. Setting that up would be trivial. You just need the clients mounting the vip and a reliable mechanism to provide the data from that vip. You could use heartbeat, but it is overly complex. Also look at clustered nfs or pnfs, both of which are nfs redesigns like you speak of. we tried that with pure linux nfs, and it does not work. The clients do not recover. After trying ourselves and failing we found several docs on the net that described just the same problem and its reasons. Very likely netapp found that out too and did something against it. Ah yes, and btw, your description contains another discussed problem: both servers were connected to the same storage. If you mean that both servers really access the same storage at the same time your software options are pretty few in numbers. -- Jeff Schroeder -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some very basic questions
On Tue, 21 Oct 2008 14:13:33 +0200 Andi Kleen [EMAIL PROTECTED] wrote: Stephan von Krawczynski [EMAIL PROTECTED] writes: reading the list for a while it looks like all kinds of implementational topics are covered but no basic user requests or talks are going on. Since I have found no other list on vger covering these issues I choose this one, forgive my ignorance if it is the wrong place. Like many people on the planet we try to handle quite some amounts of data (TBs) and try to solve this with several linux-based fileservers. Years of (mostly bad) experience led us to the following minimum requirements for a new fs on our servers: If that are the minimum requirements, what are the maximum ones? Also you realize that some of the requirements (like parallel read/write aka a full cluster file system) are extremly hard? Perhaps it would make more sense if you extracted the top 10 items and ranked them by importance and posted again. Hello Andi, thanks for your feedback. Understand minimum requirement as minimum requirement to drop the current installation and migrate the data to a new fs platform. Of course you are right, dealing with multiple/parallel mounts can be quite a nasty job if the fs was not originally planned with this feature in mind. On the other hand I cannot really imagine how to deal with TBs of data in the future without such a feature. If you look at the big picture the things I mentioned allow you to have redundant front-ends for the fileservice doing the same or completely different applications. You can use one mount (host) for tape backup purposes only without heavy loss in standard file service. You can even mount for filesystem check purposes, a box that does nothing else but check the structure and keep you informed what is really going on with your data - and your data is still in production in the meantime. Whatever happens you have a real chance of keeping your file service up, even if parts of your fs go nuts because some underlying hd got partially damaged. Keeping it up and running is the most important part, performance is only second on the list. If you take a close look there are not really 10 different items on my list, depending on the level of abstraction you prefer, nevertheless: 1) parallel mounts 2) mounting must not delay the system startup significantly 3) errors in parts of the fs are no reason for a fs to go offline as a whole 4) power loss at any time must not corrupt the fs 5) fsck on a mounted fs, interactively, not part of the mount (all fsck features) 6) journaling 7) undelete (file and dir) 8) resizing during runtime (up and down) 9) snapshots 10) performant handling of large numbers of files inside single dirs -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some very basic questions
On Tue, 21 Oct 2008 09:20:16 -0400 jim owens [EMAIL PROTECTED] wrote: btrfs has many of the same goals... but they are goals not code so when you might see them is indeterminate. no big issue, my pension is 20 years away, I got time ;-) I believe these should not be in btrfs: Stephan von Krawczynski wrote: - parallel mounts (very important!) as Andi said, you want a cluster or distributed fs. There are layered designs (CRFS or network filesystems) that can do the job and trying to do it in btrfs causes too many problems. question is: if you had such an implementation, are there drawbacks expectable for the single-mount case? If not I'd vote for it because there are not really many alternatives on the market. - journaling I assume you *do not* mean metadata journaling, you mean sending all file updates to a single output stream (as in one disk, tape, or network link). I've done that, but would not recommend it in btrfs because it limits the total fs bandwidth to what the single stream can support. This is normally done today by applications like databases, not in the filesystem. As far as I know metadata journaling is in, right? If what you mean is capable of creating live or offline images of the fs you got me right. - map out dead blocks Useless... a waste of time, code, and metadata structures. With current device technology, any device reporting bad blocks the device can not map out is about to die and needs replaced! Sure, but what you say only reflects the ideal world. On a file service, you never have that. In fact you do not even have good control about what is going on. Lets say you have a setup that creates, reads and deletes files 24h a day from numerous clients. At two o'clock in the morning some hd decides to partially die. Files get created on it, fill data up to errors, get deleted and another bunch of data arrives and yet again fs tries to allocate the same dead areas. You loose a lot more data only because the fs did not map out the already known dead blocks. Of course you would replace the dead drive later on, but in the meantime you have a lot of fun. In other words: give me a tool to freeze the world right at the time the errors show up, or map out dead blocks (only because it is a lot easier). jim -- Regards, Stephan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html