Re: Question of stability

2010-09-20 Thread Stephan von Krawczynski
On Mon, 20 Sep 2010 07:30:57 -0400
Chris Mason chris.ma...@oracle.com wrote:

 On Mon, Sep 20, 2010 at 11:00:08AM +, Lubos Kolouch wrote:
  No, not stable!
  
  Again, after powerloss, I have *two* damaged btrfs filesystems.
 
 Please tell me more about your system.  I do extensive power fail
 testing here without problems, and corruptions after powerloss are very
 often caused by the actual hardware.
 
 So, what kind of drives do you have, do they have writeback caching on,
 and what are you layering on top of the drive between btrfs and the
 kernel?
 
 -chris

Chris, the actual way how a fs was damaged must not be relevant. From a new fs
design one should expect the tree can be mounted no matter what corruption took
place up to the case where the fs is indeed empty after mounting because it
was completely corrupted. If parts were corrupt then the fs should either be
able to assist the user in correcting the damages _online_ or at least simply
exclude the damaged fs parts from the actual mounted fs tree. The basic
thought must be show me what you have and not shit, how do I get access to
the working but not mountable fs parts again?.
Would you buy a car that refuses to drive if the ash tray is broken?

-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD Optimizations

2010-03-13 Thread Stephan von Krawczynski
On Thu, 11 Mar 2010 13:00:17 -0500
Chris Mason chris.ma...@oracle.com wrote:

 On Thu, Mar 11, 2010 at 06:35:06PM +0100, Stephan von Krawczynski wrote:
  On Thu, 11 Mar 2010 15:39:05 +0100
  Sander san...@humilis.net wrote:
  
   Stephan von Krawczynski wrote (ao):
Honestly I would just drop the idea of an SSD option simply because the
vendors implement all kinds of neat strategies in their devices. So in 
the end
you cannot really tell if the option does something constructive and not
destructive in combination with a SSD controller.
   
   My understanding of the ssd mount option is also that the fs doens't try
   to do all kinds of smart (and potential expensive) things which make
   sense for rotating media to reduce seeks and the like.
   
 Sander
  
  Such an optimization sounds valid on first sight. But re-think closely: how
  does the fs really know about seeks needed during some operation?
 
 Well the FS makes a few assumptions (in the nonssd case).  First it
 assumes the storage is not a memory device.  If things would fit in
 memory we wouldn't need filesytems in the first place.

Ok, here is the bad news. This assumption everything from right to completely
wrong, and you cannot really tell the mainstream answer.
Two examples from opposite parts of the technology world:
- History: way back in the 80's there was a 3rd party hardware for C=1541
(floppy drive for C=64) that read in the complete floppy and served all
incoming requests from the ram buffer. So your assumption can already be wrong
for a trivial floppy drive from ancient times.
- Nowadays: being a linux installation today chances are that the matrix has
you. Quite a lot of installations are virtualized. So your storage is a virtual
one either, which means it is likely being a fs buffer from the host system,
i.e. RAM.
And sorry to say: if things would fit in memory you probably still need a fs
simply because there is no actual way to organize data (be it executable or
not) in RAM without a fs layer. You can't save data without an abstract file
data type. To have one accessible you need a fs.
Btw the other way round is as interesting: there is currently no fs for linux
that knows how to execute in place. Meaning if you really had only RAM and you
have a fs to organize your data it would be just logical to have ways to _not_
load data (in other parts of the RAM), but to use it in its original storage
(RAM-)space. 

 Then it assumes that adjacent blocks are cheap to read and blocks that
 are far away are expensive to read.  Given expensive raid controllers,
 cache, and everything else, you're correct that sometimes this
 assumption is wrong.

As already mentioned this assumption may be completely wrong even without a
raid controller, being within a virtual environment. Even far away blocks can
be one byte away in the next fs buffer of the underlying host fs (assuming
your device is in fact a file on the host;-).

  But, on average seeking hurts.  Really a lot.

Yes, seeking hurts. But there is no way to know if there is seeking at all.
On the other hand, if your storage is a netblock device seeking on the server
is probably your smallest problem, compared to the network latency in between.
 
 We try to organize files such that files that are likely to be read
 together are found together on disk.  Btrfs is fairly good at this
 during file creation and not as good as ext*/xfs as files over
 overwritten and modified again and again (due to cow).

You are basically saying that btrfs perfectly organizes write-once devices ;-)

 If you turn mount -o ssd on for your drive and do a test, you might not
 notice much difference right away.  ssds tend to be pretty good right
 out of the box.  Over time it tends to help, but it is a very hard thing
 to benchmark in general.

Honestly, this sounds like I give up to me ;-)
You just said that generally it is very hard to benchmark. Which means
nobody can see or feel it in real world in non-tech language.

Please understand that I am the last one critizing your and others' brillant
work and the time you spend for btrfs. Only I do believe that if you spent one
hour on some fs like glusterfs for every 10 hours you spend on btrfs you would
be both king and queen for the linux HA community :-)
(but probably unemployed, so I can't really beat you for it)
 
 -chris

-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD Optimizations

2010-03-12 Thread Stephan von Krawczynski
On Fri, 12 Mar 2010 02:07:40 +0100
Hubert Kario h...@qbs.com.pl wrote:

  [...]
  If the FS were to be smart and know about the 256kb requirement, it
  would do a read/modify/write cycle somewhere and then write the 4KB.
 
 If all the free blocks have been TRIMmed, FS should pick a completely free 
 erasure size block and write those 4KiB of data.
 
 Correct implementation of wear leveling in the drive should notice that the 
 write is entirely inside a free block and make just a write cycle adding 
 zeros 
 to the end of supplied data.

Your assumption here is that your _addressed_ block layout is completely
identical to the SSDs disk layout. Else you cannot know where a free
erasure block is located and how to address it from FS.
I really wonder what this assumption is based on. You still think a SSD is a
true disk with linear addressing. I doubt that very much. Even on true
spinning disks your assumption is wrong for relocated sectors. Which basically
means that every disk controller firmware fiddles around with the physical
layout since decades. Please accept that you cannot do a disks' job in FS. The
more advanced technology gets the more disks become black boxes with a defined
software interface. Use this interface and drop the idea of having inside
knowledge of such a device. That's other peoples' work. If you want to design
smart SSD controllers hire at a company that builds those.

-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD Optimizations

2010-03-11 Thread Stephan von Krawczynski
On Thu, 11 Mar 2010 11:59:57 +0100
Hubert Kario h...@qbs.com.pl wrote:

 On Thursday 11 March 2010 08:38:53 Sander wrote:
  Hello Gordan,
  
  Gordan Bobic wrote (ao):
   Mike Fedyk wrote:
   On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic gor...@bobich.net wrote:
   Are there options available comparable to ext2/ext3 to help reduce
   wear and improve performance?
  
  With SSDs you don't have to worry about wear.
 
 Sorry, but you do have to worry about wear. I was able to destroy a 
 relatively 
 new SD card (2007 or early 2008) just by writing on the first 10MiB over and 
 over again for two or three days. The end of the card still works without 
 problems but about 10 sectors on the beginning give write errors.

Sorry, the topic was SSD, not SD. SSDs have controllers that contain heavy
closed magic to circumvent all kinds of troubles you get when using classical
flash and SD cards.
Honestly I would just drop the idea of an SSD option simply because the
vendors implement all kinds of neat strategies in their devices. So in the end
you cannot really tell if the option does something constructive and not
destructive in combination with a SSD controller.
Of course you may well discuss about an option for passive flash devices like
ide-CF/SD or the like. There is no controller involved so your fs
implementation may well work out.

-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD Optimizations

2010-03-11 Thread Stephan von Krawczynski
On Thu, 11 Mar 2010 12:17:30 +
Gordan Bobic gor...@bobich.net wrote:

 On Thu, 11 Mar 2010 12:31:03 +0100, Stephan von Krawczynski
 sk...@ithnet.com wrote:
On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic gor...@bobich.net
wrote:
Are there options available comparable to ext2/ext3 to help
 reduce
wear and improve performance?
   
   With SSDs you don't have to worry about wear.
  
  Sorry, but you do have to worry about wear. I was able to destroy a
  relatively
  new SD card (2007 or early 2008) just by writing on the first 10MiB
 over
  and
  over again for two or three days. The end of the card still works
  without
  problems but about 10 sectors on the beginning give write errors.
  
  Sorry, the topic was SSD, not SD.
 
 SD == SSD with an SD interface.

That really is quite a statement. You really talk of a few-bucks SD card (like
the one in my android handy) as an SSD comparable with Intel XE only with
different interface? Come on, stay serious. The product is not only made of
SLCs and some raw logic.
 
  SSDs have controllers that contain heavy
  closed magic to circumvent all kinds of troubles you get when using
  classical flash and SD cards.
 
 There is absolutely no basis for thinking that SD cards don't contain wear
 leveling logic. SD standard, and thus SD cards support a lot of fancy copy
 protection capabilities, which means there is a lot of firmware involvement
 on SD cards. It is unlikely that any reputable SD card manufacturer
 wouldn't also build wear leveling logic into it.

I really don't guess about what is built into an SD or even CF card. But we
hopefully agree that there is a significant difference compared to a product
that calls itself a _disk_.
 
  Honestly I would just drop the idea of an SSD option simply because the
  vendors implement all kinds of neat strategies in their devices. So in
 the
  end you cannot really tell if the option does something constructive and
 not
  destructive in combination with a SSD controller.
 
 You can make an educated guess. For starters given that visible sector
 sizes are not equal to FS block sizes, it means that FS block sizes can
 straddle erase block boundaries without the flash controller, no matter how
 fancy, being able to determine this. Thus, at the very least, aligning FS
 structures so that they do not straddle erase block boundaries is useful in
 ALL cases. Thinking otherwise is just sticking your head in the sand
 because you cannot be bothered to think.

And your guess is that intel engineers had no glue when designing the XE
including its controller? You think they did not know what you and me know and
therefore pray every day that some smart fs designer falls from heaven and
saves their product from dying in between? Really?
 
  Of course you may well discuss about an option for passive flash devices
  like ide-CF/SD or the like. There is no controller involved so your fs
  implementation may well work out.
 
 I suggest you educate yourself on the nature of IDE and CF (which is just
 IDE with a different connector). There most certainly are controllers
 involved. The days when disks (mechanical or solid state) didn't integrate
 controllers ended with MFM/RLL and ESDI disks some 20+ years ago.

I suggest you don't talk to someone administering some hundred boxes based on
CF and SSD mediums for _years_ about pro and con of the respective
implementation and its long term usage.
Sorry, the world is not built out of paper, sometimes you meet the hard facts.
And one of it is that the ssd option in fs is very likely already overrun by
the ssd controller designers and mostly _superfluous_. The market has already
decided to make SSDs compatible to standard fs layouts.

-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD Optimizations

2010-03-11 Thread Stephan von Krawczynski
On Thu, 11 Mar 2010 15:01:55 +0100
Hubert Kario h...@qbs.com.pl wrote:

 [...]
 The _SD_standard_ states that the media has to implement wear-leveling.
 So any card with an SD logo implements it.
 
 As I stated previously, the algorithms used in SD cards may not be as 
 advanced 
 as those in top-of-the-line Intel SSDs, but I bet they don't differ by much 
 to 
 the ones used in cheapest SSD drives.

Well, we are all pretty sure about that. And that is exactly the reason why
these are not surviving the market pressure. Why should one care about bad
products that are possibly already extincted because of their bad performance
when the fs is production ready some day?
 
 Besides, why shouldn't we help the drive firmware by 
 - writing the data only in erase-block sizes
 - trying to write blocks that are smaller than the erase-block in a way that 
 won't cross the erase-block boundary

Because if the designing engineer of a good SSD controller wasn't able to cope
with that he will have no chance to design a second one.

 - using TRIM on deallocated parts of the drive

Another story. That is a designed part of a software interface between fs and
drive bios on which both agreed in its usage pattern. Whereas the above points
are pure guess based on dumb and old hardware and its behaviour.
 
 This will not only increase the life of the SSD but also increase its 
 performance.

TRIM: maybe yes. Rest: pure handwaving.

 [...]
   And your guess is that intel engineers had no glue when designing the XE
   including its controller? You think they did not know what you and me
   know and
   therefore pray every day that some smart fs designer falls from heaven
   and saves their product from dying in between? Really?
  
  I am saying that there are problems that CANNOT be solved on the disk
  firmware level. Some problems HAVE to be addressed higher up the stack.
 
 Exactly, you can't assume that the SSDs firmware understands any and all file 
 system layouts, especially if they are on fragmented LVM or other logical 
 volume manager partitions.

Hopefully the firmware understands exactly no fs layout at all. That would be
braindead. Instead it should understand how to arrange incoming and outgoing
data in a way that its own technical requirements are met as perfect as
possible. This is no spinning disk, it is completely irrelevant what the data
layout looks like as long as the controller finds its way through and copes
best with read/write/erase cycles. It may well use additional RAM for caching
and data reordering.
Do you really believe ascending block numbers are placed in ascending
addresses inside the disk (as an example)? Why should they? What does that
mean for fs block ordering? If you don't know anyway what a controller does to
your data ordering, how do you want to help it with its job?
Please accept that we are _not_ talking about trivial flash mem here or
pseudo-SSDs consisting of sd cards. The market has already evolved better
products. The dinosaurs are extincted even if some are still looking alive.

 -- 
 Hubert Kario

-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD Optimizations

2010-03-11 Thread Stephan von Krawczynski
On Thu, 11 Mar 2010 15:39:05 +0100
Sander san...@humilis.net wrote:

 Stephan von Krawczynski wrote (ao):
  Honestly I would just drop the idea of an SSD option simply because the
  vendors implement all kinds of neat strategies in their devices. So in the 
  end
  you cannot really tell if the option does something constructive and not
  destructive in combination with a SSD controller.
 
 My understanding of the ssd mount option is also that the fs doens't try
 to do all kinds of smart (and potential expensive) things which make
 sense for rotating media to reduce seeks and the like.
 
   Sander

Such an optimization sounds valid on first sight. But re-think closely: how
does the fs really know about seeks needed during some operation? If your
disk is a single plate one your seeks are completely different from multi
plate. So even a simple case is more or less unpredictable. If you consider a
RAID or SAN as device base it should be clear that trying to optimize for
certain device types is just a fake. What does that tell you? The optimization
was a pure loss of work hours in the first place. In fact if you look at this
list a lot of talks going on are highly academic and have no real usage
scenario.
Sometimes trying to be super-smart is indeed not useful (for a fs) ...
-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: severe hardlink bug

2010-01-24 Thread Stephan von Krawczynski
On Sun, 24 Jan 2010 09:09:44 +0100
Goffredo Baroncelli kreij...@gmail.com wrote:

 On Sunday 24 January 2010, Michael Niederle wrote:
  I'm using btrfs with a kernel 2.6.32.2 (builtin) as the root file system of 
 a
  Gentoo Linux installation.
  
  While attempting to install the plt-scheme package a strange error about 
 link
  counts occurred ([Error 31] Too many Links).
 
 See this thread:
 
 Mass-Hardlinking Oops - http://thread.gmane.org/gmane.comp.file-
 systems.btrfs/3427
 
 There is a limit of the number of hardlink for a file. The maximum number of 
 the link depends by  name length.

Honestly, this dependency is braindead. How do the fs authors think an
application programmer should judge how many hardlinks are possible for a
certain fs ? This is a really bad design issue. Are we really in year 2010 ?

 BR
 Goffredo

-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Phoronix article slaming BTRFS

2009-06-25 Thread Stephan von Krawczynski
On Wed, 24 Jun 2009 19:38:37 +0200
Jens Axboe jens.ax...@oracle.com wrote:

 [...]
 It's easy to throw cache at the problem and make it faster. That's like
 shaving weight off a car. Might make it go faster, definitely wont make
 it safer.

Interestingly nobody talks about the other end of the ssd market. Ok, a cf
card isn't really a ssd, but it is basically the same technology without very
intelligent controllers in front. So if you really want to see improvements
from ssd options this might be the most visible platform for playing. And
again, this is indeed a mainstream market, lots of routers and other embedded
gadgets use this - currently mostly implementing ram disks for performance
reasons.

-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs development plans

2009-04-21 Thread Stephan von Krawczynski
On Mon, 20 Apr 2009 12:38:57 -0400
Chris Mason chris.ma...@oracle.com wrote:

 On Mon, 2009-04-20 at 18:10 +0200, Ahmed Kamal wrote:
But now Oracle can re-license Solaris and merge ZFS with btrfs.
   Just kidding, I don't think it would be technically feasible.
  
  
  May I suggest the name ZbtrFS :)
  Sorry couldn't resist. On a more serious note though, is there any
  technical benefits that justify continuing to push money in btrfs
 
 The short answer from my point of view is yes.  This doesn't really
 change the motivations for working on btrfs or the problems we're trying
 to solve.

... which sounds logical to me. From looking at the project for a while one
can see you are trying to solve problems that are not really linux' ones...

-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-27 Thread Stephan von Krawczynski
On Wed, 22 Oct 2008 16:35:55 +0200
dbz [EMAIL PROTECTED] wrote:

 concerning this discussion, I'd like to put up some requests which 
 strongly oppose to those brought up initially:
 
 - if you run into an error in the fs structure or any IO error that prevents 
 you from bringing the fs into a consistent state, please simply oops. If a 
 user feels that availability is a main issue, he has to use a failover 
 solution. In this case a fast and clean cut is desireable and no 
 pray-and-hope-mode or 90%-mode. If avaliability is not the issue, it is 
 in any case most important that data on the fs is safe. If you don't oops, 
 you risk to pose further damage onto the filesystem and end up with a 
 completely destroyed fs.

Hi Gerald,

this is a good proposal to explain why most failover setups do indeed not
work. If you look at numerous internet howtos about building failover you will
recognise that 95% talk about servers that syncronise their fs by all kinds of
tools _offline_, like drbd - or choose some network-dependant raid, like nbd
or enbd. All these have in common that they are unreliable just because of the
needed mounting during failover. In your example: if box 1 oopses because of
some error, chances are that box 2 trying to mount the very same data (which
should be because of raid or sync) will indeed fail to mount, too. That leaves
you with exactly nothing in hand.
 
 - if you get any IO error, please **don't** put up a number of retries or 
 anything. If the device reports an error simply believe it. It is bad enough 
 that many block drivers or controllers try to be smart and put up hundreds 
 of retries. Adding further retries you only end up in wasting hours on 
 useless retries. If availability is an issue, the user again has to put up a 
 failover solution. Again, a clean cut is what is needed. The user has to 
 make shure he uses appropiate configuration according to the importance of 
 his data (mirroring on the fs and/or RAID, failover ...)

Well, this leaves you with my proposal to optionally stop retrying, marking
files or (better) blocks as dead.
 
 - if during mount something unexpected comes up and you can't be shure that 
 the fs will work properly, please deny mounting and request a fsck. This can 
 be easily handled by a start- or mount-script. During mount, take the time 
 you need to ensure that the fs looks proper and safe to use. I'd rather now 
 during boot that something is wrong than to run with a foul fs and end up 
 with data loss or any other mixup later on.

As explained above it is exactly the lack of parallel mounts that drives you
to not having a lot of time during mount. A failover that takes only 10 minutes
for re-mount is no failover, it is sh.t. ext? btw hardly ever mounts TBs at
below 10 minutes.
 
 - btrfs is no cluster fs, so there is no point of even thinking about it. If 
 somebody feels he needs multiple writeable mounts of the same fs, please use 
 a cluster fs. Of course, you have to live with the tradeoffs. Dreaming of a 
 fs that uses something like witchcraft to do things like locking, quorums, 
 cache synchronisation without penalty and, of course, without any 
 configuration, is pointless.

This reads pretty much like a processor is a processor and not multiple
processors. We all know today that this time has passed. In 5 years you
should pretty much say the same for single fs vs. cluster fs. 
 
 In my opinon, the whole thing comes up from the idea of using cheap hardware 
 and out-of-the-box configurations to keep promises of reliability and 
 availability which are not realistic. There is a reason why there are more 
 expensive HDDs, RAIDs, SANs with volume mirroring, multipathing and so on. 
 Simply ignoring the fact that you have to use the proper tools to address 
 specific problems and pray to the toothfairy to put a 
 solve-all-my-problems-fs under your pillow is no solution. I'd rather have a 
 solid fs with deterministic behavior and some state-of-the-art features.

Well, sorry to say, but I begin to sound a bit like Joseph Stiglitz
trying to explain why neoliberalism does not work out.
Please accept that this world is full of failure of all kinds. If you deny
that all your models and ideas will only be failures, too.
All I am saying is that we should accept that dead sectors, braindead
firmware-programmers, production in jungle-environment, transportation in
rough areas, high temperatures, high humidity, harddisks that have no disks
and so on are facts of live. And only a childs answer can be : oops
(sorry could not resist this one ;-)
 
 Just my 2c.
 (Gerald) 

-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 13:15:13 -0400
Christoph Hellwig [EMAIL PROTECTED] wrote:

 On Tue, Oct 21, 2008 at 07:01:36PM +0200, Stephan von Krawczynski wrote:
  Sure, but what you say only reflects the ideal world. On a file service, you
  never have that. In fact you do not even have good control about what is 
  going
  on. Lets say you have a setup that creates, reads and deletes files 24h a 
  day
  from numerous clients. At two o'clock in the morning some hd decides to
  partially die. Files get created on it, fill data up to errors, get
  deleted and another bunch of data arrives and yet again fs tries to allocate
  the same dead areas. You loose a lot more data only because the fs did not 
  map
  out the already known dead blocks. Of course you would replace the dead 
  drive
  later on, but in the meantime you have a lot of fun.
  In other words: give me a tool to freeze the world right at the time the
  errors show up, or map out dead blocks (only because it is a lot easier).
 
 When modern disks can't solve the problems with their internal driver
 remapping anymore you better replace it ASAP as it is a very strong
 disk failure indication.  Last years FAST has some very interesting
 statitics showing this in the field.

And of course a disk is always a disk, right? 

-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 18:09:40 +0200
Andi Kleen [EMAIL PROTECTED] wrote:

 While that's true today, I'm not sure it has to be true always.
 I always thought traditional fsck user interfaces were a
 UI desaster and could be done much better with some simple tweaks. 
 [...]

You are completely right.

 -Andi

-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 18:59:26 +0200
Andi Kleen [EMAIL PROTECTED] wrote:

 Stephan von Krawczynski [EMAIL PROTECTED] writes:
 
  Yes, we hear and say that all the time, name one linux fs doing it, please.
 
 ext[234] support it to some extent. It has some limitations
 (especially when the files are large and you shouldn't do too much followon
 IO to prevent the data from being overwriten) and the user frontends are not
 very nice, but it it's there

Well, they must be pretty ugly, I really never heard of that. But really, it
is not very important, because extX is completely useless with TB-size disks
unless you feel good waiting hours for fsck (I did, and will never do again). 
_All_ customers we deployed ext3 urged us to go back to reiserfs3 ...

 -Andi

-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 11:34:20 -0400
jim owens [EMAIL PROTECTED] wrote:

 Hearing what user's think they want is always good, but...
 
 Stephan von Krawczynski wrote:
  
  thanks for your feedback. Understand minimum requirement as minimum
  requirement to drop the current installation and migrate the data to a
  new fs platform.
 
 I would sure like to know what existing platform and filesystem
 you have that you think has all 10 of your features.

Obviously none, else I would not speak up and try to find one. :-)

  [...]
  1) parallel mounts
 
 What I see from that explanation is you have a system design idea
 using parallel machines to fix problems you have had in the past.
 To implement your design, you need a filesystem to fit it.

Well, I can't hardly deny that. Lets just name the (simple) problem, different
names for the very same thing: uptime, availability, redundancy

  I think
 it is better to just design a filesystem without the problems and
 configure the hardware to handle the necessary load.

Ok, now you see me astonished. You really think that there is one piece of
software around that is without problems ?
My idea of the world is really very different from that:
The world is far from perfect. That is why I try to deploy solutions that have
redundancy for all kinds of problems I can think of and hopefully for a few
that I haven't thought of.
 
  2) mounting must not delay the system startup significantly
  3) errors in parts of the fs are no reason for a fs to go offline as a whole
  4) power loss at any time must not corrupt the fs
  5) fsck on a mounted fs, interactively, not part of the mount (all fsck
  features)
 
 I think all of these are part of the reliability goal for btrfs
 and when you say fsck it is probably misleading if I understand
 your real requirement to be the same as my customers:
 
- *NO* fsck
- filesystem design prevents problems we have had before
- filesystem autodetects, isolates, and (possibly) repairs errors
- online scan, check, repair filesystem tool initiated by admin
- Reliability so high that they never run that check-and-fix tool

That is _wrong_ (to a certain extent). You _want to run_ diagnostic tools to
make sure that there is no problem. And you don't want some software (not even
HAL) to repair errors without prior admin knowledge/permission.
 
 Note that I personally have never seen a first release meet
 the no problems, no need to fix criteria that would obviate
 any need for a check/fix tool.

That really does not depend on the release number of _your_ special software.
Your software always depends on other components (hw or sw) that (can) have
bugs and weird behaviour. And this is the fact: no perfect world, so don't
count on your or others' perfectness. If you do you will fail.
 
 jim

-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 13:49:43 -0400
Chris Mason [EMAIL PROTECTED] wrote:

 On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote:
 
2. general requirements
- fs errors without file/dir names are useless
- errors in parts of the fs are no reason for a fs to go offline as 
a whole
   
   These two are in progress.  Btrfs won't always be able to give a file
   and directory name, but it will be able to give something that can be
   turned into a file or directory name.  You don't want important
   diagnostic messages delayed by name lookup.
  
  That's a point I really never understood. Why is it non-trivial for a fs to
  know what file or dir (name) it is currently working on?
 
 The name lives in block A, but you might find a corruption while
 processing block B.  Block A might not be in ram anymore, or it might be
 in ram but locked by another process.
 
 On top of all of that, when we print errors it's because things haven't
 gone well.  They are deep inside of various parts of the filesystem, and
 we might not be able to take the required locks or read from the disk in
 order to find the name of the thing we're operating on.

Ok, this is interesting. In another thread I was told parallel mounts are
really complex and you cannot do good things in such an environment that you
can do with single mount. Well, then, why don't we do it? All boxes I know
have tons of RAM, but fs finds no place in RAM to put large parts (if not all)
of the structural fs data including filenames? Besides the simple fact that
RAM is always faster than any known disk be it rotating or not, and that RAM
is just there, whats the word for not doing it?

- parallel mounts (very important!)
  (two or more hosts mount the same fs concurrently for reading and
  writing)
   
   As Jim and Andi have said, parallel mounts are not in the feature list
   for Btrfs.  Network filesystems will provide these features.
  
  Can you explain what network filesystems stands for in this statement,
  please name two or three examples.
  
 NFS (done) CRFS (under development), maybe ceph as well which is also
 under development.

NFS is a good example for a fs that never got redesigned for modern world. I
hope it will, but currently it's like Model T on a highway.
You have a NFS server with clients. Your NFS server dies, your backup server
cannot take over the clients without them resetting their NFS-link (which
means reboot to many applications) - no way.
Besides that you still need another fs below NFS to bring your data onto some
medium, which means you still have the problem how to create redundancy in
your server architecture.

- versioning (file and dir)
   
   From a data structure point of view, version control is fairly easy.
   From a user interface and policy point of view, it gets difficult very
   quickly.  Aside from snapshotting, version control is outside the scope
   of btrfs.
   
   There are lots of good version control systems available, I'd suggest
   you use them instead.
  
  To me versioning sounds like a not-so-easy-to-implement feature. 
  Nevertheless
  I trust your experience. If a basic implementation is possible and not too
  complex, why deny a feature? 
  
 
 In general I think snapshotting solves enough of the problem for most of
 the people most of the time.  I'd love for Btrfs to be the perfect FS,
 but I'm afraid everyone has a different definition of perfect.
 
 Storing multiple versions of something is pretty easy.  Making a usable
 interface around those versions is the hard part, especially because you
 need groups of files to be versioned together in atomic groups
 (something that looks a lot like a snapshot).
 
 Versioning is solved in userspace.  We would never be able to implement
 everything that git or mercurial can do inside the filesystem.

Well, quite often the question is not about whole trees of data to be
versioned. Even single (few) files or dirs can be of interest. And you want
people to set up a complete user space monster to version three openoffice
documents (only a rather flawed example of course)? 
Lots of people need a basic solution, not the groundbreaking answer to all
questions.

- undelete (file and dir)
   
   Undelete is easy
  
  Yes, we hear and say that all the time, name one linux fs doing it, please.
  
 
 The fact that nobody is doing it is not a good argument for why it
 should be done ;)

Believe me, if NTFS had a simple undelete tool come with it, we (in linux fs)
would have it, too. Why do we always want to be _second best_?

  Undelete is a policy decision about what to do with
 files as they are removed.  I'd much rather see it implemented above the
 filesystems instead of individually in each filesystem.
 
 This doesn't mean I'll never code it, it just means it won't get
 implemented directly inside of Btrfs.  In comparison with all of the
 other features pending, undelete is pretty far down on the list

Re: Some very basic questions

2008-10-22 Thread Stephan von Krawczynski
On Wed, 22 Oct 2008 05:48:30 -0700
Jeff Schroeder [EMAIL PROTECTED] wrote:

  NFS is a good example for a fs that never got redesigned for modern world. I
  hope it will, but currently it's like Model T on a highway.
  You have a NFS server with clients. Your NFS server dies, your backup server
  cannot take over the clients without them resetting their NFS-link (which
  means reboot to many applications) - no way.
  Besides that you still need another fs below NFS to bring your data onto 
  some
  medium, which means you still have the problem how to create redundancy in
  your server architecture.
 
 You are somewhat misinformed on this. Perhaps the Linux nfs server can't cope,
 but I doubt it. NFS was designed to be stateless. I've got a fair
 amount of experience
 with a dual head netapp architecture. When 1 head dies, the other
 transparently fails
 over. During the brief downtime, the clients will go into I/O wait if
 at all instead of being
 disconnected. You might be able to do something similar using nfsd and
 keepalived if
 both servers were connected to the same storage. Setting that up would
 be trivial. You
 just need the clients mounting the vip and a reliable mechanism to
 provide the data from
 that vip. You could use heartbeat, but it is overly complex. Also look
 at clustered nfs or
 pnfs, both of which are nfs redesigns like you speak of.

we tried that with pure linux nfs, and it does not work. The clients do not
recover. After trying ourselves and failing we found several docs on the net
that described just the same problem and its reasons. Very likely netapp found
that out too and did something against it. 

Ah yes, and btw, your description contains another discussed problem: both
servers were connected to the same storage. If you mean that both servers
really access the same storage at the same time your software options are
pretty few in numbers.

 -- 
 Jeff Schroeder

-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 14:13:33 +0200
Andi Kleen [EMAIL PROTECTED] wrote:

 Stephan von Krawczynski [EMAIL PROTECTED] writes:
 
  reading the list for a while it looks like all kinds of implementational
  topics are covered but no basic user requests or talks are going on. Since I
  have found no other list on vger covering these issues I choose this one,
  forgive my ignorance if it is the wrong place.
  Like many people on the planet we try to handle quite some amounts of data
  (TBs) and try to solve this with several linux-based fileservers.
  Years of (mostly bad) experience led us to the following minimum 
  requirements
  for a new fs on our servers:
 
 If that are the minimum requirements, what are the maximum ones?
 
 Also you realize that some of the requirements (like parallel read/write
 aka a full cluster file system) are extremly hard?
 
 Perhaps it would make more sense if you extracted the top 10 items
 and ranked them by importance and posted again.

Hello Andi,

thanks for your feedback. Understand minimum requirement as minimum
requirement to drop the current installation and migrate the data to a
new fs platform.
Of course you are right, dealing with multiple/parallel mounts can be quite a
nasty job if the fs was not originally planned with this feature in mind.
On the other hand I cannot really imagine how to deal with TBs of data in the
future without such a feature.
If you look at the big picture the things I mentioned allow you to have
redundant front-ends for the fileservice doing the same or completely
different applications. You can use one mount (host) for tape backup purposes
only without heavy loss in standard file service. You can even mount for
filesystem check purposes, a box that does nothing else but check the
structure and keep you informed what is really going on with your data - and
your data is still in production in the meantime.
Whatever happens you have a real chance of keeping your file service up, even
if parts of your fs go nuts because some underlying hd got partially damaged.
Keeping it up and running is the most important part, performance is only
second on the list.
If you take a close look there are not really 10 different items on my list,
depending on the level of abstraction you prefer, nevertheless:

1) parallel mounts
2) mounting must not delay the system startup significantly
3) errors in parts of the fs are no reason for a fs to go offline as a whole
4) power loss at any time must not corrupt the fs
5) fsck on a mounted fs, interactively, not part of the mount (all fsck
features)
6) journaling
7) undelete (file and dir)
8) resizing during runtime (up and down)
9) snapshots
10) performant handling of large numbers of files inside single dirs


-- 
Regards,
Stephan

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some very basic questions

2008-10-21 Thread Stephan von Krawczynski
On Tue, 21 Oct 2008 09:20:16 -0400
jim owens [EMAIL PROTECTED] wrote:

 btrfs has many of the same goals... but they are goals not code
 so when you might see them is indeterminate.

no big issue, my pension is 20 years away, I got time ;-)
 
 I believe these should not be in btrfs:
 
 Stephan von Krawczynski wrote:
 
  - parallel mounts (very important!)
 
 as Andi said, you want a cluster or distributed fs.  There
 are layered designs (CRFS or network filesystems) that can do
 the job and trying to do it in btrfs causes too many problems.

question is: if you had such an implementation, are there drawbacks expectable
for the single-mount case? If not I'd vote for it because there are not really
many alternatives on the market.

  - journaling
 
 I assume you *do not* mean metadata journaling, you mean
 sending all file updates to a single output stream (as in one
 disk, tape, or network link).  I've done that, but would not
 recommend it in btrfs because it limits the total fs bandwidth
 to what the single stream can support.  This is normally done
 today by applications like databases, not in the filesystem.

As far as I know metadata journaling is in, right?
If what you mean is capable of creating live or offline images of the fs you
got me right.
 
  - map out dead blocks
 
 Useless... a waste of time, code, and metadata structures.
 With current device technology, any device reporting bad blocks
 the device can not map out is about to die and needs replaced!

Sure, but what you say only reflects the ideal world. On a file service, you
never have that. In fact you do not even have good control about what is going
on. Lets say you have a setup that creates, reads and deletes files 24h a day
from numerous clients. At two o'clock in the morning some hd decides to
partially die. Files get created on it, fill data up to errors, get
deleted and another bunch of data arrives and yet again fs tries to allocate
the same dead areas. You loose a lot more data only because the fs did not map
out the already known dead blocks. Of course you would replace the dead drive
later on, but in the meantime you have a lot of fun.
In other words: give me a tool to freeze the world right at the time the
errors show up, or map out dead blocks (only because it is a lot easier).

 jim

-- 
Regards,
Stephan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html