Re: btrfs for enterprise raid arrays

2009-04-03 Thread Sander
Dear Erwin,

Erwin van Londen wrote (ao):
 Another thing is that some arrays have the capability to
 thin-provision volumes. In the back-end on the physical layer the
 array configures, let say, a 1 TB volume and virtually provisions 5TB
 to the host. On writes it dynamically allocates more pages in the pool
 up to the 5TB point. Now if for some reason large holes occur on the
 volume, maybe a couple of ISO images that have been deleted, what
 normally happens is just some pointers in the inodes get deleted so
 from an array perspective there is still data on those locations and
 will never release those allocated blocks. New firmware/microcode
 versions are able to reclaim that space if it sees a certain number of
 consecutive zero's and will reclaim that space to the volume pool. Are
 there any thoughts on writing a low-priority tread that zeros out
 those non-used blocks?

SSD would also benefit from such a feature as it doesn't need to copy
deleted data when erasing blocks.

The storage could use the ATA/SCSI commands TRIM, UNMAP and DISCARD for
that?

I have one question on thin provisioning: if Windows XP performs defrag
on a 20GB 'virtual' size LUN with 2GB in actuall use, whil the volume
grow to 20GB on the storage and never shrink afterwards anymore, while
the client still has only 2GB in use?

This would make thin provisioning on virtual desktops less useful.

Do you have any numbers on the performance impact of thin provisioning?
I can imagine that thin provisioning causes on-storage defragmentation
of disk images, which would kill any OS optimisations like grouping often
read files.

With kind regards, Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for enterprise raid arrays

2009-04-03 Thread David Woodhouse
On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:
  New firmware/microcode versions are able to reclaim that space if it
  sees a certain number of consecutive zero's and will reclaim that 
  space to the volume pool. Are there any thoughts on writing a 
  low-priority tread that zeros out those non-used blocks?
 
 Patches have been floating around to support this - see the recent 
 patches around DISCARD on linux-ide and lkml.  It would be great to 
 get access to a box that implemented the T10 proposed UNMAP commands 
 that we could test against. 

We've already made btrfs support TRIM, and Matthew has patches which
hook it up for ATA/IDE devices. Adding SCSI support shouldn't be hard
once the dust settles on the spec.

I don't think I've seen anybody talking about deliberately writing
zeroes instead of just issuing a discard command though. That doesn't
seem like a massively cunning plan.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for enterprise raid arrays

2009-04-03 Thread Chris Mason
On Thu, 2009-04-02 at 21:34 -0700, Erwin van Londen wrote:
 Dear all,
 
 While going through the archived mailing list and crawling along the
 wiki I didn't find any clues if there would be any optimizations in
 Btrfs to make efficient use of functions and features that today exist
 on enterprise class storage arrays.
 
 One exception to that was the ssd option which I think can make a
 improvement on read and write IO's however when attached to a storage
 array, from an OS perspective, it doesn't really matter since it can't
 look behind the array front-end interface anyhow(whether it FC/iSCSI
 or any other).
 
 There are however more options that we could think of. Almost all
 storage arrays these days have the capabilities to replicate volume
 (or part of it in COW cases) either in the system or remotely. It
 would be handy that if a Btrfs formatted volume could make use of
 those features since this might offload a lot of the processing time
 involved in maintaining these. The arrays already have optimized code
 to make these snapshots. I'm not saying we should step away from the
 host based snapshots but integration would be very nice.

Storage based snapshotting would definitely be useful for replication in
btrfs, and in that case we could wire it up from userland.  Basically
there is a point during commit where a storage snapshot could be taken
and fully consistent.

Outside of replication though, I'm not sure exactly where storage based
snapshotting would come in.  It wouldn't really be compatible with the
snapshots btrfs is already doing (but I'm always open to more ideas).

 Furthermore some enterprise array have a feature that allows for full
 or partial staging data in cache. By this I mean when a volume
 contains a certain amount of blocks you can define to have the first X
 number of blocks pre-staged in cache which enables you to have
 extremely high IO rates on these first ones. An option related to the
 -ssd parameter could be to have a mount command say mount -t btrfs
 -ssd 0-1 so Btrfs knows what to expect from the partial area and
 maybe can optimize the locality of frequently used blocks to optimize
 performance.

This would be very useful, although I would tend to export it to btrfs
as a second lun.  My long term goal is to have code in btrfs that
supports a super fast staging lun, which might be an ssd or cache carved
out of a high end array.
 
 Another thing is that some arrays have the capability to
 thin-provision volumes. In the back-end on the physical layer the
 array configures, let say, a 1 TB volume and virtually provisions 5TB
 to the host. On writes it dynamically allocates more pages in the pool
 up to the 5TB point. Now if for some reason large holes occur on the
 volume, maybe a couple of ISO images that have been deleted, what
 normally happens is just some pointers in the inodes get deleted so
 from an array perspective there is still data on those locations and
 will never release those allocated blocks. New firmware/microcode
 versions are able to reclaim that space if it sees a certain number of
 consecutive zero's and will reclaim that space to the volume pool. Are
 there any thoughts on writing a low-priority tread that zeros out
 those non-used blocks?

Other people have replied about the trim commands, which btrfs can issue
on every block it frees.  But, another way to look at this is that btrfs
already is thinly provisioned.  When you add storage to btrfs, it
allocates from that storage in 1GB chunks, and then hands those over to
the FS allocation code for more fine grained use.

It may make sense to talk about how that can fit in with your own thin
provisioning.

 Given the scalability targets of Btrfs it will most likely be heavily
 used in the enterprise environment once it reaches a stable code
 level. If we would be able to interface with these array based
 features that would be very beneficial. 
 
 Furthermore one question also pops to mind and that's when looking at
 the scalability of Btrfs and its targeted capacity levels I think we
 will run into problems with the capabilities of the server hardware
 itself. From what I can see now it will not be designed as a
 distributed file-system with integrated distributed lock manager to
 scale out over multiple nodes. (I know Oracle is working on a similar
 thing but this might get things more complicated than it already is.)
 This might impose some serious issues with recovery scenarios like
 backup/restore since it will take quite some time to backup/restore a
 multi PB system when it resides on just 1 physical host even when
 we're talking high end P-series, I25K's or Superdome class.

This is true.  Things like replication and failover are the best plans
for it today.

Thanks for your interest, we're always looking for ways to better
utilize high end storage features.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to 

Re: btrfs for enterprise raid arrays

2009-04-03 Thread Matthew Wilcox
On Fri, Apr 03, 2009 at 12:58:00PM +0100, David Woodhouse wrote:
 On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:
   New firmware/microcode versions are able to reclaim that space if it
   sees a certain number of consecutive zero's and will reclaim that 
   space to the volume pool. Are there any thoughts on writing a 
   low-priority tread that zeros out those non-used blocks?
  
  Patches have been floating around to support this - see the recent 
  patches around DISCARD on linux-ide and lkml.  It would be great to 
  get access to a box that implemented the T10 proposed UNMAP commands 
  that we could test against. 
 
 We've already made btrfs support TRIM, and Matthew has patches which
 hook it up for ATA/IDE devices. Adding SCSI support shouldn't be hard
 once the dust settles on the spec.

It seems like the dust has settled ... I just need to check that
my code still conforms to the spec.  Understandably, I've been focused
on TRIM ;-)

 I don't think I've seen anybody talking about deliberately writing
 zeroes instead of just issuing a discard command though. That doesn't
 seem like a massively cunning plan.

Yeah, WRITE SAME with the discard bit.  A bit of a crappy way to go, to
be sure.  I'm not exactly sure how we're supposed to be deciding whether
to issue an UNMAP or WRITE SAME command.  Perhaps if I read the spec
properly it'll tell me.

I just had a quick chat with someone from another storage vendor who
don't yet implement UNMAP -- if you do a WRITE SAME with all zeroes,
their device will notice that and unmap the LBAs in question.

Something for the plane on Sunday anyway.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for enterprise raid arrays

2009-04-03 Thread James Bottomley
On Fri, 2009-04-03 at 06:27 -0700, Matthew Wilcox wrote:
 On Fri, Apr 03, 2009 at 12:58:00PM +0100, David Woodhouse wrote:
  On Fri, 2009-04-03 at 12:43 +0100, Ric Wheeler wrote:
New firmware/microcode versions are able to reclaim that space if it
sees a certain number of consecutive zero's and will reclaim that 
space to the volume pool. Are there any thoughts on writing a 
low-priority tread that zeros out those non-used blocks?
   
   Patches have been floating around to support this - see the recent 
   patches around DISCARD on linux-ide and lkml.  It would be great to 
   get access to a box that implemented the T10 proposed UNMAP commands 
   that we could test against. 
  
  We've already made btrfs support TRIM, and Matthew has patches which
  hook it up for ATA/IDE devices. Adding SCSI support shouldn't be hard
  once the dust settles on the spec.
 
 It seems like the dust has settled ... I just need to check that
 my code still conforms to the spec.  Understandably, I've been focused
 on TRIM ;-)
 
  I don't think I've seen anybody talking about deliberately writing
  zeroes instead of just issuing a discard command though. That doesn't
  seem like a massively cunning plan.
 
 Yeah, WRITE SAME with the discard bit.  A bit of a crappy way to go, to
 be sure.  I'm not exactly sure how we're supposed to be deciding whether
 to issue an UNMAP or WRITE SAME command.  Perhaps if I read the spec
 properly it'll tell me.

Actually, the point about WRITE SAME is that it's a far smaller patch to
the standards (just a couple of bits).  Plus it gets around the problem
of what does the array return when an unmapped block is requested (which
occupies pages in the UNMAP proposal), so from that point of view it
seems very logical.

 I just had a quick chat with someone from another storage vendor who
 don't yet implement UNMAP -- if you do a WRITE SAME with all zeroes,
 their device will notice that and unmap the LBAs in question.

I actually already looked at using WRITE SAME in sd.c ... it turns out
to be surprisingly little work ... the thing you'll like about it is
that there are no extents to worry about and if you plan on writing all
zeros, you can keep a static zeroed data buffer around for the
purpose ...

James


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs for enterprise raid arrays

2009-04-03 Thread James Bottomley
On Fri, 2009-04-03 at 07:43 -0400, Ric Wheeler wrote:
 Erwin van Londen wrote:
  Another thing is that some arrays have the capability to thin-provision 
  volumes. In the back-end on the physical layer the array configures, let 
  say, a 1 TB volume and virtually provisions 5TB to the host. On writes it 
  dynamically allocates more pages in the pool up to the 5TB point. Now if 
  for some reason large holes occur on the volume, maybe a couple of ISO 
  images that have been deleted, what normally happens is just some pointers 
  in the inodes get deleted so from an array perspective there is still data 
  on those locations and will never release those allocated blocks. New 
  firmware/microcode versions are able to reclaim that space if it sees a 
  certain number of consecutive zero's and will reclaim that space to the 
  volume pool. Are there any thoughts on writing a low-priority tread that 
  zeros out those non-used blocks?

 
 Patches have been floating around to support this - see the recent 
 patches around DISCARD on linux-ide and lkml.  It would be great to 
 get access to a box that implemented the T10 proposed UNMAP commands 
 that we could test against. 

So we went several times around the block in the upcoming Linux
Filesystem and Storage workshop to see if anyone from the array vendors
might be interested in discussing thin provisioning.  The general result
was no since travel is tight.  The upshot will be that most of our
discard infrastructure will be focussed on SSD TRIM, but that we'll try
to preserve the TP option for arrays ... there are still private
conversations going on with various people who know the UNMAP/WRITE SAME
requirements of the various arrays at the various vendors.

James


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html