Re: [Lustre-discuss] external journal raid1 vs. single disk ext journal + hot spare on raid6

2009-05-15 Thread Andreas Dilger
On May 14, 2009  23:37 -0400, Robin Humble wrote:
> one problem we came across was that ext3/ldiskfs hard-codes the device
> name of the external journal (eg. /dev/md5 or /dev/sdc1 or whatever)
> into the filesystem. 
> that means that when you failover OSS's it will look for /dev/whatever
> on the failed-over node, and won't mount if it can't find it.
> so you need non-intersecting namespaces of journal devices within an OSS
> pair, so that each regular and failed-over RAID5/6 can always find its
> correct journal device.
> I didn't manage to get ext3/ldiskfs to be sane and use UUID's instead of
> hardcoded device names :-/

There is a "journal_device" mount option for this.  We'd like to make
mount.lustre find this device automatically, but it hasn't been fixed
yet.  See bug 16861.

> presumably you could also tune2fs to rename or delete the external
> journal as part of a failover, but that's a horrible hack.  

No, that will potentially lose some data, since ext3 considers data written
to the journal as "safe"

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] external journal raid1 vs. single disk ext journal + hot spare on raid6

2009-05-15 Thread Ralf Utermann
Stuart Marshall wrote:
> Hi All,
> 
> With the upgrade from 1.6.x to 1.8.x we are planning to reconfigure our
> RAID systems.
> 
> The OST RAID hardware are Sun 6140 arrays with 16x500GB SATA disks. 
> Each 6140 tray has one OSS node (Sun X2200 M2).  We have redundant paths
> and ultimately plan a failover strategy.  The MDT will be a RAID 1+0 Sun
> 2540 with 12x73GB SAS disks.
> 
> Each 6140 tray will be configured either as 1 or 2 RAID6 volumes.  The
> lustre manual recommends more smaller OST's over large and other docs
> I've seen seem to indicate that the optimal number of drives is ~(6+2). 
> For these 16 disk trays, the choice would be one (12+2R6) + external
> journal and/or hot spares or two (5+2R6)'s + ext. jrnl and/or hot spares.
> 

We have a similar hardware setup, 2 OSS nodes attached to a Sun 6140 plus
one CSM200 extension tray, which means 32x500 SATA disks. Because I assumed,
as Robin says in his post, 2^n+parity to be optimal for this hardware, I went
back to Raid5 for the OSTs and configured  2 x 4+1 and 2 x 8+1. Then there
is one Raid1 for external journals and 2 disks left as hot spare. So the OSTs
are not of the same size, but each OSS then serves one 4+1 and one 8+1 OST.
I hope Lustre will spread the data in a reasonable way. The chunksizes used
are 256k and 128k, so a stripe always adds up to 1M. 


> So my questions are:
> 
> 1.) What are the trade-offs of RAID1 external journal with no hot spare
> vs. single disk ext journal with a hot spare (spare is for R6 volume)?
> Specifically:
> 
> - If a single disk external journal is lost, can we run fsck and only
> lose the transactions that have not been committed to disk?  If so, then
> the loss of the disk hosting the external journal would not be
> catastrophic for the file system as a whole.
> 
> - How comfortable are RAID6 users with no hot spares? (We'll have cold
> spares handy, but prefer to get through weekends w/out service)
> 
> 2.) The external journal only takes up ~400MB.  If we create 2 RAID6
> volumes, can we put 2 external journals on one disk or RAID1 set
> (suitably partitioned), or do we need to blow an entire disk for one
> external journal?
we have the 4 journal volumes on one Raid1 virtual disk, but I did not 
compare to other setups with perfomance tests.

I did some performance tests with iozone in our dual-Gigabit environment,
and I see the performance going down significantly with smaller block sizes for
patchless Lustre clients. This is seen for some OSTs, but not for others.
I don't know, whether this has something to do with the 6140 and it's setup
here; the patched clients don't see this problem and I did not look further
into it.

Best regards, Ralf
-- 
Ralf Utermann
_
Universität Augsburg, Institut für Physik   --   EDV-Betreuer
Universitätsstr.1 
D-86135 Augsburg Phone:  +49-821-598-3231
SMTP: ralf.uterm...@physik.uni-augsburg.de Fax: -3411
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] external journal raid1 vs. single disk ext journal + hot spare on raid6

2009-05-14 Thread Robin Humble
Hi Stuart,

On Thu, May 14, 2009 at 01:08:36PM -0700, Stuart Marshall wrote:
>Each 6140 tray will be configured either as 1 or 2 RAID6 volumes.  The
>lustre manual recommends more smaller OST's over large and other docs I've
>seen seem to indicate that the optimal number of drives is ~(6+2).  For
>these 16 disk trays, the choice would be one (12+2R6) + external journal
>and/or hot spares or two (5+2R6)'s + ext. jrnl and/or hot spares.

2^n+partiy (eg. 8+2 R6) is generally best with software raid, and
presumably with your 6140 too. 8+2 with a 64k/128k chunk size means
512kB/1MB per data stripe which plays nicely with Lustre's 1M data
transfer sizes.
presumably you have 6+2 because that fits neatly into your 16 disk
units - these things are always a compromise :-/

>So my questions are:
>
>1.) What are the trade-offs of RAID1 external journal with no hot spare vs.
>single disk ext journal with a hot spare (spare is for R6 volume)?
>Specifically:

external journal takes away 1/2 the seeks (small writes to the journal)
when writing to RAID5/6's so can double your write speeds. it does for
us with software raid. having said that, if you have a large NVRAM cache
in your hardware raid then you might not notice these extra seeks as
they mostly go to ram and are flushed to spinning disk much less frequently.

also I believe Lustre 1.8 hides the slowness of internal journals
better than 1.6. IIRC, it allows multiple outstanding writes to be in
flight (like metadata in 1.6) and holds copies of data on clients for
replay in case an OSS crashes. so with 1.8 you may not notice external
journals helping all that much.

>- If a single disk external journal is lost, can we run fsck and only lose
>the transactions that have not been committed to disk?  If so, then the loss
>of the disk hosting the external journal would not be catastrophic for the
>file system as a whole.

I think so, yes, although we run external journals on RAID1. if you
lose the journal device then you might have to tune2fs to delete the
external journal from the fs before you fsck, as fsck will go looking
for the (dead/missing) journal device and will sulk.

one problem we came across was that ext3/ldiskfs hard-codes the device
name of the external journal (eg. /dev/md5 or /dev/sdc1 or whatever)
into the filesystem. 
that means that when you failover OSS's it will look for /dev/whatever
on the failed-over node, and won't mount if it can't find it.
so you need non-intersecting namespaces of journal devices within an OSS
pair, so that each regular and failed-over RAID5/6 can always find its
correct journal device.
I didn't manage to get ext3/ldiskfs to be sane and use UUID's instead of
hardcoded device names :-/
presumably you could also tune2fs to rename or delete the external
journal as part of a failover, but that's a horrible hack.  

>- How comfortable are RAID6 users with no hot spares? (We'll have cold
>spares handy, but prefer to get through weekends w/out service)

fairly comfy. you can do the sums and work out the likelyhood of dual
failures given your drive sizes and errors rates, and it's not
outrageous. assumes no correlations between drive failures of course...

>2.) The external journal only takes up ~400MB.  If we create 2 RAID6
>volumes, can we put 2 external journals on one disk or RAID1 set (suitably
>partitioned), or do we need to blow an entire disk for one external journal?

ext3/ldiskfs won't let you share multiple fs's in one journal (although
apparently it's technically possible), but as you say, you can just
make 2 small partitions and put a journal on each.
they will interfere if both fs's are writing heavily (no interference on
reads), but I'd guess (only a guess - I haven't measured it) the
penalty should still be smaller than with internal journals.
the Lustre 1.8 changes should probably help both external shared and
internal journal cases.
I believe Sun folks have some numbers about such shared scenarios that
you might be able to cajole out of them.

>3.) In planning for "segment size" (chunk size in lustre manual) we'd have
>to go to 128kB or lower.  However, in single disk tests (SATA), it seems
>that larger is better so perhaps this argues for small RAID6 sets as
>mentioned in the manual.  Just wondering what other folks have found here
>also.

you don't want your RAID chunk size to be such that disks*chunk > 1MB,
as then every Lustre op will be hitting less than one stripe on the RAID,
which cause read-modify-writes, and will be slow.

>We have the opportunity to test several scenarios with 2 6140 trays that are
>not part of the 1.6.x production system so I expect we will test performance
>as a function of the number of drives in the RAID6 volume (eg. 12+2 vs 5+2)
>along with array write segment sizes via sgpdd-survey.
>
>I'll report back with test results once we sort out which knobs seem to make
>the most difference.

that would be great to know.
6140's are probably quite different from the software raid md SAS JBODs
w

[Lustre-discuss] external journal raid1 vs. single disk ext journal + hot spare on raid6

2009-05-14 Thread Stuart Marshall
Hi All,

With the upgrade from 1.6.x to 1.8.x we are planning to reconfigure our RAID
systems.

The OST RAID hardware are Sun 6140 arrays with 16x500GB SATA disks.  Each
6140 tray has one OSS node (Sun X2200 M2).  We have redundant paths and
ultimately plan a failover strategy.  The MDT will be a RAID 1+0 Sun 2540
with 12x73GB SAS disks.

Each 6140 tray will be configured either as 1 or 2 RAID6 volumes.  The
lustre manual recommends more smaller OST's over large and other docs I've
seen seem to indicate that the optimal number of drives is ~(6+2).  For
these 16 disk trays, the choice would be one (12+2R6) + external journal
and/or hot spares or two (5+2R6)'s + ext. jrnl and/or hot spares.

So my questions are:

1.) What are the trade-offs of RAID1 external journal with no hot spare vs.
single disk ext journal with a hot spare (spare is for R6 volume)?
Specifically:

- If a single disk external journal is lost, can we run fsck and only lose
the transactions that have not been committed to disk?  If so, then the loss
of the disk hosting the external journal would not be catastrophic for the
file system as a whole.

- How comfortable are RAID6 users with no hot spares? (We'll have cold
spares handy, but prefer to get through weekends w/out service)

2.) The external journal only takes up ~400MB.  If we create 2 RAID6
volumes, can we put 2 external journals on one disk or RAID1 set (suitably
partitioned), or do we need to blow an entire disk for one external journal?

3.) In planning for "segment size" (chunk size in lustre manual) we'd have
to go to 128kB or lower.  However, in single disk tests (SATA), it seems
that larger is better so perhaps this argues for small RAID6 sets as
mentioned in the manual.  Just wondering what other folks have found here
also.

We have the opportunity to test several scenarios with 2 6140 trays that are
not part of the 1.6.x production system so I expect we will test performance
as a function of the number of drives in the RAID6 volume (eg. 12+2 vs 5+2)
along with array write segment sizes via sgpdd-survey.

I'll report back with test results once we sort out which knobs seem to make
the most difference.

Any advice or comments welcome,
Stuart
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss