Re: [lopsa-tech] Question on SANs, blocksize, replication, and apps

John Stoffel Fri, 29 Jan 2010 13:37:44 -0800

>>>>> "Ski" == Ski Kacoroski <[email protected]> writes:

Ski> John,
Ski> Thanks for your reply...

You're welcome.  Always glad to help.  

Ski> On 01/29/2010 09:12 AM, John Stoffel wrote:
>> 
Ski> I have been tussling with a SAN problem for several weeks now and
Ski> would like comments from you folks on it.
>> 
>> This doesn't sound so much like a SAN problem, but a replication
>> problem between your Equallogic boxes.
>> 
Ski> Situation: Equallogic iscsi san. Primary site has a 2TB volume
Ski> (1.7TB used per the SAN, 1.5TB used per the operating system).
Ski> The volume is used as the datastore for a Scalix (used to be HP
Ski> OpenMail) mail server.  File system is ext3 mounted with defaults
Ski> and yes I plan to redo this over the weekend.  It has about 30
Ski> million files of which 60% are less than 4K in size.  Equallogic
Ski> uses a 64K strip on its arrays with a 256K block size (not
Ski> changeable).  DR site has 6TB allocated for replicas.
>> 
>> How are you doing the replication?  Block level?  Rsync?

Ski> Block level.

I suspected that, since it's the only efficient way to do the copy...

Ski> My primary problem is that the replication keeps failing for
Ski> running out of space even though I have 6TB available.  I can do
Ski> the first replica, and sometimes a second or third, but then it
Ski> starts failing due to lack of space.  Change amounts ranges are
Ski> 200 - 500GB.  Even with that I should be able to create a few
Ski> replicas into 6TB (I would thnk).
>> 
>> I'm not totally suprised, since it sounds like you're doing block
>> level replication here, and since your files are all so much smaller
>> than the minimum block size, you're having problems when only 4K of a
>> block changes, it has to send the entire 64K stripe or 256K block over
>> to the replica system.
>> 
>> Does the initial replica take only 2Tb of space?  And then the
>> followons take lots more than the size of the changed files would
>> suggest?

Ski> yes.

>> 
Ski> What have you experienced with SAN's and applications that have
Ski> millions of small files?  What tricks did you use to make them
Ski> work?  Am I barking up the wrong tree and need to go in a totally
Ski> different direction?
>> 
>> I think you'll need to bite the bullet and do some sort of per-file
>> replication, just because your usage is killing your SAN replication.
>> I assume the Scalix mailstore if maildir format, with each message in
>> it's own file?  Not fun.

Ski> No, the application basically creates a massive linked list on
Ski> disk with each email broken into several smaller files (headers,
Ski> body, wrapper, attachments, etc.).  The data store has 960
Ski> directories which each have around 47000 files each (some have
Ski> less).

So go take the application vendor outside and shoot them.  :]  It's a
piss-poor design, esp with memory being as cheap as it is these days
for stuff liek this.  They should have used a proper DB instead. 

>> I'm in a Netapp world these days, and while I do replication of
>> volumes with lots and lots of small files, it's not at the level of
>> churn you're at, nor is it important that I keep multiple snapshots
>> around.
>> 
>> Turning off atime updates in ext3 might be a good first step, anything
>> you can do to limit changes to the filesystem would be a good thing.
>> 
>> If you can break your filesystem down into smaller sub-units, that
>> might let you do rsync style file level scans more efficiently.  Or
>> maybe you just do the intial replica using the block level stuff, THEN
>> do a file level scan on the replica so your don't impact your
>> production box and keep copies there.

Ski> I tried once using rsync and after several days gave up because
Ski> of so may files.  Perhaps I should try a block level initial
Ski> replication, then write some sort of parallel rsync that will do
Ski> small directories at a time.

It's not going to be pretty, but running a group of 10 rsyncs across
those 960 directories in parallel might be your best option.  Still
sucks.  And it will put a huge load on your array as well.  

You might be able to tweak the ext3 settings so that it knows it's
running on a 256k block size filesystem, so that it will layout stuff
on those boundaries, which might be a big help.

   mkfs -t ext3 -E stride=256k -J size=32m -O dir_index,extent /dev/lun...

You might also look at the -G parameter, so that filesystem metadata
is more grouped, to hopefully keep the number of changed blocks to a
minimum.

Moving to ext4 with extents might be a better option though.  Can you
test this out with a small test case?  That would be the best thing to
do.

Good luck!
John

_______________________________________________
Tech mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-tech] Question on SANs, blocksize, replication, and apps

Reply via email to