Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-10 Thread Wade . Stuart






Dick Davies [EMAIL PROTECTED] wrote on 01/10/2007 05:26:45 AM:

 On 08/01/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

  I think that in addition to lzjb compression, squishing blocks that
contain
  the same data would buy a lot of space for administrators working in
many
  common workflows.

 This idea has occurred to me too - I think there are definite
 advantages to 'block re-use'.
 When you start talking about multiple similar zones, I suspect
 substantial space savings could
 be made - and if you can re-use that saved storage to provide
 additional redundancy, everyone
 would be happy.

Very true,  even on normal fileserver usage I have historically found that
there is 15 - 30% file level duplication, when added to the cheap snapping
and the already existing compression I think this is a big big win.



  Assumptions:
 
  SHA256 hash used (Fletcher2/4 have too many collisions,  SHA256 is
2^128 if
  I remember correctly)
  SHA256 hash is taken on the data portion of the block as it exists on
disk.
  the metadata structure is hashed separately.
  In the current metadata structure, there is a reserved bit portion to
be
  used in the future.
 
 
  Description of change:
  Creates:
  The filesystem goes through its normal process of writing a block, and
  creating the checksum.
  Before the step where the metadata tree is pushed, the checksum is
checked
  against a global checksum tree to see if there is any match.
  If match exists, insert a metadata placeholder for the block, that
  references the already existing block on disk, increment a
number_of_links
  pointer on the metadata blocks to keep track of the pointers pointing
to
  this block.
  free up the new block that was written and check-summed to be used in
the
  future.
  else if no match, update the checksum tree with the new checksum and
  continue as normal.

 Unless I'm reading this wrong, this sounds a lot like Plan9s 'Venti'
 architecture
  ( http://cm.bell-labs.com/sys/doc/venti.html ) .

 But using a hash 'label' seems the wrong approach.
 ZFS is supposed to scale to terrifying levels, and the chances of a
collision,
 however small, works against that. I wouldn't want to trade
 reliability for some extra
 space.


That issue has already come up in the thread,  SHA256 is 2^128 for random,
2^80 for targeted collisions.  That is pretty darn good,  but it would also
make sense to perform a rsync like secondary check on match using a
dissimilar crypto hash.  If we hit very unlikely chance that 2 blocks match
both sha256 and whatever other secondary hash I think that block should be
lost (act of god). =)

Even with this dual check approach,  the index (and the only hash stored)
can still be just the sha256 as the chance for collision is similar to nil
in this context.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re: Adding disk to a RAID-Z?

2007-01-10 Thread Tom Buskey
[i]I think the original poster, was thinking that non-enterprise users
would be most interested in only having to *purchase* one drive at a time.

Enterprise users aren't likely to balk at purchasing 6-10 drives at a
time, so for them adding an additional *new* RaidZ to stripe across is
easier.
[/i]

Yes.  I have $xxx to spend on disks and can afford 3.  As my needs increase, 
I'll have saved enough to buy another disk.

Traditionally, you RAID your disks together then use a VM to divvy it up into 
partitions that can grow/shrink as needed.  The total size of the RAID isn't 
important until you've filled it.  Then you want to increase the RAID.

You could just add new RAID chunks and have a VM on each chunk.  But you'd be 
wasting some of your space.  The incremental cost of the added space is the 
same as the original RAID.

3*n*R5=2n
4*n*R5=3n.

Or doubling the disks:
6*n*R5=5n
   vs 
3*n*R5 + 3*n*R5 = 2n + 2n = 4n (6 disks)
or 3*n*R5 + 4*n*R5 = 2n + 3n = 5n (7 disks)

The cost of scaling/loss of space is balanced against the cost of 
backup/wipereraid/restore.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-10 Thread Robert Milkowski
Hello Kyle,

Wednesday, January 10, 2007, 5:33:12 PM, you wrote:

KM Remember though that it's been mathematically figured that the 
KM disadvantages to RaidZ start to show up after 9 or 10 drives. (That's 

Well, nothing like this was proved and definitely not mathematically.

It's just a common sense advise - for many users keeping raidz groups
below 9 disks should give good enough performance. However if someone
creates raidz group of 48 disks he/she probable expects also
performance and in general raid-z wouldn't offer one.


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-10 Thread Jason J. W. Williams

Hi Guys,

After reading through the discussion on this regarding ZFS memory
fragmentation on snv_53 (and forward) and going through our
::kmastat...looks like ZFS is sucking down about 544 MB of RAM in the
various caches. About 360MB of that is in the zio_buf_65536 cache.
Next most notable is 55MB in zio_buf_32768, and 36MB in zio_buf_16384.
I don't think that's too bad but worth keeping track of. At this
point our kernel memory growth seems to have slowed, with it hovering
around 5GB, and the anon column is mostly what's growing now (as
expected...MySQL).

Most of the problem in the discussion thread on this seemed to be
related to a lot of DLNC entries due to the workload of a file server.
How would this affect a database server with operations in only a
couple very large files? Thank you in advance.

Best Regards,
Jason

On 1/10/07, Jason J. W. Williams [EMAIL PROTECTED] wrote:

Sanjeev  Robert,

Thanks guys. We put that in place last night and it seems to be doing
a lot better job of consuming less RAM. We set it to 4GB and each of
our 2 MySQL instances on the box to a max of 4GB. So hopefully slush
of 4GB on the Thumper is enough. I would be interested in what the
other ZFS modules memory behaviors are. I'll take a perusal through
the archives. In general it seems to me that a max cap for ZFS whether
set through a series of individual tunables or a single root tunable
would be very helpful.

Best Regards,
Jason

On 1/10/07, Sanjeev Bagewadi [EMAIL PROTECTED] wrote:
 Jason,

 Robert is right...

 The point is ARC is the caching module of ZFS and majority of the memory
 is consumed through ARC.
 Hence by limiting the c_max of ARC we are limiting the amount ARC consumes.

 However, other modules of ZFS would consume more but that may not be as
 significant as ARC.

 Expert, please correct me if I am wrong here.

 Thanks and regards,
 Sanjeev.

 Robert Milkowski wrote:

 Hello Jason,
 
 Tuesday, January 9, 2007, 10:28:12 PM, you wrote:
 
 JJWW Hi Sanjeev,
 
 JJWW Thank you! I was not able to find anything as useful on the subject as
 JJWW that!  We are running build 54 on an X4500, would I be correct in my
 JJWW reading of that article that if I put set zfs:zfs_arc_max =
 JJWW 0x1 #4GB in my /etc/system, ZFS will consume no more than
 JJWW 4GB? Thank you in advance.
 
 That's the idea however it's not working that way now - under some
 circumstances ZFS could still consume much more memory - see other
 posts lately here.
 
 
 


 --
 Solaris Revenue Products Engineering,
 India Engineering Center,
 Sun Microsystems India Pvt Ltd.
 Tel:x27521 +91 80 669 27521




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-10 Thread Kyle McDonald

Robert Milkowski wrote:

Hello Kyle,

Wednesday, January 10, 2007, 5:33:12 PM, you wrote:

KM Remember though that it's been mathematically figured that the 
KM disadvantages to RaidZ start to show up after 9 or 10 drives. (That's 


Well, nothing like this was proved and definitely not mathematically.

It's just a common sense advise - for many users keeping raidz groups
below 9 disks should give good enough performance. However if someone
creates raidz group of 48 disks he/she probable expects also
performance and in general raid-z wouldn't offer one.


  

It's very possible I misstated something. :)

I thought I had read though, something like over 9 or so disks would put 
mean that each FS block would be written to less than a single disk 
block on each disk?


Or maybe it was that waiting to read from all drives for files less than 
a FS block would suffer?


Ahhh...  I can't remember what the effect were thought to be. I thought 
there was some theoretical math involved though.


I do remember people advising against it though. Not just on a 
performance basis, but also on a increased risk of failure basis. I 
think it was just seen as a good balancing point.


   -Kyle


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-10 Thread Jason J. W. Williams

Hi Kyle,

I think there was a lot of talk about this behavior on the RAIDZ2 vs.
RAID-10 thread. My understanding from that discussion was that every
write stripes the block across all disks on a RAIDZ/Z2 group, thereby
making writing the group no faster than writing to a single disk.
However reads are much faster, as all the disk are activated in the
read process.

The default config on the X4500 we received recently was RAIDZ-groups
of 6 disks (across the 6 controllers) striped together into one large
zpool.

Best Regards,
Jason

On 1/10/07, Kyle McDonald [EMAIL PROTECTED] wrote:

Robert Milkowski wrote:
 Hello Kyle,

 Wednesday, January 10, 2007, 5:33:12 PM, you wrote:

 KM Remember though that it's been mathematically figured that the
 KM disadvantages to RaidZ start to show up after 9 or 10 drives. (That's

 Well, nothing like this was proved and definitely not mathematically.

 It's just a common sense advise - for many users keeping raidz groups
 below 9 disks should give good enough performance. However if someone
 creates raidz group of 48 disks he/she probable expects also
 performance and in general raid-z wouldn't offer one.



It's very possible I misstated something. :)

I thought I had read though, something like over 9 or so disks would put
mean that each FS block would be written to less than a single disk
block on each disk?

Or maybe it was that waiting to read from all drives for files less than
a FS block would suffer?

Ahhh...  I can't remember what the effect were thought to be. I thought
there was some theoretical math involved though.

I do remember people advising against it though. Not just on a
performance basis, but also on a increased risk of failure basis. I
think it was just seen as a good balancing point.

-Kyle


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Why is + not allowed in a ZFS file system name ?

2007-01-10 Thread roland
# zpool create 500megpool /home/roland/tmp/500meg.dat
cannot create '500megpool': name must begin with a letter
pool name may have been omitted

huh?
ok - no problem if special characters aren`t allowed, but why _this_ weird 
looking limitaton ?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-10 Thread Robert Milkowski
Hello Jason,

Wednesday, January 10, 2007, 10:54:29 PM, you wrote:

JJWW Hi Kyle,

JJWW I think there was a lot of talk about this behavior on the RAIDZ2 vs.
JJWW RAID-10 thread. My understanding from that discussion was that every
JJWW write stripes the block across all disks on a RAIDZ/Z2 group, thereby
JJWW making writing the group no faster than writing to a single disk.
JJWW However reads are much faster, as all the disk are activated in the
JJWW read process.

The opposite actually. Because of COW, writing (modifying as well)
will give you up-to N-1 disks performance for raid-z1 and N-2 disks performance 
for
raid-z2. Howeer reading can be slow in case of many small random reads
as to read each fs block you've got to wait for all data disks in a
group.


JJWW The default config on the X4500 we received recently was RAIDZ-groups
JJWW of 6 disks (across the 6 controllers) striped together into one large
JJWW zpool.

However the problem with that config is lack of hot-spare.
Of course it depends what you want (and there was no hot spare support
in U2 which is os installed in factory so far).


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-10 Thread Robert Milkowski
Hello Jason,

Wednesday, January 10, 2007, 9:45:05 PM, you wrote:

JJWW Sanjeev  Robert,

JJWW Thanks guys. We put that in place last night and it seems to be doing
JJWW a lot better job of consuming less RAM. We set it to 4GB and each of
JJWW our 2 MySQL instances on the box to a max of 4GB. So hopefully slush
JJWW of 4GB on the Thumper is enough. I would be interested in what the
JJWW other ZFS modules memory behaviors are. I'll take a perusal through
JJWW the archives. In general it seems to me that a max cap for ZFS whether
JJWW set through a series of individual tunables or a single root tunable
JJWW would be very helpful.

Yes it would. Better yet would be if memory consumed by ZFS for
caching (dnodes, vnodes, data, ...) would behave similar to page cache
like with UFS so applications will be able to get back almost all
memory used for ZFS caches if needed.

I guess (and it's really a guess only based on some emails here) that
in worst case scenario ZFS caches would consume about:

  arc_max + 3*arc_max + memory lost for fragmentation

So I guess with arc_max set to 1GB you can lost even 5GB (or more) and
currently only that first 1GB can be get back automatically.


-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-10 Thread Wade . Stuart






[EMAIL PROTECTED] wrote on 01/10/2007 05:16:33 PM:

 Hello Jason,

 Wednesday, January 10, 2007, 10:54:29 PM, you wrote:

 JJWW Hi Kyle,

 JJWW I think there was a lot of talk about this behavior on the RAIDZ2
vs.
 JJWW RAID-10 thread. My understanding from that discussion was that
every
 JJWW write stripes the block across all disks on a RAIDZ/Z2 group,
thereby
 JJWW making writing the group no faster than writing to a single disk.
 JJWW However reads are much faster, as all the disk are activated in the
 JJWW read process.

 The opposite actually. Because of COW, writing (modifying as well)
 will give you up-to N-1 disks performance for raid-z1 and N-2 disks
 performance for
 raid-z2. Howeer reading can be slow in case of many small random reads
 as to read each fs block you've got to wait for all data disks in a
 group.


 JJWW The default config on the X4500 we received recently was
RAIDZ-groups
 JJWW of 6 disks (across the 6 controllers) striped together into one
large
 JJWW zpool.

 However the problem with that config is lack of hot-spare.
 Of course it depends what you want (and there was no hot spare support
 in U2 which is os installed in factory so far).


Yeah, this kinda ticked me off, first thing I notice is that the thumper
that was on back order for 3 months waiting for U3 fixes was shipped with
U2 + patches.  Called support to try to track down if U3 base was
installable with/without patches and spent 3 days of off and on calling to
get to someone who could find the info (sun's internal documentation was
locked down and unpublished to support at the time).  5 out of 6 support
engineers I talked to did not even realize that U3 was released (three
weeks after the fact). It also took 4 (long) calls to clarify that it did
infact need 220 power (at the time I ordered it was listed as 110, and it
shipped with 110 rated cables).

Long story short,  I wiped and reinstalled with U3 and raidz2 with
hostspares like it should have had in the first place.

-Wade


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-10 Thread Jason J. W. Williams

Hi Robert,

I read the following section from
http://blogs.sun.com/roch/entry/when_to_and_not_to as indicating
random writes to a RAID-Z had the performance of a single disk
regardless of the group size:


Effectively,  as  a first approximation,  an  N-disk RAID-Z group will
behave as   a single   device in  terms  of  deliveredrandom input
IOPS. Thus  a 10-disk group of devices  each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group.



Best Regards,
Jason

On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Jason,

Wednesday, January 10, 2007, 10:54:29 PM, you wrote:

JJWW Hi Kyle,

JJWW I think there was a lot of talk about this behavior on the RAIDZ2 vs.
JJWW RAID-10 thread. My understanding from that discussion was that every
JJWW write stripes the block across all disks on a RAIDZ/Z2 group, thereby
JJWW making writing the group no faster than writing to a single disk.
JJWW However reads are much faster, as all the disk are activated in the
JJWW read process.

The opposite actually. Because of COW, writing (modifying as well)
will give you up-to N-1 disks performance for raid-z1 and N-2 disks performance 
for
raid-z2. Howeer reading can be slow in case of many small random reads
as to read each fs block you've got to wait for all data disks in a
group.


JJWW The default config on the X4500 we received recently was RAIDZ-groups
JJWW of 6 disks (across the 6 controllers) striped together into one large
JJWW zpool.

However the problem with that config is lack of hot-spare.
Of course it depends what you want (and there was no hot spare support
in U2 which is os installed in factory so far).


--
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[4]: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-10 Thread Robert Milkowski
Hello Jason,

Thursday, January 11, 2007, 12:46:32 AM, you wrote:

JJWW Hi Robert,

JJWW I read the following section from
JJWW http://blogs.sun.com/roch/entry/when_to_and_not_to as indicating
JJWW random writes to a RAID-Z had the performance of a single disk
JJWW regardless of the group size:

Effectively,  as  a first approximation,  an  N-disk RAID-Z group will
behave as   a single   device in  terms  of  deliveredrandom input
IOPS. Thus  a 10-disk group of devices  each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group.


random input IOPS means random reads not writes.

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[4]: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-10 Thread Robert Milkowski
Hello Wade,

Thursday, January 11, 2007, 12:30:40 AM, you wrote:

WSfc Long story short,  I wiped and reinstalled with U3 and raidz2 with
WSfc hostspares like it should have had in the first place.

The same here.

Besides I always put my own system and I'm not using preinstalled
ones - except when x4500s arrive I run small script (dd + scrubbing)
for 2-3 days to see if everything works fine before putting into
production. Then I re-install.



-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-10 Thread Peter Schuller
 It's just a common sense advise - for many users keeping raidz groups
 below 9 disks should give good enough performance. However if someone
 creates raidz group of 48 disks he/she probable expects also
 performance and in general raid-z wouldn't offer one.

There is at least one reason for wanting more drives in the same 
raidz/raid5/etc: redundancy.

Suppose you have 18 drives. Having two raidz:s constisting of 9 drives is 
going to mean you are more likaly to fail than having a single raidz2 
consisting of 18 drives, since in the former case yes - two drives can go 
down, but only if they are the *right* two drives. In the latter case any two 
drives can go down.

The ZFS administration guide mentions this recommendation, but does not give 
any hint as to why. A reader may assume/believe it's just general adviced, 
based on someone's opinion that with more than 9 drives, the statistical 
probability of failure is too high for raidz (or raid5). It's a shame the 
statement in the guide is not further qualified to actually explain that 
there is a concrete issue at play.

(I haven't looked into the archives to find the previously mentioned 
discussion.)

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[4]: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-10 Thread Jason J. W. Williams

Hi Robert,

We've got the default ncsize. I didn't see any advantage to increasing
it outside of NFS serving...which this server is not. For speed the
X4500 is showing to be a killer MySQL platform. Between the blazing
fast procs and the sheer number of spindles, its perfromance is
tremendous. If MySQL cluster had full disk-based support, scale-out
with X4500s a-la Greenplum would be terrific solution.

At this point, the ZFS memory gobbling is the main roadblock to being
a good database platform.

Regarding the paging activity, we too saw tremendous paging of up to
24% of the X4500s CPU being used for that with the default arc_max.
After changing it to 4GB, we haven't seen anything much over 5-10%.

Best Regards,
Jason

On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Jason,

Thursday, January 11, 2007, 12:36:46 AM, you wrote:

JJWW Hi Robert,

JJWW Thank you! Holy mackerel! That's a lot of memory. With that type of a
JJWW calculation my 4GB arc_max setting is still in the danger zone on a
JJWW Thumper. I wonder if any of the ZFS developers could shed some light
JJWW on the calculation?

JJWW That kind of memory loss makes ZFS almost unusable for a database system.


If you leave ncsize with default value then I belive it won't consume
that much memory.


JJWW I agree that a page cache similar to UFS would be much better.  Linux
JJWW works similarly to free pages, and it has been effective enough in the
JJWW past. Though I'm equally unhappy about Linux's tendency to grab every
JJWW bit of free RAM available for filesystem caching, and then cause
JJWW massive memory thrashing as it frees it for applications.

Page cache won't be better - just better memory control for ZFS caches
is strongly desired. Unfortunately from time to time ZFS makes servers
to page enormously :(


--
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Why is + not allowed in a ZFS file system name ?

2007-01-10 Thread Toby Thain


On 10-Jan-07, at 5:29 PM, roland wrote:


# zpool create 500megpool /home/roland/tmp/500meg.dat
cannot create '500megpool': name must begin with a letter
pool name may have been omitted

huh?
ok - no problem if special characters aren`t allowed, but why  
_this_ weird looking limitaton ?




Potential for confusion with numbers (esp since alphabetic units are  
often suffixed).


--T



This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?

2007-01-10 Thread Robert Milkowski
Hello Peter,

Thursday, January 11, 2007, 1:08:38 AM, you wrote:

 It's just a common sense advise - for many users keeping raidz groups
 below 9 disks should give good enough performance. However if someone
 creates raidz group of 48 disks he/she probable expects also
 performance and in general raid-z wouldn't offer one.

PS There is at least one reason for wanting more drives in the same 
PS raidz/raid5/etc: redundancy.

PS Suppose you have 18 drives. Having two raidz:s constisting of 9 drives is
PS going to mean you are more likaly to fail than having a single raidz2 
PS consisting of 18 drives, since in the former case yes - two drives can go
PS down, but only if they are the *right* two drives. In the latter case any 
two
PS drives can go down.

PS The ZFS administration guide mentions this recommendation, but does not give
PS any hint as to why. A reader may assume/believe it's just general adviced,
PS based on someone's opinion that with more than 9 drives, the statistical
PS probability of failure is too high for raidz (or raid5). It's a shame the
PS statement in the guide is not further qualified to actually explain that
PS there is a concrete issue at play.

I don't know if ZFS MAN pages should teach people about RAID.

If somebody doesn't understand RAID basics then some kind of tool
where you just specify pool of disk and have to choose from: space
efficient, performance, non-redundant and that's it - all the rest
will be hidden.



-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re[6]: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-10 Thread Robert Milkowski
Hello Jason,

Thursday, January 11, 2007, 1:10:10 AM, you wrote:

JJWW Hi Robert,

JJWW We've got the default ncsize. I didn't see any advantage to increasing
JJWW it outside of NFS serving...which this server is not. For speed the
JJWW X4500 is showing to be a killer MySQL platform. Between the blazing
JJWW fast procs and the sheer number of spindles, its perfromance is

Have you got any numbers you can share?

-- 
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-10 Thread Mark Maybee

Jason J. W. Williams wrote:

Hi Robert,

Thank you! Holy mackerel! That's a lot of memory. With that type of a
calculation my 4GB arc_max setting is still in the danger zone on a
Thumper. I wonder if any of the ZFS developers could shed some light
on the calculation?


In a worst-case scenario, Robert's calculations are accurate to a
certain degree:  If you have 1GB of dnode_phys data in your arc cache
(that would be about 1,200,000 files referenced), then this will result
in another 3GB of related data held in memory: vnodes/znodes/
dnodes/etc.  This related data is the in-core data associated with
an accessed file.  Its not quite true that this data is not evictable,
it *is* evictable, but the space is returned from these kmem caches
only after the arc has cleared its blocks and triggered the free of
the related data structures (and even then, the kernel will need to
to a kmem_reap to reclaim the memory from the caches).  The
fragmentation that Robert mentions is an issue because, if we don't
free everything, the kmem_reap may not be able to reclaim all the
memory from these caches, as they are allocated in slabs.

We are in the process of trying to improve this situation.


That kind of memory loss makes ZFS almost unusable for a database system.


Note that you are not going to experience these sorts of overheads
unless you are accessing *many* files.  In a database system, there are
only going to be a few files = no significant overhead.


I agree that a page cache similar to UFS would be much better.  Linux
works similarly to free pages, and it has been effective enough in the
past. Though I'm equally unhappy about Linux's tendency to grab every
bit of free RAM available for filesystem caching, and then cause
massive memory thrashing as it frees it for applications.


The page cache is much better in the respect that it is more tightly
integrated with the VM system, so you get more efficient response to
memory pressure.  It is *much worse* than the ARC at caching data for
a file system.  In the long-term we plan to integrate the ARC into the
Solaris VM system.


Best Regards,
Jason

On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Jason,

Wednesday, January 10, 2007, 9:45:05 PM, you wrote:

JJWW Sanjeev  Robert,

JJWW Thanks guys. We put that in place last night and it seems to be 
doing
JJWW a lot better job of consuming less RAM. We set it to 4GB and 
each of
JJWW our 2 MySQL instances on the box to a max of 4GB. So hopefully 
slush

JJWW of 4GB on the Thumper is enough. I would be interested in what the
JJWW other ZFS modules memory behaviors are. I'll take a perusal through
JJWW the archives. In general it seems to me that a max cap for ZFS 
whether
JJWW set through a series of individual tunables or a single root 
tunable

JJWW would be very helpful.

Yes it would. Better yet would be if memory consumed by ZFS for
caching (dnodes, vnodes, data, ...) would behave similar to page cache
like with UFS so applications will be able to get back almost all
memory used for ZFS caches if needed.

I guess (and it's really a guess only based on some emails here) that
in worst case scenario ZFS caches would consume about:

  arc_max + 3*arc_max + memory lost for fragmentation

So I guess with arc_max set to 1GB you can lost even 5GB (or more) and
currently only that first 1GB can be get back automatically.


--
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-10 Thread Erblichs
Hey guys,

Do to lng URL lookups, the DNLC was pushed to variable
sized entries. The hit rate was dropping because of
name to long misses. This was done long ago while I
was at Sun under a bug reported by me..

I don't know your usage, but you should attempt to
estimate the amount of mem used with the default size.

Yes, this is after you start tracking your DNLC hit rate
and make sure it doesn't significantly drop if the ncsize
is decreased. You also may wish to increase the size and
again check the hit rate.. Yes, it is posible that your
access is random enough that no changes will effect the
hit rte.

2nd item.. Bonwick's mem allcators I think still have the
ability to limit the size of each slab. The issue is that
some parts of the code expect non mem failures with
SLEEPs. This can result in extended SLEEPs, but can be
done.

If your company generates changes to your local source
and then you rebuild, it is possible to pre-allocate a
fixed number of objects per cache and then use NOLSLEEPs
with returning values that indicate to retry or failure.

3rd.. And could be the most important, the mem cache
allocators are lazy in freeing memory when it is not
needed by anyone else. Thus, unfreed memory is effectively
used as a cache to remove latencies of on-demand
memory allocations. This artificially keeps memory
usage high, but should have minimal latencies to realloc
when necessary.

Also, it is possible to make mods to increase the level
of mem garbage collection after some watermark code
is added to minimize repeated allocs and frees.


Mitchell Erblich


Jason J. W. Williams wrote:
 
 Hi Robert,
 
 We've got the default ncsize. I didn't see any advantage to increasing
 it outside of NFS serving...which this server is not. For speed the
 X4500 is showing to be a killer MySQL platform. Between the blazing
 fast procs and the sheer number of spindles, its perfromance is
 tremendous. If MySQL cluster had full disk-based support, scale-out
 with X4500s a-la Greenplum would be terrific solution.
 
 At this point, the ZFS memory gobbling is the main roadblock to being
 a good database platform.
 
 Regarding the paging activity, we too saw tremendous paging of up to
 24% of the X4500s CPU being used for that with the default arc_max.
 After changing it to 4GB, we haven't seen anything much over 5-10%.
 
 Best Regards,
 Jason
 
 On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote:
  Hello Jason,
 
  Thursday, January 11, 2007, 12:36:46 AM, you wrote:
 
  JJWW Hi Robert,
 
  JJWW Thank you! Holy mackerel! That's a lot of memory. With that type of a
  JJWW calculation my 4GB arc_max setting is still in the danger zone on a
  JJWW Thumper. I wonder if any of the ZFS developers could shed some light
  JJWW on the calculation?
 
  JJWW That kind of memory loss makes ZFS almost unusable for a database 
  system.
 
 
  If you leave ncsize with default value then I belive it won't consume
  that much memory.
 
 
  JJWW I agree that a page cache similar to UFS would be much better.  Linux
  JJWW works similarly to free pages, and it has been effective enough in the
  JJWW past. Though I'm equally unhappy about Linux's tendency to grab every
  JJWW bit of free RAM available for filesystem caching, and then cause
  JJWW massive memory thrashing as it frees it for applications.
 
  Page cache won't be better - just better memory control for ZFS caches
  is strongly desired. Unfortunately from time to time ZFS makes servers
  to page enormously :(
 
 
  --
  Best regards,
   Robertmailto:[EMAIL PROTECTED]
 http://milek.blogspot.com
 
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re[2]: Re: Adding disk to a RAID-Z?

2007-01-10 Thread Martin
 Hello Kyle,
 
 Wednesday, January 10, 2007, 5:33:12 PM, you wrote:
 
 KM Remember though that it's been mathematically
 figured that the 
 KM disadvantages to RaidZ start to show up after 9
 or 10 drives. (That's 
 
 Well, nothing like this was proved and definitely not
 mathematically.
 
 It's just a common sense advise - for many users
 keeping raidz groups
 below 9 disks should give good enough performance.
 However if someone
 creates raidz group of 48 disks he/she probable
 expects also
 performance and in general raid-z wouldn't offer one.

Wow, lots of good discussion here.  I started the idea of allowing a RAIDZ 
group to grow to arbitrary drives because I was unaware of the downsides to 
massive pools.  From my RAID5 experience, a perfect world would be large 
numbers of data spindles and a sufficient number of parity spindles, e.g. 99+17 
(99 data drives and 17 parity drives).  In RAID5 this would give massive iops 
and redundancy.

After studying the code and reading the blogs, a few things have jumped out, 
with some interesting (and sometimes goofy) implications.  Since I am still 
learning, I could be wrong on any of the following.

RAIDZ pools operate with a storage granularity of one stripe.  If you request a 
read of a block within the stripe, you get the whole stripe.  If you modify a 
block within the stripe, the whole stripe is written to a different location 
(ala COW).

This implies that ANY read will require the whole stripe, therefore all 
spindles to seek and read a sector.  All drives will return the sectors 
(mostly) simultaneously.  For performance purposes, a RAIDZ pool seeks like a 
single drive would and has the throughput of multiple drives.  Unlike 
traditional RAID5, adding more spindles does NOT increase read IOPS.

Another implication is ZFS checksums the stripe, not the component sectors.  If 
a drive silently returns a bad sector, ZFS only knows is that the whole stripe 
is bad (which could probably also be inferred from a bogus parity sector).  ZFS 
has no clue which drive produced bad data, only that the whole stripe failed 
the checksum.  ZFS finds the offending sector by process of elimination: going 
through the sectors one at a time, throwing away the data actually read, 
reconstructing the data from the parity then determining if the stripe passes 
the checksum.

Two parity drives make this a bigger problem still, almost squaring the number 
of computations needed.  If a stripe has enough parity drives, then the cost of 
determining N bad data sectors in a stripe is roughly O(k^N), where k is some 
constant.

Another implication is that there is no RAID5 write penalty.  More 
accurately, the write penalty is incurred during the read operation where an 
entire stripe is read.

Finally, there is no need to rotate parity.  Rotating parity was introduced in 
RAID5 because every write of a single sector in a stripe also necessitated the 
read and subsequent write of the parity sector.  Since there are no partial 
stripe writes in ZFS, there is no need to read then write the parity sector.

For those in the know, where I am off base here?

Thanks!
Marty
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-10 Thread Al Hopper
On Wed, 10 Jan 2007, Mark Maybee wrote:

 Jason J. W. Williams wrote:
  Hi Robert,
 
  Thank you! Holy mackerel! That's a lot of memory. With that type of a
  calculation my 4GB arc_max setting is still in the danger zone on a
  Thumper. I wonder if any of the ZFS developers could shed some light
  on the calculation?
 
 In a worst-case scenario, Robert's calculations are accurate to a
 certain degree:  If you have 1GB of dnode_phys data in your arc cache
 (that would be about 1,200,000 files referenced), then this will result
 in another 3GB of related data held in memory: vnodes/znodes/
 dnodes/etc.  This related data is the in-core data associated with
 an accessed file.  Its not quite true that this data is not evictable,
 it *is* evictable, but the space is returned from these kmem caches
 only after the arc has cleared its blocks and triggered the free of
 the related data structures (and even then, the kernel will need to
 to a kmem_reap to reclaim the memory from the caches).  The
 fragmentation that Robert mentions is an issue because, if we don't
 free everything, the kmem_reap may not be able to reclaim all the
 memory from these caches, as they are allocated in slabs.

 We are in the process of trying to improve this situation.
 snip .

Understood (and many Thanks).  In the meantime, is there a rule-of-thumb
that you could share that would allow mere humans (like me) to calculate
the best values of zfs:zfs_arc_max and ncsize, given the that machine has
nGb of RAM and is used in the following broad workload scenarios:

a) a busy NFS server
b) a general multiuser development server
c) a database server
d) an Apache/Tomcat/FTP server
e) a single user Gnome desktop running U3 with home dirs on a ZFS
filesystem

It would seem, from reading between the lines of previous emails,
particularly the ones you've (Mark M) written, that there is a rule of
thumb that would apply given a standard or modified ncsize tunable??

I'm primarily interested in a calculation that would allow settings that
would reduce the possibility of the machine descending into swap hell.

PS: Interesting is that no one has mentioned (the tunable) maxpgio.  I've
often found that increasing maxpgio is the only way to improve the odds of
a machine remaining usable when lots of swapping is taking place.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
   Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
 OpenSolaris Governing Board (OGB) Member - Feb 2006
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS entry in /etc/vfstab

2007-01-10 Thread Vahid Moghaddasi
Hi,
Why would I ever need to specify ZFS mount(s) in /etc/vfstab at all? I see it 
in some documents that zfs can be defined in /etc/vfstab with fstype zfs.
Thanks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] N.J. suspected as source of stench MORE ...

2007-01-10 Thread R. Joyce - News Service
News Alert!
Fueled by the possibility of an upcoming merger,
(UTVG) is gearing up for an explosion. 
Tension is building and soon the scramble to take
a position will push this one off the charts.

Symbol: UTVG 
}Short Term Target:  $5.00
Long term Target: $10

Finally the market is ready for explosion
Thursday Jan 11 2007. will be a huge growth of UTVG
Get ready to make some cash today!
Get in while there is still time.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss