from:"Jason J. W. Williams"


Hi Jeff,

Maybe I mis-read this thread, but I don't think anyone was saying that
using ZFS on-top of an intelligent array risks more corruption. Given
my experience, I wouldn't run ZFS without some level of redundancy,
since it will panic your kernel in a RAID-0 scenario where it detects
a LUN is missing and can't fix it. That being said, I wouldn't run
anything but ZFS anymore. When we had some database corruption issues
awhile back, ZFS made it very simple to prove it was the DB. Just did
a scrub and boom, verification that the data was laid down correctly.
RAID-5 will have better random read performance the RAID-Z for reasons
Robert had to beat into my head. ;-) But if you really need that
performance, perhaps RAID-10 is what you should be looking at? Someone
smarter than I can probably give a better idea.

Regarding the failure detection, is anyone on the list have the
ZFS/FMA traps fed into a network management app yet? I'm curious what
the experience with it is?

Best Regards,
Jason

On 1/29/07, Jeffery Malloch [EMAIL PROTECTED] wrote:

Hi Guys,

SO...

From what I can tell from this thread ZFS if VERY fussy about managing 
writes,reads and failures.  It wants to be bit perfect.  So if you use the 
hardware that comes with a given solution (in my case an Engenio 6994) to manage 
failures you risk a) bad writes that don't get picked up due to corruption from 
write cache to disk b) failures due to data changes that ZFS is unaware of that 
the hardware imposes when it tries to fix itself.

So now I have a $70K+ lump that's useless for what it was designed for.  I 
should have spent $20K on a JBOD.  But since I didn't do that, it sounds like a 
traditional model works best (ie. UFS et al) for the type of hardware I have.  
No sense paying for something and not using it.  And by using ZFS just as a 
method for ease of file system growth and management I risk much more 
corruption.

The other thing I haven't heard is why NOT to use ZFS.  Or people who don't 
like it for some reason or another.

Comments?

Thanks,

Jeff

PS - the responses so far have been great and are much appreciated!  Keep 'em 
coming...


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Project Proposal: Availability Suite


Thank you for the detailed explanation. It is very helpful to
understand the issue. Is anyone successfully using SNDR with ZFS yet?

Best Regards,
Jason

On 1/26/07, Jim Dunham [EMAIL PROTECTED] wrote:

Jason J. W. Williams wrote:
 Could the replication engine eventually be integrated more tightly
 with ZFS?
Not it in the present form. The architecture and implementation of
Availability Suite is driven off block-based replication at the device
level (/dev/rdsk/...), something that allows the product to replicate
any Solaris file system, database, etc., without any knowledge of what
it is actually replicating.

To pursue ZFS replication in the manner of Availability Suite, one needs
to see what replication looks like from an abstract point of view. So
simplistically, remote replication is like the letter 'h', where the
left side of the letter is the complete I/O path on the primary node,
the horizontal part of the letter is the remote replication network
link, and the right side of the letter is only the bottom half of the
complete I/O path on the secondary node.

Next ZFS would have to have its functional I/O path split into two
halves, a top and bottom piece.  Next we configure replication, the
letter 'h', between two given nodes, running both a top and bottom piece
of ZFS on the source node, and just the bottom half of ZFS on the
secondary node.

Today, the SNDR component of Availability Suite works like the letter
'h' today, where we split the Solaris I/O stack into a top and bottom
half. The top half is that software (file system, database or
application I/O) that directs its I/Os to the bottom half (raw device,
volume manager or block device).

So all that needs to be done is to design and build a new variant of the
letter 'h', and find the place to separate ZFS into two pieces.

- Jim Dunham


 That would be slick alternative to send/recv.

 Best Regards,
 Jason

 On 1/26/07, Jim Dunham [EMAIL PROTECTED] wrote:
 Project Overview:

 I propose the creation of a project on opensolaris.org, to bring to
 the community two Solaris host-based data services; namely volume
 snapshot and volume replication. These two data services exist today
 as the Sun StorageTek Availability Suite, a Solaris 8, 9  10,
 unbundled product set, consisting of Instant Image (II) and Network
 Data Replicator (SNDR).

 Project Description:

 Although Availability Suite is typically known as just two data
 services (II  SNDR), there is an underlying Solaris I/O filter
 driver framework which supports these two data services. This
 framework provides the means to stack one or more block-based, pseudo
 device drivers on to any pre-provisioned cb_ops structure, [
 
http://www.opensolaris.org/os/article/2005-03-31_inside_opensolaris__solaris_driver_programming/#datastructs
 ], thereby shunting all cb_ops I/O into the top of a developed filter
 driver, (for driver specific processing), then out the bottom of this
 filter driver, back into the original cb_ops entry points.

 Availability Suite was developed to interpose itself on the I/O stack
 of a block device, providing a filter driver framework with the means
 to intercept any I/O originating from an upstream file system,
 database or application layer I/O. This framework provided the means
 for Availability Suite to support snapshot and remote replication
 data services for UFS, QFS, VxFS, and more recently the ZFS file
 system, plus various databases like Oracle, Sybase and PostgreSQL,
 and also application I/Os. By providing a filter driver at this point
 in the Solaris I/O stack, it allows for any number of data services
 to be implemented, without regard to the underlying block storage
 that they will be configured on. Today, as a snapshot and/or
 replication solution, the framework allows both the source and
 destination block storage device to not only differ in physical
 characteristics (DAS, Fibre Channel, iSCSI, etc.), but also logical
 characteristics such as in RAID type, volume managed storage (i.e.,
 SVM, VxVM), lofi, zvols, even ram disks.

 Community Involvement:

 By providing this filter-driver framework, two working filter drivers
 (II  SNDR), and an extensive collection of supporting software and
 utilities, it is envisioned that those individuals and companies that
 adopt OpenSolaris as a viable storage platform, will also utilize and
 enhance the existing II  SNDR data services, plus have offered to
 them the means in which to develop their own block-based filter
 driver(s), further enhancing the use and adoption on OpenSolaris.

 A very timely example that is very applicable to Availability Suite
 and the OpenSolaris community, is the recent announcement of the
 Project Proposal: lofi [ compression  encryption ] -
 http://www.opensolaris.org/jive/click.jspamessageID=26841. By
 leveraging both the Availability Suite and the lofi OpenSolaris
 projects, it would be highly probable to not only offer compression 
 encryption to lofi devices (as already proposed

Re: [zfs-discuss] hot spares - in standby?


Hi Guys,

I seem to remember the Massive Array of Independent Disk guys ran into
a problem I think they called static friction, where idle drives would
fail on spin up after being idle for a long time:
http://www.eweek.com/article2/0,1895,1941205,00.asp

Would that apply here?

Best Regards,
Jason

On 1/29/07, Toby Thain [EMAIL PROTECTED] wrote:


On 29-Jan-07, at 9:04 PM, Al Hopper wrote:

 On Mon, 29 Jan 2007, Toby Thain wrote:

 Hi,

 This is not exactly ZFS specific, but this still seems like a
 fruitful place to ask.

 It occurred to me today that hot spares could sit in standby (spun
 down) until needed (I know ATA can do this, I'm supposing SCSI does
 too, but I haven't looked at a spec recently). Does anybody do this?
 Or does everybody do this already?

 I don't work with enough disk storage systems to know what is the
 industry
 norm.  But there are 3 broad categories of disk drive spares:

 a) Cold Spare.  A spare where the power is not connected until it is
 required.  [1]

 b) Warm Spare.  A spare that is active but placed into a low power
 mode. ...

 c) Hot Spare.  A spare that is spun up and ready to accept
 read/write/position (etc) requests.

Hi Al,

Thanks for reminding me of the distinction. It seems very few
installations would actually require (c)?


 Does the tub curve (chance of early life failure) imply that hot
 spares should be burned in, instead of sitting there doing nothing
 from new? Just like a data disk, seems to me you'd want to know if a
 hot spare fails while waiting to be swapped in. Do they get tested
 periodically?

 The ideal scenario, as you already allude to, would be for the disk
 subsystem to initially configure the drive as a hot spare and send it
 periodic test events for, say, the first 48 hours.

For some reason that's a little shorter than I had in mind, but I
take your word that that's enough burn-in for semiconductors, motors,
servos, etc.

 This would get it
 past the first segment of the bathtub reliability curve ...

 If saving power was the highest priority, then the ideal situation
 would
 be where the disk subsystem could apply/remove power to the spare
 and move
 it from warm to cold upon command.

I am surmising that it would also considerably increase the spare's
useful lifespan versus hot and spinning.


 One trick with disk subsystems, like ZFS that have yet to have
 the FMA
 type functionality added and which (today) provide for hot spares
 only, is
 to initially configure a pool with one (hot) spare, and then add a
 2nd hot
 spare, based on installing a brand new device, say, 12 months
 later.  And
 another spare 12 months later.  What you are trying to achieve,
 with this
 strategy, is to avoid the scenario whereby mechanical systems, like
 disk
 drives, tend to wear out within the same general, relatively short,
 timeframe.

 One (obvious) issue with this strategy, is that it may be
 impossible to
 purchase the same disk drive 12 and 24 months later.  However, it's
 always
 possible to purchase a larger disk drive

...which is not guaranteed to be compatible with your storage
subsystem...!

--Toby

 and simply commit to the fact
 that the extra space provided by the newer drive will be wasted.

 [1] The most common example is a disk drive mounted on a carrier
 but not
 seated within the disk drive enclosure.  Simple push in when
 required.
 ...
 Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
 OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
  OpenSolaris Governing Board (OGB) Member - Feb 2006

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Project Proposal: Availability Suite


Hi Jim,

Thank you very much for the heads up. Unfortunately, we need the
write-cache enabled for the application I was thinking of combining
this with. Sounds like SNDR and ZFS need some more soak time together
before you can use both to their full potential together?

Best Regards,
Jason

On 1/29/07, Jim Dunham [EMAIL PROTECTED] wrote:

Jason,
 Thank you for the detailed explanation. It is very helpful to
 understand the issue. Is anyone successfully using SNDR with ZFS yet?
Of the opportunities I've been involved with the answer is yes, but so
far I've not seen SNDR with  ZFS in a production environment, but that
does not mean they don't exists. It was not until late June '06, that
AVS 4.0, Solaris 10 and ZFS were generally available, and to date AVS
has not been made available for the Solaris Express, Community Release,
but it will be real soon.

While I have your attention, there are two issues between ZFS and AVS
that needs mentioning.

1). When ZFS is given an entire LUN to place in a ZFS storage pool, ZFS
detect this, enabling SCSI write-caching on the LUN, and also opens the
LUN with exclusive access, preventing other data services (like AVS)
from accessing this device. The work-around is to manually format the
LUN, typically placing all the blocks into a single partition, then just
place this partition into the ZFS storage pool. ZFS detect this, not
owning the entire LUN, and doesn't enable write-caching, which means it
also doesn't open the LUN with exclusive access, and therefore AVS and
ZFS can share the same LUN.

I thought about submitting an RFE to have ZFS provide a means to
override this restriction, but I am not 100% certain that a ZFS
filesystem directly accessing a write-cached enabled LUN is the same
thing as a replicated ZFS filesystem accessing a write-cached enabled
LUN. Even though AVS is write-order consistent, there are disaster
recovery scenarios, when enacted, where block-order, verses write-order
I/Os are issued.

2). One has to be very cautious in using zpool import -f   (forced
import), especially on a LUN or LUNs in which SNDR is actively
replicating into. If ZFS complains that the storage pool was not cleanly
exported when issuing a zpool import ..., and one attempts a zpool
import -f , without checking the active replication state, they are
sure to panic Solaris. Of  course this failure scenario is no different
then accessing a LUN or LUNs on dual-ported, or SAN based storage when
another Solaris host is still accessing the ZFS filesystem, or
controller based replication, as they are all just different operational
scenarios of the same issue, data blocks changing out from underneath
the ZFS filesystem, and its CRC checking mechanisms.

Jim


 Best Regards,
 Jason



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] hot spares - in standby?


Hi Toby,

You're right. The healthcheck would definitely find any issues. I
misinterpreted your comment to that effect as a question and didn't
quite latch on. A zpool MAID-mode with that healthcheck might also be
interesting on something like a Thumper for pure-archival, D2D backup
work. Would dramatically cut down on the power. What do y'all think?

Best Regards,
Jason

On 1/29/07, Toby Thain [EMAIL PROTECTED] wrote:


On 29-Jan-07, at 11:02 PM, Jason J. W. Williams wrote:

 Hi Guys,

 I seem to remember the Massive Array of Independent Disk guys ran into
 a problem I think they called static friction, where idle drives would
 fail on spin up after being idle for a long time:

You'd think that probably wouldn't happen to a spare drive that was
spun up from time to time. In fact this problem would be (mitigated
and/or) caught by the periodic health check I suggested.

--T

 http://www.eweek.com/article2/0,1895,1941205,00.asp

 Would that apply here?

 Best Regards,
 Jason

 On 1/29/07, Toby Thain [EMAIL PROTECTED] wrote:

 On 29-Jan-07, at 9:04 PM, Al Hopper wrote:

  On Mon, 29 Jan 2007, Toby Thain wrote:
 
  Hi,
 
  This is not exactly ZFS specific, but this still seems like a
  fruitful place to ask.
 
  It occurred to me today that hot spares could sit in standby (spun
  down) until needed (I know ATA can do this, I'm supposing SCSI
 does
  too, but I haven't looked at a spec recently). Does anybody do
 this?
  Or does everybody do this already?
 
  I don't work with enough disk storage systems to know what is the
  industry
  norm.  But there are 3 broad categories of disk drive spares:
 
  a) Cold Spare.  A spare where the power is not connected until
 it is
  required.  [1]
 
  b) Warm Spare.  A spare that is active but placed into a low power
  mode. ...
 
  c) Hot Spare.  A spare that is spun up and ready to accept
  read/write/position (etc) requests.

 Hi Al,

 Thanks for reminding me of the distinction. It seems very few
 installations would actually require (c)?

 
  Does the tub curve (chance of early life failure) imply that hot
  spares should be burned in, instead of sitting there doing nothing
  from new? Just like a data disk, seems to me you'd want to know
 if a
  hot spare fails while waiting to be swapped in. Do they get tested
  periodically?
 
  The ideal scenario, as you already allude to, would be for the disk
  subsystem to initially configure the drive as a hot spare and
 send it
  periodic test events for, say, the first 48 hours.

 For some reason that's a little shorter than I had in mind, but I
 take your word that that's enough burn-in for semiconductors, motors,
 servos, etc.

  This would get it
  past the first segment of the bathtub reliability curve ...
 
  If saving power was the highest priority, then the ideal situation
  would
  be where the disk subsystem could apply/remove power to the spare
  and move
  it from warm to cold upon command.

 I am surmising that it would also considerably increase the spare's
 useful lifespan versus hot and spinning.

 
  One trick with disk subsystems, like ZFS that have yet to have
  the FMA
  type functionality added and which (today) provide for hot spares
  only, is
  to initially configure a pool with one (hot) spare, and then add a
  2nd hot
  spare, based on installing a brand new device, say, 12 months
  later.  And
  another spare 12 months later.  What you are trying to achieve,
  with this
  strategy, is to avoid the scenario whereby mechanical systems, like
  disk
  drives, tend to wear out within the same general, relatively
 short,
  timeframe.
 
  One (obvious) issue with this strategy, is that it may be
  impossible to
  purchase the same disk drive 12 and 24 months later.  However, it's
  always
  possible to purchase a larger disk drive

 ...which is not guaranteed to be compatible with your storage
 subsystem...!

 --Toby

  and simply commit to the fact
  that the extra space provided by the newer drive will be wasted.
 
  [1] The most common example is a disk drive mounted on a carrier
  but not
  seated within the disk drive enclosure.  Simple push in when
  required.
  ...
  Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
 approach.com
 Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
  OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
   OpenSolaris Governing Board (OGB) Member - Feb 2006

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS or UFS - what to do?


Hi Jeff,

We're running a FLX210 which I believe is an Engenio 2884. In our case
it also is attached to a T2000. ZFS has run VERY stably for us with
data integrity issues at all.

We did have a significant latency problem caused by ZFS flushing the
write cache on the array after every write, but that can be fixed by
configuring your array to ignore cache flushes. The instructions for
Engenio products are here: http://blogs.digitar.com/jjww/?itemid=44

We use the config for a production database, so I can't speak to the
NFS issues. All I would mention is to watch the RAM consumption by
ZFS.

Does anyone on the list have a recommendation for ARC sizing with NFS?

Best Regards,
Jason


On 1/26/07, Jeffery Malloch [EMAIL PROTECTED] wrote:

Hi Folks,

I am currently in the midst of setting up a completely new file server using a pretty 
well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio 6994 product (I work for 
LSI Logic so Engenio is a no brainer).  I have configured a couple of zpools from Volume 
groups on the Engenio box - 1x2.5TB and 1x3.75TB.  I then created sub zfs systems below 
that and set quotas and sharenfs'd them so that it appears that these file 
systems are dynamically shrinkable and growable.  It looks very good...  I can see 
the correct file system sizes on all types of machines (Linux 32/64bit and of course 
Solaris boxes) and if I resize the quota it's picked up in NFS right away.  But I would 
be the first in our organization to use this in an enterprise system so I definitely have 
some concerns that I'm hoping someone here can address.

1.  How stable is ZFS?  The Engenio box is completely configured for RAID5 with 
hot spares and write cache (8GB) has battery backup so I'm not too concerned 
from a hardware side.  I'm looking for an idea of how stable ZFS itself is in 
terms of corruptability, uptime and OS stability.

2.  Recommended config.  Above, I have a fairly simple setup.  In many of the 
examples the granularity is home directory level and when you have many many 
users that could get to be a bit of a nightmare administratively.  I am really 
only looking for high level dynamic size adjustability and am not interested in 
its built in RAID features.  But given that, any real world recommendations?

3.  Caveats?  Anything I'm missing that isn't in the docs that could turn into 
a BIG gotchya?

4.  Since all data access is via NFS we are concerned that 32 bit systems 
(Mainly Linux and Windows via Samba) will not be able to access all the data 
areas of a 2TB+ zpool even if the zfs quota on a particular share is less then 
that.  Can anyone comment?

The bottom line is that with anything new there is cause for concern.  
Especially if it hasn't been tested within our organization.  But the 
convenience/functionality factors are way too hard to ignore.

Thanks,

Jeff


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS or UFS - what to do?

Correction: ZFS has run VERY stably for us with data integrity
issues at all. should read ZFS has run VERY stably for us with NO
data integrity issues at all.

On 1/26/07, Jason J. W. Williams [EMAIL PROTECTED] wrote:

Hi Jeff,

We're running a FLX210 which I believe is an Engenio 2884. In our case
it also is attached to a T2000. ZFS has run VERY stably for us with
data integrity issues at all.

We did have a significant latency problem caused by ZFS flushing the
write cache on the array after every write, but that can be fixed by
configuring your array to ignore cache flushes. The instructions for
Engenio products are here: http://blogs.digitar.com/jjww/?itemid=44

We use the config for a production database, so I can't speak to the
NFS issues. All I would mention is to watch the RAM consumption by
ZFS.

Does anyone on the list have a recommendation for ARC sizing with NFS?

Best Regards,
Jason

On 1/26/07, Jeffery Malloch [EMAIL PROTECTED] wrote:
Hi Folks,

I am currently in the midst of setting up a completely new file server using a pretty
well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio 6994 product (I work for
LSI Logic so Engenio is a no brainer). I have configured a couple of zpools from Volume
groups on the Engenio box - 1x2.5TB and 1x3.75TB. I then created sub zfs systems below that
and set quotas and sharenfs'd them so that it appears that these file systems
are dynamically shrinkable and growable. It looks very good... I can see the correct file
system sizes on all types of machines (Linux 32/64bit and of course Solaris boxes) and if I
resize the quota it's picked up in NFS right away. But I would be the first in our
organization to use this in an enterprise system so I definitely have some concerns that I'm
hoping someone here can address.

1. How stable is ZFS? The Engenio box is completely configured for RAID5
with hot spares and write cache (8GB) has battery backup so I'm not too concerned
from a hardware side. I'm looking for an idea of how stable ZFS itself is in
terms of corruptability, uptime and OS stability.

2. Recommended config. Above, I have a fairly simple setup. In many of the
examples the granularity is home directory level and when you have many many users
that could get to be a bit of a nightmare administratively. I am really only
looking for high level dynamic size adjustability and am not interested in its
built in RAID features. But given that, any real world recommendations?

3. Caveats? Anything I'm missing that isn't in the docs that could turn
into a BIG gotchya?

4. Since all data access is via NFS we are concerned that 32 bit systems
(Mainly Linux and Windows via Samba) will not be able to access all the data areas
of a 2TB+ zpool even if the zfs quota on a particular share is less then that.
Can anyone comment?

The bottom line is that with anything new there is cause for concern.
Especially if it hasn't been tested within our organization. But the
convenience/functionality factors are way too hard to ignore.

Thanks,

Jeff

This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: How much do we really want zpool remove?


To be fair, you can replace vdevs with same-sized or larger vdevs online.
The issue is that you cannot replace with smaller vdevs nor can you
eliminate vdevs.  In other words, I can migrate data around without
downtime, I just can't shrink or eliminate vdevs without send/recv.
This is where the philosophical disconnect lies.  Everytime we descend
into this rathole, we stir up more confusion :-(


We did just this to move off RAID-5 LUNs that were the vdevs for a
pool, to RAID-10 LUNs. Worked very well, and as Richard said was done
all on-line. Doesn't really address the shrinking issue though. :-)

Best Regards,
Jason
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] multihosted ZFS


You could use SAN zoning of the affected LUN's to keep multiple hosts
from seeing the zpool.  When failover time comes, you change the zoning
to make the LUN's visible to the new host, then import.  When the old
host reboots, it won't find any zpool.  Better safe than sorry


Or change the LUN masking on the array. Depending on your switch that
can be less disruptive, and depending on your storage array might be
able to be scripted.

Best Regards,
Jason
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Project Proposal: Availability Suite


Could the replication engine eventually be integrated more tightly
with ZFS? That would be slick alternative to send/recv.

Best Regards,
Jason

On 1/26/07, Jim Dunham [EMAIL PROTECTED] wrote:

Project Overview:

I propose the creation of a project on opensolaris.org, to bring to the community 
two Solaris host-based data services; namely volume snapshot and volume 
replication. These two data services exist today as the Sun StorageTek Availability 
Suite, a Solaris 8, 9  10, unbundled product set, consisting of Instant Image 
(II) and Network Data Replicator (SNDR).

Project Description:

Although Availability Suite is typically known as just two data services (II  
SNDR), there is an underlying Solaris I/O filter driver framework which supports 
these two data services. This framework provides the means to stack one or more 
block-based, pseudo device drivers on to any pre-provisioned cb_ops structure, [ 
http://www.opensolaris.org/os/article/2005-03-31_inside_opensolaris__solaris_driver_programming/#datastructs
 ], thereby shunting all cb_ops I/O into the top of a developed filter driver, (for 
driver specific processing), then out the bottom of this filter driver, back into 
the original cb_ops entry points.

Availability Suite was developed to interpose itself on the I/O stack of a 
block device, providing a filter driver framework with the means to intercept 
any I/O originating from an upstream file system, database or application layer 
I/O. This framework provided the means for Availability Suite to support 
snapshot and remote replication data services for UFS, QFS, VxFS, and more 
recently the ZFS file system, plus various databases like Oracle, Sybase and 
PostgreSQL, and also application I/Os. By providing a filter driver at this 
point in the Solaris I/O stack, it allows for any number of data services to be 
implemented, without regard to the underlying block storage that they will be 
configured on. Today, as a snapshot and/or replication solution, the framework 
allows both the source and destination block storage device to not only differ 
in physical characteristics (DAS, Fibre Channel, iSCSI, etc.), but also logical 
characteristics such as in RAID type, volume managed storage (i.e., SVM, VxVM), 
lofi, zvols, even ram disks.

Community Involvement:

By providing this filter-driver framework, two working filter drivers (II  SNDR), 
and an extensive collection of supporting software and utilities, it is envisioned that 
those individuals and companies that adopt OpenSolaris as a viable storage platform, 
will also utilize and enhance the existing II  SNDR data services, plus have 
offered to them the means in which to develop their own block-based filter driver(s), 
further enhancing the use and adoption on OpenSolaris.

A very timely example that is very applicable to Availability Suite and the OpenSolaris 
community, is the recent announcement of the Project Proposal: lofi [ compression  
encryption ] - http://www.opensolaris.org/jive/click.jspamessageID=26841. By 
leveraging both the Availability Suite and the lofi OpenSolaris projects, it would be 
highly probable to not only offer compression  encryption to lofi devices (as already 
proposed), but by collectively leveraging these two project, creating the means to support 
file systems, databases and applications, across all block-based storage devices.

Since Availability Suite has strong technical ties to storage, please look for email 
discussion for this project at: storage-discuss at opensolaris dot org

A complete set of Availability Suite administration guides can be found at: 
http://docs.sun.com/app/docs?p=coll%2FAVS4.0


Project Lead:

Jim Dunham http://www.opensolaris.org/viewProfile.jspa?username=jdunham

Availability Suite - New Solaris Storage Group


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thumper Origins Q

2007-01-24 Thread Jason J. W. Williams


Hi Wee,

Having snapshots in the filesystem that work so well is really nice.
How are y'all quiescing the DB?

Best Regards,
J

On 1/24/07, Wee Yeh Tan [EMAIL PROTECTED] wrote:

On 1/25/07, Bryan Cantrill [EMAIL PROTECTED] wrote:
 ...
 after all, what was ZFS going to do with that expensive but useless
 hardware RAID controller?  ...

I almost rolled over reading this.

This is exactly what I went through when we moved our database server
out from Vx** to ZFS.  We had a 3510 and were thinking how best to
configure the RAID.  In the end, we ripped out the controller board
and used the 3510 as a JBOD directly attached to the server.  My DBA
was so happy with this setup (especially with the snapshot capability)
he is asking for another such setup.


--
Just me,
Wire ...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] need advice: ZFS config ideas for X4500 Thumper?


Hi Neal,

We've been getting pretty good performance out of RAID-Z2 with 3x
6-disk RAID-Z2 stripes. More stripes mean better performance all
around...particularly on random reads. But as a file-server that's
probably not a concern. With RAID-Z2 it seems to me 2 hot-spares is
very sufficient, but I'll defer to others with more knowledge.

Best Regards,
Jason

On 1/23/07, Neal Pollack [EMAIL PROTECTED] wrote:

Hi:   (Warning, new zfs user question)

I am setting up an X4500 for our small engineering site file server.
It's mostly for builds, images, doc archives, certain workspace
archives, misc
data.

I'd like a trade off between space and safety of data.  I have not set
up a large
ZFS system before, and have only played with simple raidz2 with 7 disks.
After reading
http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl;
I am leaning toward a RAID-Z2 config with spares, for approx 15
terabytes, but I
do not yet understand the nomenclature and exact config details.
For example, the graph/chart shows that 7+2 RAID-Z2  with spares would
be a good
balance in capacity and data safety, but I do not know what to do with
that number, how
it maps to an actual setup?   Does that type of config also provide a
balance between
performance and data safety?

Can someone provide an actual example of how the config should look?
If I save two disks for the boot, how do the other 46 disks get configured
between spares and zfs groups?

Thanks,

Neal

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: Re: Re: External drive enclosures + Sun


I believe the SmartArray is an LSI like the Dell PERC isn't it?

Best Regards,
Jason

On 1/23/07, Robert Suh [EMAIL PROTECTED] wrote:

People trying to hack together systems might want to look
at the HP DL320s

http://h10010.www1.hp.com/wwpc/us/en/ss/WF05a/15351-241434-241475-241475
-f79-3232017.html

12 drive bays, Intel Woodcrest, SAS (and SATA) controller.  If you snoop
around, you
might be able to find drive carriers on eBay or elsewhere (*cough*
search HP drive sleds
HP drive carriers)  $3k for the chassis.  A mini thumper.

Though I'm not sure if Solaris supports the Smart Array controller.

Rob

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of mike
Sent: Monday, January 22, 2007 1:17 PM
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Re: Re: Re: Re: External drive enclosures +
Sun


I'm dying here - does anyone know when or even if they will support
these?

I had this whole setup planned out but it requires eSATA + port
multipliers.

I want to use ZFS, but currently cannot in that fashion. I'd still
have to buy some [more expensive, noisier, bulky internal drive]
solution for ZFS. Unless anyone has other ideas. I'm looking to run a
5-10 drive system (with easy ability to expand) in my home office; not
in a datacenter.

Even opening up to iSCSI seems to not get me much - there aren't any
SOHO type NAS enclosures that act as iSCSI targets. There are however
handfuls of eSATA based 4, 5, and 10 drive enclosures perfect for
this... but all require the port multiplier support.



On 1/22/07, Frank Cusack [EMAIL PROTECTED] wrote:
 Unfortunately, Solaris does not support SATA port multipliers (yet) so
 I think you're pretty limited in how many esata drives you can
connect.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] need advice: ZFS config ideas for X4500 Thumper?


Hi Peter,

Perhaps I'm a bit dense, but I've been befuddled by the x+y notation
myself. Is it X stripes consisting of Y disks?

Best Regards,
Jason

On 1/23/07, Peter Tribble [EMAIL PROTECTED] wrote:

On 1/23/07, Neal Pollack [EMAIL PROTECTED] wrote:
 Hi:   (Warning, new zfs user question)

 I am setting up an X4500 for our small engineering site file server.
 It's mostly for builds, images, doc archives, certain workspace
 archives, misc
 data.

...
 Can someone provide an actual example of how the config should look?
 If I save two disks for the boot, how do the other 46 disks get configured
 between spares and zfs groups?

What I ended up with was working with 8+2 raidz2 vdevs. It could have been
4+2, but 8+2 gives you more space, and that was more important than
performance. (The performance of the 8+2 is easily adequate for our needs.)
And with 46 drives to play with I can have 4 lots of that. At the moment I
have
6 hot-spares (I may take some of those out later, but at the moment I don't
need them).

So the config looks like:

zpool create images \
raidz2 c{0,1,4,6,7}t0d0 c{1,4,5,6,7}t1d0 \
raidz2 c{0,4,5,6,7}t2d0 c{0,1,5,6,7}t3d0 \
raidz2 c{0,1,4,6,7}t4d0 c{0,1,4,6,7}t5d0 \
raidz2 c{0,1,4,5,7}t6d0 c{0,1,4,5,6}t7d0 \
spare c0t1d0 c1t2d0 c4t3d0 c5t5d0 c6t6d0 c7t7d0

this spreads everything across all the controllers, and with no more than 2
disks on each controller I could survive the rather unlikely event of a
controller failure (unless it's the controller with the boot drives...).

--
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] need advice: ZFS config ideas for X4500 Thumper?


Hi Peter,

Ah! That clears it up for me. Thank you.

Best Regards,
Jason

On 1/23/07, Peter Tribble [EMAIL PROTECTED] wrote:

On 1/23/07, Jason J. W. Williams [EMAIL PROTECTED] wrote:
 Hi Peter,

 Perhaps I'm a bit dense, but I've been befuddled by the x+y notation
 myself. Is it X stripes consisting of Y disks?


Sorry. Took a short cut on that bit. It's x data disks + y parity. So in the
case of raidz1, y=1; in the case of raidz2, y=2. And ideally x should
be a power of 2. (So 8+2 is a raidz2 stripe of 10 disks in total.)

I've always used this notation, but now I think about it I'm not sure
how universal it is.

--

-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Thumper Origins Q


Hi All,

This is a bit off-topic...but since the Thumper is the poster child
for ZFS I hope its not too off-topic.

What are the actual origins of the Thumper? I've heard varying stories
in word and print. It appears that the Thumper was the original server
Bechtolsheim designed at Kealia as a massive video server. However,
when we were first told about it a year ago through Sun contacts
Thumper was described as a part of a scalabe iSCSI storage system,
where Thumpers would be connected to a head (which looked a lot like a
pair of X4200s) via iSCSI that would then present the storage over
iSCSI and NFS. Recently, other sources mentioned they were told about
the same time that Thumper was part of the Honeycomb project.

So I was curious if anyone had any insights into the history/origins
of the Thumper...or just wanted to throw more rumors on the fire. ;-)

Thanks in advance for your indulgence.

Best Regards,
Jason
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Synchronous Mount?


Hi Prashanth,

My company did a lot of LVM+XFS vs. SVM+UFS testing in addition to
ZFS. Overall, LVM's overhead is abysmal. We witnessed performance hits
of 50%+. SVM only reduced performance by about 15%. ZFS was similar,
though a tad higher.

Also, my understanding is you can't write to a ZFS snapshot...unless
you clone it. Perhaps, someone who knows more than I can clarify.

Best Regards,
Jason

On 1/23/07, Prashanth Radhakrishnan [EMAIL PROTECTED] wrote:


  Is there someway to synchronously mount a ZFS filesystem?
  '-o sync' does not appear to be honoured.

 No there isn't. Why do you think it is necessary?

Specifically, I was trying to compare ZFS snapshots with LVM snapshots on
Linux. One of the tests does writes to an ext3FS (that's on top of an LVM
snapshot) mounted synchronously, in order to measure the real
Copy-on-write overhead. So, I was wondering if I could do the same with
ZFS. Seems not.

Thanks.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thumper Origins Q


Wow. That's an incredibly cool story. Thank you for sharing it! Does
the Thumper today pretty much resemble what you saw then?

Best Regards,
Jason

On 1/23/07, Bryan Cantrill [EMAIL PROTECTED] wrote:


 This is a bit off-topic...but since the Thumper is the poster child
 for ZFS I hope its not too off-topic.

 What are the actual origins of the Thumper? I've heard varying stories
 in word and print. It appears that the Thumper was the original server
 Bechtolsheim designed at Kealia as a massive video server.

That's correct -- it was originally called the StreamStor.  Speaking
personally, I first learned about it in the meeting with Andy that I
described here:

  http://blogs.sun.com/bmc/entry/man_myth_legend

I think it might be true that this was the first that anyone in Solaris
had heard of it.  Certainly, it was the first time that Andy had ever
heard of ZFS.  It was a very high bandwidth conversation, at any rate. ;)

After the meeting, I returned post-haste to Menlo Park, where I excitedly
described the box to Jeff Bonwick, Bill Moore and Bart Smaalders.  Bill
said something like I gotta see this thing and sometime later (perhaps
the next week?) Bill, Bart and I went down to visit Andy.  Andy gave
us a much more detailed tour, with Bill asking all sorts of technical
questions about the hardware (many of which were something like how did
you get a supplier to build that for you?!).  After the tour, Andy
took the three of us to lunch, and it was one of those moments that I
won't forget:  Bart, Bill, Andy and I sitting in the late afternoon Palo
Alto sun, with us very excited about his hardware, and Andy very excited
about our software.  Everyone realized that these two projects -- born
independently -- were made for each other, that together they would change
the market.  It was one of those rare moments that reminds you why you got
into this line of work -- and I feel lucky to have shared in it.

- Bryan

--
Bryan Cantrill, Solaris Kernel Development.   http://blogs.sun.com/bmc


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Synchronous Mount?


Hi Prashanth,

This was about a year ago. I believe I ran bonnie++ and IOzone tests.
Tried also to simulate an OLTP load. The 15-20% overhead for ZFS was
vs. UFS on a raw disk...UFS on SVM was almost exactly 15% lower
performance than raw UFS. UFS and XFS on raw disk were pretty similar
in terms of performance, until you got into small files...then XFS
bogged down really badly. None of this was testing with snapshots, so
I'm not sure of the effect there.

I can attest we're running ZFS right now in production on a Thumper
serving two MySQL instances, under an 80/20 write/read load. We use
ZFS snapshots as our primary backup mechanism (flush/lock the tables,
flush the logs, snap, release the locks). At the moment we have 60 ZFS
snapshots across 4 filesystems (one FS per zpool). Our primary
database zpool has 26 of those snapshots, and the primary DB log zpool
has another 26 snapshots. Overall, we haven't noticed any performance
degradation in our database serving performance. I don't have hard
benchmark numbers for you on this, but anecdotally it works very well.

There have been some folks complaining here of snapshot numbers in the
200+ range causing performance problems on a single FS.  We don't plan
to have more than about 40 snapshots on an FS right now.

Hope this is somewhat helpful. Its been a long time (2+ years) since
I've used Ext3 on a Linux system, so I couldn't give you a comparative
benchmark. Good luck! :-)

Best Regards,
Jason

On 1/23/07, Prashanth Radhakrishnan [EMAIL PROTECTED] wrote:

Hi Jason,

 My company did a lot of LVM+XFS vs. SVM+UFS testing in addition to
 ZFS. Overall, LVM's overhead is abysmal. We witnessed performance hits
 of 50%+. SVM only reduced performance by about 15%. ZFS was similar,
 though a tad higher.

Yes, LVM snapshots' overhead is high. But I've seen that as you start
increasing the chunksize, they get better (though, with higher space
usage).

So, you saw performance reductions as much as 15% with ZFS
clones/snapshots. I'm curious to know what tests and ZFS config (# of
snapshots/clones) you ran on.

I ran bonnie++ and din't notice any perceptible drops in the numbers.
Though my config had only upto 3 clones and 3 snapshots for each of them.

 Also, my understanding is you can't write to a ZFS snapshot...unless
 you clone it. Perhaps, someone who knows more than I can clarify.

Right. I wanted to check if creating snapshots affected the performance of
the origin FS/clone.

Thanks,
Prashanth

 On 1/23/07, Prashanth Radhakrishnan [EMAIL PROTECTED] wrote:
 
Is there someway to synchronously mount a ZFS filesystem?
'-o sync' does not appear to be honoured.
  
   No there isn't. Why do you think it is necessary?
 
  Specifically, I was trying to compare ZFS snapshots with LVM snapshots on
  Linux. One of the tests does writes to an ext3FS (that's on top of an LVM
  snapshot) mounted synchronously, in order to measure the real
  Copy-on-write overhead. So, I was wondering if I could do the same with
  ZFS. Seems not.
 
  Thanks.
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: External drive enclosures + Sun Server for massstorage


Hi Frank,

I'm sure Richard will check it out. He's a very good guy and not
trying to jerk you around. I'm sure the hostility isn't warranted. :-)

Best Regards,
Jason

On 1/22/07, Frank Cusack [EMAIL PROTECTED] wrote:

On January 22, 2007 10:03:14 AM -0800 Richard Elling
[EMAIL PROTECTED] wrote:
 Toby Thain wrote:
   To be clear: the X2100 drives are neither hotswap nor hotplug under
   Solaris. Replacing a failed drive requires a reboot.

 I do not believe this is true, though I don't have one to test.

Well if you won't accept multiple technically adept people's word on it,
I highly suggest you get one to test instead of speculating.

  If this
 were true, then we would have had to rewrite the disk drivers to not allow
 us to open a device more than once, even if we also closed the device.
 I can't imagine anyone allowing such code to be written.

Obviously you have not rewritten the disk drivers to do this, so this is
the wrong line of reasoning.

 However, I don't believe this is the context of the issue.  I believe that
 this release note deals with the use of NVRAID (NVidia's MCP RAID
 controller)
 which does not have a systems management interface under Solaris.  The
 solution is to not use NVRAID for Solaris.  Rather, use the proven
 techniques
 that we've been using for decades to manage hot plugging drives.

No, the release note is not about NVRAID.

 In short, the release note is confusing, so ignore it.  Use x2100 disks as
 hot pluggable like you've always used hot plug disks in Solaris.

Again, NO these drives are not hot pluggable and the release note is
accurate.  PLEASE get a system to test.  Or take our word for it.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: External drive enclosures + Sun Server for mass

Hi David,

Depending on the I/O you're doing the X4100/X4200 are much better
suited because of the dual HyperTransport buses. As a storage box with
GigE outputs you've got a lot more I/O capacity with two HT buses than
one. That plus the X4100 is just a more solid box. The X2100 M2 while
a vast improvement over the X2100 in terms of reliability and
features, is still an OEM'd whitebox. We use the X2100 M2s for
application servers, but for anything that needs solid reliability or
I/O we go Galaxy.

Best Regards,
Jason

On 1/22/07, David J. Orman [EMAIL PROTECTED] wrote:

Not to be picky, but the X2100 and X2200 series are
NOT
designed/targeted for disk serving (they don't even
have redundant power
supplies). They're compute-boxes. The X4100/X4200
are what you are
looking for to get a flexible box more oriented
towards disk i/o and
expansion.

I don't see those as being any better suited to external discs other than:

#1 - They have the capacity for redundant PSUs, which is irrelevant to my needs.
#2 - They only have PCI Express slots, and I can't find any good external SATA
interface cards on PCI Express

I can't wrap my head around the idea that I should buy a lot more than I need,
which still doesn't serve my purposes. The 4 disks in an x4100 still aren't
enough, and the machine is a fair amount more costly. I just need mirrored boot
drives, and an external disk array.

That said (if you're set on an X2200 M2), you are
probably better off
getting a PCI-E SCSI controller, and then attaching
it to an external
SCSI-SATA JBOD. There are plenty of external JBODs
out there which use
Ultra320/Ultra160 as a host interface and SATA as a
drive interface.
Sun will sell you a supported SCSI controller with
the X2200 M2 (the
Sun StorageTek PCI-E Dual Channel Ultra320 SCSI
HBA).

SCSI is far better for a host attachment mechanism
than eSATA if you
plan on doing more than a couple of drives, which it
sounds like you
are. While the SCSI HBA is going to cost quite a bit
more than an eSATA
HBA, the external JBODs run about the same, and the
total difference is
going to be $300 or so across the whole setup (which
will cost you $5000
or more fully populated). So the cost to use SCSI vs
eSATA as the host-
attach is a rounding error.

I understand your comments in some ways, in others I do not. It sounds like we're moving backwards
in time. Exactly why is SCSI better than SAS/SATA for external devices? From my
experience (with other OSs/hardware platforms) the opposite is true. A nice SAS/SATA controller
with external ports (especially those that allow multiple SAS/SATA drives via one cable - whichever
tech you use) works wonderfully for me, and I get a nice thin/clean cable which makes cable
management much more enjoyable in higher density situations.

I also don't agree with the logic just spend a mere $300 extra to use older
technology!

$300 may not be much to large business, but things like this nickle and dime
small business owners. There's a lot of things I'd prefer to spend $300 on than
an expensive SCSI HBA which offers no advantages over a SAS counterpart, in
fact offers disadvantages instead.

Your input is of course highly valued, and it's quite possible I'm missing an important
piece to the puzzle somewhere here, but I am not convinced this is the ideal solution -
simply a stick with the old stuff, it's easier solution, which I am very much
against.

Thanks,
David

This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: External drive enclosures + Sun Server for massstorage


Hi Guys,

The original X2100 was a pile of doggie doo-doo. All of our problems
with it go back to the atrocious quality of the nForce 4 Pro chipset.
The NICs in particular are just crap. The M2s are better, but the
MCP55 chipset has not resolved all of its flakiness issues. That being
said Sun designed that case with hot-plug bays, if Solaris isn't going
to support it, then those shouldn't be there in my opinion.

Best Regards,
Jason

On 1/22/07, Frank Cusack [EMAIL PROTECTED] wrote:

  In short, the release note is confusing, so ignore it.  Use x2100
  disks as hot pluggable like you've always used hot plug disks in
  Solaris.

 Again, NO these drives are not hot pluggable and the release note is
 accurate.  PLEASE get a system to test.  Or take our word for it.

hmm I think I may have just figured out the problem here.

YES the x2100 is that bad.  I too found it quite hard to believe that
Sun would sell this without hot plug drives.  It seems like a step
backwards.

(and of course I don't mean that the x2100 is awful, it's a great
hardware and very well priced ... now if only hot plug worked!)

My main issue is that the x2100 is advertised as hot plug working.
You have to dig pretty deep -- deeper than would be expected of a
typical buyer -- to find that Solaris does not support it.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: External drive enclosures + Sun Server for mass


Hi David,

Glad to help! I don't want to bad-mouth the X2100 M2s that much,
because they have been solid. I believe the M2s are made/designed just
for Sun by Quanta Computer (http://www.quanta.com.tw/e_default.htm)
whereas the mobos in the original X2100 was Tyan Tiger with some
slight modifications. That all being said, the problem is that Nvidia
chipset. The MCP55 in the X2100 M2 is an alright chipset, the nForce 4
Pro just had bugs.

Best Regards,
Jason

On 1/22/07, David J. Orman [EMAIL PROTECTED] wrote:

 Hi David,

 Depending on the I/O you're doing the X4100/X4200 are
 much better
 suited because of the dual HyperTransport buses. As a
 storage box with
 GigE outputs you've got a lot more I/O capacity with
 two HT buses than
 one. That plus the X4100 is just a more solid box.

That much makes sense, thanks for clearing that up.

 The X2100 M2 while
 a vast improvement over the X2100 in terms of
 reliability and
 features, is still an OEM'd whitebox. We use the
 X2100 M2s for
 application servers, but for anything that needs
 solid reliability or
 I/O we go Galaxy.

Ahh. That explains a lot. Thank you once again!

Sounds like the X2* is the red-headed stepchild of Sun's product line. They 
should slap disclaimers up on the product information pages so we know better 
than to purchase into something that doesn't fully function.

Still unclear on the SAS/SATA solutions, but hopefully that'll progress further 
now in the thread.

Cheers,
David


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Understanding ::memstat in terms of the ARC


Hello all,

I have a question. Below are two ::memstat outputs about 5 days apart.
The interesting thing is the anonymous memory shows 2GB, though the
two major hogs of that memory (two MySQL instances) claim to be
consuming about 6.2GB (checked via pmap).

Also, it seems like the ARC keeps creeping the kernel memory over the
4GB limit I set for the ARC (zfs_arc_max). What I was also, curious
about, is if ZFS affects the cachelist line, or if that is just for
UFS. Thank you in advance!

Best Regards,
Jason

01/17/2007  02:28:50 GMT 2007
Page SummaryPagesMB  %Tot
     
Kernel1485925  5804   36%
Anon   855812  3343   21%
Exec and libs7438290%
Page cache   3863150%
Free (cachelist)   185235   7234%
Free (freelist)   1629288  6364   39%

Total 4167561 16279
Physical  4078747 15932

01/22/2007 01:17:32 GMT 2007
Page SummaryPagesMB  %Tot
     
Kernel1534184  5992   37%
Anon   538054  2101   13%
Exec and libs7497290%
Page cache  18550720%
Free (cachelist)  1384165  5406   33%
Free (freelist)685111  2676   16%

Total 4167561 16279
Physical  4078747 15932
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] External drive enclosures + Sun Server for mass storage

2007-01-20 Thread Jason J. W. Williams


Hi Shannon,

The markup is still pretty high on a per-drive basis. That being said,
$1-2/GB is darn low for the capacity in a server. Plus, you're also
paying for having enough HyperTransport I/O to feed the PCI-E I/O.

Does anyone know what problems they had with the 250GB version of the
Thumper that caused them to pull it?

Best Regards,
Jason

On 1/20/07, Shannon Roddy [EMAIL PROTECTED] wrote:

Frank Cusack wrote:

 thumper (x4500) seems pretty reasonable ($/GB).

 -frank


I am always amazed that people consider thumper to be reasonable in
price.  450% or more markup per drive from street price in July 2006
numbers doesn't seem reasonable to me, even after subtracting the cost
of the system.  I like the x4500, I wish I had one.  But, I can't pay
what Sun wants for it.  So, instead, I am stuck buying lower end Sun
systems and buying third party SCSI/SATA JBODs.  I like Sun.  I like
their products, but I can't understand their storage pricing most of the
time.

-Shannon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] External drive enclosures + Sun Server for mass storage

2007-01-19 Thread Jason J. W. Williams


Hi David,

I don't know if your company qualifies as a startup under Sun's regs
but you can get an X4500/Thumper for $24,000 under this program:
http://www.sun.com/emrkt/startupessentials/

Best Regards,
Jason

On 1/19/07, David J. Orman [EMAIL PROTECTED] wrote:

Hi,

I'm looking at Sun's 1U x64 server line, and at most they support two drives. 
This is fine for the root OS install, but obviously not sufficient for many 
users.

Specifically, I am looking at the: http://www.sun.com/servers/x64/x2200/ 
X2200M2.

It only has Riser card assembly with two internal 64-bit, 8-lane, low-profile, half 
length PCI-Express slots for expansion.

What I'm looking for is a SAS/SATA card that would allow me to add an external 
SATA enclosure (or some such device) to add storage. The supported list on the 
HCL is pretty slim, and I see no PCI-E stuff. A card that supports SAS would be 
*ideal*, but I can settle for normal SATA too.

So, anybody have any good suggestions for these two things:

#1 - SAS/SATA PCI-E card that would work with the Sun X2200M2.
#2 - Rack-mountable external enclosure for SAS/SATA drives, supporting hot swap 
of drives.

Basically, I'm trying to get around using Sun's extremely expensive storage 
solutions while waiting on them to release something reasonable now that ZFS 
exists.

Cheers,
David


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: What SATA controllers are people using for ZFS?

2007-01-18 Thread Jason J. W. Williams


Hi Frank,

Sun doesn't support the X2100 SATA controller on Solaris 10? That's
just bizarre.

-J

On 1/18/07, Frank Cusack [EMAIL PROTECTED] wrote:

THANK YOU Naveen, Al Hopper, others, for sinking yourselves into the
shit world of PC hardware and [in]compatibility and coming up with
well qualified white box solutions for S10.

I strongly prefer to buy Sun kit, but I am done waiting for Sun to support
the SATA controller on the x2100.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Heavy writes freezing system

2007-01-17 Thread Jason J. W. Williams


Hi Anantha,

I was curious why segregating at the FS level would provide adequate
I/O isolation? Since all FS are on the same pool, I assumed flogging a
FS would flog the pool and negatively affect all the other FS on that
pool?

Best Regards,
Jason

On 1/17/07, Anantha N. Srirama [EMAIL PROTECTED] wrote:

You're probably hitting the same wall/bug that I came across; ZFS in all 
versions up to and including Sol10U3 generates excessive I/O when it encounters 
'fssync' or if any of the files were opened with 'O_DSYNC' option.

I do believe Oracle (or any DB for that matter) opens the file with O_DSYNC 
option. During normal times it does result in excessive I/O but is probably 
well under your system capacity (it was in our case.) But when you are doing 
backups or clones (Oracle clones by using RMAN or copying of db files?) you are 
going to flood the I/O sub-system and that's when the whole ZFS excessive I/O 
starts to put a hurt on the DB performance.

Here are a few suggestions that can give you interim relief:

- Seggregate your I/O at filesystem level; the bug is at the filesystem level 
not ZFS pool level. By this I mean ensure the online redo logs are in a ZFS FS 
that nobody else uses, same for control files. As long as the writes to control 
and online redo logs are met your system will be happy.
- Ensure that your clone and RMAN (if you're going to disk) write to a seperate 
ZFS FS that contains no production files.
- If the above two items don't give you relieve then relocate the online redo 
log and control files to a UFS filesystem. No need to downgrade the entire ZFS 
to something else.
- Consider Oracle ASM (DB version permitting,) works very well. Why deal with 
VxFS.

Feel free to drop me a line, I've over 17 years of Oracle DB experience and 
love to troubleshoot problems like this. I've another vested interest; we're 
considering ZFS for widespread use in our environment and any experience is 
good for us.


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: Re[2]: [zfs-discuss] Re: Heavy writes freezing system

2007-01-17 Thread Jason J. W. Williams


Hi Robert,

I see. So it really doesn't get around the idea of putting DB files
and logs on separate spindles?

Best Regards,
Jason

On 1/17/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Jason,

Wednesday, January 17, 2007, 11:24:50 PM, you wrote:

JJWW Hi Anantha,

JJWW I was curious why segregating at the FS level would provide adequate
JJWW I/O isolation? Since all FS are on the same pool, I assumed flogging a
JJWW FS would flog the pool and negatively affect all the other FS on that
JJWW pool?

because of the bug which forces all outstanding writes in a file
system to commit to storage in case of one fsync to one file.
Now when you separate data to different file systems the bug will
affect only data in that file system which could greatly reduce imapct
on performance if it's done right.

--
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Eliminating double path with ZFS's volume manager

2007-01-16 Thread Jason J. W. Williams


Hi Philip,

I'm not an expert, so I'm afraid I don't know what to tell you. I'd
call Apple Support and see what they say. As horrid as they are at
Enterprise support they may be the best ones to clarify if
multipathing is available without Xsan.


Best Regards,
Jason

On 1/16/07, Philip Mötteli [EMAIL PROTECTED] wrote:

 Looks like its got a half-way decent multipath
 design:
 http://docs.info.apple.com/article.html?path=Xsan/1.1/
 en/c3xs12.html

Great, but that is with Xsan. If I don't exchange our Hitachi with an Xsan, I 
don't have this 'cvadmin'.


This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Eliminating double path with ZFS's volume manager

2007-01-15 Thread Jason J. W. Williams


Hi Torrey,

I think it does if you buy Xsan. Its still a separate product isn't
it? Thought its more like QFS + MPXIO.

Best Regards,
Jason

On 1/15/07, Torrey McMahon [EMAIL PROTECTED] wrote:

Robert Milkowski wrote:

 2. I belive it's definitely possible to just correct your config under
 Mac OS without any need to use other fs or volume manager, however
 going to zfs could be a good idea anyway


That implies that MacOS has some sort of native SCSI multipathing like
Solaris Mpxio. Does such a beast exist?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS direct IO

2007-01-15 Thread Jason J. W. Williams


Hi Roch,

You mentioned improved ZFS performance in the latest Nevada build (60
right now?)...I was curious if one would notice much of a performance
improvement between 54 and 60? Also, does anyone think the zfs_arc_max
tunable-support will be made available as a patch to S10U3, or would
that wait until U4? Thank you in advance!

Best Regards,
Jason

On 1/15/07, Roch - PAE [EMAIL PROTECTED] wrote:


Jonathan Edwards writes:
 
  On Jan 5, 2007, at 11:10, Anton B. Rang wrote:
 
   DIRECT IO is a set of performance optimisations to circumvent
   shortcomings of a given filesystem.
  
   Direct I/O as generally understood (i.e. not UFS-specific) is an
   optimization which allows data to be transferred directly between
   user data buffers and disk, without a memory-to-memory copy.
  
   This isn't related to a particular file system.
  
 
  true .. directio(3) is generally used in the context of *any* given
  filesystem to advise it that an application buffer to system buffer
  copy may get in the way or add additional overhead (particularly if
  the filesystem buffer is doing additional copies.)  You can also look
  at it as a way of reducing more layers of indirection particularly if
  I want the application overhead to be higher than the subsystem
  overhead.  Programmatically .. less is more.

Direct IO makes good sense when the target disk sectors are
set a priori. But in the context of ZFS, would you rather
have 10 direct disk I/Os or 10 bcopies and 2 I/O (say that
was possible).

As for read, I  can see that when  the load is cached in the
disk array and we're running  100% CPU, the extra copy might
be noticeable. Is this the   situation that longs for DIO  ?
What % of a system is spent in the copy  ? What is the added
latency that comes from the copy ? Is DIO the best way to
reduce the CPU cost of ZFS ?

The  current Nevada  code base  has  quite nice  performance
characteristics  (and  certainly   quirks); there are   many
further efficiency gains to be reaped from ZFS. I just don't
see DIO on top of  that list for now.   Or at least  someone
needs to  spell out what  is ZFS/DIO and  how much better it
is expected to be (back of the envelope calculation accepted).

Reading RAID-Z  subblocks on filesystems that  have checksum
disabled might be interesting.   That would avoid  some disk
seeks.To served  the  subblocks directly   or  not is  a
separate matter; it's  a small deal  compared to the feature
itself.  How about disabling the  DB  checksum (it can't fix
the block anyway) and do mirroring ?

-r


  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Eliminating double path with ZFS's volume manager

2007-01-15 Thread Jason J. W. Williams


Hi Torrey,

Looks like its got a half-way decent multipath design:
http://docs.info.apple.com/article.html?path=Xsan/1.1/en/c3xs12.html

Whether or not it works is another story I suppose. ;-)

Best Regards,
Jason

On 1/15/07, Torrey McMahon [EMAIL PROTECTED] wrote:

Got me. However, transport multipathing - Like Mpxio, DLM, VxDMP, etc. -
is usually separated from the filesystem layers.

Jason J. W. Williams wrote:
 Hi Torrey,

 I think it does if you buy Xsan. Its still a separate product isn't
 it? Thought its more like QFS + MPXIO.

 Best Regards,
 Jason

 On 1/15/07, Torrey McMahon [EMAIL PROTECTED] wrote:
 Robert Milkowski wrote:
 
  2. I belive it's definitely possible to just correct your config under
  Mac OS without any need to use other fs or volume manager, however
  going to zfs could be a good idea anyway


 That implies that MacOS has some sort of native SCSI multipathing like
 Solaris Mpxio. Does such a beast exist?

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: Re[2]: [zfs-discuss] Replacing a drive in a raidz2 group

2007-01-13 Thread Jason J. W. Williams


Hi Robert,

Will build 54 offline the drive?

Best Regards,
Jason

On 1/13/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Jason,

Saturday, January 13, 2007, 12:06:57 AM, you wrote:

JJWW Hi Robert,

JJWW We've experienced luck with flaky SATA drives in our STK array by
JJWW unseating and reseating the drive to cause a reset of the firmware. It
JJWW may be a bad drive, or the firmware may just have hit a bug. Hope its
JJWW the latter! :-D

JJWW I'd be interested why the hot-spare didn't kick in. I thought the FMA
JJWW integration would detect read errors.

FMA did but ZFS+FMA we're not there in U3.

--
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacing a drive in a raidz2 group

2007-01-12 Thread Jason J. W. Williams


Hi Robert,

We've experienced luck with flaky SATA drives in our STK array by
unseating and reseating the drive to cause a reset of the firmware. It
may be a bad drive, or the firmware may just have hit a bug. Hope its
the latter! :-D

I'd be interested why the hot-spare didn't kick in. I thought the FMA
integration would detect read errors.

Best Regards,
Jason

On 1/12/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello zfs-discuss,

  One of our drives in x4500 is failing - it periodically
  disconnects/connects. ZFS only reports READ errors and no hot-spare
  automatically took in which was expected currently.

  So I issued zpool replace with a hot-spare drive.

  Now it takes forever and it seems like ZFS is rebuilding drive using
  checksums - wouldn't it be much faster if it just copied data from
  the drive being replaced (like attaching mirror)?


--
Best regards,
 Robert  mailto:[EMAIL PROTECTED]
 http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-11 Thread Jason J. W. Williams


Hi Mark,

That does help tremendously. How does ZFS decide which zio cache to
use? I apologize if this has already been addressed somewhere.

Best Regards,
Jason

On 1/11/07, Mark Maybee [EMAIL PROTECTED] wrote:

Al Hopper wrote:
 On Wed, 10 Jan 2007, Mark Maybee wrote:

 Jason J. W. Williams wrote:
 Hi Robert,

 Thank you! Holy mackerel! That's a lot of memory. With that type of a
 calculation my 4GB arc_max setting is still in the danger zone on a
 Thumper. I wonder if any of the ZFS developers could shed some light
 on the calculation?

 In a worst-case scenario, Robert's calculations are accurate to a
 certain degree:  If you have 1GB of dnode_phys data in your arc cache
 (that would be about 1,200,000 files referenced), then this will result
 in another 3GB of related data held in memory: vnodes/znodes/
 dnodes/etc.  This related data is the in-core data associated with
 an accessed file.  Its not quite true that this data is not evictable,
 it *is* evictable, but the space is returned from these kmem caches
 only after the arc has cleared its blocks and triggered the free of
 the related data structures (and even then, the kernel will need to
 to a kmem_reap to reclaim the memory from the caches).  The
 fragmentation that Robert mentions is an issue because, if we don't
 free everything, the kmem_reap may not be able to reclaim all the
 memory from these caches, as they are allocated in slabs.

 We are in the process of trying to improve this situation.
  snip .

 Understood (and many Thanks).  In the meantime, is there a rule-of-thumb
 that you could share that would allow mere humans (like me) to calculate
 the best values of zfs:zfs_arc_max and ncsize, given the that machine has
 nGb of RAM and is used in the following broad workload scenarios:

 a) a busy NFS server
 b) a general multiuser development server
 c) a database server
 d) an Apache/Tomcat/FTP server
 e) a single user Gnome desktop running U3 with home dirs on a ZFS
 filesystem

 It would seem, from reading between the lines of previous emails,
 particularly the ones you've (Mark M) written, that there is a rule of
 thumb that would apply given a standard or modified ncsize tunable??

 I'm primarily interested in a calculation that would allow settings that
 would reduce the possibility of the machine descending into swap hell.

Ideally, there would be no need for any tunables; ZFS would always do
the right thing.   This is our grail.  In the meantime, I can give some
recommendations, but there is no rule of thumb that is going to work
in all circumstances.

ncsize: As I have mentioned previously, there are overheads
associated with caching vnode data in ZFS.  While
the physical on-disk data for a znode is only 512bytes,
the related in-core cost is significantly higher.
Roughly, you can expect that each ZFS vnode held in
the DNLC will cost about 3K of kernel memory.

So, you need to set ncsize appropriately for how much
memory you are willing to devote to it.  500,000 entries
is going to cost you 1.5GB of memory.

zfs_arc_max: This is the maximum amount of memory you want the
ARC to be able to use.  Note that the ARC won't
necessarily use this much memory: if other applications
need memory, the ARC will shrink to accommodate.
Although, also note that the ARC *can't* shrink if all
of its memory is held.  For example, data in the DNLC
cannot be evicted from the ARC, so this data must first
be evicted from the DNLC before the ARC can free up
space (this is why it is dangerous to turn off the ARCs
ability to evict vnodes from the DNLC).

Also keep in mind that the ARC size does not account for
many in-core data structures used by ZFS (znodes/dnodes/
dbufs/etc).  Roughly, for every 1MB of cached file
pointers, you can expect another 3MB of memory used
outside of the ARC.  So, in the example above, where
ncsize is 500,000, the ARC is only seeing about 400MB
of the 1.5GB consumed.  As I have stated previously,
we consider this a bug in the current ARC accounting
that we will soon fix.  This is only an issue in
environments where many files are being accessed.  If
the number of files accessed is relatively low, then
the ARC size will be much closer to the actual memory
consumed by ZFS.

So, in general, you should not really need to tune
zfs_arc_max.  However, in environments where you have
specific applications that consume known quantities of
memory (e.g. database), it will likely

Re: [zfs-discuss] Solid State Drives?

2007-01-11 Thread Jason J. W. Williams


Hello all,

Just my two cents on the issue. The Thumper is proving to be a
terrific database server in all aspects except latency. While the
latency is acceptable, being able to add some degree of battery-backed
write cache that ZFS could use would be phenomenal.

Best Regards,
Jason

On 1/11/07, Jonathan Edwards [EMAIL PROTECTED] wrote:


On Jan 11, 2007, at 15:42, Erik Trimble wrote:

 On Thu, 2007-01-11 at 10:35 -0800, Richard Elling wrote:
 The product was called Sun PrestoServ.  It was successful for
 benchmarking
 and such, but unsuccessful in the market because:

  + when there is a failure, your data is spread across multiple
fault domains

  + it is not clusterable, which is often a requirement for data
centers

  + it used a battery, so you had to deal with physical battery
replacement and all of the associated battery problems

  + it had yet another device driver, so integration was a pain

 Google for it and you'll see all sorts of historical perspective.
   -- richard


 Yes, I remember (and used) PrestoServ. Back in the SPARCcenter 1000
 days. :-)

as do i .. (keep your batteries charged!! and don't panic!)

 And yes, local caching makes the system non-clusterable.

not necessarily .. i like the javaspaces approach to coherency, and
companies like gigaspaces have done some pretty impressive things
with in memory SBA databases and distributed grid architectures ..
intelligent coherency design with a good distribution balance for
local, remote, and redundant can go a long way in improving your
cache numbers.

 However, all
 the other issues are common to a typical HW raid controller, and many
 people use host-based HW controllers just fine and don't find their
 problems to be excessive.

True given most workloads, but in general it's the coherency issues
that drastically affect throughput on shared controllers particularly
as you add and distribute the same luns or data across different
control processors.  Add too many and your cache hit rates might fall
in the toilet.

.je
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Limit ZFS Memory Utilization


Hi Guys,

After reading through the discussion on this regarding ZFS memory
fragmentation on snv_53 (and forward) and going through our
::kmastat...looks like ZFS is sucking down about 544 MB of RAM in the
various caches. About 360MB of that is in the zio_buf_65536 cache.
Next most notable is 55MB in zio_buf_32768, and 36MB in zio_buf_16384.
I don't think that's too bad but worth keeping track of. At this
point our kernel memory growth seems to have slowed, with it hovering
around 5GB, and the anon column is mostly what's growing now (as
expected...MySQL).

Most of the problem in the discussion thread on this seemed to be
related to a lot of DLNC entries due to the workload of a file server.
How would this affect a database server with operations in only a
couple very large files? Thank you in advance.

Best Regards,
Jason

On 1/10/07, Jason J. W. Williams [EMAIL PROTECTED] wrote:

Sanjeev  Robert,

Thanks guys. We put that in place last night and it seems to be doing
a lot better job of consuming less RAM. We set it to 4GB and each of
our 2 MySQL instances on the box to a max of 4GB. So hopefully slush
of 4GB on the Thumper is enough. I would be interested in what the
other ZFS modules memory behaviors are. I'll take a perusal through
the archives. In general it seems to me that a max cap for ZFS whether
set through a series of individual tunables or a single root tunable
would be very helpful.

Best Regards,
Jason

On 1/10/07, Sanjeev Bagewadi [EMAIL PROTECTED] wrote:
 Jason,

 Robert is right...

 The point is ARC is the caching module of ZFS and majority of the memory
 is consumed through ARC.
 Hence by limiting the c_max of ARC we are limiting the amount ARC consumes.

 However, other modules of ZFS would consume more but that may not be as
 significant as ARC.

 Expert, please correct me if I am wrong here.

 Thanks and regards,
 Sanjeev.

 Robert Milkowski wrote:

 Hello Jason,
 
 Tuesday, January 9, 2007, 10:28:12 PM, you wrote:
 
 JJWW Hi Sanjeev,
 
 JJWW Thank you! I was not able to find anything as useful on the subject as
 JJWW that!  We are running build 54 on an X4500, would I be correct in my
 JJWW reading of that article that if I put set zfs:zfs_arc_max =
 JJWW 0x1 #4GB in my /etc/system, ZFS will consume no more than
 JJWW 4GB? Thank you in advance.
 
 That's the idea however it's not working that way now - under some
 circumstances ZFS could still consume much more memory - see other
 posts lately here.
 
 
 


 --
 Solaris Revenue Products Engineering,
 India Engineering Center,
 Sun Microsystems India Pvt Ltd.
 Tel:x27521 +91 80 669 27521




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Adding disk to a RAID-Z?


Hi Kyle,

I think there was a lot of talk about this behavior on the RAIDZ2 vs.
RAID-10 thread. My understanding from that discussion was that every
write stripes the block across all disks on a RAIDZ/Z2 group, thereby
making writing the group no faster than writing to a single disk.
However reads are much faster, as all the disk are activated in the
read process.

The default config on the X4500 we received recently was RAIDZ-groups
of 6 disks (across the 6 controllers) striped together into one large
zpool.

Best Regards,
Jason

On 1/10/07, Kyle McDonald [EMAIL PROTECTED] wrote:

Robert Milkowski wrote:
 Hello Kyle,

 Wednesday, January 10, 2007, 5:33:12 PM, you wrote:

 KM Remember though that it's been mathematically figured that the
 KM disadvantages to RaidZ start to show up after 9 or 10 drives. (That's

 Well, nothing like this was proved and definitely not mathematically.

 It's just a common sense advise - for many users keeping raidz groups
 below 9 disks should give good enough performance. However if someone
 creates raidz group of 48 disks he/she probable expects also
 performance and in general raid-z wouldn't offer one.



It's very possible I misstated something. :)

I thought I had read though, something like over 9 or so disks would put
mean that each FS block would be written to less than a single disk
block on each disk?

Or maybe it was that waiting to read from all drives for files less than
a FS block would suffer?

Ahhh...  I can't remember what the effect were thought to be. I thought
there was some theoretical math involved though.

I do remember people advising against it though. Not just on a
performance basis, but also on a increased risk of failure basis. I
think it was just seen as a good balancing point.

-Kyle


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: Re[2]: [zfs-discuss] Re: Adding disk to a RAID-Z?


Hi Robert,

I read the following section from
http://blogs.sun.com/roch/entry/when_to_and_not_to as indicating
random writes to a RAID-Z had the performance of a single disk
regardless of the group size:


Effectively,  as  a first approximation,  an  N-disk RAID-Z group will
behave as   a single   device in  terms  of  deliveredrandom input
IOPS. Thus  a 10-disk group of devices  each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group.



Best Regards,
Jason

On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Jason,

Wednesday, January 10, 2007, 10:54:29 PM, you wrote:

JJWW Hi Kyle,

JJWW I think there was a lot of talk about this behavior on the RAIDZ2 vs.
JJWW RAID-10 thread. My understanding from that discussion was that every
JJWW write stripes the block across all disks on a RAIDZ/Z2 group, thereby
JJWW making writing the group no faster than writing to a single disk.
JJWW However reads are much faster, as all the disk are activated in the
JJWW read process.

The opposite actually. Because of COW, writing (modifying as well)
will give you up-to N-1 disks performance for raid-z1 and N-2 disks performance 
for
raid-z2. Howeer reading can be slow in case of many small random reads
as to read each fs block you've got to wait for all data disks in a
group.


JJWW The default config on the X4500 we received recently was RAIDZ-groups
JJWW of 6 disks (across the 6 controllers) striped together into one large
JJWW zpool.

However the problem with that config is lack of hot-spare.
Of course it depends what you want (and there was no hot spare support
in U2 which is os installed in factory so far).


--
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: Re[4]: [zfs-discuss] Limit ZFS Memory Utilization


Hi Robert,

We've got the default ncsize. I didn't see any advantage to increasing
it outside of NFS serving...which this server is not. For speed the
X4500 is showing to be a killer MySQL platform. Between the blazing
fast procs and the sheer number of spindles, its perfromance is
tremendous. If MySQL cluster had full disk-based support, scale-out
with X4500s a-la Greenplum would be terrific solution.

At this point, the ZFS memory gobbling is the main roadblock to being
a good database platform.

Regarding the paging activity, we too saw tremendous paging of up to
24% of the X4500s CPU being used for that with the default arc_max.
After changing it to 4GB, we haven't seen anything much over 5-10%.

Best Regards,
Jason

On 1/10/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Jason,

Thursday, January 11, 2007, 12:36:46 AM, you wrote:

JJWW Hi Robert,

JJWW Thank you! Holy mackerel! That's a lot of memory. With that type of a
JJWW calculation my 4GB arc_max setting is still in the danger zone on a
JJWW Thumper. I wonder if any of the ZFS developers could shed some light
JJWW on the calculation?

JJWW That kind of memory loss makes ZFS almost unusable for a database system.


If you leave ncsize with default value then I belive it won't consume
that much memory.


JJWW I agree that a page cache similar to UFS would be much better.  Linux
JJWW works similarly to free pages, and it has been effective enough in the
JJWW past. Though I'm equally unhappy about Linux's tendency to grab every
JJWW bit of free RAM available for filesystem caching, and then cause
JJWW massive memory thrashing as it frees it for applications.

Page cache won't be better - just better memory control for ZFS caches
is strongly desired. Unfortunately from time to time ZFS makes servers
to page enormously :(


--
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-08 Thread Jason J. W. Williams


Sanjeev,

Could you point me in the right direction as to how to convert the
following GCC compile flags to Studio 11 compile flags? Any help is
greatly appreciated. We're trying to recompile MySQL to give a
stacktrace and core file to track down exactly why its
crashing...hopefully it will illuminate if memory truly is the issue.
Thank you very much in advance!

-felide-constructors
-fno-exceptions -fno-rtti

Best Regards,
Jason

On 1/7/07, Sanjeev Bagewadi [EMAIL PROTECTED] wrote:

Jason,

There is no documented way of limiting the memory consumption.
The ARC section of ZFS tries to adapt to the memory pressure of the system.
However, in your case probably it is not quick enough I guess.

One way of limiting the memory consumption would be limit the arc.c_max
This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than
memory available).
This is done when the ZFS is loaded (arc_init()).

You should be able to change the value of arc.c_max through mdb and set
it to the value
you want. Exercise caution while setting it. Make sure you don't have
active zpools during this operation.

Thanks and regards,
Sanjeev.

Jason J. W. Williams wrote:

 Hello,

 Is there a way to set a max memory utilization for ZFS? We're trying
 to debug an issue where the ZFS is sucking all the RAM out of the box,
 and its crashing MySQL as a result we think. Will ZFS reduce its cache
 size if it feels memory pressure? Any help is greatly appreciated.

 Best Regards,
 Jason
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel:x27521 +91 80 669 27521



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Limit ZFS Memory Utilization

2007-01-08 Thread Jason J. W. Williams


We're not using the Enterprise release, but we are working with them.
It looks like MySQL is crashing due to lack of memory.

-J

On 1/8/07, Toby Thain [EMAIL PROTECTED] wrote:


On 8-Jan-07, at 11:54 AM, Jason J. W. Williams wrote:

 ...We're trying to recompile MySQL to give a
 stacktrace and core file to track down exactly why its
 crashing...hopefully it will illuminate if memory truly is the issue.

If you're using the Enterprise release, can't you get MySQL's
assistance with this?

--Toby



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Limit ZFS Memory Utilization

2007-01-07 Thread Jason J. W. Williams


Hello,

Is there a way to set a max memory utilization for ZFS? We're trying
to debug an issue where the ZFS is sucking all the RAM out of the box,
and its crashing MySQL as a result we think. Will ZFS reduce its cache
size if it feels memory pressure? Any help is greatly appreciated.

Best Regards,
Jason
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solid State Drives?

2007-01-05 Thread Jason J. W. Williams


Could this ability (separate ZIL device) coupled with an SSD give
something like a Thumper the write latency benefit of battery-backed
write cache?

Best Regards,
Jason

On 1/5/07, Neil Perrin [EMAIL PROTECTED] wrote:



Robert Milkowski wrote On 01/05/07 11:45,:
 Hello Neil,

 Friday, January 5, 2007, 4:36:05 PM, you wrote:

 NP I'm currently working on putting the ZFS intent log on separate devices
 NP which could include seperate disks and nvram/solid state devices.
 NP This would help any application using fsync/O_DSYNC - in particular
 NP DB and NFS. From protoyping considerable peformanace improvements have
 NP been seen.

 Can you share any results from prototype testing?

I'd prefer not to just yet as I don't want to raise expectations unduly.
When testing I was using a simple local benchmark, whereas
I'd prefer to run something more official such as TPC.
I'm also missing a few required features in the protoype which
may affect performance.

Hopefully I can can provide some results soon, but even those will
be unoffical.

Neil.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] RAIDZ2 vs. ZFS RAID-10


Hello All,

I was curious if anyone had run a benchmark on the IOPS performance of
RAIDZ2 vs RAID-10? I'm getting ready to run one on a Thumper and was
curious what others had seen. Thank you in advance.

Best Regards,
Jason
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10


Hi Richard,

Hmmthat's interesting. I wonder if its worth benchmarking RAIDZ2
if those are the results you're getting. The testing is to see the
performance gain we might get for MySQL moving off the FLX210 to an
active/passive pair of X4500s. Was hoping with that many SATA disks
RAIDZ2 would provide a nice safety net.

Best Regards,
Jason

On 1/3/07, Richard Elling [EMAIL PROTECTED] wrote:

Jason J. W. Williams wrote:
 Hello All,

 I was curious if anyone had run a benchmark on the IOPS performance of
 RAIDZ2 vs RAID-10? I'm getting ready to run one on a Thumper and was
 curious what others had seen. Thank you in advance.

I've been using a simple model for small, random reads.  In that model,
the performance of a raidz[12] set will be approximately equal to a single
disk.  For example, if you have 6 disks, then the performance for the
6-disk raidz2 set will be normalized to 1, and the performance of a 3-way
dynamic stripe of 2-way mirrors will have a normalized performance of 6.
I'd be very interested to see if your results concur.

The models for writes or large reads are much more complicated because
of the numerous caches of varying size and policy throughout the system.
The small, random read workload will be largely unaffected by caches and
you should see the performance as predicted by the disk rpm and seek time.
  -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10


Just got an interesting benchmark. I made two zpools:

RAID-10 (9x 2-way RAID-1 mirrors: 18 disks total)
RAID-Z2 (3x 6-way RAIDZ2 group: 18 disks total)

Copying 38.4GB of data from the RAID-Z2 to the RAID-10 took 307
seconds. Deleted the data from the RAID-Z2. Then copying the 38.4GB of
data from the RAID-10 to the RAID-Z2 took 258 seconds. Would have
expected the RAID-10 to write data more quickly.

Its interesting to me that the RAID-10 pool registered the 38.4GB of
data as 38.4GB, whereas the RAID-Z2 registered it as 56.4.

Best Regards,
Jason

On 1/3/07, Jason J. W. Williams [EMAIL PROTECTED] wrote:

Hi Richard,

Hmmthat's interesting. I wonder if its worth benchmarking RAIDZ2
if those are the results you're getting. The testing is to see the
performance gain we might get for MySQL moving off the FLX210 to an
active/passive pair of X4500s. Was hoping with that many SATA disks
RAIDZ2 would provide a nice safety net.

Best Regards,
Jason

On 1/3/07, Richard Elling [EMAIL PROTECTED] wrote:
 Jason J. W. Williams wrote:
  Hello All,
 
  I was curious if anyone had run a benchmark on the IOPS performance of
  RAIDZ2 vs RAID-10? I'm getting ready to run one on a Thumper and was
  curious what others had seen. Thank you in advance.

 I've been using a simple model for small, random reads.  In that model,
 the performance of a raidz[12] set will be approximately equal to a single
 disk.  For example, if you have 6 disks, then the performance for the
 6-disk raidz2 set will be normalized to 1, and the performance of a 3-way
 dynamic stripe of 2-way mirrors will have a normalized performance of 6.
 I'd be very interested to see if your results concur.

 The models for writes or large reads are much more complicated because
 of the numerous caches of varying size and policy throughout the system.
 The small, random read workload will be largely unaffected by caches and
 you should see the performance as predicted by the disk rpm and seek time.
   -- richard



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: Re[2]: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10


Hi Robert,

Our X4500 configuration is multiple 6-way (across controllers) RAID-Z2
groups striped together. Currently, 3 RZ2 groups. I'm about to test
write performance against ZFS RAID-10. I'm curious why RAID-Z2
performance should be good? I assumed it was an analog to RAID-6. In
our recent experience RAID-5 due to the 2 reads, a XOR calc and a
write op per write instruction is usually much slower than RAID-10
(two write ops). Any advice is  greatly appreciated.

Best Regards,
Jason

On 1/3/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Jason,

Wednesday, January 3, 2007, 11:11:31 PM, you wrote:

JJWW Hi Richard,

JJWW Hmmthat's interesting. I wonder if its worth benchmarking RAIDZ2
JJWW if those are the results you're getting. The testing is to see the
JJWW performance gain we might get for MySQL moving off the FLX210 to an
JJWW active/passive pair of X4500s. Was hoping with that many SATA disks
JJWW RAIDZ2 would provide a nice safety net.

Well, you weren't thinking about one big raidz2 group?

To get more performance you can create one pool with many smaller
raidz2 groups - that way your worst case read performance should
increase approximately N times where N is number of raidz-2 groups.

However  keep in mind that write performance should be really good
with raidz2.

--
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: Re[2]: [zfs-discuss] RAIDZ2 vs. ZFS RAID-10


Hi Robert,

That makes sense. Thank you. :-) Also, it was zpool I was looking at.
zfs always showed the correct size.

-J

On 1/3/07, Robert Milkowski [EMAIL PROTECTED] wrote:

Hello Jason,

Wednesday, January 3, 2007, 11:40:38 PM, you wrote:

JJWW Just got an interesting benchmark. I made two zpools:

JJWW RAID-10 (9x 2-way RAID-1 mirrors: 18 disks total)
JJWW RAID-Z2 (3x 6-way RAIDZ2 group: 18 disks total)

JJWW Copying 38.4GB of data from the RAID-Z2 to the RAID-10 took 307
JJWW seconds. Deleted the data from the RAID-Z2. Then copying the 38.4GB of
JJWW data from the RAID-10 to the RAID-Z2 took 258 seconds. Would have
JJWW expected the RAID-10 to write data more quickly.

Actually with 18 disks in raid-10 in theory you get write performance
equal to stripe of 9 disks. With 18 disks in 3 raidz2 groups of 6 disks each you
should expect something like (6-2)*3 = 12 disk, so equal to 12 disks
in stripe.

JJWW Its interesting to me that the RAID-10 pool registered the 38.4GB of
JJWW data as 38.4GB, whereas the RAID-Z2 registered it as 56.4.

If you checked with zpool - then it's ok - it reports disk usage
also wit parity overhead. If zfs list showed you that numbers then
either you're using old snv bits or s10U2 as it was corrected some
time ago (in U3).


--
Best regards,
 Robertmailto:[EMAIL PROTECTED]
   http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re[2]: RAIDZ2 vs. ZFS RAID-10