Re: [zfs-discuss] Petabyte pool?

2013-03-15 Thread Ray Van Dolson
On Fri, Mar 15, 2013 at 06:09:34PM -0700, Marion Hakanson wrote:
 Greetings,
 
 Has anyone out there built a 1-petabyte pool?  I've been asked to look
 into this, and was told low performance is fine, workload is likely
 to be write-once, read-occasionally, archive storage of gene sequencing
 data.  Probably a single 10Gbit NIC for connectivity is sufficient.
 
 We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
 using 4TB nearline SAS drives, giving over 100TB usable space (raidz3).
 Back-of-the-envelope might suggest stacking up eight to ten of those,
 depending if you want a raw marketing petabyte, or a proper power-of-two
 usable petabyte.
 
 I get a little nervous at the thought of hooking all that up to a single
 server, and am a little vague on how much RAM would be advisable, other
 than as much as will fit (:-).  Then again, I've been waiting for
 something like pNFS/NFSv4.1 to be usable for gluing together multiple
 NFS servers into a single global namespace, without any sign of that
 happening anytime soon.
 
 So, has anyone done this?  Or come close to it?  Thoughts, even if you
 haven't done it yourself?
 
 Thanks and regards,
 
 Marion

We've come close:

admin@mes-str-imgnx-p1:~$ zpool list
NAME   SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
datapool   978T   298T   680T30%  1.00x  ONLINE  -
syspool278G   104G   174G37%  1.00x  ONLINE  -

Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual
pathed to a couple of LSI SAS switches.

Using Nexenta but no reason you couldn't do this w/ $whatever.

We did triple parity and our vdev membership is set up such that we can
lose up to three JBODs and still be functional (one vdev member disk
per JBOD).

This is with 3TB NL-SAS drives.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Petabyte pool?

2013-03-15 Thread Ray Van Dolson
On Fri, Mar 15, 2013 at 06:31:11PM -0700, Marion Hakanson wrote:
 rvandol...@esri.com said:
  We've come close:
  
  admin@mes-str-imgnx-p1:~$ zpool list
  NAME   SIZE  ALLOC   FREECAP DEDUP  HEALTH  ALTROOT
  datapool   978T   298T   680T30%  1.00x  ONLINE  -
  syspool278G   104G   174G37%  1.00x  ONLINE  -
  
  Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed 
  to
  a couple of LSI SAS switches. 
 
 Thanks Ray,
 
 We've been looking at those too (we've had good luck with our MD1200's).
 
 How many HBA's in the R720?
 
 Thanks and regards,
 
 Marion

We have qty 2 LSI SAS 9201-16e HBA's (Dell resold[1]).

Ray

[1] 
http://accessories.us.dell.com/sna/productdetail.aspx?c=usl=ens=hiedcs=65sku=a4614101
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [discuss] Hardware Recommendations: SAS2 JBODs

2012-11-13 Thread Ray Van Dolson
On Tue, Nov 13, 2012 at 03:08:04PM -0500, Peter Tripp wrote:
 Hi folks,
 
 I'm in the market for a couple of JBODs.  Up until now I've been
 relatively lucky with finding hardware that plays very nicely with
 ZFS.  All my gear currently in production uses LSI SAS controllers
 (3801e, 9200-16e, 9211-8i) with backplanes powered by LSI SAS
 expanders (Sun x4250, Sun J4400, etc).  But I'm in the market for
 SAS2 JBODs to support a large number 3.5inch SAS disks (60+ 3TB disks
 to start).
 
 I'm aware of potential issues with SATA drives/interposers and the
 whole SATA Tunneling Protocol (STP) nonsense, so I'm going to stick
 to a pure SAS setup.  Also, since I've had trouble with in the past
 with daisy-chained SAS JBODs I'll probably stick with one SAS 4x
 cable (SFF8088) per JBOD and unless there were a compelling reason
 for multi-pathing I'd probably stick to a single controller.  If
 possible I'd rather buy 20 packs of enterprise SAS disks with 5yr
 warranties and have the JBOD come with empty trays, but would also
 consider buying disks with the JBOD if the price wasn't too crazy.
 
 Does anyone have any positive/negative experiences with any of the following 
 with ZFS: 
  * SuperMicro SC826E16-R500LPB (2U 12 drives, dual 500w PS, single LSI 
 SAS2X28 expander)
  * SuperMicro SC846BE16-R920B (4U 24 drives, dual 920w PS, single unknown 
 expander)
  * Dell PowerVault MD 1200 (2U 12 drives, dual 600w PS, dual unknown 
 expanders)
  * HP StorageWorks D2600 (2U 12 drives, dual 460w PS, single/dual unknown 
 expanders)
 
 I'm leaning towards the SuperMicro stuff, but every time I order
 SuperMicro gear there's always something missing or wrongly
 configured so some of the cost savings gets eaten up with my time
 figuring out where things went wrong and returning/ordering
 replacements.  The Dell/HP gear I'm sure is fine, but buying disks
 from them gets pricey quick. The last time I looked they charged $150
 extra per disk for when the only added value was a proprietary sled a
 shorter warranty (3yr vs 5yr).
 
 I'm open to other JBOD vendors too, was just really just curious what
 folks were using when they needed more than two dozen 3.5 SAS disks
 for use with ZFS.
 
 Thanks
 -Peter

We've had good experiences with the Dell MD line.  It's been MD1200 up
until now, but are keeping our eyes on their MD3260 (60-bay).

You're right in that their costs are higher for disks and such, but
since we are a big Dell shop it simplifies support significantly for us
and we have quick turnaround on parts anywhere in the world.

If that weren't a significant issue I'd go SuperMicro or DataON.  We
used SuperMicro for quite a while with mixed experiences.  Best bet was
to find a chassis that work and stick with it as long as possible. :)

Even if you're not using Nexenta, their HCL is valuable for finding HW
that is likely to work for you.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-04 Thread Ray Van Dolson
On Thu, May 03, 2012 at 07:35:45AM -0700, Edward Ned Harvey wrote:
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Ray Van Dolson
  
  System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs.  16 vdevs of
  15
  disks each -- RAIDZ3.  NexentaStor 3.1.2.
 
 I think you'll get better, both performance  reliability, if you break each
 of those 15-disk raidz3's into three 5-disk raidz1's.  Here's why:
 
 Obviously, with raidz3, if any 3 of 15 disks fail, you're still in
 operation, and on the 4th failure, you're toast.
 Obviously, with raidz1, if any 1 of 5 disks fail, you're still in operation,
 and on the 2nd failure, you're toast.
 
 So it's all about computing the probability of 4 overlapping failures in the
 15-disk raidz3, or 2 overlapping failures in a smaller 5-disk raidz1.  In
 order to calculate that, you need to estimate the time to resilver any one
 failed disk...
 
 In ZFS, suppose you have a record of 128k, and suppose you have a 2-way
 mirror vdev.  Then each disk writes 128k.  If you have a 3-disk raidz1, then
 each disk writes 64k.   If you have a 5-disk raidz1, then each disk writes
 32k.  If you have a 15-disk raidz3, then each disk writes 10.6k.  
 
 Assuming you have a machine in production, and you are doing autosnapshots.
 And your data is volatile.  Over time, it serves to fragment your data, and
 after a year or two of being in production, your resilver will be composed
 almost entirely of random IO.  Each of the non-failed disks must read their
 segment of the stripe, in order to reconstruct the data that will be written
 to the new good disk.  If you're in the 15-disk raidz3 configuration...
 Your segment size is approx 3x smaller, which means approx 3x more IO
 operations.
 
 Another way of saying that...  Assuming the amount of data you will write to
 your pool is the same regardless of which architecture you chose...  For
 discussion purposes, let's say you write 3T to your pool.  And let's
 momentarily assume you whole pool will be composed of 15 disks, in either a
 single raidz3, or in 3x 5-disk raidz1.  If you use one big raidz3, then the
 3T will require at least 24million 128k records to hold it all, and each
 128k record will be divided up onto all the disks.  If you use the smaller
 raidz1, then only 1T will get written to each vdev, and you will only need
 8million records on each disk.  Thus, to resilver the large vdev, you will
 require 3x more IO operations.
 
 Worse still, on each IO request, you have to wait for the slowest of all
 disks to return.  If you were in a 2-way mirror situation, your seek time
 would be the average seek time of a single disk.  But if you were in an
 infinite-disk situation, your seek time would be the worst case seek time on
 every single IO operation, which is about 2x longer than the average seek
 time.  So not only do you have 3x more seeks to perform, you have up to 2x
 longer to wait upon each seek...
 
 Now, to put some numbers on this...
 A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write
 sequential.  This means resilvering the entire disk sequentially, including
 unused space, (which is not what ZFS does) would require 2.2 hours.  In
 practice, on my 1T disks, which are in a mirrored configuration, I find
 resilvering takes 12 hours.  I would expect this to be ~4 days if I were
 using 5-disk raidz1, and I would expect it to be ~12 days if I were using
 15-disk raidz3.
 
 Your disks are all 2T, so you should double all the times I just wrote.
 Your raidz3 should be able to resilver a single disk in approx 24 days.
 Your raidz5 should be able to do one in ~ 8 days.  If you were using
 mirrors, ~ 1 day.
 
 Suddenly the prospect of multiple failures overlapping don't seem so
 unlikely.

Ed, thanks for taking the time to write this all out.  Definitely food
for thought.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] IOzone benchmarking

2012-05-01 Thread Ray Van Dolson
I'm trying to run some IOzone benchmarking on a new system to get a
feel for baseline performance.

Unfortunately, the system has a lot of memory (144GB), but I have some
time so am approaching my runs as follows:

Throughput:
iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls

IOPS:
iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G  iops.txt

Not sure what I gain/lose by using threads or not.

Am I off on this?

System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs.  16 vdevs of 15
disks each -- RAIDZ3.  NexentaStor 3.1.2.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Ray Van Dolson
On Tue, May 01, 2012 at 03:21:05AM -0700, Gary Driggs wrote:
 On May 1, 2012, at 1:41 AM, Ray Van Dolson wrote:
 
  Throughput:
 iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls
 
  IOPS:
 iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G  iops.txt
 
 Do you expect to be reading or writing 36 or 288Gb files very often on
 this array? The largest file size I've used in my still lengthy
 benchmarks was 16Gb. If you use the sizes you've proposed, it could
 take several days or weeks to complete. Try a web search for iozone
 examples if you want more details on the command switches.
 
 -Gary

The problem is this box has 144GB of memory.  If I go with a 16GB file
size (which I did), then memory and caching influences the results
pretty severely (I get around 3GB/sec for writes!).

Obviously, I could yank RAM for purposes of benchmarking. :)

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] IOzone benchmarking

2012-05-01 Thread Ray Van Dolson
On Tue, May 01, 2012 at 07:18:18AM -0700, Bob Friesenhahn wrote:
 On Mon, 30 Apr 2012, Ray Van Dolson wrote:
 
  I'm trying to run some IOzone benchmarking on a new system to get a
  feel for baseline performance.
 
 Unfortunately, benchmarking with IOzone is a very poor indicator of 
 what performance will be like during normal use.  Forcing the system 
 to behave like it is short on memory only tests how the system will 
 behave when it is short on memory.
 
 Testing multi-threaded synchronous writes with IOzone might actually 
 mean something if it is representative of your work-load.
 
 Bob

Sounds like IOzone may not be my best option here (though it does
produce pretty graphs).

bonnie++ actually gave me more realistic sounding numbers, and I've
been reading good thigns about fio.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on Linux vs FreeBSD

2012-04-25 Thread Ray Van Dolson
On Wed, Apr 25, 2012 at 05:48:57AM -0700, Paul Archer wrote:
 This may fall into the realm of a religious war (I hope not!), but
 recently several people on this list have said/implied that ZFS was
 only acceptable for production use on FreeBSD (or Solaris, of course)
 rather than Linux with ZoL.
 
 I'm working on a project at work involving a large(-ish) amount of
 data, about 5TB, working its way up to 12-15TB eventually, spread
 among a dozen or so nodes. There may or may not be a clustered
 filesystem involved (probably gluster if we use anything). I've been
 looking at ZoL as the primary filesystem for this data. We're a Linux
 shop, so I'd rather not switch to FreeBSD, or any of the
 Solaris-derived distros--although I have no problem with them, I just
 don't want to introduce another OS into the mix if I can avoid it.
 
 So, the actual questions are:
 
 Is ZoL really not ready for production use?
 
 If not, what is holding it back? Features? Performance? Stability?
 
 If not, then what kind of timeframe are we looking at to get past
 whatever is holding it back?

I can't comment directly on experiences with ZoL as I haven't used it,
but it does seem to be under active development.  That can be a good
thing or a bad thing. :)

I for one would be hesitant to use it for anything production based
solely on the youngness of the effort.

That said, might be worthwhile to check out the ZoL mailing lists and
bug reports to see what types of issues the early adopters are running
into and whether or not they are showstoppers for you or you are
willing to accept the risks.

For your size requierements and your intent to use Gluster, it sounds
like ext4 or xfs would be entirely suitable and are obviously more
mature on Linux at this point.

Regardless, curious to hear which way you end up going and how things
work out.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Unable to allocate dma memory for extra SGL

2012-01-10 Thread Ray Van Dolson
Hi all;

We have a Solaris 10 U9 x86 instance running on Silicon Mechanics /
SuperMicro hardware.

Occasionally under high load (ZFS scrub for example), the box becomes
non-responsive (it continues to respond to ping but nothing else works
-- not even the local console).  Our only solution is to hard reset
after which everything comes up normally.

Logs are showing the following:

  Jan  8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
  Jan  8 09:44:08 prodsys-dmz-zfs2MPT SGL mem alloc failed
  Jan  8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
  Jan  8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for 
extra SGL.
  Jan  8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
  Jan  8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for 
extra SGL.
  Jan  8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
  Jan  8 09:44:10 prodsys-dmz-zfs2Unable to allocate dma memory for 
extra SGL.
  Jan  8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
  Jan  8 09:44:10 prodsys-dmz-zfs2MPT SGL mem alloc failed
  Jan  8 09:44:11 prodsys-dmz-zfs2 rpcmod: [ID 851375 kern.warning] WARNING: 
svc_cots_kdup no slots free

I am able to resolve the last error by adjusting upwards the duplicate
request cache sizes, but have been unable to find anything on the MPT
SGL errors.

Anyone have any thoughts on what this error might be?

At this point, we are simply going to apply patches to this box (we do
see an outstanding mpt patch):

147150 --  01 R-- 124 SunOS 5.10_x86: mpt_sas patch
147702 --  03 R--  21 SunOS 5.10_x86: mpt patch

But we have another identically configured box at the same patch level
(admittedly with slightly less workload, though it also undergoes
monthly zfs scrubs) which does not experience this issue.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Unable to allocate dma memory for extra SGL

2012-01-10 Thread Ray Van Dolson
On Tue, Jan 10, 2012 at 06:23:50PM -0800, Hung-Sheng Tsao (laoTsao) wrote:
 how is the ram size what is the zpool setup and what is your hba and
 hdd size and type

Hmm, actually this system has only 6GB of memory.  For some reason I
though it had more.

The controller is an LSISAS2008 (which oddly enough dose not seem to be
recognized by lsiutil).

There are 23x1TB disks (SATA interface, not SAS unfortunately) in the
system.  Three RAIDZ2 vdevs of seven disks each and one spare comprises
a single zpool with two zfs file systems mounted (no deduplication or
compression in use).

There are two internally mounted Intel X-25E's -- these double as the
rootpool and ZIL devices.

There is an 80GB X-25M mounted to the expander along with the 1TB
drives operating as L2ARC.

 
 On Jan 10, 2012, at 21:07, Ray Van Dolson rvandol...@esri.com wrote:
 
  Hi all;
  
  We have a Solaris 10 U9 x86 instance running on Silicon Mechanics /
  SuperMicro hardware.
  
  Occasionally under high load (ZFS scrub for example), the box becomes
  non-responsive (it continues to respond to ping but nothing else works
  -- not even the local console).  Our only solution is to hard reset
  after which everything comes up normally.
  
  Logs are showing the following:
  
   Jan  8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
  /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
   Jan  8 09:44:08 prodsys-dmz-zfs2MPT SGL mem alloc failed
   Jan  8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
  /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
   Jan  8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for 
  extra SGL.
   Jan  8 09:44:08 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
  /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
   Jan  8 09:44:08 prodsys-dmz-zfs2Unable to allocate dma memory for 
  extra SGL.
   Jan  8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
  /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
   Jan  8 09:44:10 prodsys-dmz-zfs2Unable to allocate dma memory for 
  extra SGL.
   Jan  8 09:44:10 prodsys-dmz-zfs2 scsi: [ID 107833 kern.warning] WARNING: 
  /pci@0,0/pci8086,3410@9/pci1000,72@0 (mpt_sas0):
   Jan  8 09:44:10 prodsys-dmz-zfs2MPT SGL mem alloc failed
   Jan  8 09:44:11 prodsys-dmz-zfs2 rpcmod: [ID 851375 kern.warning] WARNING: 
  svc_cots_kdup no slots free
  
  I am able to resolve the last error by adjusting upwards the duplicate
  request cache sizes, but have been unable to find anything on the MPT
  SGL errors.
  
  Anyone have any thoughts on what this error might be?
  
  At this point, we are simply going to apply patches to this box (we do
  see an outstanding mpt patch):
  
  147150 --  01 R-- 124 SunOS 5.10_x86: mpt_sas patch
  147702 --  03 R--  21 SunOS 5.10_x86: mpt patch
  
  But we have another identically configured box at the same patch level
  (admittedly with slightly less workload, though it also undergoes
  monthly zfs scrubs) which does not experience this issue.
  
  Ray

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS + Dell MD1200's - MD3200 necessary?

2012-01-05 Thread Ray Van Dolson
We are looking at building a storage platform based on Dell HW + ZFS
(likely Nexenta).

Going Dell because they can provide solid HW support globally.

Are any of you using the MD1200 JBOD with head units *without* an
MD3200 in front?  We are being told that the MD1200's won't daisy
chain unless the MD3200 is involved.

We would be looking to use some sort of LSI-based SAS controller on the
Dell front-end servers.

Looking to confirm from folks who have this deployed in the wild.
Perhaps you'd be willing to describe your setup as well and anything we
might need to take into consideration (thinking best option for getting
ZIL/L2ARC devices into Dell R510 head units for example in a supported
manner).

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + Dell MD1200's - MD3200 necessary?

2012-01-05 Thread Ray Van Dolson
On Thu, Jan 05, 2012 at 06:07:33PM -0800, Craig Morgan wrote:
 Ray,
 
 If you are intending to go Nexenta then speak to your local Nexenta SE, 
 we've got HSL qualified solutions which cover our h/w support and we've
 explicitly qualed some MD1200 configs with Dell for certain deployments
 to guarantee support via both Dell h/w support and ourselves.
 
 If you don't know who that would be drop me a line and I'll find someone
 local to you …
 
 We tend to go with the LSI cards, but even there there are some issues
 with regard to Dell supply or over the counter.
 
 HTH
 
 Craig

Hi Craig;

Yep, we are doing this.  Just trying to sanity check the suggested
config against what folks are doing in the wild as our Dell partner
doesn't seem to think it should/can be done without the MD3200.  They
may have alterior motives of course. :)

Thanks,
Ray

 
 On 6 Jan 2012, at 01:28, Ray Van Dolson wrote:
 
  We are looking at building a storage platform based on Dell HW + ZFS
  (likely Nexenta).
  
  Going Dell because they can provide solid HW support globally.
  
  Are any of you using the MD1200 JBOD with head units *without* an
  MD3200 in front?  We are being told that the MD1200's won't daisy
  chain unless the MD3200 is involved.
  
  We would be looking to use some sort of LSI-based SAS controller on the
  Dell front-end servers.
  
  Looking to confirm from folks who have this deployed in the wild.
  Perhaps you'd be willing to describe your setup as well and anything we
  might need to take into consideration (thinking best option for getting
  ZIL/L2ARC devices into Dell R510 head units for example in a supported
  manner).
  
  Thanks,
  Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-30 Thread Ray Van Dolson
On Fri, Dec 30, 2011 at 05:57:47AM -0800, Hung-Sheng Tsao (laoTsao) wrote:
 now s11 support shadow migration, just for this  purpose, AFAIK
 not sure nexentaStor support shadow migration

Does not appear that it does (at least the shadow property is not in
NexentaStor's zfs man page).

Thanks for the pointer.

Ray

 
 On Dec 30, 2011, at 2:03, Ray Van Dolson rvandol...@esri.com wrote:
 
  On Thu, Dec 29, 2011 at 10:59:04PM -0800, Fajar A. Nugraha wrote:
  On Fri, Dec 30, 2011 at 1:31 PM, Ray Van Dolson rvandol...@esri.com 
  wrote:
  Is there a non-disruptive way to undeduplicate everything and expunge
  the DDT?
  
  AFAIK, no
  
   zfs send/recv and then back perhaps (we have the extra
  space)?
  
  That should work, but it's disruptive :D
  
  Others might provide better answer though.
  
  Well, slightly _less_ disruptive perhaps.  We can zfs send to another
  file system on the same system, but different set of disks.  We then
  disable NFS shares on the original, do a final zfs send to sync, then
  share out the new undeduplicated file system with the same name.
  Hopefully the window here is short enough that NFS clients are able to
  recover gracefully.
  
  We'd then wipe out the old zpool, recreate and do the reverse to get
  data back onto it..
  
  Thanks,
  Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-30 Thread Ray Van Dolson
Thanks for you response, Richard.

On Fri, Dec 30, 2011 at 09:52:17AM -0800, Richard Elling wrote:
 On Dec 29, 2011, at 10:31 PM, Ray Van Dolson wrote:
 
  Hi all;
  
  We have a dev box running NexentaStor Community Edition 3.1.1 w/ 24GB
  (we don't run dedupe on production boxes -- and we do pay for Nexenta
  licenses on prd as well) RAM and an 8.5TB pool with deduplication
  enabled (1.9TB or so in use).  Dedupe ratio is only 1.26x.
 
 Yes, this workload is a poor fit for dedup.
 
  The box has an SLC-based SSD as ZIL and a 300GB MLC SSD as L2ARC.
  
  The box has been performing fairly poorly lately, and we're thinking
  it's due to deduplication:
  
   # echo ::arc | mdb -k | grep arc_meta
   arc_meta_used =  5884 MB
   arc_meta_limit=  5885 MB
 
 This can be tuned. Since you are on the community edition and thus have no 
 expectation of support, you can increase this limit yourself. In the future, 
 the
 limit will be increased OOB. For now, add something like the following to the
 /etc/system file and reboot.
 
 *** Parameter: zfs:zfs_arc_meta_limit
 ** Description: sets the maximum size of metadata stored in the ARC.
 **   Metadata competes with real data for ARC space.
 ** Release affected: NexentaStor 3.0, 3.1, not needed for 4.0
 ** Validation: none
 ** When to change: for metadata-intensive or deduplication workloads
 **   having more metadata in the ARC can improve performance.
 ** Stability: NexentaStor issue #7151 seeks to change the default 
 **   value to be larger than 1/4 of arc_max.
 ** Data type: integer
 ** Default: 1/4 of arc_max (bytes)
 ** Range: 1 to arc_max
 ** Changed by: YOUR_NAME_HERE
 ** Change date: TODAYS_DATE
 **
 *set zfs:zfs_arc_meta_limit = 1000

If we wanted to this on a running system, would the following work?

  # echo arc_meta_limit/Z 0x27100 | mdb -kw

(To up arc_meta_limit to 10GB)

 
 
   arc_meta_max  =  5888 MB
  
   # zpool status -D
   ...
   DDT entries 24529444, size 331 on disk, 185 in core
  
  So, not only are we using up all of our metadata cache, but the DDT
  table is taking up a pretty significant chunk of that (over 70%).
  
  ARC sizing is as follows:
  
   p = 15331 MB
   c = 16354 MB
   c_min =  2942 MB
   c_max = 23542 MB
   size  = 16353 MB
  
  I'm not really sure how to determine how many blocks are on this zpool
  (is it the same as the # of DDT entries? -- deduplication has been on
  since pool creation).  If I use a 64KB block size average, I get about
  31 million blocks, but DDT entries are 24 million ….
 
 The zpool status -D output shows the number of blocks.
 
  zdb -DD and zdb -bb | grep 'bp count both do not complete (zdb says
  I/O error).  Probably because the pool is in use and is quite busy.
 
 Yes, zdb is not expected to produce correct output for imported pools.
 
  Without the block count I'm having a hard time determining how much
  memory we _should_ have.  I can only speculate that it's more at this
  point. :)
  
  If I assume 24 million blocks is about accurate (from zpool status -D
  output above), then at 320 bytes per block we're looking at about 7.1GB
  for DDT table size.  
 
 That is the on-disk calculation. Use the in-core number for memory 
 consumption.
   RAM needed if DDT is completely in ARC = 4,537,947,140 bytes (+)
 
  We do have L2ARC, though I'm not sure how ZFS
  decides what portion of the DDT stays in memory and what can go to
  L2ARC -- if all of it went to L2ARC, then the references to this
  information in arc_meta would be (at 176 bytes * 24million blocks)
  around 4GB -- which again is a good chuck of arc_meta_max.
 
 Some of the data might already be in L2ARC. But L2ARC access is always
 slower than RAM access by a few orders of magnitude.
 
  Given that our dedupe ratio on this pool is fairly low anyways, am
  looking for strategies to back out.  Should we just disable
  deduplication and then maybe bump up the size of the arc_meta_max?
  Maybe also increase the size of arc.size as well (8GB left for the
  system seems higher than we need)?
 
 The arc_size is dynamic, but limited by another bug in Solaris to effectively 
 7/8
 of RAM (fixed in illumos). Since you are unsupported, you can try to add the
 following to /etc/system along with the tunable above.
 
 *** Parameter: swapfs_minfree
 ** Description: sets the minimum space reserved for the rest of the
 **   system as swapfs grows. This value is also used to calculate the
 **   dynamic upper limit of the ARC size.
 ** Release affected: NexentaStor 3.0, 3.1, not needed for 4.0
 ** Validation: none
 ** When to change: the default setting of physmem/8 caps the ARC to
 **   approximately 7/8 of physmem, a value usually much smaller than
 **   arc_max. Choosing a lower limit for swapfs_minfree can allow the
 **   ARC to grow above 7/8 of physmem.
 ** Data

[zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-29 Thread Ray Van Dolson
Hi all;

We have a dev box running NexentaStor Community Edition 3.1.1 w/ 24GB
(we don't run dedupe on production boxes -- and we do pay for Nexenta
licenses on prd as well) RAM and an 8.5TB pool with deduplication
enabled (1.9TB or so in use).  Dedupe ratio is only 1.26x.

The box has an SLC-based SSD as ZIL and a 300GB MLC SSD as L2ARC.

The box has been performing fairly poorly lately, and we're thinking
it's due to deduplication:

  # echo ::arc | mdb -k | grep arc_meta
  arc_meta_used =  5884 MB
  arc_meta_limit=  5885 MB
  arc_meta_max  =  5888 MB

  # zpool status -D
  ...
  DDT entries 24529444, size 331 on disk, 185 in core

So, not only are we using up all of our metadata cache, but the DDT
table is taking up a pretty significant chunk of that (over 70%).

ARC sizing is as follows:

  p = 15331 MB
  c = 16354 MB
  c_min =  2942 MB
  c_max = 23542 MB
  size  = 16353 MB

I'm not really sure how to determine how many blocks are on this zpool
(is it the same as the # of DDT entries? -- deduplication has been on
since pool creation).  If I use a 64KB block size average, I get about
31 million blocks, but DDT entries are 24 million 

zdb -DD and zdb -bb | grep 'bp count both do not complete (zdb says
I/O error).  Probably because the pool is in use and is quite busy.

Without the block count I'm having a hard time determining how much
memory we _should_ have.  I can only speculate that it's more at this
point. :)

If I assume 24 million blocks is about accurate (from zpool status -D
output above), then at 320 bytes per block we're looking at about 7.1GB
for DDT table size.  We do have L2ARC, though I'm not sure how ZFS
decides what portion of the DDT stays in memory and what can go to
L2ARC -- if all of it went to L2ARC, then the references to this
information in arc_meta would be (at 176 bytes * 24million blocks)
around 4GB -- which again is a good chuck of arc_meta_max.

Given that our dedupe ratio on this pool is fairly low anyways, am
looking for strategies to back out.  Should we just disable
deduplication and then maybe bump up the size of the arc_meta_max?
Maybe also increase the size of arc.size as well (8GB left for the
system seems higher than we need)?

Is there a non-disruptive way to undeduplicate everything and expunge
the DDT?  zfs send/recv and then back perhaps (we have the extra
space)?

Thanks,
Ray

[1] http://markmail.org/message/db55j6zetifn4jkd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Resolving performance issue w/ deduplication (NexentaStor)

2011-12-29 Thread Ray Van Dolson
On Thu, Dec 29, 2011 at 10:59:04PM -0800, Fajar A. Nugraha wrote:
 On Fri, Dec 30, 2011 at 1:31 PM, Ray Van Dolson rvandol...@esri.com wrote:
  Is there a non-disruptive way to undeduplicate everything and expunge
  the DDT?
 
 AFAIK, no
 
   zfs send/recv and then back perhaps (we have the extra
  space)?
 
 That should work, but it's disruptive :D
 
 Others might provide better answer though.

Well, slightly _less_ disruptive perhaps.  We can zfs send to another
file system on the same system, but different set of disks.  We then
disable NFS shares on the original, do a final zfs send to sync, then
share out the new undeduplicated file system with the same name.
Hopefully the window here is short enough that NFS clients are able to
recover gracefully.

We'd then wipe out the old zpool, recreate and do the reverse to get
data back onto it..

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS in front of MD3000i

2011-10-24 Thread Ray Van Dolson
We're setting up ZFS in front of an MD3000i (and attached MD1000
expansion trays).

The rule of thumb is to let ZFS manage all of the disks, so we wanted
to expose each MD3000i spindle via a JBOD mode of some sort.

Unfortunately, it doesn't look like the MD3000i this (though this[1]
post seems to reference an Enhanced JBOD mode), so we decided to
create a whole bunch of RAID0 1-disk LUNs and expose those.  Great..
except that the MD3000i only lets you create 16 LUNs and we have 44
disks total. :)

Anyone tried this?  I guess our best bet will be to just do all the
RAID stuff on the MD3000i and export one LUN to ZFS.

Ray

[1] http://don.blogs.smugmug.com/2007/10/01/dell-md3000-great-das-db-storage/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacement for X25-E

2011-09-22 Thread Ray Van Dolson
On Thu, Sep 22, 2011 at 12:46:42PM -0700, Brandon High wrote:
 On Tue, Sep 20, 2011 at 12:21 AM, Markus Kovero markus.kov...@nebula.fi 
 wrote:
  Hi, I was wondering do you guys have any recommendations as replacement for
  Intel X25-E as it is being EOL’d? Mainly as for log device.
 
 The Intel 311 seems like a good fit. It's a 20gb SLC device intended
 to act as a cache device with the Z68 chipset.

It seems to perform similarly to the X-25E as well (3300 IOPS for
random writes).  Perhaps the drive can be overprovisioned as well?

My impression was that Intel was classifying the 3xx series as
non-Enterprise however.  Even with the SLC.

I'm not sure what its rated lifetime is (1PB of data written?).

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacement for X25-E

2011-09-22 Thread Ray Van Dolson
On Thu, Sep 22, 2011 at 01:21:26PM -0700, Brandon High wrote:
 On Thu, Sep 22, 2011 at 12:53 PM, Ray Van Dolson rvandol...@esri.com wrote:
  It seems to perform similarly to the X-25E as well (3300 IOPS for
  random writes).  Perhaps the drive can be overprovisioned as well?
 
  My impression was that Intel was classifying the 3xx series as
  non-Enterprise however.  Even with the SLC.
 
 I don't think the 311 has any over-provisioning (other than the 7%
 from GB - GiB conversion). I believe it is an X25-E with only 5
 channels populated. The upcoming enterprise models are MLC based and
 have greater over-provisioning AFAIK.
 
 The 20GB 311 only costs ~ $100 though. The 100GB Intel 710 costs ~ $650.
 
 The 311 is a good choice for home or budget users, and it seems that
 the 710 is much bigger than it needs to be for slog devices.

My thoughts exactly.

If the 311 is aimed at home users (wear-wise in _addition_ to marketing
wise), then it doesn't really seem there is a suitable Intel
replacement for the X-25E as far as an slog device is concerned.

The drives are all way too big. :)

We are currently looking at using the 320 or 710 overprovisioned
(though the latter is likely more than we want to spend).

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replacement for X25-E

2011-09-22 Thread Ray Van Dolson
On Thu, Sep 22, 2011 at 01:34:09PM -0700, Bob Friesenhahn wrote:
 On Thu, 22 Sep 2011, Brandon High wrote:
 
  The 20GB 311 only costs ~ $100 though. The 100GB Intel 710 costs ~ $650.
 
  The 311 is a good choice for home or budget users, and it seems that
  the 710 is much bigger than it needs to be for slog devices.
 
 Much too big is a good thing if it results in much more space 
 available for wear-leveling.  If the device is designed well, it 
 should last longer.
 
 Bob

Of course, at $650 a pop, if you're buying two Intel 710 100GB drives
for either increased performance or redundancy, you could basically
afford a DDRdrive...

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel 320 as ZIL?

2011-08-15 Thread Ray Van Dolson
On Fri, Aug 12, 2011 at 06:53:22PM -0700, Edward Ned Harvey wrote:
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Ray Van Dolson
  
  For ZIL, I
  suppose we could get the 300GB drive and overcommit to 95%!
 
 What kind of benefit does that offer?  I suppose, if you have a 300G drive
 and the OS can only see 30G of it, then the drive can essentially treat all
 the other 290G as having been TRIM'd implicitly, even if your OS doesn't
 support TRIM.  It is certainly conceivable this could make a big difference.

Perhaps this is it.  Pulled the recommendation from Intel's Solid-State
Drive 320 Series in Server Storage Applications whitepaper.

Section 4.1:

  A small reduction in an SSD’s usable capacity can provide a large
  increase in random write performance and endurance. 

  All Intel SSDs have more NAND capacity than what is available for
  user data. The unused capacity is called spare capacity. This area is
  reserved for internal operations.  The larger the spare capacity, the
  more efficiently the SSD can perform random write operations and the
  higher the random write performance. 

  On the Intel SSD 320 Series, the spare capacity reserved at the
  factory is 7% to 11% (depending on the SKU) of the full NAND
  capacity. For better random write performance and endurance, the
  spare capacity can be increased by reducing the usable capacity of
  the drive; this process is called over-provisioning.

 
 
 Have you already tested it?  Anybody?  Or is it still just theoretical
 performance enhancement, compared to using a normal sized drive in a
 normal mode?
 

Haven't yet tested it, but hope to shortly.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel 320 as ZIL?

2011-08-15 Thread Ray Van Dolson
On Mon, Aug 15, 2011 at 01:38:36PM -0700, Brandon High wrote:
 On Thu, Aug 11, 2011 at 1:00 PM, Ray Van Dolson rvandol...@esri.com wrote:
  Are any of you using the Intel 320 as ZIL?  It's MLC based, but I
  understand its wear and performance characteristics can be bumped up
  significantly by increasing the overprovisioning to 20% (dropping
  usable capacity to 80%).
 
 Intel recently added the 311, a small SLC-based drive for use as a
 temp cache with their Z68 platform. It's limited to 20GB, but it might
 be a better fit for use as a ZIL than the 320.
 
 -B

Looks interesting... specs around the same as the old X-25E.  We have
heard however, that Intel will be announcing a true successor to their
X-25E line shortly.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel 320 as ZIL?

2011-08-12 Thread Ray Van Dolson
On Thu, Aug 11, 2011 at 09:17:38PM -0700, Cooper Hubbell wrote:
 Which 320 series drive are you targeting, specifically?  The ~$100
 80GB variant should perform as well as the more expensive versions if
 your workload is more random from what I've seen/read.

ESX NFS-attached datastore activity.  Probably up to 100 VM's (about
the same as we did with the X-25E).

Larger drives would let us set overcommit pretty high :)  For ZIL, I
suppose we could get the 300GB drive and overcommit to 95%!

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Intel 320 as ZIL?

2011-08-11 Thread Ray Van Dolson
Are any of you using the Intel 320 as ZIL?  It's MLC based, but I
understand its wear and performance characteristics can be bumped up
significantly by increasing the overprovisioning to 20% (dropping
usable capacity to 80%).

Anyone have experience with this?

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intel 320 as ZIL?

2011-08-11 Thread Ray Van Dolson
On Thu, Aug 11, 2011 at 01:10:07PM -0700, Ian Collins wrote:
   On 08/12/11 08:00 AM, Ray Van Dolson wrote:
  Are any of you using the Intel 320 as ZIL?  It's MLC based, but I
  understand its wear and performance characteristics can be bumped up
  significantly by increasing the overprovisioning to 20% (dropping
  usable capacity to 80%).
 
 A log device doesn't have to be larger than a few GB, so that shouldn't 
 be a problem.  I've found even low cost SSDs make a huge difference to 
 the NFS write performance of a pool.

We've been using the X-25E (SLC-based).  It's getting hard to find, and
since we're trying to stick to Intel drives (Nexenta certifies them),
and Intel doesn't have a new SLC drive available until late September,
we're hoping an overprovisioned 320 could fill the gap until then and
perform at least as well as the X-25E.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Adjusting HPA from Solaris on Intel 320 SSD's

2011-07-18 Thread Ray Van Dolson
Is there a way to tweak the HPA (Host Protected Area) on an Intel 320
SSD using native Solaris commands?

In this case, we'd like to shrink the usable space so as to improve
performance per recommendation in Intel Solid-State Drive 320 Series
in Server Storage Applications section 4.1.

hdparm on Linux is referenced, and it may be doable via the Intel Solid
State Drive Toolbox, but would be great to be able to tweak and query
this from Solaris / OpenSolaris / NexentaStor.

Did come across this[1] thread from 2007, but it's not clear if
'format' or some other utility gained this functionality since.

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Should Intel X25-E not be used with a SAS Expander?

2011-06-02 Thread Ray Van Dolson
On Thu, Jun 02, 2011 at 11:19:25AM -0700, Josh Simon wrote:
 I don't believe this to be the reason since there are other SATA 
 (single-port) SSD drives listed as approved in that same document.
 
 Upon further research I found some interesting links that may point to a 
 potentially different reason for not using the Intel X25-E with a SAS 
 Expander:
 
 http://gdamore.blogspot.com/2010/08/why-sas-sata-is-not-such-great-idea.html
 
 Update: At a significant account, I can say that we (meaning Nexenta) 
 have verified that SAS/SATA expanders combined with high loads of ZFS 
 activity have proven conclusively to be highly toxic. So, if you're 
 designing an enterprise storage solution, please consider using SAS all 
 the way to the disk drives, and just skip those cheaper SATA options. 
 You may think SATA looks like a bargain, but when your array goes 
 offline during ZFS scrub or resilver operations because the expander is 
 choking on cache sync commands, you'll really wish you had spent the 
 extra cash up front. Really.
 
 and
 
 http://gdamore.blogspot.com/2010/12/update-on-sata-expanders.html
 
 This sounds like it will affect a lot of people since so many are using 
 SATA SSD for their log devices connected to SAS expanders.
 
 Thanks,
 
 Josh Simon
 

Yup; reset storms affected us as well (we were using the X-25 series
for ZIL/L2ARC).  Only the ZIL drives were impacted, but it was a large
impact :)

Our solution was to move the SSD's off of the expander and remount
internally attached via one of the LSI SAS ports directly (we also had
problems with running the drives directly off the on-board SATA ports
on our SuperMicro motherboards -- occasionally the entire zpool would
freeze up).

Ray

 
 On 06/02/2011 01:25 PM, Jim Klimov wrote:
  2011-06-02 18:40, Josh Simon пишет:
  I was just doing some storage research and came across this
  http://www.nexenta.com/corp/images/stories/pdfs/hardware-supported.pdf. In
  that document for Nexenta (an opensolaris variant) it states that you
  should not use Intel X25-E SSDSA2SH032G1 SSD with a SAS Expander. Can
  anyone tell me why?
 
  This seems to be a very common drive people deploy in ZFS pools.
 
  I believe one reason is that these are single-port devices - and as such
  do not support failover to another SAS path.
 
 
  //Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Should Intel X25-E not be used with a SAS Expander?

2011-06-02 Thread Ray Van Dolson
On Thu, Jun 02, 2011 at 11:39:13AM -0700, Donald Stahl wrote:
  Yup; reset storms affected us as well (we were using the X-25 series
  for ZIL/L2ARC).  Only the ZIL drives were impacted, but it was a large
  impact :)
 What did you see with your reset storm? Were there log errors in
 /var/adm/messages or did you need to check the controller loogs with
 something like lsi util?

Yep, /var/adm/messages had Unit Attention errors.  Ref:

http://markmail.org/message/5rmfzvqwlmosh2oh

 Did the reset workaround in the blog post help?

We re-architected before reading the blog post, so I'm unsure if it
would have helped or not.  In any case, moving the SSD's internal lets
us use additional hot-swappable data disks, so it was beneficial in
other areas as well.

 
 The expanders you were using were SAS/SATA expanders? Or SAS expanders
 with adapters on the drive to allow the use of SATA disks?

The expander was a SuperMicro SAS-846EL1 which is a SAS expander but has
SFF-8482 connectors to provide compatability with SATA drives.

 
 I've been using 4 X-25E's with Promise J610sD SAS shelves and the
 AAMUX adapters and have yet to have a problem.

It definitely seemed itermittent, and various suggestions we received
indicated we might need to downgrade our backplane/expander's firmware.
Never did try that, but it wouldn't surprise me if behavior was
better/worse on different backplanes...

 
  Our solution was to move the SSD's off of the expander and remount
  internally attached via one of the LSI SAS ports directly (we also had
  problems with running the drives directly off the on-board SATA ports
  on our SuperMicro motherboards -- occasionally the entire zpool would
  freeze up).

 I'm surprised you had problems with the internal SATA ports as well-
 any idea what was causing the problems there?

Nope.  I posted this:

http://mail.opensolaris.org/pipermail/zfs-discuss/2010-October/045625.html

But got no responses.  We resolved the NFS errors (which I believe were
coincidental), but the watchdog port issues kept reoccurring without
rhyme or reason.  The box itself wouldn't lock up, but the zpool would
become non-resopnsive and we'd have to hard reset.

This was all production stuff, so as soon as we were able to, we
ditched using the SATA ports entirely instead of pursuing a fix with
Sun.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Tuning disk failure detection?

2011-05-10 Thread Ray Van Dolson
We recently had a disk fail on one of our whitebox (SuperMicro) ZFS
arrays (Solaris 10 U9).

The disk began throwing errors like this:

May  5 04:33:44 dev-zfs4 scsi: [ID 243001 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci15d9,400@0 (mpt_sas0):
May  5 04:33:44 dev-zfs4mptsas_handle_event_sync: IOCStatus=0x8000, 
IOCLogInfo=0x31110610

And errors for the drive were incrementing in iostat -En output.
Nothing was seen in fmdump.

Unfortunately, it took about three hours for ZFS (or maybe it was MPT)
to decide the drive was actually dead:

May  5 07:41:06 dev-zfs4 scsi: [ID 107833 kern.warning] WARNING: 
/scsi_vhci/disk@g5000c5002cbc76c0 (sd4):
May  5 07:41:06 dev-zfs4drive offline

During this three hours the I/O performance on this server was pretty
bad and caused issues for us.  Once the drive failed completely, ZFS
pulled in a spare and all was well.

My question is -- is there a way to tune the MPT driver or even ZFS
itself to be more/less aggressive on what it sees as a failure
scenario?

I suppose this would have been handled differently / better if we'd
been using real Sun hardware?

Our other option is to watch better for log entries similar to the
above and either alert someone or take some sort of automated action
.. I'm hoping there's a better way to tune this via driver or ZFS
settings however.

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning disk failure detection?

2011-05-10 Thread Ray Van Dolson
On Tue, May 10, 2011 at 02:42:40PM -0700, Jim Klimov wrote:
 In a recent post r-mexico wrote that they had to parse system
 messages and manually fail the drives on a similar, though
 different, occasion:
 
 http://opensolaris.org/jive/message.jspa?messageID=515815#515815

Thanks Jim, good pointer.

It sounds like our use of SATA disks is likely the problem and we'd
have better error reporting with SAS or some of the nearline SAS
drives (SATA drives with a real SAS controller on them).

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Tuning disk failure detection?

2011-05-10 Thread Ray Van Dolson
On Tue, May 10, 2011 at 03:57:28PM -0700, Brandon High wrote:
 On Tue, May 10, 2011 at 9:18 AM, Ray Van Dolson rvandol...@esri.com wrote:
  My question is -- is there a way to tune the MPT driver or even ZFS
  itself to be more/less aggressive on what it sees as a failure
  scenario?
 
 You didn't mention what drives you had attached, but I'm guessing they
 were normal desktop drives.
 
 I suspect (but can't confirm) that using enterprise drives with TLER /
 ERC / CCTL would have reported the failure up the stack faster than a
 consumer drive. The drives will report an error after 7 seconds rather
 than retry for several minutes.
 
 You may be able to enable the feature on your drives, depending on the
 manufacturer and firmware revision.
 
 -B

Yup, shoulda included that.  These are regular SATA drives --
supposedly Enterprise whatever that gives us (most likely a higher
MTBF number).

We'll probably look at going with nearline SAS drives (only increases
cost slightly) and write a small SEC rule on our syslog server to watch
for 0x3000 errors on servers with SATA disks only so we can at
least be alerted more quickly.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-06 Thread Ray Van Dolson
On Wed, May 04, 2011 at 08:49:03PM -0700, Edward Ned Harvey wrote:
  From: Tim Cook [mailto:t...@cook.ms]
  
  That's patently false.  VM images are the absolute best use-case for dedup
  outside of backup workloads.  I'm not sure who told you/where you got the
  idea that VM images are not ripe for dedup, but it's wrong.
 
 Well, I got that idea from this list.  I said a little bit about why I
 believed it was true ... about dedup being ineffective for VM's ... Would
 you care to describe a use case where dedup would be effective for a VM?  Or
 perhaps cite something specific, instead of just wiping the whole thing and
 saying patently false?  I don't feel like this comment was productive...
 

We use dedupe on our VMware datastores and typically see 50% savings,
often times more.  We do of course keep like VM's on the same volume
(at this point nothing more than groups of Windows VM's, Linux VM's and
so on).

Note that this isn't on ZFS (yet), but we hope to begin experimenting
with it soon (using NexentaStor).

Apologies for devolving the conversation too much in the NetApp
direction -- simply was a point of reference for me to get a better
understanding of things on the ZFS side. :)

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Permanently using hot spare?

2011-05-05 Thread Ray Van Dolson
Have a failed drive on a ZFS pool (three RAIDZ2 vdevs, one hot spare).
The hot spare kicked in and all is well.

Is it possible to just make that hot spare disk -- already silvered
into the pool -- as a permanent part of the pool?  We could then throw
in a new disk and mark it as a spare and avoid what would seem to be an
unnecessary resilver (twice, once when the spare is brought in and
again when we replace the failed disk).

This document[1] seems to make it sound like it can be done, but I'm
not really seeing how... 

Can I add the spare disk to the pool when it's already in use?
Probably not...

Note this is on Solaris 10 U9.

Thanks,
Ray

[1] http://dlc.sun.com/osol/docs/content/ZFSADMIN/gayrd.html#gcvcw 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Permanently using hot spare?

2011-05-05 Thread Ray Van Dolson
On Thu, May 05, 2011 at 03:13:06PM -0700, TianHong Zhao wrote:
 Just detach the faulty disk, then the spare will become the normal
 disk once it's finished resilvering.
 
 #zfs detach pool fault_device_name
 
 Then you need to the new spare :
 #zfs add pool new_spare_device
 
 There seems to be a new feature in illumos project to support a zpool
 property like spare promotion, 
 which would not require the manual detach operation.
  
 Tianhong

Thanks!  Great tip.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
There are a number of threads (this one[1] for example) that describe
memory requirements for deduplication.  They're pretty high.

I'm trying to get a better understanding... on our NetApps we use 4K
block sizes with their post-process deduplication and get pretty good
dedupe ratios for VM content.

Using ZFS we are using 128K record sizes by default, which nets us less
impressive savings... however, to drop to a 4K record size would
theoretically require that we have nearly 40GB of memory for only 1TB
of storage (based on 150 bytes per block for the DDT).

This obviously becomes prohibitively higher for 10+ TB file systems.

I will note that our NetApps are using only 2TB FlexVols, but would
like to better understand ZFS's (apparently) higher memory
requirements... or maybe I'm missing something entirely.

Thanks,
Ray

[1] http://markmail.org/message/wile6kawka6qnjdw
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:
 On 5/4/2011 9:57 AM, Ray Van Dolson wrote:
  There are a number of threads (this one[1] for example) that describe
  memory requirements for deduplication.  They're pretty high.
 
  I'm trying to get a better understanding... on our NetApps we use 4K
  block sizes with their post-process deduplication and get pretty good
  dedupe ratios for VM content.
 
  Using ZFS we are using 128K record sizes by default, which nets us less
  impressive savings... however, to drop to a 4K record size would
  theoretically require that we have nearly 40GB of memory for only 1TB
  of storage (based on 150 bytes per block for the DDT).
 
  This obviously becomes prohibitively higher for 10+ TB file systems.
 
  I will note that our NetApps are using only 2TB FlexVols, but would
  like to better understand ZFS's (apparently) higher memory
  requirements... or maybe I'm missing something entirely.
 
  Thanks,
  Ray
 
 I'm not familiar with NetApp's implementation, so I can't speak to
 why it might appear to use less resources.
 
 However, there are a couple of possible issues here:
 
 (1)  Pre-write vs Post-write Deduplication.
  ZFS does pre-write dedup, where it looks for duplicates before 
 it writes anything to disk.  In order to do pre-write dedup, you really 
 have to store the ENTIRE deduplication block lookup table in some sort 
 of fast (random) access media, realistically Flash or RAM.  The win is 
 that you get significantly lower disk utilization (i.e. better I/O 
 performance), as (potentially) much less data is actually written to disk.
  Post-write Dedup is done via batch processing - that is, such a 
 design has the system periodically scan the saved data, looking for 
 duplicates. While this method also greatly benefits from being able to 
 store the dedup table in fast random storage, it's not anywhere as 
 critical. The downside here is that you see much higher disk utilization 
 - the system must first write all new data to disk (without looking for 
 dedup), and then must also perform significant I/O later on to do the dedup.

Makes sense.

 (2) Block size:  a 4k block size will yield better dedup than a 128k 
 block size, presuming reasonable data turnover.  This is inherent, as 
 any single bit change in a block will make it non-duplicated.  With 32x 
 the block size, there is a much greater chance that a small change in 
 data will require a large loss of dedup ratio.  That is, 4k blocks 
 should almost always yield much better dedup ratios than larger ones. 
 Also, remember that the ZFS block size is a SUGGESTION for zfs 
 filesystems (i.e. it will use UP TO that block size, but not always that 
 size), but is FIXED for zvols.
 
 (3) Method of storing (and data stored in) the dedup table.
  ZFS's current design is (IMHO) rather piggy on DDT and L2ARC 
 lookup requirements. Right now, ZFS requires a record in the ARC (RAM) 
 for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it 
 boils down to 500+ bytes of combined L2ARC  RAM usage per block entry 
 in the DDT.  Also, the actual DDT entry itself is perhaps larger than 
 absolutely necessary.

So the addition of L2ARC doesn't necessarily reduce the need for
memory (at least not much if you're talking about 500 bytes combined)?
I was hoping we could slap in 80GB's of SSD L2ARC and get away with
only 16GB of RAM for example.

  I suspect that NetApp does the following to limit their 
 resource usage:   they presume the presence of some sort of cache that 
 can be dedicated to the DDT (and, since they also control the hardware, 
 they can make sure there is always one present).  Thus, they can make 
 their code completely avoid the need for an equivalent to the ARC-based 
 lookup.  In addition, I suspect they have a smaller DDT entry itself.  
 Which boils down to probably needing 50% of the total resource 
 consumption of ZFS, and NO (or extremely small, and fixed) RAM requirement.
 
 Honestly, ZFS's cache (L2ARC) requirements aren't really a problem. The 
 big issue is the ARC requirements, which, until they can be seriously 
 reduced (or, best case, simply eliminated), really is a significant 
 barrier to adoption of ZFS dedup.
 
 Right now, ZFS treats DDT entries like any other data or metadata in how 
 it ages from ARC to L2ARC to gone.  IMHO, the better way to do this is 
 simply require the DDT to be entirely stored on the L2ARC (if present), 
 and not ever keep any DDT info in the ARC at all (that is, the ARC 
 should contain a pointer to the DDT in the L2ARC, and that's it, 
 regardless of the amount or frequency of access of the DDT).  Frankly, 
 at this point, I'd almost change the design to REQUIRE a L2ARC device in 
 order to turn on Dedup.

Thanks for you response, Eric.  Very helpful.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman

Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:
 On Wed, May 4, 2011 at 12:29 PM, Erik Trimble erik.trim...@oracle.com wrote:
         I suspect that NetApp does the following to limit their resource
  usage:   they presume the presence of some sort of cache that can be
  dedicated to the DDT (and, since they also control the hardware, they can
  make sure there is always one present).  Thus, they can make their code
 
 AFAIK, NetApp has more restrictive requirements about how much data
 can be dedup'd on each type of hardware.
 
 See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller
 pieces of hardware can only dedup 1TB volumes, and even the big-daddy
 filers will only dedup up to 16TB per volume, even if the volume size
 is 32TB (the largest volume available for dedup).
 
 NetApp solves the problem by putting rigid constraints around the
 problem, whereas ZFS lets you enable dedup for any size dataset. Both
 approaches have limitations, and it sucks when you hit them.
 
 -B

That is very true, although worth mentioning you can have quite a few
of the dedupe/SIS enabled FlexVols on even the lower-end filers (our
FAS2050 has a bunch of 2TB SIS enabled FlexVols).

The FAS2050 of course has a fairly small memory footprint... 

I do like the additional flexibility you have with ZFS, just trying to
get a handle on the memory requirements.

Are any of you out there using dedupe ZFS file systems to store VMware
VMDK (or any VM tech. really)?  Curious what recordsize you use and
what your hardware specs / experiences have been.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 03:49:12PM -0700, Erik Trimble wrote:
 On 5/4/2011 2:54 PM, Ray Van Dolson wrote:
  On Wed, May 04, 2011 at 12:29:06PM -0700, Erik Trimble wrote:
  (2) Block size:  a 4k block size will yield better dedup than a 128k
  block size, presuming reasonable data turnover.  This is inherent, as
  any single bit change in a block will make it non-duplicated.  With 32x
  the block size, there is a much greater chance that a small change in
  data will require a large loss of dedup ratio.  That is, 4k blocks
  should almost always yield much better dedup ratios than larger ones.
  Also, remember that the ZFS block size is a SUGGESTION for zfs
  filesystems (i.e. it will use UP TO that block size, but not always that
  size), but is FIXED for zvols.
 
  (3) Method of storing (and data stored in) the dedup table.
ZFS's current design is (IMHO) rather piggy on DDT and L2ARC
  lookup requirements. Right now, ZFS requires a record in the ARC (RAM)
  for each L2ARC (cache) entire, PLUS the actual L2ARC entry.  So, it
  boils down to 500+ bytes of combined L2ARC  RAM usage per block entry
  in the DDT.  Also, the actual DDT entry itself is perhaps larger than
  absolutely necessary.
  So the addition of L2ARC doesn't necessarily reduce the need for
  memory (at least not much if you're talking about 500 bytes combined)?
  I was hoping we could slap in 80GB's of SSD L2ARC and get away with
  only 16GB of RAM for example.
 
 It reduces *somewhat* the need for RAM.  Basically, if you have no L2ARC 
 cache device, the DDT must be stored in RAM.  That's about 376 bytes per 
 dedup block.
 
 If you have an L2ARC cache device, then the ARC must contain a reference 
 to every DDT entry stored in the L2ARC, which consumes 176 bytes per DDT 
 entry reference.
 
 So, adding a L2ARC reduces the ARC consumption by about 55%.
 
 Of course, the other benefit from a L2ARC is the data/metadata caching, 
 which is likely worth it just by itself.

Great info.  Thanks Erik.

For dedupe workloads on larger file systems (8TB+), I wonder if makes
sense to use SLC / enterprise class SSD (or better) devices for L2ARC
instead of lower-end MLC stuff?  Seems like we'd be seeing more writes
to the device than in a non-dedupe scenario.

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-04 Thread Ray Van Dolson
On Wed, May 04, 2011 at 04:51:36PM -0700, Erik Trimble wrote:
 On 5/4/2011 4:44 PM, Tim Cook wrote:
 
 
 
 On Wed, May 4, 2011 at 6:36 PM, Erik Trimble erik.trim...@oracle.com
 wrote:
 
 On 5/4/2011 4:14 PM, Ray Van Dolson wrote:
 
 On Wed, May 04, 2011 at 02:55:55PM -0700, Brandon High wrote:
 
 On Wed, May 4, 2011 at 12:29 PM, Erik Trimble
 erik.trim...@oracle.com  wrote:
 
I suspect that NetApp does the following to limit
 their resource
 usage:   they presume the presence of some sort of cache
 that can be
 dedicated to the DDT (and, since they also control the
 hardware, they can
 make sure there is always one present).  Thus, they can
 make their code
 
 AFAIK, NetApp has more restrictive requirements about how much
 data
 can be dedup'd on each type of hardware.
 
 See page 29 of http://media.netapp.com/documents/tr-3505.pdf -
 Smaller
 pieces of hardware can only dedup 1TB volumes, and even the
 big-daddy
 filers will only dedup up to 16TB per volume, even if the
 volume size
 is 32TB (the largest volume available for dedup).
 
 NetApp solves the problem by putting rigid constraints around
 the
 problem, whereas ZFS lets you enable dedup for any size
 dataset. Both
 approaches have limitations, and it sucks when you hit them.
 
 -B
 
 That is very true, although worth mentioning you can have quite a
 few
 of the dedupe/SIS enabled FlexVols on even the lower-end filers
 (our
 FAS2050 has a bunch of 2TB SIS enabled FlexVols).
 
 
 Stupid question - can you hit all the various SIS volumes at once, and
 not get horrid performance penalties?
 
 If so, I'm almost certain NetApp is doing post-write dedup.  That way,
 the strictly controlled max FlexVol size helps with keeping the
 resource limits down, as it will be able to round-robin the post-write
 dedup to each FlexVol in turn.
 
 ZFS's problem is that it needs ALL the resouces for EACH pool ALL the
 time, and can't really share them well if it expects to keep
 performance from tanking... (no pun intended)
 
 
 
 On a 2050?  Probably not.  It's got a single-core mobile celeron CPU and
 2GB/ram.  You couldn't even run ZFS on that box, much less ZFS+dedup.  Can
 you do it on a model that isn't 4 years old without tanking performance?
  Absolutely.
 
 Outside of those two 2000 series, the reason there are dedup limits isn't
 performance. 
 
 --Tim
 
 
 Indirectly, yes, it's performance, since NetApp has plainly chosen
 post-write dedup as a method to restrict the required hardware
 capabilities.  The dedup limits on Volsize are almost certainly
 driven by the local RAM requirements for post-write dedup.
 
 It also looks like NetApp isn't providing for a dedicated DDT cache,
 which means that when the NetApp is doing dedup, it's consuming the
 normal filesystem cache (i.e. chewing through RAM).  Frankly, I'd be
 very surprised if you didn't see a noticeable performance hit during
 the period that the NetApp appliance is performing the dedup scans.

Yep, when the dedupe process runs, there is a drop in performance
(hence we usually schedule it to run off-peak hours).  Obviously this
is a luxury that wouldn't be an option in every environment...

During normal operations outside of the dedupe period we haven't
noticed a performance hit.  I don't think we hit the filer too hard
however -- it's acting as a VMware datastore and only a few of the VM's
have higher I/O footprints.

It is a 2050C however so we spread the load across the two filer heads
(although we occasionally run everything on one head when performing
maintenance on the other).

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] detach configured log devices?

2011-03-16 Thread Ray Van Dolson
On Wed, Mar 16, 2011 at 09:33:58AM -0700, Jim Mauro wrote:
 With ZFS, Solaris 10 Update 9, is it possible to
 detach configured log devices from a zpool?
 
 I have a zpool with 3 F20 mirrors for the ZIL. They're
 coming up corrupted. I want to detach them, remake
 the devices and reattach them to the zpool.

Yup, as long as your zpool has been updated to the correct version.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Good SLOG devices?

2011-03-01 Thread Ray Van Dolson
On Tue, Mar 01, 2011 at 08:03:42AM -0800, Roy Sigurd Karlsbakk wrote:
 Hi
 
 I'm running OpenSolaris 148 on a few boxes, and newer boxes are
 getting installed as we speak. What would you suggest for a good SLOG
 device? It seems some new PCI-E-based ones are hitting the market,
 but will those require special drivers? Cost is obviously alsoo an
 issue here
 
 Vennlige hilsener / Best regards
 
 roy

What type of workload are you looking to handle?  We've had good luck
with pairs of Intel X-25E's for VM datastore duty.

We also have a DDRrive X1 which is probably the best option out there
currently and will handle workloads the X-25E's can't.

I believe a lot of folks here use the Vertex SLC-based SF-15 SSD's
also.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Good SLOG devices?

2011-03-01 Thread Ray Van Dolson
On Tue, Mar 01, 2011 at 09:56:35AM -0800, Roy Sigurd Karlsbakk wrote:
  a) do you need an SLOG at all? Some workloads (asynchronous ones) will
  never benefit from an SLOG.
 
 We're planning to use this box for CIFS/NFS, so we'll need an SLOG to
 speed things up.
  
  b) form factor. at least one manufacturer uses a PCIe card which is
  not compliant with the PCIe form-factor and will not fit in many cases
  -- especially typical 1U boxes.
 
 The box is 4U with some 7 8x PCIe slots, so I think it should do fine
 
  c) driver support.
 
 That was why I asked here in the first place...
 
  d) do they really just go straight to ram/flash, or do they have an
  on-device SAS or SATA bus? Some PCIe devices just stick a small flash
  device on a SAS or SATA controller. I suspect that those devices won't
  see a lot of benefit relative to an external drive (although they
  could theoretically drive that private SAS/SATA bus at much higher rates
  than an external bus -- but I've not checked into it.)
  
  The other thing with PCIe based devices is that they consume an IO
  slot,
  which may be precious to you depending on your system board and other
  I/O needs.
 
 As I mentioned above, we have sufficient slots. As for the SATA/SAS
 onboard controller, that was the reason I asked here in the first
 place.
 
 So - do anyone know a good device for this? X25-E is rather old now,
 so there should be better ones available..

I think the OCZ Vertex 2 EX (SLC) is fairly highly regarded:

http://www.ocztechnology.com/ocz-vertex-2-ex-series-sata-ii-2-5-ssd.html

Note that if you're using an LSI backplane (probably are if you're
using SuperMicro hardware), they have tended to certify only against
the X-25E.  Other drives should work fine, but just an FYI.

This page (maybe a little dated, I'm not sure) has some pretty good
info:

http://www.nexenta.org/projects/site/wiki/About_suggested_NAS_SAN_Hardware

 
 Vennlige hilsener / Best regards
 
 roy

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] multipath used inadvertantly?

2011-02-15 Thread Ray Van Dolson
I'm troubleshooting an existing Solaris 10U9 server (x86 whitebox) and
noticed its device names are extremely hair -- very similar to the
multipath device names: c0t5000C50026F8ACAAd0, etc, etc.

mpathadm seems to confirm:

# mpathadm list lu
/dev/rdsk/c0t50015179591CE0C1d0s2
Total Path Count: 1
Operational Path Count: 1

# ps -ef | grep mpath
root   245 1   0   Jan 05 ?  16:38 /usr/lib/inet/in.mpathd -a

The system is SuperMicro based with an LSI SAS2008 controller in it.
To my knowledge it has no multipath capabilities (or at least not as
its wired up currently).

The mpt_sas driver is in use per prtconf and modinfo.

My questions are:

- What scenario would the multipath driver get loaded up at
  installation time for this LSI controller?  I'm guessing this is what
  happened?

- If I disabled mpathd would I get the shorter disk device names back
  again?  How would this impact existing zpools that are already on the
  system tied to these disks?  I have a feeling doing this might be a
  little bit painful. :)

I tried to glean the original device names from stmsboot -L, but it
didn't show any mappings...

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] multipath used inadvertantly?

2011-02-15 Thread Ray Van Dolson
Thanks Torrey.  I definitely see that multipathing is enabled... I
mainly want to understand whether or not there are installation
scenarios where multipathing is enabled by default (if the mpt driver
thinks it can support it will it enable mpathd at install time?) as
well as the consequences of disabling it now...

It looks to me as if disabling it will result in some pain. :)

Ray

On Tue, Feb 15, 2011 at 01:24:20PM -0800, Torrey McMahon wrote:
 in.mpathd is the IP multipath daemon. (Yes, it's a bit confusing that 
 mpathadm is the storage multipath admin tool. )
 
 If scsi_vhci is loaded in the kernel you have storage multipathing 
 enabled. (Check with modinfo.)
 
 On 2/15/2011 3:53 PM, Ray Van Dolson wrote:
  I'm troubleshooting an existing Solaris 10U9 server (x86 whitebox) and
  noticed its device names are extremely hair -- very similar to the
  multipath device names: c0t5000C50026F8ACAAd0, etc, etc.
 
  mpathadm seems to confirm:
 
  # mpathadm list lu
   /dev/rdsk/c0t50015179591CE0C1d0s2
   Total Path Count: 1
   Operational Path Count: 1
 
  # ps -ef | grep mpath
   root   245 1   0   Jan 05 ?  16:38 /usr/lib/inet/in.mpathd 
  -a
 
  The system is SuperMicro based with an LSI SAS2008 controller in it.
  To my knowledge it has no multipath capabilities (or at least not as
  its wired up currently).
 
  The mpt_sas driver is in use per prtconf and modinfo.
 
  My questions are:
 
  - What scenario would the multipath driver get loaded up at
 installation time for this LSI controller?  I'm guessing this is what
 happened?
 
  - If I disabled mpathd would I get the shorter disk device names back
 again?  How would this impact existing zpools that are already on the
 system tied to these disks?  I have a feeling doing this might be a
 little bit painful. :)
 
  I tried to glean the original device names from stmsboot -L, but it
  didn't show any mappings...
 
  Thanks,
  Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [storage-discuss] multipath used inadvertantly?

2011-02-15 Thread Ray Van Dolson
Thanks Cindy.

Are you (or anyone else reading) aware of a way to disable MPxIO at
install time?

I imagine there's no harm* in leaving MPxIO enabled with single-pathed
devices -- we'll likely just keep this in mind for future installs.

Thanks,
Ray

* performance penalty -- we do see errors in our logs from time to time
  from mpathd letting us know disks have only one path

On Tue, Feb 15, 2011 at 01:50:47PM -0800, Cindy Swearingen wrote:
 Hi Ray,
 
 MPxIO is on by default for x86 systems that run the Solaris 10 9/10
 release.
 
 On my Solaris 10 9/10 SPARC system, I see this:
 
 # stmsboot -L
 stmsboot: MPxIO is not enabled
 stmsboot: MPxIO disabled
 
 You can use the stmsboot CLI to disable multipathing. You are prompted
 to reboot the system after disabling MPxIO. See stmsboot.1m for more
 info.
 
 With an x86 whitebox, I would export your ZFS storage pools first,
 but maybe it doesn't matter if the system is rebooted.
 
 ZFS should be able to identify the devices by their internal device
 IDs but I can't speak for unknown hardware. When you make hardware
 changes, always have current backups.
 
 Thanks,
 
 Cindy
 
 On 02/15/11 14:32, Ray Van Dolson wrote:
  Thanks Torrey.  I definitely see that multipathing is enabled... I
  mainly want to understand whether or not there are installation
  scenarios where multipathing is enabled by default (if the mpt driver
  thinks it can support it will it enable mpathd at install time?) as
  well as the consequences of disabling it now...
  
  It looks to me as if disabling it will result in some pain. :)
  
  Ray
  
  On Tue, Feb 15, 2011 at 01:24:20PM -0800, Torrey McMahon wrote:
  in.mpathd is the IP multipath daemon. (Yes, it's a bit confusing that 
  mpathadm is the storage multipath admin tool. )
 
  If scsi_vhci is loaded in the kernel you have storage multipathing 
  enabled. (Check with modinfo.)
 
  On 2/15/2011 3:53 PM, Ray Van Dolson wrote:
  I'm troubleshooting an existing Solaris 10U9 server (x86 whitebox) and
  noticed its device names are extremely hair -- very similar to the
  multipath device names: c0t5000C50026F8ACAAd0, etc, etc.
 
  mpathadm seems to confirm:
 
  # mpathadm list lu
   /dev/rdsk/c0t50015179591CE0C1d0s2
   Total Path Count: 1
   Operational Path Count: 1
 
  # ps -ef | grep mpath
   root   245 1   0   Jan 05 ?  16:38 
  /usr/lib/inet/in.mpathd -a
 
  The system is SuperMicro based with an LSI SAS2008 controller in it.
  To my knowledge it has no multipath capabilities (or at least not as
  its wired up currently).
 
  The mpt_sas driver is in use per prtconf and modinfo.
 
  My questions are:
 
  - What scenario would the multipath driver get loaded up at
 installation time for this LSI controller?  I'm guessing this is what
 happened?
 
  - If I disabled mpathd would I get the shorter disk device names back
 again?  How would this impact existing zpools that are already on the
 system tied to these disks?  I have a feeling doing this might be a
 little bit painful. :)
 
  I tried to glean the original device names from stmsboot -L, but it
  didn't show any mappings...
 
  Thanks,
  Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] cfgadm MPxIO aware yet in Solaris 10 U9?

2011-02-15 Thread Ray Van Dolson
I just replaced a failing disk on one of my servers running Solaris 10
U9.  The system was MPxIO enabled and I now have the old device hanging
around in the cfgadm list.

I understand from searching around that cfgadm may not be MPxIO aware
-- at least not in Solaris 10.  I see a fix was pushed to OpenSolaris
but I'm hoping someone can confirm whether or not this is in Sol10U9
yet or what my other options are (short of rebooting) to clean this old
device out.

Maybe luxadm can do it...

FYI, my zpool replace triggered resilver completed, so the disk is no
longer tied to the zpool.

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] native ZFS on Linux

2011-02-12 Thread Ray Van Dolson
On Sat, Feb 12, 2011 at 09:18:26AM -0800, David E. Anderson wrote:
 I see that Pinguy OS, an uber-Ubuntu o/s, includes native ZFS support.
 Any pointers to more info on this?

Probably using this[1].

Ray

[1] http://kqstor.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Fwd: native ZFS on Linux

2011-02-12 Thread Ray Van Dolson
On Sat, Feb 12, 2011 at 09:36:25AM -0800, David E. Anderson wrote:
 went to IRC for the distro, will post when I have more info.  I think
 this is kqstor-based
 

Both projects would be complementary I would think.  kqstor is
definitely working on the POSIX layer that the zfsonlinux project lacks
currently.

Ray

 
 -- Forwarded message --
 From: C. Bergström codest...@osunix.org
 Date: 2011/2/12
 Subject: Re: [zfs-discuss] native ZFS on Linux
 To:
 Cc: zfs-discuss@opensolaris.org
 
 
 Ray Van Dolson wrote:
 
  On Sat, Feb 12, 2011 at 09:18:26AM -0800, David E. Anderson wrote:
 
 
  I see that Pinguy OS, an uber-Ubuntu o/s, includes native ZFS support.
  Any pointers to more info on this?
 
 
  Probably using this[1].
 
 
 doubtful.. It's more likely based on
 http://zfsonlinux.org/
 
 Why not post to the distro mailing list or look at the source though?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Looking for 3.5 SSD for ZIL

2010-12-23 Thread Ray Van Dolson
On Thu, Dec 23, 2010 at 07:35:29AM -0800, Deano wrote:
 If anybody does know of any source to the secure erase/reformatters,
 I’ll happily volunteer to do the port and then maintain it.
 
 I’m currently in talks with several SSD and driver chip hardware
 peeps with regard getting datasheets for some SSD products etc. for
 the purpose of better support under the OI/Solaris driver model but
 these things can take a while to obtain, so if anybody knows of
 existing open source versions I’ll jump on it.
 
 Thanks,
 Deano

A tool to help the end user know *when* they should run the reformatter
tool would be helpful too.

I know we can just wait until performance degrades, but it would be
nice to see what % of blocks are in use, etc.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Looking for 3.5 SSD for ZIL

2010-12-22 Thread Ray Van Dolson
On Wed, Dec 22, 2010 at 05:43:35AM -0800, Jabbar wrote:
 Hello,
  
 I was thinking of buying a couple of SSD's until I found out that Trim is only
 supported with SATA drives. I'm not sure if TRIM will work with ZFS. I was
 concerned that with trim support the SSD life and write throughput will get
 affected.
  
 Doesn't anybody have any thoughts on this?

Have been using X-25E's as ZIL for over a year.  Cheap enough to
replace a drive when they last that long... (still not seeing any
reason to replace our current batch yet either).

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Moving rpool disks

2010-11-15 Thread Ray Van Dolson
We need to move the disks comprising our mirrored rpool on a Solaris 10
U9 x86_64 (not SPARC) system.

We'll be relocating both drives to a different controller in the same
system (should go from c1* to c0*).

We're curious as to what the best way is to go about this?  We'd love
to be able to just relocate the disks and update the system BIOS to
boot off the drives in their new location and have everything magically
work.

However, we're thinking we may need to touch GRUB config files (though
maybe not since rpool is referenced in the config file) or at least
re-run grub-install or something to update the MBR on both of these
drives.

Any advice?

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] X4540 RIP

2010-11-09 Thread Ray Van Dolson
On Mon, Nov 08, 2010 at 11:51:02PM -0800, matthew patton wrote:
  I have this with 36 2TB drives (and 2 separate boot drives).
 
  http://www.colfax-intl.com/jlrid/SpotLight_more_Acc.asp?L=134S=58B=2267
 
 That's just a Supermicro SC847.
 
 http://www.supermicro.com/products/chassis/4U/?chs=847
 
 Stay away from the 24 port expander backplanes. I've gone thru
 several and they still don't work right - timeout and dropped drives
 under load. The 12-port works just fine connected to a variety of
 controllers. If you insist on the 24-port expander backplane, use a
 non-expander equipped LSI controller to drive it.

What do you mean by non-expander equipped LSI controller?

 
 I got fed up with the 24-port expander board and went with -A1 (all
 independent) and that's worked much more reliably.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] NFS/SATA lockups (svc_cots_kdup no slots free sata port time out)

2010-10-19 Thread Ray Van Dolson
I have a Solaris 10 U8 box (142901-14) running as an NFS server with
a 23 disk zpool behind it (three RAIDZ2 vdevs).

We have a single Intel X-25E SSD operating as an slog ZIL device
attached to a SATA port on this machine's motherboard.

The rest of the drives are in a hot-swap enclosure.

Infrequently (maybe once every 4-6 weeks), the zpool on the box stops
responding and although we can still SSH in and manage the server,
there appears to be no way to get the zpool to function again until we
hard reset.  shutdown -i6 -g0 -y simply hangs forever trying to call
'sync'.

The logs show the following:

Oct 19 11:42:42 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: 
svc_cots_kdup no slots free
Oct 19 11:42:50 dev-zfs1 last message repeated 189 times
Oct 19 11:42:51 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: 
svc_cots_kdup no slots free
Oct 19 11:42:55 dev-zfs1 last message repeated 99 times
Oct 19 11:42:56 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe840f453b68 timed out
Oct 19 11:42:56 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: 
svc_cots_kdup no slots free
Oct 19 11:44:00 dev-zfs1 last message repeated 1128 times
Oct 19 11:44:01 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83dffad0e8 timed out
Oct 19 11:44:02 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: 
svc_cots_kdup no slots free
Oct 19 11:45:05 dev-zfs1 last message repeated 1108 times
Oct 19 11:45:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xbe00a008 timed out
Oct 19 11:45:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xac7bc7e8 timed out
Oct 19 11:45:06 dev-zfs1 rpcmod: [ID 851375 kern.warning] WARNING: 
svc_cots_kdup no slots free
Oct 19 11:46:10 dev-zfs1 last message repeated 1091 times
Oct 19 11:46:11 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xb9438008 timed out
Oct 19 11:47:16 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xb03452a8 timed out
Oct 19 11:48:21 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83dfa5cd20 timed out
Oct 19 11:49:26 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xb6eaf2a0 timed out
Oct 19 11:50:31 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83dfa5c380 timed out
Oct 19 11:51:36 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83ca418b68 timed out
Oct 19 11:52:41 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83fff758c0 timed out
Oct 19 11:53:46 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xb1144548 timed out
Oct 19 11:54:51 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83dffad9a8 timed out
Oct 19 11:55:56 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83e8cd18c0 timed out
Oct 19 11:57:01 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83c43659a8 timed out
Oct 19 11:58:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xb9136468 timed out
Oct 19 11:59:11 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83e9f147e0 timed out
Oct 19 12:00:16 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xb1be7d20 timed out
Oct 19 12:01:21 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83dfa5fee0 timed out
Oct 19 12:02:26 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xbe6f7e08 timed out
Oct 19 12:03:31 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xb903c380 timed out
Oct 19 12:04:36 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83eee6f8c8 timed out
Oct 19 12:05:41 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xb04b7000 timed out
Oct 19 12:06:46 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83fff7dd28 timed out
Oct 19 12:07:51 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xb94389a8 timed out
Oct 19 12:08:56 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xae0ff388 timed out
Oct 19 12:10:01 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe84158032a8 timed out
Oct 19 12:11:06 dev-zfs1 ahci: [ID 517647 kern.warning] WARNING: ahci0: 
watchdog port 0 satapkt 0xfe83f07f7e00 timed out
Oct 19 12:11:25 dev-zfs1 power: [ID 199196 kern.notice] NOTICE: Power Button 
pressed 2 times, 

Re: [zfs-discuss] Multiple SLOG devices per pool

2010-10-13 Thread Ray Van Dolson
On Tue, Oct 12, 2010 at 08:49:00PM -0700, Edward Ned Harvey wrote:
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Ray Van Dolson
  
  I have a pool with a single SLOG device rated at Y iops.
  
  If I add a second (non-mirrored) SLOG device also rated at Y iops will
  my zpool now theoretically be able to handle 2Y iops?  Or close to
  that?
 
 Yes.
 
 But we're specifically talking about sync mode writes.  Not async, and not
 read.  And we're not comparing apples to oranges etc, not measuring an
 actual number of IOPS, because of aggregation etc.  But I don't think that's
 what you were asking.  I don't think you are trying to quantify the number
 of IOPS.  I think you're trying to confirm the qualitative characteristic,
 If I have N slogs, I will write N times faster than a single slog.  And
 that's a simple answer.
 
 Yes.
 

Thanks. :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bursty writes - why?

2010-10-12 Thread Ray Van Dolson
On Tue, Oct 12, 2010 at 12:09:44PM -0700, Eff Norwood wrote:
 The NFS client in this case was VMWare ESXi 4.1 release build. What
 happened is that the file uploader behavior was changed in 4.1 to
 prevent I/O contention with the VM guests. That means when you go to
 upload something to the datastore, it only sends chunks of the file
 instead of streaming it all at once like it did in ESXi 4.0. To end
 users, something appeared to be broken because file uploads now took
 95 seconds instead of 30. Turns out that is by design in 4.1. This is
 the behavior *only* for the uploader and not for the VM guests. Their
 I/O is as expected.

Interesting.

 I have to say as a side note, the DDRdrive X1s make a day and night
 difference with VMWare. If you use VMWare via NFS, I highly recommend
 the X1s as the ZIL. Otherwise the VMWare O_SYNC (Stable = FSYNC) will
 kill your performance dead. We also tried SSDs as the ZIL which
 worked ok until they got full, then performance tanked. As I have
 posted before, SSDs as your ZIL - don't do it!  -- 

We run SSD's as ZIL here exclusively on what I'd consider fairly busy
VMware datastores and have never encountered this.

How would one know how full their SSD being used as ZIL is?  I was
under the impression that even using a full 32GB X-25E was overkill
spacewise for typical ZIL functionality...

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Multiple SLOG devices per pool

2010-10-12 Thread Ray Van Dolson
I have a pool with a single SLOG device rated at Y iops.

If I add a second (non-mirrored) SLOG device also rated at Y iops will
my zpool now theoretically be able to handle 2Y iops?  Or close to
that?

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS disk space monitoring with SNMP

2010-10-01 Thread Ray Van Dolson
Hey folks;

Running on Solaris 10 U9 here.  How do most of you monitor disk usage /
capacity on your large zpools remotely via SNMP tools?

Net SNMP seems to be using a 32-bit unsigned integer (based on the MIB)
for hrStorageSize and friends, and thus we're not able to get accurate
numbers for sizes 2TB.

Looks like potentially later versions of Net-SNMP deal with this
(though I'm not sure on that), but the version of Net-SNMP with Solaris
10 is of course, not bleeding edge. :)

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS disk space monitoring with SNMP

2010-10-01 Thread Ray Van Dolson
On Fri, Oct 01, 2010 at 03:00:16PM -0700, Volker A. Brandt wrote:
 Hello Ray, hello list!
 
 
  Running on Solaris 10 U9 here.  How do most of you monitor disk usage /
  capacity on your large zpools remotely via SNMP tools?
 
  Net SNMP seems to be using a 32-bit unsigned integer (based on the MIB)
  for hrStorageSize and friends, and thus we're not able to get accurate
  numbers for sizes 2TB.
 
  Looks like potentially later versions of Net-SNMP deal with this
  (though I'm not sure on that), but the version of Net-SNMP with Solaris
  10 is of course, not bleeding edge. :)
 
 Sorry to be a lamer, but me too...
 
 Has anyone integrated an SNMP-based ZFS monitoring with their
 favorite management tool?  I am looking for disk usage warnings,
 but I am also interested in OFFLINE messages, or nonzero values
 for READ/WRITE/CKSUM errors.  Casual googling did not turn up anything
 that looked promising.
 
 There is an older ex-Sun download of an SNMP kit, but to be candid
 I haven't really looked at it yet.

Note that I'm sure we could extend Net-SNMP and configure a custom OID
to gather and present the information we're interested in.

Totally willing to go that route and standardize on it here, but am
curious if there's more of an out of the box solution -- even if I
find out it's only available in later versions of Net-SNMP (at least I
could file an RFE with Oracle for this).

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SCSI write retry errors on ZIL SSD drives...

2010-09-21 Thread Ray Van Dolson
Just wanted to post a quick follow-up to this.  Original thread is
here[1] -- not quoted for brevity.

Andrew Gabriel suggested[2] that this could possibly be some workload
triggered issue.  We wanted to rule out a driver problem and so we
tested various configurations under Solaris 10U9 and OpenSolaris with
correct 4K block alignment.

The Unit Attention errors under all operating environments for any
X-25E (we haven't tested other brands) when used as ZIL and attached to
one of the LSI port expanders used in Silicon Mechanics hardware.

As soon as we move the drives to the onboard SATA controller or
directly attach to the LSI controller (bypassing the expander) the
issues go away.

Perhaps tweaking the firmware on the port expander would have resolved
the issue, but we're not able to test that scenario currently.

Of note, heavy workload wasn't required to trigger the problem.  We ran
bonnie++ hard on the system -- which appeared to tax the ZIL quite a
bit, but got no errors.

However, as soon as we set up an NFS VMware datastore and loaded a
couple VM's on it the Unit Attention errors began popping up -- even
when they weren't particularly busy.

In any case, we'll probably stop chasing our tails on this issue and
will begin mounting all drives used for ZIL internally directly
attached to the onboard SATA controllers.

Thanks,
Ray

[1] http://mail.opensolaris.org/pipermail/zfs-discuss/2010-August/044362.html
[2] http://mail.opensolaris.org/pipermail/zfs-discuss/2010-August/044364.html
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Best practice for Sol10U9 ZIL -- mirrored or not?

2010-09-16 Thread Ray Van Dolson
Best practice in Solaris 10 U8 and older was to use a mirrored ZIL.

With the ability to remove slog devices in Solaris 10 U9, we're
thinking we may get more bang for our buck to use two slog devices for
improved IOPS performance instead of needing the redundancy so much.

Any thoughts on this?

If we lost our slog devices and had to reboot, would the system come up
(eg could we remove failed slog devices from the zpool so the zpool
would come online..)

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dedicated ZIL/L2ARC

2010-09-14 Thread Ray Van Dolson
On Tue, Sep 14, 2010 at 06:59:07AM -0700, Wolfraider wrote:
 We are looking into the possibility of adding a dedicated ZIL and/or
 L2ARC devices to our pool. We are looking into getting 4 – 32GB
 Intel X25-E SSD drives. Would this be a good solution to slow write
 speeds? We are currently sharing out different slices of the pool to
 windows servers using comstar and fibrechannel. We are currently
 getting around 300MB/sec performance with 70-100% disk busy.
 
 Opensolaris snv_134
 Dual 3.2GHz quadcores with hyperthreading
 16GB ram
 Pool_1 – 18 raidz2 groups with 5 drives a piece and 2 hot spares
 Disks are around 30% full
 No dedup

It'll probably help.

I'd get two X-25E's for ZIL (and mirror them) and one or two of Intel's
lower end X-25M for L2ARC.

There are some SSD devices out there with a super-capacitor and
significantly higher IOPs ratings than the X-25E that might be a better
choice for a ZIL device, but the X-25E is a solid drive and we have
many of them deployed as ZIL devices here.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS performance issue

2010-09-08 Thread Ray Van Dolson
On Wed, Sep 08, 2010 at 01:20:58PM -0700, Dr. Martin Mundschenk wrote:
 Hi!
 
 I searched the web for hours, trying to solve the NFS/ZFS low
 performance issue on my just setup OSOL box (snv134). The problem is
 discussed in many threads but I've found no solution. 
 
 On a nfs shared volume, I get write performance of 3,5M/sec (!!) read
 performance is about 50M/sec which is ok but on a GBit network, more
 should be possible, since the servers disk performance reaches up to
 120 M/sec.
 
 Does anyone have a solution how I can at least speed up the writes?

What's the write workload like?  You could try disabling the ZIL to see
if that makes a difference.  If it does, the addition of an SSD-based
ZIL / slog device would most certainly help.

Maybe you could describe the makeup of your zpool as well?

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 4k block alignment question (X-25E)

2010-09-02 Thread Ray Van Dolson
On Tue, Aug 31, 2010 at 12:47:49PM -0700, Brandon High wrote:
 On Mon, Aug 30, 2010 at 3:05 PM, Ray Van Dolson rvandol...@esri.com wrote:
  I want to fix (as much as is possible) a misalignment issue with an
  X-25E that I am using for both OS and as an slog device.
 
 It's pretty easy to get the alignment right
 
 fdisk uses a default of 63/255/*, which isn't easy to change. This
 makes each cylinder ( 63 * 255 * 512b ).  You want ( $cylinder_offset
 ) * ( 63 * 255 * 512b ) / ( $block_alignment_size ) to be evenly
 divisible. For a 4k alignment you want the offset to be 8.
 
 With fdisk, create your SOLARIS2 partition that uses the entire disk.
 The partition will be from cylinder 1 to whatever. Cylinder 0 is used
 for the MBR, so it's automatically un-aligned.
 
 When you create slices in format, the MBR cylinder isn't visible, so
 you have to subtract 1 from the offset, so your first slice should
 start on cylinder 7. Each additional cylinder should start on a
 multiple of 8, minus 1. eg: 63, 1999, etc.
 
 It doesn't matter if the end of a slice is unaligned, other than to
 make aligning the next slice easier.
 
 -B

Thanks Brandon.

Just a follow-up to my original post... unfortunately I couldn't try
aligning the slice on the SSD I was also using for slog/ZIL.  The
slog/ZIL slice was too small to be added to the ZIL mirror as the disk
we'd thrown in the system bypassing the expander was being used
completely (via EFI label).

Still wanted to test, however, so I pulled one of the drives from my
rpool, and added the entire disk to my mirror.  This uses the EFI label
and aligns everything correctly.

Unit Attention errors immediately began showing up.

I pulled that drive from the ZIL mirror and then used one of my two
L2ARC drives (also X-25E's) in the same fashion.

Same problem.

So I believe the problem is still expander related moreso than
alignment related.

Too bad.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 4k block alignment question (X-25E)

2010-08-31 Thread Ray Van Dolson
On Mon, Aug 30, 2010 at 10:11:32PM -0700, Christopher George wrote:
  I was wondering if anyone had a benchmarking showing this alignment 
  mattered on the latest SSDs. My guess is no, but I have no data.
 
 I don't believe there can be any doubt whether a Flash based SSD (tier1 
 or not)  is negatively affected by partition misalignment.  It is intrinsic 
 to 
 the required asymmetric erase/program dual operation and the resultant 
 RMW penalty to perform a write if unaligned.  This is detailed in the 
 following vendor benchmarking guidelines (SF-1500 controller):
 
 http://www.smartm.com/files/salesLiterature/storage/AN001_Benchmark_XceedIOPSSATA_Apr2010_.pdf
 
 Highlight from link - Proper partition alignment is one of the most critical 
 attributes that can greatly boost the I/O performance of an SSD due to 
 reduced read modify‐write operations.
 
 It should be noted, the above highlight only applies to Flash based SSD 
 as an NVRAM based SSD does *not* suffer the same fate, as its 
 performance is not bound by or vary with partition (mis)alignment.

Here's an article with some benchmarks:

  http://wikis.sun.com/pages/viewpage.action?pageId=186241353

Seems to really impact IOPS.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 4k block alignment question (X-25E)

2010-08-30 Thread Ray Van Dolson
On Mon, Aug 30, 2010 at 03:37:52PM -0700, Eric D. Mudama wrote:
 On Mon, Aug 30 at 15:05, Ray Van Dolson wrote:
 I want to fix (as much as is possible) a misalignment issue with an
 X-25E that I am using for both OS and as an slog device.
 
 This is on x86 hardware running Solaris 10U8.
 
 Partition table looks as follows:
 
 Part  TagFlag CylindersSizeBlocks
   0   rootwm   1 - 1306   10.00GB(1306/0/0) 20980890
   1 unassignedwu   0   0 (0/0/0)   0
   2 backupwm   0 - 3886   29.78GB(3887/0/0) 62444655
   3 unassignedwu1307 - 3886   19.76GB(2580/0/0) 41447700
   4 unassignedwu   0   0 (0/0/0)   0
   5 unassignedwu   0   0 (0/0/0)   0
   6 unassignedwu   0   0 (0/0/0)   0
   7 unassignedwu   0   0 (0/0/0)   0
   8   bootwu   0 -07.84MB(1/0/0)   16065
   9 unassignedwu   0   0 (0/0/0)   0
 
 And here is fdisk:
 
  Total disk size is 3890 cylinders
  Cylinder size is 16065 (512 byte) blocks
 
Cylinders
   Partition   StatusType  Start   End   Length%
   =   ==  =   ===   ==   ===
   1   ActiveSolaris   1  38893889100
 
 Slice 0 is where the OS lives and slice 3 is our slog.  As you can see
 from the fdisk partition table (and from the slice view), the OS
 partition starts on cylinder 1 -- which is not 4k aligned.
 
 I don't think there is much I can do to fix this without reinstalling.
 
 However, I'm most concerned about the slog slice and would like to
 recreate its partition such that it begins on cylinder 1312.
 
 So a few questions:
 
 - Would making s3 be 4k block aligned help even though s0 is not?
 - Do I need to worry about 4k block aligning the *end* of the
   slice?  eg instead of ending s3 on cylinder 3886, end it on 3880
   instead?
 
 Thanks,
 Ray
 
 Do you specifically have benchmark data indicating unaligned or
 aligned+offset access on the X25-E is significantly worse than aligned
 access?
 
 I'd thought the tier1 SSDs didn't have problems with these workloads.

I've been experiencing heavy Device Not Ready errors with this
configuration, and thought perhaps it could be exacerbated by the block
alignment issue.

See this thread[1].

So this would be a troubleshooting step to attempt to further isolate
the problem -- by eliminating the 4k alignment issue as a factor.

Just want to make sure I set up the alignment as optimally as possible.

Ray

[1] http://markmail.org/message/5rmfzvqwlmosh2oh
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 4k block alignment question (X-25E)

2010-08-30 Thread Ray Van Dolson
On Mon, Aug 30, 2010 at 03:56:42PM -0700, Richard Elling wrote:
 comment below...
 
 On Aug 30, 2010, at 3:42 PM, Ray Van Dolson wrote:
 
  On Mon, Aug 30, 2010 at 03:37:52PM -0700, Eric D. Mudama wrote:
  On Mon, Aug 30 at 15:05, Ray Van Dolson wrote:
  I want to fix (as much as is possible) a misalignment issue with an
  X-25E that I am using for both OS and as an slog device.
  
  This is on x86 hardware running Solaris 10U8.
  
  Partition table looks as follows:
  
  Part  TagFlag CylindersSizeBlocks
  0   rootwm   1 - 1306   10.00GB(1306/0/0) 20980890
  1 unassignedwu   0   0 (0/0/0)   0
  2 backupwm   0 - 3886   29.78GB(3887/0/0) 62444655
  3 unassignedwu1307 - 3886   19.76GB(2580/0/0) 41447700
  4 unassignedwu   0   0 (0/0/0)   0
  5 unassignedwu   0   0 (0/0/0)   0
  6 unassignedwu   0   0 (0/0/0)   0
  7 unassignedwu   0   0 (0/0/0)   0
  8   bootwu   0 -07.84MB(1/0/0)   16065
  9 unassignedwu   0   0 (0/0/0)   0
  
  And here is fdisk:
  
 Total disk size is 3890 cylinders
 Cylinder size is 16065 (512 byte) blocks
  
   Cylinders
  Partition   StatusType  Start   End   Length%
  =   ==  =   ===   ==   ===
  1   ActiveSolaris   1  38893889100
  
  Slice 0 is where the OS lives and slice 3 is our slog.  As you can see
  from the fdisk partition table (and from the slice view), the OS
  partition starts on cylinder 1 -- which is not 4k aligned.
 
 To get to a fine alignment, you need an EFI label. However, Solaris does
 not (yet) support booting from EFI labeled disks.  The older SMI labels 
 are all cylinder aligned which gives you a 1/4 chance of alignment.

Yep... our other boxes similar to this one are using whole disks as
ZIL, so we're able to use EFI.

The Device Not Ready errors happen there too (SSD's are on an expander)
but only from between 5-15 errors per day (vs the 500 per hour on the
split OS/slog setup).

 
  
  I don't think there is much I can do to fix this without reinstalling.
  
  However, I'm most concerned about the slog slice and would like to
  recreate its partition such that it begins on cylinder 1312.
  
  So a few questions:
  
- Would making s3 be 4k block aligned help even though s0 is not?
- Do I need to worry about 4k block aligning the *end* of the
  slice?  eg instead of ending s3 on cylinder 3886, end it on 3880
  instead?
  
  Thanks,
  Ray
  
  Do you specifically have benchmark data indicating unaligned or
  aligned+offset access on the X25-E is significantly worse than aligned
  access?
  
  I'd thought the tier1 SSDs didn't have problems with these workloads.
  
  I've been experiencing heavy Device Not Ready errors with this
  configuration, and thought perhaps it could be exacerbated by the block
  alignment issue.
  
  See this thread[1].
  
  So this would be a troubleshooting step to attempt to further isolate
  the problem -- by eliminating the 4k alignment issue as a factor.
 
 In my experience, port expanders with SATA drives do not handle
 the high I/O rate that can be generated by a modest server. We are
 still trying to get to the bottom of these issues, but they do not appear
 to be related to the OS, mpt driver, ZIL use, or alignment. 
  -- richard

Very interesting.  We've been looking at Nexenta as we haven't been
able to reproduce our issues on OpenSolaris -- I was hoping this meant
NexentaStor wouldn't have the issue.

In any case -- any thoughts on whether or not I'll be helping anything
if I change my slog slice starting cylinder to be 4k aligned even
though slice 0 isn't?

 
  
  Just want to make sure I set up the alignment as optimally as possible.
  
  Ray
  
  [1] http://markmail.org/message/5rmfzvqwlmosh2oh

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 4k block alignment question (X-25E)

2010-08-30 Thread Ray Van Dolson
On Mon, Aug 30, 2010 at 04:12:48PM -0700, Edho P Arief wrote:
 On Tue, Aug 31, 2010 at 6:03 AM, Ray Van Dolson rvandol...@esri.com wrote:
  In any case -- any thoughts on whether or not I'll be helping anything
  if I change my slog slice starting cylinder to be 4k aligned even
  though slice 0 isn't?
 
 
 some people claims that due to how zfs works, there will be
 performance hit as long the reported sector size is different with the
 physical size.
 
 This thread[1] has the discussion on what happened and how to handle
 such drives on freebsd.
 
 [1] http://marc.info/?l=freebsd-fsm=126976001214266w=2

Thanks for the pointer -- these posts seem to reference data disks
within the pool rather than disks being used for slog.

Perhaps some of the same issues could arise, but I'm not sure that
variable stripe sizing in a RAIDZ pool would change how the ZIL / slog
devices are addressed.  I'm sure someone will correct me if I'm wrong
on that...

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-28 Thread Ray Van Dolson
On Sat, Aug 28, 2010 at 05:50:38AM -0700, Eff Norwood wrote:
 I can't think of an easy way to measure pages that have not been consumed 
 since it's really an SSD controller function which is obfuscated from the OS, 
 and add the variable of over provisioning on top of that. If anyone would 
 like to really get into what's going on inside of an SSD that makes it a bad 
 choice for a ZIL, you can start here:
 
 http://en.wikipedia.org/wiki/TRIM_%28SSD_command%29
 
 and
 
 http://en.wikipedia.org/wiki/Write_amplification
 
 Which will be more than you might have ever wanted to know. :)

So has anyone on this list actually run into this issue?  Tons of
people use SSD-backed slog devices...

The theory sounds sound, but if it's not really happening much in
practice then I'm not too worried.  Especially when I can replace a
drive from my slog mirror for a $400 or so if problems do arise... (the
alternative being much more expensive DRAM backed devices)

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-27 Thread Ray Van Dolson
On Fri, Aug 27, 2010 at 05:51:38AM -0700, David Magda wrote:
 On Fri, August 27, 2010 08:46, Eff Norwood wrote:
  Saso is correct - ESX/i always uses F_SYNC for all writes and that is for
  sure your performance killer. Do a snoop | grep sync and you'll see the
  sync write calls from VMWare. We use DDRdrives in our production VMWare
  storage and they are excellent for solving this problem. Our cluster
  supports 50,000 users and we've had no issues at all. Do not use an SSD
  for the ZIL - as soon as it fills up you will be very unhappy.
 
 What do you mean by fills up? There is very a very limited amount of
 data that is written to a slog device: between 5-30s second's worth.
 Furthermore a log device will at maximum be = 50% the size of physical
 memory.

I would second this.  Excellent results here with small 32GB Intel
X-25E's.

Even 32GB is overkill for ZIL 

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-27 Thread Ray Van Dolson
On Fri, Aug 27, 2010 at 11:57:17AM -0700, Marion Hakanson wrote:
 markwo...@yahoo.com said:
  So the question is with a proper ZIL SSD from SUN, and a RAID10... would I 
  be
  able to support all the VM's or would it still be pushing the limits a 44
  disk pool? 
 
 If it weren't a closed 7000-series appliance, I'd suggest running the
 zilstat script.  It should make it clear whether (and by how much)
 you would benefit from the Logzilla addition in your current raidz
 configuration.  Maybe there's some equivalent in the builtin FishWorks
 analytics which can give you the same information.
 

To the OP...

I'd think turning the write cache on would help if that's an option.
Does the box have reliable power (UPS, etc)?

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-27 Thread Ray Van Dolson
On Fri, Aug 27, 2010 at 12:46:42PM -0700, Mark wrote:
 It does, its on a pair of large APC's.
 
 Right now we're using NFS for our ESX Servers.  The only iSCSI LUN's
 I have are mounted inside a couple Windows VM's.   I'd have to
 migrate all our VM's to iSCSI, which I'm willing to do if it would
 help and not cause other issues.   So far the 7210 Appliance has been
 very stable.
 
 I like the zilstat script.  I emailed a support tech I am working
 with on another issue to ask if one of the built in Analytics DTrace
 scripts will get that data.   
 
 I found one called L2ARC Eligibility:  3235 true, 66 false.  This
 makes it sound like we would benefit from a READZilla, not quite what
 I had expected...  I'm sure I don't know what I'm looking at anyways
 :)

Obviously depends on your workload, and YMMV, but for us (we're also
using NFS and love the flexibility it provides w/ ESX) and without ZIL,
things are pretty dog slow.

My impression is that synchronous writes are used too with iSCSI, so if
your problems stem from not having a ZIL w/ NFS they could very easily
reappear even with iSCSI.

Someone else may correct me on that...

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-27 Thread Ray Van Dolson
On Fri, Aug 27, 2010 at 01:22:15PM -0700, John wrote:
 Wouldn't it be possible to saturate the SSD ZIL with enough
 backlogged sync writes? 
 
 What I mean is, doesn't the ZIL eventually need to make it to the
 pool, and if the pool as a whole (spinning disks) can't keep up with
 30+ vm's of write requests, couldn't you fill up the ZIL that way?

Depends on the workload of course, but we have 50+ VM server
environments running off of 22x1TB SATA + 32GB Intel X25-E SSD's with
no problems whatsoever.  I don't have the zilstat numbers handy, but
we're not pushing enough I/O for the slog device to even come close to
sweating.

Note that our VM's are in a LabManager environment and can spun up and
down to do compiles mostly, not pushing huge amounts of non-random I/O.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-27 Thread Ray Van Dolson
On Fri, Aug 27, 2010 at 03:51:39PM -0700, Eff Norwood wrote:
 By all means please try it to validate it yourself and post your
 results from hour one, day one and week one. In a ZIL use case,
 although the data set is small it is always writing a small ever
 changing (from the SSDs perspective) data set. The SSD does not know
 to release previously written pages and without TRIM there is no way
 to tell it to. That means every time a ZIL write happens, new SSD
 pages are consumed. After some amount of time, all of those empty
 pages will become consumed and the SSD will now have to go into the
 read-erase-write cycle which is incredibly slow and the whole point
 of TRIM.
 
 I can assure you from my extensive benchmarking with all major SSDs
 in the role of a ZIL you will eventually not be happy. Depending on
 your use case it might take months, but eventually all those free
 pages will be consumed and read-erase-write is how the SSD world
 works after that - unless you have TRIM, which we don't yet.  -- 
 This message posted from opensolaris.org

Is there a way to measure how many SSD pages are taken up?

We've had a box running for nearly 8 months now -- it's performing
well, but I'd be interested to see if we'll be close to (theoretically)
hitting this problem or not.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SCSI write retry errors on ZIL SSD drives...

2010-08-25 Thread Ray Van Dolson
On Wed, Aug 25, 2010 at 11:47:38AM -0700, Andreas Grüninger wrote:
 Ray
 
 Supermicro does not support the use of SSDs behind an expander.
 
 You must put the SSD in the head or use an interposer card see here:
 http://www.lsi.com/storage_home/products_home/standard_product_ics/sas_sata_protocol_bridge/lsiss9252/index.html
 Supermicro offers an interposer card too: AOCSMPLSISS9252 .
 

Hmm, interesting.

FAQ #3 on this page[1] seems to indicate otherwise -- at least in the
case of the Intel X25-E (SSDSA2SH064G1GC) with firmware 8860 (which we
are running).

Ray

[1] http://www.supermicro.com/support/faqs/results.cfm?id=95
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] SCSI write retry errors on ZIL SSD drives...

2010-08-24 Thread Ray Van Dolson
I posted a thread on this once long ago[1] -- but we're still fighting
with this problem and I wanted to throw it out here again.

All of our hardware is from Silicon Mechanics (SuperMicro chassis and
motherboards).

Up until now, all of the hardware has had a single 24-disk expander /
backplane -- but we recently got one of the new SC847-based models with
24 disks up front and 12 in the back -- a dual backplane setup.

We're using two SSD's in the front backplane as mirrored ZIL/OS (I
don't think we have the 4K alignment set up correctly) and two drives
in the back as L2ARC.

The rest of the disks are 1TB SATA disks which make up a single large
zpool via three 8-disk RAIDZ2's.  As you can see, we don't have the
server maxed out on drives...

In any case, this new server gets between 400 and 600 of these timeout
errors an hour:

Aug 21 03:10:17 dev-zfs1 scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0):
Aug 21 03:10:17 dev-zfs1Log info 31126000 received for target 8.
Aug 21 03:10:17 dev-zfs1scsi_status=0, ioc_status=804b, scsi_state=c
Aug 21 03:10:17 dev-zfs1 scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0):
Aug 21 03:10:17 dev-zfs1Log info 31126000 received for target 8.
Aug 21 03:10:17 dev-zfs1scsi_status=0, ioc_status=804b, scsi_state=c
Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,3...@8/pci15d9,1...@0/s...@8,0 (sd0):
Aug 21 03:10:17 dev-zfs1Error for Command: write(10)   
Error Level: Retryable
Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice]  Requested Block: 
21230708  Error Block: 21230708
Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice]  Vendor: ATA 
   Serial Number: CVEM002600EW
Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice]  Sense Key: Unit 
Attention
Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice]  ASC: 0x29 (power on, 
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Aug 21 03:10:21 dev-zfs1 scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0):

iostat -xnMCez shows that the first of the two ZIL drives receives
about twice the number of errors as the second drive.

There are no other errors on any other drives -- including the L2ARC
SSD's and the ascv_t times seem reasonably low and don't indicate a bad
drive to my eyes...

The timeouts above exact a rather large performance penalty on the
system, both in IO and general usage from an SSH console.  Obvious
pauses and glitches when accessing the filesystem.

The problem _follows_ the ZIL and isn't tied to hardware.  IOW, if I
switch to using the L2ARC drives as ZIL, those drives suddenly exhibit
the timeout problems...

If we connect the SSD drives directly to the LSI controller instead of
hanging off the hot-swap backplane, the timeouts go away.

If we use SSD's attached to the SATA controllers as ZIL, there are also
no performance issues or timeout errors.

So the problem only occurs with SSD drives acting as ZIL attached to
the backplane.

This is leading me to believe we have a driver issue of some sort in
the mpt subsystem unable to cope with the longer command path of
multiple backplanes.  Someone alluded to this in [1] as well, and it
makes sense to me.

One quick fix to me would seem to be upping the SCSI timeout values.
How do you do this with the mpt driver?

We haven't yet been able to try OpenSolaris or Nexenta on one of these
systems to see if the problem goes away in later releases of the kernel
or driver, but I'm curious if anyone out there has any bright ideas as
to what we might be running into here and what's involved in fixing it.

We've swapped out backplanes and drives and the problem happens on
every single Silicon Mechanics system we have, so at this point I'm
really doubting it's a hardware issue :)

Hardware details are as follows:

Silicon Mechanics Storform iServ R518
(Based on SuperMicro SC847E16-R1400 chassis)
SuperMicro X8DT3 motherboard w/ onboard LSI1068 controller.
- One LSI port goes to the front backplane (where the bulk of the
  SATA drives are, the two SSD's used as ZIL/OS)
- The other LSI port goes to the rear backplane where the two L2ARC
  drives are along with a couple SATA's)

We've got 6GB's of RAM and 2 quad core Xeons in the box as well.

The SSD's themselves are all Intel X-25E's (32GB) with firmware 8860
and the LSI 1068 is a SAS1068E B3 with firmware 011c0200 (1.28.02.00).

We're running Solaris 10U8 mostly up to date and MPT HBA Driver v1.92.

Thoughts, theories and conjectures would be much appreciated... Sun
these days wants us to be able to reproduce the problem on Sun hardware
to get much support... Silicon Mechanics has been helpful, but they
don't have a large enough inventory on hand to replicate our hardware
setup it seems. :(

Ray

[1] http://markmail.org/message/gfz2cui2iua4dxpy
___

Re: [zfs-discuss] SCSI write retry errors on ZIL SSD drives...

2010-08-24 Thread Ray Van Dolson
On Tue, Aug 24, 2010 at 04:46:23PM -0700, Andrew Gabriel wrote:
 Ray Van Dolson wrote:
  I posted a thread on this once long ago[1] -- but we're still fighting
  with this problem and I wanted to throw it out here again.
 
  All of our hardware is from Silicon Mechanics (SuperMicro chassis and
  motherboards).
 
  Up until now, all of the hardware has had a single 24-disk expander /
  backplane -- but we recently got one of the new SC847-based models with
  24 disks up front and 12 in the back -- a dual backplane setup.
 
  We're using two SSD's in the front backplane as mirrored ZIL/OS (I
  don't think we have the 4K alignment set up correctly) and two drives
  in the back as L2ARC.
 
  The rest of the disks are 1TB SATA disks which make up a single large
  zpool via three 8-disk RAIDZ2's.  As you can see, we don't have the
  server maxed out on drives...
 
  In any case, this new server gets between 400 and 600 of these timeout
  errors an hour:
 
  Aug 21 03:10:17 dev-zfs1 scsi: [ID 365881 kern.info] 
  /p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0):
  Aug 21 03:10:17 dev-zfs1Log info 31126000 received for target 8.
  Aug 21 03:10:17 dev-zfs1scsi_status=0, ioc_status=804b, scsi_state=c
  Aug 21 03:10:17 dev-zfs1 scsi: [ID 365881 kern.info] 
  /p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0):
  Aug 21 03:10:17 dev-zfs1Log info 31126000 received for target 8.
  Aug 21 03:10:17 dev-zfs1scsi_status=0, ioc_status=804b, scsi_state=c
  Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.warning] WARNING: 
  /p...@0,0/pci8086,3...@8/pci15d9,1...@0/s...@8,0 (sd0):
  Aug 21 03:10:17 dev-zfs1Error for Command: write(10)   
  Error Level: Retryable
  Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice]  Requested Block: 
  21230708  Error Block: 21230708
  Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice]  Vendor: ATA 
 Serial Number: CVEM002600EW
  Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice]  Sense Key: Unit 
  Attention
  Aug 21 03:10:17 dev-zfs1 scsi: [ID 107833 kern.notice]  ASC: 0x29 (power 
  on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
  Aug 21 03:10:21 dev-zfs1 scsi: [ID 365881 kern.info] 
  /p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0):
 
  iostat -xnMCez shows that the first of the two ZIL drives receives
  about twice the number of errors as the second drive.
 
  There are no other errors on any other drives -- including the L2ARC
  SSD's and the ascv_t times seem reasonably low and don't indicate a bad
  drive to my eyes...
 
  The timeouts above exact a rather large performance penalty on the
  system, both in IO and general usage from an SSH console.  Obvious
  pauses and glitches when accessing the filesystem.

 
 This isn't a timeout. Unit Attention is the drive saying back to the 
 computer that it's been reset and has forgotten any negotiation which 
 happened with the controller. It's a couple of decades since I was 
 working on SCSI at this level, but IIRC, a drive will return Unit 
 Attention error to the first command issued to it after a 
 reset/powerup, except for a Test Unit Ready command. As it says, this 
 might be caused by power on, reset, or bus reset occurred.

Interesting.  Thanks for the insight.

 
  The problem _follows_ the ZIL and isn't tied to hardware.  IOW, if I
  switch to using the L2ARC drives as ZIL, those drives suddenly exhibit
  the timeout problems...

 
 A possibility is that the problem is related to the nature of the load a 
 ZIL drive attracts. One scenario could be that you are crashing the 
 drive firmware, causing it it reset and reinitialize itself, and 
 therefore to return Unit Attention to the next command. (I don't know 
 if X25-E's can behave this way.)
 
 I would try and correct the 4k alignment on the ZIL at least - that does 
 significantly affect the work the drive has to do internally (as well as 
 its performance), although I've no idea if that's related to the issue 
 you're seeing.

Will definitely give this a go -- certainly can't hurt.

 
  If we connect the SSD drives directly to the LSI controller instead of
  hanging off the hot-swap backplane, the timeouts go away.

 
 Again, may be related to some combination of the load type and physical 
 characteristics.
 
  If we use SSD's attached to the SATA controllers as ZIL, there are also
  no performance issues or timeout errors.

 
 Why not do this then? It also avoids using SATA tunneling protocol 
 across the SAS and port expanders.

We may -- however, the main reason we'd gone with the port expander was
for convenient hot swappability.  Though I guess SATA is technically
hot swappable, it's not as convenient :)

 
  So the problem only occurs with SSD drives acting as ZIL attached to
  the backplane.
 
  This is leading me to believe we have a driver issue of some sort in
  the mpt subsystem unable to cope with the longer command path of
  multiple

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Ray Van Dolson
On Mon, Aug 16, 2010 at 08:35:05AM -0700, Tim Cook wrote:
 No, no they don't.  You're under the misconception that they no
 longer own the code just because they released a copy as GPL.  That
 is not true.  Anyone ELSE who uses the GPL code must release
 modifications if they wish to distribute it due to the GPL.  The
 original author is free to license the code as many times under as
 many conditions as they like, and release or not release subsequent
 changes they make to their own code.
 
 I absolutely guarantee Oracle can and likely already has
 dual-licensed BTRFS.

Well, Oracle obviously would want btrfs to stay as part of the Linux
kernel rather than die a death of anonymity outside of it... 

As such, they'll need to continue to comply with GPLv2 requirements.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Ray Van Dolson
On Mon, Aug 16, 2010 at 08:48:31AM -0700, Joerg Schilling wrote:
 Ray Van Dolson rvandol...@esri.com wrote:
 
   I absolutely guarantee Oracle can and likely already has
   dual-licensed BTRFS.
 
  Well, Oracle obviously would want btrfs to stay as part of the Linux
  kernel rather than die a death of anonymity outside of it... 
 
  As such, they'll need to continue to comply with GPLv2 requirements.
 
 No, there is definitely no need for Oracle to comply with the GPL as they
 own the code.
 

Maybe there's not legally, but practically there is.  If they're not
GPL compliant, why would Linus or his lieutenants continue to allow the
code to remain part of the Linux kernel?

And what purpose would btrfs serve Oracle outside of the Linux kernel?

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Ray Van Dolson
On Mon, Aug 16, 2010 at 08:55:49AM -0700, Tim Cook wrote:
 Why would they obviously want that?  When the project started, they
 were competing with Sun.  They now own Solaris; they no longer have a
 need to produce a competing product.  I would be EXTREMELY surprised
 to see Oracle continue to push Linux as hard as they have in the
 past, over the next 5 years.
 
 --Tim

Well, we're getting into the realm of opinion here.. but if I'm a
decision maker at Oracle, I'm not abandoning Linux, nor my potential
influence in the future de facto Linux filesystem.

Oracle can gear Solaris towards big iron / Enterprisey, niche
solutions, but I'd bet a lot that they're not abandoning the Linux
space by a longshot just because they own Solaris...

But your opinion is as valid as mine on this topic... :)

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Ray Van Dolson
On Mon, Aug 16, 2010 at 08:58:20AM -0700, Garrett D'Amore wrote:
 On Mon, 2010-08-16 at 08:52 -0700, Ray Van Dolson wrote:
  On Mon, Aug 16, 2010 at 08:48:31AM -0700, Joerg Schilling wrote:
   Ray Van Dolson rvandol...@esri.com wrote:
   
 I absolutely guarantee Oracle can and likely already has
 dual-licensed BTRFS.
   
Well, Oracle obviously would want btrfs to stay as part of the Linux
kernel rather than die a death of anonymity outside of it... 
   
As such, they'll need to continue to comply with GPLv2 requirements.
   
   No, there is definitely no need for Oracle to comply with the GPL as they
   own the code.
   
  
  Maybe there's not legally, but practically there is.  If they're not
  GPL compliant, why would Linus or his lieutenants continue to allow the
  code to remain part of the Linux kernel?
  
  And what purpose would btrfs serve Oracle outside of the Linux kernel?
 
 If they wanted to port it to Solaris under a difference license, they
 could.  This may actually be a backup plan in case the NetApp suit goes
 badly.  But this is pure conjecture.

btrfs is often described as the next default Linux filesystem (by Ted
T'So and others).  It seems odd to me that Oracle wouldn't have an
interest in retaining a controlling interest (as in retaining the
primary engineers) in its development and ensuring it stays in the
Linux kernel and meets these expectations...

Seems like an excellent long-term strategy to me anyways!

Anyways, getting a bit off topic here I suppose, though it's an
interesting discussion. :)

 
   - Garrett
 
  
  Ray

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Ray Van Dolson
On Mon, Aug 16, 2010 at 08:57:19AM -0700, Joerg Schilling wrote:
 C. Bergström codest...@osunix.org wrote:
 
   I absolutely guarantee Oracle can and likely already has dual-licensed 
   BTRFS.
  No.. talk to Chris Mason.. it depends on the linux kernel too much 
  already to be available under anything, but GPLv2
 
 If he really believes this, then he seems to be missinformed about legal 
 background. 
 
 The question is: who wrote the btrfs code and who owns it.
 
 If Oracle pays him for writing the code, then Oracle owns the code and can 
 relicense it under any license they like.
 
 Jörg

I don't think anyone is arguing that Oracle can relicense their own
copyrighted code as they see fit.

The real question is, WHY would they do it?  What would be the business
motivation here?  Chris Mason would most likely leave Oracle, Red Hat
would hire him and fork the last GPL'd version of btrfs and Oracle
would have relegated itself to a non-player in the Linux filesystem
space... 

So, yes, they can do it if they want, I just think they're not THAT
stupid. :)

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Ray Van Dolson
On Mon, Aug 16, 2010 at 09:08:52AM -0700, Ray Van Dolson wrote:
 On Mon, Aug 16, 2010 at 08:57:19AM -0700, Joerg Schilling wrote:
  C. Bergström codest...@osunix.org wrote:
  
I absolutely guarantee Oracle can and likely already has dual-licensed 
BTRFS.
   No.. talk to Chris Mason.. it depends on the linux kernel too much 
   already to be available under anything, but GPLv2
  
  If he really believes this, then he seems to be missinformed about legal 
  background. 
  
  The question is: who wrote the btrfs code and who owns it.
  
  If Oracle pays him for writing the code, then Oracle owns the code and can 
  relicense it under any license they like.
  
  Jörg
 
 I don't think anyone is arguing that Oracle can relicense their own
 copyrighted code as they see fit.

s/can/can't/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Ray Van Dolson
On Mon, Aug 16, 2010 at 09:15:12AM -0700, Tim Cook wrote:
 Or, for all you know, Chris Mason's contract has a non-compete that
 states if he leaves Oracle he's not allowed to work on any project he
 was a part of for five years.
 
 The business motivation would be to set the competition back a decade.

Could be, though I still feel like there are plenty of great filesystem
people in the Linux kernel community who could pick things up just fine
.. 

Anyways, way off topic now -- we've both made our points I think. :)

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS development moving behind closed doors

2010-08-13 Thread Ray Van Dolson
On Fri, Aug 13, 2010 at 02:01:07PM -0700, C. Bergström wrote:
 Gary Mills wrote:
  If this information is correct,
 
  http://opensolaris.org/jive/thread.jspa?threadID=133043
 
  further development of ZFS will take place behind closed doors.
  Opensolaris will become the internal development version of Solaris
  with no public distributions.  The community has been abandoned.

 It was a community of system administrators and nearly no developers.  
 While this may make big news the real impact is probably pretty small.  
 Source code updates will get tossed over the fence and developer 
 partners (Intel) will still have access to onnv-gate.

I'm interested to see how this plays out in actuality.  It almost
sounded like source code wouldn't necessarily be shared until major
release were made... which would obviously make it hard for third party
ZFS vendors to keep up in the interim.

I guess most of this is still hear-say at this point, but if you've
read somewhere where Oracle has stated they plan to continuously share
source code and updates throughout their development processes (not
just at release time), it'd be good to see...

 
 In a way i see this as a very good thing.  It will not *force* the 
 existing (small) community of companies and developers to band together 
 to actually work together.  From there the real open source momentum can 
 happen instead of everyone depending on Sun/Oracle to give them a free 
 lunch.  The first step that I've been adamant about is making it easier 
 for developers to play and get their hands on it..  If we can enable 
 that it'll swing things around regardless of what mega-corp does or 
 doesn't do...
 
 Just my 0.02$
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Adding ZIL to pool questions

2010-08-01 Thread Ray Van Dolson
On Sun, Aug 01, 2010 at 12:36:28PM -0700, Gregory Gee wrote:
 Jim, that ACARD looks really nice, but out of the price range for a
 home server.
 
 Edward, disabling ZIL might be ok, but let me characterize what my
 home server does and tell me if disabling ZIL is ok.
 
 My home OpenSolaris server is only used for storage.  I have a
 separate linux box that runs any software I need such as media
 servers and such.  I export all pools from the OpenSolaris box to the
 linux box via NFS.
 
 The OpenSolaris box has 2 pools.  The first pool stores videos,
 pictures, various files and mail all exported via NFS to the linux
 box.  It is a mirrored zpool.  The second mirrored zpool is NFS store
 for VM images.  The linux box I mentioned are actually VMs running in
 XenServer.  The VM vdisks are stored and run from the OpenSolaris NFS
 server mounted in the XenServer box.
 
 Yes, I know that this is not a typical home setup, but I'm sure that
 most here don't have a 'typical home setup'.
 
 So the question is, will disabling ZIL have negative impacts on the
 VM vdisks stored in NFS?  Or any other files on the NFS shares?

You would probably see better performance at the expense of reliability
in the case of an unplanned outage.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Using a zvol from your rpool as zil for another zpool

2010-07-02 Thread Ray Van Dolson
We have a server with a couple X-25E's and a bunch of larger SATA
disks.

To save space, we want to install Solaris 10 (our install is only about
1.4GB) to the X-25E's and use the remaining space on the SSD's for ZIL
attached to a zpool created from the SATA drives.

Currently we do this by installing the OS using SVM+UFS (to mirror the
OS between the two SSD's) and then using the remaining space on a slice
as ZIL for the larger SATA-based zpool.

However, SVM+UFS is more annoying to work with as far as LiveUpgrade is
concerned.  We'd love to use a ZFS root, but that requires that the
entire SSD be dedicated as an rpool leaving no space for ZIL.  Or does
it?

It appears that we could do a:

  # zfs create -V 24G rpool/zil

On our rpool and then:

  # zpool add satapool log /dev/zvol/dsk/rpool/zil

(I realize 24G is probably far more than a ZIL device will ever need)

As rpool is mirrored, this would also take care of redundancy for the
ZIL as well.

This lets us have a nifty ZFS rpool for simplified LiveUpgrades and a
fast SSD-based ZIL for our SATA zpool as well...

What are the downsides to doing this?  Will there be a noticeable
performance hit?

I know I've seen this discussed here before, but wasn't able to come up
with the right search terms...

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using a zvol from your rpool as zil for another zpool

2010-07-02 Thread Ray Van Dolson
 However, SVM+UFS is more annoying to work with as far as LiveUpgrade is
 concerned.  We'd love to use a ZFS root, but that requires that the
 entire SSD be dedicated as an rpool leaving no space for ZIL.  Or does
 it?
 
 It appears that we could do a:
 
   # zfs create -V 24G rpool/zil
 
 On our rpool and then:
 
   # zpool add satapool log /dev/zvol/dsk/rpool/zil
 
 (I realize 24G is probably far more than a ZIL device will ever need)
 
 As rpool is mirrored, this would also take care of redundancy for the
 ZIL as well.
 
 This lets us have a nifty ZFS rpool for simplified LiveUpgrades and a
 fast SSD-based ZIL for our SATA zpool as well...
 
 What are the downsides to doing this?  Will there be a noticeable
 performance hit?
 
 I know I've seen this discussed here before, but wasn't able to come up
 with the right search terms...

Well, after doing a little better on my searches, it sounds like -- at
least for cache/L2ARC on zvol's, some race conditions can pop up and
this isn't necessarily the most robust or tested configuration.

Doesn't sound like something I'd want to do in production.

Perhaps the better option is to have multiple Solaris FDISK partitions
set up.  This way I could still install my rpool to the first partition
and use the remaining partition for my ZIL for the SATA zpool.

This obviously would only work on x86 systems.

Would multiple FDISK partitions be the most robust way to implement
this?

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using a zvol from your rpool as zil for another zpool

2010-07-02 Thread Ray Van Dolson
On Fri, Jul 02, 2010 at 03:40:26AM -0700, Ben Taylor wrote:
  We have a server with a couple X-25E's and a bunch of
  larger SATA
  disks.
  
  To save space, we want to install Solaris 10 (our
  install is only about
  1.4GB) to the X-25E's and use the remaining space on
  the SSD's for ZIL
  attached to a zpool created from the SATA drives.
  
  Currently we do this by installing the OS using
  SVM+UFS (to mirror the
  OS between the two SSD's) and then using the
  remaining space on a slice
  as ZIL for the larger SATA-based zpool.
  
  However, SVM+UFS is more annoying to work with as far
  as LiveUpgrade is
  concerned.  We'd love to use a ZFS root, but that
  requires that the
  entire SSD be dedicated as an rpool leaving no space
  for ZIL.  Or does
  it?
 
 For every system I have ever done zfs root on, it's always
 been a slice on a disk.  As an example, we have an x4500
 with 1TB disks.  For that root config, we are planning on
 something like 150G on s0, and the rest on S3. s0 for
 the rpool, and s3 for the qpool.  We didn't want to have
 to deal with issues around flashing a huge volume, as
 we found out with our other x4500 with 500GB disks.
 
 AFAIK, it's only non-rpool disks that use the whole disk,
 and I doubt there's some sort of specific feature with
 an SSD, but I could be wrong.
 
 I like your idea of a reasonably sized root rpool and the
 rest used for the ZIL.  But if you're going to do LU,
 you should probably take a good look at how much space
 you need for the clones and snapshots on the rpool

Interesting.  For some reason, I coulda sworn that Sol 10 U8 installer
required you to use an entire disk for a ZFS rpool, so using only part
of the disk on a slice and leaving space for other uses wasn't an
option.

I'll revisit this though.

Thanks for the reply.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using a zvol from your rpool as zil for another zpool

2010-07-02 Thread Ray Van Dolson
On Fri, Jul 02, 2010 at 08:18:48AM -0700, Erik Ableson wrote:
 Le 2 juil. 2010 à 16:30, Ray Van Dolson rvandol...@esri.com a écrit :
 
  On Fri, Jul 02, 2010 at 03:40:26AM -0700, Ben Taylor wrote:
  We have a server with a couple X-25E's and a bunch of larger SATA
  disks.
  
  To save space, we want to install Solaris 10 (our install is only
  about 1.4GB) to the X-25E's and use the remaining space on the
  SSD's for ZIL attached to a zpool created from the SATA drives.
  
  Currently we do this by installing the OS using SVM+UFS (to
  mirror the OS between the two SSD's) and then using the remaining
  space on a slice as ZIL for the larger SATA-based zpool.
  
  However, SVM+UFS is more annoying to work with as far as
  LiveUpgrade is concerned.  We'd love to use a ZFS root, but that
  requires that the entire SSD be dedicated as an rpool leaving no
  space for ZIL.  Or does it?
  
  For every system I have ever done zfs root on, it's always been a
  slice on a disk.  As an example, we have an x4500 with 1TB disks.
  For that root config, we are planning on something like 150G on
  s0, and the rest on S3. s0 for the rpool, and s3 for the qpool.
  We didn't want to have to deal with issues around flashing a huge
  volume, as we found out with our other x4500 with 500GB disks.
  
  AFAIK, it's only non-rpool disks that use the whole disk, and I
  doubt there's some sort of specific feature with an SSD, but I
  could be wrong.
  
  I like your idea of a reasonably sized root rpool and the rest
  used for the ZIL.  But if you're going to do LU, you should
  probably take a good look at how much space you need for the
  clones and snapshots on the rpool
  
  Interesting.  For some reason, I coulda sworn that Sol 10 U8
  installer required you to use an entire disk for a ZFS rpool, so
  using only part of the disk on a slice and leaving space for other
  uses wasn't an option.
  
  I'll revisit this though.
  
 
 It certainly works under OpenSolaris, but you might want to look into
 manually partitioning the drive to ensure that it's properly aligned
 on the 4k boundaries. Last time I did that, it showed me a tiny space
 before the manually created partition.
 
 Cheers,
 
 Erik
  

Well, everything worked fine.  ZFS rpool on s0 and ZIL for another pool
on s3.

Unfortunately, I didn't end up doing the 4K block alignment.  Doesn't
look like the fdisk keyword in JumpStart lets you specify this sort
of thing, but I probably could have pre-partitioned the disk from the
shell before running my JumpStart.

Lessons learned.

Thanks all,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] What happens when unmirrored ZIL log device is removed ungracefully

2010-06-30 Thread Ray Van Dolson
On Wed, Jun 30, 2010 at 09:47:15AM -0700, Edward Ned Harvey wrote:
  From: Arne Jansen [mailto:sensi...@gmx.net]
  
  Edward Ned Harvey wrote:
   Due to recent experiences, and discussion on this list, my colleague
  and
   I performed some tests:
  
   Using solaris 10, fully upgraded.  (zpool 15 is latest, which does
  not
   have log device removal that was introduced in zpool 19)  In any way
   possible, you lose an unmirrored log device, and the OS will crash,
  and
   the whole zpool is permanently gone, even after reboots.
  
  
  I'm a bit confused. I tried hard, but haven't been able to reproduce
  this
  using Sol10U8. I have a mirrored slog device. While putting it
  under load doing synchronous file creations, we pulled the power cords
  and unplugged the slog devices. After powering on zfs imported the
  pool,
  but prompted to acknowledge the missing slog devices with zpool clear.
  After that the pool was accessible again. That's exactly how it should
  be.
 
 Very interesting.  I did this test some months ago, so I may not recall the
 relevant details, but here are the details I do remember:
 
 I don't recall if I did this test on osol2009.06, or sol10.
 
 In Sol10u6 (and I think Sol10u8) the default zpool version is 10, but if you
 apply all your patches, then 15 becomes available.  I am sure that I've
 never upgraded any of my sol10 zpools higher than 10.  So it could be that
 an older zpool version might exhibit the problem, and you might be using a
 newer version.
 
 In osol2009.06, IIRC, the default is zpool 14, and if you upgrade fully,
 you'll get to something around 24.  So again, it's possible the bad behavior
 went away in zpool 15, or any other number from 11 to 15.
 
 I'll leave it there for now.  If that doesn't shed any light, I'll try to
 dust out some more of my mental cobwebs.

Anyone else done any testing with zpool version 15 (on Solaris 10 U8)?
Have a new system coming in shortly and will test myself, but knowing
this is a recoverable scenario would help me rest easier as I have an
unmirrored slog setup hanging around still.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] OCZ Devena line of enterprise SSD

2010-06-17 Thread Ray Van Dolson
On Thu, Jun 17, 2010 at 09:42:44AM -0700, F. Wessels wrote:
 I just lookup it up again and as far as i can see the super cap is
 present in the MLC version as well as the SLC 

Very nice.  A pair of the 50GB SLC model would be great for ZIL.  Might
continue to stick with the X-25M for L2ARC though based on price.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication and ISO files

2010-06-07 Thread Ray Van Dolson
On Fri, Jun 04, 2010 at 01:10:44PM -0700, Ray Van Dolson wrote:
 On Fri, Jun 04, 2010 at 01:03:32PM -0700, Brandon High wrote:
  On Fri, Jun 4, 2010 at 12:37 PM, Ray Van Dolson rvandol...@esri.com wrote:
   Makes sense.  So, as someone else suggested, decreasing my block size
   may improve the deduplication ratio.
  
  It might. It might make your performance tank, too.
  
  Decreasing the block size increases the size of the dedup table (DDT).
  Every entry in the DDT uses somewhere around 250-270 bytes. If the DDT
  gets too large to fit in memory, it will have to be read from disk,
  which will destroy any sort of write performance (although a L2ARC on
  SSD can help)
  
  If you move to 64k blocks, you'll double the DDT size and may not
  actually increase your ratio. Moving to 8k blocks will increase your
  DDT by a factor of 16, and still may not help.
  
  Changing the recordsize will not affect files that are already in the
  dataset. You'll have to recopy them to re-write with the smaller block
  size.
  
  -B
 
 Gotcha.  Just trying to make sure I understand how all this works, and
 if I _would_ in fact see an improvement in dedupe-ratio by tweaking the
 recordsize with our data-set.
 
 Once we know that we can decide if it's worth the extra costs in
 RAM/L2ARC.
 
 Thanks all.

FYI;

With 4K recordsize, I am seeing 1.26x dedupe ratio between the RHEL 5.4
ISO and the RHEL 5.5 ISO file.

However, it took about 33 minutes to copy the 2.9GB ISO file onto the
filesystem. :)  Definitely would need more RAM in this setup...

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Deduplication and ISO files

2010-06-04 Thread Ray Van Dolson
I'm running zpool version 23 (via ZFS fuse on Linux) and have a zpool
with deduplication turned on.

I am testing how well deduplication will work for the storage of many,
similar ISO files and so far am seeing unexpected results (or perhaps
my expectations are wrong).

The ISO's I'm testing with are the 32-bit and 64-bit versions of the
RHEL5 DVD ISO's.  While both have their differences, they do contain a
lot of similar data as well.

If I explode both ISO files and copy them to my ZFS filesystem I see
about a 1.24x dedup ratio.

However, if I have only the ISO files on the ZFS filesystem, the ratio
is 1.00x -- no savings at all.

Does this make sense?  I'm going to experiment with other combinations
of ISO files as well...

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication and ISO files

2010-06-04 Thread Ray Van Dolson
On Fri, Jun 04, 2010 at 11:16:40AM -0700, Brandon High wrote:
 On Fri, Jun 4, 2010 at 9:30 AM, Ray Van Dolson rvandol...@esri.com wrote:
  The ISO's I'm testing with are the 32-bit and 64-bit versions of the
  RHEL5 DVD ISO's.  While both have their differences, they do contain a
  lot of similar data as well.
 
 Similar != identical.
 
 Dedup works on blocks in zfs, so unless the iso files have identical
 data aligned at 128k boundaries you won't see any savings.
 
  If I explode both ISO files and copy them to my ZFS filesystem I see
  about a 1.24x dedup ratio.
 
 Each file starts a new block, so the identical files can be deduped.
 
 -B

Makes sense.  So, as someone else suggested, decreasing my block size
may improve the deduplication ratio.

recordsize I presume is the value to tweak?

Thanks,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication and ISO files

2010-06-04 Thread Ray Van Dolson
On Fri, Jun 04, 2010 at 01:03:32PM -0700, Brandon High wrote:
 On Fri, Jun 4, 2010 at 12:37 PM, Ray Van Dolson rvandol...@esri.com wrote:
  Makes sense.  So, as someone else suggested, decreasing my block size
  may improve the deduplication ratio.
 
 It might. It might make your performance tank, too.
 
 Decreasing the block size increases the size of the dedup table (DDT).
 Every entry in the DDT uses somewhere around 250-270 bytes. If the DDT
 gets too large to fit in memory, it will have to be read from disk,
 which will destroy any sort of write performance (although a L2ARC on
 SSD can help)
 
 If you move to 64k blocks, you'll double the DDT size and may not
 actually increase your ratio. Moving to 8k blocks will increase your
 DDT by a factor of 16, and still may not help.
 
 Changing the recordsize will not affect files that are already in the
 dataset. You'll have to recopy them to re-write with the smaller block
 size.
 
 -B

Gotcha.  Just trying to make sure I understand how all this works, and
if I _would_ in fact see an improvement in dedupe-ratio by tweaking the
recordsize with our data-set.

Once we know that we can decide if it's worth the extra costs in
RAM/L2ARC.

Thanks all.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-24 Thread Ray Van Dolson
This thread has grown giant, so apologies for screwing up threading
with an out of place reply. :)

So, as far as SF-1500 based SSD's, the only ones currently in existence
are the Vertex 2 LE and Vertex 2 EX, correct (I understand the Vertex 2
Pro was never mass produced)?

Both of these are based on MLC and not SLC -- why isn't that an issue
for longevity?

Any other SF-1500 options out there?

We continue to use UPS-backed Intel X-25E's for ZIL.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-24 Thread Ray Van Dolson
On Mon, May 24, 2010 at 11:30:20AM -0700, Ray Van Dolson wrote:
 This thread has grown giant, so apologies for screwing up threading
 with an out of place reply. :)
 
 So, as far as SF-1500 based SSD's, the only ones currently in existence
 are the Vertex 2 LE and Vertex 2 EX, correct (I understand the Vertex 2
 Pro was never mass produced)?
 
 Both of these are based on MLC and not SLC -- why isn't that an issue
 for longevity?
 
 Any other SF-1500 options out there?
 
 We continue to use UPS-backed Intel X-25E's for ZIL.

From earlier in the thread, it sounds like none of the SF-1500 based
drives even have a supercap, so it doesn't seem that they'd necessarily
be a better choice than the SLC-based X-25E at this point unless you
need more write IOPS...

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best practice for full stystem backup - equivelent of ufsdump/ufsrestore

2010-05-05 Thread Ray Van Dolson
On Wed, May 05, 2010 at 04:31:08PM -0700, Bob Friesenhahn wrote:
 On Thu, 6 May 2010, Ian Collins wrote:
  Bob and Ian are right.  I was trying to remember the last time I installed
  Solaris 10, and the best I can recall, it was around late fall 2007.
  The fine folks at Oracle have been making improvements to the product
  since then, even though no new significant features have been added since
  that time :-(
  
  ZFS boot?
 
 I think that Richard is referring to the fact that the PowerPC/Cell 
 Solaris 10 port for the Sony Playstation III never emerged.  ;-)
 
 Other than desktop features, as a Solaris 10 user I have seen 
 OpenSolaris kernel features continually percolate down to Solaris 10 
 so I don't feel as left out as Richard would like me to feel.
 
 From a zfs standpoint, Solaris 10 does not seem to be behind the 
 currently supported OpenSolaris release.
 
 Bob

Well, being able to remove ZIL devices is one important feature
missing.  Hopefully in U9. :)

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best practice for full stystem backup - equivelent of ufsdump/ufsrestore

2010-05-05 Thread Ray Van Dolson
On Wed, May 05, 2010 at 05:09:40PM -0700, Erik Trimble wrote:
 On Wed, 2010-05-05 at 19:03 -0500, Bob Friesenhahn wrote:
  On Wed, 5 May 2010, Ray Van Dolson wrote:
  
   From a zfs standpoint, Solaris 10 does not seem to be behind the
   currently supported OpenSolaris release.
  
   Well, being able to remove ZIL devices is one important feature
   missing.  Hopefully in U9. :)
  
  While the development versions of OpenSolaris are clearly well beyond 
  Solaris 10, I don't believe that the supported version of OpenSolaris 
  (a year old already) has this feature yet either and Solaris 10 has 
  been released several times since then already.  When the forthcoming 
  OpenSolaris release emerges in 2011, the situation will be far 
  different.  Solaris 10 can then play catch-up with the release of U9 
  in 2012.
  
  Bob
 
 Pessimist. ;-)
 
 
 s/2011/2010/
 s/2012/2011/
 

Yeah, U9 in 2012 makes me very sad.

I would really love to see the hot-removable ZIL's this year.
Otherwise I'll need to rebuild a few zpools :)

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS monitoring - best practices?

2010-04-08 Thread Ray Van Dolson
We're starting to grow our ZFS environment and really need to start
standardizing our monitoring procedures.

OS tools are great for spot troubleshooting and sar can be used for
some trending, but we'd really like to tie this into an SNMP based
system that can generate graphs for us (via RRD or other).

Whether or not we do this via our standard enterprise monitoring tool
or write some custom scripts I don't really care... but I do have the
following questions:

- What metrics are you guys tracking?  I'm thinking:
- IOPS
- ZIL statistics
- L2ARC hit ratio
- Throughput
- IO Wait (I know there's probably a better term here)
- How do you gather this information?  Some but not all is
  available via SNMP.  Has anyone written a ZFS specific MIB or
  plugin to make the info available via the standard Solaris SNMP
  daemon?  What information is available only via zdb/mdb?
- Anyone have any RRD-based setups for monitoring their ZFS
  environments they'd be willing to share or talk about?

Thanks in advance,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >