Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Marc Bevand
Bob Friesenhahn bfriesen at simple.dallas.tx.us writes:
 [...]
 X25-E's write cache is volatile), the X25-E has been found to offer a 
 bit more than 1000 write IOPS.

I think this is incorrect. On the paper the X25-E offers 3300 random write
4kB IOPS (and Intel is known to be very conservative about the IOPS perf 
numbers they publish). Dumb storage IOPS benchmark tools that don't issue 
parallel I/O ops to the drive tend to report numbers less than half the 
theoretical IOPS. This would explain why you see only 1000 IOPS.

I have direct evidence to prove this (with the other MLC line of SSD drives: 
X25-M): 35000 random read 4kB IOPS theoretical, 1 instance of a private 
benchmarking tool measures 6000, 10+ instances of this tool measure 37000 IOPS 
(slightly better than the theoretical max!)

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance of ZFS and UFS inside local/global zone

2009-10-21 Thread Orvar Korvar
So is there is a Change Request on this?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Tristan Ball
What makes you say that the X25-E's cache can't be disabled or flushed? 
The net seems to be full of references to people who are disabling the 
cache, or flushing it frequently, and then complaining about the 
performance!


T

Frédéric VANNIERE wrote:
The ZIL is a write-only log that is only read after a power failure. Several GB is large enough for most workloads. 

You can't use the Intel X25-E because it has a 32 or 64 MB volatile cache that can't be disabled neither flushed by ZFS. 


Imagine your server has a power failure while writing data to the pool. In 
normal situation, with ZIL on a reliable device, ZFS will read the ZIL and come 
back to a stable state at reboot. You may have lost some data (30 seconds) but 
the zpool works.   With the Intel X25-E as ZIL some log data has been lost with 
the power failure (32/64MB max) which lead to a corrupted log and so ... you 
loose your zpool and all your data !!

For the ZIL you need 2 reliable mirrored SSD devices with a supercapacitor that can flush the write cache to NAND when a power failure occurs. 


A hard-disk has a write cache but it can be disabled or flush by the operating 
system.

For more informations : 
http://www.c0t0d0s0.org/archives/5993-Somewhat-stable-Solid-State.html
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Unable to destroy/rollback snapshot

2009-10-21 Thread Andrew Robert Nicols
I've been trying to rollback a snapshot but seem to be unable to do so. Can
anyone shed some light on what I may be doing wrong?

I'm trying to rollback from thumperpool/m...@200908271200 to
thumperpool/m...@200908270100.

344 r...@thumper1:~ zfs list -t snapshot | tail
thumperpool/m...@200908261700 73.0M  -  1.45T  -
thumperpool/m...@200908261800 72.0M  -  1.45T  -
thumperpool/m...@200908261900 72.4M  -  1.45T  -
thumperpool/m...@200908262000 71.9M  -  1.45T  -
thumperpool/m...@200908262100 67.7M  -  1.45T  -
thumperpool/m...@200908262200 67.5M  -  1.45T  -
thumperpool/m...@200908262300 72.0M  -  1.45T  -
thumperpool/m...@20090827 71.0M  -  1.45T  -
thumperpool/m...@200908270100 75.0M  -  1.45T  -
thumperpool/m...@200908271200 0  -  1.45T  -

345 r...@thumper1:~ zfs list -t filesystem
NAME  USED  AVAIL  REFER  MOUNTPOINT
rpool10.4G   446G33K  /rpool
rpool/ROOT757M   446G21K  legacy
rpool/ROOT/solaris_u8 757M   446G   757M  /
rpool/export 6.39G   446G23K  /export
rpool/export/home6.39G   446G  6.39G  /export/home
rpool/opt 234M   446G   234M  /opt
thumperpool  6.72T  9.27T  48.8K  /thumperpool
thumperpool/export   11.7G  9.27T  48.8K  /oldexport
thumperpool/export/home  11.7G  9.27T  11.7G  /oldexport/home
thumperpool/mnt  6.70T  9.27T  1.45T  /thumperpool/mnt

346 r...@thumper1:~ zfs rollback thumperpool/m...@200908270100
cannot rollback to 'thumperpool/m...@200908270100': more recent snapshots
exist
use '-r' to force deletion of the following snapshots:
thumperpool/m...@200908271200

347 r...@thumper1:~ zfs rollback -r thumperpool/m...@200908270100
cannot destroy 'thumperpool/m...@200908271200': dataset already exists

This is an X4500 running Solaris U8. I'm running zpool version 15 and zfs
version 2.

Any guidance much appreciated.

Andrew

-- 
Systems Developer

e: andrew.nic...@luns.net.uk
im: a.nic...@jabber.lancs.ac.uk
t: +44 (0)1524 5 10147

Lancaster University Network Services is a limited company registered in
England and Wales. Registered number: 04311892. Registered office:
University House, Lancaster University, Lancaster, LA1 4YW


signature.asc
Description: Digital signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Meilicke, Scott
Thank you Bob and Richard. I will go with A, as it also keeps things simple.
One physical device per pool.

-Scott


On 10/20/09 6:46 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote:

 On Tue, 20 Oct 2009, Richard Elling wrote:
 
 The ZIL device will never require more space than RAM.
 In other words, if you only have 16 GB of RAM, you won't need
 more than that for the separate log.
 
 Does the wasted storage space annoy you? :-)
 
 What happens if the machine is upgraded to 32GB of RAM later?
 
 The write performace of the X25-E is likely to be the bottleneck for a
 write-mostly storage server if the storage server has excellent
 network connectivity.
 
 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/



We value your opinion!  How may we serve you better? 
Please click the survey link to tell us how we are doing:
http://www.craneae.com/ContactUs/VoiceofCustomer.aspx
Your feedback is of the utmost importance to us. Thank you for your time.

Crane Aerospace  Electronics Confidentiality Statement:
The information contained in this email message may be privileged and is 
confidential information intended only for the use of the recipient, or any 
employee or agent responsible to deliver it to the intended recipient. Any 
unauthorized use, distribution or copying of this information is strictly 
prohibited 
and may be unlawful. If you have received this communication in error, please 
notify 
the sender immediately and destroy the original message and all attachments 
from 
your electronic files.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Meilicke, Scott
Thanks Ed. It sounds like you have run in this mode? No issues with  
the perc?


--
Scott Meilicke

On Oct 20, 2009, at 9:59 PM, Edward Ned Harvey  
sola...@nedharvey.com wrote:



System:
Dell 2950
16G RAM
16 1.5T SATA disks in a SAS chassis hanging off of an LSI 3801e, no
extra drive slots, a single zpool.
svn_124, but with my zpool still running at the 2009.06 version (14).

My plan is to put the SSD into an open disk slot on the 2950, but  
will

have to configure it as a RAID 0, since the onboard PERC5 controller
does not have a JBOD mode.


You can JBOD with the perc.  It might be technically a raid0 or  
raid1 with a

single disk in it, but that would be functionally equivalent to JBOD.





We value your opinion!  How may we serve you better? 
Please click the survey link to tell us how we are doing:

http://www.craneae.com/ContactUs/VoiceofCustomer.aspx
Your feedback is of the utmost importance to us. Thank you for your time.

Crane Aerospace  Electronics Confidentiality Statement:
The information contained in this email message may be privileged and is 
confidential information intended only for the use of the recipient, or any 
employee or agent responsible to deliver it to the intended recipient. Any 
unauthorized use, distribution or copying of this information is strictly prohibited 
and may be unlawful. If you have received this communication in error, please notify 
the sender immediately and destroy the original message and all attachments from 
your electronic files.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2009-10-21 Thread Robert Dupuy
There is a debate tactic known as complex argument, where so many false and 
misleading statements are made at once, that it overwhelms the respondent.

I'm just going to respond this way.

I am very disappointed in this discussion group.  The response is not genuine.

The idea that latency is not important, patently absurd.  I am not going into 
the details of my private application, so you can pick at it.

If you want to say latency has no relevance, you can defend that absurdity at 
the risk of your own reputation.

The responses are not, in my opinion, genuine.

Attacking Intel's spec sheet, when simultanously defending no latency #'s being 
release by another vendor?

well I sent 3 emails in response earlier this morning, but I wasn't logged in, 
so I don't know if the mod will post them or not.

Moderator, you don't have to, this will suffice as my last email.

Guys, I don't have time to waste with you, and I feel that it is very wasteful 
to sit here and argue with people who either a) don't understand technology or 
more likely b) simply are being argumentative because they have a vested 
interest.

Either way, I don't see genuine help.

I am going to remove my account, and good bye!  best of luck to everyone.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2009-10-21 Thread Brian Hechinger
Please don't feed the troll.

:)

-brian

On Wed, Oct 21, 2009 at 06:32:42AM -0700, Robert Dupuy wrote:
 There is a debate tactic known as complex argument, where so many false and 
 misleading statements are made at once, that it overwhelms the respondent.
 
 I'm just going to respond this way.
 
 I am very disappointed in this discussion group.  The response is not genuine.
 
 The idea that latency is not important, patently absurd.  I am not going into 
 the details of my private application, so you can pick at it.
 
 If you want to say latency has no relevance, you can defend that absurdity at 
 the risk of your own reputation.
 
 The responses are not, in my opinion, genuine.
 
 Attacking Intel's spec sheet, when simultanously defending no latency #'s 
 being release by another vendor?
 
 well I sent 3 emails in response earlier this morning, but I wasn't logged 
 in, so I don't know if the mod will post them or not.
 
 Moderator, you don't have to, this will suffice as my last email.
 
 Guys, I don't have time to waste with you, and I feel that it is very 
 wasteful to sit here and argue with people who either a) don't understand 
 technology or more likely b) simply are being argumentative because they have 
 a vested interest.
 
 Either way, I don't see genuine help.
 
 I am going to remove my account, and good bye!  best of luck to everyone.
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Coding in C is like sending a 3 year old to do groceries. You gotta
tell them exactly what you want or you'll end up with a cupboard full of
pop tarts and pancake mix. -- IRC User (http://www.bash.org/?841435)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Edward Ned Harvey
 Thanks Ed. It sounds like you have run in this mode? No issues with
 the perc?
 
  You can JBOD with the perc.  It might be technically a raid0 or
  raid1 with a
  single disk in it, but that would be functionally equivalent to JBOD.

The only time I did this was ...

I have a Windows server, on a PE2950 with Perc5i, running on a disk mirror
for the OS with hotspare.  Then I needed to add some more space, and the
only disk I had available was a single 750G.  So I added it with no problem,
and I ordered another 750G to be the mirror of the first one.  I used a
single disk successfully, until the 2nd disk arrived, and then I enabled
mirroring from the 1st to the 2nd.  Everything went well.  No interruptions.
The system was a little slow while resilvering.

The one big obvious difference between my setup and yours is the OS.  I
expect that the OS doesn't change the capabilities of the Perc card, so I
think you should be fine.

The one comment I will make, in regards to the OS, which many people might
overlook, is ...

There are two interfaces to configure your Perc card.  One is the BIOS
interface, and the other is the Dell OpenManage System Administrator
(managed node.)  AKA, the Dell OMSA Managed Node.  This provides an
interface at https://machine:1311 which allows you to configure the card,
monitor health, enable/disable hotspare, resilver a new disk etc.  While the
OS is running.  (No need to shutdown into BIOS).

OMSA is required in order to replace a failed disk without a reboot.  Or add
disks, etc, or anything else you might want to do on the Perc card.  I know
OMSA is available for Windows and Linux.  How about Solaris?

Based on curiosity, I logged into Dell support just now, to look up my 2950.
The supported OSes are Netware, Windows, RedHat, and Suse.  Which means, on
my system, if I were running Solaris, I could count on *not* being able to
run OMSA, and consequently the only interface to configure the Perc would be
BIOS.  If solaris is able to install at all, I would have to acknowledge, I
have to shutdown anytime I need to change the Perc configuration, including
replacing failed disks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)

2009-10-21 Thread Mark Horstman
I'm seeing the same [b]lucreate[/b] error on my fresh SPARC sol10u8 install 
(and my SPARC sol10u7 machine I keep patches up to date), but I don't have a 
separate /var:

# zfs list
NAMEUSED  AVAIL  REFER  MOUNTPOINT
pool00 3.36G   532G20K  none
pool00/global  3.51M   532G20K  none
pool00/global/appl   20K   532G20K  /appl
pool00/global/home  324K   532G   324K  /home
pool00/global/local  26K   532G26K  /usr/local
pool00/global/patches  3.13M   532G  3.13M  /usr/local/patches
pool00/shared  3.35G   532G20K  none
pool00/shared/install  2.52G   532G  2.52G  /install
pool00/shared/local 849M   532G   849M  /opt/local
rpool  44.6G  89.2G97K  /rpool
rpool/ROOT 4.63G  89.2G21K  legacy
rpool/ROOT/sol10u8 4.63G  89.2G  4.63G  /
rpool/dump 8.01G  89.2G  8.01G  -
rpool/swap   32G   121G16K  -

# lucreate -n foobar
Analyzing system configuration.
Comparing source boot environment sol10u8 file systems with the file
system(s) you specified for the new boot environment. Determining which
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment foobar.
Source boot environment is sol10u8.
Creating boot environment foobar.
Cloning file systems from boot environment sol10u8 to create boot environment 
foobar.
Creating snapshot for rpool/ROOT/sol10u8 on rpool/ROOT/sol1...@foobar.
Creating clone for rpool/ROOT/sol1...@foobar on rpool/ROOT/foobar.
Setting canmount=noauto for / in zone global on rpool/ROOT/foobar.
WARNING: split filesystem / file system type zfs cannot inherit
mount point options - from parent filesystem / file
type - because the two file systems have different types.
Population of boot environment foobar successful.
Creation of boot environment foobar successful.

# cat /etc/vfstab
#device device  mount   FS  fsckmount   mount
#to mount   to fsck point   typepassat boot options
#
fd  -   /dev/fd fd  -   no  -
/proc   -   /proc   proc-   no  -
/dev/zvol/dsk/rpool/swap-   -   swap-   no  -
/devices-   /devicesdevfs   -   no  -
sharefs -   /etc/dfs/sharetab   sharefs -   no  -
ctfs-   /system/contractctfs-   no  -
objfs   -   /system/object  objfs   -   no  -
swap-   /tmptmpfs   -   yes -

I don't see anything wrong with my /etc/vfstab. Until I get this resolved, I'm 
afraid to patch and use the new BE.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] The iSCSI-backed zpool for my zone hangs.

2009-10-21 Thread Jacob Ritorto
My goal is to have a big, fast, HA filer that holds nearly everything for a 
bunch of development services, each running in its own Solaris zone.  So when I 
need a new service, test box, etc., I provision a new zone and hand it to the 
dev requesters and they load their stuff on it and go.

Each zone has zonepath on its own zpool, which is an iSCSI-backed device 
pointing to an a unique sparse zvol on the filer.

If things slow down, we buy more 1U boxes with lots of CPU and RAM, don't 
care about the disk, and simply provision more LUNs on the filer.  Works great. 
 Cheap, good performance, nice and scalable.  They smiled on me for a while.

Until the filer dropped a few packets.

I know it shouldn't happen and I'm addressing that, but the failure mode 
for this eventuality is too drastic.  If the filer isn't responding nicely to 
the zone's i/o request, the zone pretty much completely hangs, responding to 
pings perhaps, but not allowing any real connections. Kind of, not 
surprisingly, like a machine whose root disk got yanked during normal 
operations.

To make it worse, the whole global zone seems unable to do anything about 
the issue.  I can't down the affected zone; zoneadm commands just put the zone 
in a shutting_down state forever.  zpool commands just hang.  Only thing I've 
found to recover (from far away in the middle of the night) is to uadmin 1 1 
the global zone.  Even reboot didn't work. So all the zones on the box get 
hard-reset and that makes all the dev guys pretty unhappy.

I thought about setting failmode to continue on these individual zone pools 
because it's set to wait right now.  How do you folks predict that action will 
change play?

thx
jake
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)

2009-10-21 Thread Mark Horstman
Neither the virgin SPARC sol10u8 nor the (update to date) patched SPARC sol10u7 
have any local zones.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)

2009-10-21 Thread Enda O'Connor

Hi
This looks ok to me, message but not an indicator of an issue

could you post
cat /etc/lu/ICF.1
cat /etc/ICF.2 ( the foobar Be )

also lumount foobar /a
and cat /a/etc/vfstab


Enda

Mark Horstman wrote:

I'm seeing the same [b]lucreate[/b] error on my fresh SPARC sol10u8 install 
(and my SPARC sol10u7 machine I keep patches up to date), but I don't have a 
separate /var:

# zfs list
NAMEUSED  AVAIL  REFER  MOUNTPOINT
pool00 3.36G   532G20K  none
pool00/global  3.51M   532G20K  none
pool00/global/appl   20K   532G20K  /appl
pool00/global/home  324K   532G   324K  /home
pool00/global/local  26K   532G26K  /usr/local
pool00/global/patches  3.13M   532G  3.13M  /usr/local/patches
pool00/shared  3.35G   532G20K  none
pool00/shared/install  2.52G   532G  2.52G  /install
pool00/shared/local 849M   532G   849M  /opt/local
rpool  44.6G  89.2G97K  /rpool
rpool/ROOT 4.63G  89.2G21K  legacy
rpool/ROOT/sol10u8 4.63G  89.2G  4.63G  /
rpool/dump 8.01G  89.2G  8.01G  -
rpool/swap   32G   121G16K  -

# lucreate -n foobar
Analyzing system configuration.
Comparing source boot environment sol10u8 file systems with the file
system(s) you specified for the new boot environment. Determining which
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment foobar.
Source boot environment is sol10u8.
Creating boot environment foobar.
Cloning file systems from boot environment sol10u8 to create boot environment 
foobar.
Creating snapshot for rpool/ROOT/sol10u8 on rpool/ROOT/sol1...@foobar.
Creating clone for rpool/ROOT/sol1...@foobar on rpool/ROOT/foobar.
Setting canmount=noauto for / in zone global on rpool/ROOT/foobar.
WARNING: split filesystem / file system type zfs cannot inherit
mount point options - from parent filesystem / file
type - because the two file systems have different types.
Population of boot environment foobar successful.
Creation of boot environment foobar successful.

# cat /etc/vfstab
#device device  mount   FS  fsckmount   mount
#to mount   to fsck point   typepassat boot options
#
fd  -   /dev/fd fd  -   no  -
/proc   -   /proc   proc-   no  -
/dev/zvol/dsk/rpool/swap-   -   swap-   no  -
/devices-   /devicesdevfs   -   no  -
sharefs -   /etc/dfs/sharetab   sharefs -   no  -
ctfs-   /system/contractctfs-   no  -
objfs   -   /system/object  objfs   -   no  -
swap-   /tmptmpfs   -   yes -

I don't see anything wrong with my /etc/vfstab. Until I get this resolved, I'm 
afraid to patch and use the new BE.


--
Enda O'Connor x19781  Software Product Engineering
Patch System Test : Ireland : x19781/353-1-8199718
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)

2009-10-21 Thread dick hoogendijk

Mark Horstman wrote:

I don't see anything wrong with my /etc/vfstab. Until I get this resolved, I'm 
afraid to patch and use the new BE.
  

It's the vfstab file in the newly created ABE that is wrongly written to.
Try to mount this new ABE and check out for yourself.

--
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | SunOS 10u8 10/09 | OpenSolaris 2010.02 b123
+ All that's really worth doing is what we do for others (Lewis Carrol)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Scott Meilicke
sigh

Thanks Frédéric, that is a very interesting read. 

So my options as I see them now:

1. Keep the x25-e, and disable the cache. Performance should still be improved, 
but not by a *whole* like, right? I will google for an expectation, but if 
anyone knows off the top of their head, I would be appreciative.
2. Buy a ZEUS or similar SSD with a cap backed cache. Pricing is a little hard 
to come by, based on my quick google, but I am seeing $2-3k for an 8G model. Is 
that right? Yowch.
3. Wait for the x25-e g2, which is rumored to have cap backed cache, and may or 
may not work well (but probably will).
4. Put the x25-e with disabled cache behind my PERC with the PERC cache enabled.

My budget is tight. I want better performance now. #4 sounds good. Thoughts?

Regarding mirrored SSDs for the ZIL, it was my understanding that if the SSD 
backed ZIL failed, ZFS would fail back to using the regular pool for the ZIL, 
correct? Assuming this is correct, a mirror would be to preserve performance 
during a failure?

Thanks everyone, this has been really helpful.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Scott Meilicke
Ed, your comment:

If solaris is able to install at all, I would have to acknowledge, I
have to shutdown anytime I need to change the Perc configuration, including
replacing failed disks.

Replacing failed disks is easy when PERC is doing the RAID. Just remove the 
failed drive and replace with a good one, and the PERC will rebuild 
automatically. But are you talking about OpenSolaris managed RAID? I am pretty 
sure, but not tested, that in pseudo JBOD mode (each disk a raid 0 or 1), the 
PERC would still present a replaced disk to the OS without reconfiguring the 
PERC BIOS.

Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] fault.fs.zfs.vdev.io

2009-10-21 Thread Matthew C Aycock
I have several of these messages from fmdump:

 fmdump -v -u 98abae95-8053-4cdc-d91a-dad89b125db4~
TIME UUID SUNW-MSG-ID
Sep 18 00:45:23.7621 98abae95-8053-4cdc-d91a-dad89b125db4 ZFS-8000-FD
  100%  fault.fs.zfs.vdev.io

Problem in: zfs://pool=mzfs/vdev=a414878cf09644a
   Affects: zfs://pool=mzfs/vdev=a414878cf09644a
   FRU: -
  Location: -

Oct 21 10:34:41.8014 98abae95-8053-4cdc-d91a-dad89b125db4 FMD-8000-4M Repaired
  100%  fault.fs.zfs.vdev.io

Problem in: zfs://pool=mzfs/vdev=a414878cf09644a
   Affects: zfs://pool=mzfs/vdev=a414878cf09644a
   FRU: -
  Location: -

I am trying to determine which of the four vdevs is involved. Hdow do I 
translate vdev=a414878cf09644a a cWtXdYsZ?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)

2009-10-21 Thread Mark Horstman
More input:

# cat /etc/lu/ICF.1
sol10u8:-:/dev/zvol/dsk/rpool/swap:swap:67108864
sol10u8:/:rpool/ROOT/sol10u8:zfs:0
sol10u8:/appl:pool00/global/appl:zfs:0
sol10u8:/home:pool00/global/home:zfs:0
sol10u8:/rpool:rpool:zfs:0
sol10u8:/install:pool00/shared/install:zfs:0
sol10u8:/opt/local:pool00/shared/local:zfs:0
sol10u8:/usr/local:pool00/global/local:zfs:0
sol10u8:/usr/local/patches:pool00/global/patches:zfs:0

# cat /etc/lu/ICF.2
foobar:-:/dev/zvol/dsk/rpool/swap:swap:67108864
foobar:/:rpool/ROOT/foobar:zfs:0
foobar:/appl:pool00/global/appl:zfs:0
foobar:/home:pool00/global/home:zfs:0
foobar:/install:pool00/shared/install:zfs:0
foobar:/opt/local:pool00/shared/local:zfs:0
foobar:/rpool:rpool:zfs:0
foobar:/usr/local:pool00/global/local:zfs:0
foobar:/usr/local/patches:pool00/global/patches:zfs:0

Should I not be concerned about the [b]WARNING: split filesystem / file 
system type zfs cannot inherit[/b] message? Like I just have to lumount the 
new BE and modify it's /etc/vfstab and then proceed as normal using luupgrade 
to apply patches to the new BE?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)

2009-10-21 Thread Mark Horstman
more input:

# lumount foobar /mnt
/mnt

# cat /mnt/etc/vfstab
# cat /mnt/etc/vfstab
#live-upgrade:Wed Oct 21 09:36:20 CDT 2009 updated boot environment foobar
#device device  mount   FS  fsckmount   mount
#to mount   to fsck point   typepassat boot options
#
fd  -   /dev/fd fd  -   no  -
/proc   -   /proc   proc-   no  -
/dev/zvol/dsk/rpool/swap-   -   swap-   no  -
/devices-   /devicesdevfs   -   no  -
sharefs -   /etc/dfs/sharetab   sharefs -   no  -
ctfs-   /system/contractctfs-   no  -
objfs   -   /system/object  objfs   -   no  -
swap-   /tmptmpfs   -   yes -
rpool/ROOT/foobar   -   /   zfs 1   no  -


So I'm guessing the '/' entry has to be removed.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fault.fs.zfs.vdev.io

2009-10-21 Thread Cindy Swearingen

Hi Matthew,

You can use various forms of fmdump to decode this output.
It might be easier to use fmdump -eV and look for the
device info in the vdev path entry, like the one below.

Also see if the errors on these vdevs are reported in
your zpool status output.

Thanks,

Cindy

# fmdump -eV | more
TIME   CLASS
Oct 14 2009 09:56:54.639354792 ereport.fs.zfs.vdev.open_failed
nvlist version: 0
class = ereport.fs.zfs.vdev.open_failed
ena = 0xd9fa6d282c1
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = zfs
pool = 0xacea55024964f6d6
vdev = 0xf04f53d61ed76317
(end detector)

pool = mirpool
pool_guid = 0xacea55024964f6d6
pool_context = 0
pool_failmode = wait
vdev_guid = 0xf04f53d61ed76317
vdev_type = disk
vdev_path = /dev/dsk/c1t226000C0FFA001ABd3s0
vdev_devid = id1,s...@n600c0ff1ab23c5606e03/a
parent_guid = 0x6035386f7936f350




On 10/21/09 10:18, Matthew C Aycock wrote:

I have several of these messages from fmdump:

 fmdump -v -u 98abae95-8053-4cdc-d91a-dad89b125db4~
TIME UUID SUNW-MSG-ID
Sep 18 00:45:23.7621 98abae95-8053-4cdc-d91a-dad89b125db4 ZFS-8000-FD
  100%  fault.fs.zfs.vdev.io

Problem in: zfs://pool=mzfs/vdev=a414878cf09644a
   Affects: zfs://pool=mzfs/vdev=a414878cf09644a
   FRU: -
  Location: -

Oct 21 10:34:41.8014 98abae95-8053-4cdc-d91a-dad89b125db4 FMD-8000-4M Repaired
  100%  fault.fs.zfs.vdev.io

Problem in: zfs://pool=mzfs/vdev=a414878cf09644a
   Affects: zfs://pool=mzfs/vdev=a414878cf09644a
   FRU: -
  Location: -

I am trying to determine which of the four vdevs is involved. Hdow do I 
translate vdev=a414878cf09644a a cWtXdYsZ?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zvol used apparently greater than volsize for sparse volume

2009-10-21 Thread Cindy Swearingen

Hi Stuart,

I ran various forms of the zdb command to see if I could glean
the metadata accounting stuff but it is beyond my mere mortal
skills.

Maybe someone else can provide the right syntax.

Cindy


On 10/20/09 10:17, Stuart Anderson wrote:

Cindy,
Thanks for the pointer. Until this is resolved, is there some 
documentation
available that will let me calculate this by hand? I would like to know 
how large

the current 3-4% meta data storage I am observing can potentially grow.

Thanks.


On Oct 20, 2009, at 8:57 AM, Cindy Swearingen wrote:


Hi Stuart,

The reason why used is larger than the volsize is because we
aren't accounting for metadata, which is covered by this CR:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429996
6429996 zvols don't reserve enough space for requisite meta data

Metadata is usually only a small percentage.

Sparse-ness is not a factor here.  Sparse just means we ignore the
reservation so you can create a zvol bigger than what we'd normally
allow.

Cindy

On 10/17/09 13:47, Stuart Anderson wrote:

What does it mean for the reported value of a zvol volsize to be
less than the product of used and compressratio?



--
Stuart Anderson  ander...@ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Richard Elling

On Oct 20, 2009, at 10:24 PM, Frédéric VANNIERE wrote:

The ZIL is a write-only log that is only read after a power failure.  
Several GB is large enough for most workloads.


You can't use the Intel X25-E because it has a 32 or 64 MB volatile  
cache that can't be disabled neither flushed by ZFS.


I am surprised by this assertion and cannot find any confirmation from  
Intel.
Rather, the cache flush command is specifically mentioned as supported  
in

Section 6.1.1 of the Intel X25-E SATA Solid State Drive Product Manual.
http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-datasheet.pdf

I suspect that this is confusion relating to the various file systems,  
OSes,

or virtualization platforms which may or may not by default ignore cache
flushes.

Since NTFS uses the cache flush commands, I would be very surprised if
Intel would intentionally ignore it.

Imagine your server has a power failure while writing data to the  
pool. In normal situation, with ZIL on a reliable device, ZFS will  
read the ZIL and come back to a stable state at reboot. You may have  
lost some data (30 seconds) but the zpool works.   With the Intel  
X25-E as ZIL some log data has been lost with the power failure  
(32/64MB max) which lead to a corrupted log and so ... you loose  
your zpool and all your data !!


The ZIL works fine for devices which support the cache flush command.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Bob Friesenhahn

On Wed, 21 Oct 2009, Marc Bevand wrote:


Bob Friesenhahn bfriesen at simple.dallas.tx.us writes:

[...]
X25-E's write cache is volatile), the X25-E has been found to offer a
bit more than 1000 write IOPS.


I think this is incorrect. On the paper the X25-E offers 3300 random write
4kB IOPS (and Intel is known to be very conservative about the IOPS perf
numbers they publish). Dumb storage IOPS benchmark tools that don't issue
parallel I/O ops to the drive tend to report numbers less than half the
theoretical IOPS. This would explain why you see only 1000 IOPS.


The Intel specified random write IOPS are with the cache enabled and 
without cache flushing.  They also carefully only use a limited span 
of the device, which fits most perfectly with how the device is built. 
There is no mention of burning in the device for a few days to make 
sure that it is in a useful state.  In order for the test to be 
meaningful, the device needs to be loaded up for a while before taking 
any measurements.


Device performance should be specified as a minimum assured level of 
performance and not as meaningless peak (up to) values.  I repeat: 
peak values are meaningless.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread David Dyer-Bennet

On Wed, October 21, 2009 12:21, Bob Friesenhahn wrote:


 Device performance should be specified as a minimum assured level of
 performance and not as meaningless peak (up to) values.  I repeat:
 peak values are meaningless.

Seems a little pessimistic to me.  Certainly minimum assured values are
the basic thing people need to know, but reasonably characterized peak
values can be valuable, if the conditions yielding them match possible
application usage patterns.

The obvious example in electrical wiring is that the startup surge of
motors and the short-term over-current potential of circuit breakers
actually match each other fairly well, so that most saws (for example)
that can run comfortably on a given circuit can actually be *started* on
that circuit. Peak performance can have practical applications!

Certainly a really carefully optimized peak will almost certainly NOT
represent a useful possible performance level, and they should always be
considered meaningless until you've really proven otherwise.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)

2009-10-21 Thread Enda O'Connor

Hi T
his will  boot ok in my opinion, not seeing any issues there.

Enda
Mark Horstman wrote:

more input:

# lumount foobar /mnt
/mnt

# cat /mnt/etc/vfstab
# cat /mnt/etc/vfstab
#live-upgrade:Wed Oct 21 09:36:20 CDT 2009 updated boot environment foobar
#device device  mount   FS  fsckmount   mount
#to mount   to fsck point   typepassat boot options
#
fd  -   /dev/fd fd  -   no  -
/proc   -   /proc   proc-   no  -
/dev/zvol/dsk/rpool/swap-   -   swap-   no  -
/devices-   /devicesdevfs   -   no  -
sharefs -   /etc/dfs/sharetab   sharefs -   no  -
ctfs-   /system/contractctfs-   no  -
objfs   -   /system/object  objfs   -   no  -
swap-   /tmptmpfs   -   yes -
rpool/ROOT/foobar   -   /   zfs 1   no  -


So I'm guessing the '/' entry has to be removed.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)

2009-10-21 Thread Enda O'Connor

Mark Horstman wrote:

Then why the warning on the lucreate. It hasn't done that in the past.
this is from the vfstab processing code in ludo.c, in your case not 
causing any issue, but shall be fixed.


Enda


Mark

On Oct 21, 2009, at 12:41 PM, Enda O'Connor enda.ocon...@sun.com wrote:


Hi T
his will  boot ok in my opinion, not seeing any issues there.

Enda
Mark Horstman wrote:

more input:
# lumount foobar /mnt
/mnt
# cat /mnt/etc/vfstab
# cat /mnt/etc/vfstab
#live-upgrade:Wed Oct 21 09:36:20 CDT 2009 updated boot environment 
foobar
#device device  mount   FS  fsck
mount   mount
#to mount   to fsck point   typepassat 
boot options

#
fd  -   /dev/fd fd  -   no  -
/proc   -   /proc   proc-   no  -
/dev/zvol/dsk/rpool/swap-   -   swap-   
no  -

/devices-   /devicesdevfs   -   no  -
sharefs -   /etc/dfs/sharetab   sharefs -   no  -
ctfs-   /system/contractctfs-   no  -
objfs   -   /system/object  objfs   -   no  -
swap-   /tmptmpfs   -   yes -
rpool/ROOT/foobar   -   /   zfs 1   no  -
So I'm guessing the '/' entry has to be removed.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Bob Friesenhahn

On Wed, 21 Oct 2009, David Dyer-Bennet wrote:


Device performance should be specified as a minimum assured level of
performance and not as meaningless peak (up to) values.  I repeat:
peak values are meaningless.


Seems a little pessimistic to me.  Certainly minimum assured values are
the basic thing people need to know, but reasonably characterized peak
values can be valuable, if the conditions yielding them match possible
application usage patterns.


Agreed. It is useful to know minimum, median, and peak values.  If 
there is a peak, it is useful to know how long that peak may be 
sustained. Intel's specifications have not characterized the actual 
performance of the device at all.


The performance characteristics of rotating media are well understood 
since they have been observed for tens of years.  From this we already 
know that the peak performance of a hard drive does not have much to 
do with its steady-state performance since the peak performance is 
often defined by the hard drive cache size and the interface type and 
clock rate.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread David Dyer-Bennet

On Wed, October 21, 2009 12:53, Bob Friesenhahn wrote:
 On Wed, 21 Oct 2009, David Dyer-Bennet wrote:

 Device performance should be specified as a minimum assured level of
 performance and not as meaningless peak (up to) values.  I repeat:
 peak values are meaningless.

 Seems a little pessimistic to me.  Certainly minimum assured values are
 the basic thing people need to know, but reasonably characterized peak
 values can be valuable, if the conditions yielding them match possible
 application usage patterns.

 Agreed. It is useful to know minimum, median, and peak values.  If
 there is a peak, it is useful to know how long that peak may be
 sustained. Intel's specifications have not characterized the actual
 performance of the device at all.

And just a random number labeled as peak really IS meaningless, yes.

 The performance characteristics of rotating media are well understood
 since they have been observed for tens of years.  From this we already
 know that the peak performance of a hard drive does not have much to
 do with its steady-state performance since the peak performance is
 often defined by the hard drive cache size and the interface type and
 clock rate.

It strikes me that disks have been developing rather too independently of,
and sometimes in conflict with, requirements for reliable interaction with
the filesystems in various OSes.  Things like power-dependent write
caches.  Boosts peak write but not sustained write, which is probably
benchmark-friendly, AND introduces the problem of writes committed to the
drive not being safe in a power failure.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)

2009-10-21 Thread Enda O'Connor

Hi
Yes sorry remove that line from vfstab in the new BE

Enda
Mark wrote:
Ok. Thanks. Why does '/' show up in the newly created /BE/etc/vfstab but 
not in the current /etc/vfstab? Should '/' be in the /BE/etc/vfstab?


btw, thank you for responding so quickly to this.

Mark

On Wed, Oct 21, 2009 at 12:49 PM, Enda O'Connor enda.ocon...@sun.com 
mailto:enda.ocon...@sun.com wrote:


Mark Horstman wrote:

Then why the warning on the lucreate. It hasn't done that in the
past.

this is from the vfstab processing code in ludo.c, in your case not
causing any issue, but shall be fixed.

Enda


Mark

On Oct 21, 2009, at 12:41 PM, Enda O'Connor
enda.ocon...@sun.com wrote:

Hi T
his will  boot ok in my opinion, not seeing any issues there.

Enda
Mark Horstman wrote:

more input:
# lumount foobar /mnt
/mnt
# cat /mnt/etc/vfstab
# cat /mnt/etc/vfstab
#live-upgrade:Wed Oct 21 09:36:20 CDT 2009 updated
boot environment foobar
#device device  mount   FS
 fsckmount   mount
#to mount   to fsck point   type  
 passat boot options

#
fd  -   /dev/fd fd  -   no  -
/proc   -   /proc   proc-   no  -
/dev/zvol/dsk/rpool/swap-   -   swap  
 -   no  -
/devices-   /devicesdevfs   -  
no  -
sharefs -   /etc/dfs/sharetab   sharefs -  
no  -
ctfs-   /system/contractctfs-  
no  -

objfs   -   /system/object  objfs   -   no  -
swap-   /tmptmpfs   -   yes -
rpool/ROOT/foobar   -   /   zfs 1  
no  -

So I'm guessing the '/' entry has to be removed.






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] importing pool with missing/failed log device

2009-10-21 Thread Paul B. Henson

I've had a case open for a while (SR #66210171) regarding the inability to
import a pool whose log device failed while the pool was off line.

I was told this was CR #6343667, which was supposedly fixed in patches
141444-09/141445-09. However, I recently upgraded a system to U8 which
includes that kernel patch, and still am unable to import a pool with a
failed log device:

r...@ike ~ # zpool import
  pool: export
id: 4066329346842580031
 state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-6X
config:

export  UNAVAIL  missing device
  mirrorONLINE
c0t0d0  ONLINE
c1t0d0  ONLINE
[...]
Additional devices are known to be part of this pool, though their
exact configuration cannot be determined.

I have not as yet updated the pool to the new version included in U8, but I
was not told that was a prerequisite to availing of the fix.

Is this issue supposed to have been fixed by that CR, or did that resolve
some other issue and I was misinformed on my support ticket?

Any information appreciated, thanks...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Paul B. Henson
On Tue, 20 Oct 2009, [UTF-8] Fr??d??ric VANNIERE wrote:

 You can't use the Intel X25-E because it has a 32 or 64 MB volatile cache
 that can't be disabled neither flushed by ZFS.

Say what? My understanding is that the officially supported Sun SSD for the
x4540 is an OEM'd Intel X25-E, so I don't see how it could not be a good
slog device.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Exported zpool cannot be imported or deleted.

2009-10-21 Thread Stacy Maydew
I have an exported zpool that had several drives incur errors at the same time 
and as a result became unusable.  The pool was exported at the time the drives 
had problems and now I can't find a way to either delete or import the pool.

I've tried relabeling the disks and using dd to write several MB of zero data 
to partition zero of every disk in the pool but the command zpool import 
still shows the pool as available and all drives online. 

When I attempt to import the pool using zpool import -f zpool1, the system 
panics.

Is there any way possible short of reloading the OS and completely reformatting 
the drives that I can get rid of this zpool?

Thanks,

Stacy Maydew
stacy.may...@sun.com
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2009-10-21 Thread Dupuy, Robert
My take on the responses I've received the last days, is that it isn't
genuine.

 

 

 



From: Tim Cook [mailto:t...@cook.ms] 
Sent: 2009-10-20 20:57
To: Dupuy, Robert
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Sun Flash Accelerator F20

 

On Tue, Oct 20, 2009 at 3:58 PM, Robert Dupuy rdu...@umpublishing.org
wrote:

there is no consistent latency measurement in the industry

You bring up an important point, as did another poster earlier
in the thread, and certainly its an issue that needs to be addressed.


I'd be surprised if anyone could answer such a question while
simultaneously being credible.


http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-pro
duct-brief.pdf

Intel:  X-25E read latency 75 microseconds

http://www.sun.com/storage/disk_systems/sss/f5100/specs.xml

Sun:  F5100 read latency 410 microseconds

http://www.fusionio.com/PDFs/Data_Sheet_ioDrive_2.pdf

Fusion-IO:  read latency less than 50 microseconds

Fusion-IO lists theirs as .05ms


I find the latency measures to be useful.

I know it isn't perfect, and I agree benchmarks can be
deceiving, heck I criticized one vendors benchmarks in this thread
already :)

But, I did find, that for me, I just take a very simple, single
thread, read as fast you can approach, and get the # of random access
per second, as one type of measurement, that gives you some data, on the
raw access ability of the drive.

No doubt in some cases, you want to test multithreaded IO too,
but my application is very latency sensitive, so this initial test was
telling.

As I got into the actual performance of my app, the lower
latency drives, performed better than the higher latency drives...all of
this was on SSD.

(I did not test the F5100 personally, I'm talking about the SSD
drives that I did test).

So, yes, SSD and HDD are different, but latency is still
important.



Timeout, rewind, etc.  What workload do you have that 410microsecond
latency is detrimental?  More to the point, what workload do you have
that you'd rather have 5microsecond latency with 1/10th the IOPS?
Whatever it is, I've never run across such a workload in the real world.
It sounds like you're comparing paper numbers for the sake of
comparison, rather than to solve a real-world problem...

BTW, latency does not give you # of random access per second.
5microsecond latency for one access != # of random access per second,
sorry.
--Tim 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2009-10-21 Thread Dupuy, Robert
I've already explained how you can scale up IOP #'s and unless that is
your real workload, you won't see that in practice.

See, running a high # of parallel jobs spread evenly across.

I don't find the conversation genuine, so I'm not going to continue it.


-Original Message-
From: Richard Elling [mailto:richard.ell...@gmail.com] 
Sent: 2009-10-20 16:39
To: Dupuy, Robert
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Sun Flash Accelerator F20

On Oct 20, 2009, at 1:58 PM, Robert Dupuy wrote:

 there is no consistent latency measurement in the industry

 You bring up an important point, as did another poster earlier in  
 the thread, and certainly its an issue that needs to be addressed.

 I'd be surprised if anyone could answer such a question while  
 simultaneously being credible.


http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-pro
duct-brief.pdf

 Intel:  X-25E read latency 75 microseconds

... but they don't say where it was measured or how big it was...

 http://www.sun.com/storage/disk_systems/sss/f5100/specs.xml

 Sun:  F5100 read latency 410 microseconds

... for 1M transfers... I have no idea what the units are, though...  
bytes?

 http://www.fusionio.com/PDFs/Data_Sheet_ioDrive_2.pdf

 Fusion-IO:  read latency less than 50 microseconds

 Fusion-IO lists theirs as .05ms

...at the same time they quote 119,790 IOPS @ 4KB.  By my calculator,
that is 8.3 microseconds per IOP, so clearly the latency itself doesn't
have a direct impact on IOPs.

 I find the latency measures to be useful.

Yes, but since we are seeing benchmarks showing 1.6 MIOPS (mega-IOPS :-)
on a system which claims 410 microseconds of latency, it really isn't
clear to me how to apply the numbers to capacity planning. To wit, there
is some limit to the number of concurrent IOPS that can be processed per
device, so do I need more devices, faster devices, or devices which can
handle more concurrent IOPS?

 I know it isn't perfect, and I agree benchmarks can be deceiving,  
 heck I criticized one vendors benchmarks in this thread already :)

 But, I did find, that for me, I just take a very simple, single  
 thread, read as fast you can approach, and get the # of random  
 access per second, as one type of measurement, that gives you some  
 data, on the raw access ability of the drive.

 No doubt in some cases, you want to test multithreaded IO too, but  
 my application is very latency sensitive, so this initial test was  
 telling.

cool.

 As I got into the actual performance of my app, the lower latency  
 drives, performed better than the higher latency drives...all of  
 this was on SSD.

Note: the F5100 has SAS expanders which add latency.
  -- richard

 (I did not test the F5100 personally, I'm talking about the SSD  
 drives that I did test).

 So, yes, SSD and HDD are different, but latency is still important.
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2009-10-21 Thread Dupuy, Robert
 This is one of the skimpiest specification sheets that I have ever 
seen for an enterprise product.

At least it shows the latency.

This is some kind of technology cult, I've wondered into.


I won't respond further.

-Original Message-
From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] 
Sent: 2009-10-20 21:54
To: Richard Elling
Cc: Dupuy, Robert; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Sun Flash Accelerator F20

On Tue, 20 Oct 2009, Richard Elling wrote:
 
 Intel:  X-25E read latency 75 microseconds

 ... but they don't say where it was measured or how big it was...

Probably measured using a logic analyzer and measuring the time from 
the last bit of the request going in, to the first bit of the response 
coming out.  It is not clear if this latency is a minimum, maximum, 
median, or average.  It is not clear if this latency is while the 
device is under some level of load, or if it is in a quiescent state.

This is one of the skimpiest specification sheets that I have ever 
seen for an enterprise product.

 Sun:  F5100 read latency 410 microseconds

 ... for 1M transfers... I have no idea what the units are, though...
bytes?

Sun's testing is likely done while attached to a system and done with 
some standard loading factor rather than while in a quiescent state.

 ...at the same time they quote 119,790 IOPS @ 4KB.  By my calculator,
 that is 8.3 microseconds per IOP, so clearly the latency itself
doesn't
 have a direct impact on IOPs.

I would be interested to know how many IOPS an OS like Solaris is able 
to push through a single device interface.  The normal driver stack is 
likely limited as to how many IOPS it can sustain for a given LUN 
since the driver stack is optimized for high latency devices like disk 
drives.  If you are creating a driver stack, the design decisions you 
make when requests will be satisfied in about 12ms would be much 
different than if requests are satisfied in 50us.  Limitations of 
existing software stacks are likely reasons why Sun is designing 
hardware with more device interfaces and more independent devices.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)

2009-10-21 Thread Mark Horstman

Then why the warning on the lucreate. It hasn't done that in the past.

Mark

On Oct 21, 2009, at 12:41 PM, Enda O'Connor enda.ocon...@sun.com  
wrote:



Hi T
his will  boot ok in my opinion, not seeing any issues there.

Enda
Mark Horstman wrote:

more input:
# lumount foobar /mnt
/mnt
# cat /mnt/etc/vfstab
# cat /mnt/etc/vfstab
#live-upgrade:Wed Oct 21 09:36:20 CDT 2009 updated boot  
environment foobar
#device device  mount   FS  fsck 
mount   mount
#to mount   to fsck point   typepassat  
boot options

#
fd  -   /dev/fd fd  -   no  -
/proc   -   /proc   proc-   no  -
/dev/zvol/dsk/rpool/swap-   -   swap-
no  -

/devices-   /devicesdevfs   -   no  -
sharefs -   /etc/dfs/sharetab   sharefs -   no  -
ctfs-   /system/contractctfs-   no  -
objfs   -   /system/object  objfs   -   no  -
swap-   /tmptmpfs   -   yes -
rpool/ROOT/foobar   -   /   zfs 1   no  -
So I'm guessing the '/' entry has to be removed.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Strange problem with liveupgrade on zfs (10u7 and u8)

2009-10-21 Thread Mark
Ok. Thanks. Why does '/' show up in the newly created /BE/etc/vfstab but not
in the current /etc/vfstab? Should '/' be in the /BE/etc/vfstab?

btw, thank you for responding so quickly to this.

Mark

On Wed, Oct 21, 2009 at 12:49 PM, Enda O'Connor enda.ocon...@sun.comwrote:

 Mark Horstman wrote:

 Then why the warning on the lucreate. It hasn't done that in the past.

 this is from the vfstab processing code in ludo.c, in your case not causing
 any issue, but shall be fixed.

 Enda


 Mark

 On Oct 21, 2009, at 12:41 PM, Enda O'Connor enda.ocon...@sun.com
 wrote:

  Hi T
 his will  boot ok in my opinion, not seeing any issues there.

 Enda
 Mark Horstman wrote:

 more input:
 # lumount foobar /mnt
 /mnt
 # cat /mnt/etc/vfstab
 # cat /mnt/etc/vfstab
 #live-upgrade:Wed Oct 21 09:36:20 CDT 2009 updated boot environment
 foobar
 #device device  mount   FS  fsckmount
 mount
 #to mount   to fsck point   typepassat boot
 options
 #
 fd  -   /dev/fd fd  -   no  -
 /proc   -   /proc   proc-   no  -
 /dev/zvol/dsk/rpool/swap-   -   swap-   no
  -
 /devices-   /devicesdevfs   -   no  -
 sharefs -   /etc/dfs/sharetab   sharefs -   no  -
 ctfs-   /system/contractctfs-   no  -
 objfs   -   /system/object  objfs   -   no  -
 swap-   /tmptmpfs   -   yes -
 rpool/ROOT/foobar   -   /   zfs 1   no  -
 So I'm guessing the '/' entry has to be removed.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Exported zpool cannot be imported or deleted.

2009-10-21 Thread Victor Latushkin

Stacy Maydew wrote:

I have an exported zpool that had several drives incur errors at the same time 
and as a result became unusable.  The pool was exported at the time the drives 
had problems and now I can't find a way to either delete or import the pool.

I've tried relabeling the disks and using dd to write several MB of zero data to partition zero of every disk in the pool but the command zpool import still shows the pool as available and all drives online. 



Write some (e.g. 1M) zeros to the end of disk/partition as well.

victor



When I attempt to import the pool using zpool import -f zpool1, the system 
panics.

Is there any way possible short of reloading the OS and completely reformatting 
the drives that I can get rid of this zpool?

Thanks,

Stacy Maydew
stacy.may...@sun.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2009-10-21 Thread Richard Elling

On Oct 21, 2009, at 6:14 AM, Dupuy, Robert wrote:


 This is one of the skimpiest specification sheets that I have ever
seen for an enterprise product.

At least it shows the latency.


STORAGEsearch has been trying to wade through the spec muck
for years.
http://www.storagesearch.com/ssd-fastest.html
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Exported zpool cannot be imported or deleted.

2009-10-21 Thread Cindy Swearingen

Hi Stacy,

Can you try to forcibly create a new pool using the devices from
the corrupted pool, like this:

# zpool create -f newpool disk1 disk2 ...

Then, destroy this pool, which will release the devices.

This CR has been filed to help resolve the pool cruft problem:

6893282 Allow the zpool command to wipe labels from disks
(This CR isn't visible from the OpenSolaris bug database yet)

Cindy

On 10/21/09 14:11, Stacy Maydew wrote:

I have an exported zpool that had several drives incur errors at the same time 
and as a result became unusable.  The pool was exported at the time the drives 
had problems and now I can't find a way to either delete or import the pool.

I've tried relabeling the disks and using dd to write several MB of zero data to partition zero of every disk in the pool but the command zpool import still shows the pool as available and all drives online. 


When I attempt to import the pool using zpool import -f zpool1, the system 
panics.

Is there any way possible short of reloading the OS and completely reformatting 
the drives that I can get rid of this zpool?

Thanks,

Stacy Maydew
stacy.may...@sun.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Disk locating in OpenSolaris/Solaris 10

2009-10-21 Thread SHOUJIN WANG
Hi there,
What I am tring to do is: Build a NAS storage server based on the following 
hardware architecture:
Server--SAS HBA---SAS JBOD
I plugin 2 SAS HBA cards into a X86 box, I also have 2 SAS I/O Modules on SAS 
JBOD. From each HBA card, I have one SAS cable which connects to SAS JBOD. 
Configured MPT successfully on server, I can see the single multipahted disks 
likes the following:
r...@super01:~# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c0t5000C5000D34BEDFd0 SEAGATE-ST31000640SS-0001-931.51GB
  /scsi_vhci/d...@g5000c5000d34bedf
   1. c0t5000C5000D34BF37d0 SEAGATE-ST31000640SS-0001-931.51GB
  /scsi_vhci/d...@g5000c5000d34bf37
   2. c0t5000C5000D34C727d0 SEAGATE-ST31000640SS-0001-931.51GB
  /scsi_vhci/d...@g5000c5000d34c727
   3. c0t5000C5000D34D0C7d0 SEAGATE-ST31000640SS-0001-931.51GB
  /scsi_vhci/d...@g5000c5000d34d0c7
   4. c0t5000C5000D34D85Bd0 SEAGATE-ST31000640SS-0001-931.51GB
  /scsi_vhci/d...@g5000c5000d34d85b

The problem is: if one of disks failed, I don't know how to locate the disk in 
chasiss. It is diffcult for failed disk replacement.

Is there any utility in opensoalris which can be used to locate/blink the 
failed disk(or do we have any michanism to implement the SES command in bond of 
SAS)? Or do we have a tool to map the multipathing device ID to the original 
single pathing device ID likes the following?

 c0t5000C5000D34BF37d0 
   |c2t0d0
\c3t0d0

Regards,
Autumn Wang.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Trouble testing hot spares

2009-10-21 Thread Ian Allison

Hi,

I've been looking at a raidz using opensolaris snv_111b and I've come 
across something I don't quite understand. I have 5 disks (fixed size 
disk images defined in virtualbox) in a raidz configuration, with 1 disk 
marked as a spare. The disks are 100m in size and I wanted simulate data 
corruption on one of them and watch the hot spare kick in, but when I do


dd if=/dev/zero of=/dev/c10t0d0 ibs=1024 count=102400

The pool remains perfectly healthy

  pool: datapool
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Wed Oct 21 17:12:11 
2009

config:

NAME STATE READ WRITE CKSUM
datapool ONLINE   0 0 0
  raidz1 ONLINE   0 0 0
c10t0d0  ONLINE   0 0 0
c10t1d0  ONLINE   0 0 0
c10t2d0  ONLINE   0 0 0
c10t3d0  ONLINE   0 0 0
spares
  c10t4d0AVAIL

errors: No known data errors


I don't understand the output, I thought I should see cksum errors 
against c10t0d0. I tried exporting/importing the pool and scrubbing it 
incase this was a cache thing, but nothing changes.


I've tried this on all the disks in the pool with the same result and 
the datasets in the pool is uncorrupted. I guess I'm misunderstanding 
something fundamental about ZFS, can anyone help me out and explain.


-Ian.
z


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Stupid to have 2 disk raidz?

2009-10-21 Thread Marty Scholes
Erik Trimble wrote:
 As always, the devil is in the details. In this case,
 the primary 
 problem I'm having is maintaining two different block
 mapping schemes 
 (one for the old disk layout, and one for the new
 disk layout) and still 
 being able to interrupt the expansion process.  My
 primary problem is 
 that I have to keep both schemes in memory during the
 migration, and if 
 something should happen (i.e. reboot, panic, etc)
 then I lose the 
 current state of the zpool, and everything goes to
 hell in a handbasket.

It might not be that bad, if only zfs would allow mirroring a raidz pool.  Back 
when I did storage admin for a smaller company where availability was 
hyper-critical (but we couldn't afford EMC/Veritas), we had a hardware RAID5 
array.  After a few years of service, we ran into some problems:
* Need to restripe the array?  Screwed.
* Need to replace the array because current one is EOL?  Screwed.
* Array controller barfed for whatever reason?  Screwed.
* Need to flash the controller with latest firmware?  Screwed.
* Need to replace a component on the array, e.g. NIC, controller or power 
supply?  Screwed.
* Need to relocate the array?  Screwed.

If we could stomach downtime or short-lived storage solutions, none of this 
would have mattered.

To get around this, we took two hardware RAID arrays and mirrored them in 
software.  We could 
offline/restripe/replace/upgrade/relocate/whatever-we-wanted to an individual 
array since it was only a mirror which we could offline/online or detach/attach.

I suspect this could be simulated today with setting up a mirrored pool on top 
of a zvol of a raidz pool.  That involves a lot of overhead, doing 
parity/checksum calculations multiple times for the same data.  On the plus 
side, setting this up might make it possible to defrag a pool.

Should zfs simply allow mirroring one pool with another, then with a few spare 
disks laying around, altering the geometry of an existing pool could be done 
with zero downtime using steps similar to the following.
1. Create spare_pool as large as current_pool using spare disks
2. Attach spare_pool to current_pool
3. Wait for resilver to complete
4. Detach and destroy current_pool
5. Create new_pool the way you want it now
6. Attach new_pool to spare_pool
7. Wait for resilver to complete
8. Detach/destroy spare_pool
9. Chuckle at the fact that you completely remade your production pool while 
fully available

I did this dance several times over the course of many years back in the 
Disksuite days.

Thoughts?

Marty
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Trouble testing hot spares

2009-10-21 Thread Richard Elling

On Oct 21, 2009, at 5:18 PM, Ian Allison wrote:


Hi,

I've been looking at a raidz using opensolaris snv_111b and I've  
come across something I don't quite understand. I have 5 disks  
(fixed size disk images defined in virtualbox) in a raidz  
configuration, with 1 disk marked as a spare. The disks are 100m in  
size and I wanted simulate data corruption on one of them and watch  
the hot spare kick in, but when I do


dd if=/dev/zero of=/dev/c10t0d0 ibs=1024 count=102400


Should be: of=/dev/c10t0d0s0
ZFS tries to hide the slice from you, but it really confuses people by
trying to not be confusing.
 -- richard



The pool remains perfectly healthy

 pool: datapool
state: ONLINE
scrub: scrub completed after 0h0m with 0 errors on Wed Oct 21  
17:12:11 2009

config:

   NAME STATE READ WRITE CKSUM
   datapool ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c10t0d0  ONLINE   0 0 0
   c10t1d0  ONLINE   0 0 0
   c10t2d0  ONLINE   0 0 0
   c10t3d0  ONLINE   0 0 0
   spares
 c10t4d0AVAIL

errors: No known data errors


I don't understand the output, I thought I should see cksum errors  
against c10t0d0. I tried exporting/importing the pool and scrubbing  
it incase this was a cache thing, but nothing changes.


I've tried this on all the disks in the pool with the same result  
and the datasets in the pool is uncorrupted. I guess I'm  
misunderstanding something fundamental about ZFS, can anyone help me  
out and explain.


-Ian.
z


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Disk locating in OpenSolaris/Solaris 10

2009-10-21 Thread Trevor Pretty





have a look at this thread:-
http://mail.opensolaris.org/pipermail/zfs-discuss/2009-September/032349.html

we discussed this a while back.



SHOUJIN WANG wrote:

  Hi there,
What I am tring to do is: Build a NAS storage server based on the following hardware architecture:
Server--SAS HBA---SAS JBOD
I plugin 2 SAS HBA cards into a X86 box, I also have 2 SAS I/O Modules on SAS JBOD. From each HBA card, I have one SAS cable which connects to SAS JBOD. 
Configured MPT successfully on server, I can see the single multipahted disks likes the following:
r...@super01:~# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
   0. c0t5000C5000D34BEDFd0 SEAGATE-ST31000640SS-0001-931.51GB
  /scsi_vhci/d...@g5000c5000d34bedf
   1. c0t5000C5000D34BF37d0 SEAGATE-ST31000640SS-0001-931.51GB
  /scsi_vhci/d...@g5000c5000d34bf37
   2. c0t5000C5000D34C727d0 SEAGATE-ST31000640SS-0001-931.51GB
  /scsi_vhci/d...@g5000c5000d34c727
   3. c0t5000C5000D34D0C7d0 SEAGATE-ST31000640SS-0001-931.51GB
  /scsi_vhci/d...@g5000c5000d34d0c7
   4. c0t5000C5000D34D85Bd0 SEAGATE-ST31000640SS-0001-931.51GB
  /scsi_vhci/d...@g5000c5000d34d85b

The problem is: if one of disks failed, I don't know how to locate the disk in chasiss. It is diffcult for failed disk replacement.

Is there any utility in opensoalris which can be used to locate/blink the failed disk(or do we have any michanism to implement the SES command in bond of SAS)? Or do we have a tool to map the multipathing device ID to the original single pathing device ID likes the following?

 c0t5000C5000D34BF37d0 
   |c2t0d0
\c3t0d0

Regards,
Autumn Wang.
  




www.eagle.co.nz
This email is confidential and may be legally 
privileged. If received in error please destroy and immediately notify 
us.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2009-10-21 Thread Jake Caferilla
Clearly a lot of people don't understand latency, so I'll talk about latency, 
breaking it down in simpler components.

Sometimes it helps to use made up numbers, to simplify a point.

Imagine a non-real system that had these 'ridiculous' performance 
characteristics:

The system has a 60 second (1 minute) read latency.
The system can scale dramatically, it can do 60 billion IO's per minute.

Now some here are arguing about the term latency, but its rather a simple term.
It simply means the amount of time it takes, for data to move from one point to 
another.

And some here have argued there is no good measurement of latency, but also it 
very simple.
It is measured in time units.

OK, so we have a latency of 1 minute, in this 'explanatory' system.

That means, I issued a read request, the Flash takes 1 minute to return the 
data requested to the program.

But remember, this example system, has massive parallel scalability.

I issue 2 read requests, both read requests return after 1 minute.
I issue 3 read requests, all 3 return after 1 minute.

I defined this made up system, as one, such that if you issue 60 billion read 
requests, they all return, simultaneously, after 1 minute.

Let's do some math.

60,000,000,000 divided by 60 seconds, well this system does 1 billion IOPS!

Wow, what wouldn't run fast with 1 billion IOPS?

The answer, is, most programs would not, not with such a high latency as 
waiting 1 minute for data to return.  Most apps wouldn't run acceptably, no not 
at all.

Imagine you are in Windows, or Solaris, or Linux, and every time you needed to 
go to disk, a 1 minute wait.  Wow, it would be totally unacceptable, despite 
the IOPS, latency matters.

Certain types of apps wouldn't be latency sensitive, some people would love to 
have this 1 billion IOPs system :)

The good news is, the F20 latency, even if we don't know what it is, is 
certainly not 1 minute, and we can speculate, it is much better than 
traditional rotating disks.  But lets blue sky this, and make up a number, say 
.41ms (410 microseconds).   And lets say you have a competitor at .041ms (41 
microseconds).  When would the competitor have a real advantage?  Well, if it 
was an app that issued a read, waited for the results, issued a read, waited 
for the results, and say, did this 100 million times or so, then, yes, that low 
latency card is going to help accelerate that app.  Computers are fast, they 
deal with a lot of data, real world -and a surprising lot, doesn't scale.   
I've seen sales and financial apps do 100 million io's and more. Even a Sun 
blogger, I read recently, did an article about the F20 in terms of  how, 
compared to traditional disks, it speeds up Peoplesoft jobs. 

flash has lower latency than traditional disks, that's part of what makes it 
competitive...and by the same token, flash with lower latency than other flash, 
has a competitive advantage.

Some here say latency (that wait times) doesn't matter with flash.  That 
latency (waiting) only matters with traditional hard drives.

Uhm, who told you that?  I've never heard someone make that case before, 
anywhere, ever.

And lets give you credit and say you had some minor point to make about hdd and 
flash differences...still you are using it in such a way, that someone could 
draw the wrong conclusion, so. clarify this point, you are certainly not 
suggesting that higher wait times speeds up an application, correct?

Or that the F20's latency cannot impact performace, right?  C'mon, some common 
sense? anyone?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2009-10-21 Thread Jake Caferilla
Now lets talk about the 'latency deny'ers'

First of all, the say, there is no standard measurement of latency.

That isn't complicated.  Sun includes the transfer time in latency figures, 
other companies do not.

THen latency deny'ers say, there is no way to compare the numbers.  Thats what 
I'm getting from reading the thread.

Well, take a 4k block size, the transfer time isn't significant.  When someone 
publishes their 'best' latency times...transfer time is not significant.

You CAN compare latency specs.  If a latency deny'er says you can't; ask 
yourself why.  Is it because they wouldn't be favorable in a latency comparison?

Some say, latency is not important in flash.

Well why would someone say that?  Is it because they don't have good latency 
numbers?

The F20 card, has some features to its merit, but its not a high performance 
card...its not in the class of a Fusion-IO or TMS RamSan-20 in terms of IOPS.

However, it has some features you may like.  I understand its bootable.  I'm 
told it has a supercapacitor.  Its not rotating hard drive slow.

Furthermore, if you need high sequential transfers, it seems to have that.  
High IOP card?  Well, I do say, watch out for scaled results that look good on 
benchmarks.

Your real world application better run like a benchmark.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2009-10-21 Thread Tim Cook
On Wed, Oct 21, 2009 at 9:15 PM, Jake Caferilla j...@tanooshka.com wrote:

 Clearly a lot of people don't understand latency, so I'll talk about
 latency, breaking it down in simpler components.

 Sometimes it helps to use made up numbers, to simplify a point.

 Imagine a non-real system that had these 'ridiculous' performance
 characteristics:

 The system has a 60 second (1 minute) read latency.
 The system can scale dramatically, it can do 60 billion IO's per minute.

 Now some here are arguing about the term latency, but its rather a simple
 term.
 It simply means the amount of time it takes, for data to move from one
 point to another.

 And some here have argued there is no good measurement of latency, but also
 it very simple.
 It is measured in time units.

 OK, so we have a latency of 1 minute, in this 'explanatory' system.

 That means, I issued a read request, the Flash takes 1 minute to return the
 data requested to the program.

 But remember, this example system, has massive parallel scalability.

 I issue 2 read requests, both read requests return after 1 minute.
 I issue 3 read requests, all 3 return after 1 minute.

 I defined this made up system, as one, such that if you issue 60 billion
 read requests, they all return, simultaneously, after 1 minute.

 Let's do some math.

 60,000,000,000 divided by 60 seconds, well this system does 1 billion IOPS!

 Wow, what wouldn't run fast with 1 billion IOPS?

 The answer, is, most programs would not, not with such a high latency as
 waiting 1 minute for data to return.  Most apps wouldn't run acceptably, no
 not at all.

 Imagine you are in Windows, or Solaris, or Linux, and every time you needed
 to go to disk, a 1 minute wait.  Wow, it would be totally unacceptable,
 despite the IOPS, latency matters.

 Certain types of apps wouldn't be latency sensitive, some people would love
 to have this 1 billion IOPs system :)

 The good news is, the F20 latency, even if we don

flash has lower latency than traditional disks, that's part of what makes it
 competitive...and by the same token, flash with lower latency than other
 flash, has a competitive advantage.

 Some here say latency (that wait times) doesn't matter with flash.  That
 latency (waiting) only matters with traditional hard drives.

 Uhm, who told you that?  I've never heard someone make that case before,
 anywhere, ever.

 And lets give you credit and say you had some minor point to make about hdd
 and flash differences...still you are using it in such a way, that someone
 could draw the wrong conclusion, so. clarify this point, you are
 certainly not suggesting that higher wait times speeds up an application,
 correct?

 Or that the F20's latency cannot impact performace, right?  C'mon, some
 common sense? anyone?


Yet again, you're making up situations on paper.  We're dealing with the
real world, not theory.  So please, describe the electronics that have been
invented that can somehow take in 1billion IO requests, process them, have a
memory back end that can return them, but does absolutely nothing with them
for a full minute.  Even if you scale those numbers down, your theory is
absolutely ridiculous.

Of course, you also failed to address the other issue.  How exactly does a
drive have .05ms response time, yet only provide 500 IOPS.  It's IMPOSSIBLE
for those numbers to work out.

But hey, lets ignore reality and just go with vendor numbers.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Setting up an SSD ZIL - Need A Reality Check

2009-10-21 Thread Marc Bevand
Bob Friesenhahn bfriesen at simple.dallas.tx.us writes:
 
 The Intel specified random write IOPS are with the cache enabled and 
 without cache flushing.

For random write I/O, caching improves I/O latency not sustained I/O 
throughput (which is what random write IOPS usually refer to). So Intel can't 
cheat with caching. However they can cheat by benchmarking a brand new drive 
instead of an aged one.

 They also carefully only use a limited span 
 of the device, which fits most perfectly with how the device is built. 

AFAIK, for the X25-E series, they benchmark random write IOPS on a 100% span. 
You may be confusing it with the X25-M series with which they actually clearly 
disclose two performance numbers: 350 random write IOPS on 8GB span, and 3.3k 
on 100% span. See 
http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/nand/tech/425265.htm

I agree with the rest of your email.

-mrb

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Trouble testing hot spares

2009-10-21 Thread Victor Latushkin

On Oct 22, 2009, at 4:18, Ian Allison i...@pims.math.ca wrote:


Hi,

I've been looking at a raidz using opensolaris snv_111b and I've  
come across something I don't quite understand. I have 5 disks  
(fixed size disk images defined in virtualbox) in a raidz  
configuration, with 1 disk marked as a spare. The disks are 100m in  
size and I wanted simulate data corruption on one of them and watch  
the hot spare kick in, but when I do


dd if=/dev/zero of=/dev/c10t0d0 ibs=1024 count=102400

The pool remains perfectly healthy


Try of=/dev/rdsk/c10t0d0s0 and see what happens

Victor


 pool: datapool
state: ONLINE
scrub: scrub completed after 0h0m with 0 errors on Wed Oct 21  
17:12:11 2009

config:

   NAME STATE READ WRITE CKSUM
   datapool ONLINE   0 0 0
 raidz1 ONLINE   0 0 0
   c10t0d0  ONLINE   0 0 0
   c10t1d0  ONLINE   0 0 0
   c10t2d0  ONLINE   0 0 0
   c10t3d0  ONLINE   0 0 0
   spares
 c10t4d0AVAIL

errors: No known data errors


I don't understand the output, I thought I should see cksum errors  
against c10t0d0. I tried exporting/importing the pool and scrubbing  
it incase this was a cache thing, but nothing changes.


I've tried this on all the disks in the pool with the same result  
and the datasets in the pool is uncorrupted. I guess I'm  
misunderstanding something fundamental about ZFS, can anyone help me  
out and explain.


-Ian.
z


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] strange results ...

2009-10-21 Thread Jens Elkner
Hmmm,

wondering about IMHO strange ZFS results ...

X4440:  4x6 2.8GHz cores (Opteron 8439 SE), 64 GB RAM
6x Sun STK RAID INT V1.0 (Hitachi H103012SCSUN146G SAS)
Nevada b124

Started with a simple test using zfs on c1t0d0s0: cd /var/tmp

(1) time sh -c 'mkfile 32g bla ; sync' 
0.16u 19.88s 5:04.15 6.5%
(2) time sh -c 'mkfile 32g blabla ; sync'
0.13u 46.41s 5:22.65 14.4%
(3) time sh -c 'mkfile 32g blablabla ; sync'
0.19u 26.88s 5:38.07 8.0%

chmod 644 b*
(4) time dd if=bla of=/dev/null bs=128k
262144+0 records in
262144+0 records out
0.26u 25.34s 6:06.16 6.9%
(5) time dd if=blabla of=/dev/null bs=128k
262144+0 records in
262144+0 records out
0.15u 26.67s 4:46.63 9.3%
(6) time dd if=blablabla of=/dev/null bs=128k
262144+0 records in
262144+0 records out
0.10u 20.56s 0:20.68 99.9%

So 1-3 is more or less expected (~97..108 MB/s write).
However 4-6 looks strange: 89, 114 and 1585 MB/s read!

Since the arc size is ~55+-2GB (at least arcstat.pl says so), I guess (6)
reads from memory completely. Hmm - maybe.
However, I would expect, that when repeating 5-6, 'blablabla' gets replaced
by 'bla' or 'blabla'. But the numbers say, that 'blablabla' is kept in the
cache, since I get almost the same results as in the first run (and zpool
iostat/arcstat.pl show for the blablabla almost no activity at all).
So is this a ZFS bug? Or does the OS some magic here?

2nd)
Never had a Sun STK RAID INT before. Actually my intention was to
create a zpool mirror of sd0 and sd1 for boot and logs, and a 2x2-way 
zpool mirror with the 4 remaining disks. However, the controller seems
not to support JBODs :( - which is also bad, since we can't simply put
those disks into another machine with a different controller without
data loss, because the controller seems to use its own format under the
hood.  Also the 256MB BBCache seems to be a little bit small for ZIL
even if one would know, how to configure it ...

So what would you recommend? Creating 2 appropriate STK INT arrays
and using both as a single zpool device, i.e. without ZFS mirror devs
and 2nd copies? 

Intent workload is MySQL DBs + VBox images wrt. to the 4 disk *mirror,
logs and OS for the 2 disk *mirror, and should also act as a sunray
server (user homes and add. apps are comming from another server via NFS).

Any hints?

Regards,
jel.
-- 
Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany Tel: +49 391 67 12768
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss