from:"Robert Milkowski"

Re: [zfs-discuss] ZFS Distro Advice

2013-03-05 Thread Robert Milkowski

  We do the same for all of our legacy operating system backups.
 Take
  a snapshot then do an rsync and an excellent way of maintaining
  incremental backups for those.
 
 
  Magic rsync options used:
 
-a --inplace --no-whole-file --delete-excluded
 
  This causes rsync to overwrite the file blocks in place rather than
  writing to a new temporary file first.  As a result, zfs COW produces
  primitive deduplication of at least the unchanged blocks (by
 writing
  nothing) while writing new COW blocks for the changed blocks.
 
 If I understand your use case correctly (the application overwrites
 some blocks with the same exact contents), ZFS will ignore these no-

I think he meant to rely on rsync here to do in-place updates of files and
only for changed blocks with the above parameters (by using rsync's own
delta mechanism). So if you have a file a and only one block changed rsync
will overwrite on destination only that single block.


 op writes only on recent Open ZFS (illumos / FreeBSD / Linux) builds
 with checksum=sha256 and compression!=off.  AFAIK, Solaris ZFS will COW
 the blocks even if their content is identical to what's already there,
 causing the snapshots to diverge.
 
 See https://www.illumos.org/issues/3236 for details.
 

This is interesting. I didn't know about it.
Is there an option similar to verify=on in dedup or does it just assume that
checksum is your data?

-- 
Robert Milkowski
http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Distro Advice

2013-02-26 Thread Robert Milkowski

Solaris 11.1 (free for non-prod use). 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Tiernan OToole
Sent: 25 February 2013 14:58
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] ZFS Distro Advice

Good morning all.

My home NAS died over the weekend, and it leaves me with a lot of spare
drives (5 2Tb and 3 1Tb disks). I have a Dell Poweredge 2900 Server sitting
in the house, which has not been doing much over the last while (bought it a
few years back with the intent of using it as a storage box, since it has 8
Hot Swap drive bays) and i am now looking at building the NAS using ZFS...

But, now i am confused as to what OS to use... OpenIndiana? Nexenta?
FreeNAS/FreeBSD? 

I need something that will allow me to share files over SMB (3 if possible),
NFS, AFP (for Time Machine) and iSCSI. Ideally, i would like something i can
manage easily and something that works with the Dell... 

Any recommendations? Any comparisons to each? 

Thanks.

-- 
Tiernan O'Toole
blog.lotas-smartman.net
www.geekphotographer.com
www.tiernanotoole.ie 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Distro Advice

2013-02-26 Thread Robert Milkowski


 
 Robert Milkowski wrote:
 
  Solaris 11.1 (free for non-prod use).
 
 
 But a ticking bomb if you use a cache device.


It's been fixed in SRU (although this is only for customers with a support
contract - still, will be in 11.2 as well).

Then, I'm sure there are other bugs which are fixed in S11 and not in
Illumos (and vice-versa).

-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-29 Thread Robert Milkowski

  It also has a lot of performance improvements and general bug fixes
 in
  the Solaris 11.1 release.
 
 Performance improvements such as?


Dedup'ed ARC for one.
0 block automatically dedup'ed in-memory.
Improvements to ZIL performance.
Zero-copy zfs+nfs+iscsi
...


-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-29 Thread Robert Milkowski



From: Richard Elling
Sent: 21 January 2013 03:51

VAAI has 4 features, 3 of which have been in illumos for a long time. The
remaining
feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
product, 
but the CEO made a conscious (and unpopular) decision to keep that code
from the 
community. Over the summer, another developer picked up the work in the
community, 
but I've lost track of the progress and haven't seen an RTI yet.

That is one thing that always bothered me... so it is ok for others, like
Nexenta, to keep stuff closed and not in open, while if Oracle does it they
are bad?

Isn't it at least a little bit being hypocritical? (bashing Oracle and doing
sort of the same)

-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] poor CIFS and NFS performance

2013-01-04 Thread Robert Milkowski


   Personally, I'd recommend putting a standard Solaris fdisk
 partition on the drive and creating the two slices under that.

Why? In most cases giving zfs an entire disk is the best option.
I wouldn't bother with any manual partitioning.

-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris 11 System Reboots Continuously Because of a ZFS-Related Panic (7191375)

2013-01-04 Thread Robert Milkowski



 
 Illumos is not so good at dealing with huge memory systems but perhaps
 it is also more stable as well.

Well, I guess that it depends on your environment, but generally I would
expect S11 to be more stable if only because the sheer amount of bugs
reported by paid customers and bug fixes by Oracle that Illumos is not
getting (lack of resource, limited usage, etc.).

-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Appliance as a general-purpose server question

2012-11-26 Thread Robert Milkowski


 I am in the market for something newer than that, though. Anyone know
 what HP's using as a replacement for the DL320s?

I have no idea... but they have dl380  Gen8 with a disk plane supporting 25x
2.5 disks (all in front), and it is Sandy Bridge based.

Oracle/Sun have X3-2L - 24x 2.5 disks in front, another 2x 2.5 in rear,
Sandy Bridge as well.

-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Appliance as a general-purpose server question

2012-11-22 Thread Robert Milkowski


So, the only supported (or even possible) way is indeed to us it
as NAS for file or block IO from another head running the database
or application servers?..

Technically speaking you can get access to standard shell and do whatever
you want - this would essentially void support contract though.

-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zvol access rights - chown zvol on reboot / startup / boot

2012-11-16 Thread Robert Milkowski

 

No, there isn't other way to do it currently. SMF approach is probably the
best option for the time being.

I think that there should be couple of other properties for zvol where
permissions could be stated.

 

Best regards,

Robert Milkowski

http://milek.blogspot.com

 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris)
Sent: 15 November 2012 19:57
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] zvol access rights - chown zvol on reboot / startup /
boot

 

When I google around for anyone else who cares and may have already solved
the problem before I came along - it seems we're all doing the same thing
for the same reason.  If by any chance you are running VirtualBox on a
solaris / opensolaris / openidiana / whatever ZFS host, you could of course
use .vdi files for the VM virtual disks, but a lot of us are using zvol
instead, for various reasons.  To do the zvol, you first create the zvol
(sudo zfs create -V) and then chown it to the user who runs VBox (sudo chown
someuser /dev/zvol/rdsk/...) and then create a rawvmdk that references it
(VBoxManage internalcommands createrawvmdk -filename
/home/someuser/somedisk.vmdk -rawdisk /dev/zvol/rdsk/...)

 

The problem is - during boot / reboot, or anytime the zpool or zfs
filesystem is mounted or remounted, export, import...  The zvol ownership
reverts back to root:root.  So you have to repeat your sudo chown before
the guest VM can start.

 

And the question is ...  Obviously I can make an SMF service which will
chown those devices automatically, but that's kind of a crappy solution.

 

Is there any good way to assign the access rights, or persistently assign
ownership of zvol's?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ARC de-allocation with large ram

2012-10-22 Thread Robert Milkowski

Hi,

If after it decreases in size it stays there it might be similar to:

7111576 arc shrinks in the absence of memory pressure

Also, see document:

ZFS ARC can shrink down without memory pressure result in slow
performance [ID 1404581.1]

Specifically, check if arc_no_grow is set to 1 after the cache size is
decreased, and if it stays that way.

The fix is in one of the SRUs and I think it should be in 11.1
I don't know if it was fixed in Illumos or even if Illumos was affected by
this at all.


-- 
Robert Milkowski
http://milek.blogspot.com


 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Chris Nagele
 Sent: 20 October 2012 18:47
 To: zfs-discuss@opensolaris.org
 Subject: [zfs-discuss] ARC de-allocation with large ram
 
 Hi. We're running OmniOS as a ZFS storage server. For some reason, our
 arc cache will grow to a certain point, then suddenly drops. I used
 arcstat to catch it in action, but I was not able to capture what else
 was going on in the system at the time. I'll do that next.
 
 read  hits  miss  hit%  l2read  l2hits  l2miss  l2hit%  arcsz  l2size
  166   166 0   100   0   0   0   085G225G
 5.9K  5.9K 0   100   0   0   0   085G225G
  755   7154094  40   0  40   084G225G
  17K   17K 0   100   0   0   0   067G225G
  409   3951496  14   0  14   049G225G
  388   3642493  24   0  24   041G225G
  37K   37K2099  20   6  14  3040G225G
 
 For reference, it's a 12TB pool with 512GB SSD L2 ARC and 198GB RAM.
 We have nothing else running on the system except NFS. We are also not
 using dedupe. Here is the output of memstat at one point:
 
 # echo ::memstat | mdb -k
 Page SummaryPagesMB  %Tot
      
 Kernel   19061902 74460   38%
 ZFS File Data28237282110301   56%
 Anon43112   1680%
 Exec and libs1522 50%
 Page cache  13509520%
 Free (cachelist) 6366240%
 Free (freelist)   2958527 115566%
 
 Total5030196571
 Physical 50322219196571
 
 According to prstat -s rss nothing else is consuming the memory.
 
592 root   33M   26M sleep   590   0:00:33 0.0% fmd/27
 12 root   13M   11M sleep   590   0:00:08 0.0%
 svc.configd/21
641 root   12M   11M sleep   590   0:04:48 0.0% snmpd/1
 10 root   14M   10M sleep   590   0:00:03 0.0%
 svc.startd/16
342 root   12M 9084K sleep   590   0:00:15 0.0% hald/5
321 root   14M 8652K sleep   590   0:03:00 0.0% nscd/52
 
 So far I can't figure out what could be causing this. The only other
 thing I can think of is that we have a bunch of zfs send/receive
 operations going on as backups across 10 datasets in the pool. I  am
 not sure how snapshots and send/receive affect the arc. Does anyone
 else have any ideas?
 
 Thanks,
 Chris
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] encfs on top of zfs

2012-07-31 Thread Robert Milkowski

 Once something is written deduped you will always use the memory when
 you want to read any files that were written when dedup was enabled, so
 you do not save any memory unless you do not normally access most of
 your data.

For reads you don't need ddt. Also in Solaris 11 (not in Illumos
unfortunately AFAIK) on reads the in-memory ARC will also stay deduped (so
if 10x logical blocks are deduped to 1 and you read all 10 logical copies,
only one block in arc will be allocated). If there are no further
modifications and you only read dedupped data, apart from disk space
savings, there can be very nice improvement in performance as well (less
i/o, more ram for caching, etc.).


 
 As far as the OP is concerned, unless you have a dataset that will
 dedup well don't bother with it, use compression instead (don't use
 both compression and dedup because you will shrink the average record
 size and balloon the memory usage).

Can you expand a little bit more here?
Dedup+compression works pretty well actually (not counting standard
problems with current dedup - compression or not).


-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-14 Thread Robert Milkowski

 The client is using async writes, that include commits. Sync writes do
 not need commits.
 
 What happens is that the ZFS transaction group commit occurs at more-
 or-less regular intervals, likely 5 seconds for more modern ZFS
 systems. When the commit occurs, any data that is in the ARC but not
 commited in a prior transaction group gets sent to the ZIL

Are you sure? I don't think this is the case unless I misunderstood you or
this is some recent change to Illumos.

Whatever is being committed when zfs txg closes goes directly to pool and
not to zil. Only sync writes will go to zil right a way (and not always, see
logbias, etc.) and to arc to be committed later to a pool when txg closes.


-- 
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance on LSI 9240-8i?

2012-05-11 Thread Robert Milkowski



 Now, if anyone is still reading, I have another question. The new Solaris
11
 device naming convention hides the physical tree from me. I got just a
list of
 long disk names all starting with c0 (see below) but I need to know
which
 disk is connected to which controller so that I can create two parts of my
 mirrors to two different controllers in order to tolerate a single
controller
 failure. I need a way of figuring the connection path for each disk. Hope
I
 manage to explain what I want?

See diskinfo(1M), for example:


$ diskinfo -T bay -o Rc -h
HDD00  -
HDD01  -
HDD02  c0t5000CCA00AC87F54d0
HDD03  c0t5000CCA00AA95838d0
HDD04  c0t5000CCA01510ECC0d0
HDD05  c0t5000CCA01515EE78d0
HDD06  c0t5000CCA01512DA3Cd0
HDD07  c0t5000CCA00AB3E1C8d0
HDD08  c0t5000CCA0151C1D18d0
HDD09  c0t5000CCA0151F7E08d0
HDD10  c0t5000CCA0151C7CA8d0
HDD11  c0t5000CCA00AA9D570d0
HDD12  c0t5000CCA0151CB180d0
HDD13  c0t5000CCA015208C98d0
HDD14  c0t5000CCA00AA97F04d0
HDD15  c0t5000CCA0151A287Cd0
HDD16  c0t5000CCA00AAA1544d0
HDD17  c0t5000CCA01521070Cd0
HDD18  c0t5000CCA00AA97EF4d0
HDD19  c0t5000CCA015214F84d0
HDD20  c0t5000CCA015214844d0
HDD21  c0t5000CCA00AAAD154d0
HDD22  c0t5000CCA00AA95558d0
HDD23  c0t5000CCA00AAA0D1Cd0


In your case you probably will have to put a configuration in place for your
disk slots (on Oracle's HW it works out of the box) - go to
support.oracle.com and look for the document:

How To : Selecting a Physical Slot for a SAS Device with a WWN for
an Oracle Solaris 11 Installation [ID 1411444.1]




ps. there is also zpool status -l option which is cool:

$ zpool status -l cwafseng3-0
  pool: pool-0
 state: ONLINE
  scan: scrub canceled on Thu Apr 12 13:52:13 2012
config:

NAME
STATE READ WRITE CKSUM 
pool-0
ONLINE   0 0 0
  raidz1-0
ONLINE   0 0 0
/dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD02/disk  ONLINE
0 0 0
/dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD23/disk  ONLINE
0 0 0
/dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD22/disk  ONLINE
0 0 0
/dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD21/disk  ONLINE
0 0 0
/dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD20/disk  ONLINE
0 0 0
/dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD19/disk  ONLINE
0 0 0
/dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD17/disk  ONLINE
0 0 0
/dev/chassis/SUN-FIRE-X4270-M2-SERVER.unknown/HDD15/disk  ONLINE
0 0 0

errors: No known data errors

Best regards,
 Robert Milkowski
 http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Resilver restarting several times

2012-05-11 Thread Robert Milkowski

 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 Sent: 12 May 2012 01:27
 Cc: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Resilver restarting several times

 2012-05-11 17:18, Bob Friesenhahn написал:
  On Fri, 11 May 2012, Jim Klimov wrote:

  Hello all,

  SHORT VERSION:

  What conditions can cause the reset of the resilvering process? My
  lost-and-found disk can't get back into the pool because of resilvers
  restarting...

  I recall that with sufficiently old vintage zfs, resilver would
  restart if a snapshot was taken. What sort of zfs is being used here?

  Bob

 Well, for the night I rebooted the machine into single-user mode, to rule out
 zones, crontabs and networked abusers, but I still get resilvering resets 
 every
 now and then, about once an hour.

 I'm now trying a run with all zfs datasets unmounted, hope that helps
 somewhat... I'm growing puzzled now.

To double check that no snapshots, etc. are being created run: zpool history 
-il pond

-- 
Best regards,
 Robert Milkowski
 http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Robert Milkowski


And he will still need an underlying filesystem like ZFS for them :)


 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nico Williams
 Sent: 25 April 2012 20:32
 To: Paul Archer
 Cc: ZFS-Discuss mailing list
 Subject: Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs
FreeBSD)
 
 I agree, you need something like AFS, Lustre, or pNFS.  And/or an NFS
proxy
 to those.
 
 Nico
 --
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Robert Milkowski

 

Citing yourself:

 

The average block size for a given data block should be used as the metric
to map all other datablock sizes to. For example, the ZFS recordsize is
128kb by default. If the average block (or page) size of a directory server
is 2k, then the mismatch in size will result in degraded throughput for both
read and write operations. One of the benefits of ZFS is that you can change
the recordsize of all write operations from the time you set the new value
going forward.



 

And the above is not even entirely correct as if a file is bigger than a
current value of recordsize property reducing a recordsize won't change
block size for the file (it will continue to use the previous size, for
example 128K). This is why you need to set recordsize to a desired value for
large files *before* you create them (or you will have to copy them later
on).

 

From the performance point of view it really depends on a workload but as
you described in your blog the default recordsize of 128K with an average
write/read of 2K for many workloads will negatively impact performance, and
lowering recordsize can potentially improve it.

 

Nevertheless I was referring to dedup efficiency which with lower recordsize
values should improve dedup ratios (although it will require more memory for
ddt).

 

 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brad Diggs
Sent: 29 December 2011 15:55
To: Robert Milkowski
Cc: 'zfs-discuss discussion list'
Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

 

Reducing the record size would negatively impact performance.  For rational
why, see the

section titled Match Average I/O Block Sizes in my blog post on filesystem
caching:

http://www.thezonemanager.com/2009/03/filesystem-cache-optimization.html

 

Brad


Brad Diggs | Principal Sales Consultant | 972.814.3698

eMail: brad.di...@oracle.com

Tech Blog:  http://TheZoneManager.com/ http://TheZoneManager.com

LinkedIn: http://www.linkedin.com/in/braddiggs

 

On Dec 29, 2011, at 8:08 AM, Robert Milkowski wrote:





 

Try reducing recordsize to 8K or even less *before* you put any data.

This can potentially improve your dedup ratio and keep it higher after you
start modifying data.

 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brad Diggs
Sent: 28 December 2011 21:15
To: zfs-discuss discussion list
Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

 

As promised, here are the findings from my testing.  I created 6 directory
server instances where the first

instance has roughly 8.5GB of data.  Then I initialized the remaining 5
instances from a binary backup of

the first instance.  Then, I rebooted the server to start off with an empty
ZFS cache.  The following table

shows the increased L1ARC size, increased search rate performance, and
increase CPU% busy with

each starting and applying load to each successive directory server
instance.  The L1ARC cache grew

a little bit with each additional instance but largely stayed the same size.
Likewise, the ZFS dedup ratio

remained the same because no data on the directory server instances was
changing.

 

image001.png

However, once I started modifying the data of the replicated directory
server topology, the caching efficiency 

quickly diminished.  The following table shows that the delta for each
instance increased by roughly 2GB 

after only 300k of changes.

 

image002.png

I suspect the divergence in data as seen by ZFS deduplication most likely
occurs because reduplication 

occurs at the block level rather than at the byte level.  When a write is
sent to one directory server instance, 

the exact same write is propagated to the other 5 instances and therefore
should be considered a duplicate.  

However this was not the case.  There could be other reasons for the
divergence as well.

 

The two key takeaways from this exercise were as follows.  There is
tremendous caching potential

through the use of ZFS deduplication.  However, the current block level
deduplication does not 

benefit directory as much as it perhaps could if deduplication occurred at
the byte level rather than

the block level.  It very could be that even byte level deduplication
doesn't work as well either.  

Until that option is available, we won't know for sure.

 

Regards,

 

Brad

image003.png

Brad Diggs | Principal Sales Consultant

Tech Blog:  http://TheZoneManager.com/ http://TheZoneManager.com

LinkedIn: http://www.linkedin.com/in/braddiggs

 

On Dec 12, 2011, at 10:05 AM, Brad Diggs wrote:






Thanks everyone for your input on this thread.  It sounds like there is
sufficient weight

behind the affirmative that I will include this methodology into my
performance analysis

test plan.  If the performance goes well, I will share some of the results
when we conclude

in January/February timeframe.

 

Regarding the great dd use

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-16 Thread Robert Milkowski

 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Pawel Jakub Dawidek
 Sent: 10 December 2011 14:05
 To: Mertol Ozyoney
 Cc: zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

 On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:
  Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.

  The only vendor i know that can do this is Netapp

 And you really work at Oracle?:)

 The answer is definiately yes. ARC caches on-disk blocks and dedup just
 reference those blocks. When you read dedup code is not involved at all.
 Let me show it to you with simple test:

 Create a file (dedup is on):

   # dd if=/dev/random of=/foo/a bs=1m count=1024

 Copy this file so that it is deduped:

   # dd if=/foo/a of=/foo/b bs=1m

 Export the pool so all cache is removed and reimport it:

   # zpool export foo
   # zpool import foo

 Now let's read one file:

   # dd if=/foo/a of=/dev/null bs=1m
   1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec)

 We read file 'a' and all its blocks are in cache now. The 'b' file shares
all the
 same blocks, so if ARC caches blocks only once, reading 'b' should be much
 faster:

   # dd if=/foo/b of=/dev/null bs=1m
   1073741824 bytes transferred in 0.870501 secs (1233475634
 bytes/sec)

 Now look at it, 'b' was read 12.5 times faster than 'a' with no disk
activity.
 Magic?:)

Yep, however in pre Solaris 11 GA (and in Illumos) you would end up with 2x
copies of blocks in ARC cache, while in S11 GA ARC will keep only 1 copy of
all blocks. This can make a big difference if there are even more than just
2x files being dedupped and you need arc memory to cache other data as well.

-- 
Robert Milkowski

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs sync=disabled property

2011-11-11 Thread Robert Milkowski



 disk.  This behavior is what makes NFS over ZFS slow without a slog: NFS
does
 everything O_SYNC by default, 


No, it doesn't. Howver VMWare by default issues all writes as SYNC.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-10 Thread Robert Milkowski


 On 01/ 8/11 05:59 PM, Edward Ned Harvey wrote:


Has anybody measured the cost of enabling or disabling verification?

The cost of disabling verification is an infinitesimally small number
multiplied by possibly all your data.  Basically lim-0 times lim-infinity.
This can only be evaluated on a case-by-case basis and there's no use in
making any more generalizations in favor or against it.

The benefit of disabling verification would presumably be faster
performance.  Has anybody got any measurements, or even calculations or
vague estimates or clueless guesses, to indicate how significant this is?
How much is there to gain by disabling verification?



Exactly my point and there isn't one answer which fits all environments.
In the testing I'm doing so far enabling/disabling verification doesn't 
make any noticeable difference so I'm sticking to verify. But I have 
enough memory and such a workload that I see little physical reads going 
on.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-08 Thread Robert Milkowski


 On 01/ 7/11 09:02 PM, Pawel Jakub Dawidek wrote:

On Fri, Jan 07, 2011 at 07:33:53PM +, Robert Milkowski wrote:


Now what if block B is a meta-data block?

Metadata is not deduplicated.


Good point but then it depends on a perspective.
What if you you are storing lots of VMDKs?
One corrupted block which is shared among hundreds of VMDKs will affect 
all of them.

And it might be a block containing meta-data information within vmdk...

Anyway, green or not, imho if in a given environment turning 
verification on still delivers acceptable performance then I would 
basically turn it on.


In other environments it is about risk assessment.

Best regards,
 Robert Milkowski
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-07 Thread Robert Milkowski


 On 01/ 7/11 02:13 PM, David Magda wrote:


Given the above: most people are content enough to trust Fletcher to not
have data corruption, but are worried about SHA-256 giving 'data
corruption' when it comes de-dupe? The entire rest of the computing world
is content to live with 10^-15 (for SAS disks), and yet one wouldn't be
prepared to have 10^-30 (or better) for dedupe?



I think you are do not understand entirely the problem.
Lets say two different blocks A and B have the same sha256 checksum, A 
is already stored in a pool, B is being written. Without verify and 
dedup enabled B won't be written. Next time you ask for block B you will 
actually end-up with the block A. Now if B is relatively common in your 
data set you have a relatively big impact on many files because of one 
corrupted block (additionally from a fs point of view this is a silent 
data corruption). Without dedup if you get a single block corrupted 
silently an impact usually will be relatively limited.


Now what if block B is a meta-data block?

The point is that a potential impact of a hash collision is much bigger 
than a single silent data corruption to a block, not to mention that 
dedup or not all the other possible cases of data corruption are there 
anyway, adding yet another one might or might not be acceptable.



--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-06 Thread Robert Milkowski


 On 01/ 6/11 07:44 PM, Peter Taps wrote:

Folks,

I have been told that the checksum value returned by Sha256 is almost 
guaranteed to be unique. In fact, if Sha256 fails in some case, we have a 
bigger problem such as memory corruption, etc. Essentially, adding verification 
to sha256 is an overkill.

Perhaps (Sha256+NoVerification) would work 99.99% of the time. But 
(Fletcher+Verification) would work 100% of the time.

Which one of the two is a better deduplication strategy?

If we do not use verification with Sha256, what is the worst case scenario? Is 
it just more disk space occupied (because of failure to detect duplicate 
blocks) or there is a chance of actual data corruption (because two blocks were 
assumed to be duplicate although they are not)?


Yes, there is a possibility of data corruption.


Or, if I go with (Sha256+Verification), how much is the overhead of 
verification on the overall process?


It really depends on your specific workload.
If your application is mostly reading data then it well might be you 
won't even notice verify.


Sha256 is supposed to be almost bullet proof but...
At the end of a day it is all about how much you value your data.
But as I wrote before, try with verify and see if performance is 
acceptable. It well might be the case.

You can always disable verify at any time.


If I do go with verification, it seems (Fletcher+Verification) is more 
efficient than (Sha256+Verification). And both are 100% accurate in detecting 
duplicate blocks.

I don't believe that fletcher is still allowed for dedup - right now it 
is only sha256.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-04 Thread Robert Milkowski


 On 01/ 3/11 04:28 PM, Richard Elling wrote:

On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:


On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm 
quite certain you know I mean no disrespect)?  Solaris11 Express was 
a massive leap in functionality and bugfixes to ZFS.  I've seen 
exactly nothing out of outside of Oracle in the time since it went 
closed.  We used to see updates bi-weekly out of Sun.  Nexenta 
spending hundreds of man-hours on a GUI and userland apps isn't work 
on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


I am quite sure you understand how pipelines work :-)



Are you suggesting that Nexenta is developing new ZFS features behind 
closed doors (like Oracle...) and then will share code later-on? Somehow 
I don't think so... but I would love to be proved wrong :)


--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-04 Thread Robert Milkowski


 On 01/ 4/11 11:35 PM, Robert Milkowski wrote:

On 01/ 3/11 04:28 PM, Richard Elling wrote:

On Jan 3, 2011, at 5:08 AM, Robert Milkowski wrote:


On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm 
quite certain you know I mean no disrespect)?  Solaris11 Express 
was a massive leap in functionality and bugfixes to ZFS.  I've seen 
exactly nothing out of outside of Oracle in the time since it 
went closed.  We used to see updates bi-weekly out of Sun.  Nexenta 
spending hundreds of man-hours on a GUI and userland apps isn't 
work on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


I am quite sure you understand how pipelines work :-)



Are you suggesting that Nexenta is developing new ZFS features behind 
closed doors (like Oracle...) and then will share code later-on? 
Somehow I don't think so... but I would love to be proved wrong :)


I mean I would love to see Nexenta start delivering real innovation in 
Solaris/Illumos kernel (zfs, networking, ...), not that I would love to 
see it happening behind a closed doors :)


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A few questions

2011-01-03 Thread Robert Milkowski


 On 12/26/10 05:40 AM, Tim Cook wrote:



On Sat, Dec 25, 2010 at 11:23 PM, Richard Elling 
richard.ell...@gmail.com mailto:richard.ell...@gmail.com wrote:



There are more people outside of Oracle developing for ZFS than
inside Oracle.
This has been true for some time now.




Pardon my skepticism, but where is the proof of this claim (I'm quite 
certain you know I mean no disrespect)?  Solaris11 Express was a 
massive leap in functionality and bugfixes to ZFS.  I've seen exactly 
nothing out of outside of Oracle in the time since it went closed. 
 We used to see updates bi-weekly out of Sun.  Nexenta spending 
hundreds of man-hours on a GUI and userland apps isn't work on ZFS.





Exactly my observation as well. I haven't seen any ZFS related 
development happening at Ilumos or Nexenta, at least not yet.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-11 Thread Robert Milkowski


On 11/12/2010 00:07, Erik Trimble wrote:


The last update I see to the ZFS public tree is 29 Oct 2010.   Which, 
I *think*, is about the time that the fork for the Solaris 11 Express 
snapshot was taken.




I don't think this is the case.
Although all the files show modification date of 29 Oct 2010 at 
src.opensolaris.org they are still old versions from August, at least 
the ones I checked.


See 
http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/


the mercurial gate doesn't have any updates either.

Best regards,
 Robert Milkowski

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Increase Volume Size

2010-12-07 Thread Robert Milkowski


On 07/12/2010 23:54, Tony MacDoodle wrote:

Is is possible to expand the size of a ZFS volume?

It was created with the following command:

zfs create -V 20G ldomspool/test




see man page for zfs, section about volsize property.

Best regards,
 Robert Milkowski
 http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAID-Z/mirror hybrid allocator

2010-11-22 Thread Robert Milkowski


On 18/11/2010 17:53, Cindy Swearingen wrote:

Markus,

Let me correct/expand this:

1. If you create a RAIDZ pool on OS 11 Express (b151a), you will have
some mirrored metadata. This feature integrated into b148 and the pool
version is 29. This is the part I mixed up.

2. If you have an existing RAIDZ pool and upgrade to b151a, you would
need to upgrade the pool version to use this feature. In this case,
newly written metadata would be mirrored.



Hi,

And if one creates raid-z3 pool would meta-data be a 3-way mirror as well?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs send|recv and inherited recordsize

2010-10-04 Thread Robert Milkowski


Hi,

I thought that if I use zfs send snap | zfs recv  if on a receiving 
side the recordsize property is set to different value it will be 
honored. But it doesn't seem to be the case, at least on snv_130.



 $ zfs get recordsize test/m1
NAME PROPERTYVALUESOURCE
test/m1  recordsize  128K default

 $ ls -nil /test/m1/f1
 5 -rw-r--r--   1 011048576 Oct  4 10:31 
/test/m1/f1


 $ zdb -vv test/m1 5
Dataset test/m1 [ZPL], ID 1082, cr_txg 33413, 1.02M, 5 objects

Object  lvl   iblk   dblk  dsize  lsize   %full  type
 5216K   128K  1.00M 1M  100.00  ZFS plain file

 $ zfs snapshot test/m...@s1
 $ zfs create -o recordsize=32k test/m2
 $ zfs send test/m...@s1 | zfs recv test/m2/m1



 $ zfs get recordsize test/m2/m1
NAMEPROPERTYVALUESOURCE
test/m2/m1  recordsize  32K  inherited from test/m2

 $ ls -lni /test/m2/m1/f1
 5 -rw-r--r--   1 011048576 Oct  4 10:31 
/test/m2/m1/f1


 $ zdb -vv test/m2/m1 5
Dataset test/m2/m1 [ZPL], ID 1110, cr_txg 33537, 1.02M, 5 objects

Object  lvl   iblk   dblk  dsize  lsize   %full  type
 5216K   128K  1.00M 1M  100.00  ZFS plain file


Well, dblk is 128KB - I would expect it to be 32K.
Lets see what happens if I use cp instead:


 $ cp /test/m2/m1/f1 /test/m2/m1/f2
 $ ls -lni /test/m2/m1/f2
 6 -rw-r--r--   1 011048576 Oct  4 11:15 
/test/m2/m1/f2


 $ zdb -vv test/m2/m1 6
Dataset test/m2/m1 [ZPL], ID 1110, cr_txg 33537, 2.03M, 6 objects

Object  lvl   iblk   dblk  dsize  lsize   %full  type
 6216K32K  1.00M 1M  100.00  ZFS plain file


Now it is fine.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send|recv and inherited recordsize

2010-10-04 Thread Robert Milkowski



thank you.

On 04/10/2010 19:55, Matthew Ahrens wrote:

That's correct.

This behavior is because the send|recv operates on the DMU objects,
whereas the recordsize property is interpreted by the ZPL.  The ZPL
checks the recordsize property when a file grows.  But the recv
doesn't grow any files, it just dumps data into the underlying
objects.

--matt

On Mon, Oct 4, 2010 at 11:20 AM, Robert Milkowskimi...@task.gda.pl  wrote:
   

Hi,

I thought that if I use zfs send snap | zfs recv  if on a receiving side
the recordsize property is set to different value it will be honored. But it
doesn't seem to be the case, at least on snv_130.


  $ zfs get recordsize test/m1
NAME PROPERTYVALUESOURCE
test/m1  recordsize  128K default

  $ ls -nil /test/m1/f1
 5 -rw-r--r--   1 011048576 Oct  4 10:31 /test/m1/f1

  $ zdb -vv test/m1 5
Dataset test/m1 [ZPL], ID 1082, cr_txg 33413, 1.02M, 5 objects

Object  lvl   iblk   dblk  dsize  lsize   %full  type
 5216K   128K  1.00M 1M  100.00  ZFS plain file

  $ zfs snapshot test/m...@s1
  $ zfs create -o recordsize=32k test/m2
  $ zfs send test/m...@s1 | zfs recv test/m2/m1



  $ zfs get recordsize test/m2/m1
NAMEPROPERTYVALUESOURCE
test/m2/m1  recordsize  32K  inherited from test/m2

  $ ls -lni /test/m2/m1/f1
 5 -rw-r--r--   1 011048576 Oct  4 10:31
/test/m2/m1/f1

  $ zdb -vv test/m2/m1 5
Dataset test/m2/m1 [ZPL], ID 1110, cr_txg 33537, 1.02M, 5 objects

Object  lvl   iblk   dblk  dsize  lsize   %full  type
 5216K   128K  1.00M 1M  100.00  ZFS plain file


Well, dblk is 128KB - I would expect it to be 32K.
Lets see what happens if I use cp instead:


  $ cp /test/m2/m1/f1 /test/m2/m1/f2
  $ ls -lni /test/m2/m1/f2
 6 -rw-r--r--   1 011048576 Oct  4 11:15
/test/m2/m1/f2

  $ zdb -vv test/m2/m1 6
Dataset test/m2/m1 [ZPL], ID 1110, cr_txg 33537, 2.03M, 6 objects

Object  lvl   iblk   dblk  dsize  lsize   %full  type
 6216K32K  1.00M 1M  100.00  ZFS plain file


Now it is fine.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 
   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] file level clones

2010-09-27 Thread Robert Milkowski


Hi,

fyi

http://lwn.net/Articles/399148/
copyfile()

The reflink() http://lwn.net/Articles/333783/ system call was 
originally proposed as a sort of fast copy operation; it would create a 
new copy of a file which shared all of the data blocks. If one of the 
files were subsequently written to, a copy-on-write operation would be 
performed so that the other file would not change. LWN readers last 
heard about this patch last September, when Linus refused to pull it 
http://lwn.net/Articles/353048/ for 2.6.32. Among other things, he 
didn't like the name.


So now reflink() is back as copyfile(), with some proposed additional 
features. It would make the same copy-on-write copies on filesystems 
that support it, but copyfile() would also be able to delegate the 
actual copy work to the underlying storage device when it makes sense. 
For example, if a file is being copied on a network-mounted filesystem, 
it may well make sense to have the server do the actual copy work, 
eliminating the need to move the data over the network twice. The system 
call might also do ordinary copies within the kernel if nothing faster 
is available.


The first question that was asked is: should copyfile() perhaps be an 
asynchronous interface? It could return a file descriptor which could be 
polled for the status of the operation. Then, graphical utilities could 
start a copy, then present a progress bar showing how things were going. 
Christoph Hellwig was adamant, though, that copyfile() should be a 
synchronous operation like almost all other Linux system calls; there is 
no need to create something weird and different here. Progress bars 
neither justify nor require the creation of asynchronous interfaces.


There was also opposition to the mixing of the old reflink() idea with 
that of copying a file. There is little perceived value in creating a 
bad version of cp within the kernel. The two ideas were mixed because it 
seems that Linus seems to want it that way, but, after this discussion, 
they may yet be split apart again.





http://en.wikipedia.org/wiki/Btrfs

Btrfs provides a /clone/ operation which atomically creates a 
copy-on-write snapshot of a file, support for
 which was added to GNU coreutils 
http://en.wikipedia.org/wiki/Coreutils 7.5.^[17] 
http://en.wikipedia.org/wiki/Btrfs#cite_note-16 ^[18] 
http://en.wikipedia.org/wiki/Btrfs#cite_note-17 Cloning from byte 
ranges in one file to another is also
 supported, allowing large files to be more efficiently manipulated 
like standard rope 
http://en.wikipedia.org/wiki/Rope_%28computer_science%29 data structures.



Also see http://www.symantec.com/connect/virtualstoreserver


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 'sync' properties and write operations.

2010-08-28 Thread Robert Milkowski


On 28/08/2010 09:55, eXeC001er wrote:

Hi.

Can you explain to me:

1. dataset has 'sync=always'

I start write to file on this dataset in no-sync mode: system write 
file in sync or async mode?




sync


2. dataset has 'sync=disabled'

I start write to file on this dataset in sync mode: system write file 
in sync or async mode?




async


The sync property takes an effect immediately for all new writes even if 
a file was open before the property was changed.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zfs set readonly=on does not entirely go into read-only mode

2010-08-27 Thread Robert Milkowski


Hi,

When I set readonly=on on a dataset then no new files are allowed to be 
created.

However writes to already opened files are allowed.

This is rather counter intuitive - if I set a filesystem as read-only I 
would expect it not to allow any modifications to it.


I think it shouldn't behave this way and it should be considered as a bug.

What do you think?


ps. I tested it on S10u8 and snv_134.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iScsi slow

2010-08-03 Thread Robert Milkowski


On 03/08/2010 23:20, Ross Walker wrote:



Nothing has been violated here.
Look for WCE flag in COMSTAR where you can control how a given zvol  should 
behave (synchronous or asynchronous). Additionally in recent build you have zfs 
set sync={disabled|default|always} which also works with zvols.

So you do have a control over how it is supposed to behave and to make it nice 
it is even on per zvol basis.
It is just that the default is synchronous.
 

Ah, ok, my experience has been with Solaris and the iscsitgt which, correct me 
if I am wrong, is still synchronous only.

   


I don't remember if it offered or not an ability to manipulate zvol's 
WCE flag but if it didn't then you can do it anyway as it is a zvol 
property. For an example see 
http://milek.blogspot.com/2010/02/zvols-write-cache.html


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Fwd: zpool import despite missing log [PSARC/2010/292 Self Review]

2010-07-28 Thread Robert Milkowski



fyi

--
Robert Milkowski
http://milek.blogspot.com


 Original Message 
Subject:zpool import despite missing log [PSARC/2010/292 Self Review]
Date:   Mon, 26 Jul 2010 08:38:22 -0600
From:   Tim Haley tim.ha...@oracle.com
To: psarc-...@sun.com
CC: zfs-t...@sun.com



I am sponsoring the following case for George Wilson.  Requested binding
is micro/patch.  Since this is a straight-forward addition of a command
line option, I think itqualifies for self review.  If an ARC member
disagrees, let me know and I'll convert to a fast-track.

Template Version: @(#)sac_nextcase 1.70 03/30/10 SMI
This information is Copyright (c) 2010, Oracle and/or its affiliates.
All rights reserved.
1. Introduction
1.1. Project/Component Working Name:
 zpool import despite missing log
1.2. Name of Document Author/Supplier:
 Author:  George Wilson
1.3  Date of This Document:
26 July, 2010

4. Technical Description

OVERVIEW:

 ZFS maintains a GUID (global unique identifier) on each device and
 the sum of all GUIDs of a pool are stored into the ZFS uberblock.
 This sum is used to determine the availability of all vdevs
 within a pool when a pool is imported or opened.  Pools which
 contain a separate intent log device (e.g. a slog) will fail to
 import when that device is removed or is otherwise unavailable.
 This proposal aims to address this particular issue.

PROPOSED SOLUTION:

 This fast-track introduce a new command line flag to the
 'zpool import' sub-command.  This new option, '-m', allows
 pools to import even when a log device is missing.  The contents
 of that log device are obviously discarded and the pool will
 operate as if the log device were offlined.

MANPAGE DIFFS:

   zpool import [-o mntopts] [-p property=value] ... [-d dir | -c
cachefile]
-  [-D] [-f] [-R root] [-n] [-F] -a
+  [-D] [-f] [-m] [-R root] [-n] [-F] -a


   zpool import [-o mntopts] [-o property=value] ... [-d dir | -c
cachefile]
-  [-D] [-f] [-R root] [-n] [-F] pool |id [newpool]
+  [-D] [-f] [-m] [-R root] [-n] [-F] pool |id [newpool]

   zpool import [-o mntopts] [ -o property=value] ... [-d dir |
- -c cachefile] [-D] [-f] [-n] [-F] [-R root] -a
+ -c cachefile] [-D] [-f] [-m] [-n] [-F] [-R root] -a

   Imports all  pools  found  in  the  search  directories.
   Identical to the previous command, except that all pools

+ -m
+
+Allows a pool to import when there is a missing log device

EXAMPLES:

1). Configuration with a single intent log device:

# zpool status tank
   pool: tank
state: ONLINE
 scan: none requested
 config:

 NAMESTATE READ WRITE CKSUM
 tankONLINE   0 0 0
   c7t0d0ONLINE   0 0 0
 logs
   c5t0d0ONLINE   0 0 0

errors: No known data errors

# zpool import tank
The devices below are missing, use '-m' to import the pool anyway:
 c5t0d0 [log]

cannot import 'tank': one or more devices is currently unavailable

# zpool import -m tank
# zpool status tank
   pool: tank
  state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas
exist for
 the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-2Q
   scan: none requested
config:

 NAME   STATE READ WRITE CKSUM
 tank   DEGRADED 0 0 0
   c7t0d0   ONLINE   0 0 0
 logs
   1693927398582730352  UNAVAIL  0 0 0  was
/dev/dsk/c5t0d0

errors: No known data errors

2). Configuration with mirrored intent log device:

# zpool add tank log mirror c5t0d0 c5t1d0
zr...@diskmonster:/dev/dsk# zpool status tank
   pool: tank
  state: ONLINE
   scan: none requested
config:

 NAMESTATE READ WRITE CKSUM
 tankONLINE   0 0 0
   c7t0d0ONLINE   0 0 0
 logs
   mirror-1  ONLINE   0 0 0
 c5t0d0  ONLINE   0 0 0
 c5t1d0  ONLINE   0 0 0

errors: No known data errors

# zpool import 429789444028972405
The devices below are missing, use '-m' to import the pool anyway:
 mirror-1 [log]
   c5t0d0
   c5t1d0

# zpool import -m tank
# zpool status tank
   pool: tank
  state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas
exist for
 the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-2Q
   scan: none requested
config:

 NAME

Re: [zfs-discuss] zfs raidz1 and traditional raid 5 perfomrance comparision

2010-07-22 Thread Robert Milkowski


On 22/07/2010 03:25, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Robert Milkowski
 
   

I had a quick look at your results a moment ago.
The problem is that you used a server with 4GB of RAM + a raid card
with
a 256MB of cache.
Then your filesize for iozone was set to 4GB - so random or not you
probably had a relatively good cache hit ratio for random reads. And
 

Look again in the raw_results.  I ran it with 4G, and also with 12G.  There
was no significant difference between the two, so I only compiled the 4G
results into a spreadsheet PDF.


   


The only tests with 12GB file size in raw files are a mirror and a 
single disk configuration.

There are no results for raid-z there.



even then a random read from 8 threads gave you only about 40% more
IOPS
than for a RAID-Z made out of 5 disks than a single drive. The poor
result for HW-R5 is surprising though but it might be that a stripe
size
was not matched to ZFS recordsize and iozone block size in this case.
 

I think what you're saying is With 5 disks performing well, you should
expect 4x higher iops than a single disk, and the measured result was only
40% higher, which is a poor result.

I agree.  I guess the 128k recordsize used in iozone is probably large
enough that it frequently causes blocks to span disks?  I don't know.

   


Probably - but it would also depend on how you configured hw-r5 (mainly 
it's stripe size).
The other thing is that you might have had some bottleneck somewhere 
else as your results for N-way mirrors aren't that good either.


   

The issue with raid-z and random reads is that as cache hit ratio goes
down to 0 the IOPS approaches IOPS of a single drive. For a little bit
more information see http://blogs.sun.com/roch/entry/when_to_and_not_to
 

I don't think that's correct, less you're using a single thread.  As long
as multiple threads are issuing random reads on raidz, and those reads are
small enough that each one is entirely written on a single disk, then you
should be able to get n-1 disk operating simultaneously, to achieve (n-1)x
performance of a single disk.

Even if blocks are large enough to span disks, you should be able to get
(n-1)x performance of a single disk for large sequential operations.
   


While it is tru to some degree for hw raid-5, raid-z doesn't work that way.
The issue is that each zfs filesystem block is basically spread across 
n-1 devices.
So every time you want to read back a single fs block you need to wait 
for all n-1 devices to provide you with a part of it - and keep in mind 
in zfs you can't get a partial block even if that's what you are asking 
for as zfs has to check checksum of entire fs block.
Now multiple readers make it actually worse for raid-z (assuming very 
poor cache hit ratio) - because each read from each reader involves all 
disk drives basically others can't read anything until it is done. It 
gets really bad for random reads. With HW raid-5 is your stripe size 
matches block you are reading back for random reads it is probable that 
while reader-X1 is reading from disk-Y1 reader-X2 is reading from 
disk-Y2 so you should end-up with all disk drives (-1) contributing to 
better overall iops.


Read Roch's blog entry carefully for more information.

btw: even in your results 6x disks in raid-z provided over 3x less IOPS 
than zfs raid-10 configuration for random reads. It is a big difference 
if one needs performance.


--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs raidz1 and traditional raid 5 perfomrance comparision

2010-07-21 Thread Robert Milkowski


On 21/07/2010 15:40, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of v

for zfs raidz1, I know for random io, iops of a raidz1 vdev eqaul to
one physical disk iops, since raidz1 is like raid5 , so is raid5 has
same performance like raidz1? ie. random iops equal to one physical
disk's ipos.
 

I tested this extensively about 6 months ago.  Please see
http://www.nedharvey.com for more details.  I disagree with the assumptions
you've made above, and I'll say this instead:

Look at
http://nedharvey.com/iozone_weezer/bobs%20method/iozone%20results%20summary.
pdf
Go down to the 2nd section, Compared to a single disk
Look at single-disk and raidz-5disks and raid5-5disks-hardware

You'll see that both raidz and raid5 are significantly faster than a single
disk in all types of operations.  In all cases, raidz is approximately equal
to, or significantly faster than hardware raid5.

   

I had a quick look at your results a moment ago.
The problem is that you used a server with 4GB of RAM + a raid card with 
a 256MB of cache.
Then your filesize for iozone was set to 4GB - so random or not you 
probably had a relatively good cache hit ratio for random reads. And 
even then a random read from 8 threads gave you only about 40% more IOPS 
than for a RAID-Z made out of 5 disks than a single drive. The poor 
result for HW-R5 is surprising though but it might be that a stripe size 
was not matched to ZFS recordsize and iozone block size in this case.


The issue with raid-z and random reads is that as cache hit ratio goes 
down to 0 the IOPS approaches IOPS of a single drive. For a little bit 
more information see http://blogs.sun.com/roch/entry/when_to_and_not_to


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143

2010-07-20 Thread Robert Milkowski


On 20/07/2010 07:59, Chad Cantwell wrote:


I've just compiled and booted into snv_142, and I experienced the same slow dd 
and
scrubbing as I did with my 142 and 143 compilations and with the Nexanta 3 RC2 
CD.
So, this would seem to indicate a build environment/process flaw rather than a
regression.

   


Are you sure it is not a debug vs. non-debug issue?


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Debunking the dedup memory myth

2010-07-20 Thread Robert Milkowski


On 20/07/2010 04:41, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Richard L. Hamilton

I would imagine that if it's read-mostly, it's a win, but
otherwise it costs more than it saves.  Even more conventional
compression tends to be more resource intensive than decompression...
 

I would imagine it's *easier* to have a win when it's read-mostly, but the
expense of computing checksums is going to be done either way, with or
without dedup.  The only extra cost dedup adds is to maintain a hash tree of
some kind, to see if some block has already been stored on disk.  So ... of
course I'm speaking hypothetically and haven't been proven ... I think dedup
will accelerate the system in nearly all use cases.

The main exception is whenever you have highly non-duplicated data.  I think
the cost of dedup CPU power is tiny little small, but in the case of highly
non-duplicated data, even that little expense is a waste.

   


Please note that by default ZFS uses fletcher4 checksums but dedup 
currently allows only for sha256 which are more CPU intensive. Also from 
a performance point of view there will be a sudden drop in write 
performance the moment DDT can't fit entirely in a memory. L2ARC could 
mitigate the impact though.


Then there will be less memory available for data caching due to extra 
memory requirements for DDT.
(however please note that IIRC DDT is treated as meta data and by 
default there is a limit of meta-data cache size to be no bigger than 
20% of ARC - there is a bug open for it, I haven't checked if it's been 
fixed yet or not).



What I'm wondering is when dedup is a better value than compression.
 

Whenever files have internal repetition, compression will be better.
Whenever the repetition crosses file barriers, dedup will be better.

   


Not necessarily. Compression in ZFS works only within a single fs block 
scope.
So for example if you have a large file with most of its block identical 
dedup should compress the file much better than a compression. Also 
please note that you can use both: compression and dedup at the same time.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Legality and the future of zfs...

2010-07-19 Thread Robert Milkowski


On 12/07/2010 16:32, Erik Trimble wrote:


ZFS is NOT automatically ACID. There is no guaranty of commits for 
async write operations. You would have to use synchronous writes to 
guaranty commits. And, furthermore, I think that there is a strong




# zfs set sync=always pool

will force all I/O (async or sync) to be written synchronously.

ps. still, I'm not saying it would made ZFS ACID.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] carrying on [was: Legality and the future of zfs...]

2010-07-19 Thread Robert Milkowski


On 16/07/2010 23:57, Richard Elling wrote:

On Jul 15, 2010, at 4:48 AM, BM wrote:

   

2. No community = stale outdated code.
 

But there is a community.  What is lacking is that Oracle, in their infinite
wisdom, has stopped producing OpenSolaris developer binary releases.
Not to be outdone, they've stopped other OS releases as well.  Surely,
this is a temporary situation.

   


AFAIK the dev OSOL releases are still being produced - they haven't been 
made public since b134 though.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski


On 23/06/2010 18:50, Adam Leventhal wrote:

Does it mean that for dataset used for databases and similar environments where 
basically all blocks have fixed size and there is no other data all parity 
information will end-up on one (z1) or two (z2) specific disks?
 

No. There are always smaller writes to metadata that will distribute parity. 
What is the total width of your raidz1 stripe?

   


4x disks, 16KB recordsize, 128GB file, random read with 16KB block.

--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski


On 23/06/2010 19:29, Ross Walker wrote:

On Jun 23, 2010, at 1:48 PM, Robert Milkowskimi...@task.gda.pl  wrote:

   

128GB.

Does it mean that for dataset used for databases and similar environments where 
basically all blocks have fixed size and there is no other data all parity 
information will end-up on one (z1) or two (z2) specific disks?
 

What's the record size on those datasets?

8k?

   


16K

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski


On 24/06/2010 14:32, Ross Walker wrote:

On Jun 24, 2010, at 5:40 AM, Robert Milkowskimi...@task.gda.pl  wrote:

   

On 23/06/2010 18:50, Adam Leventhal wrote:
 

Does it mean that for dataset used for databases and similar environments where 
basically all blocks have fixed size and there is no other data all parity 
information will end-up on one (z1) or two (z2) specific disks?

 

No. There are always smaller writes to metadata that will distribute parity. 
What is the total width of your raidz1 stripe?


   

4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
 

 From what I gather each 16KB record (plus parity) is spread across the raidz 
disks. This causes the total random IOPS (write AND read) of the raidz to be 
that of the slowest disk in the raidz.

Raidz is definitely made for sequential IO patterns not random. To get good 
random IO with raidz you need a zpool with X raidz vdevs where X = desired 
IOPS/IOPS of single drive.
   


I know that and it wasn't mine question.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski


On 24/06/2010 15:54, Bob Friesenhahn wrote:

On Thu, 24 Jun 2010, Ross Walker wrote:


Raidz is definitely made for sequential IO patterns not random. To 
get good random IO with raidz you need a zpool with X raidz vdevs 
where X = desired IOPS/IOPS of single drive.


Remarkably, I have yet to see mention of someone testing a raidz which 
is comprised entirely of FLASH SSDs.  This should help with the IOPS, 
particularly when reading.


I have.

Briefly:


  X4270 2x Quad-core 2.93GHz, 72GB RAM
  Open Solaris 2009.06 (snv_111b)
  ARC limited to 4GB
  44x SSD in a F5100.
  4x SAS HBAs, 4x physical SAS connections to the f5100 (16x SAS 
channels in total), each to a different domain.



1. RAID-10 pool

22x mirrors across domains
ZFS: 16KB recordsize, atime=off
randomread filebennch benchmark with a 16KB block size with 1, 16, 
..., 128 threads, 128GB working set.


maximum performance when 128 threads: ~137,000 ops/s

2. RAID-Z pool

11x 4-way RAID-z, each raid-z vdev across domains
ZFS: recordsize=16k, atime=off
randomread filebennch benchmark with a 16KB block size with 1, 16, 
..., 128 threads, 128GB working set.


maximum performance when 64-128 threads: ~34,000 ops/s

With a ZFS recordsize of 32KB it got up-to ~41,000 ops/s.
Larger ZFS record sizes produced worse results.



RAID-Z delivered about 3.3X less ops/s compared to RAID-10 here.
SSDs do not make any fundamental chanage here and RAID-Z characteristics 
are basically the same whether it is configured out of SSDs or HDDs.


However SSDs could of course provide a good-enough performance even with 
RAID-Z, as at the end of a day it is not about benchmarks but your 
environment requirements.


A given number of SSDs in a RAID-Z configuration is able to deliver the 
same performance as a much greater number of disk drives in RAID-10 
configuration and if you don't need much space it could make sense.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Robert Milkowski


On 24/06/2010 20:52, Arne Jansen wrote:

Ross Walker wrote:

Raidz is definitely made for sequential IO patterns not random. To 
get good random IO with raidz you need a zpool with X raidz vdevs 
where X = desired IOPS/IOPS of single drive.




I have seen statements like this repeated several times, though
I haven't been able to find an in-depth discussion of why this
is the case. From what I've gathered every block (what is the
correct term for this? zio block?) written is spread across the
whole raid-z. But in what units? will a 4k write be split into
512 byte writes? And in the opposite direction, every block needs
to be read fully, even if only parts of it are being requested,
because the checksum needs to be checked? Will the parity be
read, too?
If this is all the case, I can see why raid-z reduces the performance
of an array effectively to one device w.r.t. random reads.



http://blogs.sun.com/roch/entry/when_to_and_not_to

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Robert Milkowski



128GB.

Does it mean that for dataset used for databases and similar 
environments where basically all blocks have fixed size and there is no 
other data all parity information will end-up on one (z1) or two (z2) 
specific disks?




On 23/06/2010 17:51, Adam Leventhal wrote:

Hey Robert,

How big of a file are you making? RAID-Z does not explicitly do the parity 
distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths 
to distribute IOPS.

Adam

On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote:

   

Hi,


zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \
  raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \
  raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \
  raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \
  [...]
  raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0

zfs set atime=off test
zfs set recordsize=16k test
(I know...)

now if I create a one large file with filebench and simulate a randomread 
workload with 1 or more threads then disks on c2 and c3 controllers are getting 
about 80% more reads. This happens both on 111b and snv_134. I would rather 
except all of them to get about the same number of iops.

Any idea why?


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl


   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] hot detach of disks, ZFS and FMA integration

2010-06-18 Thread Robert Milkowski


On 18/06/2010 00:18, Garrett D'Amore wrote:

On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote:
   


On the SS7000 series, you get an alert that the enclosure has been detached 
from the system.  The fru-monitor code (generalization of the disk-monitor) 
that generates this sysevent has not yet been pushed to ON.


 

[...]

I guess the fact that the SS7000 code isn't kept up to date in ON means
that we may wind up having to do our own thing here... its a bit
unfortunate, but ok.


Eric - is it a business decision that the discussed code is not in the 
ON or do you actually intent to get it integrated into ON? Because if 
you do then I think that getting Nexenta guys expanding on it would be 
better for everyone instead of having them reinventing the wheel...




--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] raid-z - not even iops distribution

2010-06-18 Thread Robert Milkowski


Hi,


zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \
  raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \
  raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \
  raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \
  [...]
  raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0

zfs set atime=off test
zfs set recordsize=16k test
(I know...)

now if I create a one large file with filebench and simulate a 
randomread workload with 1 or more threads then disks on c2 and c3 
controllers are getting about 80% more reads. This happens both on 111b 
and snv_134. I would rather except all of them to get about the same 
number of iops.


Any idea why?


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Question : Sun Storage 7000 dedup ratio per share

2010-06-18 Thread Robert Milkowski


On 18/06/2010 14:47, ??? wrote:

Dear All :

   Under Sun Storage 7000 system, can we see per share ratio after enable dedup 
function ? We would like deep to see each share dedup ratio.

   On Web GUI, only show dedup ratio entire storage pool.


   
Since dedup works across all dataset with dedup enabled in a pool you 
can't really get a dedup ratio per share.


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] At what level does the “zfs ” directory exist?

2010-06-17 Thread Robert Milkowski


On 17/06/2010 09:18, MichaelHoy wrote:

First thing, it’s simply not practical to have so many file systems. I’d 
already tested 5k and boot time was unacceptable, never mind the other inherent 
implications of such a strategy. Therefore, access to Previous Versions via 
Windows is out.
   


Previous Versions should work even if you have a one large filesystems 
with all users homes as directories within.


What Solaris/OpenSolaris version did you try for the 5k test?

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OCZ Devena line of enterprise SSD

2010-06-16 Thread Robert Milkowski


On 15/06/2010 18:46, Brandon High wrote:

On Mon, Jun 14, 2010 at 2:07 PM, Roger Hernandezrhvar...@gmail.com  wrote:
   

OCZ has a new line of enterprise SSDs, based on the SandForce 1500
controller.
 

The SLC based drive should be great as a ZIL, and the MLC drives
should be a close second.

Neither is cost effective as a L2ARC, since the cache device doesn't
require resiliency or high random iops. A previous generation drive
(such as the Vertex or X25-M) is probably sufficient.

   
If you don't need a high random iops from you l2arc then perhaps you 
don't need an l2arc at all?
The whole point of having L2ARC is to serve high random read iops from 
RAM and L2ARC device instead of disk drives in a main pool.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] At what level does the “zfs ” directory exist?

2010-06-16 Thread Robert Milkowski


On 16/06/2010 09:11, Arne Jansen wrote:

MichaelHoy wrote:
   

I’ve posted a query regarding the visibility of snapshots via CIFS here 
(http://opensolaris.org/jive/thread.jspa?threadID=130577tstart=0)
however, I’m beginning to suspect that it may be a more fundamental ZFS 
question so I’m asking the same question here.

At what level does the “zfs” directory exist?

If  the “.zfs” subdirectory only exists as the direct child of the mount point 
then can someone suggest how I can make it visible lower down without requiring 
me (even if it were possible for 50k users) to make each users’ home folder a 
file system?

By way of a background, I’m looking at the possibility of hosting our students 
personal file space on OpenSolaris since the capacities required go well beyond 
my budget to keep investing in our NetApp kit.
So far I’ve managed to implement the same functionality however, the visibility 
of the snapshots to allow self-service file restores is a real issue which may 
prevent me for going forward on this platform.

I’d appreciate any suggestions.
 

Do you only want to share the filesystem via CIFS? Have you had a look
at the shadow_copy2 extension for samba? It maps the snapshots so windows
can access them via previous versions from the explorers context menu.
   


btw: the CIFS service supports Windows Shadow Copies out-of-the-box.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Scrub issues

2010-06-14 Thread Robert Milkowski


On 14/06/2010 22:12, Roy Sigurd Karlsbakk wrote:

Hi all

It seems zfs scrub is taking a big bit out of I/O when running. During a scrub, 
sync I/O, such as NFS and iSCSI is mostly useless. Attaching an SLOG and some 
L2ARC helps this, but still, the problem remains in that the scrub is given 
full priority.

Is this problem known to the developers? Will it be addressed?

   


http://sparcv9.blogspot.com/2010/06/slower-zfs-scrubsresilver-on-way.html
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread Robert Milkowski


On 10/06/2010 20:43, Andrey Kuzmin wrote:
As to your results, it sounds almost too good to be true. As Bob has 
pointed out, h/w design targeted hundreds IOPS, and it was hard to 
believe it can scale 100x. Fantastic.


But it actually can do over 100k.
Also several thousand IOPS on a single FC port is nothing unusual and 
has been the case for at least several years.


--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread Robert Milkowski


On 11/06/2010 09:22, sensille wrote:

Andrey Kuzmin wrote:
   

On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling
richard.ell...@gmail.commailto:richard.ell...@gmail.com  wrote:

 On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:

   Andrey Kuzmin wrote:
   Well, I'm more accustomed to  sequential vs. random, but YMMW.
   As to 67000 512 byte writes (this sounds suspiciously close to
 32Mb fitting into cache), did you have write-back enabled?
 
   It's a sustained number, so it shouldn't matter.

 That is only 34 MB/sec.  The disk can do better for sequential writes.

 Note: in ZFS, such writes will be coalesced into 128KB chunks.


So this is just 256 IOPS in the controller, not 64K.
 

No, it's 67k ops, it was a completely ZFS-free test setup. iostat also confirmed
the numbers.


It's a really simple test everyone can do it.

# dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512

I did a test on my workstation a moment ago and got about 21k IOPS from 
my sata drive (iostat).
The trick here of course is that this is sequentail write with no other 
workload going on and a drive should be able to nicely coalesce these 
IOs and do a sequential writes with large blocks.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-11 Thread Robert Milkowski


On 11/06/2010 10:58, Andrey Kuzmin wrote:
On Fri, Jun 11, 2010 at 1:26 PM, Robert Milkowski mi...@task.gda.pl 
mailto:mi...@task.gda.pl wrote:


On 11/06/2010 09:22, sensille wrote:

Andrey Kuzmin wrote:

On Fri, Jun 11, 2010 at 1:54 AM, Richard Elling
richard.ell...@gmail.com
mailto:richard.ell...@gmail.commailto:richard.ell...@gmail.com
mailto:richard.ell...@gmail.com  wrote:

On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:

  Andrey Kuzmin wrote:
  Well, I'm more accustomed to  sequential vs. random,
but YMMW.
  As to 67000 512 byte writes (this sounds suspiciously
close to
32Mb fitting into cache), did you have write-back enabled?

  It's a sustained number, so it shouldn't matter.

That is only 34 MB/sec.  The disk can do better for
sequential writes.

Note: in ZFS, such writes will be coalesced into 128KB
chunks.


So this is just 256 IOPS in the controller, not 64K.

No, it's 67k ops, it was a completely ZFS-free test setup.
iostat also confirmed
the numbers.


It's a really simple test everyone can do it.

# dd if=/dev/zero of=/dev/rdsk/cXtYdZs0 bs=512

I did a test on my workstation a moment ago and got about 21k IOPS
from my sata drive (iostat).
The trick here of course is that this is sequentail write with no
other workload going on and a drive should be able to nicely
coalesce these IOs and do a sequential writes with large blocks.


Exactly, though one might still wonder where the coalescing actually 
happens, in the respective OS layer or in the controller. Nonetheless, 
this is hardly a common use-case one would design h/w for.





in the above example it happens inside a disk drive.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Robert Milkowski


On 21/10/2009 03:54, Bob Friesenhahn wrote:


I would be interested to know how many IOPS an OS like Solaris is able 
to push through a single device interface.  The normal driver stack is 
likely limited as to how many IOPS it can sustain for a given LUN 
since the driver stack is optimized for high latency devices like disk 
drives.  If you are creating a driver stack, the design decisions you 
make when requests will be satisfied in about 12ms would be much 
different than if requests are satisfied in 50us.  Limitations of 
existing software stacks are likely reasons why Sun is designing 
hardware with more device interfaces and more independent devices.



Open Solaris 2009.06, 1KB READ I/O:

# dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0
# iostat -xnzCM 1|egrep device|c[0123]$
[...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 17497.30.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 17498.80.0   17.10.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 17277.60.0   16.90.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 17441.30.0   17.00.0  0.0  0.80.00.0   0  82 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 17333.90.0   16.90.0  0.0  0.80.00.0   0  82 c0


Now lets see how it looks like for a single SAS connection but dd to 11x 
SSDs:


# dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t1d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t2d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t4d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t5d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t6d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t7d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t8d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t9d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t10d0p0
# dd of=/dev/null bs=1k if=/dev/rdsk/c0t11d0p0

# iostat -xnzCM 1|egrep device|c[0123]$
[...]
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104243.30.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104249.20.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104208.10.0  101.80.0  0.2  9.70.00.1   0 967 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104245.80.0  101.80.0  0.2  9.70.00.1   0 966 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104221.90.0  101.80.0  0.2  9.70.00.1   0 968 c0
extended device statistics
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 104212.20.0  101.80.0  0.2  9.70.00.1   0 967 c0


It looks like a single CPU core still hasn't been saturated and the 
bottleneck is in the device rather then OS/CPU. So the MPT driver in 
Solaris 2009.06 can do at least 100,000 IOPS to a single SAS port.


It also scales well - I did run above dd's over 4x SAS ports at the same 
time and it scaled linearly by achieving well over 400k IOPS.



hw used: x4270, 2x Intel X5570 2.93GHz, 4x SAS SG-PCIE8SAS-E-Z (fw. 
1.27.3.0), connected to F5100.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Robert Milkowski


On 10/06/2010 15:39, Andrey Kuzmin wrote:
On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl 
mailto:mi...@task.gda.pl wrote:


On 21/10/2009 03:54, Bob Friesenhahn wrote:


I would be interested to know how many IOPS an OS like Solaris
is able to push through a single device interface.  The normal
driver stack is likely limited as to how many IOPS it can
sustain for a given LUN since the driver stack is optimized
for high latency devices like disk drives.  If you are
creating a driver stack, the design decisions you make when
requests will be satisfied in about 12ms would be much
different than if requests are satisfied in 50us.  Limitations
of existing software stacks are likely reasons why Sun is
designing hardware with more device interfaces and more
independent devices.



Open Solaris 2009.06, 1KB READ I/O:

# dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0


/dev/null is usually a poor choice for a test lie this. Just to be on 
the safe side, I'd rerun it with /dev/random.




That wouldn't work, would it?
Please notice that I'm reading *from* an ssd and writing *to* /dev/null

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] ZFS ARC cache issue

2010-06-04 Thread Robert Milkowski


On 04/06/2010 15:46, James Carlson wrote:

Petr Benes wrote:
   

add to /etc/system something like (value depends on your needs)

* limit greedy ZFS to 4 GiB
set zfs:zfs_arc_max = 4294967296

And yes, this has nothing to do with zones :-).
 

That leaves unanswered the underlying question: why do you need to do
this at all?  Isn't the ZFS ARC supposed to release memory when the
system is under pressure?  Is that mechanism not working well in some
cases ... ?

   


My understanding is that if kmem gets heavily fragmaneted ZFS won't be 
able to give back much memory.



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Odd dump volume panic

2010-05-17 Thread Robert Milkowski


On 12/05/2010 22:19, Ian Collins wrote:

On 05/13/10 03:27 AM, Lori Alt wrote:

On 05/12/10 04:29 AM, Ian Collins wrote:
I just tried moving a dump volume form rpool into another pool so I 
used zfs send/receive to copy the volume (to keep some older dumps) 
then ran dumpadm -d to use the new location.  This caused a panic.  
Nothing ended up in messages and needless to say, there isn't a dump!


Creating a new volume and using that worked fine.

This was on Solaris 10 update 8.

Has anyone else seen anything like this?



The fact that a panic occurred is some kind of bug, but I'm also not 
surprised that this didn't work.  Dump volumes have specialized 
behavior and characteristics and using send/receive to move them (or 
any other way to move them) is probably not going to work.  You need 
to extract the dump from the dump zvol using savecore and then move 
the resulting file.


I'm surprised.  I thought the volume used for dump is just a normal 
zvol or other block device.  I didn't realise there was any 
relationship between a zvol and its contents.


One odd think I did notice was the device size was reported 
differently on the new pool:


zfs get all space/dump
NAMEPROPERTY  VALUE  SOURCE
space/dump  type  volume -
space/dump  creation  Wed May 12 20:56 2010  -
space/dump  used  12.9G  -
space/dump  available 201G   -
space/dump  referenced12.9G  -
space/dump  compressratio 1.01x  -
space/dump  reservation   none   default
space/dump  volsize   16G-
space/dump  volblocksize  128K   -
space/dump  checksum  on default
space/dump  compression   on inherited 
from space

space/dump  readonly  offdefault
space/dump  shareiscsioffdefault
space/dump  copies1  default
space/dump  refreservationnone   default
space/dump  primarycache  alldefault
space/dump  secondarycachealldefault
space/dump  usedbysnapshots   0  -
space/dump  usedbydataset 12.9G  -
space/dump  usedbychildren0  -
space/dump  usedbyrefreservation  0  -

zfs get all rpool/dump
NAMEPROPERTY  VALUE  SOURCE
rpool/dump  type  volume -
rpool/dump  creation  Thu Jun 25 19:40 2009  -
rpool/dump  used  16.0G  -
rpool/dump  available 10.4G  -
rpool/dump  referenced16K-
rpool/dump  compressratio 1.00x  -
rpool/dump  reservation   none   default
rpool/dump  volsize   16G-
rpool/dump  volblocksize  8K -
rpool/dump  checksum  offlocal
rpool/dump  compression   offlocal
rpool/dump  readonly  offdefault
rpool/dump  shareiscsioffdefault
rpool/dump  copies1  default
rpool/dump  refreservationnone   default
rpool/dump  primarycache  alldefault
rpool/dump  secondarycachealldefault



zvol used as a dump device has some constraints in regards to its 
settings like checksum, compressions, etc. For more details see:
   
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zvol.c#1683


See that space/dump has checksums turned on, compression turned on, etc. 
while rpool/dump doesn't.


Additionally all blocks need to be pre-allocated 
(http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zvol.c#1785) 
- but zfs send|recv should replicate it I think.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Robert Milkowski


With the put back of:

[PSARC/2010/108] zil synchronicity

zfs datasets now have a new 'sync' property to control synchronous behaviour.
The zil_disable tunable to turn synchronous requests into asynchronous
requests (disable the ZIL) has been removed. For systems that use that switch 
on upgrade
you will now see a message on booting:

  sorry, variable 'zil_disable' is not defined in the 'zfs' module

Please update your system to use the new sync property.
Here is a summary of the property:

---

The options and semantics for the zfs sync property:

sync=standard
   This is the default option. Synchronous file system transactions
   (fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
   and then secondly all devices written are flushed to ensure
   the data is stable (not cached by device controllers).

sync=always
   For the ultra-cautious, every file system transaction is
   written and flushed to stable storage by system call return.
   This obviously has a big performance penalty.

sync=disabled
   Synchronous requests are disabled.  File system transactions
   only commit to stable storage on the next DMU transaction group
   commit which can be many seconds.  This option gives the
   highest performance, with no risk of corrupting the pool.
   However, it is very dangerous as ZFS is ignoring the synchronous
transaction
   demands of applications such as databases or NFS.
   Setting sync=disabled on the currently active root or /var
   file system may result in out-of-spec behavior or application data
   loss and increased vulnerability to replay attacks.
   Administrators should only use this when these risks are understood.

The property can be set when the dataset is created, or dynamically,
and will take effect immediately.  To change the property, an
administrator can use the standard 'zfs' command.  For example:

# zfs create -o sync=disabled whirlpool/milek
# zfs set sync=always whirlpool/perrin



-- Team ZIL.

It should be in build 140.
For a little bit more information on it you might look at 
http://milek.blogspot.com/2010/05/zfs-synchronous-vs-asynchronous-io.html


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Robert Milkowski


On 06/05/2010 12:24, Pawel Jakub Dawidek wrote:

I read that this property is not inherited and I can't see why.
If what I read is up-to-date, could you tell why?
   


It is inherited. Sorry for the confusion but there was a discussion if 
it should or should not be inherited, then we propose that it shouldn't 
but it was changed again during a PSARC review that it should.


And I did a copy'n'paste here.

Again, sorry for the confusion.

--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Robert Milkowski


On 06/05/2010 13:12, Robert Milkowski wrote:

On 06/05/2010 12:24, Pawel Jakub Dawidek wrote:

I read that this property is not inherited and I can't see why.
If what I read is up-to-date, could you tell why?


It is inherited. Sorry for the confusion but there was a discussion if 
it should or should not be inherited, then we propose that it 
shouldn't but it was changed again during a PSARC review that it should.


And I did a copy'n'paste here.

Again, sorry for the confusion.

Well, actually I did copy'n'paste a proper page as it doesn't say 
anything about inheritance.


Nevertheless, yes it is inherited.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Robert Milkowski


On 06/05/2010 15:31, Tomas Ögren wrote:

On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes:

   

On Wed, 5 May 2010, Edward Ned Harvey wrote:
 

In the L2ARC (cache) there is no ability to mirror, because cache device
removal has always been supported.  You can't mirror a cache device, because
you don't need it.
   

How do you know that I don't need it?  The ability seems useful to me.
 

The gain is quite minimal.. If the first device fails (which doesn't
happen too often I hope), then it will be read from the normal pool once
and then stored in ARC/L2ARC again. It just behaves like a cache miss
for that specific block... If this happens often enough to become a
performance problem, then you should throw away that L2ARC device
because it's broken beyond usability.

   


Well if a L2ARC device fails there might be an unacceptable drop in 
delivered performance.
If it were mirrored than a drop usually would be much smaller or there 
could be no drop if a mirror had an option to read only from one side.


Being able to mirror L2ARC might especially be useful once a persistent 
L2ARC is implemented as after a node restart or a resource failover in a 
cluster L2ARC will be kept warm. Then the only thing which might affect 
L2 performance considerably would be a L2ARC device failure...



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Loss of L2ARC SSD Behaviour

2010-05-06 Thread Robert Milkowski


On 06/05/2010 19:08, Michael Sullivan wrote:

Hi Marc,

Well, if you are striping over multiple devices the you I/O should be 
spread over the devices and you should be reading them all 
simultaneously rather than just accessing a single device. 
 Traditional striping would give 1/n performance improvement rather 
than 1/1 where n is the number of disks the stripe is spread across.


The round-robin access I am referring to, is the way the L2ARC vdevs 
appear to be accessed.  So, any given object will be taken from a 
single device rather than from several devices simultaneously, thereby 
increasing the I/O throughput.  So, theoretically, a stripe spread 
over 4 disks would give 4 times the performance as opposed to reading 
from a single disk.  This also assumes the controller can handle 
multiple I/O as well or that you are striped over different disk 
controllers for each disk in the stripe.


SSD's are fast, but if I can read a block from more devices 
simultaneously, it will cut the latency of the overall read.




Keep in mind that the largest block is currently 128KB and you always 
need to read an entire block.
Splitting a block across several L2ARC devices would probably decrease 
performance and would invalidate all blocks if only a single l2arc 
device would die. Additionally having each block only on one l2arc 
device allows to read from all of l2arc devices at the same time.


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Heads Up: zil_disable has expired, ceased to be, ...

2010-05-06 Thread Robert Milkowski


On 06/05/2010 21:45, Nicolas Williams wrote:

On Thu, May 06, 2010 at 03:30:05PM -0500, Wes Felter wrote:
   

On 5/6/10 5:28 AM, Robert Milkowski wrote:

 

sync=disabled
Synchronous requests are disabled. File system transactions
only commit to stable storage on the next DMU transaction group
commit which can be many seconds.
   

Is there a way (short of DTrace) to write() some data and get
notified when the corresponding txg is committed? Think of it as a
poor man's group commit.
 

fsync(2) is it.  Of course, if you disable sync writes then there's no
way to find out for sure.  If you need to know when a write is durable,
then don't disable sync writes.

Nico
   
There is one way - issue a sync(2) - even with sync=disabled it will 
sync all filesystems and then return.

Another workaround would be to create a snapshot...

However I agree with Nico - if you don't need sync=disabled then don't 
use it.


Someone else mentioned that yet another option like sync=fsync-only 
would be useful so all would be async but fsync() - but frankly I'm not 
convinced as it would require a support in your application but at this 
point you already have a full control of the behavior without need for 
sync=disabled.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZIL behavior on import

2010-05-05 Thread Robert Milkowski


On 05/05/2010 20:45, Steven Stallion wrote:

All,

I had a question regarding how the ZIL interacts with zpool import:

Given that the intent log is replayed in the event of a system failure,
does the replay behavior differ if -f is passed to zpool import? For
example, if I have a system which fails prior to completing a series of
writes and I reboot using a failsafe (i.e. install disc), will the log be
replayed after a zpool import -f ?

   

yes

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pool import with failed ZIL device now possible ?

2010-05-04 Thread Robert Milkowski


On 16/02/2010 21:54, Jeff Bonwick wrote:

People used fastfs for years in specific environments (hopefully
understanding the risks), and disabling the ZIL is safer than fastfs.
Seems like it would be a useful ZFS dataset parameter.
 

We agree.  There's an open RFE for this:

6280630 zil synchronicity

No promise on date, but it will bubble to the top eventually.

   


So everyone knows - it has been integrated into snv_140 :)

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance of the ZIL

2010-05-04 Thread Robert Milkowski


On 04/05/2010 18:19, Tony MacDoodle wrote:
How would one determine if I should have a separate ZIL disk? We are 
using ZFS as the backend of our Guest Domains boot drives using 
LDom's. And we are seeing bad/very slow write performance?


if you can disable ZIL and compare the performance to when it is off it 
will give you an estimate of what's the absolute maximum performance 
increase (if any) by having a dedicated ZIL device.


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance drop during scrub?

2010-04-29 Thread Robert Milkowski


On 28/04/2010 21:39, David Dyer-Bennet wrote:


The situations being mentioned are much worse than what seem reasonable
tradeoffs to me.  Maybe that's because my intuition is misleading me about
what's available.  But if the normal workload of a system uses 25% of its
sustained IOPS, and a scrub is run at low priority, I'd like to think
that during a scrub I'd see a little degradation in performance, and that
the scrub would take 25% or so longer than it would on an idle system.
There's presumably some inefficiency, so the two loads don't just add
perfectly; so maybe another 5% lost to that?  That's the big uncertainty.
I have a hard time believing in 20% lost to that.

   


Well, it's not that easy as there are many other factors you need to 
take into account.
For example how many IOs are you allowing to be queued per device? This 
might affect a latency for your application.


Or if you have a disk array with its own cache - just by doing scrub you 
might be pushing other entries in a cache out which might impact the 
performance of your application.


Then there might be SAN and
and so on.

I'm not saying there is no room for improvement here. All I'm saying is 
that it is not as easy problem as it seems.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Compellant announces zNAS

2010-04-29 Thread Robert Milkowski


On 29/04/2010 07:57, Phil Harman wrote:
That screen shot looks very much like Nexenta 3.0 with a different 
branding. Elsewhere, The Register confirms it's OpenSolaris.




Well it looks like it is running Nexenta which is based on Open Solaris.
But it is not the Open Solaris *distribution*.

--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re-attaching zpools after machine termination [amazon ebs ec2]

2010-04-26 Thread Robert Milkowski


On 26/04/2010 09:27, Phillip Oldham wrote:

Then perhaps you should do zpool import -R / pool
*after* you attach EBS.
That way Solaris won't automatically try to import
the pool and your
scripts will do it once disks are available.
 

zpool import doesn't work as there was no previous export.

I'm trying to solve the case where the instance terminates unexpectedly; think 
of someone just pulling the plug. There's no way to do the export operation 
before it goes down, but I still need to bring it back up, attach the EBS 
drives and continue as previous.

The start/attach/reboot/available cycle is interesting, however. I may be able 
to init a reboot after attaching the drives, but it's not optimal - there's 
always a chance the instance might not come back up after the reboot. And it 
still doesn't answer *why* the drives aren't showing any data after they're 
initially attached.
   


You don't have to do exports as I suggested to use 'zpool -R / pool' 
(notice -R).
If you do so that a pool won't be added to zpool.cache and therefore 
after a reboot (unexpected or not) you will be able to import it again 
(and do so with -R). That way you can easily script it so import happens 
after your disks ara available.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re-attaching zpools after machine termination [amazon ebs ec2]

2010-04-26 Thread Robert Milkowski


On 26/04/2010 11:14, Phillip Oldham wrote:

You don't have to do exports as I suggested to use
'zpool -R / pool'
(notice -R).
 

I tried this after your suggestion (including the -R switch) but it failed, 
saying the pool I was trying to import didn't exist.

   
which means it couldn't discover it. does 'zpool import' (no other 
options) list the pool?


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Pool, what happen when disk failure

2010-04-25 Thread Robert Milkowski


On 25/04/2010 13:08, Edward Ned Harvey wrote:


The system should boot-up properly even if some pools are not
accessible
(except rpool of course).
If it is not the case then there is a bug - last time I checked it
worked perfectly fine.
 

This may be different in the latest opensolaris, but in the latest solaris,
this is what I know:

If a pool fails, and forces an ungraceful shutdown, then during the next
bootup, the pool is treated as currently in use by another system.  The OS
doesn't come up all the way; you have to power cycle again, and go into
failsafe mode.  Then you can zpool import I think requiring the -f or -F,
and reboot again normal.


   
I just did a test on Solaris 10/09 - and system came up properly, 
entirely on its own, with a failed pool.
zpool status showed the pool as unavailable (as I removed an underlying 
device) which is fine.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Benchmarking Methodologies

2010-04-24 Thread Robert Milkowski


On 21/04/2010 18:37, Ben Rockwood wrote:

You've made an excellent case for benchmarking and where its useful
but what I'm asking for on this thread is for folks to share the
research they've done with as much specificity as possible for research
purposes. :)
   


However you can also find some benchmarks with sysbench + mysql or oracle.
I don't remember if I posted or not some of my results but I'm pretty 
sure you can find others.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Pool, what happen when disk failure

2010-04-24 Thread Robert Milkowski


On 24/04/2010 13:51, Edward Ned Harvey wrote:

But what you might not know:  If any pool fails, the system will crash.

This actually depends on the failmode property setting in your pools.
The default is panic, but it also might be wait or continue - see 
zpool(1M) man page for more details.



   You
will need to power cycle.  The system won't boot up again; you'll have to
   


The system should boot-up properly even if some pools are not accessible 
(except rpool of course).
If it is not the case then there is a bug - last time I checked it 
worked perfectly fine.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re-attaching zpools after machine termination [amazon ebs ec2]

2010-04-23 Thread Robert Milkowski


On 23/04/2010 13:38, Phillip Oldham wrote:

The instances are ephemeral; once terminated they cease to exist, as do all 
their settings. Rebooting an image keeps any EBS volumes attached, but this isn't the 
case I'm dealing with - its when the instance terminates unexpectedly. For instance, if a 
reboot operation doesn't succeed or if there's an issue with the data-centre.

There isn't any way (yet, AFACT) to attach an EBS during the boot process, so 
they must be attached after boot.
   

Then perhaps you should do zpool import -R / pool *after* you attach EBS.
That way Solaris won't automatically try to import the pool and your 
scripts will do it once disks are available.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is file cloning anywhere on ZFS roadmap

2010-04-21 Thread Robert Milkowski


On 21/04/2010 07:41, Schachar Levin wrote:

Hi,
We are currently using NetApp file clone option to clone multiple VMs on our FS.

ZFS dedup feature is great storage space wise but when we need to clone allot 
of VMs it just takes allot of time.

Is there a way (or a planned way) to clone a file without going through the 
process of actually copying the blocks, but just duplicating its meta data like 
NetApp does?

   


I don't know about file cloning but why not put each VM on top of a zvol 
- then you can clone a zvol. ?



--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Double slash in mountpoint

2010-04-21 Thread Robert Milkowski



but it suggests that it had nothing to do with a double slash - rather 
some process (your shell?) had an open file within the mountpoint. But 
supplying -f you forced zfs to unmount it anyway.



--
Robert Milkowski
http://milek.blogspot.com

On 21/04/2010 06:16, Ryan John wrote:

Thanks. That was it

-Original Message-
From: Brandon High [mailto:bh...@freaks.com]
Sent: Wednesday, 21 April 2010 6:57 AM
To: Ryan John
Cc: zfs-discuss
Subject: Re: [zfs-discuss] Double slash in mountpoint

On Tue, Apr 20, 2010 at 7:38 PM, Ryan  Johnjohn.r...@bsse.ethz.ch  wrote:
   

Anyone know how to fix it?
I can't even do a zfs destroy
 

zfs unmount -a -f

-B

   



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Benchmarking Methodologies

2010-04-21 Thread Robert Milkowski


On 21/04/2010 04:43, Ben Rockwood wrote:

I'm doing a little research study on ZFS benchmarking and performance
profiling.  Like most, I've had my favorite methods, but I'm
re-evaluating my choices and trying to be a bit more scientific than I
have in the past.


To that end, I'm curious if folks wouldn't mind sharing their work on
the subject?  What tool(s) to you prefer in what situations?  Do you
have a standard method of running them (tool args; block sizes, thread
counts, ...) or procedures between runs (zpool import/export, new
dataset creation,...)?  etc.


Any feedback is appreciated.  I want to get a good sampling of opinions.

   


I haven't heard from you in a while! Good to see you here again :)

Sorry for stating obvious but at the end of a day it depends on what 
your goals are.

Are you interested in micro-benchmarks and comparison to other file systems?

I think the most relevant filesystem benchmarks for users is when you 
benchmark a specific application and present results from an application 
point of view. For example, given a workload for Oracle, MySQL, LDAP, 
... how quickly it completes? How much benefit there is by using SSDs? 
What about other filesystems?


Micro-benchmarks are fine but very hard to be properly interpreted by 
most users.


Additionally most benchmarks are almost useless if they are not compared 
to some other configuration with only a benchmarked component changed. 
For example, knowing that some MySQL load completes in 1h on ZFS is 
basically useless. But knowing that on the same HW with Linux/ext3 and 
under the same load it completes in 2h would be interesting to users.


Other interesting thing would be to see an impact of different ZFS 
setting on a benchmark results (aligned recordsize for database vs. 
default, atime off vs. on, lzjb, gzip, ssd). Also comparison of 
benchmark results with all default zfs setting compared to whatever 
setting you did which gave you the best result.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] casesensitivity mixed and CIFS

2010-04-14 Thread Robert Milkowski


On 14/04/2010 16:04, John wrote:

Hello,
we set our ZFS filesystems to casesensitivity=mixed when we created them. 
However, CIFS access to these files is still case sensitive.

Here is the configuration:
# zfs get casesensitivity pool003/arch
NAME PROPERTY VALUESOURCE
pool003/arch  casesensitivity  mixed-
#

At the pool level it's set as follows:
# zfs get casesensitivity pool003
NAME PROPERTY VALUESOURCE
pool003  casesensitivity  sensitive-
#

 From a Windows client, accessing \\filer\arch\MYFOLDER\myfile.txt fails, 
while accessing \\filer\arch\myfolder\myfile.txt works.

Any ideas?

We are running snv_130.
   

you are not using Samba daemon, are you?

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Robert Milkowski


On 07/04/2010 13:58, Ragnar Sundblad wrote:


Rather: ...=19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.

For a file server, mail server, etc etc, where things are stored
and supposed to be available later, you almost certainly want
redundancy on your slog too. (There may be file servers where
this doesn't apply, but they are special cases that should not
be mentioned in the general documentation.)

   


While I agree with you I want to mention that it is all about 
understanding a risk.
In this case not only your server has to crash in such a way so data has 
not been synced (sudden power loss for example) but there would have to 
be some data committed to a slog device(s) which was not written to a 
main pool and when your server restarts your slog device would have to 
completely die as well.


Other than that you are fine even with unmirrored slog device.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-07 Thread Robert Milkowski


On 07/04/2010 15:35, Bob Friesenhahn wrote:

On Wed, 7 Apr 2010, Ragnar Sundblad wrote:


So the recommendation for zpool 19 would be *strongly* 
recommended.  Mirror

your log device if you care about using your pool.
And the recommendation for zpool =19 would be ... don't mirror your 
log

device.  If you have more than one, just add them both unmirrored.


Rather: ... =19 would be ... if you don't mind loosing data written
the ~30 seconds before the crash, you don't have to mirror your log
device.


It is also worth pointing out that in normal operation the slog is 
essentially a write-only device which is only read at boot time.  The 
writes are assumed to work if the device claims success.  If the log 
device fails to read (oops!), then a mirror would be quite useful.


it is only read at boot if there are uncomitted data on it - during 
normal reboots zfs won't read data from slog.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Question about large pools

2010-04-03 Thread Robert Milkowski


On 02/04/2010 05:45, Roy Sigurd Karlsbakk wrote:

Hi all

 From http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide 
I read

Avoid creating a RAIDZ, RAIDZ-2, RAIDZ-3, or a mirrored configuration with one 
logical device of 40+ devices. See the sections below for examples of redundant 
configurations.

What do they mean by this? 40+ devices in a single raidz[123] set or 40+ 
devices in a pool regardless of raidz[123] sets?

   

It means - try to avoid a single RAID-Z group with 40+ disk drives.
Creating several smaller groups in a one pool is perfectly fine.

So for example - on x4540 servers try to avoid creating a pool with a 
single RAID-Z3 group made of 44 disks, rather create 4 RAID-Z2 groups 
each made of 11 disks all of them in a single pool.


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Robert Milkowski


On 03/04/2010 19:24, Tim Cook wrote:



On Fri, Apr 2, 2010 at 4:05 PM, Edward Ned Harvey 
guacam...@nedharvey.com mailto:guacam...@nedharvey.com wrote:


Momentarily, I will begin scouring the omniscient interweb for
information, but I’d like to know a little bit of what people
would say here.  The question is to slice, or not to slice, disks
before using them in a zpool.

One reason to slice comes from recent personal experience.  One
disk of a mirror dies.  Replaced under contract with an identical
disk.  Same model number, same firmware.  Yet when it’s plugged
into the system, for an unknown reason, it appears 0.001 Gb
smaller than the old disk, and therefore unable to attach and
un-degrade the mirror.  It seems logical this problem could have
been avoided if the device added to the pool originally had been a
slice somewhat smaller than the whole physical device.  Say, a
slice of 28G out of the 29G physical disk.  Because later when I
get the infinitesimally smaller disk, I can always slice 28G out
of it to use as the mirror device.

There is some question about performance.  Is there any additional
overhead caused by using a slice instead of the whole physical device?

There is another question about performance.  One of my colleagues
said he saw some literature on the internet somewhere, saying ZFS
behaves differently for slices than it does on physical devices,
because it doesn’t assume it has exclusive access to that physical
device, and therefore caches or buffers differently … or something
like that.

Any other pros/cons people can think of?

And finally, if anyone has experience doing this, and process
recommendations?  That is … My next task is to go read
documentation again, to refresh my memory from years ago, about
the difference between “format,” “partition,” “label,” “fdisk,”
because those terms don’t have the same meaning that they do in
other OSes…  And I don’t know clearly right now, which one(s) I
want to do, in order to create the large slice of my disks.


Your experience is exactly why I suggested ZFS start doing some right 
sizing if you will.  Chop off a bit from the end of any disk so that 
we're guaranteed to be able to replace drives from different 
manufacturers.  The excuse being no reason to, Sun drives are always 
of identical size.  If your drives did indeed come from Sun, their 
response is clearly not true.  Regardless, I guess I still think it 
should be done.  Figure out what the greatest variation we've seen 
from drives that are supposedly of the exact same size, and chop it 
off the end of every disk.  I'm betting it's no more than 1GB, and 
probably less than that.  When we're talking about a 2TB drive, I'm 
willing to give up a gig to be guaranteed I won't have any issues when 
it comes time to swap it out.




that's what open solaris is doing more or less for some time now.

look in the archives of this mailing list for more information.
--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Robert Milkowski


On 02/04/2010 16:04, casper@sun.com wrote:


sync() is actually *async* and returning from sync() says nothing about

   


to clarify - in case of ZFS sync() is actually synchronous.

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Robert Milkowski


On 01/04/2010 13:01, Edward Ned Harvey wrote:

Is that what sync means in Linux?
 

A sync write is one in which the application blocks until the OS acks that
the write has been committed to disk.  An async write is given to the OS,
and the OS is permitted to buffer the write to disk at its own discretion.
Meaning the async write function call returns sooner, and the application is
free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.  But sync
writes are done by applications which need to satisfy a race condition for
the sake of internal consistency.  Applications which need to know their
next commands will not begin until after the previous sync write was
committed to disk.

   

ROTFL!!!

I think you should explain it even further for Casper :) :) :) :) :) :) :)

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Robert Milkowski


On 01/04/2010 20:58, Jeroen Roodhart wrote:



I'm happy to see that it is now the default and I hope this will cause the
Linux NFS client implementation to be faster for conforming NFS servers.
 

Interesting thing is that apparently defaults on Solaris an Linux are chosen 
such that one can't signal the desired behaviour to the other. At least we 
didn't manage to get a Linux client to asynchronously mount a Solaris (ZFS 
backed) NFS export...
   


Which is to be expected as it is not a nfs client which requests the 
behavior but rather a nfs server.
Currently on Linux you can export a share with as sync (default) or 
async share while on Solaris you can't really currently force a NFS 
server to start working in an async mode.


--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] can't destroy snapshot

2010-04-01 Thread Robert Milkowski


On 01/04/2010 15:24, Richard Elling wrote:

On Mar 31, 2010, at 7:57 PM, Charles Hedrick wrote:

   

So that eliminates one of my concerns. However the other one is still an issue. 
Presumably Solaris Cluster shouldn't import a pool that's still active on the 
other system. We'll be looking more carefully into that.
 

Older releases of Solaris Cluster used SCSI reservations to help
prevent such things. However, that is now tunable :-(  Did you tune it?
   


scsi reservation is used only if a node left a cluster.
so for example in a two-node cluster when both nodes are part of a 
cluster both of them have a full access to shared storage and you can 
force zpool import on both nodes at the same time.


When you think about it you need actually such behavior for RAC to work 
on raw devices or real cluster volumes or filesystems, etc.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] can't destroy snapshot

2010-04-01 Thread Robert Milkowski


On 01/04/2010 02:01, Charles Hedrick wrote:

So we tried recreating the pool and sending the data again.

1) compression wasn't set on the copy, even though I did sent -R, which is 
supposed to send all properties
2) I tried killing to send | receive pipe. Receive couldn't be killed. It hung.
3) This is Solaris Cluster. We tried forcing a failover. The pool mounted on 
the other server without dismounting on the first. zpool list showed it mounted 
on both machines. zpool iostat showed I/O actually occurring on both systems.

Altogether this does not give me a good feeling about ZFS. I'm hoping the 
problem is just with receive and CLuster, and the it works properly on a single 
system. Because i'm running a critical database on ZFS on another system.
   


1. you shouldn't allow for a pool to be imported on more than one node 
at a time, if you do you will probably loose entire pool


2. if you have a pool under a cluster control and you want to manually 
import make sure that you do it in such an order:

- disable hastorageplus resource which manages the pool
- suspend a resource group so cluster won't start a storage 
resource in any event
- manually import a pool and do whatever you need to do with it - 
however to be on a safe side import it with -R / option so if your node 
would reboot for some reason the pool won't be automatically imported
- after you are done with whatever you wanted to do make sure you 
export the pool, resume the resource group and enable the storage resource


The other approach is to keep a pool under a cluster management but 
eventually suspend a resource group so there won't be any unexpected 
failovers (but it really depends on circumstances and what you are 
trying to do).


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski




On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
   
Use something other than Open/Solaris with ZFS as an NFS server?  :)


I don't think you'll find the performance you paid for with ZFS and
Solaris at this time. I've been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.

   


Well, for lots of environments disabling ZIL is perfectly acceptable.
And frankly the reason you get better performance out of the box on 
Linux as NFS server is that it actually behaves like with disabled ZIL - 
so disabling ZIL on ZFS for NFS shares is no worse than using Linux here 
or any other OS which behaves in the same manner. Actually it makes it 
better as even if ZIL is disabled ZFS filesystem is always consisten on 
a disk and you still get all the other benefits from ZFS.


What would be useful though is to be able to easily disable ZIL per 
dataset instead of OS wide switch.
This feature has already been coded and tested and awaits a formal 
process to be completed in order to get integrated. Should be rather 
sooner than later.



You'd be better off getting NetApp
   
Well, spend some extra money on a really fast NVRAM solution for ZIL and 
you will get much faster ZFS environment than NetApp and still you will 
spend much less money. Not to mention all the extra flexibity compared 
to NetApp.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski




Just to make sure you know ... if you disable the ZIL altogether, and
   

you
 

have a power interruption, failed cpu, or kernel halt, then you're
   

likely to
 

have a corrupt unusable zpool, or at least data corruption.  If that
   

is
 

indeed acceptable to you, go nuts.  ;-)
   

I believe that the above is wrong information as long as the devices
involved do flush their caches when requested to.  Zfs still writes
data in order (at the TXG level) and advances to the next transaction
group when the devices written to affirm that they have flushed their
cache.  Without the ZIL, data claimed to be synchronously written
since the previous transaction group may be entirely lost.

If the devices don't flush their caches appropriately, the ZIL is
irrelevant to pool corruption.
 

I stand corrected.  You don't lose your pool.  You don't have corrupted
filesystem.  But you lose whatever writes were not yet completed, so if
those writes happen to be things like database transactions, you could have
corrupted databases or files, or missing files if you were creating them at
the time, and stuff like that.  AKA, data corruption.

But not pool corruption, and not filesystem corruption.


   
Which is an expected behavior when you break NFS requirements as Linux 
does out of the box.
Disabling ZIL on a nfs server makes it no worse than the standard Linux 
behaviour - now you get decent performance at a cost of some data to get 
corrupted from a nfs client point of view. But then there are 
environments when it is perfectly acceptable as you there are not 
running critical databases but rather user home directories and zfs will 
flush a transaction maximum after 30s currently so user won't be able to 
loose more than last 30s if the nfs server would suddenly lost power.


To clarify - if ZIL is disabled it makes no difference at all for a 
pool/filesystem level consistency.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski




standard ZIL:   7m40s  (ZFS default)
1x SSD ZIL:  4m07s  (Flash Accelerator F20)
2x SSD ZIL:  2m42s  (Flash Accelerator F20)
2x SSD mirrored ZIL:   3m59s  (Flash Accelerator F20)
3x SSD ZIL:  2m47s  (Flash Accelerator F20)
4x SSD ZIL:  2m57s  (Flash Accelerator F20)
disabled ZIL:   0m15s
(local extraction0m0.269s)

I was not so much interested in the absolute numbers but rather in the
relative
performance differences between the standard ZIL, the SSD ZIL and the
disabled
ZIL cases.
 

Oh, one more comment.  If you don't mirror your ZIL, and your unmirrored SSD
goes bad, you lose your whole pool.  Or at least suffer data corruption.


   
This is not true. If ZIL device would die while pool is imported then 
ZFS would start using z ZIL withing a pool and continue to operate.


On the other hand if your server would suddenly lost power and then when 
you power it up later on and ZFS detects that the ZIL is broken/gone it 
will require a sysadmin intervation to force the pool import and yes 
possibly loose some data.


But how is it different from any other solution where your log is put on 
a separate device?
Well, it is actually different. With ZFS you can still guearantee it to 
be consistent on-disk while others generally can't and often you will 
have to do fsck to even mount a fs in r/w...


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Robert Milkowski



I have a pool (on an X4540 running S10U8) in which a disk failed, and the
hot spare kicked in. That's perfect. I'm happy.

Then a second disk fails.

Now, I've replaced the first failed disk, and it's resilvered and I have my
hot spare back.

But: why hasn't it used the spare to cover the other failed drive? And
can I hotspare it manually?  I could do a straight replace, but that
isn't quite the same thing.

   


It seems like it is even driven. Hmmm.. perhaps it shouldn't be.

Anyway you can do zpool replace and it is the same thing, why wouldn't it?

--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Simultaneous failure recovery

2010-03-31 Thread Robert Milkowski



On Tue, Mar 30, 2010 at 10:42 PM, Eric Schrockeric.schr...@oracle.com  wrote:
   

On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote:

 

I have a pool (on an X4540 running S10U8) in which a disk failed, and the
hot spare kicked in. That's perfect. I'm happy.

Then a second disk fails.

Now, I've replaced the first failed disk, and it's resilvered and I have my
hot spare back.

But: why hasn't it used the spare to cover the other failed drive? And
can I hotspare it manually?  I could do a straight replace, but that
isn't quite the same thing.
   

Hot spares are only activated in response to a fault received by the zfs-retire 
FMA agent.  There is no notion that the spares should be re-evaluated when they 
become available at a later point in time.  Certainly a reasonable RFE, but not 
something ZFS does today.
 

Definitely an RFE I would like.

   

You can 'zpool attach' the spare like a normal device - that's all that the 
retire agent is doing under the hood.
 

So, given:

 NAMESTATE READ WRITE CKSUM
 images  DEGRADED 0 0 0
   raidz1DEGRADED 0 0 0
 c2t0d0  FAULTED  4 0 0  too many errors
 c3t0d0  ONLINE   0 0 0
 c4t0d0  ONLINE   0 0 0
 c5t0d0  ONLINE   0 0 0
 c0t1d0  ONLINE   0 0 0
 c1t1d0  ONLINE   0 0 0
 c2t1d0  ONLINE   0 0 0
 c3t1d0  ONLINE   0 0 0
 c4t1d0  ONLINE   0 0 0
 spares
   c5t7d0AVAIL

then it would be this?

zpool attach images c2t0d0 c5t7d0

which I had considered, but the man page for attach says The
existing device cannot be part of a raidz configuration.

If I try that it fails, saying:
/invalid vdev specification
use '-f' to override the following errors:
dev/dsk/c5t7d0s0 is reserved as a hot spare for ZFS pool images.
Please see zpool(1M).

Thanks!

   

You need to use zpool replace.
Once you fix the failed drive and it re-synchronizes a hot spare will 
detach automatically (regardless if you forced it to kick-in via zpool 
replace or if it did so due to FMA).


For more details see http://blogs.sun.com/eschrock/entry/zfs_hot_spares

--
Robert Milkowski
http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Robert Milkowski




On 31/03/2010 10:27, Erik Trimble wrote:

Orvar's post over in opensol-discuss has me thinking:

After reading the paper and looking at design docs, I'm wondering if
there is some facility to allow for comparing data in the ARC to it's
corresponding checksum. That is, if I've got the data I want in the ARC,
how can I be sure it's correct (and free of hardware memory errors)? I'd
assume the way is to also store absolutely all the checksums for all
blocks/metadatas being read/written in the ARC (which, of course, means
that only so much RAM corruption can be compensated for), and do a
validation when that every time that block is used/written from the ARC.
You'd likely have to do constant metadata consistency checking, and
likely have to hold multiple copies of metadata in-ARC to compensate for
possible corruption. I'm assuming that this has at least been explored,
right?


A subset of this is already done. The ARC keeps its own in memory 
checksum (because some buffers in the ARC are not yet on stable 
storage so don't have a block pointer checksum yet).


http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c 



arc_buf_freeze()
arc_buf_thaw()
arc_cksum_verify()
arc_cksum_compute()

It isn't done on every access but it can detect in memory corruption - 
I've seen it happen on several occasions but all due to errors in my 
code not bad physical memory.


Doing in more frequently could cause a significant performance problem.



or there might be an extra zpool level (or system wide) property to 
enable checking checksums onevery access from ARC - there will be a 
siginificatn performance impact but then it might be acceptable for 
really paranoid folks especially with modern hardware.


--
Robert Milkowski
http://milek.blogspot.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski


On 31/03/2010 17:31, Bob Friesenhahn wrote:

On Wed, 31 Mar 2010, Edward Ned Harvey wrote:


Would your users be concerned if there was a possibility that
after extracting a 50 MB tarball that files are incomplete, whole
subdirectories are missing, or file permissions are incorrect?


Correction:  Would your users be concerned if there was a 
possibility that
after extracting a 50MB tarball *and having a server crash* then 
files could

be corrupted as described above.

If you disable the ZIL, the filesystem still stays correct in RAM, 
and the

only way you lose any data such as you've described, is to have an
ungraceful power down or reboot.


Yes, of course.  Suppose that you are a system administrator.  The 
server spontaneously reboots.  A corporate VP (CFO) comes to you and 
says that he had just saved the critical presentation to be given to 
the board of the company (and all shareholders) later that day, and 
now it is gone due to your spontaneous server reboot.  Due to a 
delayed financial statement, the corporate stock plummets.  What are 
you to do?  Do you expect that your employment will continue?


Reliable NFS synchronous writes are good for the system administrators.


well, it really depends on your environment.
There is place for Oracle database and there is place for MySQL, then 
you don't really need to cluster everything and then there are 
environments where disabling ZIL is perfectly acceptablt.


One of such cases is that you need to re-import a database or recover 
lots of files over NFS - your service is down and disabling ZIL makes a 
recovery MUCH faster. Then there are cases when leaving the ZIL disabled 
is acceptable as well.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Robert Milkowski


On 31/03/2010 17:22, Edward Ned Harvey wrote:


The advice I would give is:  Do zfs autosnapshots frequently (say ... every
5 minutes, keeping the most recent 2 hours of snaps) and then run with no
ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
snapshot ... and rollback once more for good measure.  As long as you can
afford to risk 5-10 minutes of the most recent work after a crash, then you
can get a 10x performance boost most of the time, and no risk of the
aforementioned data corruption.
   


I don't really get it - rolling back to a last snapshot doesn't really 
improve things here it actually makes it worse as now you are going to 
loose even more data. Keep in mind that currently the maximum time after 
which ZFS commits a transaction is 30s - ZIL or not. So with disabled 
ZIL in worst case scenario you should loose no more than last 30-60s. 
You can tune it down if you want. Rolling back to a snapshot will only 
make it worse. Then also keep in mind that it is a worst case scenario 
here - it well may be there were no outstanding transactions at all - it 
all goes down basically to a risk assessment, impact assessment and a cost.


Unless you are talking about doing regular snapshots and making sure 
that application is consistent while doing so - for example putting all 
Oracle tablespaces in a hot backup mode and taking a snapshot... 
otherwise it doesn't really make sense.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 948 matches

Mail list logo