Re: [zfs-discuss] Freeing unused space in thin provisioned zvols

2013-02-12 Thread Casper . Dik


No tools, ZFS does it automaticaly when freeing blocks when the 
underlying device advertises the functionality.

ZFS ZVOLs shared over COMSTAR advertise SCSI UNMAP as well.


If a system was running something older, e.g., Solaris 11; the free 
blocks will not be marked such on the server even after the system 
upgrades to Solaris 11.1.

There might be a way to force that by disabling compression and then 
create a large file full with NULs and then remove that.  But you need to 
check first that this has some effect before you even try.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-23 Thread Casper . Dik

IIRC dump is special.

As for swap... really, you don't want to swap.  If you're swapping you
have problems.  Any swap space you have is to help you detect those
problems and correct them before apps start getting ENOMEM.  There
*are* exceptions to this, such as Varnish.  For Varnish and any other
apps like it I'd dedicate an entire flash drive to it, no ZFS, no
nothing.

Yes and no: the system reserves a lot of additional memory (Solaris 
doesn't over-commits swap) and swap is needed to support those 
reservations.  Also, some pages are dirtied early on and never touched 
again; those pages should not be kept in memory.

But continuously swapping is clearly a sign of a system too small for its 
job.

Of course, compressing and/or encrypting swap has interesting issues: in 
order to free memory by swapping pages out requires even more memory.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-23 Thread Casper . Dik

On 01/22/2013 10:50 PM, Gary Mills wrote:
 On Tue, Jan 22, 2013 at 11:54:53PM +, Edward Ned Harvey 
 (opensolarisisdeadlongliveopensolari
s) wrote:
 Paging out unused portions of an executing process from real memory to
 the swap device is certainly beneficial. Swapping out complete
 processes is a desperation move, but paging out most of an idle
 process is a good thing. 

It gets even better.  Executables become part of the swap space via
mmap, so that if you have a lot of copies of the same process running in
memory, the executable bits don't waste any more space (well, unless you
use the sticky bit, although that might be deprecated, or if you copy
the binary elsewhere.)  There's lots of awesome fun optimizations in
UNIX. :)

The sticky bit has never been used in  that form of SunOS for as long
as I remember (SunOS 3.x) and probably before that.  It no longer makes 
sense in demand-paged executables.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Casper . Dik

On 01/22/2013 02:39 PM, Darren J Moffat wrote:
 
 On 01/22/13 13:29, Darren J Moffat wrote:
 Since I'm replying here are a few others that have been introduced in
 Solaris 11 or 11.1.
 
 and another one I can't believe I missed since I was one of the people
 that helped design it and I did codereview...
 
 Per file sensitively labels for TX configurations.

Can you give some details on that? Google search are turning up pretty dry.


Start here:

http://docs.oracle.com/cd/E26502_01/html/E29017/managefiles-1.html#scrolltoc


Look for multilevel datasets.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Casper . Dik


Some vendors call this (and thins like it) Thin Provisioning, I'd say 
it is more accurate communication between 'disk' and filesystem about 
in use blocks.

In some cases, users of disks are charged by bytes in use; when not using
SCSI UNMAP, a set of disks used for a zpool will in the end be charged for
the whole reservation; this becomes costly when your standard usage is 
much less than your peak usage.

Thin provisioning can now be used for zpools as long as the underlying 
LUNs have support for SCSI UNMAP


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] deleting a link in ZFS

2012-08-29 Thread Casper . Dik

On 12-08-29 12:29 AM, Gregg Wonderly wrote:
 On Aug 28, 2012, at 6:01 AM, Murray Cullen themurma...@gmail.com wrote:

 I've copied an old home directory from an install of OS 134 to the data 
 pool on my OI install. 
Opensolaris apparently had wine installed as I now have a link to / in my data 
pool. I've tried eve
rything I can think of to remove this link with one exception. I have not tried 
mounting the pool o
n a different OS yet, I'm trying to avoid that.

 Does anyone have any advice or suggestions? Ulink and rm error out as root.
 What is the error?  Is it permission denied, I/O error, or what?

 Gregg
The error is unlink, not owner although I am the owner.


What exactly is the file?  In zfs you cannot create a link to a directory;
so does the link look like?


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik


You do realize that the age of the universe is only on the order of
around 10^18 seconds, do you? Even if you had a trillion CPUs each
chugging along at 3.0 GHz for all this time, the number of processor
cycles you will have executed cumulatively is only on the order 10^40,
still 37 orders of magnitude lower than the chance for a random hash
collision.



Suppose you find a weakness in a specific hash algorithm; you use this
to create hash collisions and now imagined you store the hash collisions 
in a zfs dataset with dedup enabled using the same hash algorithm.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik

Sorry, but isn't this what dedup=verify solves? I don't see the problem here.
Maybe all that's needed is a comment in the manpage saying hash algorithms
aren't perfect.

The point is that hash functions are many to one and I think the point
was about that verify wasn't really needed if the hash function is good
enough.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik

On 07/11/2012 12:24 PM, Justin Stringfellow wrote:
 Suppose you find a weakness in a specific hash algorithm; you use this
 to create hash collisions and now imagined you store the hash collisions 
 in a zfs dataset with dedup enabled using the same hash algorithm.
 
 Sorry, but isn't this what dedup=verify solves? I don't see the problem 
 here. Maybe all that's n
eeded is a comment in the manpage saying hash algorithms aren't perfect.

It does solve it, but at a cost to normal operation. Every write gets
turned into a read. Assuming a big enough and reasonably busy dataset,
this leads to tremendous write amplification.


If and only if the block is being dedup'ed.  (In that case, you're just
changing the write of a whole block into one read (of the block) and an 
update in the dedup date (the whole block isn't written)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik


This assumes you have low volumes of deduplicated data. As your dedup
ratio grows, so does the performance hit from dedup=verify. At, say,
dedupratio=10.0x, on average, every write results in 10 reads.

I don't follow.

If dedupratio == 10, it means that each item is *referenced* 10 times
but it is only stored *once*.  Only when you have hash collisions then 
multiple reads would be needed.

Only one read is needed except in the case of hash collisions.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik

On Tue, 10 Jul 2012, Edward Ned Harvey wrote:

 CPU's are not getting much faster.  But IO is definitely getting faster.  
 It's best to keep ahea
d of that curve.

It seems that per-socket CPU performance is doubling every year. 
That seems like faster to me.

I think that I/O isn't getting as fast as CPU is; memory capacity and
bandwith and CPUs are getting faster.  I/O, not so much.
(Apart from the one single step from harddisk to SSD; but note that
I/O is limited to standard interfaces and as such it is likely be
helddown by requiring a new standard.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik

Unfortunately, the government imagines that people are using their home com=
puters to compute hashes and try and decrypt stuff.  Look at what is happen=
ing with GPUs these days.  People are hooking up 4 GPUs in their computers =
and getting huge performance gains.  5-6 char password space covered in a f=
ew days.  12 or so chars would take one machine a couple of years if I reca=
ll.  So, if we had 20 people with that class of machine, we'd be down to a =
few months.   I'm just suggesting that while the compute space is still hug=
e, it's not actually undoable, it just requires some thought into how to ap=
proach the problem, and then some time to do the computations.

Huge space, but still finite=85

Dan Brown seems to think so in Digital Fortress but it just means he 
has no grasp on big numbers.

2^128 is a huge space, finite *but* beyond brute force *forever*.

Cconsidering that we have nearly 10billion people and if you give them
all of them 1 billion computers all being able to compute 1 billion checks 
per second, how many years does it take before we get the solution?

Did  you realize that that number is *twice* the number of the years 
needed for a *single* computer with the same specification  to solve this
problem for 64 bits?

There are two reasons for finding a new hash alrgorithm:
- a faster one on current hardware
- a better one with a larger output

But bruteforce is not what we are defending against: we're trying to 
defend against bugs in the hash algorithm.  In the case of md5 and the 
related hash algorithm, a new attack method was discovered and it made 
many hash algorithms obsolete/broken.

When a algorithm is broken, the work factor needed for a successful 
attack depends in part of the hash, e.g., you may left with 64 bits
of effective has and that would be brute forcible.


Casper

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik


Do you need assurances that in the next 5 seconds a meteorite won't fall
to Earth and crush you? No. And yet, the Earth puts on thousands of tons
of weight each year from meteoric bombardment and people have been hit
and killed by them (not to speak of mass extinction events). Nobody has
ever demonstrated of being able to produce a hash collision in any
suitably long hash (128-bits plus) using a random search. All hash
collisions have been found by attacking the weaknesses in the
mathematical definition of these functions (i.e. some part of the input
didn't get obfuscated well in the hash function machinery and spilled
over into the result, resulting in a slight, but usable non-randomness).

The reason why we don't protect against such event because it would
be extremely expensive with a very small chance of it being needed.

verify doesn't cost much so even if the risk as infinitesimal as a 
direct meteorite hit, it may still be cost effective.

(Just like we'd better be off preparing for the climate changing (rather 
cheap) rather then trying to keep the climate from changing (impossible 
and still extremely expensive)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Casper . Dik

On Wed, Jul 11, 2012 at 9:48 AM,  casper@oracle.com wrote:
Huge space, but still finite=85

 Dan Brown seems to think so in Digital Fortress but it just means he
 has no grasp on big numbers.

I couldn't get past that.  I had to put the book down.  I'm guessing
it was as awful as it threatened to be.


It is *fiction*.  So just read it as if it is magical, like Harry Potter.

It's just like well researched in fiction means exactly the same as
well researched in journalism: the facts aren't actually facts but could 
pass as facts to those who hasn't had a proper science education
(which unfortunately includes a large part of the population and 99.5% of
all politicians)

Casper




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?

2012-06-26 Thread Casper . Dik

 To be honest, I think we should also remove this from all other
 filesystems and I think ZFS was created this way because all modern
 filesystems do it that way.

This may be wrong way to go if it breaks existing applications which
rely on this feature. It does break applications in our case.

I don't think this isn't supported on most Linux filesystems either.

What do you use it for?

Anyway, we've added this to the list of mandatory features and see
what we can procure with that.


Not much?

I'd suggest whether you can restructure your code and work without this.

For one, it requires the application to run with root (older versions) or
with specific privileges which aren't, e.g., available in non-global zones.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?

2012-06-26 Thread Casper . Dik


We've already asked our Netapp representative. She said it's not hard
to add that.

And symlinks don't work for this?  I'm amazed because we're talking about 
the same file system.   Or is it that the code you have does the 
hardlinking?


If you want this rfo Oracle, you would need to talk to an Oracle 
representative and not in a mailing list (for illumos email will work I 
suppose)

 I'd suggest whether you can restructure your code and work without this.

It would require touching code for which we don't have sources anymore
(people gone, too). It would also require to create hard links to the
results files directly, which means linking 15000+ files per directory
with a minimum of 3 directories. Each day (this is CERN after
all).

I'm assuming then that it is the code for which you don't have the source 
which does the hardlinking?

I'm still not sure why symlinks won't work or for that matter loopback 
mounts.

Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?

2012-06-25 Thread Casper . Dik

Does someone know the history which led to the EPERM for unlink() of
directories on ZFS? Why was this done this way, and not something like
allowing the unlink and execute it on the next scrub or remount?


It's not about the unlink(), it's about the link() and unlink().
But not allowing link  unlink, you force the filesystem to contain only 
trees and not graphs.

It also allows you to create directories were .. points to a directory 
were the inode cannot be found, simply because it was just removed.

The support for link() on directories in ufs has always given issues
and would create problems fsck couldn't fix.

To be honest, I think we should also remove this from all other
filesystems and I think ZFS was created this way because all modern
filesystems do it that way.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?

2012-06-25 Thread Casper . Dik

The decision to not support link(2) of directories was very deliberate - it
is an abomination that never should have been allowed in the first place.
My guess is that the behavior of unlink(2) on directories is a direct
side-effect of that (if link isn't supported, then why support unlink?).
Also worth noting that ZFS also doesn't let you open(2) directories and
read(2) from them, something (I believe) UFS does allow.

In the very beginning, mkdir(1) was a set-uid application; it used
mknod to make a directory and then created a link from
newdir to newdir/.
and from
. to newdir/..

Traditionally, we was only allowed for the superuser and when
we added privileges a special privileges was added.

I think we should remove it for the other filesystems.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)

2012-05-29 Thread Casper . Dik


The drives were the seagate green barracuda IIRC, and performance for 
just about everything was 20MB/s per spindle or worse, when it should 
have been closer to 100MB/s when streaming. Things were worse still when 
doing random...

It is possible that your partitions weren't aligned at 4K and that will 
give serious issues with those drives (Solaris now tries to make sure that 
all partitions are on 4K boundaries or makes sure that the zpool dev_t is 
aligned to 4K.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migration of a Thumper to bigger HDDs

2012-05-15 Thread Casper . Dik

Hello all, I'd like some practical advice on migration of a
Sun Fire X4500 (Thumper) from aging data disks to a set of
newer disks. Some questions below are my own, others are
passed from the customer and I may consider not all of them
sane - but must ask anyway ;)

1) They hope to use 3Tb disks, and hotplugged an Ultrastar 3Tb
for testing. However, the system only sees it as a 802Gb
device, via Solaris format/fdisk as well as via parted [1].
Is that a limitation of the Marvell controller, disk,
the current OS (snv_117)? Would it be cleared by a reboot
and proper disk detection on POST (I'll test tonight) or
these big disks won't work in X4500, period?



Your old release of Solaris (nearly three years old) doesn't support
disks over 2TB, I would think.

(A 3TB is 3E12, the 2TB limit is 2^41 and the difference is around 800Gb)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Accessing Data from a detached device.

2012-03-30 Thread Casper . Dik

Hi,
As an addendum to this, I'm curious about how to grow the split pool in 
size.

Scenario, mirrored pool comprising of two disks, one 200GB and the other 
300GB, naturally the size of the mirrored pool is 200GB e.g. the smaller 
of the two devices.

I ran some tests within vbox env and I'm curious why after a zpool split 
one of the pools does not increase in size to 300gb, yet for some reason 
both pools remain at 200gb even if I export/import them. Sizes are  
reported via zpool list.

I checked the label, both disks have a single EFI partition consuming 
100% of each disk. and format/partition shows slice 0 on both disks also 
consuming the entire disk respectively.

So how does one force the pool with the larger disk to increase in size ?


What is the autoexpand setting (I think it is off by default)?


zpool get autoexpand splitted-pool


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Accessing Data from a detached device.

2012-03-29 Thread Casper . Dik

Is it possible to access the data from a detached device from an 
mirrored pool.

If it is detached, I don't think there is a way to get access
to the mirror.  Had you used split, you should be able to reimport it.

(You can try aiming zpool import at the disk but I'm not hopeful)

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] test for holes in a file?

2012-03-27 Thread Casper . Dik

On Mon, 26 Mar 2012, Andrew Gabriel wrote:

 I just played and knocked this up (note the stunning lack of comments, 
 missing optarg processing, etc)...
 Give it a list of files to check...

This is a cool program, but programmers were asking (and answering) 
this same question 20+ years ago before there was anything like 
SEEK_HOLE.

If file space usage is less than file directory size then it must 
contain a hole.  Even for compressed files, I am pretty sure that 
Solaris reports the uncompressed space usage.


Unfortunately not true with filesystems which compress data.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-08 Thread Casper . Dik


If the performance of the outer tracks is better than the performance of the
inner tracks due to limitations of magnetic density or rotation speed (not
being limited by the head speed or bus speed), then the sequential
performance of the drive should increase as a square function, going toward
the outer tracks.  c = pi * r^2

Decrease because the outer tracks are the lower numbered tracks; they
have the same density but they are larger.

So, small variations of sequential performance are possible, jumping from
track to track, but based on what I've seen, the maximum performance
difference from the absolute slowest track to the absolute fastest track
(which may or may not have any relation to inner vs outer) ... maximum
variation on-par with 10% performance difference.  Not a square function.

I've noticed a change of 50% in speed or more between the lower and the
higher numbers.  (60MB to 30MB)

In benchmark land, they do short-stroke disks for better performance;
I believe the Pillar boxes do similar tricks under the covers (if you want 
more performance, it gives you the faster tracks)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bug moving files between two zfs filesystems (too many open files)

2011-11-29 Thread Casper . Dik

I think the too many open files is a generic error message about 
running out of file descriptors. You should check your shell ulimit
information.


Yeah, but mv shouldn't run out of file descriptors or should be
handle to deal with that.

Are we moving a tree of files?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] weird bug with Seagate 3TB USB3 drive

2011-10-13 Thread Casper . Dik

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Cindy Swearingen
 
 In the steps below, you're missing a zpool import step.
 I would like to see the error message when the zpool import
 step fails.

I see him doing this...


  # truss -t open zpool import foo

The following lines are informative, sort of.


  /8: openat64(6, c1t0d0s0, O_RDONLY)   = 7
  /4: openat64(6, c1t0d0s2, O_RDONLY)   Err#5 EIO

And the output result is:


  cannot import 'foo': no such pool available


What is the partition table?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] weird bug with Seagate 3TB USB3 drive

2011-10-13 Thread Casper . Dik

 From: casper@oracle.com [mailto:casper@oracle.com]
 
 What is the partition table?

He also said this...


 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of John D Groenveld

 # zpool create foo c1t0d0

Which, to me, suggests no partition table.

An EFI partition table (there needs to be some form of label so there
is always a partition table).

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD vs hybrid drive - any advice?

2011-07-27 Thread Casper . Dik


ZFS never does update-in-place and UFS only does update-in-place for
metadata and where the application forces update-in-place. 

ufs always updates in place (it will rewrite earlier allocated locks).
The only time when it does is when the file is growing and it may move
stuff around (when the end of the file is a fragment and it needs to grow)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD vs hybrid drive - any advice?

2011-07-26 Thread Casper . Dik


Bullshit. I just got a OCZ Vertex 3, and the first fill was 450-500MB/s.
Second and sequent fills are at half that speed. I'm quite confident
that it's due to the flash erase cycle that's needed, and if stuff can
be TRIM:ed (and thus flash erased as well), speed would be regained.
Overwriting an previously used block requires a flash erase, and if that
can be done in the background when the timing is not critical instead of
just before you can actually write the block you want, performance will
increase.

I think TRIM is needed both for flash (for speed) and for
thin provisioning; ZFS will dirty all of the volume even though only a 
small part of the volume is used at any particular time.  That makes ZFS 
more or less unusable with thin provisioning; support for TRIM would fix 
that if the underlying volume management supports TRIM.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD vs hybrid drive - any advice?

2011-07-26 Thread Casper . Dik


Shouldn't modern SSD controllers be smart enough already that they know:
- if there's a request to overwrite a sector, then the old data on
that sector is no longer needed
- allocate a clean sector from pool of available sectors (part of
wear-leveling mechanism)
- clear the old sector, and add it to the pool (possibly done in
background operation)

It seems to be the case with sandforce-based SSDs. That would pretty
much let the SSD work just fine even without TRIM (like when used
under HW raid).


That is possibly not sufficient.  If ZFS writes bytes to every sector, 
even though the pool is not full, the controller cannot know where to
reclaim the data.  If it uses spare sectors then it can map them to the 
to the new data and add the overwritten sectors to the free pool.

With TRIM, it gets more blocks to reuse and it gives more time to
erase them, making the SSD faster.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How about 4KB disk sectors?

2011-07-13 Thread Casper . Dik

So, what is the story about 4KB disk sectors? Should such disks be avoided 
with ZFS? Or, no proble
m? Or, need to modify some config file before usage?


The issue is most with 4K underwater disks; unless you make sure that 
all the partitions are on a 4K boundary.  If it advertises as a 4K sector 
size disk, then there is no issue.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How about 4KB disk sectors?

2011-07-13 Thread Casper . Dik

On Wed, Jul 13, 2011 at 7:14 AM,  casper@oracle.com wrote:

 The issue is most with 4K underwater disks; unless you make sure that
 all the partitions are on a 4K boundary. =A0If it advertises as a 4K sect=
or
 size disk, then there is no issue.

So if you hand the entire drive to ZFS you should be OK ?

[Not applicable to the root zpool, will the OS installation utility do
the right thing ?]

I think that depends on the version of ZFS/Solaris.  I remember there were 
some issues even when you handed the whole disk.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-29 Thread Casper . Dik

On Jun 27, 2011, at 17:16, Erik Trimble wrote:

 Think about how things were done with the i386 and i387.  That's what I'm=
 after.  With modern CPU buses like AMD  Intel support, plopping a co-pro=
cessor into another CPU socket would really, really help.

Given the amount of transistors that are available nowadays I think it'd be=
 simpler to just create a series of SIMD instructions right in/on general C=
PUs, and skip the whole  co-processor angle.

One of the VIA processors was one of the first with specific random and 
AES instructions.  AMD  Intel have followed suite and your can some 
information here:

http://en.wikipedia.org/wiki/AES_instruction_set

(Similar instructions have been added for SHA, MD5 (older CPUs), RSA,
though typically using building blocks not a single long running 
instruction)

A number of the crypto accelerators were much slower than the 
implementation of a direct implementation in opcodes; one issue, though, 
what register sets will be used and where will it be saved when the thread
is preempted (I'm assuming that the reason why AMD and Intel use different 
instructions from VIA is possibly because of such details.

The current implementation the T3 uses a co-processor (one per core, I 
think)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-16 Thread Casper . Dik

Op 15-06-11 05:56, Richard Elling schreef:
 You can even have applications like databases make snapshots when
 they want.

Makes me think of a backup utility called mylvmbackup, which is written
with Linux in mind - basically it locks mysql tables, takes an LVM
snapshot and releases the lock (and then you backup the database files
from the snapshot). Should work at least as well with ZFS.

If a database engine or another application keeps both the data and the
log in the same filesystem, a snapshot wouldn't create inconsistent data
(I think this would be true with vim and a large number of database 
engines; vim will detect the swap file and datbase should be able to 
detect the inconsistency and rollback and re-apply the log file.)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Oracle and Nexenta

2011-05-25 Thread Casper . Dik

However, do remember that you might not be able to import a pool from 
another system, simply because your system can't support the 
featureset.  Ideally, it would be nice if you could just import the pool 
and use the features your current OS supports, but that's pretty darned 
dicey, and I'd be very happy if importing worked when both systems 
supported the same featureset.

You can use zpool create to set a specific version; this should allow
you to create a pool usable in a number of different systems.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Casper . Dik

Op 06-05-11 05:44, Richard Elling schreef:
 As the size of the data grows, the need to have the whole DDT in RAM or L2ARC
 decreases. With one notable exception, destroying a dataset or snapshot 
 requires
 the DDT entries for the destroyed blocks to be updated. This is why people 
 can
 go for months or years and not see a problem, until they try to destroy a 
 dataset.

So what you are saying is you with your ram-starved system, don't even
try to start using snapshots on that system. Right?


I think it's more like don't use dedup when you don't have RAM.

(It is not possible to not use snapshots in Solaris; they are used for
everything)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Casper . Dik

On Mon, May  2 at 14:01, Bob Friesenhahn wrote:
On Mon, 2 May 2011, Eric D. Mudama wrote:


Hi.  While doing a scan of disk usage, I noticed the following oddity.
I have a directory of files (named file.dat for this example) that all
appear as ~1.5GB when using 'ls -l', but that (correctly) appear as ~250KB
files when using 'ls -s' or du commands:

These are probably just sparse files.  Nothing to be alarmed about.

They were created via CIFS.  I thought sparse files were an iSCSI concept, no?

sparse files are a concept of the underlying filesystem.

E.g., if you lseek() after the end of the file and you write, your
filesystem may not need to allocate empty blocks.  Most Unix filesystems 
allow sparse files; FAT/FAT32 filesystems do not.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to rename rpool. Is that recommended ?

2011-04-08 Thread Casper . Dik

On Fri, Apr 8, 2011 at 2:24 PM, Arjun YK arju...@gmail.com wrote:
 Hi,

 Let me add another query.
 I would assume it would be perfectly ok to choose any name for root
 pool, instead of 'rpool', during the OS install. Please suggest
 otherwise.

Have you tried it?

Last time I try, the pool name is predetermined, you can't change it.
I tried cloning an exsisting openinstallation manually, changing the
pool name in the process. IIRC it works (sorry for the somewhat vague
detail, it was several years ago).

You can only rename it by exporting it and importing under a different 
name.

NOTE: when you modify your root pool on a different system, throw away the 
zfs.cache file.  (Earlier implementation of a bug where if you rename your 
root pool and then your re-import it under a different name, zfs claims to 
have found two pools and then starts to corrupt them)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to get rid of phantom pool ?

2011-02-15 Thread Casper . Dik

I had a pool on external drive.Recently the drive failed,but pool still shows 
up when run 'zpoll s
tatus'

Any attempt to remove/delete/export pool ends up with unresponsiveness(The 
system is still up/runn
ing perfectly,it's just this specific command kind of hangs so I have to open 
new ssh session)

zpool status shows state: UNAVAIL

When try zpool clear get cannot clear errors for backup: I/O error

Please help me out to get rid of this phantom pool.

Remove the zfs cache file: /etc/zfs/zpool.cache.

Then reboot.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)

2011-01-07 Thread Casper . Dik

On Fri, January 7, 2011 01:42, Michael DeMan wrote:
 Then - there is the other side of things.  The 'black swan' event.  At
 some point, given percentages on a scenario like the example case above,
 one simply has to make the business justification case internally at their
 own company about whether to go SHA-256 only or Fletcher+Verification?
 Add Murphy's Law to the 'black swan event' and of course the only data
 that is lost is that .01% of your data that is the most critical?

The other thing to note is that by default (with de-dupe disabled), ZFS
uses Fletcher checksums to prevent data corruption. Add also the fact all
other file systems don't have any checksums, and simply rely on the fact
that disks have a bit error rate of (at best) 10^-16.

Given the above: most people are content enough to trust Fletcher to not
have data corruption, but are worried about SHA-256 giving 'data
corruption' when it comes de-dupe? The entire rest of the computing world
is content to live with 10^-15 (for SAS disks), and yet one wouldn't be
prepared to have 10^-30 (or better) for dedupe?


I would; we're not talking about flipping bits the OS comparing data
using just the checksums and replacing one set with another.

You might want to create a file to show how weak fletcher really is
but two such files wouldn't be properly stored on a de-dup zpool 
unless you verify.




Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-11 Thread Casper . Dik
ransfer-encoding: 7BIT

On 11/12/2010 00:07, Erik Trimble wrote:

 The last update I see to the ZFS public tree is 29 Oct 2010.   Which, 
 I *think*, is about the time that the fork for the Solaris 11 Express 
 snapshot was taken.


I don't think this is the case.
Although all the files show modification date of 29 Oct 2010 at 
src.opensolaris.org they are still old versions from August, at least 
the ones I checked.

See 
http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/

the mercurial gate doesn't have any updates either.

Correct; the last public push was on 2010/8/18.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS snapshot limit?

2010-12-01 Thread Casper . Dik


In my situation is first option, I send snapshot to another server using 
zfs send | zfs recv and I have problem when data send is completed, 
after reboot the zpool have error or have state: faulted.
First server is physical, second is a virtual machine running under 
xenserver 5.6


What is the underlying datastorage?

Typically what can happen here is that zfs is safe, it needs to trust the 
hardware not to lie to the kernel.  If you write data and you reboot/
restart the VM, the data should still be there.  If that is not the case, 
then it has lied to you and you may need to change something in the host.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Space not freed from large deleted sparse file

2010-11-02 Thread Casper . Dik

I removed the file using a simple rm /mnt/tank0/temp/mytempfile.bin.
It's definitely gone. But the space hasn't been freed.

I have been pointed in the direction of this bug
http://bugs.opensolaris.org/view_bug.do?bug_id=6792701

 It was apparently introduced in build 94 and at that time we had a zpool
 version of 11.

Can you confirm that Fixed In: snv_118 means the issue was fixed in
ZFS pool version 17?
http://hub.opensolaris.org/bin/view/Community+Group+zfs/17

The version of zpool changes only when there's a change to the on-disk
format.  It was fixed in build 118 but no change was made to the
zfs version.  zpool version was bumped in build 120.

which appears to be in ZFS version 14, and my FreeNAS distro is at
version 13. Could this be the issue? If so, what is the correct course
of action? Ditch FreeNAS and move to a distro with a more recent ZFS
version? If I do this, is it safe to upgrade my ZFS version knowing
that there is something up with the filesystem?

 I believe it will just work.

Sorry, that what will just work? Moving to a distro with more recent
ZFS support and upgrading my pool?


Even without the upgrade but I'm not sure.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Space not freed from large deleted sparse file

2010-11-01 Thread Casper . Dik


I removed the file using a simple rm /mnt/tank0/temp/mytempfile.bin.
It's definitely gone. But the space hasn't been freed.

I have been pointed in the direction of this bug
http://bugs.opensolaris.org/view_bug.do?bug_id=6792701

It was apparently introduced in build 94 and at that time we had a zpool
version of 11.

which appears to be in ZFS version 14, and my FreeNAS distro is at
version 13. Could this be the issue? If so, what is the correct course
of action? Ditch FreeNAS and move to a distro with a more recent ZFS
version? If I do this, is it safe to upgrade my ZFS version knowing
that there is something up with the filesystem?

I believe it will just work.


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Unknown Space Gain

2010-10-20 Thread Casper . Dik


tank  com.sun:auto-snapshot  true   local

I don't utilize snapshots (this machine just stores media)...so what
could be up?


You've also disabled the time-slider functionality?  (automatic snapshots)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Unknown Space Gain

2010-10-20 Thread Casper . Dik

Huh, I don't actually ever recall enabling that. Perhaps that is
connected to the message I started getting every minute recently in
the kernel buffer,

It's on by default.

You can see if it was ever enabled by using:

zfs list -t snapshot |grep @zfs-auto

Oct 20 12:20:49 megatron pcplusmp: [ID 805372 kern.info] pcplusmp: id=
e
(ata) instance 3 irq 0xf vector 0x45 ioapic 0x2 intin 0xf is bound to
cpu 0
Oct 20 12:21:49 megatron pcplusmp: [ID 805372 kern.info] pcplusmp: id=
e
(ata) instance 3 irq 0xf vector 0x45 ioapic 0x2 intin 0xf is bound to
cpu 1

This sounds more like a device driver unloaded and later it is reloaded 
because of some other service.

I just disabled it (zfs set com.sun\:auto-snapshot=3Dfalse tank,
correct?), will see if the log messages disappear. Did the filesystem
kill off some snapshots or something in an effort to free up space?

Yes, but typically it will log that.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to avoid striping ?

2010-10-18 Thread Casper . Dik

 You have an application filesystem from one LUN. (vxfs is expensive, 
 ufs/svm is not really able
 to handle online filesystem increase. Thus we plan to use zfs for application 
filesystems.)

What do you mean by not really?
...
Use growfs to grow UFS on the grown device.

I know its off-toopic but the statement:  growfs will ``write-lock''
 (see lockfs(1M)) a  mounted  filesystem when expanding.  made me 
always uncomfortable with this online expansion. I cannot guarantee how a
specific application will behave during the expansion.


-w

 Write-lock  (wlock)  the  specified  file-system.  wlock
 suspends  writes  that  would  modify  the  file system.
 Access times are not kept while a file system is  write-
 locked.


All the applications trying to write will suspend.  What would be the
risk of that?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS equivalent of inotify

2010-10-08 Thread Casper . Dik


Is there a ZFS equivalent (or alternative) of inotify?

 

You have some thing, which wants to be notified whenever a specific file or
directory changes.  For example, a live sync application of some kind...

Have you looked at port_associate and ilk?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-06 Thread Casper . Dik


Changing the sector size (if it's possible at all) would require a
reformat of the drive.

The WD drives only support a 4K sector but they pretend to have 512byte
sectors.  I don't think they need to format the drive when changing to 4K 
sectors.  A non-aligned write requires a read-modify-write operation and 
that makes the file slower.

Casper



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-06 Thread Casper . Dik


This would require a low-level re-format and would significantly
reduce the available space if it was possible at all.

I don't think it is possible.

  WD has a jumper,
but is there explicitly to work with WindowsXP, and is not a real way
to dumb down the drive to 512.

All it does is offset the sector numbers by 1 so that sector 63
becomes physical sector 64 (a multiple of 4KB).

Is that all?  And this forces 4K alignment?

  I would presume that any vendor that
is shipping 4K sector size drives now, with a jumper to make it
'real' 512, would be supporting that over the long run?

I would be very surprised if any vendor shipped a drive that could
be jumpered to real 512 bytes.  The best you are going to get is
jumpered to logical 512 bytes and maybe a 1-sector offset (needed
for WindozeXP only).  These jumpers will probably last as long as
the 8GB jumpers that were needed by old BIOS code.  (Eg BIOS boots
using simulated 512-byte sectors and then the OS tells the drive to
switch to native mode).

I would assume that such a jumper would change the drive from
4K native to pretend to be have 512 byte sectors/

It's unfortunate that Sun didn't bite the bullet several decades
ago and provide support for block sizes other than 512-bytes
instead of getting custom firmware for their CD drives to make
them provide 512-byte logical blocks for 2KB CD-ROMs.

Since Solaris x86 works fine with standard CD/DVD drives, that is no 
longer an issue.  Solaris does support larger sectors.

It's even more idiotic of WD to sell a drive with 4KB sectors but
not provide any way for an OS to identify those drives and perform
4KB aligned I/O.

I'm not sure that that is correct; the drive works on naive clients but I 
believe it can reveal its true colors.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-06 Thread Casper . Dik

On Tue, Oct 5, 2010 at 11:49 PM,  casper@sun.com wrote:
 I'm not sure that that is correct; the drive works on naive clients but I
 believe it can reveal its true colors.

The drive reports 512 byte sectors to all hosts. AFAIK there's no way
to make it report 4k sectors.


Too bad because it makes it less useful (specifically because the label 
mentions sectors and if you can use bigger sectors, you can address a 
larger drive).

They still have all sizes w/o Advanced Format (non EARS/AARS models)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-05 Thread Casper . Dik


My immediate reaction to this is time to avoid WD drives for a while;
until things shake out and we know what's what reliably.

But, um, what do we know about say the Seagate Barracuda 7200.12 ($70),
the SAMSUNG Spinpoint F3 1TB ($75), or the HITACHI Deskstar 1TB 3.5
($70)?


I've seen several important features when selecting a drive for
a mirror:

TLER (the ability of the drive to timeout a command)
sector size (native vs virtual)
power use (specifically at home)
performance (mostly for work)
price

I've heard scary stories about a mismatch of the native sector size and
unaligned Solaris partitions (4K sectors, unaligned cylinder).

I was pretty happen with the WD drives (except for the one with a seriously
broken cache) but I see the reasons to not to pick WD drives over the 1TB
range.

Are people now using 4K native sectors and formating them with 4K sectors 
in (Open)Solaris?

Performance sucks when you use unaligned accesses but is performance good 
when the performance is aligned?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-discuss] [illumos-Developer] zpool upgrade and zfs upgrade behavior on b145

2010-09-29 Thread Casper . Dik


Additionally, even though zpool and zfs get version display the true and
updated versions, I'm not convinced that the problem is zdb, as the label
config is almost certainly set by the zpool and/or zfs commands.  Somewhere,
something is not happening that is supposed to when initiating a zpool
upgrade, but since I know virtually nothing of the internals of zfs, I do

The problem is likely in the boot block or in grub.

The development version did not update the boot block; newer versions of 
beadm do fix boot blocks.

For now, I'd recommend you upgrade the boot block on all halves of a 
bootable mirror before you upgrade the zpool version or the zfs version.
export/import won't help.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] drive speeds etc

2010-09-28 Thread Casper . Dik

I have both EVDS and EARS 2TB green drive.  And I have to say they are
not good to build storage servers.


I think both have native 4K sectors; as such, they balk or perform slowly
when a smaller I/O or an unaligned IOP hits them.

How are they formatted? Specifically, solaris slices must be aligned on a 
4K boundary or performance will stink.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-discuss] zfs send/receive?

2010-09-26 Thread Casper . Dik

hi all

I'm using a custom snaopshot scheme which snapshots every hour, day,
week and month, rotating 24h, 7d, 4w and so on. What would be the best
way to zfs send/receive these things? I'm a little confused about how
this works for delta udpates...

Vennlige hilsener / Best regards

The initial backup should look like this:


zfs snapshot -r exp...@backup-2010-07-12

zfs send -R exp...@backup-2010-07-12 | zfs receive -F -u -d 
portable/export

(portable is a portable pool; the export filesystem needs to exist; I use one
zpool to receive different zpools, each in their own directory)

A incremental backup:

zfs snapshot -r exp...@backup-2010-07-13
zfs send -R -I exp...@backup-2010-07-12 exp...@backup-2010-07-13 | 
zfs receive -v -u -d -F portable/export



You need to make sure you keep the last backup snapshot; when receiving
the incremental backup, destroyed filesystems and snapshots are also
destroyed in the backup.

Typically, I remove some of the snapshot *after* the backup; they are only
destroyed during the next backup.

I did notice that send/receive gets confused when older snapshots are
destroyed  by time-slider during the backup.

Casper
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] fs root inode number?

2010-09-26 Thread Casper . Dik

Typically on most filesystems, the inode number of the root
directory of the filesystem is 2, 0 being unused and 1 historically
once invisible and used for bad blocks (no longer done, but kept
reserved so as not to invalidate assumptions implicit in ufsdump tapes).

However, my observation seems to be (at least back at snv_97), the
inode number of ZFS filesystem root directories (including at the
top level of a spool) is 3, not 2.


Buggy files may have all types bad assumptions; this problem isn't new: 
the root filesystem of a zone is typically in a simple directory of a 
filesystem with ufs.

I seem to remember that flexlm wanted that the root was an actual root 
directory (so you can run only one copy).  They didn't realize that faking 
the hostid is just too simple 

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))

2010-09-23 Thread Casper . Dik


I'm using ZFS on a system w/o ECC; it works (it's an Atom 230).

Note that this is not different from using another OS; the difference is 
that ZFS will complain when memory leads to disk corruption; without ZFS 
you will still have memory corruption but you wouldn't know.

Is it helpful not knowing that you have memory corruption?  I don't think 
so.

I've love to have a small (40W) system with ECC but it is difficult to 
find one.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))

2010-09-23 Thread Casper . Dik

  On 23-9-2010 10:25, casper@sun.com wrote:
 I'm using ZFS on a system w/o ECC; it works (it's an Atom 230).

I'm using ZFS on a non-ECC machine for years now without any issues. 
Never had errors. Plus, like others said, other OS'ses have the same 
problems and also run quite well. If not, you don't know it. With ZFS 
you will know.
I would say - just go for it. You will never want to go back.


Indeed.  While I mirror stuff on the same system, I'm now also making
backups using a USB connected disk (eSATA would be better but the box
only has USB).

My backup consists of:



for pool in $pools
do

zfs snapshot -r $p...@$newnapshot
zfs send -R -I $p...@$lastsnapshot $p...@$newsnapshot |
zfs receive -v -u -d -F portable/$pool
done

then I export and store the portable pool somewhere else.

I do run a once per two weeks scrub for all the pools, just in case.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Growing a root ZFS mirror on b134?

2010-09-23 Thread Casper . Dik

Ok, that doesn't seem to have worked so well ...

I took one of the drives offline, rebooted and it just hangs at the
splash screen after prompting for which BE to boot into.   
It gets to 
hostname: blah
and just sits there.


When you say offline, did you:

- remove the drive physically?
- or did you zfs detach it?
- or both?


In order to remove half of the mirror I suggest that you:


split the mirror (if your ZFS is recent enough; seems to be 
supported since 131) 
[ make sure you remove /etc/zfs/zpool.cache from the
  split half of the mirror. ]
or
detach 


only then remove the disk.

Depending on the hardware it may try to find the missing disk and this
may take some time.

You can boot with the debugger and/or -v to find out was is going on.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to migrate to 4KB sector drives?

2010-09-13 Thread Casper . Dik

On Sun, Sep 12, 2010 at 10:07 AM, Orvar Korvar
knatte_fnatte_tja...@yahoo.com wrote:
 No replies. Does this mean that you should avoid large drives with 4KB 
 sectors, that is, new dri
ves? ZFS does not handle new drives?

Solaris 10u9 handles 4k sectors, so it might be in a post-b134 release of osol.



Build 118 adds support for 4K sectors with the following putback:

PSARC 2008/769 Multiple disk sector size support.
6710930 Solaris needs to support large sector size hard drive disk

But already in build 38 there is some support for large-sector disks in
ZFS.  6407365 large-sector disk support in ZFS


When new features are added to the current release, it is typically created 
for the next release and then backported to the current release.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ufs root to zfs root liveupgrade?

2010-08-28 Thread Casper . Dik


  hi all
Try to learn how UFS root to ZFS root  liveUG work.

I download the vbox image of s10u8, it come up as UFS root.
add a new  disks (16GB)
create zpool rpool
run lucreate -n zfsroot -p rpool
run luactivate zfsroot
run lustatus it do show zfsroot will be active in next boot
init 6
but it come up with UFS root,
lustatus show ufsroot active
zpool rpool is mounted but not used by boot


You'll need to boot from a different disk; I don't think that the
OS can change the boot disk (it can on SPARC but it can't on x86)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Directory tree renaming -- disk usage

2010-08-09 Thread Casper . Dik

If I have a directory with a bazillion files in it (or, let's say, a
directory subtree full of raw camera images, about 15MB each, totalling
say 50GB) on a ZFS filesystem, and take daily snapshots of it (without
altering it), the snapshots use almost no extra space, I know.

If I now rename that directory, and take another snapshot, what happens? 
Do I get two copies of the unchanged data now, or does everything still
reference the same original data (file content)?  Seems like the new
directory tree contains the same old files, same inodes and so forth, so
it shouldn't be duplicating the data as I understand it; is that correct?

This would, obviously, be fairly easy to test; and, if I removed the
snapshots afterward, wouldn't take space permanently (have to make sure
that the scheduler doesn't do one of my permanent snapshots during the
test).  But I'm interested in the theoretical answer in any case.


snapshots never take additional space until it starts to reference deleted 
data.  If the directory is renamed then the parent directory is changed 
and the directory's inode but the rest of the data is not modified
and has no effect on the amount of data stored.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] swap - where is it coming from?

2010-06-10 Thread Casper . Dik



Swap is perhaps the wrong name; it is really virtual memory; virtual 
memory consists of real memory and swap on disk. In Solaris, a page
either exists on the physical swap device or in memory.  Of course, not
all memory is available as the kernel and other caches use a large part
of the memory.

When no swap based disk is in use, then there is sufficient free memory;
reserved is pages reserved, e.g., fork, (pages to copy when copy-on-write
happens) or allocated memory but not written to.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] swap - where is it coming from?

2010-06-10 Thread Casper . Dik

On Thu, 10 Jun 2010, casper@sun.com wrote:

 Swap is perhaps the wrong name; it is really virtual memory; virtual
 memory consists of real memory and swap on disk. In Solaris, a page
 either exists on the physical swap device or in memory.  Of course, not
 all memory is available as the kernel and other caches use a large part
 of the memory.

Don't forget that virtual memory pages may also come from memory 
mapped files from the filesystem.  However, it seems that zfs is 
effectively diminishing this.


I should have said anonymous virtual memory.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Using WD Green drives?

2010-05-17 Thread Casper . Dik

On Thu, May 13, 2010 at 06:09:55PM +0200, Roy Sigurd Karlsbakk wrote:
 1. even though they're 5900, not 7200, benchmarks I've seen show they are 
 quite good 

Minor correction, they are 5400rpm.  Seagate makes some 5900rpm drives.

The green drives have reasonable raw throughput rate, due to the
extremely high platter density nowadays.  however, due to their low
spin speed, their average-access time is significantly slower than
7200rpm drives.

For bulk archive data containing large files, this is less of a concern.

Regarding slow reslivering times, in the absence of other disk activity,
I think that should really be limited by the throughput rate, not the
relatively slow random i/o performance...again assuming large files
(and low fragmentation, which if the archive is write-and-never-delete
is what i'd expect).

My experience is that they resilver fairly quickly and scrbs aren't slow
either. (300GB in 2hrs)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why both dedup and compression?

2010-05-07 Thread Casper . Dik

On 06/05/2010 21:07, Erik Trimble wrote:
 VM images contain large quantities of executable files, most of which
 compress poorly, if at all.

What data are you basing that generalisation on ?

Look at these simple examples for libc on my OpenSolaris machine:

1.6M  /usr/lib/libc.so.1*
636K  /tmp/libc.gz

I did the same thing for vim and got pretty much the same result.

It will be different (probably not quite as good) when it is at the ZFS 
block level rather than whole file but those to randomly choosen by me 
samples say otherwise to your generalisation.

Easy to test when compression is enabled for your rpool:

2191 -rwxr-xr-x   1 root bin  1794552 May  6 14:46 /usr/lib/libc.so.1*

(The actual size is 3500 blocks so we're saving quite a bit)


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reverse lookup: inode to name lookup

2010-05-02 Thread Casper . Dik


You can do in the kernel by calling vnodetopath(). I don't know if it
is exposed to user space.

Yes, in /proc/*/path  (kinda).

But that could be slow if you have large directories so you have to
think about where you would use it.

The kernel caches file names; however, it cannot be use for files that 
aren't in use.

It is certainly possible to create a .zfs/snapshot_byinode but it is not 
clear when it helps but it can be used for finding the earlier copy of a 
directory (netapp/.snapshot)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reverse lookup: inode to name lookup

2010-05-01 Thread Casper . Dik


I understand you cannot lookup names by inode number in general, because
that would present a security violation.  Joe User should not be able to
find the name of an item that's in a directory where he does not have
permission.

 

But, even if it can only be run by root, is there some way to lookup the
name of an object based on inode number?

Sure, that's typically how NFS works.

The inode itself is not sufficient; an inode number might be recycled and 
and old snapshot with the same inode number may refer to a different file.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Reverse lookup: inode to name lookup

2010-05-01 Thread Casper . Dik


No, a NFS client will not ask the NFS server for a name by sending the
inode or NFS-handle. There is no need for a NFS client to do that.

The NFS clients certainly version 2 and 3 only use the file handle;
the file handle can be decoded by the server.  It filehandle does not
contain the name, only the FSid, the inode number and the generation.


There is no way to get a name from an inode number.

The nfs server knows how so it is clearly possible.  It is not exported to 
userland but the kernel can find a file by its inumber.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-20 Thread Casper . Dik

On Mon, 19 Apr 2010, Edward Ned Harvey wrote:
 Improbability assessment aside, suppose you use something like the DDRDrive
 X1 ... Which might be more like 4G instead of 32G ... Is it even physically
 possible to write 4G to any device in less than 10 seconds?  Remember, to
 achieve worst case, highest demand on ZIL log device, these would all have
 to be 32kbyte writes (default configuration), because larger writes will go
 directly to primary storage, with only the intent landing on the ZIL.

Note that ZFS always writes data in order so I believe that the 
statement larger writes will go directly to primary storage really 
should be larger writes will go directly to the ZIL implemented in 
primary storage (which always exists).  Otherwise, ZFS would need to 
write a new TXG whenever a new large block of data appeared (which 
may be puny as far as the underlying store is concerned) in order to 
assure proper ordering.  This would result in a very high TXG issue 
rate.  Pool fragmentation would be increased.

I am sure that someone will correct me if this is wrong.

There's a difference between written and the data is referenced by the 
uberblock.  There is no need to start a new TXG when a large datablock
is written.  (If the system resets, the data will be on disk but not 
referenced and is lost unless the TXG it belongs to is comitted)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] brtfs on Solaris? (Re: [osol-discuss] [indiana-discuss] So when are we gonna fork this sucker?)

2010-04-14 Thread Casper . Dik

brtfs could be supported on Opensolaris, too. IMO it could even
complement ZFS and spawn some concurrent development between both. ZFS
is too high end and works very poorly with less than 2GB while brtfs
reportedly works well with 128MB on ARM.

Both have license issues; Oracle can now re-license either, I believe, 
unless brtfs has escaped.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-03 Thread Casper . Dik


The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG's before async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.

This is what Jeff Bonwick says in the zil synchronicity arc case:

   What I mean is that the barrier semantic is implicit even with no ZIL at 
all.
   In ZFS, if event A happens before event B, and you lose power, then
   what you'll see on disk is either nothing, A, or both A and B.  Never just B.
   It is impossible for us not to have at least barrier semantics.

So there's no chance that a *later* async write will overtake an earlier
sync *or* async write.

Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik

On 01/04/2010 20:58, Jeroen Roodhart wrote:

 I'm happy to see that it is now the default and I hope this will cause the
 Linux NFS client implementation to be faster for conforming NFS servers.
  
 Interesting thing is that apparently defaults on Solaris an Linux are chosen 
 such that one can't
 signal the desired behaviour to the other. At least we didn't manage to get a 
Linux client to asyn
chronously mount a Solaris (ZFS backed) NFS export...


Which is to be expected as it is not a nfs client which requests the 
behavior but rather a nfs server.
Currently on Linux you can export a share with as sync (default) or 
async share while on Solaris you can't really currently force a NFS 
server to start working in an async mode.


The other part of the issue is that the Solaris Clients have been 
developed with a sync server.  The client write behinds more and
continues caching the non-acked data.  The Linux client has been developed 
with a async server and has some catching up to do.


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik

  http://nfs.sourceforge.net/
 
 I think B4 is the answer to Casper's question:

We were talking about ZFS, and under what circumstances data is flushed to
disk, in what way sync and async writes are handled by the OS, and what
happens if you disable ZIL and lose power to your system.

We were talking about C/C++ sync and async.  Not NFS sync and async.

I don't think so.

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg36783.html

(This discussion was started, I think, in the context of NFS performance)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik


So you're saying that while the OS is building txg's to write to disk, the
OS will never reorder the sequence in which individual write operations get
ordered into the txg's.  That is, an application performing a small sync
write, followed by a large async write, will never have the second operation
flushed to disk before the first.  Can you support this belief in any way?

The question is not how the writes are ordered but whether an earlier
write can be in a later txg.  A transaction group is committed atomically.

In http://arc.opensolaris.org/caselog/PSARC/2010/108/mail I ask a similar 
question to make sure I understand it correctly, and the answer was:

 = Casper, the answer is from Neil Perrin:

 Is there a partialy order defined for all filesystem operations?
   

File system operations  will be written in order for all settings of 
the 
sync flag.

 Specifically, will ZFS guarantee that when fsync()/O_DATA happens on a
 file,
   
(I assume by O_DATA you meant O_DSYNC).

 that later transactions will not be in an earlier transaction group?
 (Or is this already the case?)
  
This is already the case.


So what I assumed was true but what you made me doubt, was apparently still
true: later transactions cannot be committed in an earlier txg.



If that's true, if there's no increased risk of data corruption, then why
doesn't everybody just disable their ZIL all the time on every system?

For an application running on the file server, there is no difference.
When the system panics you know that data might be lost.  The application 
also dies.  (The snapshot and the last valid uberblock are equally valid)

But for an application on an NFS client, without ZIL data will be lost 
while the NFS client believes the data is written amd it will not try 
again.  With the ZIL, when the NFS server says that data is written then 
it is actually on stable storage.

The reason to have a sync() function in C/C++ is so you can ensure data is
written to disk before you move on.  It's a blocking call, that doesn't
return until the sync is completed.  The only reason you would ever do this
is if order matters.  If you cannot allow the next command to begin until
after the previous one was completed.  Such is the situation with databases
and sometimes virtual machines.  

So the question is: when will your data invalid?

What happens with the data when the system dies before the fsync() call?
What happens with the data when the system dies after the fsync() call?
What happens with the data when the system dies after more I/O operations?

With the zil disabled, you call fsync() but you may encounter data from
before the call to fsync().  That could happen before, so I assume you can
actually recover from that situation.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Casper . Dik


Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls?  Is it
ever used to accelerate async writes?

There are quite a few of sync writes, specifically when you mix in the 
NFS server.

Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.

What I quoted from the other discussion, it seems to be that later writes 
cannot be committed in an earlier TXG then your sync write or other earlier
writes.

I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?

The uberblock is the root of all the data.  All the data in a ZFS pool 
is referenced by it; after the txg is in stable storage then the uberblock 
is updated.

At boot time, or zpool import time, what is taken to be the current
filesystem?  The latest uberblock?  Something else?

The current zpool and the filesystems such as referenced by the last
uberblock.

My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

sync() is actually *async* and returning from sync() says nothing about 
stable storage.  After fsync() returns it signals that all the data is
in stable storage (except if you disable ZIL), or, apparently, in Linux
when the write caches for your disks are enabled (the default for PC
drives).  ZFS doesn't care about the writecache; it makes sure it is 
flushed.  (There's fsyc() and open(..., O_DSYNC|O_SYNC)

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.

Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you disable your ZIL, then there is no
guarantee your snapshots are consistent either.  Rolling back doesn't
necessarily gain you anything.

The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG's before async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.


I believe that the writes are still ordered so the consistency you want is 
actually delivered even without the ZIL enabled.

Casper


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik

If you disable the ZIL, the filesystem still stays correct in RAM, and the
only way you lose any data such as you've described, is to have an
ungraceful power down or reboot.

The advice I would give is:  Do zfs autosnapshots frequently (say ... every
5 minutes, keeping the most recent 2 hours of snaps) and then run with no
ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
snapshot ... and rollback once more for good measure.  As long as you can
afford to risk 5-10 minutes of the most recent work after a crash, then you
can get a 10x performance boost most of the time, and no risk of the
aforementioned data corruption.

Why do you need the rollback? The current filesystems have correct and 
consistent data; not different from the last two snapshots.
(Snapshots can happen in the middle of untarring)

The difference between running with or without ZIL is whether the
client has lost data when the server reboots; not different from using 
Linux as an NFS server.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik


If you have an ungraceful shutdown in the middle of writing stuff, while the
ZIL is disabled, then you have corrupt data.  Could be files that are
partially written.  Could be wrong permissions or attributes on files.
Could be missing files or directories.  Or some other problem.

Some changes from the last 1 second of operation before crash might be
written, while some changes from the last 4 seconds might be still
unwritten.  This is data corruption, which could be worse than losing a few
minutes of changes.  At least, if you rollback, you know the data is
consistent, and you know what you lost.  You won't continue having more
losses afterward caused by inconsistent data on disk.

How exactly is this different from rolling back to some other point of 
time?.

I think you don't quite understand how ZFS works; all operations are 
grouped in transaction groups; all the transactions in a particular group 
are commit in one operation.  I don't know what partial ordering ZFS uses 
when creating transaction groups, but a snapshot just picks one
transaction group as the last group included in the snapshot.

When the system reboots, ZFS picks the most recent, valid uberblock;
so the data available is correct upto transaction group N1.

If you rollback to a snapshot, you get data
 correct upto transaction group N2.

But N2  N1 so you lose more data.

Why do you think that a Snapshot has a better quality than the last 
snapshot available?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik


Dude, don't be so arrogant.  Acting like you know what I'm talking about
better than I do.  Face it that you have something to learn here.

You may say that, but then you post this:


 Why do you think that a Snapshot has a better quality than the last
 snapshot available?

If you rollback to a snapshot from several minutes ago, you can rest assured
all the transaction groups that belonged to that snapshot have been
committed.  So although you're losing the most recent few minutes of data,
you can rest assured you haven't got file corruption in any of the existing
files.


But the actual fact is that there is *NO* difference between the last
uberblock and an uberblock named as snapshot-such-and-so.  All changes 
made after the uberblock was written are discarded by rolling back.


All the transaction groups referenced by last uberblock *are* written to 
disk.

Disabling the ZIL makes sure that fsync() and sync() no longer work;
whether you take a named snapshot or the uberblock is immaterial; your
strategy will cause more data to be lost.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik

 Is that what sync means in Linux?  

A sync write is one in which the application blocks until the OS acks that
the write has been committed to disk.  An async write is given to the OS,
and the OS is permitted to buffer the write to disk at its own discretion.
Meaning the async write function call returns sooner, and the application is
free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.  But sync
writes are done by applications which need to satisfy a race condition for
the sake of internal consistency.  Applications which need to know their
next commands will not begin until after the previous sync write was
committed to disk.


We're talking about the sync for NFS exports in Linux; what do they mean 
with sync NFS exports? 


Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik

 This approach does not solve the problem.  When you do a snapshot,
 the txg is committed.  If you wish to reduce the exposure to loss of
 sync data and run with ZIL disabled, then you can change the txg commit
 interval -- however changing the txg commit interval will not eliminate
 the
 possibility of data loss.

The default commit interval is what, 30 seconds?  Doesn't that guarantee
that any snapshot taken more than 30 seconds ago will have been fully
committed to disk?

When a system boots and it finds the snapshot, then all the data referred 
by the snapshot are on-disk.  But the snapshot doesn't guarantee more than 
the last valid uberblock.

Therefore, any snapshot older than 30 seconds old is guaranteed to be
consistent on disk.  While anything less than 30 seconds old could possibly
have some later writes committed to disk before some older writes from a few
seconds before.

If I'm wrong about this, please explain.

When a pointer to data is committed to disk by ZFS, then the data is 
also on disk.  (if the pointer is reachable from the uberblock,
then the data is also on dissk and reachable from the uberblock)

You don't need to wait 30 seconds.  If it's there, it's there.

I am envisioning a database, which issues a small sync write, followed by a
larger async write.  Since the sync write is small, the OS would prefer to
defer the write and aggregate into a larger block.  So the possibility of
the later async write being committed to disk before the older sync write is
a real risk.  The end result would be inconsistency in my database file.

If you rollback to a snapshot that's at least 30 seconds old, then all the
writes for that snapshot are guaranteed to be committed to disk already, and
in the right order.  You're acknowledging the loss of some known time worth
of data.  But you're gaining a guarantee of internal file consistency.


I don't know what ZFS guarantees when you disable the zil; the one broken 
promise is that when fsync() returns, that the data may not have 
committed to stable storage when fsync() returns.

I'm not sure whether there is a barrier when there is a sync()/fsync(),
if that is the case, then ZFS is still safe for your application.



Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik

On 01/04/2010 13:01, Edward Ned Harvey wrote:
 Is that what sync means in Linux?
  
 A sync write is one in which the application blocks until the OS acks that
 the write has been committed to disk.  An async write is given to the OS,
 and the OS is permitted to buffer the write to disk at its own discretion.
 Meaning the async write function call returns sooner, and the application is
 free to continue doing other stuff, including issuing more writes.

 Async writes are faster from the point of view of the application.  But sync
 writes are done by applications which need to satisfy a race condition for
 the sake of internal consistency.  Applications which need to know their
 next commands will not begin until after the previous sync write was
 committed to disk.


ROTFL!!!

I think you should explain it even further for Casper :) :) :) :) :) :) :)



:-)

So what I *really* wanted to know what sync meant for the NFS server
in the case of Linux.

Apparently it means implement the NFS protocol to the letter.

I'm happy to see that it is now the default and I hope this will cause the 
Linux NFS client implementation to be faster for conforming NFS servers.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Casper . Dik


It does seem like rollback to a snapshot does help here (to assure 
that sync  async data is consistent), but it certainly does not help 
any NFS clients.  Only a broken application uses sync writes 
sometimes, and async writes at other times.

But doesn't that snapshot possibly have the same issues?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bit-flipping in RAM...

2010-03-31 Thread Casper . Dik


I'm not saying that ZFS should consider doing this - doing a validation 
for in-memory data is non-trivially expensive in performance terms, and 
there's only so much you can do and still expect your machine to 
survive.  I mean, I've used the old NonStop stuff, and yes, you can 
shoot them with a .45 and it likely will still run, but wacking them 
with a bazooka still is guarantied to make them, well, Non-NonStop.

If we scrub the memory anyway, why not include the check of the ZFS 
checksums which are already in memory?

OTOH, zfs gets a lot of mileage out of cheap hardware and we know what the 
limitations are when you don't use ECC; the industry must start to require 
that all chipsets support ECC.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-21 Thread Casper . Dik

 That would add unnecessary code to the ZFS layer for something that
 cron can handle in one line.

Actually ... Why should there be a ZFS property to share NFS, when you can
already do that with share and dfstab?  And still the zfs property
exists.

Probably because it is easy to create new filesystems and clone them; as 
NFS only works per filesystem you need to edit dfstab every time when you 
add a filesystem.  With the nfs property, zfs create the NFS export, etc.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems

2010-03-17 Thread Casper . Dik

Carson Gaspar wrote:
 Not quite. 
 11 x 10^12 =~ 10.004 x (1024^4).

 So, the 'zpool list' is right on, at 10T available.

 Duh, I was doing GiB math (y = x * 10^9 / 2^20), not TiB math (y = x * 
 10^12 / 2^40).

 Thanks for the correction.

You're welcome. :-)


On a not-completely-on-topic note:

Has there been a consideration by anyone to do a class-action lawsuit 
for false advertising on this?  I know they now have to include the 1GB 
= 1,000,000,000 bytes thing in their specs and somewhere on the box, 
but just because I say 1 L = 0.9 metric liters somewhere on the box, 
it shouldn't mean that I should be able to avertise in huge letters 2 L 
bottle of Coke on the outside of the package...

I think such attempts have been done and I think one was settled by 
Western Digital.

https://www.wdc.com/settlement/docs/document20.htm

This was in 2006.

I was apparently part of the 'class' as I had a disk registered; I think 
they gave some software.

See also:

http://en.wikipedia.org/wiki/Binary_prefix

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems

2010-03-17 Thread Casper . Dik


IMHO, what matters is that pretty much everything from the disk controller
to the CPU and network interface is advertised in power-of-2 terms and disks
sit alone using power-of-10. And students are taught that computers work
with bits and so everything is a power of 2.

That is simply not true:

Memory: power of 2(bytes)
Network: power of 10  (bits/s))
Disk: power of 10 (bytes)
CPU Frequency: power of 10 (cycles/s)
SD/Flash/..: power of 10 (bytes)
Bus speed: power of 10

Main memory is the odd one out.

Just last week I had to remind people that a 24-disk JBOD with 1TB disks
wouldn't provide 24TB of storage since disks show up as 931GB.

Well some will say it's 24T :-)

It *is* an anomaly and I don't expect it to be fixed.

Perhaps some disk vendor could add more bits to its drives and advertise a
real 1TB disk using power-of-2 and show how people are being misled by
other vendors that use power-of-10. Highly unlikely but would sure get some
respect from the storage community.

You've not been misled unless you have your had in the sand for the last
five to ten years.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] swap across multiple pools

2010-03-03 Thread Casper . Dik

The default install for OpenSolaris creates a single root pool, and creates a 
swap and dump dataset within this pool.

In a mutipool environment, would be make sense to add swap to a pool outside 
or 
the root pool, either as the sole swap dataset to be used or as extra swap ?

Would this have any performance implications ?

My own experience is that the zvol swap devices are much slower than swap
directly to disk.  Perhaps because I had compression on in the rpool,
but any form of data copying/compressing or caching for swap is a no-no: 
you use more memory and you need to evict more pages.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to verify ecc for ram is active and enabled?

2010-03-03 Thread Casper . Dik

Is there a method to view the status of the rams ecc single or double bit 
errors? I would like to 
confirm that ecc on my xeon e5520 and ecc ram are performing their role since 
memtest is ambiguous.


I am running memory test on a p6t6 ws, e5520 xeon, 2gb samsung ecc modules and 
this is what is on 
the screen:

Chipset: Core IMC (ECC : Detect / Correct)

However, further down ECC is identified as being off. Yet there is a 
column for ECC Errs.

I don't know how to interpret this. Is ECC active or not?

Off but only disabled by memtest, I believe.


You can enable it in the memtest menu.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Installing Solaris 10 with ZFS Root FS

2010-03-01 Thread Casper . Dik

Hi Romain,

The option to select a ZFS root file system or a UFS root file system
is available starting in the Solaris 10 10/08 release.

(aka update 6, right?)


 I wish to install a Solaris 10 on a ZFS mirror but in the installer (in 
 interactive text mode) I don't have choice of the filesystem : I only 
 have 'SOLARIS' fs type (which is UFS if I'm right).


That sounds like the fdisk screen; use Solaris and then go one to
the next screen.  This screen will create a fdisk partition.  After doing, 
that you need to create slices and create a ufs filesystem or create a zfs
rpool)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Installing Solaris 10 with ZFS Root FS

2010-03-01 Thread Casper . Dik

Hi Cindy,

thanks for your quick response ! I'm trying to install Solaris 10 11/06.
I don't know how the version numbering works so I don't know if my
version is newer than 10/08.


It's month/year; 11/06 is a three years and a bit over.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS replace - many to one

2010-02-25 Thread Casper . Dik

I'm looking to migrate a pool from using multiple smaller LUNs to one larger 
LUN.
I don't see a way to do a zpool replace for multiple to one. Anybody know how
to do this? It needs to be non disruptive.


Depends on the zpool's layout and the source of the old and the new files;
you can only replace or attach a vdev one by one and you could 
theoretically do that by making different slices in on the new device.
I don't think you want that.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] chmod behavior with symbolic links

2010-02-22 Thread Casper . Dik


I know it's documented in the manual, but I find it a bit strange behaviour
that chmod -R changes the permissions of the target of a
symbolic link.


Is there any reason for this behaviour?


Symbolic links do not have a mode; so you can't chmod them; chmod(2) 
follows symbolic links (it was created before symbolic links existed).
Unfortunately, when symbolic links were created, they had an owner but
no relevant mode: so there's a readlink, symlink, lchown but no lchmod.

I think a lchmod() would be nice, if only to avoid following them.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is there something like udev in OpenSolaris

2010-02-20 Thread Casper . Dik

Hello list, 

beeing a Linux Guy I'm actually quite new to Opensolaris. One thing I miss is 
udev. I found that w
hen using SATA disks with ZFS - it always required manual intervention (cfgadm) 
to do SATA hot plug
.  

I would like to automate the disk replacement, so that it is a fully automatic 
process without man
ual intervention if: 

a) the new disk contains no ZFS labels 
b) the new disk does not contain a partition table

.. thus it is a real replacement part

On Linux I would write a udev hot plug script to automate this. 

Is there something like udev on OpenSolaris ? 
(A place / hook that is executed every time new hardware is added / detected)


Sysevent, perhaps?

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Growing ZFS Volume with SMI/VTOC label

2010-02-19 Thread Casper . Dik

So in a ZFS boot disk configuration (rpool) in a running environment, it's
not possible?

The example I have does grows the rpool while running from the rpool.

But you need a recent version of zfs to grow the pool while it is in use.

On Fri, Feb 19, 2010 at 9:25 AM, casper@sun.com wrote:



 Is it possible to grow a ZFS volume on a SPARC system with a SMI/VTOC
 label
 without losing data as the OS is built on this volume?


 Sure as long as the new partition starts on the same block and is longer.

 It was a bit more difficult with UFS but for zfs it is very simple.

 I had a few systems with two ufs root slices using live upgrade:

slice 1slice 2swap

 First I booted from slice 2
 ludelete slice1
 zpool create rpool slice1
 lucreate -p rpool
 luactivate slice1
 init 6
 from the zfs root:
 ludelete slice2
 format:
 remove slice2;
 grow slice1 to incorporate slice2
 label

 At that time I needed to reboot to get the new device size reflected in
 zpool list; today that is no longer needed

 Casper



--Boundary_(ID_oehH7aQu3QEaJqsmuxeYyA)
Content-type: text/html; charset=ISO-8859-1
Content-transfer-encoding: QUOTED-PRINTABLE

So in a ZFS boot disk configuration (rpool) in a running environment,=
 it#39;s not possible?brbrdiv class=3Dgmail_quoteOn Fri, Feb=
 19, 2010 at 9:25 AM,  span dir=3Dltrlt;a href=3Dmailto:Casper=
@sun.comcasper@sun.com/agt;/span wrote:br
blockquote class=3Dgmail_quote style=3Dmargin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex;div class=3Dimbr
br
gt;Is it possible to grow a ZFS volume on a SPARC system with a SMI/=
VTOC labelbr
gt;without losing data as the OS is built on this volume?br
br
br
/divSure as long as the new partition starts on the same block and =
is longer.br
br
It was a bit more difficult with UFS but for zfs it is very simple.b=
r
br
I had a few systems with two ufs root slices using live upgrade:br
br
 =A0 =A0 =A0 =A0lt;slice 1gt;lt;slice 2gt;lt;swapgt;br
br
First I booted from lt;slice 2gt;br
ludelete quot;slice1quot;br
zpool create rpool quot;slice1quot;br
lucreate -p rpoolbr
luactivate slice1br
init 6br
=66rom the zfs root:br
ludelete slice2br
format:br
 =A0 =A0 =A0 =A0 remove slice2;br
 =A0 =A0 =A0 =A0 grow slice1 to incorporate slice2br
 =A0 =A0 =A0 =A0 labelbr
br
At that time I needed to reboot to get the new device size reflected =
inbr
zpool list; today that is no longer neededbr
br
Casperbr
br
/blockquote/divbr

--Boundary_(ID_oehH7aQu3QEaJqsmuxeYyA)--


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Casper . Dik


If there were a real-world device that tended to randomly flip bits,
or randomly replace swaths of LBA's with zeroes, but otherwise behave
normally (not return any errors, not slow down retrying reads, not
fail to attach), then copies=2 would be really valuable, but so far it
seems no such device exists.  If you actually explore the errors that
really happen I venture there are few to no cases copies=2 would save
you.

I had a device which had 256 bytes of the 32MB broken (some were 1, some
were always 0).  But I never put it online because it was so broken.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposed idea for enhancement - damage control

2010-02-17 Thread Casper . Dik



If there were a real-world device that tended to randomly flip bits,
or randomly replace swaths of LBA's with zeroes, but otherwise behave
normally (not return any errors, not slow down retrying reads, not
fail to attach), then copies=2 would be really valuable, but so far it
seems no such device exists.  If you actually explore the errors that
really happen I venture there are few to no cases copies=2 would save
you.

I had a device which had 256 bytes of the 32MB broken (some were 1, some
were always 0).  But I never put it online because it was so broken.

Of the 32MB cache, sorry.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Shrink the slice used for zpool?

2010-02-15 Thread Casper . Dik

Hi,

I recently installed OpenSoalris 200906 on a 10GB primary partition on
my laptop. I noticed there wasn't any option for customizing the
slices inside the solaris partition. After installation, there was
only a single slice (0) occupying the entire partition. Now the
problem is that I need to set up a UFS slice for my development. Is
there a way to shrink slice 0 (backing storage for the zpool) and make
room for a new slice to be used for UFS?

I also tried to create UFS on another primary DOS partition, but
apparently only one Solaris partition is allowed on one disk. So that
failed...


Can you create a zvol and use that for ufs?  Slow, but ...

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] why checksum data?

2010-01-30 Thread Casper . Dik


I find that when people take this argument, they assuming that each component
has perfect implementation and 100% fault coverage.  The real world isn't so 
lucky

Recently I bought a disk with a broken 32MB buffer (256 bits had bits
stuck to 1 or 0)  It was corrupting data by the bucket.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Casper . Dik

On Tue, January 5, 2010 05:34, Mikko Lammi wrote:

 As a result of one badly designed application running loose for some time,
 we now seem to have over 60 million files in one directory. Good thing
 about ZFS is that it allows it without any issues. Unfortunatelly now that
 we need to get rid of them (because they eat 80% of disk space) it seems
 to be quite challenging.

How about creating a new data set, moving the directory into it, and then
destroying it?

Assuming the directory in question is /opt/MYapp/data:
  1. zfs create rpool/junk
  2. mv /opt/MYapp/data /rpool/junk/
  3. zfs destroy rpool/junk

The move will create and remove the files; the remove by mv will be as
inefficient removing them one by one.

rm -rf would be at least as quick.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   4   >