Re: [zfs-discuss] DDT sync?

2011-05-31 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> So here's what I'm going to do.  With arc_meta_limit at 7680M, of which
> 100M
> was consumed "naturally," that leaves me 7580 to play with.  Call it
7500M.
> Divide by 412 bytes, it means I'll hit a brick wall when I reach a little
> over 19M blocks.  Which means if I set my recordsize to 32K, I'll hit that
> limit around 582G disk space consumed.  That is my hypothesis, and now
> beginning the test.

Well, this is interesting.  With 7580MB theoretically available for DDT in
ARC, the expectation was that 19M DDT entries would finally max out the ARC
and then I'd jump off a performance cliff and start seeing a bunch of pool
reads killing my write performance.

In reality, what I saw was:  
* Up to a million blocks, the performance difference with/without dedup was
basically negligible.  Write time with dedup = 1x write time without dedup.
* After a million, the dedup write time consistently reached 2x longer than
the native write time.  This happened when my ARC became full of user data
(not meta data)
* As the # of unique blocks in pool increased, gradually, the dedup write
time deviated from the non-dedup write time.  2x, 3x, 4x.  I got a
consistent 4x longer write time with dedup enabled, after the pool reached
22.5M blocks.
* And then it jumped off a cliff.  When I got to 24M blocks, it was the last
datapoint able to be collected.  28x slower write with dedup (4966 sec to
write 3G, as compared to 178sec), and for the first time, a nonzero rm time.
All the way up till now, even with dedup, the rm time was zero.  But now it
was 72sec.  
* I waited another 6 hours, and never got another data point.  So I found
the limit where the pool becomes unusably slow.  

At a cursory look, you might say this supported the hypothesis.  You might
say "24M compared to 19M, that's not too far off.  This could be accounted
for by using the 376byte size of ddt_entry_t, instead of the 412byte size
apparently measured... This would adjust the hypothesis to 21.1M blocks."

But I don't think that's quite fair.  Because my arc_meta_used never got
above 5,159.  And I never saw the massive read overload that was predicted
to be the cause of failure.  In fact, starting from 0.4M to 0.5M blocks
(early, early, early on) from that point onward, I always had 40-50 reads
for every 250 writes.  Right to the bitter end.  And my arc is full of user
data, not meta data.

So the conclusions I'm drawing are:

(1)  If you don't tweak arc_meta_limit, and you want to enable dedup, you're
toast.  But if you do tweak arc_meta_limit, you might reasonably expect
dedup to perform 3x to 4x slower on unique data...  And based on results
that I haven't talked about yet here, dedup performs 3x to 4x faster on
duplicate data.  So if you have 50% or higher duplicate data (dedup ratio 2x
or higher) and you have plenty of memory and tweak it, then your performance
with dedup could be comparable, or even faster than running without dedup.
Of course, depending on your data patterns and usage patterns.  YMMV.

(2)  The above is pretty much the best you can do, if your server is going
to be a "normal" server, handling both reads & writes.  Because the data and
the meta_data are both stored in the ARC, the data has a tendency to push
the meta_data out.  But in a special use case - Suppose you only care about
write performance and saving disk space.  For example, suppose you're the
destination server of a backup policy.  You only do writes, so you don't
care about keeping data in cache.  You want to enable dedup to save cost on
backup disks.  You only care about keeping meta_data in ARC.  If you set
primarycache=metadata   I'll go test this now.  The hypothesis is that
my arc_meta_used should actually climb up to the arc_meta_limit before I
start hitting any disk reads, so my write performance with/without dedup
should be pretty much equal up to that point.  I'm sacrificing the potential
read benefit of caching data in ARC, in order to hopefully gain write
performance - So write performance can be just as good with dedup enabled or
disabled.  In fact, if there's much duplicate data, the dedup write
performance in this case should be significantly better than without dedup.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Fajar A. Nugraha
On Wed, Jun 1, 2011 at 7:06 AM, Bill Sommerfeld  wrote:
> On 05/31/11 09:01, Anonymous wrote:
>> Hi. I have a development system on Intel commodity hardware with a 500G ZFS
>> root mirror. I have another 500G drive same as the other two. Is there any
>> way to use this disk to good advantage in this box? I don't think I need any
>> more redundancy, I would like to increase performance if possible. I have
>> only one SATA port left so I can only use 3 drives total unless I buy a PCI
>> card. Would you please advise me. Many thanks.
>
> I'd use the extra SATA port for an ssd, and use that ssd for some
> combination of boot/root, ZIL, and L2ARC.
>
> I have a couple systems in this configuration now and have been quite
> happy with the config.  While slicing an ssd and using one slice for
> root, one slice for zil, and one slice for l2arc isn't optimal from a
> performance standpoint and won't scale up to a larger configuration, it
> is a noticeable improvement from a 2-disk mirror.
>
> I used an 80G intel X25-M, with 1G for zil, with the rest split roughly
> 50:50 between root pool and l2arc for the data pool.

Does anyone have a benchmark or history data on how reliable an SSD is nowadays?

Cheap-ish sandforce-based MLC SSDs usually say they support 1 million
write cycles, and that they have some kind of wear-leveling. How does
this translates when it's used as L2ARC? Can we expect something like
one year or three years lifetime when the pool is relatively busy?

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Daniel Carosone
On Wed, Jun 01, 2011 at 05:45:14AM +0400, Jim Klimov wrote:
> Also, in a mirroring scenario is there any good reason to keep a warm spare
> instead of making a three-way mirror right away (beside energy saving)? 
> Rebuild times and non-redundant windows can be decreased considerably ;)

Perhaps where the spare may be used for any of several pools,
whichever has a failure first. Not relevant to this case..

In this case, if the drive is warm, it might as well be live.

My point was that, even as a cold spare it is worth something, and
that the sata port may be worth more, since the OP is more interested
in performance than extra redundancy.

--
Dan.



pgp8cB9ApGE1h.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] JBOD recommendation for ZFS usage

2011-05-31 Thread Rocky Shek
Thomas,

You can consider the DataON DNS-1600(4U 24 3.5" Bay 6Gb/s SAS JBOD). It is
perfect for ZFS storage as the alternative of J4400.   
http://dataonstorage.com/dns-1600

And we recommend you to use native SAS HD like Seagate Constellation ES 2TB
SAS to connect 2 hosts for fail-over cluster. 

The following is setup diagram of HA failover cluster with Nexenta. Same
configuration can applied to Solaris, OpenSolaris and OpenIndiana  
http://dataonstorage.com/nexentaha

We also have DSM(Disk Shelf Management Tool) available for Solaris 10 and
Nexenta to help identify fail disk and JBOD. You can also check the status
of all FRU  
http://dataonstorage.com/dsm

FYI, we have reseller in Germany. If you need the additional info, you can
let me know!  

Rocky


-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Thomas Nau
Sent: Sunday, May 29, 2011 11:07 PM
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] JBOD recommendation for ZFS usage

Dear all

Sorry if it's kind of off-topic for the list but after talking
to lots of vendors I'm running out of ideas...

We are looking for JBOD systems which

(1) hold 20+ 3.3" SATA drives

(2) are rack mountable

(3) have all the nive hot-swap stuff

(4) allow 2 hosts to connect via SAS (4+ lines per host) and see
all available drives as disks, no RAID volume.
In a perfect world both hosts would connect each using
two independent SAS connectors


The box will be used in a ZFS Solaris/based fileserver in a
fail-over cluster setup. Only one host will access a drive
at any given time.

It seems that a lot of vendors offer JBODs but so far I haven't found
one in Germany which handles (4).

Any hints?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Jim Klimov
> If it is powered on, then it is a warm spare :-)
> Warm spares are a good idea.  For some platforms, you can 
> spin down the
> disk so it doesn't waste energy.

 
But I should note that we've had issues with a hot spare disk added to rpool
in particular, preventing boots on Solaris 10u8. It turned out to be a known 
bug which may have since been fixed...
 
Also, in a mirroring scenario is there any good reason to keep a warm spare
instead of making a three-way mirror right away (beside energy saving)? 
Rebuild times and non-redundant windows can be decreased considerably ;)
 
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Richard Elling
On May 31, 2011, at 5:16 PM, Daniel Carosone wrote:

> Namely, leave the third drive on the shelf as a cold spare, and use
> the third sata connector for an ssd, as L2ARC, ZIL or even possibly
> both (which will affect selection of which device to use).

If it is powered on, then it is a warm spare :-)
Warm spares are a good idea.  For some platforms, you can spin down the
disk so it doesn't waste energy.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Jim Klimov

> What about writes?

 
Writes in a mirror are deemed to be not faster than the slowest disk -
all two or three drives must commit a block before it is considered
written (in sync write mode), likewise for TXG sync but with some
optimization by caching and write-coalescing.
 
//Jim
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE

2011-05-31 Thread Daniel Carosone
On Tue, May 31, 2011 at 05:32:47PM +0100, Matt Keenan wrote:
> Jim,
>
> Thanks for the response, I've nearly got it working, coming up against a  
> hostid issue.
>
> Here's the steps I'm going through :
>
> - At end of auto-install, on the client just installed before I manually  
> reboot I do the following :
>   $ beadm mount solaris /a
>   $ zpool export data
>   $ zpool import -R /a -N -o cachefile=/a/etc/zfs/zpool.cache data
>   $ beadm umount solaris
>   $ reboot
>
> - Before rebooting I check /a/etc/zfs/zpool.cache and it does contain  
> references to "data".
>
> - On reboot, the automatic import of data is attempted however following  
> message is displayed :
>
>  WARNING: pool 'data' could not be loaded as it was last accessed by  
> another system (host: ai-client hostid: 0x87a4a4). See  
> http://www.sun.com/msg/ZFS-8000-EY.
>
> - Host id on booted client is :
>   $ hostid
>   000c32eb
>
> As I don't control the import command on boot i cannot simply add a "-f"  
> to force the import, any ideas on what else I can do here ?

Can you simply export the pool again before rebooting, but after the
cachefile in /a has been unmounted? 
 
--
Dan.

pgp7IC9jTUesC.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Daniel Carosone
On Wed, Jun 01, 2011 at 10:16:28AM +1000, Daniel Carosone wrote:
> On Tue, May 31, 2011 at 06:57:53PM -0400, Edward Ned Harvey wrote:
> > If you make it a 3-way mirror, your write performance will be unaffected,
> > but your read performance will increase 50% over a 2-way mirror.  All 3
> > drives can read different data simultaneously for the net effect of 3x a
> > single disk read performance.
> 
> This would be my recommendation too, but for the sake of completeness,
> there are other options that may provide better performance
> improvement (at a cost) depending on your needs. 

In fact, I should state even more clearly: do this, since there is
very little reason not to.  Measure the benefit.  Move on to the other
things if the benefit is not enough. When doing so, consider what kind
of benefit you're looking for.

> Namely, leave the third drive on the shelf as a cold spare, and use
> the third sata connector for an ssd, as L2ARC, ZIL or even possibly
> both (which will affect selection of which device to use).
> 
> L2ARC is likely to improve read latency (on average) even more than a
> third submirror.  ZIL will be unmirrored, but may improve writes at an
> acceptable risk for development system.  If this risk is acceptable,
> you may wish to consider whether setting sync=disabled is also
> acceptable at least for certain datasets.
> 
> Finally, if you're considering spending money, can you increase the
> RAM instead?  If so, do that first.
> 
> --
> Dan.


> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



pgpHRSk23bsVr.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Daniel Carosone
On Tue, May 31, 2011 at 06:57:53PM -0400, Edward Ned Harvey wrote:
> If you make it a 3-way mirror, your write performance will be unaffected,
> but your read performance will increase 50% over a 2-way mirror.  All 3
> drives can read different data simultaneously for the net effect of 3x a
> single disk read performance.

This would be my recommendation too, but for the sake of completeness,
there are other options that may provide better performance
improvement (at a cost) depending on your needs. 

Namely, leave the third drive on the shelf as a cold spare, and use
the third sata connector for an ssd, as L2ARC, ZIL or even possibly
both (which will affect selection of which device to use).

L2ARC is likely to improve read latency (on average) even more than a
third submirror.  ZIL will be unmirrored, but may improve writes at an
acceptable risk for development system.  If this risk is acceptable,
you may wish to consider whether setting sync=disabled is also
acceptable at least for certain datasets.

Finally, if you're considering spending money, can you increase the
RAM instead?  If so, do that first.

--
Dan.

pgpt1w2jn0CGs.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Bill Sommerfeld
On 05/31/11 09:01, Anonymous wrote:
> Hi. I have a development system on Intel commodity hardware with a 500G ZFS
> root mirror. I have another 500G drive same as the other two. Is there any
> way to use this disk to good advantage in this box? I don't think I need any
> more redundancy, I would like to increase performance if possible. I have
> only one SATA port left so I can only use 3 drives total unless I buy a PCI
> card. Would you please advise me. Many thanks.

I'd use the extra SATA port for an ssd, and use that ssd for some
combination of boot/root, ZIL, and L2ARC.

I have a couple systems in this configuration now and have been quite
happy with the config.  While slicing an ssd and using one slice for
root, one slice for zil, and one slice for l2arc isn't optimal from a
performance standpoint and won't scale up to a larger configuration, it
is a noticeable improvement from a 2-disk mirror.

I used an 80G intel X25-M, with 1G for zil, with the rest split roughly
50:50 between root pool and l2arc for the data pool.

- Bill

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread David Magda

On May 31, 2011, at 19:00, Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk
>> 
>> Theoretically, you'll get a 50% read increase, but I doubt it'll be that 
>> high in
>> practice.

What about writes?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Bob Friesenhahn

On Tue, 31 May 2011, Edward Ned Harvey wrote:


If you make it a 3-way mirror, your write performance will be unaffected,
but your read performance will increase 50% over a 2-way mirror.  All 3
drives can read different data simultaneously for the net effect of 3x a
single disk read performance.


I think that a read performance increase of (at most) 33.3% is more 
correct.  You might obtain (at most) 50% over one disk by mirroring 
it.


Zfs makes a random selection of which disk to read from in a mirror 
set so the improvement is not truely linear.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk
>
> Theoretically, you'll get a 50% read increase, but I doubt it'll be that high 
> in
> practice.

In my benchmarking, I found 2-way mirror reads 1.97x the speed of a single 
disk, and a 3-way mirror reads 2.91x a single disk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Anonymous
> 
> Hi. I have a development system on Intel commodity hardware with a 500G
> ZFS
> root mirror. I have another 500G drive same as the other two. Is there any
> way to use this disk to good advantage in this box? I don't think I need
any
> more redundancy, I would like to increase performance if possible. I have
> only one SATA port left so I can only use 3 drives total unless I buy a
PCI
> card. Would you please advise me. Many thanks.

If you make it a 3-way mirror, your write performance will be unaffected,
but your read performance will increase 50% over a 2-way mirror.  All 3
drives can read different data simultaneously for the net effect of 3x a
single disk read performance.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Tomas Ögren
On 31 May, 2011 - Gertjan Oude Lohuis sent me these 0,9K bytes:

> On 05/31/2011 03:52 PM, Tomas Ögren wrote:
>> I've done a not too scientific test on reboot times for Solaris 10 vs 11
>> with regard to many filesystems...
>>
>
>> http://www8.cs.umu.se/~stric/tmp/zfs-many.png
>>
>> As the picture shows, don't try 1 filesystems with nfs on sol10.
>> Creating more filesystems gets slower and slower the more you have as
>> well.
>>
>
> Since all filesystem would be shared via NFS, this clearly is a nogo :).  
> Thanks!
>
>> On a different setup, we have about 750 datasets where we would like to
>> use a single recursive snapshot, but when doing that all file access
>> will be frozen for varying amounts of time
>
> What version of ZFS are you using? Like Matthew Ahrens said: version 27  
> has a fix for this.

22, Solaris 10.

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Richard Elling
On May 31, 2011, at 2:29 PM, Gertjan Oude Lohuis wrote:

> On 05/31/2011 03:52 PM, Tomas Ögren wrote:
>> I've done a not too scientific test on reboot times for Solaris 10 vs 11
>> with regard to many filesystems...
>> 
> 
>> http://www8.cs.umu.se/~stric/tmp/zfs-many.png
>> 
>> As the picture shows, don't try 1 filesystems with nfs on sol10.
>> Creating more filesystems gets slower and slower the more you have as
>> well.
>> 
> 
> Since all filesystem would be shared via NFS, this clearly is a nogo :). 
> Thanks!

If you search the archives, you will find that the people who tried to do this 
in the
past were more successful with legacy NFS export methods than the sharenfs
property in ZFS.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Gertjan Oude Lohuis

On 05/31/2011 12:26 PM, Khushil Dep wrote:

Generally snapshots are quick operations but 10,000 such operations
would I believe take enough to time to complete as to present
operational issues - breaking these into sets would alleviate some?
Perhaps if you are starting to run into many thousands of filesystems
you would need to re-examin your rationale in creating so many.



Thanks for your feedback! My rationale is this: I have a lot of 
hostingaccounts which have databases. These databases need to be backed 
up, preferably with mysqldump and there need to be historic data. I 
would like to use ZFS snapshots for this. However, I have some variables 
that need to be taken into account:


* Different hostingplans offer different backupschedules: every 3 hour, 
every 24 hour. Backups might be kept 3 days, 14 day or 30 days. These 
schedules thus need to be on separate storage, otherwise I can't create 
a matching snapshot schedule to create and rotate snapshots.


* Databases are hosted on multiple databaseservers, and are frequently 
migrated between them. I could create a ZFS filesystem for each server, 
but if a hostingaccount is migrated, all backups will be 'lost'.


Having one filesystem for each hostingaccount would have solved nearly 
all disadvantages I could think of. But I don't think it is going to 
work, sadly. I'll have to make some choices :).


Regards,
Gertjan Oude Lohuis
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Gertjan Oude Lohuis

On 05/31/2011 03:52 PM, Tomas Ögren wrote:

I've done a not too scientific test on reboot times for Solaris 10 vs 11
with regard to many filesystems...




http://www8.cs.umu.se/~stric/tmp/zfs-many.png

As the picture shows, don't try 1 filesystems with nfs on sol10.
Creating more filesystems gets slower and slower the more you have as
well.



Since all filesystem would be shared via NFS, this clearly is a nogo :). 
Thanks!



On a different setup, we have about 750 datasets where we would like to
use a single recursive snapshot, but when doing that all file access
will be frozen for varying amounts of time


What version of ZFS are you using? Like Matthew Ahrens said: version 27 
has a fix for this.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE

2011-05-31 Thread Matt Keenan
I've written a possible solution using svc-system-config SMF where on 
first boot it will import -f a specified list of pools, and it does 
work, I was hoping to find a cleaner solution via zpool.cache... but if 
there's no way to achieve it I guess I'll have to stick with the other 
solution.


I even tried simply copying /etc/zfs/zpool.cache to 
/a/etc/zfs/zpool.cache and not exporting/importing the data pool at all, 
however this gave the same hostid problem.


thanks for your help.

cheers

Matt

Jim Klimov wrote:

Actually if you need beadm to "know" about the data pool,
it might be beneficial to mix both approaches - yours with
bemount, and init-script to enforce the pool import on that
first boot...
 
HTH,

//Jim Klimov
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Roy Sigurd Karlsbakk
> Hi. I have a development system on Intel commodity hardware with a
> 500G ZFS
> root mirror. I have another 500G drive same as the other two. Is there
> any
> way to use this disk to good advantage in this box? I don't think I
> need any
> more redundancy, I would like to increase performance if possible. I
> have
> only one SATA port left so I can only use 3 drives total unless I buy
> a PCI
> card. Would you please advise me. Many thanks.

A third drive in the mirror (aka three-way mirror) will increase read 
performance from the pool, as ZFS reads from all drives in a mirror. 
Theoretically, you'll get a 50% read increase, but I doubt it'll be that high 
in practice.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Jim Klimov
> Hi. I have a development system on Intel commodity hardware with 
> a 500G ZFS
> root mirror. I have another 500G drive same as the other two. Is 
> there any
> way to use this disk to good advantage in this box? I don't 
> think I need any
> more redundancy, I would like to increase performance if 
> possible. I have
> only one SATA port left so I can only use 3 drives total unless 
> I buy a PCI
> card. Would you please advise me. Many thanks.

Well, you can use this drive as a separate "scratch area", as a separate
single-disk pool, without redundancy. You'd have a separate spindle
for some dedicated tasks with data you're okay with losing.
 
You can also make the rpool a three-way mirror which may increase
read speeds if you have enough concurrentcy. And when one drive 
breaks, your rpool is still mirrored.
 
HTH,
//Jim Klimov
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] not sure how to make filesystems

2011-05-31 Thread Jim Klimov
Alas, I have some notes on the subject of migration from UFS to ZFS 
with split filesystems (separate /usr /var and some /var/* subdirs), but 
they are an unpublished document in Russian ;) Here I will outline some
main points, but will probably have omitted some others :(
Hope this helps anyway...
 
Splitting off /usr and /var/* subdirs into separate datasets has been a
varying success (worked in Solaris 10 and OpenSolaris SXCE, failed
in OpenIndiana) and may cause issues during first reboots after OS
upgrades and after some repair reboots (system tools don't expect 
such layout), but separating /var as a single dataset is supported.
Paths like /export and /opt are not involved as "system root", so 
these can be implemented any way you want, including storage 
on a separate "data pool".
 
With the zfs root in place, you can either create a swap volume 
inside the zfs root, or use a dedicated partition for swapping,
or do both. With a dedicated partition you might control where
on disk it is localted (faster/slower tracks), but you dedicate
this space for only swapping if it is needed. With volumes you
can relatively easily resize the swap area.
 
/tmp is usually implemented as a "tmpfs" filesystem and as such it 
is stored in virtual memory, which is spread between RAM and 
swap areas, and its contents are lost on reboot - but you don't 
really care much about that implementation detail. In your vfstab
file you just have this line:
 
# grep tmp /etc/vfstab  
 
swap-   /tmptmpfs   -   yes -

In short, you might not want to involve LU in this at all: after a successful
migration has been tested, you're likely to kill the UFS partition and use
it as part of the ZFS root pool mirror. After that you would want to start
the LU history from scratch, by naming this ZFS-rooted copy of your
installation the initial boot environment, and later LUpgrade it to newer
releases.
 
Data migration itself is rather simple: you create the zfs pool named "rpool"
in an available slice (i.e. c0t1d0s0) and in that rpool you create and mount
the needed hierarchy of filesystem datasets (compatible with LU/beadm 
expectations). Then you copy over all the file data from UFS into your 
hierarchy (ufsdump/ufsrestore or Sun cpio preferred - to keep the ACL
data), then enable booting of the ZFS root (zpool set bootfs=), and test if it 
works ;)
 
# format
... (create the slice #0 on c0t1d0 of appropriate size - see below)
 
# zpool create -f -R /a rpool c0t1d0s0
# zfs create -o mountpoint=legacy rpool/ROOT 
# zfs create -o mountpoint=/ rpool/ROOT/sol10u8 
# zfs create -o compression=on rpool/ROOT/sol10u8/var 
# zfs create -o compression=on rpool/ROOT/sol10u8/opt
# zfs create rpool/export
# zfs create -o compression=on rpool/export/home
# zpool set bootfs=rpool/ROOT/sol10u8 rpool
# zpool set failmode=continue rpool
 
Optionally create the swap and dump areas, i.e.
# zfs create -V2g rpool/dump 
# zfs create -V2g rpool/swap
 
If all goes well (and I didn't type mistakes) you should have the
hierarchy mounted under /a. Check with "df -k" to be sure...
 
One way to copy - with ufsdump:
# cd /a && ( ufsdump 0f - / | ufsrestore -rf - )
# cd /a/var && ( ufsdump 0f - /var | ufsrestore -rf - ) 
# cd /a/opt && ( ufsdump 0f - /opt | ufsrestore -rf - ) 
# cd /a/export/home && ( ufsdump 0f - /export/home | ufsrestore -rf - )
 
Another way - with Sun cpio:
# cd /a
# mkdir -p tmp proc devices var/run system/contract system/object 
etc/svc/volatile
# touch etc/mnttab etc/dfs/sharetab
# cd / && ( /usr/bin/find . var opt export/home -xdev -depth -print | 
/usr/bin/cpio -Ppvdm /a )
 
Review the /a/etc/vfstab file, you probably need to comment away the explicit 
mountpoints for your new datasets, including root. It might get to look like 
this:
 
# cat /etc/vfstab 
#device device  mount   FS  fsckmount   mount
#to mount   to fsck point   typepassat boot options
#
/devices-   /devicesdevfs   -   no  -
/proc   -   /proc   proc-   no  -
ctfs-   /system/contract ctfs   -   no  -
objfs   -   /system/object  objfs   -   no  -
sharefs -   /etc/dfs/sharetab   sharefs -   no  
-
fd  -   /dev/fd fd  -   no  -
swap-   /tmptmpfs   -   yes -
/dev/zvol/dsk/rpool/swap-   -   swap-   
no  -

 
Finally, install the right bootloader for the current OS.
 
* In case of GRUB:
# /a/sbin/installgrub /a/boot/grub/stage1 /a/boot/grub/stage2 /dev/rdsk/c0t1d0s0
# mkdir -p /a/rpool/boot/grub
# cp /boot/grub/menu.lst /a/rpool/boot/grub
 
Review and update the GRUB menu file as needed. Note that the current 
disk wh

Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Matthew Ahrens
On Tue, May 31, 2011 at 6:52 AM, Tomas Ögren  wrote:

>
> On a different setup, we have about 750 datasets where we would like to
> use a single recursive snapshot, but when doing that all file access
> will be frozen for varying amounts of time (sometimes half an hour or
> way more). Splitting it up into ~30 subsets, doing recursive snapshots
> over those instead has decreased the total snapshot time greatly and cut
> the "frozen time" down to single digit seconds instead of minutes or
> hours.
>

If you can upgrade to zpool version 27 or later, you should see much much
less "frozen time" when doing a "zfs snapshot -r" of thousands of
filesystems.

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Is another drive worth anything?

2011-05-31 Thread Anonymous
Hi. I have a development system on Intel commodity hardware with a 500G ZFS
root mirror. I have another 500G drive same as the other two. Is there any
way to use this disk to good advantage in this box? I don't think I need any
more redundancy, I would like to increase performance if possible. I have
only one SATA port left so I can only use 3 drives total unless I buy a PCI
card. Would you please advise me. Many thanks.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE

2011-05-31 Thread Jim Klimov
Actually if you need beadm to "know" about the data pool,
it might be beneficial to mix both approaches - yours with 
bemount, and init-script to enforce the pool import on that 
first boot...
 
HTH,
//Jim Klimov
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Importing Corrupted zpool never ends

2011-05-31 Thread Jim Klimov
I had several pool corruptions on my test box recently, and recovery imports
did inded take a large part of a week (the process starved my 8Gb RAM
so the system hanged and had to be reset with the hardware reset button -
and this contributed to the large timeframe). Luckily for me, these import
attempts were cumulative, so after a while the system began working.
It seems that the system crashed during a major deletion operation and
needed more time to find and release the deferred-free blocks.
 
Not sure if my success would apply to your situation though.
 
iostat speeds can vary during pool maintenance operations (i.e. scrub
and probably import and zdb walks too) depending on (metadata) 
fragmentation, CPU busy-ness, etc. A more relevant metric here
is %busy for the disks.
 
While researching my problem I found many older posts indicating that
this is "normal", however setting some kernel values with mdb may help
speed up the process and/or have it succeed.

To be short here, I can suggest that you read my recent threads from
that timeframe:
* (OI_148a) ZFS hangs when accessing the pool. How to trace what's happening? 
http://opensolaris.org/jive/thread.jspa?messageID=515689
* Questions on ZFS pool as a volume in another ZFS pool - details my system's 
setup
http://opensolaris.org/jive/thread.jspa?threadID=138604&tstart=0
 
Since the system froze often by dropping into swap-hell, I had to 
create a watchdog which would initiate an ungraceful reboot if the 
conditions were "right".
 
My FreeRAM-Watchdog code and compiled i386 binary and a
primitive SMF service wrapper can be found here:
http://thumper.cos.ru/~jim/freeram-watchdog-20110531-smf.tgz
 
Other related forum threads:
* zpool import hangs indefinitely (retry post in parts; too long?) 
http://opensolaris.org/jive/thread.jspa?threadID=131237
* zpool import hangs 
http://opensolaris.org/jive/thread.jspa?threadID=70205&tstart=15

- Original Message -
From: Christian Becker 
Date: Tuesday, May 31, 2011 18:02
Subject: [zfs-discuss] Importing Corrupted zpool never ends
To: zfs-discuss@opensolaris.org


> Hi There,  


> I need to import an corrupted ZPOOL after double-Crash (Mainboard and one 
> HDD) on a different system.   


> It is a RAIDZ1 - 3 HDDs - only 2 are working. 
> 

> Problem: spool import -f poolname runs and runs and runs. Looking after 
> iostat (not zpool iostat) it is doing something - but what? And why does it 
> last so long (2x 1.5TB - Atom System).
> 

> iostat seems to read and write with something about 500kB/s - I hope that it 
> doesn't work through the whole 1500GB - that would need 40 Days...
> 

> Hope someone could help me.
> 

> Thanks allot
> Chris

> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
 
-- 

++ 
|| 
| Климов Евгений, Jim Klimov | 
| технический директор   CTO | 
| ЗАО "ЦОС и ВТ"  JSC COS&HT | 
|| 
| +7-903-7705859 (cellular)  mailto:jimkli...@cos.ru | 
|CC:ad...@cos.ru,jimkli...@gmail.com | 
++ 
| ()  ascii ribbon campaign - against html mail  | 
| /\- against microsoft attachments  | 
++
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE

2011-05-31 Thread Jim Klimov
I should have seen that coming, but didn't ;)
 
I think in ths case I would go with a different approach: don't
import the data pool in the AI instance and save it to zpool.cache
Instead, make sure it is cleanly exported from AI instance, and
in the installed system create a self-destructing init script or SMF
service. For an init script in might go like this:
 
#!/bin/sh
# /etc/rc2.d/S00importdatapool
[ "$1" = start ] && zpool import -f datapool && rm -f "$0"
 
Or you can try setting the hostid in a persisnt manner (perhaps
via eeprom emulation in /boot/solaris/bootenv.rc ?)

- Original Message -
From: Matt Keenan 
Date: Tuesday, May 31, 2011 21:02
Subject: Re: [zfs-discuss] Ensure Newly created pool is imported automatically 
in new BE
To: j...@cos.ru
Cc: zfs-discuss@opensolaris.org

> Jim,
> 
> Thanks for the response, I've nearly got it working, coming up 
> against a 
> hostid issue.
> 
> Here's the steps I'm going through :
> 
> - At end of auto-install, on the client just installed before I 
> manually 
> reboot I do the following :
>$ beadm mount solaris /a
>$ zpool export data
>$ zpool import -R /a -N -o 
> cachefile=/a/etc/zfs/zpool.cache data
>$ beadm umount solaris
>$ reboot
> 
> - Before rebooting I check /a/etc/zfs/zpool.cache and it does 
> contain 
> references to "data".
> 
> - On reboot, the automatic import of data is attempted however 
> following 
> message is displayed :
> 
>   WARNING: pool 'data' could not be loaded as it was last 
> accessed by 
> another system (host: ai-client hostid: 0x87a4a4). See 
> http://www.sun.com/msg/ZFS-8000-EY.
> 
> - Host id on booted client is :
>$ hostid
>000c32eb
> 
> As I don't control the import command on boot i cannot simply 
> add a "-f" 
> to force the import, any ideas on what else I can do here ?
> 
> cheers
> 
> Matt
> 
> On 05/27/11 13:43, Jim Klimov wrote:
> > Did you try it as a single command, somewhat like:
> >
> > zpool create -R /a -o cachefile=/a/etc/zfs/zpool.cache mypool c3d0
> > Using altroots and cachefile(=none) explicitly is a nearly-
> documented> way to avoid caching pools which you would not want 
> to see after
> > reboot, i.e. removable media.
> > I think that after the AI is done and before reboot you might 
> want to
> > reset the altroot property to point to root (or be undefined) 
> so that
> > the data pool is mounted into your new rpools hierarchy and not
> > under "/a/mypool" again ;)
> > And if your AI setup does not use the data pool, you might be better
> > off not using altroot at all, maybe...
> >
> > - Original Message -
> > From: Matt Keenan 
> > Date: Friday, May 27, 2011 13:25
> > Subject: [zfs-discuss] Ensure Newly created pool is imported 
> > automatically in new BE
> > To: zfs-discuss@opensolaris.org
> >
> > > Hi,
> > >
> > > Trying to ensure a newly created data pool gets import on boot
> > > into a
> > > new BE.
> > >
> > > Scenario :
> > >Just completed a AI install, and on the client
> > > before I reboot I want
> > > to create a data pool, and have this pool automatically imported
> > > on boot
> > > into the newly installed AI Boot Env.
> > >
> > >Trying to use the -R altroot option to 
> zpool create
> > > to achieve this or
> > > the zpool set -o cachefile property, but having no luck, and
> > > would like
> > > some advice on what the best means of achieving this would be.
> > >
> > > When the install completes, we have a default root pool 
> "rpool", which
> > > contains a single default boot environment, rpool/ROOT/solaris
> > >
> > > This is mounted on /a so I tried :
> > > zpool create -R /a mypool c3d0
> > >
> > > Also tried :
> > > zpool create mypool c3d0
> > > zpool set -o cachefile=/a mypool
> > >
> > > I can clearly see /a/etc/zfs/zpool.cache contains information
> > > for rpool,
> > > but it does not get any information about mypool. I would expect
> > > this
> > > file to contain some reference to mypool. So I tried :
> > > zpool set -o 
> cachefile=/a/etc/zfs/zpool.cache> >
> > > Which fails.
> > >
> > > Any advice would be great.
> > >
> > > cheers
> > >
> > > Matt
> > > ___
> > > zfs-discuss mailing list
> > > zfs-discuss@opensolaris.org
> > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > -- 
> >
> > ++
> > 
> ||
> > | Климов 
> Евгений, Jim Klimov |
> > | технический 
> директор   CTO |
> > | ЗАО "ЦОС и 
> ВТ"  JSC COS&HT |
> > 
> ||
> > | +7-903-7705859 
> (cellular)  mailto:jimkli...@cos.ru |
> > 
> |CC:ad...@cos.ru,jimkli...@gmail.com |
> > ++
> > | ()  ascii ribbon campaign - against html

Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Jim Klimov
In general, you may need to keep data in one dataset if it is somehow
related (i.e. backup of a specific machine or program, a user's home, etc) 
and if you plan to manage it in a consistent manner. For example, CIFS
shares can not be nested, so for a unitary share (like "distribs") you would
probably want one dataset. Also you can only have hardlinks within one
FS dataset, so if you manage different views into a distribution set
(i.e. sorted by vendor or sorted by software type) and if you do it
by hardlinks - you need one dataset as well. If you often move (link
and unlink) files around, i.e. from an "incoming" directory to final
storage, you may want or not want to have that "incoming" in the
same dataset, this depends on some other considerations too.
 
You want to split datasets when you need them to have different
features and perhaps different uses, i.e. to have them as separate
shares, to enforce separate quotas and reservations, perhaps to
delegate administration to particular OS users (i.e. let a user manage
snapshots of his own homedir) and/or local zones. Don't forget
about individual dataset properties (i.e. you may want compression
for source code files but not for a multimedia collection), snapshots
and clones, etc.
 
> 2. space management (we have wasted space in some pools while others
> are starved)
Well, that's a reason to decrease number of pools, but not datasets ;)
 
> 3. tool speed
> 
> I do not have good numbers for time to do 
> some of these operations
> as we are down to under 200 datasets (1/3 of the way through the
> migration to the new layout). I do have log entries that point to
> about a minute to complete a `zfs list` operation.
> 
> > Would I run into any problems when snapshots are taken (almost)
> > simultaneously from multiple filesystems at once?
> 
> Our logs show snapshot creation time at 2 
> seconds or less, but we
> do not try to do them all at once, we walk the list of datasets and
> process (snapshot and replicate) each in turn.

I can partially relate to that. We have a Thumper system running
OpenSolaris SXCE snv_177, with a separate dataset for each
user's home directory, for backups of each individual remote
machine, for each VM image, each local zone, etc. - in particular 
as to have separate history of snapshots and possibility to clone
what we need to.
 
Its relatively many filesystems (about 350) are or are not a problem 
depending on the tool used. For example, a typical import of the 
main pool may take up to 8 minutes when in safe mode,  but many 
of delays seem to be related to attempts to share_nfs and share_cifs
while the network is down ;)
 
Auto-snapshots are on, and listing them is indeed rather long:
 
[root@thumper ~]# time zfs list -tall -r pond | wc -l
   56528
real0m18.146s
user0m7.360s
sys 0m10.084s

[root@thumper ~]# time zfs list -tvolume -r pond | wc -l
   5
real0m0.096s
user0m0.025s
sys 0m0.073s

[root@thumper ~]# time zfs list -tfilesystem -r pond | wc -l
 353
real0m0.123s
user0m0.052s
sys 0m0.073s

Some operations like listing the filesystems SEEM slow due to the terminal,
but in fact are rather quick:
 
[root@thumper ~]# time df -k | wc -l
 363
real0m2.104s
user0m0.094s
sys 0m0.183s

However low-level system programs may have problems with multiple 
FSes; one known troublemaker is LiveUpgrade. Jens Elkner published
a wonderful set of patches for Solaris 10 and OpenSolaris to limit LU's
interests to just the filesystems that the admin knows to be interesting
for the OS upgrade (they also fix mount order and other known bugs
of that LU software release):
* http://iws.cs.uni-magdeburg.de/~elkner/luc/lutrouble.html
 
True, 1 FSes is not something I would have seen, so some tools
(especially legacy ones) may break at the sheer amount of mountpoints :)
 
One of my own tricks for cleaning snapshots, i.e. to free up pool space 
starvation quickly, is to use parallel "zfs destroy" invokations like this 
(note the ampersand):
 
# zfs list -t snapshot -r pond/export/home/user | grep @zfs-auto-snap | awk 
'{print $1}' | \
  while read Z ; do zfs destroy "$Z" & done
 
This may spawn several thousand processes (if called for the root dataset), 
but they often complete in just 1-2 minutes instead of hours for a one-by-one 
series of calls; I guess because this way many ZFS metadata operations 
are requested in a small timeframe and get coalesced into few big writes.
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE

2011-05-31 Thread Matt Keenan

Jim,

Thanks for the response, I've nearly got it working, coming up against a 
hostid issue.


Here's the steps I'm going through :

- At end of auto-install, on the client just installed before I manually 
reboot I do the following :

  $ beadm mount solaris /a
  $ zpool export data
  $ zpool import -R /a -N -o cachefile=/a/etc/zfs/zpool.cache data
  $ beadm umount solaris
  $ reboot

- Before rebooting I check /a/etc/zfs/zpool.cache and it does contain 
references to "data".


- On reboot, the automatic import of data is attempted however following 
message is displayed :


 WARNING: pool 'data' could not be loaded as it was last accessed by 
another system (host: ai-client hostid: 0x87a4a4). See 
http://www.sun.com/msg/ZFS-8000-EY.


- Host id on booted client is :
  $ hostid
  000c32eb

As I don't control the import command on boot i cannot simply add a "-f" 
to force the import, any ideas on what else I can do here ?


cheers

Matt

On 05/27/11 13:43, Jim Klimov wrote:

Did you try it as a single command, somewhat like:

zpool create -R /a -o cachefile=/a/etc/zfs/zpool.cache mypool c3d0
Using altroots and cachefile(=none) explicitly is a nearly-documented
way to avoid caching pools which you would not want to see after
reboot, i.e. removable media.
I think that after the AI is done and before reboot you might want to
reset the altroot property to point to root (or be undefined) so that
the data pool is mounted into your new rpools hierarchy and not
under "/a/mypool" again ;)
And if your AI setup does not use the data pool, you might be better
off not using altroot at all, maybe...

- Original Message -
From: Matt Keenan 
Date: Friday, May 27, 2011 13:25
Subject: [zfs-discuss] Ensure Newly created pool is imported 
automatically in new BE

To: zfs-discuss@opensolaris.org

> Hi,
>
> Trying to ensure a newly created data pool gets import on boot
> into a
> new BE.
>
> Scenario :
>Just completed a AI install, and on the client
> before I reboot I want
> to create a data pool, and have this pool automatically imported
> on boot
> into the newly installed AI Boot Env.
>
>Trying to use the -R altroot option to zpool create
> to achieve this or
> the zpool set -o cachefile property, but having no luck, and
> would like
> some advice on what the best means of achieving this would be.
>
> When the install completes, we have a default root pool "rpool", which
> contains a single default boot environment, rpool/ROOT/solaris
>
> This is mounted on /a so I tried :
> zpool create -R /a mypool c3d0
>
> Also tried :
> zpool create mypool c3d0
> zpool set -o cachefile=/a mypool
>
> I can clearly see /a/etc/zfs/zpool.cache contains information
> for rpool,
> but it does not get any information about mypool. I would expect
> this
> file to contain some reference to mypool. So I tried :
> zpool set -o cachefile=/a/etc/zfs/zpool.cache
>
> Which fails.
>
> Any advice would be great.
>
> cheers
>
> Matt
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--

++
||
| Климов Евгений, Jim Klimov |
| технический директор   CTO |
| ЗАО "ЦОС и ВТ"  JSC COS&HT |
||
| +7-903-7705859 (cellular)  mailto:jimkli...@cos.ru |
|CC:ad...@cos.ru,jimkli...@gmail.com |
++
| ()  ascii ribbon campaign - against html mail  |
| /\- against microsoft attachments  |
++


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Eric D. Mudama

On Tue, May 31 at  8:52, Paul Kraus wrote:

   When we initially configured a large (20TB) files server about 5
years ago, we went with multiple zpools and multiple datasets (zfs) in
each zpool. Currently we have 17 zpools and about 280 datasets.
Nowhere near the 10,000+ you intend. We are moving _away_ from the
many dataset model to one zpool and one dataset. We are doing this for
the following reasons:

1. manageability
2. space management (we have wasted space in some pools while others
are starved)
3. tool speed

   I do not have good numbers for time to do some of these operations
as we are down to under 200 datasets (1/3 of the way through the
migration to the new layout). I do have log entries that point to
about a minute to complete a `zfs list` operation.


It would be interesting to see if you still had issues (#3) with 1 pool and
your 280 datasets.  It would definitely eliminate #2.

--
Eric D. Mudama
edmud...@bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Jerry Kemp
Gertjan,

In addition to the comments directly relating from your post, we have
had similar discussions previously on the zfs-discuss list.

If you care to go and review the list archives, I can share that we have
had similar discussions on at least the following time periods.

March 2006
May 2008
January 2010
February 2010

There may be (and probably are) more stuff in the list archives, but I
know from my personal archives that these are good dates.

Hope this helps,

Jerry



On 05/31/11 05:08, Gertjan Oude Lohuis wrote:
> "Filesystem are cheap" is one of ZFS's mottos. I'm wondering how far
> this goes. Does anyone have experience with having more than 10.000 ZFS
> filesystems? I know that mounting this many filesystems during boot
> while take considerable time. Are there any other disadvantages that I
> should be aware of? Are zfs-tools still usable, like 'zfs list', 'zfs
> get/set'.
> Would I run into any problems when snapshots are taken (almost)
> simultaneously from multiple filesystems at once?
> 
> Regards,
> Gertjan Oude Lohuis
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] not sure how to make filesystems

2011-05-31 Thread Enda O'Connor

On 29/05/2011 19:55, BIll Palin wrote:

I'm migrating some filesystems from UFS to ZFS and I'm not sure how to create a 
couple of them.

I want to migrate /, /var, /opt, /export/home and also want swap and /tmp.  I 
don't care about any of the others.

The first disk, and the one with the UFS filesystems, is c0t0d0 and the 2nd 
disk is c0t1d0.

I've been told that /tmp is supposed to be part of swap.  So far I have:

lucreate -m /:/dev/dsk/c0t0d0s0:ufs -m /var:/dev/dsk/c0t0d0s3:ufs -m 
/export/home:/dev/dsk/c0t0d0s5:ufs -m /opt:/dev/dsk/c0t0d0s4:ufs -m 
-:/dev/dsk/c0t1d0s2:swap -m /tmp:/dev/dsk/c0t1d0s3:swap-n zfsBE -p rootpool

And then set quotas for them.  Is this right?

Hi
So zfs root is very different, one cannot have a mix of ufs + zvol based 
swap at all.

and lucreate is a bit restricted, one cannot split out /var.

The only one that works is
lucreate -n zfsBE -p rpool

where rpool is an SMI based pool.
To check for SMI run format, select the rpool disk and p, p, then check 
if it lists cylinders ( SMI ), if not run format -e on the disk and 
label ( delete rpool first if it all ready exists ), then preferrably ( 
but not necessary ), put all space in slice 0 say ( so that rpool has 
the whole disk ).


Post boot of zfsBE, one can modify the swap and dump zvols ( look on 
google for zfs root swap ).


Enda
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Importing Corrupted zpool never ends

2011-05-31 Thread Christian Becker
Hi There, 

I need to import an corrupted ZPOOL after double-Crash (Mainboard and one HDD) 
on a different system.  

It is a RAIDZ1 - 3 HDDs - only 2 are working. 

Problem: spool import -f poolname runs and runs and runs. Looking after iostat 
(not zpool iostat) it is doing something - but what? And why does it last so 
long (2x 1.5TB - Atom System).

iostat seems to read and write with something about 500kB/s - I hope that it 
doesn't work through the whole 1500GB - that would need 40 Days...

Hope someone could help me.

Thanks allot
Chris___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] not sure how to make filesystems

2011-05-31 Thread BIll Palin
I'm migrating some filesystems from UFS to ZFS and I'm not sure how to create a 
couple of them.

I want to migrate /, /var, /opt, /export/home and also want swap and /tmp.  I 
don't care about any of the others.

The first disk, and the one with the UFS filesystems, is c0t0d0 and the 2nd 
disk is c0t1d0.

I've been told that /tmp is supposed to be part of swap.  So far I have:

lucreate -m /:/dev/dsk/c0t0d0s0:ufs -m /var:/dev/dsk/c0t0d0s3:ufs -m 
/export/home:/dev/dsk/c0t0d0s5:ufs -m /opt:/dev/dsk/c0t0d0s4:ufs -m 
-:/dev/dsk/c0t1d0s2:swap -m /tmp:/dev/dsk/c0t1d0s3:swap-n zfsBE -p rootpool

And then set quotas for them.  Is this right?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Tomas Ögren
On 31 May, 2011 - Khushil Dep sent me these 4,5K bytes:

> The adage that I adhere to with ZFS features is "just because you can
> doesn't mean you should!". I would suspect that with that many
> filesystems the normal zfs-tools would also take an inordinate length
> of time to complete their operations - scale according to size.

I've done a not too scientific test on reboot times for Solaris 10 vs 11
with regard to many filesystems...

Quad Xeon machines with single raid10 and one boot environment. Using
more be's with LU in sol10 will make the situation even worse, as it's
LU that's taking time (re)mounting all filesystems over and over and
over and over again.
http://www8.cs.umu.se/~stric/tmp/zfs-many.png

As the picture shows, don't try 1 filesystems with nfs on sol10.
Creating more filesystems gets slower and slower the more you have as
well.

> Generally snapshots are quick operations but 10,000 such operations
> would I believe take enough to time to complete as to present
> operational issues - breaking these into sets would alleviate some?
> Perhaps if you are starting to run into many thousands of filesystems
> you would need to re-examin your rationale in creating so many.

On a different setup, we have about 750 datasets where we would like to
use a single recursive snapshot, but when doing that all file access
will be frozen for varying amounts of time (sometimes half an hour or
way more). Splitting it up into ~30 subsets, doing recursive snapshots
over those instead has decreased the total snapshot time greatly and cut
the "frozen time" down to single digit seconds instead of minutes or
hours.

> My 2c. YMMV.
> 
> -- 
> Khush
> 
> On Tuesday, 31 May 2011 at 11:08, Gertjan Oude Lohuis wrote:
> 
> > "Filesystem are cheap" is one of ZFS's mottos. I'm wondering how far
> > this goes. Does anyone have experience with having more than 10.000 ZFS
> > filesystems? I know that mounting this many filesystems during boot
> > while take considerable time. Are there any other disadvantages that I
> > should be aware of? Are zfs-tools still usable, like 'zfs list', 'zfs
> > get/set'.
> > Would I run into any problems when snapshots are taken (almost)
> > simultaneously from multiple filesystems at once?
> > 
> > Regards,
> > Gertjan Oude Lohuis
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org (mailto:zfs-discuss@opensolaris.org)
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 

> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Paul Kraus
On Tue, May 31, 2011 at 6:08 AM, Gertjan Oude Lohuis
 wrote:

> "Filesystem are cheap" is one of ZFS's mottos. I'm wondering how far
> this goes. Does anyone have experience with having more than 10.000 ZFS
> filesystems? I know that mounting this many filesystems during boot
> while take considerable time. Are there any other disadvantages that I
> should be aware of? Are zfs-tools still usable, like 'zfs list', 'zfs
> get/set'.

When we initially configured a large (20TB) files server about 5
years ago, we went with multiple zpools and multiple datasets (zfs) in
each zpool. Currently we have 17 zpools and about 280 datasets.
Nowhere near the 10,000+ you intend. We are moving _away_ from the
many dataset model to one zpool and one dataset. We are doing this for
the following reasons:

1. manageability
2. space management (we have wasted space in some pools while others
are starved)
3. tool speed

I do not have good numbers for time to do some of these operations
as we are down to under 200 datasets (1/3 of the way through the
migration to the new layout). I do have log entries that point to
about a minute to complete a `zfs list` operation.

> Would I run into any problems when snapshots are taken (almost)
> simultaneously from multiple filesystems at once?

Our logs show snapshot creation time at 2 seconds or less, but we
do not try to do them all at once, we walk the list of datasets and
process (snapshot and replicate) each in turn.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)

2011-05-31 Thread Jim Klimov
Interesting, although makes sense ;)
 
Now, I wonder about reliability (with large 2-3Tb drives
and long scrub/resilver/replace times): say I have 12 drives
in my box. 
 
I can lay them out as 4*3-disk raidz1, 3*4-disk-raidz1
or a 1*12-disk raidz3 with nearly the same capacity (8-9
data disks plus parity). I see that with more vdevs the
IOPS will grow - does this translate to better resilver
and scrub times as well?
 
Smaller raidz sets can be more easily spread over 
different controllers and JBOD boxes, which is also
an interesting factor...
 
How good or bad is the expected reliability of  
3*4-disk-raidz1 vs 1*12-disk raidz3, so which 
of the tradeoffs is better - more vdevs or more 
parity to survive loss of ANY 3 disks vs. "right"
3 disks?
 
Thanks,
//Jim Klimov
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)

2011-05-31 Thread Paul Kraus
On Fri, May 27, 2011 at 2:49 PM, Marty Scholes  wrote:

> For what it's worth, I ran a 22 disk home array as a single RAIDZ3 vdev 
> (19+3)for several
> months and it was fine.  These days I run a 32 disk array laid out as four 
> vdevs, each an
> 8 disk RAIDZ2, i.e. 4x 6+2.

I tested 40 drives in various configurations and determined that
for random read workloads, the I/O scaled linearly with the number of
vdevs, NOT the number of drives. See
https://spreadsheets.google.com/a/kraus-haus.org/spreadsheet/pub?hl=en_US&hl=en_US&key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXc&output=html
for results using raidz2 vdevs. I did not test sequential read
performance here as our workload does not include any.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question on ZFS iSCSI

2011-05-31 Thread Jim Klimov
> The volume is exported as whole disk. When given whole disk, zpool
> creates GPT partition table by default. You need to pass the partition
> (not the disk) to zdb.

Yes, that is what seems to be the problem.
However, for the zfs volumes (/dev/zvol/rdsk/pool/dcpool) there seems
to be no concept of partitions, etc. inside of them - these are defined
only for the iSCSI representation which I want to try and get rid of.
 
> In Linux you can use kpartx to make the partitions available. I don't
> know the equivalent command in Solaris.

Interesting... If only lofiadm could represent not a whole file, but a
given "window" into it ;)
 
 
At least, trying loopback mounts as well as directly the zfs volume
with "fdisk", "parted" and such reveals that there are no noticeable
iSCSI service data overheads in the addresable volume space:
 
# parted /dev/zvol/rdsk/pool/dcpool print
_device_probe_geometry: DKIOCG_PHYGEOM: Inappropriate ioctl for device
Model: Generic Ide (ide)
Disk /dev/zvol/rdsk/pool/dcpool: 4295GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number  Start   End SizeFile system  Name  Flags
 1  131kB   4295GB  4295GB   zfs
 9  4295GB  4295GB  8389kB  

But lofiadm doesn't let me address that partition #1 as a separate device :(
 
Thanks,
//Jim Klimov
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Question on ZFS iSCSI

2011-05-31 Thread Fajar A. Nugraha
On Tue, May 31, 2011 at 5:47 PM, Jim Klimov  wrote:
> However it seems that there may be some extra data beside the zfs
> pool in the actual volume (I'd at least expect an MBR or GPT, and
> maybe some iSCSI service data as an overhead). One way or another,
> the "dcpool" can not be found in the physical zfs volume:
>
> ===
> # zdb -l /dev/zvol/rdsk/pool/dcpool
>
> 
> LABEL 0
> 
> failed to unpack label 0

The volume is exported as whole disk. When given whole disk, zpool
creates GPT partition table by default. You need to pass the partition
(not the disk) to zdb.

> So the questions are:
>
> 1) Is it possible to skip iSCSI-over-loopback in this configuration?

Yes. Well, maybe.

In Linux you can use kpartx to make the partitions available. I don't
know the equivalent command in Solaris.

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Question on ZFS iSCSI

2011-05-31 Thread Jim Klimov
I have a oi_148a test box with a pool on physical HDDs, a volume in
this pool shared over iSCSI with explicit commands (sbdadm and such),
and this iSCSI target is initiated by the same box. In the resulting iSCSI
device I have another ZFS pool "dcpool".

Recently I found the iSCSI part to be a potential bottleneck in my pool
operations and wanted to revert to using ZFS volume directly as the
backing store for "dcpool".

However it seems that there may be some extra data beside the zfs
pool in the actual volume (I'd at least expect an MBR or GPT, and 
maybe some iSCSI service data as an overhead). One way or another, 
the "dcpool" can not be found in the physical zfs volume:

===
# zdb -l /dev/zvol/rdsk/pool/dcpool 


LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

failed to unpack label 2

LABEL 3

failed to unpack label 3
===

So the questions are:

1) Is it possible to skip iSCSI-over-loopback in this configuration?
Preferably I would just specify a fixed offset (at which byte in the
volume the "dcpool" data starts) and remove the iSCSI/networking
overheads and see if they are the bottlenecks.

2) This configuration "zpool -> iSCSI -> zvol" was initially proposed 
as preferable over direct volume access by Darren Moffat as the 
fully supported way, see last comments here:
http://blogs.oracle.com/darren/entry/compress_encrypt_checksum_deduplicate_with

I still wonder why - the overhead is deemed negligible and there
are more options quickly available, such as mounting the iSCSI
device on another server? Now that I hit the problem of reverting
to direct volume access, this makes sense ;)

Thanks in advance for ideas or clarifications,
//Jim Klimov


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Khushil Dep
The adage that I adhere to with ZFS features is "just because you can doesn't 
mean you should!". I would suspect that with that many filesystems the normal 
zfs-tools would also take an inordinate length of time to complete their 
operations - scale according to size.

Generally snapshots are quick operations but 10,000 such operations would I 
believe take enough to time to complete as to present operational issues - 
breaking these into sets would alleviate some? Perhaps if you are starting to 
run into many thousands of filesystems you would need to re-examin your 
rationale in creating so many.

My 2c. YMMV.

-- 
Khush

On Tuesday, 31 May 2011 at 11:08, Gertjan Oude Lohuis wrote:

> "Filesystem are cheap" is one of ZFS's mottos. I'm wondering how far
> this goes. Does anyone have experience with having more than 10.000 ZFS
> filesystems? I know that mounting this many filesystems during boot
> while take considerable time. Are there any other disadvantages that I
> should be aware of? Are zfs-tools still usable, like 'zfs list', 'zfs
> get/set'.
> Would I run into any problems when snapshots are taken (almost)
> simultaneously from multiple filesystems at once?
> 
> Regards,
> Gertjan Oude Lohuis
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org (mailto:zfs-discuss@opensolaris.org)
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] JBOD recommendation for ZFS usage

2011-05-31 Thread Jim Klimov
So, if I may, is this the correct summary of the answer to original 
question (on JBOD for a ZFS HA cluster):
 
===
  SC847E26-RJBOD1 with dual-ported SAS drives are known
  to work in a failover HA storage scenario, allowing both servers
  (HBAs) access to each single SAS drive individually, so zpools
  can be configured from any disks regardless of which backplane
  they are connected to. HA Clusterware, such as NexentaStor
  HA-Cluster plugin should be used to ensure that only one head
  node actually uses a given disk drive in an imported ZFS pool.
===
 
Is the indented statement correct? :)
 
Other questions:
 
What clusterware would be encouraged now for OpenIndiana
boxes?
 
Also, in case of clustered shared filesystems (like VMWare vmfs)
can these JBODs allow two different servers to access one drive 
simultaneously in a safe manner (do they do fencing, reservations 
and other SCSI magic)?
 
> > Following up on some of this forum's discussions, I read the 
> manuals on SuperMicro's
> > SC847E26-RJBOD1 this weekend. 
> 
> We see quite a few of these in the NexentaStor installed base. 

> The NexentaStor HA-Cluster plugin manages STONITH and reservations.
> I do not believe programming expanders or switches for 
> clustering is the best approach.
> It is better to let the higher layers manage this.

> The cost of a SATA disk + SATA/SAS interposer is about the same 
> as a native SAS
> drive. Native SAS makes a better solution.
>  
 
-- 

++ 
|| 
| Климов Евгений, Jim Klimov | 
| технический директор   CTO | 
| ЗАО "ЦОС и ВТ"  JSC COS&HT | 
|| 
| +7-903-7705859 (cellular)  mailto:jimkli...@cos.ru | 
|CC:ad...@cos.ru,jimkli...@gmail.com | 
++ 
| ()  ascii ribbon campaign - against html mail  | 
| /\- against microsoft attachments  | 
++
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Experiences with 10.000+ filesystems

2011-05-31 Thread Gertjan Oude Lohuis
"Filesystem are cheap" is one of ZFS's mottos. I'm wondering how far
this goes. Does anyone have experience with having more than 10.000 ZFS
filesystems? I know that mounting this many filesystems during boot
while take considerable time. Are there any other disadvantages that I
should be aware of? Are zfs-tools still usable, like 'zfs list', 'zfs
get/set'.
Would I run into any problems when snapshots are taken (almost)
simultaneously from multiple filesystems at once?

Regards,
Gertjan Oude Lohuis
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss